## Final Project - Group 3 - Credit Card Fraud Detection
Team: Sean Ely, Xiwang Li, Pedram, Amir, Rox, Arash

### Kaggle - Credit Card Fraud Detection
https://www.kaggle.com/mlg-ulb/creditcardfraud
#### Context
It is important that credit card companies are able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase.
#### Content
The datasets contains transactions made by credit cards in September 2013 by european cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, ... V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-senstive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.
#### Inspiration
Identify fraudulent credit card transactions.

Given the class imbalance ratio, we recommend measuring the accuracy using the Area Under the Precision-Recall Curve (AUPRC). Confusion matrix accuracy is not meaningful for unbalanced classification.
#### Acknowledgements
The dataset has been collected and analysed during a research collaboration of Worldline and the Machine Learning Group (http://mlg.ulb.ac.be) of ULB (Université Libre de Bruxelles) on big data mining and fraud detection. More details on current and past projects on related topics are available on http://mlg.ulb.ac.be/BruFence and http://mlg.ulb.ac.be/ARTML

Please cite: Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi. Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015

In [2]:
%sh
#need to run ***ONCE*** to install SMOTE package
/home/ubuntu/databricks/python/bin/pip install 'imbalanced-learn<0.2.1'
pip freeze | grep imbalanced-learn

In [3]:
# File location and type
file_location = "/FileStore/tables/creditcard.csv"
file_type = "csv"

# CSV options
infer_schema = "true"
first_row_is_header = "true"
delimiter = ","

# The applied options are for CSV files. For other file types, these will be ignored.
df = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .load(file_location)

display(df)

Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16,V17,V18,V19,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,-1.3598071336738,-0.0727811733098497,2.53634673796914,1.37815522427443,-0.338320769942518,0.462387777762292,0.239598554061257,0.0986979012610507,0.363786969611213,0.0907941719789316,-0.551599533260813,-0.617800855762348,-0.991389847235408,-0.311169353699879,1.46817697209427,-0.470400525259478,0.207971241929242,0.0257905801985591,0.403992960255733,0.251412098239705,-0.018306777944153,0.277837575558899,-0.110473910188767,0.0669280749146731,0.128539358273528,-0.189114843888824,0.133558376740387,-0.0210530534538215,149.62,0
0,1.19185711131486,0.26615071205963,0.16648011335321,0.448154078460911,0.0600176492822243,-0.0823608088155687,-0.0788029833323113,0.0851016549148104,-0.255425128109186,-0.166974414004614,1.61272666105479,1.06523531137287,0.48909501589608,-0.143772296441519,0.635558093258208,0.463917041022171,-0.114804663102346,-0.183361270123994,-0.145783041325259,-0.0690831352230203,-0.225775248033138,-0.638671952771851,0.101288021253234,-0.339846475529127,0.167170404418143,0.125894532368176,-0.0089830991432281,0.0147241691924927,2.69,0
1,-1.35835406159823,-1.34016307473609,1.77320934263119,0.379779593034328,-0.503198133318193,1.80049938079263,0.791460956450422,0.247675786588991,-1.51465432260583,0.207642865216696,0.624501459424895,0.066083685268831,0.717292731410831,-0.165945922763554,2.34586494901581,-2.89008319444231,1.10996937869599,-0.121359313195888,-2.26185709530414,0.524979725224404,0.247998153469754,0.771679401917229,0.909412262347719,-0.689280956490685,-0.327641833735251,-0.139096571514147,-0.0553527940384261,-0.0597518405929204,378.66,0
1,-0.966271711572087,-0.185226008082898,1.79299333957872,-0.863291275036453,-0.0103088796030823,1.24720316752486,0.23760893977178,0.377435874652262,-1.38702406270197,-0.0549519224713749,-0.226487263835401,0.178228225877303,0.507756869957169,-0.28792374549456,-0.631418117709045,-1.0596472454325,-0.684092786345479,1.96577500349538,-1.2326219700892,-0.208037781160366,-0.108300452035545,0.0052735967825345,-0.190320518742841,-1.17557533186321,0.647376034602038,-0.221928844458407,0.0627228487293033,0.0614576285006353,123.5,0
2,-1.15823309349523,0.877736754848451,1.548717846511,0.403033933955121,-0.407193377311653,0.0959214624684256,0.592940745385545,-0.270532677192282,0.817739308235294,0.753074431976354,-0.822842877946363,0.53819555014995,1.3458515932154,-1.11966983471731,0.175121130008994,-0.451449182813529,-0.237033239362776,-0.0381947870352842,0.803486924960175,0.408542360392758,-0.0094306971323291,0.79827849458971,-0.137458079619063,0.141266983824769,-0.206009587619756,0.502292224181569,0.219422229513348,0.215153147499206,69.99,0
2,-0.425965884412454,0.960523044882985,1.14110934232219,-0.168252079760302,0.42098688077219,-0.0297275516639742,0.476200948720027,0.260314333074874,-0.56867137571251,-0.371407196834471,1.34126198001957,0.359893837038039,-0.358090652573631,-0.137133700217612,0.517616806555742,0.401725895589603,-0.0581328233640131,0.0686531494425432,-0.0331937877876282,0.0849676720682049,-0.208253514656728,-0.559824796253248,-0.0263976679795373,-0.371426583174346,-0.232793816737034,0.105914779097957,0.253844224739337,0.0810802569229443,3.67,0
4,1.22965763450793,0.141003507049326,0.0453707735899449,1.20261273673594,0.191880988597645,0.272708122899098,-0.0051590028825098,0.0812129398830894,0.464959994783886,-0.0992543211289237,-1.41690724314928,-0.153825826253651,-0.75106271556262,0.16737196252175,0.0501435942254188,-0.443586797916727,0.002820512472347,-0.61198733994012,-0.0455750446637976,-0.21963255278686,-0.167716265815783,-0.270709726172363,-0.154103786809305,-0.780055415004671,0.75013693580659,-0.257236845917139,0.0345074297438413,0.0051677689062491,4.99,0
7,-0.644269442348146,1.41796354547385,1.0743803763556,-0.492199018495015,0.948934094764157,0.428118462833089,1.12063135838353,-3.80786423873589,0.615374730667027,1.24937617815176,-0.619467796121913,0.291474353088705,1.75796421396042,-1.32386521970526,0.686132504394383,-0.0761269994382006,-1.2221273453247,-0.358221569869078,0.324504731321494,-0.156741852488285,1.94346533978412,-1.01545470979971,0.057503529867291,-0.649709005559993,-0.415266566234811,-0.0516342969262494,-1.20692108094258,-1.08533918832377,40.8,0
7,-0.89428608220282,0.286157196276544,-0.113192212729871,-0.271526130088604,2.6695986595986,3.72181806112751,0.370145127676916,0.851084443200905,-0.392047586798604,-0.410430432848439,-0.705116586646536,-0.110452261733098,-0.286253632470583,0.0743553603016731,-0.328783050303565,-0.210077268148783,-0.499767968800267,0.118764861004217,0.57032816746536,0.0527356691149697,-0.0734251001059225,-0.268091632235551,-0.204232669947878,1.0115918018785,0.373204680146282,-0.384157307702294,0.0117473564581996,0.14240432992147,93.2,0
9,-0.33826175242575,1.11959337641566,1.04436655157316,-0.222187276738296,0.49936080649727,-0.24676110061991,0.651583206489972,0.0695385865186387,-0.736727316364109,-0.366845639206541,1.01761446783262,0.836389570307029,1.00684351373408,-0.443522816876142,0.150219101422635,0.739452777052119,-0.540979921943059,0.47667726004282,0.451772964394125,0.203711454727929,-0.246913936910008,-0.633752642406113,-0.12079408408185,-0.385049925313426,-0.0697330460416923,0.0941988339514961,0.246219304619926,0.0830756493473326,3.68,0


In [4]:
# Imported Libraries
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql import functions as F

In [5]:
df.printSchema()

In [6]:
df.describe('Amount').show()

In [7]:
# number of null or nan values
df.select([F.count(F.when(F.isnan(c) | F.col(c).isNull(), c)).alias(c) for c in df.columns]).show()

In [8]:
# The classes are heavily skewed we need to solve this issue later.
print 'Number of frauds: ', df.filter(F.col('Class') == 1).count(), ", ", round((df.filter(F.col('Class') == 1).count())/float(df.count()) * 100, 3), '% of the dataset'
print 'Number of records', df.count()

In [9]:
display(df.groupBy('class').count())

class,count
1,492
0,284315


In [10]:
display(df)

Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16,V17,V18,V19,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,-1.3598071336738,-0.0727811733098497,2.53634673796914,1.37815522427443,-0.338320769942518,0.462387777762292,0.239598554061257,0.0986979012610507,0.363786969611213,0.0907941719789316,-0.551599533260813,-0.617800855762348,-0.991389847235408,-0.311169353699879,1.46817697209427,-0.470400525259478,0.207971241929242,0.0257905801985591,0.403992960255733,0.251412098239705,-0.018306777944153,0.277837575558899,-0.110473910188767,0.0669280749146731,0.128539358273528,-0.189114843888824,0.133558376740387,-0.0210530534538215,149.62,0
0,1.19185711131486,0.26615071205963,0.16648011335321,0.448154078460911,0.0600176492822243,-0.0823608088155687,-0.0788029833323113,0.0851016549148104,-0.255425128109186,-0.166974414004614,1.61272666105479,1.06523531137287,0.48909501589608,-0.143772296441519,0.635558093258208,0.463917041022171,-0.114804663102346,-0.183361270123994,-0.145783041325259,-0.0690831352230203,-0.225775248033138,-0.638671952771851,0.101288021253234,-0.339846475529127,0.167170404418143,0.125894532368176,-0.0089830991432281,0.0147241691924927,2.69,0
1,-1.35835406159823,-1.34016307473609,1.77320934263119,0.379779593034328,-0.503198133318193,1.80049938079263,0.791460956450422,0.247675786588991,-1.51465432260583,0.207642865216696,0.624501459424895,0.066083685268831,0.717292731410831,-0.165945922763554,2.34586494901581,-2.89008319444231,1.10996937869599,-0.121359313195888,-2.26185709530414,0.524979725224404,0.247998153469754,0.771679401917229,0.909412262347719,-0.689280956490685,-0.327641833735251,-0.139096571514147,-0.0553527940384261,-0.0597518405929204,378.66,0
1,-0.966271711572087,-0.185226008082898,1.79299333957872,-0.863291275036453,-0.0103088796030823,1.24720316752486,0.23760893977178,0.377435874652262,-1.38702406270197,-0.0549519224713749,-0.226487263835401,0.178228225877303,0.507756869957169,-0.28792374549456,-0.631418117709045,-1.0596472454325,-0.684092786345479,1.96577500349538,-1.2326219700892,-0.208037781160366,-0.108300452035545,0.0052735967825345,-0.190320518742841,-1.17557533186321,0.647376034602038,-0.221928844458407,0.0627228487293033,0.0614576285006353,123.5,0
2,-1.15823309349523,0.877736754848451,1.548717846511,0.403033933955121,-0.407193377311653,0.0959214624684256,0.592940745385545,-0.270532677192282,0.817739308235294,0.753074431976354,-0.822842877946363,0.53819555014995,1.3458515932154,-1.11966983471731,0.175121130008994,-0.451449182813529,-0.237033239362776,-0.0381947870352842,0.803486924960175,0.408542360392758,-0.0094306971323291,0.79827849458971,-0.137458079619063,0.141266983824769,-0.206009587619756,0.502292224181569,0.219422229513348,0.215153147499206,69.99,0
2,-0.425965884412454,0.960523044882985,1.14110934232219,-0.168252079760302,0.42098688077219,-0.0297275516639742,0.476200948720027,0.260314333074874,-0.56867137571251,-0.371407196834471,1.34126198001957,0.359893837038039,-0.358090652573631,-0.137133700217612,0.517616806555742,0.401725895589603,-0.0581328233640131,0.0686531494425432,-0.0331937877876282,0.0849676720682049,-0.208253514656728,-0.559824796253248,-0.0263976679795373,-0.371426583174346,-0.232793816737034,0.105914779097957,0.253844224739337,0.0810802569229443,3.67,0
4,1.22965763450793,0.141003507049326,0.0453707735899449,1.20261273673594,0.191880988597645,0.272708122899098,-0.0051590028825098,0.0812129398830894,0.464959994783886,-0.0992543211289237,-1.41690724314928,-0.153825826253651,-0.75106271556262,0.16737196252175,0.0501435942254188,-0.443586797916727,0.002820512472347,-0.61198733994012,-0.0455750446637976,-0.21963255278686,-0.167716265815783,-0.270709726172363,-0.154103786809305,-0.780055415004671,0.75013693580659,-0.257236845917139,0.0345074297438413,0.0051677689062491,4.99,0
7,-0.644269442348146,1.41796354547385,1.0743803763556,-0.492199018495015,0.948934094764157,0.428118462833089,1.12063135838353,-3.80786423873589,0.615374730667027,1.24937617815176,-0.619467796121913,0.291474353088705,1.75796421396042,-1.32386521970526,0.686132504394383,-0.0761269994382006,-1.2221273453247,-0.358221569869078,0.324504731321494,-0.156741852488285,1.94346533978412,-1.01545470979971,0.057503529867291,-0.649709005559993,-0.415266566234811,-0.0516342969262494,-1.20692108094258,-1.08533918832377,40.8,0
7,-0.89428608220282,0.286157196276544,-0.113192212729871,-0.271526130088604,2.6695986595986,3.72181806112751,0.370145127676916,0.851084443200905,-0.392047586798604,-0.410430432848439,-0.705116586646536,-0.110452261733098,-0.286253632470583,0.0743553603016731,-0.328783050303565,-0.210077268148783,-0.499767968800267,0.118764861004217,0.57032816746536,0.0527356691149697,-0.0734251001059225,-0.268091632235551,-0.204232669947878,1.0115918018785,0.373204680146282,-0.384157307702294,0.0117473564581996,0.14240432992147,93.2,0
9,-0.33826175242575,1.11959337641566,1.04436655157316,-0.222187276738296,0.49936080649727,-0.24676110061991,0.651583206489972,0.0695385865186387,-0.736727316364109,-0.366845639206541,1.01761446783262,0.836389570307029,1.00684351373408,-0.443522816876142,0.150219101422635,0.739452777052119,-0.540979921943059,0.47667726004282,0.451772964394125,0.203711454727929,-0.246913936910008,-0.633752642406113,-0.12079408408185,-0.385049925313426,-0.0697330460416923,0.0941988339514961,0.246219304619926,0.0830756493473326,3.68,0


In [11]:
display(df)

Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16,V17,V18,V19,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,-1.3598071336738,-0.0727811733098497,2.53634673796914,1.37815522427443,-0.338320769942518,0.462387777762292,0.239598554061257,0.0986979012610507,0.363786969611213,0.0907941719789316,-0.551599533260813,-0.617800855762348,-0.991389847235408,-0.311169353699879,1.46817697209427,-0.470400525259478,0.207971241929242,0.0257905801985591,0.403992960255733,0.251412098239705,-0.018306777944153,0.277837575558899,-0.110473910188767,0.0669280749146731,0.128539358273528,-0.189114843888824,0.133558376740387,-0.0210530534538215,149.62,0
0,1.19185711131486,0.26615071205963,0.16648011335321,0.448154078460911,0.0600176492822243,-0.0823608088155687,-0.0788029833323113,0.0851016549148104,-0.255425128109186,-0.166974414004614,1.61272666105479,1.06523531137287,0.48909501589608,-0.143772296441519,0.635558093258208,0.463917041022171,-0.114804663102346,-0.183361270123994,-0.145783041325259,-0.0690831352230203,-0.225775248033138,-0.638671952771851,0.101288021253234,-0.339846475529127,0.167170404418143,0.125894532368176,-0.0089830991432281,0.0147241691924927,2.69,0
1,-1.35835406159823,-1.34016307473609,1.77320934263119,0.379779593034328,-0.503198133318193,1.80049938079263,0.791460956450422,0.247675786588991,-1.51465432260583,0.207642865216696,0.624501459424895,0.066083685268831,0.717292731410831,-0.165945922763554,2.34586494901581,-2.89008319444231,1.10996937869599,-0.121359313195888,-2.26185709530414,0.524979725224404,0.247998153469754,0.771679401917229,0.909412262347719,-0.689280956490685,-0.327641833735251,-0.139096571514147,-0.0553527940384261,-0.0597518405929204,378.66,0
1,-0.966271711572087,-0.185226008082898,1.79299333957872,-0.863291275036453,-0.0103088796030823,1.24720316752486,0.23760893977178,0.377435874652262,-1.38702406270197,-0.0549519224713749,-0.226487263835401,0.178228225877303,0.507756869957169,-0.28792374549456,-0.631418117709045,-1.0596472454325,-0.684092786345479,1.96577500349538,-1.2326219700892,-0.208037781160366,-0.108300452035545,0.0052735967825345,-0.190320518742841,-1.17557533186321,0.647376034602038,-0.221928844458407,0.0627228487293033,0.0614576285006353,123.5,0
2,-1.15823309349523,0.877736754848451,1.548717846511,0.403033933955121,-0.407193377311653,0.0959214624684256,0.592940745385545,-0.270532677192282,0.817739308235294,0.753074431976354,-0.822842877946363,0.53819555014995,1.3458515932154,-1.11966983471731,0.175121130008994,-0.451449182813529,-0.237033239362776,-0.0381947870352842,0.803486924960175,0.408542360392758,-0.0094306971323291,0.79827849458971,-0.137458079619063,0.141266983824769,-0.206009587619756,0.502292224181569,0.219422229513348,0.215153147499206,69.99,0
2,-0.425965884412454,0.960523044882985,1.14110934232219,-0.168252079760302,0.42098688077219,-0.0297275516639742,0.476200948720027,0.260314333074874,-0.56867137571251,-0.371407196834471,1.34126198001957,0.359893837038039,-0.358090652573631,-0.137133700217612,0.517616806555742,0.401725895589603,-0.0581328233640131,0.0686531494425432,-0.0331937877876282,0.0849676720682049,-0.208253514656728,-0.559824796253248,-0.0263976679795373,-0.371426583174346,-0.232793816737034,0.105914779097957,0.253844224739337,0.0810802569229443,3.67,0
4,1.22965763450793,0.141003507049326,0.0453707735899449,1.20261273673594,0.191880988597645,0.272708122899098,-0.0051590028825098,0.0812129398830894,0.464959994783886,-0.0992543211289237,-1.41690724314928,-0.153825826253651,-0.75106271556262,0.16737196252175,0.0501435942254188,-0.443586797916727,0.002820512472347,-0.61198733994012,-0.0455750446637976,-0.21963255278686,-0.167716265815783,-0.270709726172363,-0.154103786809305,-0.780055415004671,0.75013693580659,-0.257236845917139,0.0345074297438413,0.0051677689062491,4.99,0
7,-0.644269442348146,1.41796354547385,1.0743803763556,-0.492199018495015,0.948934094764157,0.428118462833089,1.12063135838353,-3.80786423873589,0.615374730667027,1.24937617815176,-0.619467796121913,0.291474353088705,1.75796421396042,-1.32386521970526,0.686132504394383,-0.0761269994382006,-1.2221273453247,-0.358221569869078,0.324504731321494,-0.156741852488285,1.94346533978412,-1.01545470979971,0.057503529867291,-0.649709005559993,-0.415266566234811,-0.0516342969262494,-1.20692108094258,-1.08533918832377,40.8,0
7,-0.89428608220282,0.286157196276544,-0.113192212729871,-0.271526130088604,2.6695986595986,3.72181806112751,0.370145127676916,0.851084443200905,-0.392047586798604,-0.410430432848439,-0.705116586646536,-0.110452261733098,-0.286253632470583,0.0743553603016731,-0.328783050303565,-0.210077268148783,-0.499767968800267,0.118764861004217,0.57032816746536,0.0527356691149697,-0.0734251001059225,-0.268091632235551,-0.204232669947878,1.0115918018785,0.373204680146282,-0.384157307702294,0.0117473564581996,0.14240432992147,93.2,0
9,-0.33826175242575,1.11959337641566,1.04436655157316,-0.222187276738296,0.49936080649727,-0.24676110061991,0.651583206489972,0.0695385865186387,-0.736727316364109,-0.366845639206541,1.01761446783262,0.836389570307029,1.00684351373408,-0.443522816876142,0.150219101422635,0.739452777052119,-0.540979921943059,0.47667726004282,0.451772964394125,0.203711454727929,-0.246913936910008,-0.633752642406113,-0.12079408408185,-0.385049925313426,-0.0697330460416923,0.0941988339514961,0.246219304619926,0.0830756493473326,3.68,0


In [12]:
df_fraud = df.filter(F.col('Class') == 1)
df_no_fraud_sample = df.filter(F.col('Class') == 0).sample(False, 0.1).limit(4920) #downsaple no fraud class 

df_sample = df_fraud.union(df_no_fraud_sample)
display(df_sample)

Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16,V17,V18,V19,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
406,-2.3122265423263,1.95199201064158,-1.60985073229769,3.9979055875468,-0.522187864667764,-1.42654531920595,-2.53738730624579,1.39165724829804,-2.77008927719433,-2.77227214465915,3.20203320709635,-2.89990738849473,-0.595221881324605,-4.28925378244217,0.389724120274487,-1.14074717980657,-2.83005567450437,-0.0168224681808257,0.416955705037907,0.126910559061474,0.517232370861764,-0.0350493686052974,-0.465211076182388,0.320198198514526,0.0445191674731724,0.177839798284401,0.261145002567677,-0.143275874698919,0.0,1
472,-3.0435406239976,-3.15730712090228,1.08846277997285,2.2886436183814,1.35980512966107,-1.06482252298131,0.325574266158614,-0.0677936531906277,-0.270952836226548,-0.838586564582682,-0.414575448285725,-0.503140859566824,0.676501544635863,-1.69202893305906,2.00063483909015,0.666779695901966,0.599717413841732,1.72532100745514,0.283344830149495,2.10233879259444,0.661695924845707,0.435477208966341,1.37596574254306,-0.293803152734021,0.279798031841214,-0.145361714815161,-0.252773122530705,0.0357642251788156,529.0,1
4462,-2.30334956758553,1.759247460267,-0.359744743330052,2.33024305053917,-0.821628328375422,-0.0757875706194599,0.562319782266954,-0.399146578487216,-0.238253367661746,-1.52541162656194,2.03291215755072,-6.56012429505962,0.0229373234890961,-1.47010153611197,-0.698826068579047,-2.28219382856251,-4.78183085597533,-2.61566494476124,-1.33444106667307,-0.430021867171611,-0.294166317554753,-0.932391057274991,0.172726295799422,-0.0873295379700724,-0.156114264651172,-0.542627889040196,0.0395659889264757,-0.153028796529788,239.93,1
6986,-4.39797444171999,1.35836702839758,-2.5928442182573,2.67978696694832,-1.12813094208956,-1.70653638774951,-3.49619729302467,-0.248777743025673,-0.24776789948008,-4.80163740602813,4.89584422347523,-10.9128193194019,0.184371685834387,-6.77109672468083,-0.0073261825777121,-7.35808322132346,-12.5984185405511,-5.13154862842983,0.308333945758691,-0.17160787864796,0.573574068424352,0.176967718048195,-0.436206883597401,-0.0535018648884285,0.252405261951833,-0.657487754764504,-0.827135714578603,0.849573379985768,59.0,1
7519,1.23423504613468,3.0197404207034,-4.30459688479665,4.73279513041887,3.62420083055386,-1.35774566315358,1.71344498787235,-0.496358487073991,-1.28285782036322,-2.44746925511151,2.10134386504854,-4.6096283906446,1.46437762476188,-6.07933719308005,-0.339237372732577,2.58185095378146,6.73938438478335,3.04249317830411,-2.72185312222835,0.0090608363953452,-0.37906830709218,-0.704181032215427,-0.656804756348389,-1.63265295692929,1.48890144838237,0.566797273468934,-0.0100162234965625,0.146792734916988,1.0,1
7526,0.0084303648955825,4.13783683497998,-6.24069657194744,6.6757321631344,0.768307024571449,-3.35305954788994,-1.63173467271809,0.15461244822474,-2.79589246446281,-6.18789062970647,5.66439470857116,-9.85448482287037,-0.306166658250084,-10.6911962118171,-0.638498192673322,-2.04197379107768,-1.12905587703585,0.116452521226364,-1.93466573889727,0.488378221134715,0.36451420978479,-0.608057133838703,-0.539527941820093,0.128939982991813,1.48848121006868,0.50796267782385,0.735821636119662,0.513573740679437,1.0,1
7535,0.0267792264491516,4.13246389713003,-6.56059996809658,6.34855667313983,1.32966566904142,-2.51347884762413,-1.68910220031328,0.303252800547589,-3.13940905736457,-6.04546779778801,6.75462544809695,-8.94817857893317,0.702724998099873,-10.7338541032306,-1.37951985681718,-1.63896011485587,-1.74635013628103,0.776744097926754,-1.32735663549015,0.587743219006407,0.370508651493253,-0.57675247317433,-0.669605371766238,-0.759907529538618,1.60505555017462,0.540675396428899,0.737040381683977,0.496699108168337,1.0,1
7543,0.329594333318222,3.71288929524103,-5.77593510831666,6.07826550560828,1.66735901311948,-2.42016841351562,-0.812891249491333,0.133080117970748,-2.21431131204961,-5.13445447110633,4.56072010550223,-8.87374836164535,-0.797483599628474,-9.17716637009146,-0.25702477514424,-0.871688490451564,1.31301362907797,0.773913872552923,-2.37059945059811,0.269772775978284,0.156617169389793,-0.652450440932299,-0.551572219392364,-0.716521635357197,1.41571661508922,0.555264739787582,0.530507388890912,0.404474054528712,1.0,1
7551,0.316459000444982,3.80907594667829,-5.61515901119457,6.04744510216478,1.55402595692572,-2.6513531120137,-0.746579273100222,0.0555863112529252,-2.6786785422399,-4.95949291161496,6.43905335158373,-7.52011739288703,0.38635166741077,-9.25230724747513,-1.36518841502051,-0.502362190618164,0.784426598154274,1.49430460743838,-1.80801215867357,0.388307428238927,0.208828369001674,-0.511746619200722,-0.583813220813723,-0.219845029091423,1.47475258440688,0.491191925656006,0.518868284577287,0.40252806767232,1.0,1
7610,0.725645739819857,2.30089443776603,-5.32997618300917,4.007682804682,-1.73041059025206,-1.73219256822244,-3.96859261813707,1.06372815344105,-0.486096552344833,-4.62498495406596,5.5887239146762,-7.14824263637845,1.68045074096412,-6.21025774661028,0.495282117814298,-3.5995402092184,-4.83032424210571,-0.649090120211694,2.2501232487881,0.504646226103286,0.589669127323198,0.109541319229913,0.601045276521079,-0.364700278220039,-1.84307769215194,0.351909298434892,0.594549978086464,0.0993722360416487,1.0,1


In [13]:
from pyspark.ml.feature import PCA
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import RFormula


rf = RFormula(formula="~ V1 + V2 + V3 + V4 + V5 + V6 + V7 + V8 + V9 + V10 + V11 + V12 + V13 + V14 + V15 + V16 + V17 + V18 + V19 + V20 + V21 + V22 + V23 + V24 + V25 + V26 + V27")
final_df_rf = rf.fit(df_sample).transform(df_sample)

pca = PCA(k=2, inputCol="features", outputCol="pcaFeatures")
model = pca.fit(final_df_rf)

result = model.transform(final_df_rf).select("pcaFeatures")
import numpy as np
df_pca_arr = np.array(result.collect())
df_pca_no_class = sc.parallelize(df_pca_arr[:,0]).map(lambda x: x.tolist()).toDF(['V1', 'V2'])
df_pca_no_class = df_pca_no_class.withColumn("id", F.monotonically_increasing_id())
df_pca_no_class.createOrReplaceTempView('df_pca_no_class')
df_pca_no_class = spark.sql('select row_number() over (order by "id") as num, * from df_pca_no_class')
df_pca_class = df_sample.select('class')
df_pca_class = df_pca_class.withColumn("id", F.monotonically_increasing_id())
df_pca_class.createOrReplaceTempView('df_pca_class')
df_pca_class = spark.sql('select row_number() over (order by "id") as num, * from df_pca_class')

df_pca = df_pca_no_class.join(df_pca_class, "num", 'inner').drop("id")
display(df_pca)

num,V1,V2,class
1,8.683077875598103,-2.750988390642488,1
2,0.1972612809263669,-0.5282734746561177,1
3,7.177251269775736,-3.0127958902660863,1
4,18.711151509644218,-7.740982410722202,1
5,2.3475581645592003,-2.873541741426004,1
6,14.94176304743996,-7.884547067276009,1
7,14.873980360997232,-7.918092680537963,1
8,11.40613264494057,-6.454776614523022,1
9,11.374433486109629,-6.486436525065428,1
10,13.882699306547895,-6.1057288756230985,1


In [14]:
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

## Weighted Logistic Regression Approach

In [16]:
# Calculate the balancing ratio for the dataset
# balancing_ratio = NumberPositives / TotalRecords
balancing_ratio = float(492) / float(284807)
df_weighted = df.withColumn("weight", F.when(F.col('Class') == 0, balancing_ratio)\
                           .otherwise(1 - balancing_ratio))

## Resampling
We undersampling the normal trascation

In [18]:
df_train, df_test = df.randomSplit([0.7, 0.3], 42)

# Random Undersampling
df_fraud = df_train.filter(F.col('Class') == 1)
print df_fraud.count()
df_no_fraud_sample = df_train.filter(F.col('Class') == 0).sample(False, 0.1).limit(df_fraud.count()) #downsaple no fraud class 
print df_no_fraud_sample.count()

df_train_sample = df_fraud.union(df_no_fraud_sample)
print df_train_sample.count()

In [19]:
def confusionmatrix(predictions):
  if 'Class' in predictions.columns:
    tp = predictions[(predictions.Class == 1) & (predictions.prediction == 1)].count()
    tn = predictions[(predictions.Class == 0) & (predictions.prediction == 0)].count()
    fp = predictions[(predictions.Class == 0) & (predictions.prediction == 1)].count()
    fn = predictions[(predictions.Class == 1) & (predictions.prediction == 0)].count()
  else:
    tp = predictions[(predictions.label == 1) & (predictions.prediction == 1)].count()
    tn = predictions[(predictions.label == 0) & (predictions.prediction == 0)].count()
    fp = predictions[(predictions.label == 0) & (predictions.prediction == 1)].count()
    fn = predictions[(predictions.label == 1) & (predictions.prediction == 0)].count()    

  print "True Positives:", tp
  print "True Negatives:", tn
  print "False Positives:", fp
  print "False Negatives:", fn
  print "Total", predictions.count()

  r = float(tp)/(tp + fn)
  print "recall", r

  p = float(tp) / (tp + fp)
  print "precision", p

In [20]:
from pyspark.ml.classification import RandomForestClassifier
from pyspark.mllib.evaluation import BinaryClassificationMetrics

def data_prep_train(trainDF):
  
  # produce train and test dataframe objects
  featureAssembler = VectorAssembler()\
  .setInputCols(["Time","V1","V2","V3","V4","V5","V6","V7","V8","V9","V10","V11","V12",\
                                                     "V13","V14","V15","V16","V17","V18","V19","V20","V21","V22","V23","V24","V25","V26","V27","V28","Amount"])\
    .setOutputCol("features")  

  rf = RandomForestClassifier()\
    .setLabelCol("Class")\
    .setFeaturesCol("features")\
    .setNumTrees(100)\
    .setMaxBins(32)\
    .setMaxDepth(10)

  rfpipeline = Pipeline()\
    .setStages([featureAssembler, rf])

  rfmodel=rfpipeline.fit(trainDF)
  return rfmodel

In [21]:
# Try random forest for classification
rfmodel = data_prep_train(df_train_sample)

# Make predictions
predictions = rfmodel.transform(df_test)


evaluator = MulticlassClassificationEvaluator()\
  .setLabelCol("Class")\
  .setPredictionCol("prediction")\
  .setMetricName("accuracy")
accuracy = evaluator.evaluate(predictions)

print "Test Accuracy = %5.2f%%" % ((accuracy)*100)
confusionmatrix(predictions)

In [23]:
df_train, df_test = df.randomSplit([0.7, 0.3], 42)

# Random Undersampling
df_fraud = df_train.filter(F.col('Class') == 1)
print "Fraud count:", df_fraud.count()

for fraud_ratio in range (1,20):
  print "---------- Predicitions #", fraud_ratio, "----------"
  df_no_fraud_sample = df_train.filter(F.col('Class') == 0).sample(False, 0.1).limit(df_fraud.count()*fraud_ratio) #downsaple no fraud class 
  print "No fraud count:", df_no_fraud_sample.count()
  
  df_train_sample = df_fraud.union(df_no_fraud_sample)
  print "Total record count:", df_train_sample.count()
  
  # Try random forest for classification
  rfmodel = data_prep_train(df_train_sample)

  # Make predictions
  predictions = rfmodel.transform(df_test)


  evaluator = MulticlassClassificationEvaluator()\
    .setLabelCol("Class")\
    .setPredictionCol("prediction")\
    .setMetricName("accuracy")
  accuracy = evaluator.evaluate(predictions)

  print "Test Accuracy = %5.2f%%" % ((accuracy)*100)
  confusionmatrix(predictions)

In [24]:
df_train, df_test = df.randomSplit([0.7, 0.3], 42)

# Random Undersampling
df_fraud = df_train.filter(F.col('Class') == 1)
no_fraud_count = df_train.filter(F.col('Class') == 0).count()
print "Fraud count:", df_fraud.count()

for fraud_ratio in range (2,5):
  print "---------- Predicitions #", fraud_ratio, "----------"
  df_no_fraud_sample = df_train.filter(F.col('Class') == 0).sample(False, 0.1).limit(no_fraud_count/fraud_ratio) #downsaple no fraud class 
  print "No fraud count:", df_no_fraud_sample.count()
  
  df_train_sample = df_fraud.union(df_no_fraud_sample)
  print "Total record count:", df_train_sample.count()
  
  # Try random forest for classification
  rfmodel = data_prep_train(df_train_sample)

  # Make predictions
  predictions = rfmodel.transform(df_test)


  evaluator = MulticlassClassificationEvaluator()\
    .setLabelCol("Class")\
    .setPredictionCol("prediction")\
    .setMetricName("accuracy")
  accuracy = evaluator.evaluate(predictions)

  print "Test Accuracy = %5.2f%%" % ((accuracy)*100)
  confusionmatrix(predictions)

In [25]:
# Continuing where the last cell stopped:
df_train, df_test = df.randomSplit([0.7, 0.3], 42)

# Random Undersampling
df_fraud = df_train.filter(F.col('Class') == 1)
print "Fraud count:", df_fraud.count()

for fraud_ratio in range (8,20):
  print "---------- Predicitions #", fraud_ratio, "----------"
  df_no_fraud_sample = df_train.filter(F.col('Class') == 0).sample(False, 0.1).limit(df_fraud.count()*fraud_ratio) #downsaple no fraud class 
  print "No fraud count:", df_no_fraud_sample.count()
  
  df_train_sample = df_fraud.union(df_no_fraud_sample)
  print "Total record count:", df_train_sample.count()
  
  # Try random forest for classification
  rfmodel = data_prep_train(df_train_sample)

  # Make predictions
  predictions = rfmodel.transform(df_test)


  evaluator = MulticlassClassificationEvaluator()\
    .setLabelCol("Class")\
    .setPredictionCol("prediction")\
    .setMetricName("accuracy")
  accuracy = evaluator.evaluate(predictions)

  print "Test Accuracy = %5.2f%%" % ((accuracy)*100)
  confusionmatrix(predictions)

In [26]:
# Continuing where the last cell stopped:
df_train, df_test = df.randomSplit([0.7, 0.3], 42)

# Random Undersampling
df_fraud = df_train.filter(F.col('Class') == 1)
print "Fraud count:", df_fraud.count()

fraud_ratio = 20
print "---------- Predicitions #", fraud_ratio, "----------"
df_no_fraud_sample = df_train.filter(F.col('Class') == 0).sample(False, 0.1).limit(df_fraud.count()*fraud_ratio) #downsaple no fraud class 
print "No fraud count:", df_no_fraud_sample.count()

df_train_sample = df_fraud.union(df_no_fraud_sample)
print "Total record count:", df_train_sample.count()

# Try random forest for classification
rfmodel = data_prep_train(df_train_sample)

# Make predictions
predictions = rfmodel.transform(df_test)


evaluator = MulticlassClassificationEvaluator()\
  .setLabelCol("Class")\
  .setPredictionCol("prediction")\
  .setMetricName("accuracy")
accuracy = evaluator.evaluate(predictions)

print "Test Accuracy = %5.2f%%" % ((accuracy)*100)
confusionmatrix(predictions)

Created graph of results in excel

In [28]:
# Keep cluster running

In [29]:
# Continuing where the last cell stopped:
df_train, df_test = df.randomSplit([0.7, 0.3], 42)

# Try random forest for classification
rfmodel = data_prep_train(df_train)

# Make predictions
predictions = rfmodel.transform(df_test)


evaluator = MulticlassClassificationEvaluator()\
  .setLabelCol("Class")\
  .setPredictionCol("prediction")\
  .setMetricName("accuracy")
accuracy = evaluator.evaluate(predictions)

print "Test Accuracy = %5.2f%%" % ((accuracy)*100)
confusionmatrix(predictions)

In [30]:
from pyspark.ml.classification import LogisticRegression

def data_prep_train(trainDF):
  
  # produce train and test dataframe objects
  featureAssembler = VectorAssembler()\
  .setInputCols(["Time","V1","V2","V3","V4","V5","V6","V7","V8","V9","V10","V11","V12",\
                                                     "V13","V14","V15","V16","V17","V18","V19","V20","V21","V22","V23","V24","V25","V26","V27","V28","Amount"])\
    .setOutputCol("features")  

  lr = LogisticRegression(maxIter=10, regParam=0.001)\
    .setLabelCol("Class")\
    .setFeaturesCol("features")

  lrpipeline = Pipeline()\
    .setStages([featureAssembler, lr])

  lrmodel=lrpipeline.fit(trainDF)
  
  return lrmodel

In [31]:
df_train, df_test = df.randomSplit([0.7, 0.3], 42)

# Try random forest for classification
lrmodel = data_prep_train(df_train)

# Make predictions
predictions = lrmodel.transform(df_test)


evaluator = MulticlassClassificationEvaluator()\
  .setLabelCol("Class")\
  .setPredictionCol("prediction")\
  .setMetricName("accuracy")
accuracy = evaluator.evaluate(predictions)

print "Test Accuracy = %5.2f%%" % ((accuracy)*100)
confusionmatrix(predictions)

In [32]:
mean_Amount, sttdev_Amount = df.select(F.mean("Amount"), F.stddev("Amount")).first()
df_scaled = df.withColumn("Amount_scaled", (F.col("Amount") - mean_Amount) / sttdev_Amount)
df_scaled = df_scaled.drop('Time','Amount')
display(df_scaled)

V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16,V17,V18,V19,V20,V21,V22,V23,V24,V25,V26,V27,V28,Class,Amount_scaled
-1.3598071336738,-0.0727811733098497,2.53634673796914,1.37815522427443,-0.338320769942518,0.462387777762292,0.239598554061257,0.0986979012610507,0.363786969611213,0.0907941719789316,-0.551599533260813,-0.617800855762348,-0.991389847235408,-0.311169353699879,1.46817697209427,-0.470400525259478,0.207971241929242,0.0257905801985591,0.403992960255733,0.251412098239705,-0.018306777944153,0.277837575558899,-0.110473910188767,0.0669280749146731,0.128539358273528,-0.189114843888824,0.133558376740387,-0.0210530534538215,0,0.2449638333166789
1.19185711131486,0.26615071205963,0.16648011335321,0.448154078460911,0.0600176492822243,-0.0823608088155687,-0.0788029833323113,0.0851016549148104,-0.255425128109186,-0.166974414004614,1.61272666105479,1.06523531137287,0.48909501589608,-0.143772296441519,0.635558093258208,0.463917041022171,-0.114804663102346,-0.183361270123994,-0.145783041325259,-0.0690831352230203,-0.225775248033138,-0.638671952771851,0.101288021253234,-0.339846475529127,0.167170404418143,0.125894532368176,-0.0089830991432281,0.0147241691924927,0,-0.3424739398649419
-1.35835406159823,-1.34016307473609,1.77320934263119,0.379779593034328,-0.503198133318193,1.80049938079263,0.791460956450422,0.247675786588991,-1.51465432260583,0.207642865216696,0.624501459424895,0.066083685268831,0.717292731410831,-0.165945922763554,2.34586494901581,-2.89008319444231,1.10996937869599,-0.121359313195888,-2.26185709530414,0.524979725224404,0.247998153469754,0.771679401917229,0.909412262347719,-0.689280956490685,-0.327641833735251,-0.139096571514147,-0.0553527940384261,-0.0597518405929204,0,1.1606838875569188
-0.966271711572087,-0.185226008082898,1.79299333957872,-0.863291275036453,-0.0103088796030823,1.24720316752486,0.23760893977178,0.377435874652262,-1.38702406270197,-0.0549519224713749,-0.226487263835401,0.178228225877303,0.507756869957169,-0.28792374549456,-0.631418117709045,-1.0596472454325,-0.684092786345479,1.96577500349538,-1.2326219700892,-0.208037781160366,-0.108300452035545,0.0052735967825345,-0.190320518742841,-1.17557533186321,0.647376034602038,-0.221928844458407,0.0627228487293033,0.0614576285006353,0,0.1405340052658796
-1.15823309349523,0.877736754848451,1.548717846511,0.403033933955121,-0.407193377311653,0.0959214624684256,0.592940745385545,-0.270532677192282,0.817739308235294,0.753074431976354,-0.822842877946363,0.53819555014995,1.3458515932154,-1.11966983471731,0.175121130008994,-0.451449182813529,-0.237033239362776,-0.0381947870352842,0.803486924960175,0.408542360392758,-0.0094306971323291,0.79827849458971,-0.137458079619063,0.141266983824769,-0.206009587619756,0.502292224181569,0.219422229513348,0.215153147499206,0,-0.0734032113879591
-0.425965884412454,0.960523044882985,1.14110934232219,-0.168252079760302,0.42098688077219,-0.0297275516639742,0.476200948720027,0.260314333074874,-0.56867137571251,-0.371407196834471,1.34126198001957,0.359893837038039,-0.358090652573631,-0.137133700217612,0.517616806555742,0.401725895589603,-0.0581328233640131,0.0686531494425432,-0.0331937877876282,0.0849676720682049,-0.208253514656728,-0.559824796253248,-0.0263976679795373,-0.371426583174346,-0.232793816737034,0.105914779097957,0.253844224739337,0.0810802569229443,0,-0.3385558222734495
1.22965763450793,0.141003507049326,0.0453707735899449,1.20261273673594,0.191880988597645,0.272708122899098,-0.0051590028825098,0.0812129398830894,0.464959994783886,-0.0992543211289237,-1.41690724314928,-0.153825826253651,-0.75106271556262,0.16737196252175,0.0501435942254188,-0.443586797916727,0.002820512472347,-0.61198733994012,-0.0455750446637976,-0.21963255278686,-0.167716265815783,-0.270709726172363,-0.154103786809305,-0.780055415004671,0.75013693580659,-0.257236845917139,0.0345074297438413,0.0051677689062491,0,-0.3332783577624597
-0.644269442348146,1.41796354547385,1.0743803763556,-0.492199018495015,0.948934094764157,0.428118462833089,1.12063135838353,-3.80786423873589,0.615374730667027,1.24937617815176,-0.619467796121913,0.291474353088705,1.75796421396042,-1.32386521970526,0.686132504394383,-0.0761269994382006,-1.2221273453247,-0.358221569869078,0.324504731321494,-0.156741852488285,1.94346533978412,-1.01545470979971,0.057503529867291,-0.649709005559993,-0.415266566234811,-0.0516342969262494,-1.20692108094258,-1.08533918832377,0,-0.1901071425059848
-0.89428608220282,0.286157196276544,-0.113192212729871,-0.271526130088604,2.6695986595986,3.72181806112751,0.370145127676916,0.851084443200905,-0.392047586798604,-0.410430432848439,-0.705116586646536,-0.110452261733098,-0.286253632470583,0.0743553603016731,-0.328783050303565,-0.210077268148783,-0.499767968800267,0.118764861004217,0.57032816746536,0.0527356691149697,-0.0734251001059225,-0.268091632235551,-0.204232669947878,1.0115918018785,0.373204680146282,-0.384157307702294,0.0117473564581996,0.14240432992147,0,0.0193922062636124
-0.33826175242575,1.11959337641566,1.04436655157316,-0.222187276738296,0.49936080649727,-0.24676110061991,0.651583206489972,0.0695385865186387,-0.736727316364109,-0.366845639206541,1.01761446783262,0.836389570307029,1.00684351373408,-0.443522816876142,0.150219101422635,0.739452777052119,-0.540979921943059,0.47667726004282,0.451772964394125,0.203711454727929,-0.246913936910008,-0.633752642406113,-0.12079408408185,-0.385049925313426,-0.0697330460416923,0.0941988339514961,0.246219304619926,0.0830756493473326,0,-0.3385158414816995


In [33]:
def data_prep_train_scaled(trainDF):
  
  # produce train and test dataframe objects
  featureAssembler = VectorAssembler()\
  .setInputCols(["V1","V2","V3","V4","V5","V6","V7","V8","V9","V10","V11","V12",\
                                                     "V13","V14","V15","V16","V17","V18","V19","V20","V21","V22","V23","V24","V25","V26","V27","V28","Amount_scaled"])\
    .setOutputCol("features")  

  rf = RandomForestClassifier()\
    .setLabelCol("Class")\
    .setFeaturesCol("features")\
    .setNumTrees(100)\
    .setMaxBins(32)\
    .setMaxDepth(10)

  rfpipeline = Pipeline()\
    .setStages([featureAssembler, rf])

  rfmodel=rfpipeline.fit(trainDF)
  return rfmodel

In [34]:
import pandas as pd
from imblearn.over_sampling import SMOTE,RandomOverSampler
from sklearn.model_selection import train_test_split
from collections import Counter
df_pd = df.toPandas()

#df_train, df_test = df.randomSplit([0.7, 0.3], 42)
#df_train_pd = df_train.toPandas()

X = df_pd.iloc[:, df_pd.columns != 'Class']
Y = df_pd.iloc[:, df_pd.columns == 'Class']

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=42)

sm = SMOTE(random_state=12, ratio = 'auto', kind = 'regular')

x_train_res, y_train_res = sm.fit_sample(x_train, y_train)
print 'Resampled dataset shape {}'.format(Counter(y_train_res))

In [35]:
xcol = df.columns
ycol = [xcol.pop()]
df_pd_res_x = pd.DataFrame(x_train_res, columns=xcol)
df_pd_res_y = pd.DataFrame(y_train_res, columns=ycol)
df_pd_res = pd.concat([df_pd_res_x,df_pd_res_y], axis=1)
df_pd_res.head()

In [36]:
df_pd_x_test = pd.DataFrame(x_test, columns=xcol)
df_pd_y_test = pd.DataFrame(y_test, columns=ycol)
df_pd_test = pd.concat([df_pd_x_test,df_pd_y_test], axis=1)
df_pd_test.head()

In [37]:
df_train_res = spark.createDataFrame(df_pd_res)
df_test = spark.createDataFrame(df_pd_test)

In [38]:
# Try random forest for classification
rfmodel = data_prep_train(df_train_res)

# Make predictions
predictions = rfmodel.transform(df_test)

evaluator = MulticlassClassificationEvaluator()\
  .setLabelCol("Class")\
  .setPredictionCol("prediction")\
  .setMetricName("accuracy")
accuracy = evaluator.evaluate(predictions)

print "Test Accuracy = %5.2f%%" % ((accuracy)*100)
confusionmatrix(predictions)

In [39]:
mean_Amount, sttdev_Amount = df.select(F.mean("Amount"), F.stddev("Amount")).first()
df_scaled = df.withColumn("Amount_scaled", (F.col("Amount") - mean_Amount) / sttdev_Amount)
df_scaled = df_scaled.drop('Time','Amount')

df_pd = df_scaled.toPandas()

X = df_pd.iloc[:, df_pd.columns != 'Class']
Y = df_pd.iloc[:, df_pd.columns == 'Class']

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=42)

sm = SMOTE(random_state=12, ratio = 'auto', kind = 'regular')

x_train_res, y_train_res = sm.fit_sample(x_train, y_train)

xcol = ["V1","V2","V3","V4","V5","V6","V7","V8","V9","V10","V11","V12","V13","V14","V15","V16","V17","V18","V19","V20","V21","V22","V23","V24","V25","V26","V27","V28","Amount_scaled"]
ycol = ["Class"]
df_pd_res_x = pd.DataFrame(x_train_res, columns=xcol)
df_pd_res_y = pd.DataFrame(y_train_res, columns=ycol)
df_pd_res = pd.concat([df_pd_res_x,df_pd_res_y], axis=1)

df_pd_x_test = pd.DataFrame(x_test, columns=xcol)
df_pd_y_test = pd.DataFrame(y_test, columns=ycol)
df_pd_test = pd.concat([df_pd_x_test,df_pd_y_test], axis=1)

df_train_res = spark.createDataFrame(df_pd_res)
df_test = spark.createDataFrame(df_pd_test)

# Try random forest for classification
rfmodel = data_prep_train_scaled(df_train_res)

# Make predictions
predictions = rfmodel.transform(df_test)

evaluator = MulticlassClassificationEvaluator()\
  .setLabelCol("Class")\
  .setPredictionCol("prediction")\
  .setMetricName("accuracy")
accuracy = evaluator.evaluate(predictions)

print "Test Accuracy = %5.2f%%" % ((accuracy)*100)
confusionmatrix(predictions)

In [40]:
# DOES NOT WORK IN SPARK
from imblearn.over_sampling import SMOTE,RandomOverSampler

df_train, df_test = df.randomSplit([0.7, 0.3], 42)

xcol = df_train.columns
ycol = xcol.pop()

df_train_x = df_train.select(xcol) # Feature columns
df_train_y = df_train.select(ycol) # Class column


df_train_x_array =  np.array(df_train_x.select(df_train_x.columns).collect())
df_train_y_array =  np.array(df_train_y.select(df_train_y.columns).collect())

sm = SMOTE(random_state=12, ratio = 'auto', kind = 'regular')
X_resampled, y_resampled = sm.fit_sample(df_train_x_array, df_train_y_array)

In [41]:
df_X_resampled = sc.parallelize(X_resampled).toDF(df_train_x.columns)
#X_resampled

### Model selection (a.k.a. hyperparameter tuning)
An important task in ML is model selection, or using data to find the best model or parameters for a given task. This is also called tuning. Tuning may be done for individual Estimators such as LogisticRegression, or for entire Pipelines which include multiple algorithms, featurization, and other steps. Users can tune an entire Pipeline at once, rather than tuning each element in the Pipeline separately.
https://spark.apache.org/docs/latest/ml-tuning.html

In [43]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

In [44]:
# produce train and test dataframe objects
featureAssembler = VectorAssembler()\
 .setInputCols(["Time","V1","V2","V3","V4","V5","V6","V7","V8","V9","V10","V11","V12","V13","V14","V15","V16","V17","V18","V19","V20","V21","V22","V23","V24","V25","V26","V27","V28","Amount"])\
  .setOutputCol("features")  

rf_cv = RandomForestClassifier()\
  .setLabelCol("label")\
  .setFeaturesCol("features")\
  .setNumTrees(10)\
  .setMaxBins(50)\
  .setMaxDepth(20)\

pipeline_cv = Pipeline(stages=[featureAssembler,rf_cv])

trainDF_cv = trainDF.withColumnRenamed('Class', 'label')
testDF_cv = testDF.withColumnRenamed('Class', 'label')

paramGrid = ParamGridBuilder() \
    .addGrid(rf_cv.setNumTrees, [75, 100, 125]) \
    .addGrid(rf_cv.setMaxBins, [20, 32, 45])\
    .addGrid(rf_cv.setMaxDepth, [8, 10, 15])\
    .build()

crossval = CrossValidator(estimator=pipeline_cv,
                          estimatorParamMaps=paramGrid,
                          evaluator=BinaryClassificationEvaluator(),
                          numFolds=10)

cvModel = crossval.fit(trainDF_cv)

In [45]:
prediction = cvModel.transform(testDF_cv)

evaluator = MulticlassClassificationEvaluator()\
  .setLabelCol("label")\
  .setPredictionCol("prediction")\
  .setMetricName("accuracy")
accuracy = evaluator.evaluate(prediction)

print "Accuracy = %5.2f%%" % ((accuracy)*100)

confusionmatrix(prediction)

In [46]:
#make prediction on the unsee imbalanced dataset 
im_prediction = cvModel.transform(im_test)
confusionmatrix(im_prediction)

try slightly imbalance data for trining

In [48]:
df_weighted_train, im_test = df_weighted.randomSplit([0.9, 0.1], 42)

# Random Undersampling
df_fraud = df_weighted_train.filter(F.col('Class') == 1)
print df_fraud.count()
df_no_fraud_sample = df_weighted_train.filter(F.col('Class') == 0).sample(False, 0.1).limit(3*df_fraud.count()) #downsaple no fraud class 
print df_no_fraud_sample.count()

df_weighted_sample = df_fraud.union(df_no_fraud_sample)
print df_weighted_sample.count()
# Random Oversampling


# First split the data without sampling for comparison
splits = df_weighted_sample.randomSplit([0.7, 0.3], 42) # 70/30 split for training/testing
(trainDF, testDF) = (splits[0], splits[1])

In [49]:
# Try random forest for classification
rfmodel = data_prep_train(trainDF)

# Make predictions
predictions = rfmodel.transform(testDF)

evaluator = MulticlassClassificationEvaluator()\
  .setLabelCol("Class")\
  .setPredictionCol("prediction")\
  .setMetricName("accuracy")
accuracy = evaluator.evaluate(predictions)

print "Test Accuracy = %5.2f%%" % ((accuracy)*100)
confusionmatrix(predictions)

In [50]:
im_prediction = rfmodel.transform(im_test)
confusionmatrix(im_prediction)