# Credit Card fraud detection - Using H2O AutoML

#### We explored using the popular H2O.ai platform, specifically the AutoML package 
#### This is the 2nd of 2 notebooks on H2O

This notebook shows the call to AutoML with max_models=5 WITH and NO oversampling

The results of all combinations of model hypertuning have been included in the paper writeup for our project

One of the drawbacks of using AutoML that is it both time and resource intensive as it runs through multiple models as part of the algorithm

Reference for notebook: H20 Documentation from http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science.html#
http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/max_after_balance_size.html


### Importing and initiating H2O instance

In [2]:
import h2o
from h2o.automl import H2OAutoML

h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321. connected.


0,1
H2O cluster uptime:,4 hours 4 mins
H2O cluster timezone:,Etc/UTC
H2O data parsing timezone:,UTC
H2O cluster version:,3.22.0.2
H2O cluster version age:,27 days
H2O cluster name:,H2O_from_python_ubuntu_00mwx9
H2O cluster total nodes:,1
H2O cluster free memory:,12.38 Gb
H2O cluster total cores:,4
H2O cluster allowed cores:,4


### Importing and preparing the dataset

#### Loading the data

We used the Kaggle dataset on Credit card fraud ref:[Kaggle](https://www.kaggle.com/dalpozz/creditcardfraud). 
It contains data about credit card transactions from European credit card holders that occurred during a period of two days in Sep 2013, with 492 frauds out of 284,807 transactions. The frauds account for 0.172% of the total transactions.

Kaggle describes the data to have 30 features that are numberical values from a transformed dataset using PCA transformation(s) in a reduced feature dimension space due to privacy reasons. 

The two features that haven't been changed are Time and Amount. Time contains the seconds elapsed between each transaction and the first transaction in the dataset.

Label 'Class' is the target class lable with 1 representing the fraud case and 0 representing the normal case

In [3]:
credit = h2o.import_file("creditcard.csv")

Parse progress: |█████████████████████████████████████████████████████████| 100%


In [4]:
# For binary classification, response should be a factor
credit['Class'] = credit['Class'].asfactor()

credit= credit.drop(['Time'], axis=1)
credit.head()

V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16,V17,V18,V19,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
-1.35981,-0.0727812,2.53635,1.37816,-0.338321,0.462388,0.239599,0.0986979,0.363787,0.0907942,-0.5516,-0.617801,-0.99139,-0.311169,1.46818,-0.470401,0.207971,0.0257906,0.403993,0.251412,-0.0183068,0.277838,-0.110474,0.0669281,0.128539,-0.189115,0.133558,-0.0210531,149.62,0
1.19186,0.266151,0.16648,0.448154,0.0600176,-0.0823608,-0.078803,0.0851017,-0.255425,-0.166974,1.61273,1.06524,0.489095,-0.143772,0.635558,0.463917,-0.114805,-0.183361,-0.145783,-0.0690831,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.0089831,0.0147242,2.69,0
-1.35835,-1.34016,1.77321,0.37978,-0.503198,1.8005,0.791461,0.247676,-1.51465,0.207643,0.624501,0.0660837,0.717293,-0.165946,2.34586,-2.89008,1.10997,-0.121359,-2.26186,0.52498,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.0553528,-0.0597518,378.66,0
-0.966272,-0.185226,1.79299,-0.863291,-0.0103089,1.2472,0.237609,0.377436,-1.38702,-0.0549519,-0.226487,0.178228,0.507757,-0.287924,-0.631418,-1.05965,-0.684093,1.96578,-1.23262,-0.208038,-0.1083,0.0052736,-0.190321,-1.17558,0.647376,-0.221929,0.0627228,0.0614576,123.5,0
-1.15823,0.877737,1.54872,0.403034,-0.407193,0.0959215,0.592941,-0.270533,0.817739,0.753074,-0.822843,0.538196,1.34585,-1.11967,0.175121,-0.451449,-0.237033,-0.0381948,0.803487,0.408542,-0.0094307,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0
-0.425966,0.960523,1.14111,-0.168252,0.420987,-0.0297276,0.476201,0.260314,-0.568671,-0.371407,1.34126,0.359894,-0.358091,-0.137134,0.517617,0.401726,-0.0581328,0.0686531,-0.0331938,0.0849677,-0.208254,-0.559825,-0.0263977,-0.371427,-0.232794,0.105915,0.253844,0.0810803,3.67,0
1.22966,0.141004,0.0453708,1.20261,0.191881,0.272708,-0.005159,0.0812129,0.46496,-0.0992543,-1.41691,-0.153826,-0.751063,0.167372,0.0501436,-0.443587,0.00282051,-0.611987,-0.045575,-0.219633,-0.167716,-0.27071,-0.154104,-0.780055,0.750137,-0.257237,0.0345074,0.00516777,4.99,0
-0.644269,1.41796,1.07438,-0.492199,0.948934,0.428118,1.12063,-3.80786,0.615375,1.24938,-0.619468,0.291474,1.75796,-1.32387,0.686133,-0.076127,-1.22213,-0.358222,0.324505,-0.156742,1.94347,-1.01545,0.0575035,-0.649709,-0.415267,-0.0516343,-1.20692,-1.08534,40.8,0
-0.894286,0.286157,-0.113192,-0.271526,2.6696,3.72182,0.370145,0.851084,-0.392048,-0.41043,-0.705117,-0.110452,-0.286254,0.0743554,-0.328783,-0.210077,-0.499768,0.118765,0.570328,0.0527357,-0.0734251,-0.268092,-0.204233,1.01159,0.373205,-0.384157,0.0117474,0.142404,93.2,0
-0.338262,1.11959,1.04437,-0.222187,0.499361,-0.246761,0.651583,0.0695386,-0.736727,-0.366846,1.01761,0.83639,1.00684,-0.443523,0.150219,0.739453,-0.54098,0.476677,0.451773,0.203711,-0.246914,-0.633753,-0.120794,-0.38505,-0.069733,0.0941988,0.246219,0.0830756,3.68,0




In [5]:
# set the predictor names and the response column name
predictors = credit.columns[0:28]
response = 'Class'

In [6]:
# split into train and validation sets
train, valid = credit.split_frame(ratios = [.7], seed = 1234)

### Calling the AutoML package (without over sampling to balance Class) 

In [7]:
# Run AutoML for 10,20 base models (limited to 1 hour max runtime by default)
aml_ub = H2OAutoML(max_models=5, seed=1234)
aml_ub.train(x = predictors, y = response, training_frame = train, validation_frame = valid)

AutoML progress: |████████████████████████████████████████████████████████| 100%


### Model leader board shows XGBoost as the top performing techinque

In [9]:
# View the AutoML Leaderboard
lb = aml_ub.leaderboard
lb.head(rows=lb.nrows)  # Print all rows instead of default (10 rows)

model_id,auc,logloss,mean_per_class_error,rmse,mse
XGBoost_1_AutoML_20181220_034656,0.981113,0.00249092,0.080723,0.0198301,0.000393234
GLM_grid_1_AutoML_20181220_034656_model_1,0.978821,0.00398823,0.0999025,0.0261906,0.000685946
XGBoost_2_AutoML_20181220_034656,0.96814,0.0030263,0.0968344,0.0205132,0.000420792
StackedEnsemble_BestOfFamily_AutoML_20181220_034656,0.956275,0.0029775,0.0938868,0.0203581,0.000414451
XRT_1_AutoML_20181220_034656,0.955907,0.00426309,0.0895131,0.0203825,0.000415446
StackedEnsemble_AllModels_AutoML_20181220_034656,0.950469,0.00297353,0.0953556,0.0203903,0.000415764
DRF_1_AutoML_20181220_034656,0.941789,0.00536516,0.0982831,0.0204059,0.000416402




### Calling the AutoML package (using over sampling to balance Class) 

In [6]:
# Run AutoML for 10,20 base models (limited to 1 hour max runtime by default)
aml = H2OAutoML(max_models=5, seed=1234, balance_classes = True, max_after_balance_size = 0.85)
aml.train(x = predictors, y = response, training_frame = train, validation_frame = valid)

AutoML progress: |████████████████████████████████████████████████████████| 100%


### Model leader board shows XGBoost as the top performing techinque

In [7]:
# View the AutoML Leaderboard
lb = aml.leaderboard
lb.head(rows=lb.nrows)  # Print all rows instead of default (10 rows)


model_id,auc,logloss,mean_per_class_error,rmse,mse
XGBoost_1_AutoML_20181220_015035,0.981113,0.00249092,0.080723,0.0198301,0.000393234
GLM_grid_1_AutoML_20181220_015035_model_1,0.978821,0.00398823,0.0999025,0.0261906,0.000685946
XGBoost_2_AutoML_20181220_015035,0.964032,0.00359919,0.0953707,0.0204597,0.000418601
DRF_1_AutoML_20181220_015035,0.952295,0.0108255,0.0924557,0.035489,0.00125947
StackedEnsemble_BestOfFamily_AutoML_20181220_015035,0.947073,0.00314415,0.101216,0.0208256,0.000433706
StackedEnsemble_AllModels_AutoML_20181220_015035,0.946946,0.00316611,0.102682,0.0209337,0.000438221
XRT_1_AutoML_20181220_015035,0.946124,0.0114528,0.096847,0.0356493,0.00127087




In [None]:
# The leader model is stored here
aml.leader

In [None]:
# If you need to generate predictions on a test set, you can make
# predictions directly on the `"H2OAutoML"` object, or on the leader
# model object directly

preds = aml.predict(test)

In [None]:
# or:
preds = aml.leader.predict(test)