# Prediction of Fraudlent ATM Transaction

PredCatch Analytics' Australian banking client's profitability and reputation are being hit by fraudulent ATM Transactions. So we need to build predictive model to catch such fraudulent transaction in real time and decline them. Let's walk through one of the approach for Model building

### Importing of the most libraries that are required for model building

In [1]:
#Basic Libraries
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import math
from sklearn.model_selection import train_test_split,KFold, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
import datetime
import random
import tensorflow as tf
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

#Classification Libraries
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

#Other Data Processing libraries
from sklearn.pipeline import make_pipeline
from imblearn.pipeline import make_pipeline as imbalanced_make_pipeline
from imblearn.over_sampling import SMOTE
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score, accuracy_score, classification_report
from collections import Counter
from sklearn.model_selection import KFold, StratifiedKFold

### Reading given data in CSV files into data frames

In [2]:
#Reading all the given data into respective data frames
geo_df = pd.read_csv(r"D:\Hackathon\Data\Data\Geo_scores.csv")
instance_df = pd.read_csv(r"D:\Hackathon\Data\Data\instance_scores.csv")
lambda_df = pd.read_csv(r"D:\Hackathon\Data\Data\Lambda_wts.csv")
qset_df = pd.read_csv(r"D:\Hackathon\Data\Data\Qset_tats.csv")
train_df = pd.read_csv(r"D:\Hackathon\Data\Data\train.csv")
test_df = pd.read_csv(r"D:\Hackathon\Data\Data\test_share.csv")

### There are multiple values for geo_scores, instance_scores and qset_score for single ID, whereas for lambda_wt are having single value for each group.

### Since there are multiple values for three scores, we will group them by ID and aggregrate on mean

In [12]:
sum(instance_df['id'].value_counts() > 1)

284807

In [11]:
sum(lambda_df['Group'].value_counts() > 1)

0

In [16]:
sum((qset_df['id'].value_counts() > 1))

284807

In [14]:
sum(geo_df['id'].value_counts() > 1)

284807

#### Below we're grouping based on ID and aggreating on mean. Further reset index to use ID column while joining with train and test data

In [18]:
#grouping by unique columns and aggregating based on mean for exporting the column to train and test data frame
geo_scores = geo_df.groupby(['id']).mean().reset_index()
instance_score = instance_df.groupby(['id']).mean().reset_index()
qset_score = qset_df.groupby(['id']).mean().reset_index()

### Merging the different scores and lambda wt with train and data. Using left join to retain the original position of records

In [20]:
#Merging the above scores with train data frame using left join to retaing the observation position as it is.
train_df_all = pd.merge(train_df, geo_scores, on='id',how='left')
train_df_all = pd.merge(train_df_all, instance_score, on='id',how='left')
train_df_all = pd.merge(train_df_all,lambda_df, on='Group',how='left')
train_df_all = pd.merge(train_df_all, qset_score, on='id',how='left')

In [21]:
#Merging the above scores with test data frame using left join to retaing the observation position as it is.
test_df_all = pd.merge(test_df, geo_scores, on='id',how='left')
test_df_all = pd.merge(test_df_all, instance_score, on='id',how='left')
test_df_all = pd.merge(test_df_all, lambda_df, on='Group',how='left')
test_df_all = pd.merge(test_df_all, qset_score, on='id',how='left')

### Now let's check if there is any data processing required for modelling.

In [22]:
#checking for number of features and observations in train data
train_df_all.shape

#There are 32 features with 227845 observations(This includes Target variable as well)

(227845, 32)

In [23]:
#checking for number of features and observations in test data
test_df_all.shape

#There are 3 features with 227845 observations(This doesn't include Target variable)

(56962, 31)

### Are there any NA values in our data

In [24]:
#looking at number of NA's
train_df_all.isna().sum().sum()

0

In [25]:
#looking at number of NA's
test_df_all.isna().sum().sum()

0

We could see there are no values in both train and test data, hence no imputations are required

In [26]:
#Verifying with data types in our data
train_df_all.dtypes

id                        int64
Group                    object
Per1                    float64
Per2                    float64
Per3                    float64
Per4                    float64
Per5                    float64
Per6                    float64
Per7                    float64
Per8                    float64
Per9                    float64
Dem1                    float64
Dem2                    float64
Dem3                    float64
Dem4                    float64
Dem5                    float64
Dem6                    float64
Dem7                    float64
Dem8                    float64
Dem9                    float64
Cred1                   float64
Cred2                   float64
Cred3                   float64
Cred4                   float64
Cred5                   float64
Cred6                   float64
Normalised_FNT          float64
Target                    int64
geo_score               float64
instance_scores         float64
lambda_wt               float64
qsets_no

All data types are numeric(either integer or float64) except Group, let's see if dummies are required for Group variable

In [29]:
#checking for number of unique value of column group
len(train_df['Group'].unique())

1301

We could see there are around 1301 unique values in our group column, hence further breakdown and see how many groups are having observations more than 150.

In [31]:
sum(train_df['Group'].value_counts()>150)

316

There are 316 groups with 150 observation against each group, which signifies that it's kind of ID and also we had Lambda_wt corresponding to each Group. So group can be ignored in our modelling process.

Similarly ID value is unique for each of the observation, hence ignoring ID. Below dropping ID and Group columns

In [32]:
#removing ID since they don't make any sense in our prediction and presently Group not considered as it will involve huge number of dummies
train = train_df_all.drop(['id','Group'],axis=1)
test = test_df_all.drop(['id','Group'],axis=1)

### Our data now has no missing values and all data types  are relevant for modelling. Now checking for distribution of our target variable.

In [33]:
#checking the class distribution
print('No Frauds (Target:0)', round(train_df['Target'].value_counts()[0]/len(train) * 100,2), '% of the dataset')
print('Frauds (Target:1)', round(train_df['Target'].value_counts()[1]/len(train) * 100,2), '% of the dataset')

No Frauds (Target:0) 99.83 % of the dataset
Frauds (Target:1) 0.17 % of the dataset


#### Clearly 99.83% are no fraud and 0.17 fraud, our data is highly imbalanced. We need to take care of this during modelling. Else our predictions will be more towards No Fraud(Class 0)

In [35]:
#looking how features spread in data
train.describe()

Unnamed: 0,Per1,Per2,Per3,Per4,Per5,Per6,Per7,Per8,Per9,Dem1,...,Cred3,Cred4,Cred5,Cred6,Normalised_FNT,Target,geo_score,instance_scores,lambda_wt,qsets_normalized_tat
count,227845.0,227845.0,227845.0,227845.0,227845.0,227845.0,227845.0,227845.0,227845.0,227845.0,...,227845.0,227845.0,227845.0,227845.0,227845.0,227845.0,227845.0,227845.0,227845.0,227845.0
mean,0.666006,0.667701,0.666315,0.666687,0.666723,0.667378,0.666934,0.666279,0.666688,0.666576,...,0.666755,0.666878,0.666566,0.666776,-227.95417,0.001729,-0.000478,-0.000123,0.00035,0.000115
std,0.654133,0.548305,0.506357,0.471956,0.461393,0.444573,0.415657,0.401546,0.366537,0.340436,...,0.174204,0.160803,0.135762,0.111612,61.951661,0.041548,1.076016,1.091488,0.957957,0.945602
min,-18.136667,-23.573333,-15.443333,-1.226667,-37.246667,-8.053333,-13.853333,-23.74,-3.81,-0.893333,...,-2.766667,-0.08,-6.856667,-4.476667,-250.0,0.0,-25.983333,-24.59,-19.21,-31.45
25%,0.36,0.47,0.37,0.383333,0.436667,0.41,0.483333,0.596667,0.453333,0.413333,...,0.56,0.556667,0.643333,0.65,-248.6175,0.0,-0.43,-0.54,-0.43,-0.52
50%,0.67,0.69,0.726667,0.66,0.65,0.576667,0.68,0.673333,0.65,0.656667,...,0.673333,0.65,0.666667,0.67,-244.51,0.0,0.15,-0.09,0.05,-0.07
75%,1.103333,0.933333,1.01,0.913333,0.87,0.8,0.856667,0.776667,0.866667,0.913333,...,0.783333,0.746667,0.696667,0.693333,-230.75,0.0,0.65,0.45,0.49,0.4375
max,1.483333,8.02,3.793333,6.163333,12.266667,25.1,40.863333,7.336667,5.863333,4.673333,...,3.173333,1.84,11.203333,11.95,6172.79,1.0,7.85,23.75,10.53,10.233333


All features are spread in almost same scale, only Normalised_FNT is between [-250,6172.79].. Let's scale this so that this variable is not given more weightage in our model

In [36]:
# We could see that Normalized_FNT has a different scale when compared to other features, hence scaling for modelling purposes
from sklearn.preprocessing import StandardScaler, RobustScaler

# RobustScaler is less prone to outliers. 
#Robust scaler is similar to MIN-MAX scaler, however its uses interquartile range which makes it less prone to outliers

std_scaler = StandardScaler()
rob_scaler = RobustScaler()

train['Normalised_FNT_Scaled'] = rob_scaler.fit_transform(train['Normalised_FNT'].values.reshape(-1,1))
test['Normalised_FNT_Scaled'] = rob_scaler.fit_transform(test['Normalised_FNT'].values.reshape(-1,1))

#Now dropping Normalised_FNT, as we have added normalised FNT with scale
train.drop(["Normalised_FNT"], axis=1, inplace=True)
test.drop(["Normalised_FNT"], axis=1, inplace=True)

In [37]:
train.describe()

Unnamed: 0,Per1,Per2,Per3,Per4,Per5,Per6,Per7,Per8,Per9,Dem1,...,Cred3,Cred4,Cred5,Cred6,Target,geo_score,instance_scores,lambda_wt,qsets_normalized_tat,Normalised_FNT_Scaled
count,227845.0,227845.0,227845.0,227845.0,227845.0,227845.0,227845.0,227845.0,227845.0,227845.0,...,227845.0,227845.0,227845.0,227845.0,227845.0,227845.0,227845.0,227845.0,227845.0,227845.0
mean,0.666006,0.667701,0.666315,0.666687,0.666723,0.667378,0.666934,0.666279,0.666688,0.666576,...,0.666755,0.666878,0.666566,0.666776,0.001729,-0.000478,-0.000123,0.00035,0.000115,0.926589
std,0.654133,0.548305,0.506357,0.471956,0.461393,0.444573,0.415657,0.401546,0.366537,0.340436,...,0.174204,0.160803,0.135762,0.111612,0.041548,1.076016,1.091488,0.957957,0.945602,3.467282
min,-18.136667,-23.573333,-15.443333,-1.226667,-37.246667,-8.053333,-13.853333,-23.74,-3.81,-0.893333,...,-2.766667,-0.08,-6.856667,-4.476667,0.0,-25.983333,-24.59,-19.21,-31.45,-0.307262
25%,0.36,0.47,0.37,0.383333,0.436667,0.41,0.483333,0.596667,0.453333,0.413333,...,0.56,0.556667,0.643333,0.65,0.0,-0.43,-0.54,-0.43,-0.52,-0.229887
50%,0.67,0.69,0.726667,0.66,0.65,0.576667,0.68,0.673333,0.65,0.656667,...,0.673333,0.65,0.666667,0.67,0.0,0.15,-0.09,0.05,-0.07,0.0
75%,1.103333,0.933333,1.01,0.913333,0.87,0.8,0.856667,0.776667,0.866667,0.913333,...,0.783333,0.746667,0.696667,0.693333,0.0,0.65,0.45,0.49,0.4375,0.770113
max,1.483333,8.02,3.793333,6.163333,12.266667,25.1,40.863333,7.336667,5.863333,4.673333,...,3.173333,1.84,11.203333,11.95,1.0,7.85,23.75,10.53,10.233333,359.160487


#### Now all features are on similar scale, we can proceed further with modelling. Separating Features and Target variable for Fitting model.

In [38]:
X = train.drop('Target', axis=1)
Y = train['Target']

#### Splitting data into train and test for validating model built. Using Stratified sampling so that splitting is made by preserving the percentage of samples for each class

In [47]:
#Our data has great imbalance, 
#i.e, there are very less records for fraud class(Target:1). hence using this for sampling from each group

from sklearn.model_selection import train_test_split

from sklearn.model_selection import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(n_splits=5, test_size=0.2, random_state=42)

for train_index, test_index in sss.split(X, Y):
    print("Train:", train_index, "Test:", test_index)
    actual_Xtrain, actual_Xtest = X.iloc[train_index], X.iloc[test_index]
    actual_Ytrain, actual_Ytest = Y.iloc[train_index], Y.iloc[test_index]

actual_Xtrain = actual_Xtrain.values
actual_Xtest = actual_Xtest.values
actual_Ytrain = actual_Ytrain.values
actual_Ytest = actual_Ytest.values


train_unique_label, train_counts_label = np.unique(actual_Ytrain, return_counts=True)
test_unique_label, test_counts_label = np.unique(actual_Ytest, return_counts=True)
print('---' * 85)

print('Label Distributions: \n')
print(train_counts_label/ len(actual_Ytrain))
print(test_counts_label/ len(actual_Ytest))

Train: [225122  61988 197455 ... 195669 152118 100486] Test: [197832 111272  39995 ... 161441 162197  66521]
Train: [ 94872 216389  49742 ... 212564  15634  28205] Test: [218589 111697 194742 ...   2918 166454  55967]
Train: [ 32315 180277 157046 ... 137763 192051 182988] Test: [ 74377  81640 166170 ... 119451  81607 204468]
Train: [ 43131 115016 219145 ... 219246 193635 214208] Test: [112229 227312  70281 ...  18812  45834 188756]
Train: [169045  18030 143025 ...  18168  91155 146404] Test: [223113 114665 210205 ...  40527 224978 204904]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Label Distributions: 

[0.99827185 0.00172815]
[0.99826637 0.00173363]


##### With above splitting data into train & test using Stratified sampling, retaining only the last split. Modelling can be done on each split and check how the performance varies. 

# Logistic Regression

The basic model for any classification problem is Logistic regression. Since our data is imbalance use class_weight="balance"

In [48]:
import time
t0 = time.time()
logr=LogisticRegression(penalty="l1",class_weight="balanced",random_state=2)
logr.fit(actual_Xtrain, actual_Ytrain)
t1 = time.time()
print("Fitting Logistic Regression Model took {} sec".format(t1 - t0))

Fitting Logistic Regression Model took 120.6010115146637 sec


### logr is model built, now let's make predictions on the validation data.

Predicting hard classes for imbalanced data with KS Score will tend to favour majority class. Hence using logr.predict which predict class labels for test data.

In [49]:
#predicting on our validaton data
pred_test = logr.predict(actual_Xtest)

In [50]:
#checking for AUC Score metric
roc_auc_score(pred_test,actual_Ytest)

0.5260035076314027

In [51]:
#checking for Cohen Kappa Score metric
from sklearn.metrics import cohen_kappa_score
cohen_kappa_score(pred_test,actual_Ytest)

0.0957475309656366

The above metrics are obtained from a basic model.. We can further train model on entire training data and make predictions on test share data.

In [None]:
#Fitting on entire train data
import time
t0 = time.time()
logr=LogisticRegression(penalty="l1",class_weight="balanced",random_state=2)
logr.fit(X, Y)
t1 = time.time()
print("Fitting Logistic Regression Model took {} sec".format(t1 - t0))

In [None]:
#Prediction on given test data
pred_test_share = logr.predict(test)

In [None]:
#converting into data frame with corresponding ID for submission
predictions = pd.DataFrame(list(zip(test_df_all['id'],pred_test_share)),columns=["id","Target"])

# Randomized search with XGBoost

### Let's train model using XGBoost, and check the model performs

Due to time and system performance constraints, considered only to tune below 4 parameters. It'll be good option to try building model by tuning other parameters as well, there are chances of getting an improved performance

XGB is widely used due to it's ability to penalise errors which is a disadvantage for GBM. Also XGB can handle missing values on it's own.

In [52]:
#Parameters for training XGBoost, limited to 3 parameters due system processing and time constraints
param_dist_xgb = {
              "max_depth": [2,3,4,5,6,7],
              "learning_rate":[0.01,0.05,0.1,0.2,0.3,0.5],
    #"min_child_weight":[4,5,6],
              #"subsample":[i/10.0 for i in range(6,10)],
 #"colsample_bytree":[i/10.0 for i in range(6,10)],
               #"reg_alpha":[1e-5, 1e-2, 0.1, 1, 100],
              #"gamma":[i/10.0 for i in range(0,5)],
    "n_estimators":[100,200,500,700],
    "scale_pos_weight":[2,3,4,5,6,7,8,9]
                  }

In [55]:
#Again limiting to 10 iterations, (With above number of parameters number of combinations will be over 1000)..
#10% of possible combinations will give us an more probability of getting best params
from xgboost import XGBClassifier
from sklearn.model_selection import RandomizedSearchCV
n_iter=10
clf=XGBClassifier(objective='binary:logistic')
random_search=RandomizedSearchCV(clf,n_jobs=-1,verbose=2,cv=10,n_iter=n_iter,scoring='roc_auc',
                                 param_distributions=param_dist_xgb)

In [56]:
#fitting on validation train data
import time
t0 = time.time()
random_search.fit(actual_Xtrain,actual_Ytrain)
t1 = time.time()
print("Execution through Random Search for XGB took {} sec".format(t1-t0))

Fitting 10 folds for each of 10 candidates, totalling 100 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed: 28.8min
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed: 136.5min finished


Execution through Random Search for XGB took 8294.710326910019 sec


In [57]:
#Best fit
best_XGB_fit = random_search.best_estimator_
#best_params saved from our iteration(learning_rate=0.01,n_estimators=700,max_depth=2,scale_pos_weight=6 )
#this keeps varies with different runs...

In [58]:
random_search.best_params_

{'learning_rate': 0.01,
 'max_depth': 2,
 'n_estimators': 700,
 'scale_pos_weight': 6}

In [61]:
#Prediction on validation test data
pred_XGB = best_XGB_fit.predict(actual_Xtest)

In [62]:
roc_auc_score(pred_XGB,actual_Ytest)

0.9100915424803666

In [63]:
cohen_kappa_score(pred_XGB,actual_Ytest)

0.8149678892871739

### There's lot of improvement in both roc_auc_score and cohen_kappa_score when compared to Logistic regression

Let's train model on our entire training data and make predictions on test_share data

In [64]:
#Fitting on entire train data
best_XGB_fit.fit(X,Y)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.01, max_delta_step=0,
       max_depth=2, min_child_weight=1, missing=None, n_estimators=700,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=6, seed=None,
       silent=True, subsample=1)

In [65]:
#Predictions on given test data
pred_test_share=best_XGB_fit.predict(test)

In [66]:
#combinig results into data frame with respective ID's for submission
predictions = pd.DataFrame(list(zip(test_df_all['id'],pred_test_share)),columns=["id","Target"])

# Randomized Search with Random Forests

### Building model with hyper parameter tuning using Random Forests to see if we get improved performance.

In [67]:
#Random Forests using randomized search wiht below parameters
from scipy.stats import randint as sp_randint
param_dist_rf = {"n_estimators":[10,100,500,700],
              "max_depth": [3,5, None],
              "max_features": sp_randint(5, 13),
              "min_samples_split": sp_randint(5, 11),
              "min_samples_leaf": sp_randint(5, 11),
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}

In [69]:
#Fitting on validation train data
n_iter_search = 10
rf_clf = RandomForestClassifier(verbose=1,n_jobs=-1)
random_search = RandomizedSearchCV(rf_clf, param_distributions=param_dist_rf,
                                   n_iter=n_iter_search)
random_search.fit(actual_Xtrain, actual_Ytrain)
#This takes quite some time to execute and increases with number of iterations

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:   29.7s finished
[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  10 out of  10 | elapsed:    1.1s finished
[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  10 out of  10 | elapsed:    0.0s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    0.9s finished
[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  10 out of  10 | elapsed:    0.0s finished
[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  10 out of  10 | elapsed:    0.0s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  1

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   18.8s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  1.4min
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:  3.2min
[Parallel(n_jobs=-1)]: Done 700 out of 700 | elapsed:  5.1min finished
[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.1s
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:    0.6s
[Parallel(n_jobs=4)]: Done 442 tasks      | elapsed:    1.4s
[Parallel(n_jobs=4)]: Done 700 out of 700 | elapsed:    2.2s finished
[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.3s
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:    1.4s
[Parallel(n_jobs=4)]: Done 442 tasks      | elapsed:    3.2s
[Parallel(n_jobs=4)]: Done 700 out of 700 | elapsed:    5.0s finished
[

[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    5.1s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:   22.8s
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:   53.0s
[Parallel(n_jobs=-1)]: Done 700 out of 700 | elapsed:  1.4min finished
[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.1s
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:    0.4s
[Parallel(n_jobs=4)]: Done 442 tasks      | elapsed:    1.0s
[Parallel(n_jobs=4)]: Done 700 out of 700 | elapsed:    1.5s finished
[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.2s
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:    1.1s
[Parallel(n_jobs=4)]: Done 442 tasks      | elapsed:    2.4s
[Parallel(n_jobs=4)]: Done 700 out of 700 | elapsed:    3.7s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[

RandomizedSearchCV(cv='warn', error_score='raise-deprecating',
          estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=-1,
            oob_score=False, random_state=None, verbose=1,
            warm_start=False),
          fit_params=None, iid='warn', n_iter=10, n_jobs=None,
          param_distributions={'n_estimators': [10, 100, 500, 700], 'max_depth': [3, 5, None], 'max_features': <scipy.stats._distn_infrastructure.rv_frozen object at 0x0000020186799518>, 'min_samples_split': <scipy.stats._distn_infrastructure.rv_frozen object at 0x0000020186799D68>, 'min_samples_leaf': <scipy.stats._distn_infrastructure.rv_frozen object at 0x00000201864778D0>, 'bootstrap': [True, False], 'criterion': ['gini

In [70]:
random_search.best_params_

{'bootstrap': False,
 'criterion': 'entropy',
 'max_depth': None,
 'max_features': 12,
 'min_samples_leaf': 6,
 'min_samples_split': 8,
 'n_estimators': 10}

In [71]:
#Best estimator
rand_best=random_search.best_estimator_
#Best Params taken from Iterations run
#n_estimators=10,max_depth=None,max_features=12,min_samples_leaf=6,min_samples_split=8,criterion=entropy,class=balanced
#bootstrap=False

In [75]:
#prediction on validation test data and viewing confusion matrix
predicted_rf=rand_best.predict(actual_Xtest)

df_test=pd.DataFrame(list(zip(actual_Ytest,predicted_rf)),columns=["real","predicted"])

k=pd.crosstab(df_test['real'],df_test["predicted"])
print(k)

[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  10 out of  10 | elapsed:    0.0s finished


predicted      0   1
real                
0          45487   3
1             17  62


In [77]:
#checking with AUC and cohen kappa metrics
roc_auc_score(actual_Ytest,predicted_rf)

0.8923720890110778

In [78]:
cohen_kappa_score(actual_Ytest,predicted_rf)

0.8608933971908824

#### With random forests there is little decrease in AUC score whereas there is improvement in Cohen_kappa_score

Let's build our model on entire train data and make predictions on test_share data

In [79]:
#fitting on entire train data
rand_best.fit(X,Y)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    9.7s finished


RandomForestClassifier(bootstrap=False, class_weight=None,
            criterion='entropy', max_depth=None, max_features=12,
            max_leaf_nodes=None, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=6,
            min_samples_split=8, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=-1, oob_score=False, random_state=None,
            verbose=1, warm_start=False)

In [80]:
#predictions on given test share data
pred_test_share=rand_best.predict(test)

[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  10 out of  10 | elapsed:    0.0s finished


In [81]:
#combining into data frame with respective ID's
predictions = pd.DataFrame(list(zip(test_df_all['id'],pred_test_share)),columns=["id","Target"])

# Using SMOTE for Oversampling

In [82]:
#Directly using parameters from previous Randomized params for faster view on results
#saved from our previous iteration without balancing(learning_rate=0.01,n_estimators=700,max_depth=2,scale_pos_weight=6 )
clf=XGBClassifier(objective='binary:logistic',max_depth=2, learning_rate=0.01, n_estimators=700,scale_pos_weight=6)

In [83]:
#Using SMOTE to balance our validation train data set
sm = SMOTE(ratio='minority', random_state=42)
# This will be the data were we are going to 
Xsm_train, Ysm_train = sm.fit_sample(actual_Xtrain, actual_Ytrain)

In [84]:
#fitting on our balanced validation train data seet
import time
t0 = time.time()
clf.fit(Xsm_train,Ysm_train)
t1 = time.time()
print("Time taken for simple XGBoost(for given set of params) is {}".format(t1-t0))

Time taken for simple XGBoost(for given set of params) is 468.5534701347351


In [85]:
#prediction on our validation test data set
preds = clf.predict(actual_Xtest)

In [86]:
roc_auc_score(actual_Ytest,preds)

0.9160022650686896

In [90]:
cohen_kappa_score(preds,actual_Ytest)

0.026785607024297442

Here although good AUC Score, the cohen_kappa_score is very low.

In [91]:
accuracy_score(preds,actual_Ytest)

0.8953674647238254

In [92]:
f1_score(preds,actual_Ytest)

0.030105777054515868

#### With the above parameters got a relatively low performance, this might be due to parameters. Hence we can improve the performance by tuning on hyper parameters

In [93]:
#Using SMOTE to balance our entire trian data set
sm = SMOTE(ratio='minority', random_state=42)
# This will be the data were we are going to 
Xsm, Ysm = sm.fit_sample(X, Y)

In [94]:
#model fitting on our entire train data set balanced
clf.fit(Xsm,Ysm)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.01, max_delta_step=0,
       max_depth=2, min_child_weight=1, missing=None, n_estimators=700,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=6, seed=None,
       silent=True, subsample=1)

In [97]:
#Predictions on given test share data.
#After oversampling the model is fit on numpy array, we need to convert our test data also to numpy array.
test_values = test.values
test_pred_share = clf.predict(test_values)

In [98]:
#combining classes with respective ID's into data frame for submission
predictions = pd.DataFrame(list(zip(test_df_all['id'],pred_test_share)),columns=["id","Target"])

In [99]:
#writing to a csv file.
predictions.to_csv("saikumar_ganneboyina_Finhack.csv",index=False)

# Summary

With respect to cohen_kappa_score, best model is with Random forests followed by XGBoost. Further after getting the Best Parameters through Randomized search, using Grid Search still better parameters can be achieved.

Also the model can be improved by sampling, here though used SMOTE couldn't get better performance. The model needs to be tuned on hyper parameters which can yield a better model.