# Context 

For this challenge, we will create a multi-layer perceptron neural network model to predict fraudulent transactions on Kaggle's competition [IEEE-CIS Fraud Detection](https://www.kaggle.com/c/ieee-fraud-detection) dataset. We will also compare this model to a random forest model and describe the relative tradeoffs between complexity and accuracy.

## Data Overview 
The datasets contains historical Vesta's real-world e-commerce transaction. There is a high imbalance of classes since the positive class (frauds) only account for 3.95% of all transactions. To treat this issue, we have undersampled the dataset and reduced the amount of normal transactions to equal the amount of fraudulent transactions. 

Due to confidentiality issues, original features are masked, which presents a challenge when interpreting the features.

In [1]:
# Libraries 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Datetime
from datetime import datetime

# Data Cleaning 
from sklearn.impute import SimpleImputer

# Model Selection 
from sklearn.model_selection import train_test_split

# Evaluate 
from sklearn import metrics
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix

# Ensemble Model
from sklearn.ensemble import RandomForestClassifier

# SVM 
from sklearn.svm import SVC

# Network 
from sklearn.neural_network import MLPClassifier

In [2]:
df = pd.read_csv('/Users/tsawaengsri/Desktop/Data Science Courses/Datasets/ieee-fraud-detection/clean_df_le.csv')

In [3]:
df.head()

Unnamed: 0,isFraud,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,card6,addr1,...,id_38,DeviceType,transaction_day_of_week,transaction_hour,average_trans_amt_for_card1,average_trans_amt_for_card4,average_id_02_for_card1,average_id_02_for_card4,P_major_email,R_major_email
0,0,107.95,4,12695,490.0,150.0,3,226.0,1,325.0,...,0,0,6.0,23.0,,,1.033636,0.970722,15,15
1,0,25.0,1,12929,285.0,150.0,3,226.0,1,184.0,...,1,1,1.0,17.0,1.16,0.21083,1.0,0.207972,1,1
2,0,57.95,4,9500,321.0,150.0,3,226.0,1,204.0,...,0,0,4.0,22.0,0.396588,0.428929,1.00609,0.970722,15,15
3,0,100.0,1,12769,555.0,150.0,2,224.0,1,204.0,...,0,0,5.0,17.0,,,1.0,0.936085,15,15
4,1,32.356,0,12778,500.0,185.0,2,224.0,0,290.733794,...,0,0,5.0,23.0,1.090654,0.379163,3.187676,3.95732,15,15


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32902 entries, 0 to 32901
Columns: 422 entries, isFraud to R_major_email
dtypes: float64(395), int64(27)
memory usage: 105.9 MB


In [5]:
df['isFraud'].value_counts()

1    16451
0    16451
Name: isFraud, dtype: int64

## Split Train and Test set¶
Let's split dataset by using function train_test_split(). Here, the Dataset is broken into two parts in a ratio of 80:20. It means 80% data will be used for model training and 20% for model testing.

To continue feature selection, we will start by using the original attributes in the raw training set.

In [6]:
# X is the feature set
X = df.drop(labels=['isFraud'], axis=1)

# Y is the target variable
y = df['isFraud']

In [7]:
# df
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [8]:
X_train = X_train.values
X_test = X_test.values
y_train = y_train.values
y_test = y_test.values

In [9]:
print('X_shapes:\n', 'X_train:', 'X_test:\n', X_train.shape, X_test.shape, '\n')
print('Y_shapes:\n', 'Y_train:', 'Y_test:\n', y_train.shape, y_test.shape)

X_shapes:
 X_train: X_test:
 (26321, 421) (6581, 421) 

Y_shapes:
 Y_train: Y_test:
 (26321,) (6581,)


### Imputing Missing Values Again
Standard machine learning models cannot deal with missing values, and which means we have to find a way to fill these in or disard any features with missing values. Imputing also helps to reduce bias due to missingness: ‘rather than deleting cases that are subject to item-nonresponse, the sample size is maintained resulting in a potentially higher efficiency than for case deletion'[Durrant](https://www.tandfonline.com/doi/full/10.1080/1743727X.2014.979146#).

Here, we will fill in missing values with the mean of the column.

In [10]:
# Create an imputer object with a mean filling strategy
imputer = SimpleImputer(missing_values=np.NaN, strategy='mean')

# Train on the training features
imputer.fit(X_train)

# Transform both training data and testing data
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)

In [11]:
print('Missing values in training features: ', np.sum(np.isnan(X_train)))
print('Missing values in testing features:  ', np.sum(np.isnan(X_test)))

Missing values in training features:  0
Missing values in testing features:   0


In [12]:
# Make sure all values are finite
print(np.where(~np.isfinite(X_train)))
print(np.where(~np.isfinite(X_test)))

(array([], dtype=int64), array([], dtype=int64))
(array([], dtype=int64), array([], dtype=int64))


## Random Forest

In [13]:
# Random Forest Classifer
start_time = datetime.now()

print('Random Forest Classifer')

#Create a Gaussian Classifier
clf = RandomForestClassifier(n_estimators=100)

#Train the model using the training sets y_pred=clf.predict(X_test)
print(cross_val_score(clf, X_train, y_train, cv=5))

clf.fit(X_train,y_train)
y_pred=clf.predict(X_test)

# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

print(confusion_matrix(y_test, y_pred))

end_time = datetime.now()
print('\nDuration: {}'.format(end_time - start_time))

Random Forest Classifer
[0.8317189  0.83833587 0.83985562 0.84004559 0.83453647]
Accuracy: 0.8378665856252849
[[2846  459]
 [ 608 2668]]

Duration: 0:01:15.136509


Here, the model misclassified 3.24% of transactions. 13.8% of normal transactions were labeled as fraud and 18.5% of fraudulent transactions were classifed as normal transactions. 

### Model Interpretation: Feature Importances
For model interpretability, we will take a look at the feature importances of our initial random forest. We may use these feature importances as a method of dimensionality reduction in future work.

In [14]:
# Top N importances

clf.fit(X_train, y_train)

N = 10
importances = clf.feature_importances_
std = np.std([tree.feature_importances_ for tree in clf.estimators_],
             axis=0)

# Create a dataframe
importances_df = pd.DataFrame({'Variable':X.columns, 'Importance': importances})

top_N = importances_df.sort_values(by=['Importance'], ascending=False).head(10)

top_N

Unnamed: 0,Variable,Importance
0,TransactionAmt,0.024504
23,C13,0.022797
24,C14,0.019286
2,card1,0.019266
417,average_id_02_for_card1,0.018177
3,card2,0.01726
415,average_trans_amt_for_card1,0.01598
416,average_trans_amt_for_card4,0.01556
18,C8,0.015486
414,transaction_hour,0.014654


## Support Vector Machine

In [15]:
# Support Vector Classifer
start_time = datetime.now()

print('Support Vector Classifer')

#Create a Gaussian Classifier
svc = SVC(C=1e-9, kernel='rbf')

#Train the model using the training sets y_pred=clf.predict(X_test)
print(cross_val_score(svc, X_train, y_train, cv=5))

y_pred=clf.predict(X_test)

# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

print(confusion_matrix(y_test, y_pred))

end_time = datetime.now()
print('\nDuration: {}'.format(end_time - start_time))

Support Vector Classifer




[0.50047483 0.50056991 0.50056991 0.50056991 0.50056991]
Accuracy: 0.8393861115332016
[[2851  454]
 [ 603 2673]]

Duration: 0:29:21.422980


Here, the model misclassified 3.21% of transactions. This is slightly better than the random forest model. 13.7% of normal transactions were labeled as fraud and 18.4% of fraudulent transactions were classifed as normal transactions.

## Challenge 4.3.6 Make Your Network
Create a multi-layer perceptron neural network model to predict on a labeled dataset of your choosing. Compare this model to either a boosted tree or a random forest model and describe the relative tradeoffs between complexity and accuracy. Be sure to vary the hyperparameters of your MLP!

In [16]:
# Establish and fit the model, with a single, 1000 perceptron layer.
mlp = MLPClassifier(hidden_layer_sizes=(1000,))
mlp.fit(X_train, y_train)

mlp.score(X_train, y_train)

0.712738877702215

In [17]:
cross_val_score(mlp, X_train, y_train, cv=5)

array([0.73162393, 0.73347264, 0.72473404, 0.71162614, 0.64266717])

In [18]:
# Let's see how the testing data performs.
mlp_pred = mlp.predict(X_test)

confusion_matrix(y_test, mlp_pred)

array([[3090,  215],
       [1671, 1605]])

It doesn't seem like this model is able to perform better than SVM or random forest. Let's try adding hidden layers. 

In [19]:
# Attempt more hidden layers. 
mlp2 = MLPClassifier(hidden_layer_sizes=(1000,))
mlp2.fit(X_train, y_train)

mlp2.score(X_train, y_train)

0.7187036966680598

In [20]:
# Attempt more hidden layers. 
mlp3 = MLPClassifier(hidden_layer_sizes=(1000,100))
mlp3.fit(X_train, y_train)

mlp3.score(X_train, y_train)

0.629459367045325

In [21]:
# Change the activation method. 
mlp4 = MLPClassifier(hidden_layer_sizes=(1000,100), activation='logistic')
mlp4.fit(X_train, y_train)

mlp4.score(X_train, y_train)

0.509023213403746

In [22]:
cross_val_score(mlp4, X_train, y_train, cv=5)

array([0.57948718, 0.56591945, 0.5949848 , 0.57408815, 0.57940729])

In [23]:
# Increase the amount of iterations. 
mlp5 = MLPClassifier(hidden_layer_sizes=(1000,100,100), activation='logistic', max_iter=500)
mlp5.fit(X_train, y_train)

mlp5.score(X_train, y_train)

0.6009650089282322

Increasing hidden layers and changing the activation method did not increase the model's performance. Since neural networks perform better with more data, maybe oversampling would improve the model's performance. 