# IEEE-CIS Fraud Detection Part 2

In this series of notebooks, we are working on a supervised, regression machine learning problem. Using Kaggle's competition [IEEE-CIS Fraud Detection](https://www.kaggle.com/c/ieee-fraud-detection) dataset, we want to predict whether a transaction is fraud or not. 

 ### Workflow 
 1. Understand the problem (we're almost there already)
 2. Exploratory Data Analysis
 3. Feature engineering to create a dataset for machine learning
 4. Create a baseline machine learning model
 5. Try more complex machine learning models
 6. Optimize the selected model
 7. Investigate model predictions in context of problem
 8. Draw conclusions and lay out next steps
 
The first notebook covered steps 1-3, and in this notebook, we will cover 4-6.

# Read in Data



In [1]:
# Numpy and pandas
import pandas as pd
import numpy as np

#Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno

# Statistics tools
import scipy.stats as stats

# Sklearn data clean
from sklearn.preprocessing import LabelEncoder
from dask_ml.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler

# Model selection

from sklearn.feature_selection import SelectFromModel
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold

# Logistic Regression
from dask_ml.linear_model import LogisticRegression

# KNN Classifer 
from sklearn.neighbors import KNeighborsClassifier

# Decision Trees
from sklearn import tree
from sklearn.tree import DecisionTreeRegressor
from IPython.display import Image
import pydotplus
import graphviz

# Random Forests 
from sklearn.ensemble import RandomForestClassifier

# SVM
from sklearn.svm import SVC

# Gradient Boost
from dask_ml.xgboost import XGBClassifier

# Evaluate
from sklearn import metrics
from sklearn.metrics import f1_score,roc_auc_score, confusion_matrix, classification_report, accuracy_score

# Hyperparameter tuning
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

# Datetime
from datetime import datetime

# Import data
import warnings

# Dask 
import dask.array as da
import dask.dataframe as dd
from dask.distributed import Client

from dask_ml.preprocessing import OneHotEncoder, MinMaxScaler
from dask_ml.datasets import make_classification
from dask_ml.model_selection import train_test_split, RandomizedSearchCV, GridSearchCV

from dask_ml.wrappers import ParallelPostFit

from sklearn.externals.joblib import parallel_backend

In [3]:
# Load Data
df = dd.read_csv('/Users/tsawaengsri/Desktop/Data Science Courses/Datasets/ieee-fraud-detection/clean_train_trans.csv')

In [4]:
# Display sizes of data
print('Training Feature Size: ', df.describe().compute())

Training Feature Size:         TransactionID        isFraud  TransactionDT  TransactionAmt  \
count   5.905400e+05  590540.000000   5.905400e+05   590540.000000   
mean    3.282270e+06       0.034990   7.372311e+06      135.027176   
std     1.704744e+05       0.183755   4.617224e+06      239.162522   
min     2.987000e+06       0.000000   8.640000e+04        0.251000   
25%     3.173778e+06       0.000000   4.151084e+06       48.950000   
50%     3.351566e+06       0.000000   9.055706e+06       82.950000   
75%     3.565136e+06       0.000000   1.537790e+07      150.000000   
max     3.577539e+06       1.000000   1.581113e+07    31937.391000   

               card1          card2          card3          card5  \
count  590540.000000  581607.000000  588975.000000  586281.000000   
mean     9898.734658     362.555488     153.194925     199.278897   
std      4901.170153     157.793246      11.336444      41.244453   
min      1000.000000     100.000000     100.000000     100.000000   


In [5]:
 len(df.columns) 

397

Dask does not know the length of the dataset since it loads in chucks of the dataset at a time.

In [6]:
df.head()

Unnamed: 0,TransactionID,isFraud,TransactionDT,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,...,V333,V334,V335,V336,V337,V338,V339,TransactionAmt_Log,Transaction_day_of_week,Transaction_hour
0,2987000,0,86400,68.5,W,13926,,150.0,discover,142.0,...,,,,,,,,4.226834,0.0,0.0
1,2987001,0,86401,29.0,W,2755,404.0,150.0,mastercard,102.0,...,,,,,,,,3.367296,0.0,0.0
2,2987002,0,86469,59.0,W,4663,490.0,150.0,visa,166.0,...,,,,,,,,4.077537,0.0,0.0
3,2987003,0,86499,50.0,W,18132,567.0,150.0,mastercard,117.0,...,,,,,,,,3.912023,0.0,0.0
4,2987004,0,86506,50.0,H,4497,514.0,150.0,mastercard,102.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.912023,0.0,0.0


In [7]:
df.info()

<class 'dask.dataframe.core.DataFrame'>
Columns: 397 entries, TransactionID to Transaction_hour
dtypes: object(14), float64(379), int64(4)

# Evaluating and Comparing Machine Learning Models
In this section we will build, train, and evalute several machine learning methods for our supervised regression task. The objective is to determine which model holds the most promise for further development (such as hyperparameter tuning).

### Imputing Missing Values 
Standard machine learning models cannot deal with missing values, and which means we have to find a way to fill these in or disard any features with missing values. Imputing also helps to reduce bias due to missingness: ‘rather than deleting cases that are subject to item-nonresponse, the sample size is maintained resulting in a potentially higher efficiency than for case deletion'[Durrant](https://www.tandfonline.com/doi/full/10.1080/1743727X.2014.979146#).

Here, we will fill in missing values with the mean of the column.

In [12]:
# Create an imputer object with a mean filling strategy
imputer = SimpleImputer(missing_values=np.NaN, strategy='mean')

# Train on the training features
imputer.fit(df, shape(n_samples=100000,n_features=397))

# Transform both training data and testing data
df = imputer.transform(df)

NameError: name 'shape' is not defined

In [None]:
print('Missing values in training set: ', da.sum(da.isnan(df)).compute())

### Encoding Categorical Variables
Before we continue, most machine learning models cannot handle categorical variables well. Therefore, we will have to encode (represent) these variables as numbers before modeling. The two main encoding methods are: 
 * __Label encoding__: assign each unique category in a categorical variable with an integer. No new columns are created.
 * __One-hot encoding__: create a new column for each unique category in a categorical variable. Each observation recieves a 1 in the column for its corresponding category and a 0 in all other new columns.
 
The problem with label encoding is that it gives the categories an arbitrary ordering. The value assigned to each of the categories is random and does not reflect any inherent aspect of the category. 

For this project, we will use Label Encoding for any categorical variables for [tree-based models](https://roamanalytics.com/2016/10/28/are-categorical-variables-getting-lost-in-your-random-forests/) and One-Hot Encoding for any categorical variables for other models. 

In [None]:
# Create a label encoder object
le = LabelEncoder()
le_count = 0

df_le = df.copy()
# Iterate through the columns
for col in df_le:
    if df_le[col].dtype == 'object':
        # Train on the training data
        le.fit(df_le[col])
        # Transform both training and testing data
        df_le[col] = le.transform(df_le[col])
            
        # Keep track of how many columns were label encoded
        le_count += 1
            
print('%d columns were label encoded.' % le_count)
print('Training Features shape: ', df_le.shape)

In [None]:
# One-hot encoding of categorical variables
df = pd.get_dummies(df)

print('Training Features shape: ', df.shape)

## Split Train and Test set¶
Let's split dataset by using function train_test_split(). Here, the Dataset is broken into two parts in a ratio of 80:20. It means 80% data will be used for model training and 20% for model testing.

To continue feature selection, we will start by using the original attributes in the raw training set.

__Dataframe with OneHotEncoder__

In [None]:
# Y is the target variable
y = df['isFraud']

# X is the feature set
X = df.drop(labels=['TransactionID','TransactionDT','isFraud'], axis=1)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
X_train = X_train.values
X_test = X_test.values
y_train = y_train.values
y_test = y_test.values

In [None]:
print('X_shapes:\n', 'X_train:', 'X_test:\n', X_train.shape, X_test.shape, '\n')
print('Y_shapes:\n', 'Y_train:', 'Y_test:\n', y_train.shape, y_test.shape)

__Datafram with Label Encoder__

In [None]:
# Y is the target variable
y_le = df_le['isFraud']

# X is the feature set
X_le = df_le.drop(labels=['TransactionID','TransactionDT','isFraud'], axis=1)

In [None]:
X_le_train, X_le_test, y_le_train, y_le_test = train_test_split(X_le, y_le, test_size=0.2, random_state=42)

In [None]:
X_train = X_train.values
X_test = X_test.values
y_train = y_train.values
y_test = y_test.values

In [None]:
print('X_shapes:\n', 'X_train:', 'X_test:\n', X_train.shape, X_test.shape, '\n')
print('Y_shapes:\n', 'Y_train:', 'Y_test:\n', y_train.shape, y_test.shape)

### Scaling Features
The final step to take before we can build our models is to scale the features. This is necessary because features are in different units, and we want to normalize the features so the units do not affect the algorithm. Linear Regression and Random Forest do not require feature scaling, but other methods, such as support vector machines and k nearest neighbors, do require it because they take into account the Euclidean distance between observations. For this reason, it is a best practice to scale features when we are comparing multiple algorithms.

In [None]:
# Create the scaler object with a range of 0-1
scaler = MinMaxScaler(feature_range=(0, 1))

# Fit on the training data
scaler.fit(X)

# Transform both the training and testing data
X = scaler.transform(X)
X_test = scaler.transform(X_test)

In [None]:
# Convert y to one-dimensional array (vector)
y = da.array(y_train).reshape((-1, ))
y_test = da.array(y_test).reshape((-1, ))

# Baseline 
For a naive baseline, we will use logistic regression to predict the probability of fraud occurrence. Unlike linear regression which gives continuous output, logistic regression provides a constant output in prediciting binary classes. If the probability 'p' is greater than 0.5, the data is labeled '1'. Probability less than 0.5 is labeled as '0'.

## Logistic Regression Implementation
First, we'll create the model and train the model and make predictions on the testing data.

In [None]:
X, y = make_classification(chunks=10000)
X_test, y_test = make_classification(chunks=10000)

In [None]:
# Logisitic Regression 
start_time = datetime.now()

# Instantiate the model (using the default parameters)
logreg = LogisticRegression()

# Fit the model with data
logreg.fit(X, y)

# Predict on test set
y_pred=logreg.predict(X_test)

# Compute ROC AUC score, accuracy score, confusion matrix, and classification report
print('ROC AUC score: %0.4f' % roc_auc_score(y_test, y_pred))
print('Accuracy score: %0.4f' % accuracy_score(y_test, y_pred))      
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

end_time = datetime.now()
print('\nDuration: {}'.format(end_time - start_time))

As we can see, the ROC AUC and accuracy scores are very different. This model has an accuracy score of 96.5%. The score is calculated by adding the true positive and true negative values of 113,732 and 231 and dividing by the total count of 118,108. In contrast, the ROC AUC score is close to a random guess score of 52.7%. From the confusion matrix, we failed to flag 134 transactions as fraud out of 113,732 fraudulent transactions. 

In [None]:
# ROC Curve
y_pred_proba = logreg.predict_proba(X_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test,  y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.show()

The predicted probability of the logistic regression basline scored around 0.77 which is better than random guess score of 0.5.


## Improved Model: Random Forest
Let's try using a Random Forest on the same training data to see if it will beat the performance of our baseline. The Random Forest is a much more powerful model especially when we use hundreds of trees. We will use 100 trees in the random forest.

In [None]:
# Random Forest Classifer
start_time = datetime.now()

#Create a Gaussian Classifier
clf=RandomForestClassifier(n_estimators=100)

#Train the model using the training sets y_pred=clf.predict(X_test)
clf.fit(X, y)

y_pred=clf.predict(X_test)

# Compute ROC AUC score, accuracy score, confusion matrix, and classification report
print('ROC AUC score: %0.4f' % roc_auc_score(y_test, y_pred))
print('Accuracy score: %0.4f' % accuracy_score(y_test, y_pred))      
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

end_time = datetime.now()
print('\nDuration: {}'.format(end_time - start_time))

The ROC AUC and accuracy score for this model has increased from the baseline model indicating that random forest is a the better model for this data. 

In [None]:
# ROC Curve
y_pred_proba = logreg.predict_proba(X_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test,  y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.show()

This graph looks exactly like the baseline model. I'm not sure if this kernel ran properly. 

### Model Interpretation: Feature Importances
As a simple method to see which variables are the most relevant, we can look at the feature importances of the random forest. We may use these feature importances as a method of dimensionality reduction in future work.

In [None]:
feature_importances = dd.DataFrame(clf.feature_importances_,
                                   index = df.columns,
                                    columns=['importance']).sort_values('importance', ascending=False)

In [None]:
#'Feature_importances_' from random forest
print((feature_importances).head(10))

## Models to Evaluate
We will compare five different machine learning models:

1. Logistic Regression
2. Random Forest Classifer
3. Support Vector Machine 
4. K-Nearest Neighbors Classifer
5. Extreme Gradient Boosting Classifer

In [None]:
# Function to calculate roc auc score
def auc(y_test, y_pred_proba):
    auc = metrics.roc_auc_score(y_test, y_pred_proba)
    return auc

# Takes in a model, trains the model, and evaluates the model on the test set
def fit_and_evaluate(model):
    
   # with joblib.parallel_backend('dask'):
        # Train the model
        model.fit(X, y)
    
        # Make predictions and evalute
        model_pred = model.predict(X_test)
    
        # Compute ROC AUC score, accuracy score, confusion matrix, and classification report
        print('ROC AUC score: %0.4f' % roc_auc_score(y_test, model_pred))
        print('Accuracy score: %0.4f' % accuracy_score(y_test, model_pred))      
        print(confusion_matrix(y_test, model_pred))
        print(classification_report(y_test, model_pred))
    
        # ROC Curve
        y_pred_proba = logreg.predict_proba(X_test)[::,1]
        fpr, tpr, _ = metrics.roc_curve(y_test,  y_pred_proba)
        auc = metrics.roc_auc_score(y_test, y_pred_proba)
        plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
        plt.legend(loc=4)
        plt.xlabel('False Positive Rate')
        plt.ylabel('True Positive Rate')
        plt.show()

In [None]:
# Logisitic Regression 
start_time = datetime.now()

lr = LogisticRegression()
fit_and_evaluate(lr)

end_time = datetime.now()
print('\nDuration: {}'.format(end_time - start_time))

In [None]:
# Random Forest Classifer
start_time = datetime.now()

rfc = RandomForestClassifier()
fit_and_evaluate(rfc)

end_time = datetime.now()
print('\nDuration: {}'.format(end_time - start_time))

In [None]:
# Support Vector Classifer
start_time = datetime.now()

smv = SVC()
fit_and_evaluate(smv)

end_time = datetime.now()
print('\nDuration: {}'.format(end_time - start_time))

In [None]:
# KNN Classifer 
start_time = datetime.now()

knn = KNeighborsClassifier()
fit_and_evaluate(knn)

end_time = datetime.now()
print('\nDuration: {}'.format(end_time - start_time))

In [None]:
# Extreme Gradient Boosting Classifer 
start_time = datetime.now()

xgb = XGBClassifier()
fit_and_evaluate(xgb)

end_time = datetime.now()
print('\nDuration: {}'.format(end_time - start_time))

In [None]:
# Dataframe to hold the results
model_comparison = pd.DataFrame({'Model': ['Logistic Regression', 'RandomForest Classifier',
                                           'Support Vector Classifer', 'KNeighbors Classifier', 'XGB Classifier'],
                                 'ROC_AUC': [lr_score, rfc_score, svc_score, knn_score, xgb_score]})

# Horizontal bar chart of test mae
model_comparison.sort_values('ROC_AUC', ascending = True)

# Conclusion 
We followed the general outline of a machine learning project:

1. Understand the problem and the data
2. Data cleaning and formatting (this was mostly done for us)
3. Exploratory Data Analysis
4. Baseline model
5. Improved model
6. Model interpretation (just a little)