# IEEE-CIS Fraud Detection Part 2

In this series of notebooks, we are working on a supervised, regression machine learning problem. Using Kaggle's competition [IEEE-CIS Fraud Detection](https://www.kaggle.com/c/ieee-fraud-detection) dataset, we want to predict whether a transaction is fraud or not. 

 ### Workflow 
 1. Understand the problem (we're almost there already)
 2. Exploratory Data Analysis
 3. Feature engineering to create a dataset for machine learning
 4. Create a baseline machine learning model
 5. Try more complex machine learning models
 6. Optimize the selected model
 7. Investigate model predictions in context of problem
 8. Draw conclusions and lay out next steps
 
The first notebook covered steps 1-3, and in this notebook, we will cover 4-6.

# Read in Data



In [1]:
# Numpy and pandas
import pandas as pd
import numpy as np

#Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno

# Statistics tools
import scipy.stats as stats

# Sklearn data clean
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler

# Model selection
from sklearn.decomposition import PCA 
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectFromModel
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold

# Logistic Regression
from sklearn.linear_model import Lasso, LogisticRegression

# KNN Classifer 
from sklearn.neighbors import KNeighborsClassifier

# Decision Trees
from sklearn import tree
from sklearn.tree import DecisionTreeRegressor
from IPython.display import Image
import pydotplus
import graphviz

# Random Forests 
from sklearn.ensemble import RandomForestClassifier

# SVM
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.svm import SVR
from sklearn.svm import LinearSVC
from sklearn.svm import SVC

# Gradient Boost
from xgboost import XGBClassifier

# Evaluate
from sklearn import metrics
from sklearn.metrics import log_loss,accuracy_score, f1_score,roc_auc_score, confusion_matrix

# Import data
from sqlalchemy import create_engine
import warnings

In [2]:
# Load Data
train_df = pd.read_csv('/Users/tsawaengsri/Desktop/Data Science Courses/Datasets/ieee-fraud-detection/train.csv')
test_df = pd.read_csv('/Users/tsawaengsri/Desktop/Data Science Courses/Datasets/ieee-fraud-detection/test.csv')


In [3]:
# Display sizes of data
print('Training Feature Size: ', train.shape)
print('Testing Feature Size:  ', test.shape)

Training Feature Size:  (590540, 2161)
Testing Feature Size:   (506691, 2160)


In [4]:
train_df.head()

Unnamed: 0,TransactionID,TransactionDT,TransactionAmt,card1,card2,card3,card5,addr1,addr2,dist1,...,Transaction_hour,card1_TransactionAmt_mean,card1_TransactionAmt_std,card2_TransactionAmt_mean,card2_TransactionAmt_std,card3_TransactionAmt_mean,card3_TransactionAmt_std,card5_TransactionAmt_mean,card5_TransactionAmt_std,isFraud
0,2987000,86400,68.5,13926,,150.0,142.0,315.0,87.0,19.0,...,0.0,316.570357,351.513997,,,147.65346,255.330369,185.236343,322.134467,0
1,2987001,86401,29.0,2755,404.0,150.0,102.0,325.0,87.0,,...,0.0,213.053819,391.543884,227.107106,373.703941,147.65346,255.330369,212.7937,396.390243,0
2,2987002,86469,59.0,4663,490.0,150.0,166.0,330.0,87.0,287.0,...,0.0,104.87694,130.380968,136.179809,228.571548,147.65346,255.330369,98.77496,141.059909,0
3,2987003,86499,50.0,18132,567.0,150.0,117.0,476.0,87.0,,...,0.0,120.958705,196.463487,133.628801,226.771834,147.65346,255.330369,124.389514,191.8809,0
4,2987004,86506,50.0,4497,514.0,150.0,102.0,420.0,87.0,,...,0.0,99.811667,69.829736,223.770752,457.894839,147.65346,255.330369,212.7937,396.390243,0


In [5]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 590540 entries, 0 to 590539
Columns: 2161 entries, TransactionID to isFraud
dtypes: float64(410), int64(1751)
memory usage: 9.5 GB


In [6]:
test_df.head()

Unnamed: 0,TransactionID,TransactionDT,TransactionAmt,card1,card2,card3,card5,addr1,addr2,dist1,...,Transaction_day_of_week,Transaction_hour,card1_TransactionAmt_mean,card1_TransactionAmt_std,card2_TransactionAmt_mean,card2_TransactionAmt_std,card3_TransactionAmt_mean,card3_TransactionAmt_std,card5_TransactionAmt_mean,card5_TransactionAmt_std
0,3663549,18403224,31.95,10409,111.0,150.0,226.0,170.0,87.0,1.0,...,2.0,0.0,111.438993,127.021956,150.513374,258.890796,147.65346,255.330369,141.865993,242.457293
1,3663550,18403263,49.0,4272,111.0,150.0,226.0,299.0,87.0,4.0,...,2.0,0.0,154.212902,315.255714,150.513374,258.890796,147.65346,255.330369,141.865993,242.457293
2,3663551,18403310,171.0,4476,574.0,150.0,226.0,472.0,87.0,2635.0,...,2.0,0.0,137.671446,134.993751,162.600916,161.299536,147.65346,255.330369,141.865993,242.457293
3,3663552,18403310,284.95,10989,360.0,150.0,166.0,205.0,87.0,17.0,...,2.0,0.0,89.571312,130.253829,97.760244,144.61787,147.65346,255.330369,98.77496,141.059909
4,3663553,18403317,67.95,18018,452.0,150.0,117.0,264.0,87.0,6.0,...,2.0,0.0,115.629317,214.010137,117.029127,210.349436,147.65346,255.330369,124.389514,191.8809


In [7]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506691 entries, 0 to 506690
Columns: 2160 entries, TransactionID to card5_TransactionAmt_std
dtypes: float64(410), int64(1750)
memory usage: 8.2 GB


## Evaluating and Comparing Machine Learning Models
In this section we will build, train, and evalute several machine learning methods for our supervised regression task. The objective is to determine which model holds the most promise for further development (such as hyperparameter tuning).

## Not Splitting the Train and  Test set 
Usually, we would split the training dataset to continue feature selection and perform model evaluation. However, since Kaggle has provided us with a train and test set, we will move forward using these tables. 

Here is a reference for the train and test split labels:

* X_train -> training table with target removed
* Y_train -> target of training table
* X_test -> test table with target removed
* Y_test -> target of test data

In [8]:
# Modify train and test table for modeling 
train_label = train_df['isFraud']
train_df = train_df.drop(labels=['TransactionID','TransactionDT','isFraud'], axis=1)
test_df = test_df.drop(labels=['TransactionID','TransactionDT'], axis=1)


In [None]:
# Copy datasets
train = train_df.copy()
test = test_df.copy()

## Imputing Missing Values
Standard machine learning models cannot deal with missing values, and which means we have to find a way to fill these in or disard any features with missing values. Here, we will fill in missing values with the mode of the column.

In [9]:
# Create an imputer object with a most frequent filling strategy
imputer = SimpleImputer(strategy='most_frequent')

# Train on the training features
imputer.fit(train)

# Transform both training data and testing data
train = imputer.transform(train)
test = imputer.transform(test)



In [10]:
print('Missing values in training features: ', np.sum(np.isnan(train)))
print('Missing values in testing features:  ', np.sum(np.isnan(test)))

Missing values in training features:  0
Missing values in testing features:   0


In [11]:
# Make sure all values are finite
print(np.where(~np.isfinite(train)))
print(np.where(~np.isfinite(test)))

(array([], dtype=int64), array([], dtype=int64))
(array([], dtype=int64), array([], dtype=int64))


### Scaling Features 
The final step to take before we can build our models is to scale the features. This is necessary because features are in different units, and we want to normalize the features so the units do not affect the algorithm. Linear Regression and Random Forest do not require feature scaling, but other methods, such as support vector machines and k nearest neighbors, do require it because they take into account the Euclidean distance between observations. For this reason, it is a best practice to scale features when we are comparing multiple algorithms.

In [12]:
# Create the scaler object with a range of 0-1
scaler = MinMaxScaler(feature_range=(0, 1))

# Fit on the training data
scaler.fit(train)

# Transform both the training and testing data
train = scaler.transform(train)
test = scaler.transform(test)

In [13]:
# Convert train_label to one-dimensional array (vector)
train_label = np.array(train_label).reshape((-1, ))


# Baseline 
For a naive baseline, logistic regression to predict the probability of occurrence of fraud by utilizing a logit function. 

## Logistic Regression 
First, we will use lasso regression to conduct feature selection.  first create the model and train the model using .fit.  Then, we make predictions on the testing data using .predict_proba (remember that we want probabilities and not a 0 or 1).

### Selecting features using Lasso regularisation

Here I will do the model fitting and feature selection, altogether in one line of code. First I specify the Logistic Regression model, and I make sure I select the Lasso (L1) penalty.Then I use the selectFromModel object from sklearn, which will select in theory the features which coefficients are non-zero.

In [14]:
sel_ = SelectFromModel(LogisticRegression(C=1, penalty='l1'))
sel_.fit(scaler.transform(train), train_label)



SelectFromModel(estimator=LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l1', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False),
        max_features=None, norm_order=1, prefit=False, threshold=None)

#### Make a list of with the selected features.


In [15]:
selected_feat = train_df.columns[(sel_.get_support())]
print('total features: {}'.format((train.shape[1])))
print('selected features: {}'.format(len(selected_feat)))
print('features with coefficients shrank to zero: {}'.format(
      np.sum(sel_.estimator_.coef_ == 0)))

AttributeError: 'numpy.ndarray' object has no attribute 'columns'

#### Identifying the removed features

In [16]:
removed_feats = train_df.columns[(sel_.estimator_.coef_ == 0).ravel().tolist()]
removed_feats

AttributeError: 'numpy.ndarray' object has no attribute 'columns'

#### Removing the features from training an test set

In [17]:
train_selected = sel_.transform(train)
test_selected = sel_.transform(test)
train_selected.shape, test_selected.shape

((590540, 780), (506691, 780))

In [18]:
# instantiate the model (using the default parameters)
logreg = LogisticRegression()

# fit the model with data
logreg.fit(train_selected, train_label)

#
y_pred=logreg.predict(test_selected)



In [19]:
# ROC Curve
y_pred_proba = logreg.predict_proba(test_selected)[::,1]
fpr, tpr, _ = metrics.roc_curve(train_label,  y_pred_proba)
auc = metrics.roc_auc_score(train_label, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()

ValueError: Found input variables with inconsistent numbers of samples: [590540, 506691]

The logistic regression basline scored around 0.81. This model is slightly better than using only transaction dataset which scored 0.78.

## Models to Evaluate
We will compare five different machine learning models:

1. Logistic Regression
2. Random Forest Classifer
3. Support Vector Machine 
4. Gradient Boosting Classifer
5. K-Nearest Neighbors Classifer

In [25]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
def fit_and_evaluate(model):
    kfold = KFold(n_splits=10, random_state=42)
    cv_results = cross_val_score(model, train_selected, train_label, cv=kfold, scoring='roc_auc')
    score = cv_results
    name = str(model)
    msg = '%s: %f (%f)' % (name, cv_results.mean(), cv_results.std())
    print(msg)

In [26]:
lr = LogisticRegression()
fit_and_evaluate(lr)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False): 0.850680 (0.016009)


In [27]:
rfc = RandomForestClassifier()
fit_and_evaluate(rfc)



RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False): 0.828824 (0.020985)


In [None]:
smv = SVC()
fit_and_evaluate(smv)



In [None]:
xgb = XGBClassifier()
fit_and_evaluate(xgb)

In [None]:
knn = KNeighborsClassifier()
fit_and_evaluate(knn)

## Improved Model: Random Forest
Let's try using a Random Forest on the same training data to see if it will beat the performance of our baseline. The Random Forest is a much more powerful model especially when we use hundreds of trees. We will use 100 trees in the random forest.

In [None]:
#Import Random Forest Model
from sklearn.ensemble import RandomForestClassifier

#Create a Gaussian Classifier
clf=RandomForestClassifier(n_estimators=100)

#Train the model using the training sets y_pred=clf.predict(X_test)
clf.fit(train_selected,train_label)

y_pred=clf.predict(test_selected)

In [None]:
# ROC Curve
y_pred_proba = clf.predict_proba(test_selected)[::,1]
fpr, tpr, _ = metrics.roc_curve(train_label,  y_pred_proba)
auc = metrics.roc_auc_score(train_label, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()

This model score is around 0.957 which is 0.02 better than only using transaction dataset. This is also a drastic improvement from our baseline model. 

### Model Interpretation: Feature Importances
As a simple method to see which variables are the most relevant, we can look at the feature importances of the random forest. We may use these feature importances as a method of dimensionality reduction in future work.

In [None]:
feature_imp = pd.Series(clf.feature_importances_).sort_values(ascending=False)

In [None]:
%matplotlib inline
# Creating a bar plot
sns.barplot(x=feature_imp, y=feature_imp.index)
# Add labels to your graph
plt.xlabel('Feature Importance Score')
plt.ylabel('Features')
plt.title("Visualizing Important Features")
plt.legend()
plt.show()

Well, this graph isn't helpful. How can I fix this? Maybe I'll try turning it into a list and return only the most top features. 

# Conclusion 
We followed the general outline of a machine learning project:

1. Understand the problem and the data
2. Data cleaning and formatting (this was mostly done for us)
3. Exploratory Data Analysis
4. Baseline model
5. Improved model
6. Model interpretation (just a little)