# Fraud detection

In case of comments or questions: please contact <a href="mailto:R.vdnAkker@uvt.nl">Ramon van den Akker</a>.

The notebook contains illustrations and exercises corresponding to the module Data Science II.

# 0. Install additional packages (you only need to do this once)

We need an additional package, imbalanced-learn.

In [None]:
# Install the package imblearn in the current Jupyter kernel
!pip install imbalanced-learn

# 0. Import standard packages

Important packages for Python are <a href="http://www.numpy.org/">numpy</a> (for arrays, linear algebra, pseudorandom numbers etc.), <a href="http://pandas.pydata.org/">pandas</a> (contains convenient data structure called "pandas dataframe"), <a href="http://matplotlib.org/">matplotlib</a> & <a href="http://seaborn.pydata.org/"> seaborn</a> (for data visualisation), <a href="http://scikit-learn.org/stable/">sklearn</a> (scikit-learn; powerful package containing machine & statistical learning functions).

Typically all import statements are organized at the top of the notebook. In case you get an error stating a package is missing you can open the <i>Anaconda prompt</i> and enter <i>conda install name-package</i>. In case this does not work you can resort to <i>pip install name-package</i> or <i>easy_install name-package</i>.

In [None]:
# Standard packages:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
from matplotlib import cm as cm
%matplotlib inline
import seaborn as sns
plt.style.use("seaborn-deep")
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, roc_auc_score, precision_recall_curve
from sklearn.ensemble import IsolationForest

In [None]:
# Special packages

In [None]:
from imblearn.over_sampling import RandomOverSampler, SMOTE

# 1. Getting started

## 1.1 Data retrieval

The dataset we will use originates from a Kaggle competition; see <a href="https://www.kaggle.com/mlg-ulb/creditcardfraud/version/3">link</a>. The data is available in a csv-file.

This file should be available in the same folder as this notebook.

##### Load data into a pandas dataframe from provided csv-file

In [None]:
url_data = "https://raw.githubusercontent.com/nsethi31/Kaggle-Data-Credit-Card-Fraud-Detection/master/creditcard.csv"
df = pd.read_csv(url_data, sep=",")

##### Inspect first rows of dataframe and information on dataframe

In [None]:
df.head(5)

In [None]:
df.info()

In [None]:
print(f"The number of observations is {df.shape[0]}")
print(f"The number of variables is {df.shape[1]}")

Provide list of labels corresponding to Class=0 and Class=1

In [None]:
target_names = ["non-fraud", "fraud"]

If performance turns out to be (too) slow: uncomment the following lines and run the cell to downsample the dataframe.

## 1.2 Generate train, validation, and test sets

Construct a train, validation, and test set. And organize the features in dataframes X and the target in dataframes y.

In [None]:
seed = 140
X_train, X_aux, y_train, y_aux = train_test_split(df.drop(columns=["Class"]), df["Class"], test_size=0.5, random_state=seed)
X_val, X_test, y_val, y_test = train_test_split(X_aux, y_aux, test_size=0.5)
#
print(f"data_train shape: {X_train.shape}")
print(f"data_validation shape: {X_val.shape}")
print(f"data_test shape: {X_test.shape}")

# 2. Elementary Data Exploration

### Check the descriptive statistics:

In [None]:
X_train.describe()

### Inspect estimated correlation matrix:

In [None]:
X_train.corr(method="pearson")

### Perhaps easier to analyze estimated correlations via a visualization:

In [None]:
def VizCorrelationMatrix(df):
    fig = plt.figure()
    ax1 = fig.add_subplot(111)
    cmap = cm.get_cmap("jet", 30)
    cax = ax1.imshow(df.corr(), interpolation="nearest", cmap=cmap)
    ax1.grid(True)
    plt.title("Estimated Correlation Matrix")
    fig.colorbar(cax, ticks=[.75,.8,.85,.90,.95,1])
    plt.show()
VizCorrelationMatrix(X_train)
print(pd.DataFrame(X_train.columns, columns=["Name feature"]))

### Distribution of the target:

In [None]:
print("Recall, from the description of the data, that target=1 corresponds to a fraud.")
print("\n")
print("The data type of the target is " + str(type(y_train)))
print("\n")
print("The distribution of the target in the train set:")
unique, counts = np.unique(y_train, return_counts=True)
print(" - value " + str(unique[0]) + ": " + str(counts[0]) + " observations;" )
print(" - value " + str(unique[1]) + ": " + str(counts[1]) + " observations.")
print("\n")
print("The frequency of observations with Y=1 equals (in the train set): " + str(np.round(100*counts[1]/(counts[0]+counts[1]),1)) + "%.")

#### Question
Is this dataset imbalanced?

### Check histograms features

In [None]:
X_train.hist(bins=50, figsize=(20,15))

There are several packages available that provide an extensive Explorative Data Analysis. See, for example, https://github.com/pandas-profiling/pandas-profiling for the pandas-profiling package.

# 3. Univariate predictive performance
The following histograms show the marginal distributions of the features in the groups $\{Y=1\}$ (True) and $\{Y=0\}$ (False).

(Note that the histograms are scaled.)

In [None]:
Z = X_train.copy()
Z["target"] = pd.DataFrame(y_train)
for name in X_train.columns:
    print("Consider feature " + name + ":")
    x = Z.where(Z["target"]==1)[name]
    y = Z.where(Z["target"]==0)[name]
    left =min(np.nanmin(x), np.nanmin(y))
    right =max(np.nanmax(x), np.nanmax(y))
    plt.hist([x, y], bins=25, range=[left, right], label=['1', '0'], density=True)
    plt.legend(loc="upper right")
    plt.show()

##### Question
Which features seem to be promising?

##### Question
What other analyses could you think of to assess the predictive performance of features?

# 3. Elementary Model Exploration

## 3.1  Decision Tree

Check the documentation of Scikit, https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier,  for further details.

Initialization of learner (create object):

In [None]:
dt = tree.DecisionTreeClassifier(max_depth=10)

Estimate model:

In [None]:
dt.fit(X_train, y_train)

Determine classifications on train and validation sets using estimated model:

In [None]:
hat_y_train_dt = dt.predict(X_train)
hat_y_val_dt = dt.predict(X_val)

Inspect type of output that <i>predict</i> yields:

In [None]:
plt.hist(hat_y_train_dt)
print(f"Unique values: {np.unique(hat_y_train_dt)}")

In [None]:
# As we are going to plot a lot of confusion matrices, we create a function:
def plot_confusion_matrix(hat_y, y, target_names):
    matrix = confusion_matrix(y, hat_y)  # note that true label corresponds to first argument
    sns.heatmap(matrix.T, square=True, annot=True, fmt="d", cbar=False,
    xticklabels=target_names, yticklabels=target_names)
    plt.xlabel("true label")
    plt.ylabel("predicted label")
    accuracy = accuracy_score(y, hat_y, normalize=True, sample_weight=None)
    print("The accuracy is " + str(np.round(100*accuracy,1)) + "%")
plot_confusion_matrix(hat_y_train_dt, y_train, target_names)

##### Question
Check, using the command `sum(...)' that the matrix is indeed correct.

##### Question
What is the number of False Positives?  

##### Question  
Determine classifications on validation set using the estimated model.

##### Question
The accuracy is almost perfect. How do you assess the quality of the model?

## 3.2 Random forest

You can reduce n_estimators to speed up calculations.

In [None]:
clf = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=123)

In [None]:
rf = clf.fit(X_train, y_train)

In [None]:
hat_y_train_rf = rf.predict(X_train)
hat_y_val_rf = rf.predict(X_val)

In [None]:
print("Results for train set:")
plot_confusion_matrix(hat_y_train_rf, y_train, target_names)

In [None]:
print("Results for validation set:")
plot_confusion_matrix(hat_y_val_rf, y_val, target_names)

# 4. Cost-sensitive learning

Let us first inspect that our classifiers can also deliver a score / probability as output.

In [None]:
rf.predict_proba(X_val)

You see that two columns are generated. The second one corresponds to (the probability of) the `fraud' class.

### Cost misclassification
Let us use as misclassification costs:
- for True Negative and True Positive: 0
- for False Negative (missed fraud): 250
- for False Positive (false alert): 1

In [None]:
cost_FP = 1
cost_FN = 250

# 4.1. Optimal treshold for "probability learners"

In [None]:
optimal_treshold = (cost_FP - 0) / (cost_FP - 0 + cost_FN - 0)
print("For `probability leaners the optimal treshold (assuming we are dealing with true probabilities) is " + str(np.round(100*optimal_treshold, 3)) + "%")

### Let us reconsider the estimated decision tree

We will compare the standard decision tree to the "optimal-treshold dt"  which classifies an observation as "1" in case the estimated probability exceeds the threshold above.

Classifications decision tree using optimal_treshold:

In [None]:
hat_y_val_dt_ot = (dt.predict_proba(X_val)[:,1] > optimal_treshold)
hat_y_train_dt_ot = (dt.predict_proba(X_train)[:,1] > optimal_treshold)

In [None]:
print("The confusion matrix on validation set for `standard' decision tree:")
plot_confusion_matrix(hat_y_val_dt, y_val, target_names)

In [None]:
print("The confusion matrix on validation set for `optimal-treshold' decision tree:")
plot_confusion_matrix(hat_y_val_dt_ot, y_val, target_names)

### Question:
Also evaluate the costs on the train set.

### Question:
Are the results as expected?

Let us evaluate costs.

In [None]:
def estimate_cost(hat_y, y, cost_FP, cost_FN):
    return  np.sum(np.multiply(hat_y, (1 - y)) * cost_FP) + np.sum(np.multiply((1 - hat_y), y) * cost_FN)

In [None]:
c_dt_st = estimate_cost(hat_y_val_dt, y_val, cost_FP, cost_FN)
c_dt_ot = estimate_cost(hat_y_val_dt_ot, y_val, cost_FP, cost_FN)
print("Cost misclassification using `standard' decision tree: "
          +  str(np.int(c_dt_st) ))
print("Cost misclassification using `optimal-treshold' decision tree: "
          +  str(np.int(c_dt_ot)) )
print("Ratio (standard/o-treshold): " + str(np.round(100 * c_dt_st / c_dt_ot, 1)) + "%")

##### Question
Analyze the performance of the estimated random forest in combination with the `optimal-treshold'.

## 4.2. Cost-sensitive decision tree

The decision tree of Scikit is able to accept class-weights as input.

##### Question
Check, using the cells below, that such class-weights indeed have an impact on the <i>internal</i> structure of the tree (different splits and/or different selection of features).

In [None]:
# auxiliary function
def class_weight(cost_FP, cost_FN):
    return {0: cost_FP, 1: cost_FN}
# standard decision tree
ctree = tree.DecisionTreeClassifier(max_depth=2)
tree.plot_tree(ctree.fit(X_train, y_train.values))

In [None]:
# cost-sensitive decision tree
ctree = tree.DecisionTreeClassifier(max_depth=2, class_weight=class_weight(cost_FP, cost_FN))
tree.plot_tree(ctree.fit(X_train, y_train.values))

Next we fit a cost-sensitive decision tree and compare the resulting performance to that of a standard decision tree (with the same depth).

In [None]:
ctree = tree.DecisionTreeClassifier(max_depth=10, class_weight=class_weight(cost_FP, cost_FN))
ctree.fit(X_train, y_train.values)
hat_y_val_ctree = ctree.predict(X_val)
hat_y_train_ctree = ctree.predict(X_train)

In [None]:
print("Recall the confusion matrix on train set for `standard' decision tree:")
plot_confusion_matrix(hat_y_train_dt, y_train, target_names)

In [None]:
print("The confusion matrix for cost-sensitive decision tree:")
plot_confusion_matrix(hat_y_train_ctree, y_train, target_names)

In [None]:
cost_DT_ot = estimate_cost(hat_y_train_dt_ot, y_train, cost_FP, cost_FN)
cost_CSDT = estimate_cost(hat_y_train_ctree, y_train, cost_FP, cost_FN)
print("Cost misclassification using `standard' decision tree with optimal-treshold: "
          +  str(cost_DT_ot) )
print("Estimated expected cost misclassification using cost-sensitive decision tree: "
          +  str(cost_CSDT) )
print("Ratio (o-t DT/CSDT): " + str(np.round(100 * cost_DT_ot / cost_CSDT,1)) + "%.")

##### Question
Also evaluate the costs on the validation set.

##### Question
Build a cost-sensitive random forest and evaluate its performance. See https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html?highlight=random%20forest#sklearn.ensemble.RandomForestClassifier

##### Question (optional)
You could also use a wrapper to determine the optimal specification of the class-weights.

## 4.3. Sampling methods - oversampling of minority class

We will use the imbalanced-learn package. See https://imbalanced-learn.readthedocs.io/en/stable/index.html for the documentation.

We use the "over-sampler" from the imbalanced-learn package to obtain a balanced distribution of the target.

In [None]:
seed = 123
ros = RandomOverSampler(sampling_strategy="not majority", random_state=seed)

In [None]:
X_train_over, y_train_over = ros.fit_resample(X_train, y_train)

In [None]:
print("Distribution of target in train set:")
print(pd.DataFrame(y_train).groupby("Class", axis=0).size())
print("\n")
print("Distribution of target in oversampled train set:")
print(pd.DataFrame(y_train_over).groupby("Class", axis=0).size())

##### Estimate decision tree on this rebalanced set

In [None]:
clf_over = tree.DecisionTreeClassifier()
dto = clf_over.fit(X_train_over, y_train_over)

In [None]:
hat_y_train_dto = dto.predict(X_train_over)
hat_y_val_dto = dto.predict(X_val)

In [None]:
print("Results for standard decision tree validation set:")
plot_confusion_matrix(hat_y_val_dt, y_val, target_names)

In [None]:
print("Results for over-sampled decision tree on original validation set:")
plot_confusion_matrix(hat_y_val_dto, y_val, target_names)

In [None]:
cost_DT = estimate_cost(hat_y_val_dt, y_val, cost_FP, cost_FN)
cost_CSDT = estimate_cost(hat_y_val_ctree, y_val, cost_FP, cost_FN)
cost_dto = estimate_cost(hat_y_val_dto, y_val, cost_FP, cost_FN)
print("Ratio (DT/DTOver): " + str(np.round(100 * cost_DT / cost_dto, 1)) + "%.")
print("Ratio (CSDT/DTOver): " + str(np.round(100 * cost_CSDT / cost_dto, 1)) + "%.")

### Question:
Analyze the performances.

## 4.4. Sampling methods - SMOTE

The SMOTE function in Scikit-imbalanced learn uses the SMOTE to oversample
(the standard setting is to obtain a uniform class distribution).

In [None]:
X_train_SMOTE, y_train_SMOTE = SMOTE(random_state=38).fit_resample(X_train, y_train)
print("Distribution of target in train set:")
print(pd.DataFrame(y_train).groupby("Class", axis=0).size())
print("\n")
print("Distribution of target in oversampled train set:")
print(pd.DataFrame(y_train_SMOTE).groupby("Class", axis=0).size())

In [None]:
clf_smote = tree.DecisionTreeClassifier()
clf_smote.fit(X_train_SMOTE, y_train_SMOTE)
hat_y_train_dt_SMOTE = clf_smote.predict(X_train_SMOTE)
hat_y_val_dt_SMOTE = clf_smote.predict(X_val)

In [None]:
cost_dt_smote = estimate_cost(hat_y_val_dt_SMOTE, y_val, cost_FP, cost_FN)
print("Ratio (DT/DTSMOTE): " + str(np.round(100 * cost_DT / cost_dt_smote, 1)) + "%")
print("Ratio (CSDT/DTOver): " + str(np.round(100 * cost_CSDT / cost_dto,1)) + "%")
print("Ratio (DTSMOTE/DTOver): " + str(np.round(100 * cost_dt_smote / cost_dto,1)) + "%")

### SMOTE with wrapper

In [None]:
n_fraud = sum(y_train)
n_nfraud = sum(1 - y_train)
factors = [1, 10, 100, 500]
for factor in factors:
    sm = SMOTE(sampling_strategy={0 : n_nfraud, 1 : int(n_fraud * factor)}, random_state=123)
    X_train_SMOTEaux, y_train_SMOTEaux = sm.fit_resample(X_train, y_train)
    DTSMOTEaux = tree.DecisionTreeClassifier(random_state=123)
    DTSMOTEaux.fit(X_train_SMOTEaux, y_train_SMOTEaux)
    hat_y_val_SMOTEaux = DTSMOTEaux.predict(X_val)
    cost_DTSMOTEaux = estimate_cost(hat_y_val_SMOTEaux, y_val, cost_FP, cost_FN)
    print("Oversampling Y=1, using SMOTE, by factor " + str(int(factor)) + ", yields ratio (CSDT/DTSMOTE): " + str(np.round(100 * cost_CSDT / cost_DTSMOTEaux,1)) + "%")

### Question:
Determine your favourite SMOTE-factor.

### Question:  
Choose your Top 3 of models (you are also allowed to estimate new ones)
and evaluate them on the test set.

# 5. Anomaly detection

We will use the unsupervised learning algorithm isolation forest as an alternative to the supervised methods we analyzed above.

First we determine the precision-recall curve for our random forest:

In [None]:
def draw_precision_recall(scores, y):
    precision, recall, thresholds = precision_recall_curve(y, scores)
    plt.fill_between(recall, precision, alpha=0.2, color="b")
    plt.xlabel("Recall")
    plt.ylabel("Precision")
    plt.ylim([0.0, 1.05])
    plt.xlim([0.0, 1.0])
    plt.title("Precision-Recall curve")
    plt.show()
draw_precision_recall(rf.predict_proba(X_val)[:, 1], y_val)

#### Next, we consider the isolation forest.

In [None]:
isof = IsolationForest(n_estimators=100)

Train an isolation forest (note that y_train is not used):  

In [None]:
isof.fit(X_train)

Determine scores (related to depth) on validation set:

In [None]:
isof_scores_val = isof.decision_function(X_val)

Evaluate distribution of scores in groups "fraud" and "non-fraud":

In [None]:
isof_scores_val_fraud = isof_scores_val[y_val==1]
isof_scores_val_nfraud = isof_scores_val[y_val==0]
left =min(np.nanmin(isof_scores_val_fraud), np.nanmin(isof_scores_val_nfraud))
right =max(np.nanmax(isof_scores_val_fraud), np.nanmax(isof_scores_val_nfraud))
plt.hist([isof_scores_val_fraud, isof_scores_val_nfraud], bins=25, range=[left, right], label=['1', '0'], density=True)
plt.legend(loc="upper right")
plt.show()

Precision-recall curve of the isolation forest:

In [None]:
draw_precision_recall(-1 * isof_scores_val, y_val)

##### Question
Consider the test set. Suppose that we are allowed to generate 150 alerts. Determine the alerts generated by the random forests and the alerts generated by the isolation forest.  Determine the precisions and the recalls.