## ================================================
## PLEASE RUN THIS DEMO WITH PYTHON 2 AND SPARK 2.1
## ================================================

## Credit Card Fraud Detection

**Use Case:** The credit-card company has released an anonymized list of time-recorded European card transactions to detect fraudulent credit-card transactions for two days in September 2013. The data contains a highly imbalanced (small) percentage of transactions that are fraudulent. The goal is to train the computer to detect fraudulent transactions. To assess the performance of various machine/deep-learning algorithms, we carry out and evaluate two popular models that are often used to cope with imbalanced datasets. They are (1) *autoencoder neural networks*, (2) *support vector machines* and (3) *decision trees*. 

**Data Source:** Kaggle: https://www.kaggle.com/mlg-ulb/creditcardfraud


### Data inspection and ETL

#### The CSV file creditcard.csv is downloaded and stored in IBM Cloud Object Storage. Data from it are then extracted and converted to Spark dataframe with appropriate credentials.

In [None]:
# The code was removed by Watson Studio for sharing.

In [None]:
import ibmos2spark

configuration_name = 'os_3253229b087f481987a8b59b2f9ce876_configs'
cos = ibmos2spark.CloudObjectStorage(sc, credentials, configuration_name, 'bluemix_cos')

from pyspark.sql import SparkSession

spark=SparkSession.builder.getOrCreate()
df_credit = spark.read\
  .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')\
  .option('header', 'true')\
  .load(cos.url('creditcard.csv', 'ibmadvanceddatasciencecapstone-donotdelete-pr-iivs2mtioqigul'))
df_credit.take(5)

#### Data properties are checked.

In [None]:
num_smp = df_credit.count() # Number of samples
num_feat = len(df_credit.columns) # Number of features
print(num_smp,num_feat)

In [None]:
df_credit.printSchema()

#### Cast the entire dataframe to float format. Note that PCA is already performed on the original confidential dataset to form this current public dataset to protect the identities of the card users.

In [None]:
from pyspark.sql.functions import col

df_credit=df_credit.select(*(col(c).cast("float").alias(c) for c in df_credit.columns))
df_credit.printSchema()

#### We now upload the preprocessed dataframe to the persistent Spark storage format for later use.

In [None]:
from pyspark.sql import SparkSession

spark=SparkSession.builder.getOrCreate()
df_credit=df_credit.repartition(1)
df_credit.write.mode('overwrite').parquet(cos.url('credit.parquet','ibmadvanceddatasciencecapstone-donotdelete-pr-iivs2mtioqigul'))

### ===========================

### Load data from Object Storage

In [None]:
# The code was removed by Watson Studio for sharing.

In [None]:
import ibmos2spark

from pyspark.sql import SparkSession
spark=SparkSession.builder.getOrCreate()

configuration_name = 'os_3253229b087f481987a8b59b2f9ce876_configs'
cos = ibmos2spark.CloudObjectStorage(sc, credentials, configuration_name, 'bluemix_cos')

df_credit=spark.read.parquet(cos.url('credit.parquet','ibmadvanceddatasciencecapstone-donotdelete-pr-iivs2mtioqigul'))

### Load all packages

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
import seaborn as sns
import itertools
import json

from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (confusion_matrix, precision_recall_curve, auc,
                             roc_curve, recall_score, classification_report, f1_score,
                             precision_recall_fscore_support)

from keras.models import Model, load_model
from keras.layers import Input, Dense, Dropout
from keras.callbacks import ModelCheckpoint, TensorBoard
from keras import regularizers

from pyspark.ml.classification import GBTClassifier, RandomForestClassifier
from pyspark.ml.feature import VectorAssembler, Normalizer, StringIndexer
from pyspark.ml.linalg import Vectors
from pyspark.ml import Pipeline
from pyspark.mllib.evaluation import MulticlassMetrics, BinaryClassificationMetrics
from pyspark.ml.evaluation import BinaryClassificationEvaluator

from pyspark.sql import DataFrame

from functools import reduce

### Data visualization

In [None]:
df=df_credit.toPandas()

In [None]:
df_legal=df.loc[df['Class'] == 0]
df_fraud=df.loc[df['Class'] == 1]

df_legal=df_legal.values
df_fraud=df_fraud.values

df_legal.shape

In [None]:
size_l = len(df_legal)
size_f = len(df_fraud)

plt.bar([0,1], [size_l,size_f], align='center', alpha=1)
plt.xticks([0,1],['Legal','Fraudulent'])
plt.ylabel('Frequency')
plt.title('Credit-card transactions')
 
plt.show()

fig, ax = plt.subplots(num=None, figsize=(14, 6), dpi=80, facecolor='w', edgecolor='k')
ax.plot(df_legal[:,0], df_legal[:,29], '-', color='blue', animated = True, linewidth=1)
ax.plot(df_fraud[:,0], df_fraud[:,29], '-', color='red', animated = True, linewidth=1)
plt.ylabel('Credit-card transaction amount')
plt.xlabel('Time')
plt.legend(['Legal','Fraudulent'],loc='best')
plt.show()

#### As expected, the dataset is highly imbalanced with a very low percentage of fraudulent transactions. The above time series shows that, apart from the user characteristics V1 to V28, small transaction amounts are also signatures of fraudulent credit-card activities.

### (1) We first train the machine to detect fraudulent transactions with Autoencoders using Keras

### Preprocess the transactions data and group the legal transactions for training

In [None]:
df_prep=df.drop('Time',axis=1)                           # Remove the 'Time' column.
df_class=df_prep['Class']                                # Extract class labels store it.
header = list(df_prep)
df_prep=StandardScaler().fit(df_prep).transform(df_prep) # Perform standard feature scaling
df_prep=pd.DataFrame(data=df_prep,columns=header)        # Reinstate column headers
df_prep['Class']=df_class                                # Reinstate correct class labels

X_train, X_rest = train_test_split(df_prep, test_size=0.4)  # Split data to 60% training, 20% validation and 20% test
X_val, X_test = train_test_split(X_rest, test_size=0.5)

X_train = X_train[X_train.Class == 0]
X_train = X_train.drop('Class', axis=1)
y_val = X_val['Class']
X_val = X_val.drop('Class', axis=1)
y_test = X_test['Class']
X_test = X_test.drop('Class', axis=1)

X_train = X_train.values
X_val = X_val.values
X_test = X_test.values

print(X_train.shape)
print(X_val.shape)
print(y_val.shape)
print(X_test.shape)
print(y_test.shape)

In [None]:
df_prep.head()

### Define the autoencoder model

In [None]:
input_dim = X_train.shape[1]

input_layer = Input(shape=(input_dim, ))

# encoder = Dense(40, activation="tanh", 
#                 activity_regularizer=regularizers.l1(1e-10))(input_layer)
# encoder = Dense(35, activation="tanh")(encoder)
# encoder = Dense(30, activation="tanh")(encoder)
# encoder = Dense(20, activation="tanh")(encoder)
# decoder = Dense(30, activation="relu")(encoder)
# decoder = Dense(35, activation="relu")(decoder)
# decoder = Dense(40, activation="relu")(decoder)
# decoder = Dense(input_dim, activation='relu')(decoder)

encoder = Dense(40, activation="tanh", 
                activity_regularizer=regularizers.l1(1e-5))(input_layer)
encoder = Dense(20, activation="tanh")(encoder)
decoder = Dense(20, activation="relu")(encoder)
decoder = Dense(40, activation="relu")(decoder)
decoder = Dense(input_dim, activation='relu')(decoder)

autoencoder = Model(inputs=input_layer, outputs=decoder)

autoencoder.compile(optimizer='adam', 
                    loss='mae', 
                    metrics=['accuracy'])
checkpointer = ModelCheckpoint(filepath="model_autoenc.h5",
                               verbose=0,
                               save_best_only=True)
tensorboard = TensorBoard(log_dir='./logs',
                          histogram_freq=0,
                          write_graph=True,
                          write_images=True)

history = autoencoder.fit(X_train, X_train,
                    epochs=100,
                    batch_size=32,
                    validation_data=(X_val, X_val),
                    verbose=1,
                    callbacks=[checkpointer, tensorboard]).history

with open('model_autoenc_hist.txt', 'w') as outfile:  
    json.dump(history, outfile)

### Evaluating the autoencoder results

#### We judge that the autoencoder converges well from both the loss and accuracy learning curves.

In [None]:
with open('model_autoenc_hist.txt') as json_file:  
    history_ld = json.load(json_file)
    
autoencoder_ld = load_model('model_autoenc.h5')

plt.plot(history_ld['val_loss'])
plt.plot(history_ld['loss'])
plt.title('Model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Validation set','Training set'],loc='best');
plt.show()

plt.plot(history_ld['val_acc'])
plt.plot(history_ld['acc'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Validation set','Training set'],loc='lower right');
plt.show()

#### Since this dataset is imbalanced, we first investigate the confustion matrix for a fixed threshold. It turns out that while the autoencoder prediction is capable of rooting out a large portion of fraudulent transactions (high recall), it is at the same time relatively inaccurate in recognizing legal transactions (low precision).

In [None]:
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True class')
    plt.xlabel('Predicted class')
    plt.grid(False)

In [None]:
predictions = autoencoder_ld.predict(X_test) # Autoencoder predictions are feature values, not class labels.
mse = np.mean(np.power(X_test - predictions, 2), axis=1) # Compute reconstruction error based on mean squared-error.
error_df = pd.DataFrame({'reconstruction_error': mse,
                        'true_class': y_test})

threshold=1e-2 # Normalized to 1, this threshold is defined for MSE.

LABELS=['Legal','Fraudulent']
maxval=np.max(error_df.reconstruction_error.values)
y_pred = [1 if e > threshold*maxval else 0 for e in error_df.reconstruction_error.values] # Classify based on reconstruction error.
cm = confusion_matrix(error_df.true_class, y_pred).astype('float64')

accuracy=(cm[0][0]+cm[1][1])/cm.sum()
precision=(cm[1][1])/(cm[1][1]+cm[0][1])
recall=(cm[1][1])/(cm[1][1]+cm[1][0])
print(precision, recall)
F1=2*precision*recall/(precision+recall)
print("Statistics for test data: accuracy,precision,recall,F1 ",accuracy,precision,recall,F1)

cm = cm.astype('int')
plt.figure(figsize=(6, 6))
plot_confusion_matrix(cm, classes=LABELS,normalize= False,  title='Confusion matrix for test data')

#### More appropriately for such a highly imbalanced data, we should look at the precision-recall curve (https://doi.org/10.1371/journal.pone.0118432) and make sure that the area under it is large. This generally corresponds to high precision and high recall.  For our autoencoder, the area under the curve is indeed small.

In [None]:
precision, recall, th = precision_recall_curve(error_df.true_class, error_df.reconstruction_error)
plt.plot(recall, precision)
plt.title('Precision-Recall curve for test data')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.show()

print("Area Under PR Curve:", auc(recall,precision))

#### We therefore need a better machine-learning algorithm to increase this area.

### (2) Let us train the computer with Support Vector Machines using Sklearn 

In [None]:
X_train, X_test = train_test_split(df_prep, test_size=0.3)  # Split data to 70% training and 30% test

y_train = X_train['Class']
X_train = X_train.drop('Class', axis=1)
y_test = X_test['Class']
X_test = X_test.drop('Class', axis=1)

X_train = X_train.values
X_test = X_test.values
y_train = y_train.values
y_test = y_test.values

clf = svm.SVC(kernel='rbf',verbose=True,max_iter=200,probability=True)
clf.fit(X_train, y_train) 
yhat = clf.predict(X_test)

#### The confusion matrix reads

In [None]:
cm = confusion_matrix(y_test, yhat).astype('float64')

accuracy=(cm[0][0]+cm[1][1])/cm.sum()
precision=(cm[1][1])/(cm[1][1]+cm[0][1])
recall=(cm[1][1])/(cm[1][1]+cm[1][0])
F1=2*precision*recall/(precision+recall)
print("Statistics for test data: accuracy,precision,recall,F1 ",accuracy,precision,recall,F1)

cm = cm.astype('int')
plt.figure(figsize=(6, 6))
plot_confusion_matrix(cm, classes=LABELS,normalize= False,  title='Confusion matrix for test data') # Fixed at standard probability threshold of 0.5

#### We see that this time, compared to autoencoders, SVM gives a much more optimistic prediction with relatively high precision and recall. Its precision-recall curve also possesses a larger area.

In [None]:
yhat_prob = clf.predict_proba(X_test)
precision, recall, th = precision_recall_curve(y_test, yhat_prob[:,1])
plt.plot(recall, precision)
plt.title('Precision-Recall curve for test data')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.show()

print("Area Under PR Curve:", auc(recall,precision))

#### We shall see that it is indeed possible to improve this result further!

### (3) We now train the machine with Decision Trees using Apache Spark

In [None]:
splits=df_credit.randomSplit([0.7,0.3]) # Split data to 70% training and 30% test
df_train=splits[0]
df_test=splits[1]

In [None]:
df_train=reduce(DataFrame.drop, ['Time'], df_train)   # Drop 'Time' column from the training dataset
df_test=reduce(DataFrame.drop, ['Time'], df_test)     # Drop 'Time' column from the test dataset

print(len(df_train.columns),len(df_test.columns))

header = list(df_prep)

vectorAssembler=VectorAssembler(inputCols=header,outputCol='features')                        # Assembler feature vector
labelIndexer = StringIndexer(inputCol="Class", outputCol="indexedClass").fit(df_train)        # Index class labels

normalizer = Normalizer(inputCol='features',outputCol='features_norm',p=2.0)                  # Normalize the feature matrix

tree_classifier =RandomForestClassifier(labelCol='indexedClass',featuresCol='features_norm',numTrees=3,seed=0)    # Define a classifier

pipeline=Pipeline(stages=[vectorAssembler,labelIndexer,normalizer,tree_classifier])           # Assemble the ML pipeline

In [None]:
model =pipeline.fit(df_train)
predict_train=model.transform(df_train)
predict_test=model.transform(df_test)

### Evaluate the Random-Forest predictions first with the confusion matrix

#### We find that RF only roots out most fraudulent cases, but also recognizes almost all legal transactions from the test dataset.

In [None]:
results = predict_test.select(['prediction', 'Class'])
predictionAndLabels=results.rdd
metrics = MulticlassMetrics(predictionAndLabels)

cm=metrics.confusionMatrix().toArray()
accuracy=(cm[0][0]+cm[1][1])/cm.sum()
precision=(cm[1][1])/(cm[1][1]+cm[0][1])
recall=(cm[1][1])/(cm[1][1]+cm[1][0])
F1=2.*precision*recall/(precision+recall)
print("Statistics for test data: accuracy,precision,recall,F1 ",accuracy,precision,recall,F1)

cm=cm.astype('int')
plt.figure(figsize=(6, 6))
plot_confusion_matrix(cm, classes=LABELS,normalize= False,  title='Confusion matrix for test data') # Fixed at standard probability threshold of 0.5

#### As a matter of fact, we obtain a rather large area for the precision-recall curve.

In [None]:
y_proba=predict_test.select("probability").collect()
y_proba=np.asarray([x[0] for x in y_proba])
y_test=predict_test.select("Class").collect()
y_test=np.asarray([x[0] for x in y_test])

In [None]:
precision, recall, th = precision_recall_curve(y_test, y_proba[:,1])
plt.plot(recall, precision)
plt.title('Precision-Recall curve for test data')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.ylim([0,1.03])
plt.show()

eval = BinaryClassificationEvaluator().setMetricName('areaUnderPR').setLabelCol('Class').setRawPredictionCol("rawPrediction")
print('Area under PR curve:',eval.evaluate(predict_test))

#### We therefore conclude that Decision Trees are the best alternative for detecting anomalous credit-card frauduluent activities. In effect, we have reproduced the general reported observation in the proceedings paper http://www.iaeng.org/publication/IMECS2011/IMECS2011_pp442-447.pdf ---- Decision-Tree classifiers outperform SVMs in credit-card fraud detection.

### Conclusion

We have tested autoencoder neural networks, SVMs and decision-tree classifiers to train the computer for credit-card fraud detection. We find that autoencoder neural networks quite often misclassifies legal transactions as fraudulent on test/validation data even though they manage to root out a large portion of fraudulent transactions. SVMs improves this result by raising the precision, but it turns out that decision-tree algorithms are most reliable in recognizing legal and fraudulent transactions under highly imbalanced situations.