Bryan Chen (bc2vf)<br>
Comparative Algorithm Study<br><br>
**Abstract**<br>
This study compares the accuracy of the Decision Tree, k-Nearest Neighbor, Random Forest, AdaBoost and Support Vector Machine algorithms on detecting fraud in a dataset of credit card transactions from [Kaggle]( https://www.kaggle.com/dalpozz/creditcardfraud). I compared the five algorithms based on accuracy, precision, and recall and determined Random Forest to be best algorithm overall.

In [None]:
import pandas as pd
import numpy as np
from matplotlib import pyplot
from matplotlib import gridspec

%matplotlib inline

In [None]:
df = pd.read_csv('./input/creditcard.csv')
df.head()

**Data**<br>
The dataset is highly unbalanced. There are 492 frauds out of 284,807 transactions, so the positive class (frauds) account for only 0.172% of all transactions. There are no missing values, which is convenient. There is a large number of features, 31 overall, in which other than Time, Amount, and Class, they are anonymous (V).

In [None]:
from sklearn.feature_selection import VarianceThreshold, SelectFromModel
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.model_selection import GridSearchCV, cross_val_score, KFold
from sklearn.neighbors import KNeighborsClassifier

counts = pd.value_counts(df['Class']).sort_index()
counts.plot(kind='bar')
pyplot.title('Fraud classes')
pyplot.xlabel('Class (0 is non fraud)') 
pyplot.ylabel('Count')

The relationship of these V features is unknown, so I selected the features based on if the variance of the samples did not meet a threshold of 0.16. Since the classes are boolean values (0 or 1), I want to remove all features that are either one or zero (fraud or non fraud) in more of 80% of the samples. I removed features that have similar distributions between fraudulent and valid transactions in order to idnetify values where fraudulent transactions are more common. Boolean features are Bernoulli random variables, hence the variance is given by Var[x] = p(1-p) = 0.8*(1-0.8) = 0.16

In [None]:
print(df.isnull().sum()) # no null values
X, y = df.values, df['Class'].values
sel = VarianceThreshold(threshold=0.16)
sel.fit_transform(X, y)
X = sel.transform(X)  # two features were dropped from training due to low variances
print(X.shape)

Since I used the test set to both select the values of the parameter and evaluate the model, I risk optimistically biasing my model evaluations. For this reason, if a test set is used to select model parameters, then I need a different test set to get an unbiased evaluation of that selected model. I overcame this problem using nested cross validations. First, an inner cross validation is used to tune the parameters and select the best model. Second, an outer cross validation is used to evaluate the model selected by the inner cross validation. I also used scoring based on ROC AUC instead of accuracy.

For KNN with 10-fold, with value of k=1, cv=0.99816, with k=3, cv=0.99842, with k=5, cv=0.998367, so 3 is the optimal k

In [None]:
inner_fold = KFold(n_splits=4, random_state=123)
outer_fold = KFold(n_splits=5, random_state=123)
dtc = DecisionTreeClassifier(max_features='sqrt')
knn = KNeighborsClassifier(metric='euclidean')
svc = SVC()
rfc = RandomForestClassifier(oob_score=True, max_features='sqrt')
adb = AdaBoostClassifier(base_estimator=dtc)
models = [dtc, knn, svc, rfc, adb]
titles = ['DT', 'KNN', 'SVM', 'RF', 'ADA']
params = [
    {'max_depth': [1, 5, 10, 20], 'criterion': ['gini', 'entropy']},
    {'n_neighbors': [1, 3, 5, 7, 9, 11]},
    {'C': [1, 10, 100], 'kernel': ['linear', 'rbf', 'poly', 'sigmoid'], 'gamma': ['0.01', '0.001', 'auto']},
    {'n_estimators': [25, 50, 100], 'max_depth': [1, 5, 10, 20], 'min_samples_leaf': [1, 5, 10, 50, 100]},
    {'n_estimators': [25, 50, 100], 'learning_rate': [0.05, 0.1, 0.2]}
]
cv_scores = pd.DataFrame(0, index=np.arange(len(models)), columns=titles)
cm = [np.array([[0, 0], [0, 0]]) for _ in range(len(models))]
iters = 0  # number of outer folds completed

for train, test in outer_fold.split(X):
    for i, clf in enumerate(models):
        best_clf = GridSearchCV(
            estimator=clf, 
            param_grid=params[i], 
            X=X[train], 
            y=y[train], 
            cv=inner_fold,
            scoring='recall',  # best recall/precision because data unbalanced
            n_jobs=-1
        )
        best_clf.fit(X[train], y[train])
        y_pred = best_clf.predict(X[test])
        score = accuracy_score(y_true=y[test], y_pred=y_pred)
        print(best_clf.best_params_ + '\t' + best_clf.best_score_ + '\t' + score)
        cv_scores.iloc[iters, i] = score
        np.add(cm[i], confusion_matrix(y_true=y[test], y_pred=y_pred, labels=['0', '1']))
    iters += 1

In [None]:
mse = 1 - np.mean(cv_scores, axis=0)
var = np.var(cv_scores, axis=0)

# construct some data like what you have:

# create stacked errorbars:
# pyplot.figure()
# pyplot.boxplot(cv_


def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

np.set_printoptions(precision=2)

# Plot non-normalized confusion matrix
for i, cnf_matrix in enumerate(cm):
    plt.figure()
    plot_confusion_matrix(cnf_matrix, classes=class_names, normalize=True,
                          title=titles[i])

plt.show()