# פרוייקט למידה חישובית

## Introduction

* Context
It is important that credit card companies are able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase.

* Content
The dataset can be downloaded from here: https://www.kaggle.com/mlg-ulb/creditcardfraud.
The datasets contains transactions made by credit cards in September 2013 by european cardholders.
This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-senstive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.

### Prerequisites
Install imblearn and upgrade sklearn to 0.19.0

In [None]:
#!pip install imblearn
#!pip install --upgrade sklearn

In [None]:
import os

* Import data `creditcard.csv`. 

In [None]:
import pandas as pd
import numpy as np

data = pd.read_csv('../input/creditcardfraud/creditcard.csv', sep=',')

In [None]:
data.head().T

* The prediction should definitely not include time, that in this case is just like an id field.
* Normal amount column
* Change name from 'class' to 'fraud', as boolean.

In [None]:
data.drop(['Time'], axis=1, inplace=True)
data.rename(columns={'Class': 'Fraud'}, inplace=True)
data['Fraud'] = data['Fraud'].astype(np.bool)

mean_amount = data['Amount'].mean()
std_amount = data['Amount'].std()
data['Amount'] = (data['Amount'] - mean_amount) / std_amount

data.head().T

In [None]:
mean_amount, std_amount

Examine the data types.

In [None]:
data.dtypes

In [None]:
data[['Fraud']].dtypes

## Check correlation
If there is a very high correlation between two features, keeping both of them is not a good idea most of the time not to cause overfitting.

In [None]:
data.corr()

In [None]:
import seaborn as sn
sn.heatmap(data.corr())

## Check count of each fraud

In [None]:
data.Fraud.value_counts()

In [None]:
data.Fraud.value_counts(normalize=True)

In [None]:
import matplotlib.pyplot as plt
data["Fraud"].value_counts().plot(kind = 'pie',explode=[0, 0.1],figsize=(6, 6),autopct='%1.1f%%',shadow=True)
plt.title("Fraudulent and Non-Fraudulent Distribution",fontsize=20)
plt.legend(["Genuine","Fraud"])
plt.show()

# split data

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit

feature_cols = [x for x in data.columns if x != 'Fraud']

# Split the data into two parts with 1500 points in the test data
# This creates a generator
strat_shuff_split = StratifiedShuffleSplit(n_splits=1, test_size=0.3, random_state=42)

# Get the index values from the generator
train_idx, test_idx = next(strat_shuff_split.split(data[feature_cols], data['Fraud']))

# Create the data sets
X_train = data.loc[train_idx, feature_cols]
y_train = data.loc[train_idx, 'Fraud']

X_test = data.loc[test_idx, feature_cols]
y_test = data.loc[test_idx, 'Fraud']

In [None]:
y_train.value_counts(normalize=False), y_train.value_counts(normalize=True)

In [None]:
y_test.value_counts(normalize=False), y_test.value_counts(normalize=True)

In [None]:
from sklearn.metrics import classification_report, accuracy_score, precision_score, recall_score
from sklearn.metrics import f1_score, roc_auc_score

def scores(y_train, y_train_pred, y_test, y_test_pred):
    score_df = pd.DataFrame({
        'set': ['Train', 'Test'], 
        'accuracy': [accuracy_score(y_train, y_train_pred), accuracy_score(y_test, y_test_pred)],
        'precision': [precision_score(y_train, y_train_pred), precision_score(y_test, y_test_pred)],
        'recall': [recall_score(y_train, y_train_pred), recall_score(y_test, y_test_pred)],
        'f1': [f1_score(y_train, y_train_pred), f1_score(y_test, y_test_pred)],
        'auc': [roc_auc_score(y_train, y_train_pred), roc_auc_score(y_test, y_test_pred)]
    })
    score_df.set_index('set')
    print(score_df)

In [None]:
test_scores = []

def add_test_scores(name, y_test, y_test_pred):
    test_scores.append({
        'name': name, 
        'accuracy': accuracy_score(y_test, y_test_pred),
        'precision': precision_score(y_test, y_test_pred),
        'recall': recall_score(y_test, y_test_pred),
        'f1': f1_score(y_test, y_test_pred),
        'auc': roc_auc_score(y_test, y_test_pred)
    })

## PCA - prepare data

In [None]:
from sklearn.decomposition import PCA

pca = PCA().fit(X_train)
print(pca.explained_variance_ratio_)

In [None]:
n_components = 10
print(sum(list(pca.explained_variance_ratio_[0:n_components])))

The first 10 most important pca components hold about 2/3 of the data

In [None]:
pca = PCA(n_components=n_components).fit(X_train)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

## Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(max_iter=1000).fit(X_train, y_train)
y_train_pred = clf.predict(X_train)
y_test_pred = clf.predict(X_test)

In [None]:
add_test_scores('Logistic Regression', y_test, y_test_pred)
scores(y_train, y_train_pred, y_test, y_test_pred)

## Logistic Regression on PCA

In [None]:
clf_pca = LogisticRegression().fit(X_train_pca, y_train)
y_train_pca_pred = clf_pca.predict(X_train_pca)
y_test_pca_pred = clf_pca.predict(X_test_pca)

add_test_scores('Logistic Regression PCA', y_test, y_test_pca_pred)
scores(y_train, y_train_pca_pred, y_test, y_test_pca_pred)

## SVM

In [None]:
#Import svm model
from sklearn.svm import SVC

#Create a svm Classifier
svm_linear = SVC(kernel='linear') # Linear Kernel

#Train the model using the training sets
svm_linear.fit(X_train, y_train)

In [None]:
#Predict the response for test dataset
y_train_pred = svm_linear.predict(X_train)
y_test_pred = svm_linear.predict(X_test)

add_test_scores('SVM Linear', y_test, y_test_pred)
scores(y_train, y_train_pred, y_test, y_test_pred)

## SVM with polynom kernel

In [None]:
from sklearn.svm import SVC
svclassifier = SVC(kernel='poly', degree=8)
svclassifier = svclassifier.fit(X_train, y_train)
y_test_pred = svclassifier.predict(X_test)
y_train_pred = svclassifier.predict(X_train)

In [None]:
add_test_scores('SVM poly', y_test, y_test_pred)
scores(y_train, y_train_pred, y_test, y_test_pred)

## SVM PCA

In [None]:
svm_rbg_pca = SVC(kernel='rbf', )
svm_rbg_pca.fit(X_train_pca, y_train)

In [None]:
y_train_pred = svm_rbg_pca.predict(X_train_pca)
y_test_pred = svm_rbg_pca.predict(X_test_pca)

add_test_scores('SVM RBF - PCA', y_test, y_test_pred)
scores(y_train, y_train_pred, y_test, y_test_pred)

### Don't run! very slow!

In [None]:
#from sklearn.model_selection import GridSearchCV

#param_grid = { 'C': [c for c in range(1, 11)] }

#GR = GridSearchCV(SVC(kernel='linear'),
#                  param_grid=param_grid,
#                  scoring='f1',
#                  n_jobs=-1)

#GR = GR.fit(X_train_pca, y_train)
#GR.best_estimator_

The result is:
SVC(C=1, kernel='linear')

In [None]:
svm_linear_pca = SVC(kernel='linear', C=1)
svm_linear_pca.fit(X_train_pca, y_train)
y_train_pred = svm_linear_pca.predict(X_train_pca)
y_test_pred = svm_linear_pca.predict(X_test_pca)

add_test_scores('SVM Linear - PCA', y_test, y_test_pred)
scores(y_train, y_train_pred, y_test, y_test_pred)

## Decision tree

In [None]:
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
dt = DecisionTreeClassifier()
dt = dt.fit(X_train,y_train)

The number of nodes and the maximum actual depth.

In [None]:
dt.tree_.node_count, dt.tree_.max_depth

In [None]:
dt.feature_importances_

In [None]:
y_train_pred = dt.predict(X_train)
y_test_pred = dt.predict(X_test)

In [None]:
add_test_scores('Decision tree', y_test, y_test_pred)
scores(y_train, y_train_pred, y_test, y_test_pred)

## Grid search on decision tree

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {'max_depth':range(1, dt.tree_.max_depth+1, 2),
              'max_features': range(1, len(dt.feature_importances_)+1)}

GR = GridSearchCV(DecisionTreeClassifier(random_state=42),
                  param_grid=param_grid,
                  scoring='f1',
                  n_jobs=-1)

GR = GR.fit(X_train, y_train)

The number of nodes and the maximum depth of the tree.

In [None]:
GR.best_estimator_

In [None]:
GR.best_estimator_.tree_.node_count, GR.best_estimator_.tree_.max_depth

In [None]:
y_train_pred = GR.predict(X_train)
y_test_pred = GR.predict(X_test)

In [None]:
add_test_scores('Decision tree grid search', y_test, y_test_pred)
scores(y_train, y_train_pred, y_test, y_test_pred)

The tree fit without cross validation.

In [None]:
from io import StringIO
from IPython.display import Image, display

from sklearn.tree import export_graphviz

try:
    import pydotplus
    pydotplus_installed = True
    
except:
    print('PyDotPlus must be installed to execute the remainder of the cells associated with this question.')
    print('Please see the instructions for this question for details.')
    pydotplus_installed = False

In [None]:
if pydotplus_installed:
    
    # Create an output destination for the file
    dot_data = StringIO()

    export_graphviz(dt, out_file=dot_data, filled=True)
    graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
    print(graph)
    # View the tree image
    filename = 'fraud_tree.png'
    graph.write_png(filename)
    img = Image(filename=filename)
    display(img)
    
else:
    print('This cell not executed because PyDotPlus could not be loaded.')

The tree fit with cross validation.

In [None]:
if pydotplus_installed:
    
    # Create an output destination for the file
    dot_data = StringIO()

    export_graphviz(GR.best_estimator_, out_file=dot_data, filled=True)
    graph = pydotplus.graph_from_dot_data(dot_data.getvalue())

    # View the tree image
    filename = 'fraud_tree_prune.png'
    graph.write_png(filename)
    img = Image(filename=filename) 
    display(img)
    
else:
    print('This cell not executed because PyDotPlus could not be loaded.')

## KNN

### Grid Search for ideal n_neighbors - Don't run! bery slow!

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

In [None]:
#param_grid = { 'n_neighbors': [n for n in range(1, 10, 2)] }

#GR = GridSearchCV(KNeighborsClassifier(),
#                 param_grid=param_grid,
#                 scoring='f1',
#                 n_jobs=-1)

#GR_knn = GR.fit(X_train, y_train)
#GR_knn.best_estimator_

The result is:
KNeighborsClassifier(n_neighbors=3)


In [None]:
classifier = KNeighborsClassifier(n_neighbors=3)
classifier.fit(X_train, y_train)

### Also very slow, unfortunately...

In [None]:
#y_train_pred = classifier.predict(X_train)
#y_test_pred = classifier.predict(X_test)
#scores(y_train, y_train_pred, y_test, y_test_pred)

`    set  accuracy  precision    recall        f1       auc
0  Train  0.999649   0.969178  0.822674  0.889937  0.911315
1   Test  0.999473   0.918699  0.763514  0.833948  0.881698`

In [None]:
# so will update the scores manually
test_scores.append({
        'name': 'KNN', 
        'accuracy': 0.999473,
        'precision': 0.918699,
        'recall': 0.763514,
        'f1': 0.833948,
        'auc': 0.881698
    })

### GridSearch for KNN on PCA gives same results - n_neighbors=3

In [None]:
#param_grid = { 'n_neighbors': [n for n in range(1, 10, 2)] }

#GR = GridSearchCV(KNeighborsClassifier(),
#                 param_grid=param_grid,
#                 scoring='f1',
#                 n_jobs=-1)

#GR_knn_pca = GR.fit(X_train_pca, y_train)
#GR_knn_pca.best_estimator_

### And it is lighter to run, but worse f1 results:

In [None]:
classifier = KNeighborsClassifier(n_neighbors=3)
classifier.fit(X_train_pca, y_train)
y_train_pred = classifier.predict(X_train_pca)
y_test_pred = classifier.predict(X_test_pca)

In [None]:
add_test_scores('KNN - PCA', y_test, y_test_pred)
scores(y_train, y_train_pred, y_test, y_test_pred)

## Balanced Data - Over Sampling

In [None]:
from imblearn import over_sampling as os_smote
os_sm = os_smote.SMOTE(random_state=42, n_jobs=-1, sampling_strategy=0.1)
X_train_os, y_train_os = os_sm.fit_resample(X_train, y_train)
y_train_os.value_counts(), y_train.value_counts()

In [None]:
pca_os = PCA().fit(X_train_os)
print(pca_os.explained_variance_ratio_)

In [None]:
n_components = 10
print(sum(list(pca_os.explained_variance_ratio_[0:n_components])))

In [None]:
pca_os = PCA(n_components=n_components).fit(X_train_os)
X_train_os_pca = pca_os.fit_transform(X_train_os)
X_test_pca = pca_os.fit_transform(X_test)

## Logisitc Regression on Over Sampling Balanced Data

In [None]:
from sklearn.linear_model import LogisticRegression
clf_os = LogisticRegression(max_iter=1000).fit(X_train_os, y_train_os)
y_train_pred = clf_os.predict(X_train_os)
y_test_pred = clf_os.predict(X_test)

In [None]:
add_test_scores('Logistic Regression on Over Sampling Balanced Data', y_test, y_test_pred)
scores(y_train_os, y_train_pred, y_test, y_test_pred)

In [None]:
clf_pca = LogisticRegression().fit(X_train_os_pca, y_train_os)
y_train_pred = clf_pca.predict(X_train_os_pca)
y_test_pred = clf_pca.predict(X_test_pca)

add_test_scores('Logistic Regression on Over Sampling Balanced Data PCA', y_test, y_test_pred)
scores(y_train_os, y_train_pred, y_test, y_test_pred)

## Gradient Boosting on Over Sampling Balanced Data

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

param_grid = { 
    'learning_rate': [0.1], #[0.01, 0.05, 0.1, 0.15, 0.2],
    'n_estimators': [200], #range(100, 400, 50),
    'max_features': [5] # [1, 5, 10]
}

GR = GridSearchCV(GradientBoostingClassifier(subsample=0.5, random_state=42),
                 param_grid=param_grid,
                 scoring='f1',
                 n_jobs=-1)

GR_boosting = GR.fit(X_train_os, y_train_os)
GR_boosting.best_estimator_

In [None]:
y_train_pred = GR_boosting.predict(X_train_os)
y_test_pred = GR_boosting.predict(X_test)

add_test_scores('Boosting - Over Sampling', y_test, y_test_pred)
scores(y_train_os, y_train_pred, y_test, y_test_pred)

## Gradient Boosting on Over Sampling Balanced Data PCA

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

param_grid = { 
    'learning_rate': [0.1], #[0.01, 0.05, 0.1, 0.15, 0.2],
    'n_estimators': [200], #range(100, 400, 50),
    'max_features': [5] #[1, 5, 10]
}

GR = GridSearchCV(GradientBoostingClassifier(subsample=0.5, random_state=42),
                 param_grid=param_grid,
                 scoring='f1',
                 n_jobs=-1)

GR_boosting = GR.fit(X_train_os_pca, y_train_os)
GR_boosting.best_estimator_

In [None]:
y_train_pred = GR_boosting.predict(X_train_os_pca)
y_test_pred = GR_boosting.predict(X_test_pca)

add_test_scores('Boosting - Over Sampling PCA', y_test, y_test_pred)
scores(y_train_os, y_train_pred, y_test, y_test_pred)

### SVM on Balanced Data - too heavy to compute.
### Trying with under sampling

In [None]:
from imblearn.under_sampling import RandomUnderSampler
us = RandomUnderSampler(sampling_strategy=0.1)
X_train_us, y_train_us = us.fit_resample(X_train, y_train)

In [None]:
y_train_us.value_counts()

## Logistic Regression Under Sampling

In [None]:
clf_os = LogisticRegression(max_iter=1000).fit(X_train_us, y_train_us)
y_train_pred = clf_os.predict(X_train_us)
y_test_pred = clf_os.predict(X_test)

add_test_scores('Linear Regression - Under Sampling', y_test, y_test_pred)
scores(y_train_us, y_train_pred, y_test, y_test_pred)

## SVM Kernel Polynom

In [None]:
from sklearn.svm import SVC
svm_linear = SVC(kernel='poly', degree=1, C=0.01)
svm_linear.fit(X_train_us, y_train_us)

In [None]:
y_train_pred = svm_linear.predict(X_train_us)
y_test_pred = svm_linear.predict(X_test)

add_test_scores('SVM Poly - Under Sampling', y_test, y_test_pred)
scores(y_train_us, y_train_pred, y_test, y_test_pred)

## SVM Kernel RBF

In [None]:
svm_rbg = SVC(kernel='rbf', C=0.01)
svm_rbg.fit(X_train_us, y_train_us)

y_train_pred = svm_rbg.predict(X_train_us)
y_test_pred = svm_rbg.predict(X_test)

add_test_scores('SVM RGB - Under Sampling', y_test, y_test_pred)
scores(y_train_us, y_train_pred, y_test, y_test_pred)

## KNN Under Sampling

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

param_grid = { 'n_neighbors': [n for n in range(1, 11)] }

GR = GridSearchCV(KNeighborsClassifier(),
                 param_grid=param_grid,
                 scoring='f1',
                 n_jobs=-1)

GR_knn = GR.fit(X_train_us, y_train_us)
GR_knn.best_params_

In [None]:
y_train_pred = GR_knn.predict(X_train_us)
y_test_pred = GR_knn.predict(X_test)

add_test_scores('KNN - Under Sampling', y_test, y_test_pred)
scores(y_train_us, y_train_pred, y_test, y_test_pred)

## Gradient Descents Under Sampling

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

param_grid = { 
    'learning_rate': [0.05], #[0.01, 0.05, 0.1, 0.15, 0.2],
    'n_estimators': [300], #range(100, 400, 50),
    'max_features': [1] #[1, 5, 10]
}

GR = GridSearchCV(GradientBoostingClassifier(random_state=42),
                 param_grid=param_grid,
                 scoring='f1',
                 n_jobs=-1)

GR_boosting = GR.fit(X_train_us, y_train_us)
GR_boosting.best_estimator_

In [None]:
y_train_pred = GR_boosting.predict(X_train_us)
y_test_pred = GR_boosting.predict(X_test)

add_test_scores('Boosting - Under Sampling', y_test, y_test_pred)
scores(y_train_us, y_train_pred, y_test, y_test_pred)

## Boosting - Under Sampling

In [None]:
from sklearn.ensemble import ExtraTreesClassifier
EC = ExtraTreesClassifier(n_estimators=100, max_features=1)
EC = EC.fit(X_train_us, y_train_us)

In [None]:
y_train_pred = EC.predict(X_train_us)
y_test_pred = EC.predict(X_test)

add_test_scores('Bagging - Under Sampling', y_test, y_test_pred)
scores(y_train_us, y_train_pred, y_test, y_test_pred)

## Show all the results, sorted by F1 field

In [None]:
results = pd.DataFrame(test_scores).sort_values(by='f1', ascending=False, ignore_index=True)
results