# Lab : Kaggle Credit card fraud detection

It is important that credit card companies are able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase.

The dataset contains transactions made by credit cards in September 2013 by european cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

**Objectives:** Compare Logistic Regression classifiers on skewed data. The idea is to compare if preprocessing techniques work better when there is an overwhelming majority class that can disrupt the efficiency of the predictive model. Learn how to apply cross validation (CV) for hyper-parameter tuning.

In [None]:
import warnings
warnings.filterwarnings('ignore',category=FutureWarning)
warnings.filterwarnings('ignore',category=DeprecationWarning)
warnings.filterwarnings('ignore',category=Warning)

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

%matplotlib inline

### Load the anonimised dataset

Dataset contains only numerical features which are the result of a PCA (Principal Component Analysis) transformation. Due to confidentiality issues, the original features and more personal information cannot be provided. Features V1, V2, ... V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction quantity of money. 
 
The last column is the Class:  normal transaction (0),  fraud transaction (1). 

Load the dataset stored in the file *"creditcard.csv"*. 

In [None]:
data = ?


# What is the dimension of the data => (284807, 31)    
?

#Compute the mean of each column, and see that the anonimised features V1-V28 
#have mean arround 0
?

# show the first few rows from the dataset 
 ?

#### Normalize the values of Column Amount  

In [None]:
from sklearn.preprocessing import StandardScaler
#x.reshape(-1, 1) does not mean normalizing between -1,1). 
#It means collumn vector (-1 means all rows), second dimension = 1.

data['normAmount'] = StandardScaler().fit_transform(data['Amount'].values.reshape(-1, 1))

#drop column Time as irrelevant feature
#drop columnt Amount as column normAmount was added
data = data.drop(['Time','Amount'],axis=1)  

# show again the first few rows of the normalized dataset 
?

#### Compute the number of samples per class


In [None]:
N_records_fraud = len(data[data.Class == 1])

N_records_normal = ?

# How many samples of Class 1 ( fraud transaction) ?   =>ANSWER: Class 1 = 492

# How many samples of Class 0 (normal transaction) ? => ANSWER: Class 0 = 284315

###  Data is totally unbalanced ! How to deal with such classification problem:

- Collect more data.  Nice strategy but not always applicable. 
- Change the performance metric (do not rely only on Accuracy): compute other metrics Precision, Recall, F1_score.
- Resampling the dataset to have an approximate 50-50 ratio:
    - By OVER-sampling => add copies of the under-represented class.
    - By UNDER-sampling => delete instances from the over-represented class.
   

Extract the features in matrix X and the class labels in vector y

In [None]:
X = data.iloc[:, data.columns != 'Class']
y =  ?


####  UNDER-sampling 

Apply UNDER-sampling by randomly selecting x samples from the majority class (0), where x is the total number of records with the minority class (1). 

The under-sampled dataset has a 50/50 class ratio of samples. 

In [None]:
# Picking the indices of the minority (fraud) class
fraud_indices = np.array(data[data.Class == 1].index)

# Picking the indices of the normal class
normal_indices = ?

# Number of data points in the minority (fraud) class
N_records_fraud = ?


# Out of the normal class indices, randomly select N_records_fraud samples 
random_normal_indices = np.random.choice(normal_indices, N_records_fraud, replace = False)


# Appending the indices of normal and fraud classes
under_sample_indices = np.concatenate([fraud_indices, random_normal_indices])

# Under sample dataset
under_sample_data = data.iloc[under_sample_indices,:]

# Check if the under-sampled Data is balanced 

# Total number of transactions in the under_sample_data ? => ANSWER:  984


# Number of normal under sample transactions ? =>   ANSWER:  492


# Number of fraud transactions? => ANSWER:  492


# Extract the features in matrix X_undersample, the class labels in vector  y_undersample

X_undersample = ?

y_undersample = ?


### Explanation of random_state

All computers have what is called a pseudo-random number generator. This is something that produces seemingly random numbers, but if kept being repeated, would reproduce the same sequence eventually.
Where the number generator is started is known as the seed. When you specify the random_state parameter, you are just setting the random seed for the random number generator.

Suppose you set random_seed = 0. The random number generator might then produce the sequence of integers
0, 19, 11, 2, 34, 5, 23, 24, 0, 1, 89, …

and by fixing random_state=0, you will always see this sequence each time you call your train_test_split function. 

On the other hand, suppose you set random_state=1 and got the following sequence of integers:
91, 18, 11, 34, 34, 5, 19, 18, 0, 0, 1, …

You will always see these random numbers when you set random_state = 1. 

### Train-test data splitting

Apply *train_test_split* to the Whole dataset and to the Undersampled dataset with 30% train-test data ratio and random_state = 0. 

In [None]:
from sklearn.model_selection import train_test_split

# Call function train_test_split to devide the WHOLE dataset 
# in 30% for test data and the rest for training data. 

X_train, X_test, y_train, y_test = ?

print("WHOLE DATA:")

# train dataset ?  => ANSWER: 199364
?

# test dataset ? => ANSWER: 85443
?



# DO the same division for Undersampled dataset
X_train_undersample, X_test_undersample, y_train_undersample, y_test_undersample = ?

print() 
print("UNDER-SAMPLED DATA:")

# train dataset ?  
?

# test dataset ? 
?



###  MODEL 1: Logistic regression classifier - Undersampled data

- Accuracy = (TP+TN)/total
- Precision = TP/(TP+FP)
- Recall = TP/(TP+FN)

**Our goal is, do not miss a fraud transaction**, therefore  we are interested in the Recall score, because that is the metric to capture the most fraudulent transactions. Due to the imbalacing of the data, many observations could be predicted as False Negatives, that is, we predict a normal transaction, but it is in fact a fraudulent one. Recall captures this.

Precision is less important metric for this problem, because if we predict that a transaction is fraudulent but it is not (this is false positive case), this is not a massive problem compared to the opposite. 

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold, cross_val_score, GridSearchCV
from sklearn.metrics import confusion_matrix,precision_recall_curve,auc,roc_auc_score,roc_curve,recall_score,classification_report 

### K-fold Cross Validation (CV) to find the best hyper-parameter C of Logistic Regression.  

C =1/$\lambda$, where $\lambda$ is the regularization parameter. 

In [None]:
# Find the best hyper-parameter C. Optimizing for recall perf. metric 
def print_gridsearch_scores(x_train_data,y_train_data):
    c_param_range = [0.01,0.1,1,10]

    clf = GridSearchCV(LogisticRegression(), {"C": c_param_range}, cv=5, scoring='recall')
    clf.fit(x_train_data,y_train_data)

    print("Best hyper-parameter C")
    print(clf.best_params_)

    print("K-fold Score (Recall):")
    
    means = clf.cv_results_['mean_test_score']
    stds = clf.cv_results_['std_test_score']
    
    # K-fold Recall results for different values of C
    for mean, std, params in zip(means, stds, clf.cv_results_['params']):
        print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))
        
    return clf.best_params_["C"]

In [None]:
#Apply print_gridsearch_scores to get the best C with the Undersampled dataset
best_c = ?

### Model 1.1: Logistic Regression trained and tested with undersampled data


In [None]:
# Use best C to train LogReg model with undersampled train data and 
# test it with undersampled test data 

lr = LogisticRegression(C = best_c)
lr.fit(X_train_undersample,y_train_undersample)
y_pred_undersample = lr.predict(X_test_undersample)

# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test_undersample,y_pred_undersample)

print('Confusion matrix (undersample test dataset)')
print(cnf_matrix)

#Compute Recall metric
?


### Model 1.2: Logistic Regression trained on under-sampled data and tested with the whole test data

Apply the same approach as above. 


In [None]:
# Apply the same approach to train LogReg model with undersampled train dataset 
# but test it with WHOLE test dataset

#train on undersampled data
?

#predict whole test data 


# Compute and print confusion matrix on test data
?

# Compute and print Recall metric
?


###  ROC curve & AUC

Plot the Receiver Operating Characteristic (ROC) curve and compute the Area Under the ROC Curve (AUC). 


In [None]:
lr = LogisticRegression(C = best_c, penalty = 'l1')
lr.fit(X_train_undersample,y_train_undersample)
y_pred_undersample_score=lr.decision_function(X_test_undersample)

# Compute ROC curve
fpr, tpr, thresholds = roc_curve(y_test_undersample,y_pred_undersample_score)

# Compute Area Under the ROC Curve (AUC), it is a scalar 
roc_auc = auc(fpr,tpr)

# Plot ROC
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b',label='AUC = %0.2f'% roc_auc)
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.1,1.0])
plt.ylim([-0.1,1.01])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

#### REMARK
To create the undersampled data, we randomly picked some samples from the majority class. This is a valid technique, however is doesn't represent the real (huge) population. 
For sufficient statistical credibility, it would be usefull to repeat the process with different undersampled configurations and check if the previous chosen parameters are still the most effective. In the end, the idea is to use a wider random representation of the whole dataset and rely on the averaged best parameters.

### MODEL 2: Logistic regression classifier - Skewed data

Now, apply K-fold Cross Validation (CV) to find the best hyper-parameter C with whole train data, as it was done above. 

K-fold is now computationally much more time consuming. 

In [None]:
best_c = ?

Use the best C to train LogReg model with the whole train data and test it with whole test data. 


In [None]:
lr = ?
?
y_pred = ?

# Compute and print confusion matrix
?

# Compute and print Recall metric.
?

