### Business Context

It is important that credit card companies are able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase. 

This dataset is a bespoke dataset which contains transactions made by credit cards. This dataset contains transactions, where we have 180 frauds out of 4700 transactions. The dataset is highly imbalanced, the positive class (frauds) account for 3.83% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation. The input features are transformed to maintain the confidentiality of the original features and more background information about the data. Features PC1, PC2, … PC5 are the principal components obtained with PCA, the only feature which have not been transformed with PCA is 'ID' and 'Class'. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.

Here is a vary famous dataset on [fraud detection](https://www.kaggle.com/mlg-ulb/creditcardfraud) which is available on Kaggle and similar to this dataset

### Task 1: Loading libraries and dataset

In [None]:
# import libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# import dataset



### Task 2: Exploring the dataset

In [None]:
# check the shape of dataset



In [None]:
# check head of the dataset



In [None]:
# check information about the dataset like - missing values, datatypes etc



In [None]:
# check the target class distribution



In [None]:
# create a visual plot to see the target distribution



In [None]:
# create some scatterplots based on features and see if you can see some pattern in the data



### Task 3: Evaluation metric selection

In [None]:
# baseline accuracy of the model

round(data['Class'].value_counts(normalize=True) * 100, 2)

For this dataset let's consider fraudulent transactions (which are denoted as 1 in the dataset) is `positive` class and the non fraudulent transactions (which are denoted as 0 in the dataset) is `negative` class.

- TP - transactions which are actually fraudulent and the model also able correctly identify them as fraudulent transactions
- FP - transactions which are actually non fraudulent transactions but the model is predicting them as fraudulent transactions
- TN - transactions which are actually non fraudulent transactions and model is also predicting them as non fraudulent transactions
- FN - transactions which are actually fraudulent but the model is predicting them as non fraudulent transactions

![alt text](images/confusion_matrix.png "confusion matrix")

Out of the misclassifications - FN (False negative) and FP (False positive) which one is costlier for this business problem?

$$recall = \frac{TP}{TP + FN}$$

$$precision = \frac{TP }{TP+ FP}$$

### Task 4: Baseline model

In [None]:
# drop the ID variable as it is unique for all the transactions and does not have any meaningful information about the data



In [None]:
# extract features and target from the original dataset



In [None]:
# split the dataset into trainig and test set to train and evaluate the model respectively



In [None]:
# create a random forest model



In [None]:
# predict for the test dataset



In [None]:
# plot the confusion matrix



In [None]:
# print the classification report



### Task 5: Resampling techniques for imbalanced data

The imbalanced datasets are generally biased towards the majority class of the target variable. In this case the majority class is non fraudulent transactions and the minority class is fraudulent transactions. Hence if we don't balance these two classes the machine learning algorithms will be biased towards the majority class. Therefore it becomes important to balance the classes present in target variables. There are two ways in which we can balance these two categories - 

- **Undersampling**: In undersampling we randomly select as many observations of majority class as we have for minority class to make both of these classes balanced
- **Oversampling**: In oversampling, we create multiple copies of minority class to have same number of observations as we have for majority class. Here also we can oversampling in two ways - 
    - **Minotiy Oversampling**: here we create duplicates of same data from minority class
    - **SMOTE (Synthetic Minority Oversampling Technique)**: here we create observations for the minority class, based on those that already exist. It randomly picks a point from the minority class and computes the k-nearest neighbors for this point. The synthetic points are added between the chosen point and its neighbors.

![alt text](images/resampling.png "data resampling")

#### Synthetic Minority Oversampling Technique (SMOTE)

![alt text](images/smote.png "SMOTE")

In [None]:
# import imblearn library and resample the original data using SMOTE technique



In [None]:
# train a random forest model on SMOTE data



In [None]:
# predict the classes on test data using model built on SMOTE data and plot the confusion matrix



In [None]:
# print the classification report



### Task 6: Computing ROC AUC Curve

By defualt, every machine learning algorithm uses a probability threshold of 0.5 to classify between positive and negative classes. If we can tune this probability threshold to some other values which increases the true positive rate then we will be able to increase the recall for fraudulent transactions.

To do that we need to compute the AUC score. The AUC score signifies that the probability value of a random observation from the positive class (i.e. fraudulent transactions) is larger than the probability value of another random observation from the negative class (i.e. non fraudulent transactions). AUC value of 1 means all the predicted positive (fraudulent) transactions have higher probabilites of being fraudulent than the non fraudulent transactions, which is an ideal case. 

In [None]:
# let's compute the AUC curve for the model we developed on SMOTE data



In [None]:
# let's use another probability threshold so that we can get to the elbow position in the above curve



### Task 7: Adjusting probability threshold

In [None]:
# compute the probabilites of test observations using rf_smote model



In [None]:
# compare these probabilities against the probability threshold of 6% rather than the default threshold of 50%



In [None]:
# plot the confusion matrix



In [None]:
# print the classification report

