<a href="https://colab.research.google.com/github/vividesigns/Credit-Card-Fraud-Detection/blob/main/CreditCardFraudDetection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Credit Card Fraud Detection**

# Background

* Anonymized credit card transactions labeled as fraudulent or genuine
* Data can be downloaded from [kaggle](https://www.kaggle.com/mlg-ulb/creditcardfraud?select=creditcard.csv) or from [here](https://raw.githubusercontent.com/nsethi31/Kaggle-Data-Credit-Card-Fraud-Detection/master/creditcard.csv)
* The datasets contains transactions made by credit cards in September 2013 by european cardholders.
* This dataset presents transactions that occurred in two days, where we have:
  492 frauds out of 284,807 transactions. 
* The dataset is **highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.**
* It contains only numerical input variables which are the result of a PCA transformation. 
> Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. 
> Features V1, V2, … V28 are the principal components obtained with PCA
* The only features which have not been transformed with PCA are 'Time' and 'Amount'. 
* Feature _'Time'_ contains the seconds elapsed between each transaction and the first transaction in the dataset. 
* The feature _'Amount'_ is the transaction Amount, this feature can be used for example-dependant cost-senstive learning. 

  ## [Cost Sensitive Learning](https://machinelearningmastery.com/cost-sensitive-learning-for-imbalanced-classification/#:~:text=Cost%2DSensitive%20Learning%20is%20a%20type%20of%20learning%20that%20takes,Encyclopedia%20of%20Machine%20Learning%2C%202010.)
  * A subfield of ML that takes the costs of prediction errors (and potentially other costs) into account when training a machine learning model.

  * Cost-sensitive techniques: data resampling, algorithm modifications and ensemble methods.

  * Cost sensitive learning takes into account confusion matrix (TP or TN etc)

  * Scikit-learn ML library provides examples of these cost-sensitive extensions via the `class_weight` argument on SVC and DecisionTreeClassifiers

  * Feature _'Class'_ is the response variable and it takes value 1 in case of fraud and 0 otherwise.

# Recommendation:

> Given the class imbalance ratio, we recommend measuring the accuracy using the Area Under the Precision-Recall Curve (AUPRC). Confusion matrix accuracy is not meaningful for unbalanced classification.

# **Load Libraries**

In [1]:
import pandas as pd
!pip install pycaret

Collecting pycaret
[?25l  Downloading https://files.pythonhosted.org/packages/33/4d/792832e86c34eb7f8c06f1805f19ef72a2d38b11435502b69fca3409b84c/pycaret-2.2.2-py3-none-any.whl (249kB)
[K     |█▎                              | 10kB 18.1MB/s eta 0:00:01[K     |██▋                             | 20kB 15.9MB/s eta 0:00:01[K     |████                            | 30kB 13.5MB/s eta 0:00:01[K     |█████▎                          | 40kB 12.4MB/s eta 0:00:01[K     |██████▋                         | 51kB 7.5MB/s eta 0:00:01[K     |███████▉                        | 61kB 8.8MB/s eta 0:00:01[K     |█████████▏                      | 71kB 9.1MB/s eta 0:00:01[K     |██████████▌                     | 81kB 9.4MB/s eta 0:00:01[K     |███████████▉                    | 92kB 9.2MB/s eta 0:00:01[K     |█████████████▏                  | 102kB 8.7MB/s eta 0:00:01[K     |██████████████▍                 | 112kB 8.7MB/s eta 0:00:01[K     |███████████████▊                | 122kB 8.7MB/s eta

In [2]:
data_link='https://raw.githubusercontent.com/nsethi31/Kaggle-Data-Credit-Card-Fraud-Detection/master/creditcard.csv'

# **Load & Explore Data**

In [3]:
# Load Data

data=pd.read_csv(data_link)

In [9]:
# Convert object to float

for col in data.columns:
  data[[col]]=data[[col]].astype(float)

In [13]:
data.shape

(284807, 31)

In [20]:
print ("Instances of Class 1 {}:".format(data[data.Class==1].shape[0]))

Instances of Class 1 492:


# **Setup PyCaret**

In [21]:
# Import Classification module

from pycaret.classification import *

# Specify data and Target Class

clf=setup(data=data,target='Class') 

Unnamed: 0,Description,Value
0,session_id,4080
1,Target,Class
2,Target Type,Binary
3,Label Encoded,
4,Original Data,"(284807, 31)"
5,Missing Values,False
6,Numeric Features,30
7,Categorical Features,0
8,Ordinal Features,False
9,High Cardinality Features,False


# **Compare Models**

In [None]:
# Compare models trains models using k-fold cross-validation (10 folds) by default

compare_models()

IntProgress(value=0, description='Processing: ', max=79)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
