# **Credit Card Fraud Detection**

**Problem**

In recent times, many people moved to digital life for purchasing, and people prefer using credit/ debit cards for making transactions. Cyber attackers are misusing features like credit limits. As technology advances, new methods of cyber-attacks are emerging. This has become a significant problem in the modern era, as all transactions can be quickly completed online by only entering your credit card information.

**Approach using machine Learning:**   


When we talk about security in digital life, the main challenge is to find fraudulent activity. Fraud detection is a set of actions taken to prevent money from being accessed illegally. In this project, we use three machine learning algorithms to perform classifications of abnormal activities. Algorithms track the patterns of transactions, and if they find anything fraudulent in the transaction, it should abort the transaction. We use Accuracy, Precision, Recall, and F1 score as deciding factors to choose which algorithm is the best fit for this data.

**Dataset:**

We are using a dataset from [kaggle] (https://www.kaggle.com/mlg-ulb/creditcardfraud . Please download this dataset or the csv file on the shared folder (preferred) and upload to the runtime session. 

This dataset contains transactions made by credit cards in September 2013 by European cardholders. We have 30 features and one final class column. 

Here dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. 

The dataset is highly unbalanced; it has positive class (frauds) account for 0.172% of all transactions.
It contains only numerical input variables, which are the result of a PCA transformation.

**Importing the Dependencies**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib import gridspec
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import precision_score, recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


The following link is one of a shared folder in which our csv file is uploaded
Please download this file and upload it to the colab session to run it
https://drive.google.com/drive/folders/1myGUb3fehpbl-ohiWDxNtKfcYfFe0UYi?usp=sharing

In [None]:
# loading the dataset to a Pandas DataFrame

path="/content/drive/MyDrive/credit card /creditcard.csv"
ccData = pd.read_csv(path)

## Exploratory Data Analysis

In [None]:
ccData.shape

In [None]:
# first 5 rows of the dataset
ccData.head()

In [None]:
ccData.tail()

In [None]:
# dataset informations
ccData.info()

In [None]:
# checking the number of missing values in each column
ccData.isnull().sum()

In [None]:
ccData['Class'].value_counts()

0 --> Normal Transaction

1 --> fraudulent transaction

In [None]:
ccData.drop_duplicates(inplace=True)

**Removing Duplicate Transactions**

Duplicates are a type of nonrandom sampling that can cause your fitted model to be biased. By including them, the model will basically overfit this subset of points. So, we will remove all duplicate transactions. 

In [None]:
# distribution of legit transactions & fraudulent transactions
ccData['Class'].value_counts()

The credit card dataset had 284807 transactions initially, after removing the duplicates the data set has 283726 transactions.

This dataset is highly imblanced (as shown below)

In [None]:
labels = ["Normal", "Fraud"]
plt.figure(figsize=(10, 10))
ccData['Class'].plot(kind = "hist",rot = 0)
plt.title("Class Imbalance")
plt.ylabel("Count")
plt.xticks(range(2),labels)
plt.show

In [None]:
# separating the data for analysis
normal = ccData[ccData.Class == 0]
fraud = ccData[ccData.Class == 1]

In [None]:
print(normal.shape)
print(fraud.shape)

In [None]:
# statistical measures of the data
normal.Amount.describe()

In [None]:
fraud.Amount.describe()

In [None]:
# compare the values for both transactions
ccData.groupby('Class').mean()

## Under-Sampling

Build a sample dataset containing similar distribution of normal transactions and fraudulent fransactions

Number of fraudulent transactions = 473

In [None]:
sampled_normal = normal.sample(n=473)

**Concatenating two DataFrames**

In [None]:
ccData_new = pd.concat([sampled_normal, fraud], axis=0)

In [None]:
ccData_new.head()

In [None]:
ccData_new.tail()

In [None]:
ccData_new['Class'].value_counts()

In [None]:
labels = ["Normal", "Fraud"]
plt.figure(figsize=(10, 10))
ccData_new['Class'].plot(kind = "hist",rot = 0)
plt.title("After under sampling")
plt.ylabel("Count")
plt.xticks(range(2),labels)
plt.show

In [None]:
ccData_new.groupby('Class').mean()

## Splitting the data into Features & Targets

In [None]:
X = ccData_new.drop(columns='Class', axis=1)
Y = ccData_new['Class']

In [None]:
print(X)

In [None]:
print(Y)

## Split the data into Training data & Testing Data

To model a general case, we have split our existing data into training and testing sets where 20% of the data is set to test and 80% of data is used to train the model. 

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y,test_size=0.2, random_state=16)

In [None]:
print(X.shape, X_train.shape, X_test.shape)

**Feature Scaling**

We first standardize the data since otherwise, a feature with high variation may bias or dictate the distance functions. Thus, we use StandardScaler that centers the data (by subtracting the mean of data) and make the variance 1 (by dividing by variance) to ensure that no single feature dominates its functioning. 

In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit(X_train)

X_train = sc.transform(X_train)
X_test = sc.transform(X_test)

## Model Training

**Logisitc Regression**

Logistic regression is a widely used discriminative classification model. Since our data has a binary classification, we will use binary logistic regression. Binary Logistic Regression Classification uses one or more predictor variables that can be either continuous or categorical to predict the target variable classes. This technique aids in identifying essential features(Xi) that influence the target variable (Y) and the nature of the relationships between these features and the dependent variable.

In [None]:
model = LogisticRegression()

In [None]:
# training the Logistic Regression Model with Training Data
model.fit(X_train, Y_train)

Model Evaluation

In [None]:
# accuracy on training data
train_pred = model.predict(X_train)
acc_train = accuracy_score(train_pred, Y_train)

In [None]:
print('Accuracy on Training data : ', acc_train)

In [None]:
# accuracy on test data
Y_pred = model.predict(X_test)
acc = accuracy_score(Y_pred, Y_test)*100

In [None]:
print('Accuracy score on Test Data : ', acc)

In [None]:
rec = recall_score(Y_pred, Y_test)*100
print('Recall score on Test Data : ', rec)

In [None]:
f1 = f1_score(Y_pred, Y_test)*100
print('F1 score on Test Data : ', f1)

In [None]:
prec = precision_score(Y_pred,Y_test)*100
print('Precision on Test Data : ', prec)

In [None]:
LABELS = ['Normal', 'Fraud']
conf_matrix = confusion_matrix(Y_test,Y_pred)
sns.heatmap(conf_matrix, xticklabels = LABELS, 
            yticklabels = LABELS, annot = True, fmt ="d");
plt.title("Confusion matrix")
plt.ylabel('True class')
plt.xlabel('Predicted class')
plt.show()

Out of 90 normal transactions, the logistic regression model predicted 86 of them correctly and four wrong. Out of 100 fraudulent transactions, the logistic regression model predicted 93 of them correctly and seven bad. In conclusion, the logistic regression model classified 176 data points correctly out of 190.

**Random Forest Classifier**

A random forest is a supervised machine learning system that uses decision tree algorithms to build it. Many decision trees make up a random forest algorithm. Bagging or bootstrap aggregation are used to train the 'forest' formed by the random forest method. Bagging is a meta-algorithm that increases the accuracy of machine learning methods by grouping them. The algorithm determines the outcome based on decision tree predictions. It forecasts by averaging the output of various trees.

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier(max_depth = 4)
rf_model.fit(X_train, Y_train)
rf_pred = rf_model.predict(X_test)

In [None]:
rf_acc = accuracy_score(rf_pred, Y_test)*100
print('Accuracy score on Test Data : ', rf_acc)

In [None]:
rf_rec = recall_score(rf_pred, Y_test)*100
print('Recall score on Test Data : ', rf_rec)

In [None]:
rf_prec = precision_score(rf_pred, Y_test)*100
print('Precision score on Test Data : ', rf_prec)

In [None]:
rf_f1 = f1_score(rf_pred, Y_test)*100
print('F1 score on Test Data : ', rf_f1)

In [None]:
rfCM = confusion_matrix(Y_test, rf_pred)
sns.heatmap(rfCM, xticklabels = LABELS, 
            yticklabels = LABELS, annot = True, fmt ="d");
plt.title("Confusion matrix")
plt.ylabel('True class')
plt.xlabel('Predicted class')
plt.show()

Out of 90 normal and 100 fraud transactions, the Random forest model correctly predicted 89 normal transactions and 91 fraud transactions. In conclusion, the Random forest model predicts 178 data points correctly out of 190.

**K-nearest Neighbours Classifier**

k-NN is a supervised algorithm used to solve classification problems. In K-NN, the data point of interest (unseen, unlabelled data) is assigned a class based on the classes of the k closest known data points around it. The value of k determines how many neighboring points we consider to determine its class (based on the majority class value of k points nearest to the point). The main crux of this algorithm is to find the optimized values for the k parameter. Too large or too small of a value will produce incorrect results since the former may bias the data against classes with fewer samples while the latter will make the model susceptible to outliers.

In our case, we have plotted a graph of the mean error values encountered using different values of k (form 1 to 40). In our case, k=8 gives the least mean error (maximum accuracy), so we proceed with that.

In [None]:
from sklearn.neighbors import KNeighborsClassifier
error = []

# Calculating error for K values between 1 and 40
for i in range(1, 40):
    knn_model = KNeighborsClassifier(n_neighbors=i)
    knn_model.fit(X_train, Y_train)
    pred_i = knn_model.predict(X_test)
    error.append(np.mean(pred_i != Y_test))

In [None]:
plt.figure(figsize=(12, 6))
plt.plot(range(1, 40), error, marker='o', markersize=10)
plt.title('Error Rate K Value')
plt.xlabel('K Value')
plt.ylabel('Mean Error')

In [None]:
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=7)
classifier.fit(X_train, Y_train)
knn_pred = classifier.predict(X_test)

In [None]:
knn_acc = accuracy_score(knn_pred, Y_test)*100
print('Accuracy score on Test Data : ', knn_acc)

In [None]:
knn_rec = recall_score(knn_pred, Y_test)*100
print('Recall score on Test Data : ', knn_rec)

In [None]:
knn_prec = precision_score(knn_pred, Y_test)*100
print('Precision score on Test Data : ', knn_prec)

In [None]:
knn_f1 = f1_score(knn_pred, Y_test)*100
print('F1 score on Test Data : ', knn_f1)

In [None]:
knnCM = confusion_matrix(Y_test, knn_pred)
sns.heatmap(knnCM, xticklabels = LABELS, 
            yticklabels = LABELS, annot = True, fmt ="d");
plt.title("Confusion matrix")
plt.ylabel('True class')
plt.xlabel('Predicted class')
plt.show()

Out of 90 normal transactions and 100 fraud transactions, the KNN model predicted 87 of normal transactions and 92 of fraud transactions correctly. In conclusion, the KNN model classified 178 data points correctly out of 190.

# Results

In [None]:
comp = {'Algorithm': ['Logistic Regression', 'K-Nearest Neighbours', 'Random Forest'], 
        'Accuracy': [acc,knn_acc,rf_acc],
        'Recall': [rec, knn_rec, rf_rec],
        'Precision': [prec, knn_prec, rf_prec],
        'F1 Score': [f1,knn_f1,rf_f1]} 

In [None]:
df = pd.DataFrame(comp)  
df

# Conclusion
In this project, we have tested three different algorithms, i.e., Logistic Regression,  K-nearest neighbors, and Random Forest, to predict the nature of a credit card transaction. After training and testing the three models, we observed that all performed equally well, but Random forest performed better in a few aspects.