# Credit Card Fraud Detection

## Description : Logistic Regression and SVM for Credit Card Fraud Detection. In this assignment, you will use logistic regression and SVM to predict whether a credit card transaction is fraudulent or not using the Credit Card Fraud Detection dataset

## Instructions:

### 1. Use the dataset as “creditcardfraud.csv”.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import seaborn as sns

In [2]:
data = pd.read_csv("creditcardfraud.csv")

In [3]:
data.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,82450,1.314539,0.590643,-0.666593,0.716564,0.301978,-1.125467,0.388881,-0.28839,-0.132137,...,-0.170307,-0.429655,-0.141341,-0.200195,0.639491,0.399476,-0.034321,0.031692,0.76,0
1,50554,-0.798672,1.185093,0.904547,0.694584,0.219041,-0.319295,0.495236,0.139269,-0.760214,...,0.202287,0.578699,-0.092245,0.013723,-0.246466,-0.380057,-0.39603,-0.112901,4.18,0
2,55125,-0.391128,-0.24554,1.122074,-1.308725,-0.639891,0.008678,-0.701304,-0.027315,-2.628854,...,-0.133485,0.117403,-0.191748,-0.488642,-0.309774,0.0081,0.163716,0.239582,15.0,0
3,116572,-0.060302,1.065093,-0.987421,-0.029567,0.176376,-1.348539,0.775644,0.134843,-0.149734,...,0.355576,0.90757,-0.018454,-0.126269,-0.339923,-0.150285,-0.023634,0.04233,57.0,0
4,90434,1.848433,0.373364,0.269272,3.866438,0.088062,0.970447,-0.721945,0.235983,0.683491,...,0.103563,0.620954,0.197077,0.692392,-0.20653,-0.021328,-0.019823,-0.042682,0.0,0


In [4]:
data.shape

(600, 31)

### 2. Preprocess the data by:

● Dropping the Time column since it is not useful for classification.

● Scaling the Amount column using a standard scaler.

In [5]:
# Dropping the Time column since it is not useful for classification.

data.drop("Time", axis=1, inplace=True)

In [6]:
# Scaling --> Standard Scaler

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

# Reshape 'Amount' to a 2D array
Amount = data['Amount'].values.reshape(-1, 1)

data['Amount'] = scaler.fit_transform(Amount)

In [7]:
data['Amount']

0     -0.443051
1     -0.427821
2     -0.379637
3     -0.192602
4     -0.446435
         ...   
595    0.248265
596   -0.443051
597    0.565779
598   -0.311458
599   -0.001159
Name: Amount, Length: 600, dtype: float64

### 3. Split the data into training and test sets using a 80:20 split ratio. Use random_state=42 for reproducibility.

In [8]:
X = data.drop('Class', axis=1)
y = data['Class']

In [9]:
X

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
0,1.314539,0.590643,-0.666593,0.716564,0.301978,-1.125467,0.388881,-0.288390,-0.132137,-0.597739,...,-0.058040,-0.170307,-0.429655,-0.141341,-0.200195,0.639491,0.399476,-0.034321,0.031692,-0.443051
1,-0.798672,1.185093,0.904547,0.694584,0.219041,-0.319295,0.495236,0.139269,-0.760214,0.170547,...,-0.081298,0.202287,0.578699,-0.092245,0.013723,-0.246466,-0.380057,-0.396030,-0.112901,-0.427821
2,-0.391128,-0.245540,1.122074,-1.308725,-0.639891,0.008678,-0.701304,-0.027315,-2.628854,2.051312,...,0.065716,-0.133485,0.117403,-0.191748,-0.488642,-0.309774,0.008100,0.163716,0.239582,-0.379637
3,-0.060302,1.065093,-0.987421,-0.029567,0.176376,-1.348539,0.775644,0.134843,-0.149734,-1.238598,...,-0.169706,0.355576,0.907570,-0.018454,-0.126269,-0.339923,-0.150285,-0.023634,0.042330,-0.192602
4,1.848433,0.373364,0.269272,3.866438,0.088062,0.970447,-0.721945,0.235983,0.683491,1.166335,...,-0.282777,0.103563,0.620954,0.197077,0.692392,-0.206530,-0.021328,-0.019823,-0.042682,-0.446435
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
595,-2.783865,1.596824,-2.084844,2.512986,-1.446749,-0.828496,-0.732262,-0.203329,-0.347046,-2.162061,...,-0.515001,0.203563,0.293268,0.199568,0.146868,0.163602,-0.624085,-1.333100,0.428634,0.248265
596,-1.532810,2.232752,-5.923100,3.386708,-0.153443,-1.419748,-3.878576,1.444656,-1.465542,-5.208335,...,0.520840,0.632505,-0.070838,-0.490291,-0.359983,0.050678,1.095671,0.471741,-0.106667,-0.443051
597,-0.440095,1.137239,-3.227080,3.242293,-2.033998,-1.618415,-3.028013,0.764555,-1.801937,-4.711769,...,0.895841,0.764187,-0.275578,-0.343572,0.233085,0.606434,-0.315433,0.768291,0.459623,0.565779
598,-13.086519,7.352148,-18.256576,10.648505,-11.731476,-3.659167,-14.873658,8.810473,-5.418204,-13.202577,...,-1.376298,2.761157,-0.266162,-0.412861,0.519952,-0.743909,-0.167808,-2.498300,-0.711066,-0.311458


In [10]:
y

0      0
1      0
2      0
3      0
4      0
      ..
595    1
596    1
597    1
598    1
599    1
Name: Class, Length: 600, dtype: int64

In [11]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(480, 29) (480,)
(120, 29) (120,)


### 4. Train a logistic regression model on the training set using the default hyperparameters.

In [12]:
from sklearn.linear_model import LogisticRegression

# Create a logistic regression model
model = LogisticRegression()

# Train the model on the training data
model.fit(X_train, y_train)

In [13]:
# Make predictions on the test data
y_pred_lr = model.predict(X_test)

### 5. Evaluate the model's performance on the test set using:

● Confusion matrix

● Classification report

In [14]:
from sklearn.metrics import confusion_matrix, classification_report

# Evaluate the model
conf_matrix_LR = confusion_matrix(y_test, y_pred_lr)
report_LR = classification_report(y_test, y_pred_lr)

print(f"Logistic Regression Confusion Matrix:\n{conf_matrix_LR}")
print("Logistic Regression Classification Report:\n", report_LR)

Logistic Regression Confusion Matrix:
[[59  3]
 [ 4 54]]
Logistic Regression Classification Report:
               precision    recall  f1-score   support

           0       0.94      0.95      0.94        62
           1       0.95      0.93      0.94        58

    accuracy                           0.94       120
   macro avg       0.94      0.94      0.94       120
weighted avg       0.94      0.94      0.94       120



### 6. Train an SVM model on the training set using the default hyperparameters. Evaluate the model's performance on the test set using the same evaluation metrics as in step 4.

In [15]:
from sklearn.svm import SVC

# Create an SVM model
svm_model = SVC()

# Train the model on the training data
svm_model.fit(X_train, y_train)

In [16]:
# Make predictions on the test data
y_pred_svm = svm_model.predict(X_test)

In [17]:
from sklearn.metrics import confusion_matrix, classification_report

# Evaluate the model
conf_matrix_svm = confusion_matrix(y_test, y_pred_svm)
report_svm = classification_report(y_test, y_pred_svm)

print("SVM Confusion Matrix:\n", conf_matrix_svm)
print("SVM Classification Report:\n", report_svm)

SVM Confusion Matrix:
 [[62  0]
 [ 6 52]]
SVM Classification Report:
               precision    recall  f1-score   support

           0       0.91      1.00      0.95        62
           1       1.00      0.90      0.95        58

    accuracy                           0.95       120
   macro avg       0.96      0.95      0.95       120
weighted avg       0.95      0.95      0.95       120



### 7. Tune the hyperparameters of the logistic regression model and the SVM model using grid search cross-validation. Use a range of values for the hyperparameters of your choice. Choose the evaluation metric of your choice (e.g.Accuracy Score) to optimize the hyperparameters.

##### Hyperparameter Tuning for Logistic Regression

In [18]:
from sklearn.model_selection import GridSearchCV

lr_model = LogisticRegression(max_iter=1000)

lr_grid_search = GridSearchCV(lr_model, {'C': [0.001, 0.01, 0.1, 1, 10, 100], 'penalty': ['l2']}, cv=5, scoring='accuracy')

# Perform grid search cross-validation
lr_grid_search.fit(X_train, y_train)

In [19]:
dir(lr_grid_search)

['__abstractmethods__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_abc_impl',
 '_check_feature_names',
 '_check_n_features',
 '_check_refit_for_multimetric',
 '_estimator_type',
 '_format_results',
 '_get_param_names',
 '_get_tags',
 '_more_tags',
 '_repr_html_',
 '_repr_html_inner',
 '_repr_mimebundle_',
 '_required_parameters',
 '_run_search',
 '_select_best_index',
 '_validate_data',
 '_validate_params',
 'best_estimator_',
 'best_index_',
 'best_params_',
 'best_score_',
 'classes_',
 'cv',
 'cv_results_',
 'decision_function',
 'error_score',
 'estimator',
 'feature_names_in_',
 'fit',
 'get_params',
 'inverse_transform',
 

In [20]:
# Get the best score
best_lr_score = lr_grid_search.best_score_
print(best_lr_score)

0.94375


In [21]:
# Get the best hyperparameters
best_lr_params = lr_grid_search.best_params_
print(best_lr_params)

{'C': 0.1, 'penalty': 'l2'}


##### Hyperparameter Tuning for Support Vector Machine

In [22]:
# Define the hyperparameter grid for SVM
svm_param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100], 'kernel': ['linear', 'rbf', 'poly']}

# Create an SVM model
svm_model = SVC()

# Perform grid search cross-validation
svm_grid_search = GridSearchCV(svm_model, svm_param_grid, cv=5, scoring='accuracy')
svm_grid_search.fit(X_train, y_train)

In [23]:
# Get the best score
best_svm_score = svm_grid_search.best_score_
print(best_svm_score)

0.94375


In [24]:
# Get the best hyperparameters
best_svm_params = svm_grid_search.best_params_
print(best_svm_params)

{'C': 0.1, 'kernel': 'linear'}


### 8. Train the logistic regression model and the SVM model with the optimal hyperparameters on the training set. Evaluate their performance on the test set using the same evaluation metrics as in step 4.

##### Training Logistic Regression with the optimal hyperparameter on the training set

In [25]:
# Train the Logistic Regression model with the best hyperparameters
best_lr_model = LogisticRegression(**best_lr_params)
best_lr_model.fit(X_train, y_train)

In [26]:
# Make predictions on the test set
lr_predictions = best_lr_model.predict(X_test)

In [27]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Evaluate the model
accuracy_LR_best = accuracy_score(y_test, lr_predictions)
conf_matrix_LR_best = confusion_matrix(y_test, lr_predictions)
report_LR_best = classification_report(y_test, lr_predictions)

print("\n Accuracy of Logistic Regression after Hyperparameter :\n", accuracy_LR_best)
print(f"Logistic Regression Confusion Matrix:\n{conf_matrix_LR_best}")
print("Logistic Regression Classification Report:\n", report_LR_best)


 Accuracy of Logistic Regression after Hyperparameter :
 0.9583333333333334
Logistic Regression Confusion Matrix:
[[61  1]
 [ 4 54]]
Logistic Regression Classification Report:
               precision    recall  f1-score   support

           0       0.94      0.98      0.96        62
           1       0.98      0.93      0.96        58

    accuracy                           0.96       120
   macro avg       0.96      0.96      0.96       120
weighted avg       0.96      0.96      0.96       120



##### Training SVM with the optimal hyperparameter on the training set

In [28]:
# Train the SVM model with the best hyperparameters
best_svm_model = SVC(**best_svm_params)
best_svm_model.fit(X_train, y_train)

In [29]:
# Make predictions on the test set
svm_predictions = best_svm_model.predict(X_test)

In [30]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Evaluate the model
accuracy_svm_best = accuracy_score(y_test, svm_predictions)
conf_matrix_svm_best = confusion_matrix(y_test, svm_predictions)
report_svm_best = classification_report(y_test, svm_predictions)

print("\n Accuracy of SVM after Hyperparameter :\n", accuracy_svm_best)
print("SVM Confusion Matrix:\n", conf_matrix_svm_best)
print("SVM Classification Report:\n", report_svm_best)


 Accuracy of SVM after Hyperparameter :
 0.9583333333333334
SVM Confusion Matrix:
 [[62  0]
 [ 5 53]]
SVM Classification Report:
               precision    recall  f1-score   support

           0       0.93      1.00      0.96        62
           1       1.00      0.91      0.95        58

    accuracy                           0.96       120
   macro avg       0.96      0.96      0.96       120
weighted avg       0.96      0.96      0.96       120



### 9. Compare the performance of the logistic regression model and the SVM model using the evaluation metrics from steps 4 and 7. Interpret the results and provide insights on which model performed better and why.

In [35]:
from sklearn.metrics import accuracy_score

# Logistic Regression
lr_accuracy = accuracy_score(y_test, lr_predictions)

# SVM
svm_accuracy = accuracy_score(y_test, svm_predictions)

# Print the results
print("Logistic Regression")
print(f"Accuracy: {lr_accuracy:.4f}")

print("\nSVM")
print(f"Accuracy: {svm_accuracy:.4f}")

Logistic Regression
Accuracy: 0.9583

SVM
Accuracy: 0.9583


#### Since both the Accuracy scores are same, we need to look into both the recall values to decide the best model

In [38]:
from sklearn.metrics import recall_score

# Logistic Regression
lr_recall = recall_score(y_test, lr_predictions)

# SVM
svm_recall = recall_score(y_test, svm_predictions)

# Print the results
print("Logistic Regression")
print(f"Recall: {lr_recall:.4f}")

print("\nSVM")
print(f"Recall: {svm_recall:.4f}")

Logistic Regression
Recall: 0.9310

SVM
Recall: 0.9138


#### The Logistic Regression model is better than SVM model due to its higher recall (0.9310 compared to 0.9138).

### 10. Summarize your findings and conclusions in a brief report.

#### In the analysis of the models for the given task, both Logistic Regression and SVM achieved the same accuracy. However, a deeper evaluation based on recall, a crucial metric for correctly identifying positive cases, revealed that the Logistic Regression model outperformed the SVM model. With a higher recall of 0.9310 compared to 0.9138, the Logistic Regression model is preferred for its superior performance in identifying instances of the positive class.