# CREDIT CARD FRAUD DETECTION

U.S. adults lost a record $10 billion to fraudsters in 2023, according to the  Federal Trade Commission
https://www.ftc.gov/news-events/news/press-releases/2024/02/nationwide-fraud-losses-top-10-billion-2023-ftc-steps-efforts-protect-public 

The aim of this project is to build a machine learning model that can predict whether a transction is fraud or not.

In [1]:
import pandas as pd

In [2]:
#loading dataset to the notebook
dataframe = pd.read_csv("card_transdata.csv")

In [3]:
#preview of the first 5 rows of the dataset
top_five = dataframe.head(5)
top_five

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
0,57.877857,0.31114,1.94594,1.0,1.0,0.0,0.0,0.0
1,10.829943,0.175592,1.294219,1.0,0.0,0.0,0.0,0.0
2,5.091079,0.805153,0.427715,1.0,0.0,0.0,1.0,0.0
3,2.247564,5.600044,0.362663,1.0,1.0,0.0,1.0,0.0
4,44.190936,0.566486,2.222767,1.0,1.0,0.0,1.0,0.0


### Data Understanding

In [4]:
#Function to check the number of rows and columns in the dataset
def shape_of(data):
    rows, columns = data.shape
    print(f"This dataset has {columns} columns and {rows} rows")

In [5]:
shape_of(dataframe)

This dataset has 8 columns and 1000000 rows


In [6]:
# Function to check information, null and datatypes 
def information_of(data):
    return data.info()


In [7]:
information_of(dataframe)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 8 columns):
 #   Column                          Non-Null Count    Dtype  
---  ------                          --------------    -----  
 0   distance_from_home              1000000 non-null  float64
 1   distance_from_last_transaction  1000000 non-null  float64
 2   ratio_to_median_purchase_price  1000000 non-null  float64
 3   repeat_retailer                 1000000 non-null  float64
 4   used_chip                       1000000 non-null  float64
 5   used_pin_number                 1000000 non-null  float64
 6   online_order                    1000000 non-null  float64
 7   fraud                           1000000 non-null  float64
dtypes: float64(8)
memory usage: 61.0 MB


This dataset has 8 columns so there is no need to check for columns or to check for domain knowledge because the variables are easy to explain.

### Feature selection

> Because this is a machine learning project instead of selecting and working on variables that do not affect the dependent variables we perform feature selection first to choose the features that impact fraud.

In [8]:
# import libraries from sklearn to use for feature selection
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

In [9]:
#Split the data in to the target(dependent) and predictor(independent) variables.
X = dataframe.drop("fraud", axis=1)
y = dataframe["fraud"]

In [10]:
#Split the data into training and testing dataset
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.3, random_state = 29)
rfc= RandomForestClassifier(random_state=29)
rfc.fit(X_train,y_train)
feature_importance = pd.Series(rfc.feature_importances_,index=X.columns).sort_values(ascending=False)
print(feature_importance)

ratio_to_median_purchase_price    0.522565
online_order                      0.179292
distance_from_home                0.133756
used_pin_number                   0.060114
used_chip                         0.055498
distance_from_last_transaction    0.042845
repeat_retailer                   0.005930
dtype: float64


In [11]:
#split data to remain with important features
important_features = dataframe[["ratio_to_median_purchase_price", "online_order", "distance_from_home","used_pin_number","fraud"]]

Use previous functions to preview the dataset

In [12]:
#checking for the shape of the data
shape_of(important_features)

This dataset has 5 columns and 1000000 rows


### Data cleaning

In [13]:
#checking the information of the dataset
information_of(important_features)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 5 columns):
 #   Column                          Non-Null Count    Dtype  
---  ------                          --------------    -----  
 0   ratio_to_median_purchase_price  1000000 non-null  float64
 1   online_order                    1000000 non-null  float64
 2   distance_from_home              1000000 non-null  float64
 3   used_pin_number                 1000000 non-null  float64
 4   fraud                           1000000 non-null  float64
dtypes: float64(5)
memory usage: 38.1 MB


There are no null data and the datatypes are correct so we will just check for duplicates as our data cleaning,

In [14]:
#function to check for duplicates.
def checking_duplicates(data):
    return data.duplicated().sum()

In [15]:
checking_duplicates(important_features)

0

No duplicates so the data is ready for analysis

# Machine Learning

## Logistic Regression Base Model

Because it is a binary classification we will use logistic regression as our basemodel.

Logistic regression is used to classify data into two categories and in our case is either fraud or non fraud.

In [16]:
# Base model we choose logistics regression and we use Starnadrd scaler  to scale our features.
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

In [17]:
#choose your predictor and Target variables
X = important_features.drop("fraud", axis = 1)
y = important_features["fraud"]

In [18]:
#split the data into training and testing data
X_train,X_test,y_train,y_test = train_test_split(X, y, test_size=0.3, random_state=29)

#initialize the scaler
scaler = StandardScaler()

#scale the features of both sets of data 
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

#initialize the model
logreg = LogisticRegression(random_state=29)
logreg.fit(X_train_scaled, y_train)

In [19]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

y_pred = logreg.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)


print(f"Accuracy: {accuracy:.2f}")
print("Confusion Matrix:\n", conf_matrix)
print("Classification Report:\n", classification_rep)

Accuracy: 0.96
Confusion Matrix:
 [[271999   1845]
 [ 11520  14636]]
Classification Report:
               precision    recall  f1-score   support

         0.0       0.96      0.99      0.98    273844
         1.0       0.89      0.56      0.69     26156

    accuracy                           0.96    300000
   macro avg       0.92      0.78      0.83    300000
weighted avg       0.95      0.96      0.95    300000



In [23]:
#checking for overfitting
logistic_train_pred = logreg.predict(X_train_scaled)
logistic_test_pred = logreg.predict(X_test_scaled)


logistic_train_accuracy = accuracy_score(y_train, logistic_train_pred)
logistic_test_accuracy = accuracy_score(y_test, logistic_test_pred)

print(f"Training Accuracy: {logistic_train_accuracy:.2f}")
print(f"Testing Accuracy: {logistic_test_accuracy:.2f}")


Training Accuracy: 0.96
Testing Accuracy: 0.96


## Random Forest

We are using random forest because it is an ensemble method and it can capture both linear and non linear complex relationships. This will help improve accuracy because random forest are robust to noise and outliers.

In [20]:
# we had initialized our random forest classifier  so we will just fit it to the scaled data
rfc.fit(X_train_scaled, y_train)



In [25]:
rfc_y_pred = rfc.predict(X_test_scaled)
rfc_accuracy = accuracy_score(y_test, rfc_y_pred)
rfc_conf_matrix = confusion_matrix(y_test, rfc_y_pred)
rfc_classification_report = classification_report(y_test, rfc_y_pred)


print(f"Accuracy: {rfc_accuracy:.2f}")
print("Confusion Matrix:\n", rfc_conf_matrix)
print("Classification Report:\n", rfc_classification_report)

Accuracy: 0.98
Confusion Matrix:
 [[271435   2409]
 [  3123  23033]]
Classification Report:
               precision    recall  f1-score   support

         0.0       0.99      0.99      0.99    273844
         1.0       0.91      0.88      0.89     26156

    accuracy                           0.98    300000
   macro avg       0.95      0.94      0.94    300000
weighted avg       0.98      0.98      0.98    300000



This model is an improvement from the baseline model with significantly higher recall values.

In [22]:
#Checking for overfitting
rfc_train_pred = rfc.predict(X_train_scaled)
rfc_test_pred = rfc.predict(X_test_scaled)

rfc_train_accuracy = accuracy_score(y_train, rfc_train_pred)
rfc_test_accuracy = accuracy_score(y_test, rfc_test_pred)

print(f"Training Accuracy: {rfc_train_accuracy:.2f}")
print(f"Testing Accuracy: {rfc_test_accuracy:.2f}")


Training Accuracy: 1.00
Testing Accuracy: 0.98
