# Credit Card Fraud Detection

![image.jpg](https://st.depositphotos.com/1252160/4236/i/450/depositphotos_42366967-stock-photo-internet-theft-concept.jpg)

### Fraudulent activities in financial transactions pose significant challenges to businesses and consumers alike. Detecting and preventing such activities are crucial for maintaining trust and security in online transactions. In this project, we leverage machine learning techniques, specifically the RandomForestClassifier algorithm, to develop a robust fraud detection system.

## Dataset Loading and Preprocessing

### Load the dataset

In [1]:
import pandas as pd

train_df = pd.read_csv('/content/fraudTest.csv')
test_df = pd.read_csv('/content/fraudTest.csv')

In [2]:
train_df.head()

Unnamed: 0.1,Unnamed: 0,trans_date_trans_time,cc_num,merchant,category,amt,first,last,gender,street,...,lat,long,city_pop,job,dob,trans_num,unix_time,merch_lat,merch_long,is_fraud
0,0,2020-06-21 12:14:25,2291163933867244,fraud_Kirlin and Sons,personal_care,2.86,Jeff,Elliott,M,351 Darlene Green,...,33.9659,-80.9355,333497.0,Mechanical engineer,1968-03-19,2da90c7d74bd46a0caf3777415b3ebd3,1371817000.0,33.986391,-81.200714,0.0
1,1,2020-06-21 12:14:33,3573030041201292,fraud_Sporer-Keebler,personal_care,29.84,Joanne,Williams,F,3638 Marsh Union,...,40.3207,-110.436,302.0,"Sales professional, IT",1990-01-17,324cc204407e99f51b0d6ca0055005e7,1371817000.0,39.450498,-109.960431,0.0
2,2,2020-06-21 12:14:53,3598215285024754,"fraud_Swaniawski, Nitzsche and Welch",health_fitness,41.28,Ashley,Lopez,F,9333 Valentine Point,...,40.6729,-73.5365,34496.0,"Librarian, public",1970-10-21,c81755dbbbea9d5c77f094348a7579be,1371817000.0,40.49581,-74.196111,0.0
3,3,2020-06-21 12:15:15,3591919803438423,fraud_Haley Group,misc_pos,60.05,Brian,Williams,M,32941 Krystal Mill Apt. 552,...,28.5697,-80.8191,54767.0,Set designer,1987-07-25,2159175b9efe66dc301f149d3d5abf8c,1371817000.0,28.812398,-80.883061,0.0
4,4,2020-06-21 12:15:17,3526826139003047,fraud_Johnston-Casper,travel,3.19,Nathan,Massey,M,5783 Evan Roads Apt. 465,...,44.2529,-85.01700000000001,1126.0,Furniture designer,1955-07-06,57ff021bd3f328f8738bb535c302a31b,1371817000.0,44.959148,-85.884734,0.0


In [3]:
test_df.head()

Unnamed: 0.1,Unnamed: 0,trans_date_trans_time,cc_num,merchant,category,amt,first,last,gender,street,...,lat,long,city_pop,job,dob,trans_num,unix_time,merch_lat,merch_long,is_fraud
0,0,2020-06-21 12:14:25,2291163933867244,fraud_Kirlin and Sons,personal_care,2.86,Jeff,Elliott,M,351 Darlene Green,...,33.9659,-80.9355,333497.0,Mechanical engineer,1968-03-19,2da90c7d74bd46a0caf3777415b3ebd3,1371817000.0,33.986391,-81.200714,0.0
1,1,2020-06-21 12:14:33,3573030041201292,fraud_Sporer-Keebler,personal_care,29.84,Joanne,Williams,F,3638 Marsh Union,...,40.3207,-110.436,302.0,"Sales professional, IT",1990-01-17,324cc204407e99f51b0d6ca0055005e7,1371817000.0,39.450498,-109.960431,0.0
2,2,2020-06-21 12:14:53,3598215285024754,"fraud_Swaniawski, Nitzsche and Welch",health_fitness,41.28,Ashley,Lopez,F,9333 Valentine Point,...,40.6729,-73.5365,34496.0,"Librarian, public",1970-10-21,c81755dbbbea9d5c77f094348a7579be,1371817000.0,40.49581,-74.196111,0.0
3,3,2020-06-21 12:15:15,3591919803438423,fraud_Haley Group,misc_pos,60.05,Brian,Williams,M,32941 Krystal Mill Apt. 552,...,28.5697,-80.8191,54767.0,Set designer,1987-07-25,2159175b9efe66dc301f149d3d5abf8c,1371817000.0,28.812398,-80.883061,0.0
4,4,2020-06-21 12:15:17,3526826139003047,fraud_Johnston-Casper,travel,3.19,Nathan,Massey,M,5783 Evan Roads Apt. 465,...,44.2529,-85.01700000000001,1126.0,Furniture designer,1955-07-06,57ff021bd3f328f8738bb535c302a31b,1371817000.0,44.959148,-85.884734,0.0


### Data preprocessing

In [4]:
train_df.isnull().sum()

Unnamed: 0               0
trans_date_trans_time    0
cc_num                   0
merchant                 0
category                 0
amt                      0
first                    0
last                     0
gender                   0
street                   0
city                     0
state                    0
zip                      0
lat                      0
long                     0
city_pop                 1
job                      1
dob                      1
trans_num                1
unix_time                1
merch_lat                1
merch_long               1
is_fraud                 1
dtype: int64

In [5]:
train_df.dropna(inplace=True)
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3900 entries, 0 to 3899
Data columns (total 23 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Unnamed: 0             3900 non-null   int64  
 1   trans_date_trans_time  3900 non-null   object 
 2   cc_num                 3900 non-null   int64  
 3   merchant               3900 non-null   object 
 4   category               3900 non-null   object 
 5   amt                    3900 non-null   float64
 6   first                  3900 non-null   object 
 7   last                   3900 non-null   object 
 8   gender                 3900 non-null   object 
 9   street                 3900 non-null   object 
 10  city                   3900 non-null   object 
 11  state                  3900 non-null   object 
 12  zip                    3900 non-null   int64  
 13  lat                    3900 non-null   float64
 14  long                   3900 non-null   object 
 15  city_pop 

In [6]:
test_df.isnull().sum()

Unnamed: 0               0
trans_date_trans_time    0
cc_num                   0
merchant                 0
category                 0
amt                      0
first                    0
last                     0
gender                   0
street                   0
city                     0
state                    0
zip                      0
lat                      0
long                     0
city_pop                 1
job                      1
dob                      1
trans_num                1
unix_time                1
merch_lat                1
merch_long               1
is_fraud                 1
dtype: int64

In [7]:
test_df.dropna(inplace=True)
test_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3900 entries, 0 to 3899
Data columns (total 23 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Unnamed: 0             3900 non-null   int64  
 1   trans_date_trans_time  3900 non-null   object 
 2   cc_num                 3900 non-null   int64  
 3   merchant               3900 non-null   object 
 4   category               3900 non-null   object 
 5   amt                    3900 non-null   float64
 6   first                  3900 non-null   object 
 7   last                   3900 non-null   object 
 8   gender                 3900 non-null   object 
 9   street                 3900 non-null   object 
 10  city                   3900 non-null   object 
 11  state                  3900 non-null   object 
 12  zip                    3900 non-null   int64  
 13  lat                    3900 non-null   float64
 14  long                   3900 non-null   object 
 15  city_pop 

### Feature engineering

In [8]:
train_df['trans_date_trans_time'] = pd.to_datetime(train_df['trans_date_trans_time'])
test_df['trans_date_trans_time'] = pd.to_datetime(test_df['trans_date_trans_time'])

train_df['transaction_hour'] = train_df['trans_date_trans_time'].dt.hour
test_df['transaction_hour'] = test_df['trans_date_trans_time'].dt.hour

train_df['transaction_day'] = train_df['trans_date_trans_time'].dt.day
test_df['transaction_day'] = test_df['trans_date_trans_time'].dt.day

train_df['transaction_month'] = train_df['trans_date_trans_time'].dt.month
test_df['transaction_month'] = test_df['trans_date_trans_time'].dt.month

In [9]:
train_df.drop(columns=['trans_date_trans_time'], inplace=True)
test_df.drop(columns=['trans_date_trans_time'], inplace=True)

### Drop unnecessary columns

In [10]:
drop_columns = ['Unnamed: 0', 'first', 'last', 'street', 'dob', 'trans_num']
train_df.drop(columns=drop_columns, inplace=True)
test_df.drop(columns=drop_columns, inplace=True)

### Encode categorical variables

In [11]:
from sklearn.preprocessing import LabelEncoder, StandardScaler

label_encoder = LabelEncoder()
categorical_columns = ['merchant', 'category', 'gender', 'state', 'job', 'city']
for col in categorical_columns:
    train_df[col] = label_encoder.fit_transform(train_df[col])
    test_df[col] = label_encoder.transform(test_df[col])

In [12]:
# Ensure all columns are numeric
print(train_df.dtypes)

cc_num                 int64
merchant               int64
category               int64
amt                  float64
gender                 int64
city                   int64
state                  int64
zip                    int64
lat                  float64
long                  object
city_pop             float64
job                    int64
unix_time            float64
merch_lat            float64
merch_long           float64
is_fraud             float64
transaction_hour       int32
transaction_day        int32
transaction_month      int32
dtype: object


### Scaling numeric features

In [13]:
scaler = StandardScaler()
feature_columns = ['cc_num', 'amt', 'zip', 'lat', 'long', 'city_pop', 'unix_time', 'merch_lat', 'merch_long', 'transaction_hour', 'transaction_day', 'transaction_month']
train_df[feature_columns] = scaler.fit_transform(train_df[feature_columns])
test_df[feature_columns] = scaler.transform(test_df[feature_columns])

##  Split data into training and validation sets

In [14]:
from sklearn.model_selection import train_test_split

X_train = train_df.drop('is_fraud', axis=1)
y_train = train_df['is_fraud']
X_test = test_df.drop('is_fraud', axis=1)
y_test = test_df['is_fraud']

In [15]:
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

## Model Training and Evaluation

In [16]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

# Initialize the model
rf_model = RandomForestClassifier(random_state=42)

# Train the model
rf_model.fit(X_train, y_train)

# Evaluate on the validation set
y_val_pred = rf_model.predict(X_val)
y_val_pred_proba = rf_model.predict_proba(X_val)[:, 1]

# Print evaluation metrics
print("Validation Set Evaluation:")
print(classification_report(y_val, y_val_pred))
print(confusion_matrix(y_val, y_val_pred))
print(f'Validation ROC-AUC Score: {roc_auc_score(y_val, y_val_pred_proba)}')

Validation Set Evaluation:
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00       776
         1.0       1.00      0.75      0.86         4

    accuracy                           1.00       780
   macro avg       1.00      0.88      0.93       780
weighted avg       1.00      1.00      1.00       780

[[776   0]
 [  1   3]]
Validation ROC-AUC Score: 1.0


## Hyperparameter Tuning

In [18]:
from sklearn.model_selection import GridSearchCV

# Define the parameter grid
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, cv=3, n_jobs=-1, scoring='roc_auc')

# Fit GridSearchCV
grid_search.fit(X_train, y_train)

# Best parameters
best_params = grid_search.best_params_
print(f'Best Parameters: {best_params}')

# Train the model with best parameters
best_rf_model = grid_search.best_estimator_

# Evaluate on the validation set with tuned model
y_val_pred = best_rf_model.predict(X_val)
y_val_pred_proba = best_rf_model.predict_proba(X_val)[:, 1]

# Print evaluation metrics
print("Tuned Validation Set Evaluation:")
print(classification_report(y_val, y_val_pred))
print(confusion_matrix(y_val, y_val_pred))
print(f'Tuned Validation ROC-AUC Score: {roc_auc_score(y_val, y_val_pred_proba)}')

# Evaluate on the test set with tuned model
y_test_pred = best_rf_model.predict(X_test)
y_test_pred_proba = best_rf_model.predict_proba(X_test)[:, 1]

# Print test set evaluation metrics
print("\nTest Set Evaluation:")
print(classification_report(y_test, y_test_pred))
print(confusion_matrix(y_test, y_test_pred))
print(f'Test ROC-AUC Score: {roc_auc_score(y_test, y_test_pred_proba)}')


Best Parameters: {'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 100}
Tuned Validation Set Evaluation:
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00       776
         1.0       1.00      0.75      0.86         4

    accuracy                           1.00       780
   macro avg       1.00      0.88      0.93       780
weighted avg       1.00      1.00      1.00       780

[[776   0]
 [  1   3]]
Tuned Validation ROC-AUC Score: 1.0

Test Set Evaluation:
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00      3882
         1.0       1.00      0.94      0.97        18

    accuracy                           1.00      3900
   macro avg       1.00      0.97      0.99      3900
weighted avg       1.00      1.00      1.00      3900

[[3882    0]
 [   1   17]]
Test ROC-AUC Score: 0.9999856889346843


In [19]:
from joblib import dump

dump(best_rf_model, 'fraud_detection_model.joblib')

['fraud_detection_model.joblib']