▶**TASK-5 : CREDIT CARD FRAUD DETECTION**

> Author: Trisha Jaiswal

> Batch: June 2025 batch B33

> Domain: Data Science


> Aim: To build a machine learning model to identify fraudulent credit card
transactions.



## Load data

### Subtask:
Load the data from the CSV file into a pandas DataFrame.


**Reasoning**:
Import pandas and load the data into a DataFrame, then display the first 5 rows.



In [None]:
import pandas as pd

df = pd.read_csv('creditcard.csv')
display(df.head())

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0.0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0.0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0.0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0.0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0.0


## Preprocess data

### Subtask:
Preprocess and normalize the transaction data.


**Reasoning**:
Check for missing values in the DataFrame to understand the data quality before preprocessing.



In [2]:
print(df.isnull().sum())

Time      0
V1        0
V2        0
V3        0
V4        0
V5        0
V6        0
V7        0
V8        0
V9        0
V10       0
V11       0
V12       0
V13       0
V14       0
V15       0
V16       0
V17       0
V18       0
V19       0
V20       0
V21       0
V22       0
V23       0
V24       0
V25       0
V26       0
V27       0
V28       0
Amount    1
Class     1
dtype: int64


**Reasoning**:
Handle the missing values before normalization. Since there is only one missing value in 'Amount' and 'Class', dropping the row is an acceptable approach. Then, normalize the 'Time' and 'Amount' columns and separate features and target variable.



In [3]:
from sklearn.preprocessing import StandardScaler

df.dropna(inplace=True)

scaler = StandardScaler()
df[['Time', 'Amount']] = scaler.fit_transform(df[['Time', 'Amount']])

X = df.drop('Class', axis=1)
y = df['Class']

display(X.head())
display(y.head())

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
0,-1.997319,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,0.251412,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,0.23126
1,-1.997319,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.069083,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,-0.350171
2,-1.997295,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.52498,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,1.137617
3,-1.997295,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.208038,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,0.127898
4,-1.99727,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,0.408542,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,-0.083851


Unnamed: 0,Class
0,0.0
1,0.0
2,0.0
3,0.0
4,0.0


## Handle class imbalance

### Subtask:
Handle class imbalance issues in the dataset.


**Reasoning**:
Check the distribution of the target variable `y` to understand the extent of class imbalance.



In [4]:
display(y.value_counts())
display(y.value_counts(normalize=True))

Unnamed: 0_level_0,count
Class,Unnamed: 1_level_1
0.0,235289
1.0,430


Unnamed: 0_level_0,proportion
Class,Unnamed: 1_level_1
0.0,0.998176
1.0,0.001824


**Reasoning**:
Apply RandomOverSampler to handle class imbalance by oversampling the minority class.



In [5]:
from imblearn.over_sampling import RandomOverSampler

ros = RandomOverSampler(random_state=42)
X_resampled, y_resampled = ros.fit_resample(X, y)

display(y_resampled.value_counts())

Unnamed: 0_level_0,count
Class,Unnamed: 1_level_1
0.0,235289
1.0,235289


## Split data

### Subtask:
Split the dataset into training and testing sets.


**Reasoning**:
Import the necessary function for splitting the data.



In [6]:
from sklearn.model_selection import train_test_split

**Reasoning**:
Split the resampled data into training and testing sets.



In [7]:
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42)

print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

Shape of X_train: (376462, 30)
Shape of X_test: (94116, 30)
Shape of y_train: (376462,)
Shape of y_test: (94116,)


## Train model

### Subtask:
Train a classification algorithm (e.g., Logistic Regression or Random Forest) to classify transactions.


**Reasoning**:
Import the necessary classifier and instantiate it.



In [8]:
from sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier(random_state=42)

**Reasoning**:
Fit the Random Forest classifier to the training data.



In [9]:
rf_model.fit(X_train, y_train)

## Evaluate model

### Subtask:
Evaluate the model's performance using appropriate metrics (precision, recall, F1-score).


**Reasoning**:
Calculate and print the precision, recall, and F1-score of the trained Random Forest model on the test set.



In [10]:
from sklearn.metrics import precision_score, recall_score, f1_score

y_pred = rf_model.predict(X_test)

precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-score: {f1:.4f}")

Precision: 0.9999
Recall: 1.0000
F1-score: 1.0000


## Improve model

### Subtask:
Consider techniques like oversampling or undersampling to improve the model's results.


## Summary:

### Data Analysis Key Findings

*   The dataset initially contained a significant class imbalance, with fraudulent transactions making up only 0.1824% of the data.
*   Random oversampling was applied to address the class imbalance, resulting in a balanced dataset with an equal number of fraudulent and non-fraudulent transactions.
*   The 'Time' and 'Amount' features were successfully normalized using StandardScaler.
*   A Random Forest classifier was trained on the oversampled and normalized data.
*   The trained Random Forest model achieved very high performance metrics on the test set: Precision of 0.9999, Recall of 1.0000, and F1-score of 1.0000.

### Insights or Next Steps

*   The high evaluation scores suggest that the model is performing exceptionally well at identifying fraudulent transactions on the resampled data. However, evaluating the model on the original imbalanced test set would provide a more realistic assessment of its performance in a real-world scenario.
*   While the current performance is excellent, further steps could include exploring different classification algorithms or fine-tuning the hyperparameters of the Random Forest model, although the current metrics leave little room for improvement.
