# Fraud Detection Assignment

### 1. Problem Statement
### 2. Dataset Overview
### 3. Data Cleaning
### 4. Exploratory Data Analysis
### 5. Model Selection & Training
### 6. Model Evaluation
### 7. Key Fraud Indicators
### 8. Business Insights & Prevention
### 9. Limitations & Future Improvements
### 10. Conclusion

---

In [1]:
pip install pandas numpy matplotlib seaborn scikit-learn

Defaulting to user installation because normal site-packages is not writeable
Collecting seaborn
  Downloading seaborn-0.13.2-py3-none-any.whl.metadata (5.4 kB)
Downloading seaborn-0.13.2-py3-none-any.whl (294 kB)
Installing collected packages: seaborn
Successfully installed seaborn-0.13.2
Note: you may need to restart the kernel to use updated packages.


### Importing libraries

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split

print("All libraries imported successfully.")

All libraries imported successfully.


#### 1.Dataset Overview

In [3]:
import pandas as pd

df=pd.read_csv("Fraud.csv")
df.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0


In [4]:
df.shape

(6362620, 11)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6362620 entries, 0 to 6362619
Data columns (total 11 columns):
 #   Column          Dtype  
---  ------          -----  
 0   step            int64  
 1   type            object 
 2   amount          float64
 3   nameOrig        object 
 4   oldbalanceOrg   float64
 5   newbalanceOrig  float64
 6   nameDest        object 
 7   oldbalanceDest  float64
 8   newbalanceDest  float64
 9   isFraud         int64  
 10  isFlaggedFraud  int64  
dtypes: float64(5), int64(3), object(3)
memory usage: 534.0+ MB


### 2.Exploratory Data Analysis

In [6]:
df['isFraud'].value_counts()

isFraud
0    6354407
1       8213
Name: count, dtype: int64

In [7]:
df['isFraud'].value_counts(normalize=True)


isFraud
0    0.998709
1    0.001291
Name: proportion, dtype: float64

#### The dataset is highly imbalanced, with fraudulent transactions representing a very small fraction of total transactions. This reflects real-world financial data, where fraud cases are rare. Due to this imbalance, accuracy alone is not a reliable evaluation metric, and greater emphasis should be placed on recall and precision.

#### sample dataset

In [None]:
# Create a manageable working sample
df_sample = df.sample(frac=0.05, random_state=42)

df_sample.shape

(318131, 11)

#### 3.Data Cleaning

In [10]:
df_sample.isnull().sum()


step              0
type              0
amount            0
nameOrig          0
oldbalanceOrg     0
newbalanceOrig    0
nameDest          0
oldbalanceDest    0
newbalanceDest    0
isFraud           0
isFlaggedFraud    0
dtype: int64

#### 4.FEATURE SELECTION & PREPARATION

In [26]:
df_sample.columns



Index(['step', 'amount', 'oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest',
       'newbalanceDest', 'isFraud', 'isFlaggedFraud', 'type_CASH_OUT',
       'type_DEBIT', 'type_PAYMENT', 'type_TRANSFER'],
      dtype='object')

In [27]:
if 'type' in df_sample.columns:
    df_sample = pd.get_dummies(df_sample, columns=['type'], drop_first=True)


#### 5.DEFINE FEATURES (X) & TARGET (y)

In [28]:
X = df_sample.drop('isFraud', axis=1)
y = df_sample['isFraud']


#### The dataset was divided into input features (X) and the target variable (y), where isFraud represents whether a transaction is fraudulent.

#### 6.TRAIN–TEST SPLIT

In [29]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.3,
    random_state=42,
    stratify=y
)


In [30]:
X_train.shape


(222691, 11)

In [31]:
X_test.shape


(95440, 11)

#### 7.TRAIN BASELINE MODEL (LOGISTIC REGRESSION)

In [32]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)


0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,1000


#### 8: MODEL EVALUATION

In [None]:
#Prediction
y_pred = model.predict(X_test)

In [34]:
#Confusion Matrix
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_pred)


array([[95298,    18],
       [   63,    61]])

In [35]:
#Classification Report
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     95316
           1       0.77      0.49      0.60       124

    accuracy                           1.00     95440
   macro avg       0.89      0.75      0.80     95440
weighted avg       1.00      1.00      1.00     95440



#### The confusion matrix shows that the model correctly identifies the majority of legitimate transactions, with very few false positives. For fraudulent transactions, the model achieves moderate recall, indicating that some fraud cases are still missed. This behavior is expected for a baseline model trained on highly imbalanced data. In fraud detection, minimizing false negatives is critical, as undetected fraud can lead to financial loss.

#### ✅ KEY FRAUD INDICATORS
#### Analysis suggests that transaction amount, balance changes before and after transactions, and transaction type play a significant role in fraud detection. Fraudulent activities often involve unusual transaction values and sudden balance inconsistencies, which align with real-world fraud patterns.

#### ✅ FRAUD PREVENTION STRATEGIES
#### Based on the findings, organizations can implement real-time transaction monitoring, risk-based alerts for high-value or abnormal transactions, and additional verification for suspicious activities. These measures can reduce fraud while minimizing inconvenience to genuine users.

#### ✅ HOW TO MEASURE SUCCESS 
#### The effectiveness of fraud prevention strategies can be measured by monitoring improvements in recall, reduction in financial loss due to fraud, and stability of false positive rates over time. Continuous model retraining and performance tracking would ensure long-term effectiveness.

#### ✅ LIMITATIONS & FUTURE IMPROVEMENTS
#### This analysis was conducted using a sampled dataset and a baseline logistic regression model. Future improvements could include handling class imbalance using resampling techniques, incorporating advanced models, and deploying the system for real-time fraud detection.

#### ✅ CONCLUSION 
#### This project demonstrates a structured and practical approach to fraud detection using machine learning. Despite data imbalance and system constraints, the model provides meaningful insights into fraud behavior and establishes a strong foundation for further enhancement.