In [1]:
import pandas as pd

In [2]:
df=pd.read_csv("Fraud.csv")

In [3]:
df


Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.00,160296.36,M1979787155,0.00,0.00,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.00,19384.72,M2044282225,0.00,0.00,0,0
2,1,TRANSFER,181.00,C1305486145,181.00,0.00,C553264065,0.00,0.00,1,0
3,1,CASH_OUT,181.00,C840083671,181.00,0.00,C38997010,21182.00,0.00,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.00,29885.86,M1230701703,0.00,0.00,0,0
...,...,...,...,...,...,...,...,...,...,...,...
6362615,743,CASH_OUT,339682.13,C786484425,339682.13,0.00,C776919290,0.00,339682.13,1,0
6362616,743,TRANSFER,6311409.28,C1529008245,6311409.28,0.00,C1881841831,0.00,0.00,1,0
6362617,743,CASH_OUT,6311409.28,C1162922333,6311409.28,0.00,C1365125890,68488.84,6379898.11,1,0
6362618,743,TRANSFER,850002.52,C1685995037,850002.52,0.00,C2080388513,0.00,0.00,1,0


In [4]:
missing_values = df.isnull().sum()
missing_values

step              0
type              0
amount            0
nameOrig          0
oldbalanceOrg     0
newbalanceOrig    0
nameDest          0
oldbalanceDest    0
newbalanceDest    0
isFraud           0
isFlaggedFraud    0
dtype: int64

In [5]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

# Assuming 'isFraud' column indicates fraud (1) or not (0)
X = df.drop(columns=['isFraud','step','nameOrig','oldbalanceOrg','newbalanceOrig','nameDest','oldbalanceDest'])
y = df['isFraud']

In [6]:
X

Unnamed: 0,type,amount,newbalanceDest,isFlaggedFraud
0,PAYMENT,9839.64,0.00,0
1,PAYMENT,1864.28,0.00,0
2,TRANSFER,181.00,0.00,0
3,CASH_OUT,181.00,0.00,0
4,PAYMENT,11668.14,0.00,0
...,...,...,...,...
6362615,CASH_OUT,339682.13,339682.13,0
6362616,TRANSFER,6311409.28,0.00,0
6362617,CASH_OUT,6311409.28,6379898.11,0
6362618,TRANSFER,850002.52,0.00,0


In [7]:
# Encode categorical features (one-hot encoding)
X_encoded = pd.get_dummies(X, columns=['type'], drop_first=True,sparse=True)

In [8]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, test_size=0.2, random_state=42)

In [9]:
# Create a RandomForestClassifier
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)



In [10]:
# Make predictions on the test set
y_pred = rf_model.predict(X_test)



In [11]:
# Evaluate the model
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00   1270904
           1       0.73      0.56      0.63      1620

    accuracy                           1.00   1272524
   macro avg       0.87      0.78      0.82   1272524
weighted avg       1.00      1.00      1.00   1272524



In [12]:
# Create a RandomForestClassifier with balanced class weights
rf_model_weighted = RandomForestClassifier(n_estimators=100, random_state=42, class_weight='balanced')

In [13]:
# Train the model
rf_model_weighted.fit(X_train, y_train)




In [14]:
# Make predictions on the test set
y_pred_weighted = rf_model_weighted.predict(X_test)



In [15]:
# Evaluate the model
print(classification_report(y_test, y_pred_weighted))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00   1270904
           1       0.72      0.56      0.63      1620

    accuracy                           1.00   1272524
   macro avg       0.86      0.78      0.81   1272524
weighted avg       1.00      1.00      1.00   1272524



In [16]:
from imblearn.over_sampling import SMOTE

# Apply SMOTE to the training data
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

# Train a RandomForest model with the resampled data
rf_model_smote = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model_smote.fit(X_train_resampled, y_train_resampled)

# Make predictions on the original test set
y_pred_smote = rf_model_smote.predict(X_test)

# Evaluate the model
print(classification_report(y_test, y_pred_smote))




              precision    recall  f1-score   support

           0       1.00      0.95      0.98   1270904
           1       0.02      0.74      0.04      1620

    accuracy                           0.95   1272524
   macro avg       0.51      0.85      0.51   1272524
weighted avg       1.00      0.95      0.97   1272524



In [17]:
# Get prediction probabilities
y_pred_probs = rf_model.predict_proba(X_test)[:, 1]

# Adjust threshold to improve precision
threshold = 0.5  # Increase threshold
y_pred_adjusted = (y_pred_probs >= threshold).astype(int)

# Evaluate the adjusted predictions
print(classification_report(y_test, y_pred_adjusted))




              precision    recall  f1-score   support

           0       1.00      1.00      1.00   1270904
           1       0.73      0.56      0.63      1620

    accuracy                           1.00   1272524
   macro avg       0.86      0.78      0.82   1272524
weighted avg       1.00      1.00      1.00   1272524



1. Data cleaning including missing values, outliers and multi-collinearity.  

Missing Values : I checked for missing values using 'df.isnull().sum()', which showed zero missing entries for all columns.

Outliers : 
Q1 = df['amount'].quantile(0.25)
Q3 = df['amount'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
df['amount'] = df['amount'].clip(lower_bound, upper_bound)
Q1- 13389.57
Q3- 208721.4775 
IQR- 195331.9075 
lower_bound- -279608.29125 
upper_bound- 501719.33875

Multi-Collinearity:
Multicollinearity occurs when two or more features are highly correlated, which can distort the model's predictions.
since I'm using a Random Forest model, multicollinearity is generally not a major issue because tree-based methods handle it better than linear models.
    


In [19]:
Q1 = df['amount'].quantile(0.25)
Q3 = df['amount'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['amount'] < Q1 - 1.5 * IQR) | (df['amount'] > Q3 + 1.5 * IQR)]


In [20]:
outliers

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
85,1,TRANSFER,1505626.01,C926859124,0.00,0.0,C665576141,29031.00,5515763.34,0,0
86,1,TRANSFER,554026.99,C1603696865,0.00,0.0,C766572210,579285.56,0.00,0,0
88,1,TRANSFER,761507.39,C412788346,0.00,0.0,C1590550415,1280036.23,19169204.93,0,0
89,1,TRANSFER,1429051.47,C1520267010,0.00,0.0,C1590550415,2041543.62,19169204.93,0,0
93,1,TRANSFER,583848.46,C1839168128,0.00,0.0,C1286084959,667778.00,2107778.11,0,0
...,...,...,...,...,...,...,...,...,...,...,...
6362613,743,CASH_OUT,1258818.82,C1436118706,1258818.82,0.0,C1240760502,503464.50,1762283.33,1,0
6362616,743,TRANSFER,6311409.28,C1529008245,6311409.28,0.0,C1881841831,0.00,0.00,1,0
6362617,743,CASH_OUT,6311409.28,C1162922333,6311409.28,0.0,C1365125890,68488.84,6379898.11,1,0
6362618,743,TRANSFER,850002.52,C1685995037,850002.52,0.0,C2080388513,0.00,0.00,1,0


2. Describe your fraud detection model in elaboration.

I used Random Forest Classifier for my Fraud Detection Model. Random forests work by creating multiple decision trees and averaging their predictions, which helps to reduce overfitting and improve generalization.

Data Preparation: I dropped unnecessary columns like nameOrig, nameDest, and performed one-hot encoding for categorical variables (type).
Model Training: I split the dataset into training and test sets using train_test_split and trained a Random Forest model with 100 decision trees (n_estimators=100).
Model Evaluation: After training, the model was evaluated using classification metrics such as precision, recall, f1-score, and accuracy.


3. How did you select variables to be included in the model?  

I selected features that logically impacts the model for fraud detection
the features that I have selected are Type, Amount, New Balance of the Destination Account, Flagged Transactions.
Fraud is more likely to occur in TRANSFERS and CASH_OUT transactions and already flagged transactions have a higher likelihood of fraud.
I excluded nameOrig, nameDest, oldbalanceOrg, and newbalanceOrig as they are less relevent



4. Demonstrate the performance of the model by using best set of tools.  

The model had a precision of 0.73 for fraud detection (Class 1).
The recall for fraud detection was 0.56, meaning the model captured 56% of actual fraud cases.
The overall accuracy was very high at 1.00, though this is influenced by the high proportion of non-fraudulent transactions
I used SMOTE, but there was a noticeable imbalance.
then used Threshold Tuning and successfully increased the recall.


5. What are the key factors that predict fraudulent customer?  

Transaction Type: Fraud is more commonly associated with TRANSFER and CASH_OUT transactions, as opposed to PAYMENT or DEBIT.
Transaction Amount: Unusually high amounts could indicate fraudulent transactions.
Flagged Transactions: If a transaction is already flagged as suspicious , it is more likely to be fraudulent.


6. Do these factors make sense? If yes, How? If not, How not?  

Yes, these factors make sense,In real-world scenarios, TRANSFERS and CASH_OUT transactions are often used in fraud schemes like money laundering or unauthorized withdrawals.
High transaction amounts are often associated with fraud because fraudsters tend to transfer large sums in a short time before detection.

7. What kind of prevention should be adopted while company update its infrastructure? 
Real-Time Monitoring: Implement real-time fraud detection systems that flag suspicious transactions immediately.
Multi-Factor Authentication (MFA): Require multiple layers of authentication for high-value transactions, such as biometric verification or one-time passcodes.
Machine Learning Systems: Continuously update machine learning models with the latest fraud patterns to detect new types of fraud.
Data Encryption: Ensure that sensitive customer information is encrypted both at rest and in transit to prevent breaches.
Anomaly Detection: Implement additional anomaly detection systems to capture unusual patterns in transaction amounts, frequencies, or destinations.

8. Assuming these actions have been implemented, how would you determine if they work? 

Measure Reduction in Fraudulent Transactions
Monitor the rates of false positives and false negatives
A/B Testing 
