# **Importing Libraries**

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import xgboost as xgb
from imblearn.over_sampling import SMOTE

pandas & numpy: Used for data manipulation and numerical operations.

train_test_split: Helps split data into training and testing sets.

LabelEncoder & StandardScaler: Perform categorical encoding and feature scaling.

accuracy_score, classification_report, confusion_matrix: Metrics to evaluate model performance.

xgboost: Implements the XGBoost classifier, known for its efficiency and performance.

SMOTE: Addresses class imbalance by synthetically oversampling the minority class.

# **Data Loading**

In [None]:
df = pd.read_csv('/content/Fraud.csv')

# **Data Exploration**

In [None]:
df.shape

(6362620, 11)

In [None]:
df.dtypes

Unnamed: 0,0
step,int64
type,object
amount,float64
nameOrig,object
oldbalanceOrg,float64
newbalanceOrig,float64
nameDest,object
oldbalanceDest,float64
newbalanceDest,float64
isFraud,int64


In [None]:
df.isnull().sum()

Unnamed: 0,0
step,0
type,0
amount,0
nameOrig,0
oldbalanceOrg,0
newbalanceOrig,0
nameDest,0
oldbalanceDest,0
newbalanceDest,0
isFraud,0


In [None]:
df = df.dropna()

Drop missing values: Ensures that the dataset is clean by removing any rows with missing data, which could otherwise lead to errors or bias in model training.

In [None]:
df.isnull().sum()

Unnamed: 0,0
step,0
type,0
amount,0
nameOrig,0
oldbalanceOrg,0
newbalanceOrig,0
nameDest,0
oldbalanceDest,0
newbalanceDest,0
isFraud,0


In [None]:
df.shape

(6362620, 11)

In [None]:
df.describe()

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
count,6362620.0,6362620.0,6362620.0,6362620.0,6362620.0,6362620.0,6362620.0,6362620.0
mean,243.3972,179861.9,833883.1,855113.7,1100702.0,1224996.0,0.00129082,2.514687e-06
std,142.332,603858.2,2888243.0,2924049.0,3399180.0,3674129.0,0.0359048,0.001585775
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,156.0,13389.57,0.0,0.0,0.0,0.0,0.0,0.0
50%,239.0,74871.94,14208.0,0.0,132705.7,214661.4,0.0,0.0
75%,335.0,208721.5,107315.2,144258.4,943036.7,1111909.0,0.0,0.0
max,743.0,92445520.0,59585040.0,49585040.0,356015900.0,356179300.0,1.0,1.0


# **Feature engineering**

In [None]:
df['hour'] = df['step'] % 24
df['day'] = df['step'] // 24
df['balance_diff_orig'] = df['newbalanceOrig'] - df['oldbalanceOrg']
df['balance_diff_dest'] = df['newbalanceDest'] - df['oldbalanceDest']

Hour & Day Calculation: Converts the 'step' variable into hour and day components, which could capture periodic patterns or temporal trends in transactions.

Balance Differences: Creates new features that represent the change in balance for both the origin and destination accounts, offering additional insight into transaction behavior.

### **Encoding**

In [None]:
le = LabelEncoder()
df['type'] = le.fit_transform(df['type'])

Label Encoding: Converts the categorical 'type' variable into numerical form. This is essential because machine learning models require numerical input.

# **Feature Selection**

In [None]:
features = ['type', 'amount', 'oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest', 'newbalanceDest', 'hour', 'day', 'balance_diff_orig', 'balance_diff_dest']
X = df[features]
y = df['isFraud']

Feature Selection: Identifies which columns (or engineered features) will be used as predictors.

Target Variable: Assigns 'isFraud' as the output variable that the model will learn to predict.

# **Splitting Data**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Train-Test Split: Divides the data into training (80%) and testing (20%) sets to evaluate the model’s performance on unseen data.

Random State: Ensures reproducibility of results by fixing the random seed.

# **Feature Scaling**

In [None]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

StandardScaler: Standardizes features by removing the mean and scaling to unit variance. This is important for algorithms like XGBoost that can be sensitive to the scale of input data.

Fitting and Transforming: The scaler is fitted on the training set and then applied to both the training and test sets, ensuring the scaling is consistent.

# **Applying SMOTE for handling class imbalance**

In [None]:
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_scaled, y_train)

SMOTE (Synthetic Minority Over-sampling Technique): Generates synthetic examples of the minority class (fraud cases) to balance the dataset, which can improve model performance on imbalanced classes.

# **Training**

### **Initializing XGBoost classifier**

In [None]:
xgb_model = xgb.XGBClassifier(
    max_depth=6,
    learning_rate=0.05,
    n_estimators=1000,
    subsample=0.8,
    colsample_bytree=0.8,
    objective='binary:logistic',
    scale_pos_weight=1,
    random_state=42,
    n_jobs=-1,
    early_stopping_rounds=50
)


Hyperparameters:

max_depth: Controls the maximum depth of trees to prevent overfitting.

learning_rate: Determines the step size during boosting.

n_estimators: Sets the number of boosting rounds.

subsample & colsample_bytree: Used to introduce randomness by sampling a fraction of observations and features, helping to reduce overfitting.

objective: Specifies a binary classification task.

scale_pos_weight: Helps to address imbalance; here it’s set to 1 because SMOTE has already balanced the classes.

n_jobs: Utilizes all available CPU cores for parallel processing.

early_stopping_rounds: Stops training if there is no improvement, thus preventing overfitting and reducing training time.

### **Model Fit**

In [None]:
xgb_model.fit(
    X_train_resampled,
    y_train_resampled,
    eval_set=[(X_train_resampled, y_train_resampled), (X_test_scaled, y_test)],
    verbose=100
)

[0]	validation_0-logloss:0.64945	validation_1-logloss:0.64948
[100]	validation_0-logloss:0.02330	validation_1-logloss:0.02499
[200]	validation_0-logloss:0.01259	validation_1-logloss:0.01364
[300]	validation_0-logloss:0.00796	validation_1-logloss:0.00886
[400]	validation_0-logloss:0.00588	validation_1-logloss:0.00667
[500]	validation_0-logloss:0.00457	validation_1-logloss:0.00532
[600]	validation_0-logloss:0.00376	validation_1-logloss:0.00447
[700]	validation_0-logloss:0.00318	validation_1-logloss:0.00386
[800]	validation_0-logloss:0.00278	validation_1-logloss:0.00344
[900]	validation_0-logloss:0.00246	validation_1-logloss:0.00311
[999]	validation_0-logloss:0.00221	validation_1-logloss:0.00286


Model Fitting: Trains the XGBoost classifier on the resampled training data.

Evaluation Set: Uses both the training and test sets for evaluation during training, providing insight into the model’s performance over iterations.

Verbose: Displays progress and performance metrics every 100 iterations, useful for monitoring training progress.

# **Testing**

### **Making Predictions**

In [None]:
y_pred = xgb_model.predict(X_test_scaled)

# **Evaluation**

In [None]:
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Accuracy: 0.9989

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00   1270904
           1       0.54      0.98      0.70      1620

    accuracy                           1.00   1272524
   macro avg       0.77      0.99      0.85   1272524
weighted avg       1.00      1.00      1.00   1272524


Confusion Matrix:
[[1269555    1349]
 [     34    1586]]


# **Feature importance**

In [None]:
feature_importance = xgb_model.feature_importances_
for i, importance in enumerate(feature_importance):
    print(f"{features[i]}: {importance}")

type: 0.06987999379634857
amount: 0.044050171971321106
oldbalanceOrg: 0.0379924550652504
newbalanceOrig: 0.41221657395362854
oldbalanceDest: 0.004255259409546852
newbalanceDest: 0.013701528310775757
hour: 0.017167041078209877
day: 0.009573615156114101
balance_diff_orig: 0.35802698135375977
balance_diff_dest: 0.03313641622662544


Feature Importance: Retrieves the importance scores for each feature, indicating their contribution to the model’s predictions.

Iterative Print: Loops over each feature and prints its corresponding importance score, which is useful for understanding which features have the most impact on detecting fraud.

**The model delivers exceptional performance, achieving 99.89% accuracy in classifying transactions. Notably, its 98% recall for fraud detection ensures nearly all fraudulent activities are identified, which is critical in this field. This establishes a robust foundation for a fraud detection system and highlights the effective use of SMOTE and XGBoost to manage data imbalance.**