# LAB | Imbalanced

**Load the data**

In this challenge, we will be working with Credit Card Fraud dataset.

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv

Metadata

- **distance_from_home:** the distance from home where the transaction happened.
- **distance_from_last_transaction:** the distance from last transaction happened.
- **ratio_to_median_purchase_price:** Ratio of purchased price transaction to median purchase price.
- **repeat_retailer:** Is the transaction happened from same retailer.
- **used_chip:** Is the transaction through chip (credit card).
- **used_pin_number:** Is the transaction happened by using PIN number.
- **online_order:** Is the transaction an online order.
- **fraud:** Is the transaction fraudulent. **0=legit** -  **1=fraud**


In [1]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, recall_score
from imblearn.over_sampling import RandomOverSampler, SMOTE
from imblearn.under_sampling import RandomUnderSampler
from collections import Counter
import warnings
warnings.filterwarnings("ignore")

In [2]:
fraud = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv")
fraud.head()

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
0,57.877857,0.31114,1.94594,1.0,1.0,0.0,0.0,0.0
1,10.829943,0.175592,1.294219,1.0,0.0,0.0,0.0,0.0
2,5.091079,0.805153,0.427715,1.0,0.0,0.0,1.0,0.0
3,2.247564,5.600044,0.362663,1.0,1.0,0.0,1.0,0.0
4,44.190936,0.566486,2.222767,1.0,1.0,0.0,1.0,0.0


**Steps:**

- **1.** What is the distribution of our target variable? Can we say we're dealing with an imbalanced dataset?
- **2.** Train a LogisticRegression.
- **3.** Evaluate your model. Take in consideration class importance, and evaluate it by selection the correct metric.
- **4.** Run **Oversample** in order to balance our target variable and repeat the steps above, now with balanced data. Does it improve the performance of our model? 
- **5.** Now, run **Undersample** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model?
- **6.** Finally, run **SMOTE** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model? 

In [3]:
# Define features (X) and target (y)
X = fraud.drop('fraud', axis=1)
y = fraud['fraud']

# Split the data first to prevent data leakage during resampling
# Resampling should only be applied to the training set!
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [4]:
# 1. Distribution of the target variable
target_counts = y.value_counts(normalize=False)
target_ratios = y.value_counts(normalize=True)

print("### Target Distribution (fraud)")
print(target_counts)
print("\n### Target Ratios")
print(target_ratios.round(4))

# Check for imbalance
is_imbalanced = target_ratios[1] < 0.1 # A common heuristic for imbalance is less than 10%
print(f"\nCan we say we're dealing with an imbalanced dataset? {'YES' if is_imbalanced else 'NO'}")

### Target Distribution (fraud)
fraud
0.0    912597
1.0     87403
Name: count, dtype: int64

### Target Ratios
fraud
0.0    0.9126
1.0    0.0874
Name: proportion, dtype: float64

Can we say we're dealing with an imbalanced dataset? YES


In [5]:
# 2. Train a Logistic Regression model (Baseline)
model_baseline = LogisticRegression(solver='liblinear', random_state=42)
model_baseline.fit(X_train, y_train)

In [6]:
# 3. Evaluate the model
y_pred_baseline = model_baseline.predict(X_test)

print("### Baseline Model (Imbalanced Data) Evaluation")
print("---")
print(classification_report(y_test, y_pred_baseline))
print(f"Overall Accuracy: {accuracy_score(y_test, y_pred_baseline):.4f}")
print(f"Recall (Fraud, Class 1): {recall_score(y_test, y_pred_baseline):.4f}")

### Baseline Model (Imbalanced Data) Evaluation
---
              precision    recall  f1-score   support

         0.0       0.96      0.99      0.98    182519
         1.0       0.90      0.60      0.72     17481

    accuracy                           0.96    200000
   macro avg       0.93      0.80      0.85    200000
weighted avg       0.96      0.96      0.96    200000

Overall Accuracy: 0.9592
Recall (Fraud, Class 1): 0.6037


In [7]:
# 4. Oversample the training data
ros = RandomOverSampler(random_state=42)
X_train_ros, y_train_ros = ros.fit_resample(X_train, y_train)

In [8]:
print(f"Original training shape: {Counter(y_train)}")
print(f"Oversampled training shape: {Counter(y_train_ros)}")
print("---")

Original training shape: Counter({0.0: 730078, 1.0: 69922})
Oversampled training shape: Counter({0.0: 730078, 1.0: 730078})
---


In [9]:
# Train model with Oversampled data
model_ros = LogisticRegression(solver='liblinear', random_state=42)
model_ros.fit(X_train_ros, y_train_ros)


In [10]:
# Evaluate the model on the original test set
y_pred_ros = model_ros.predict(X_test)

print("### Oversampling Model Evaluation")
print(classification_report(y_test, y_pred_ros))
print(f"Overall Accuracy: {accuracy_score(y_test, y_pred_ros):.4f}")
print(f"Recall (Fraud, Class 1): {recall_score(y_test, y_pred_ros):.4f}")

### Oversampling Model Evaluation
              precision    recall  f1-score   support

         0.0       0.99      0.93      0.96    182519
         1.0       0.58      0.95      0.72     17481

    accuracy                           0.93    200000
   macro avg       0.79      0.94      0.84    200000
weighted avg       0.96      0.93      0.94    200000

Overall Accuracy: 0.9348
Recall (Fraud, Class 1): 0.9481


In [11]:
# 5. Undersample the training data
rus = RandomUnderSampler(random_state=42)
X_train_rus, y_train_rus = rus.fit_resample(X_train, y_train)

In [12]:
print(f"Original training shape: {Counter(y_train)}")
print(f"Undersampled training shape: {Counter(y_train_rus)}")
print("---")


Original training shape: Counter({0.0: 730078, 1.0: 69922})
Undersampled training shape: Counter({0.0: 69922, 1.0: 69922})
---


In [13]:
# Train model with Undersampled data
model_rus = LogisticRegression(solver='liblinear', random_state=42)
model_rus.fit(X_train_rus, y_train_rus)


In [14]:
# Evaluate the model on the original test set
y_pred_rus = model_rus.predict(X_test)

print("### Undersampling Model Evaluation")
print(classification_report(y_test, y_pred_rus))
print(f"Overall Accuracy: {accuracy_score(y_test, y_pred_rus):.4f}")
print(f"Recall (Fraud, Class 1): {recall_score(y_test, y_pred_rus):.4f}")

### Undersampling Model Evaluation
              precision    recall  f1-score   support

         0.0       0.99      0.93      0.96    182519
         1.0       0.58      0.95      0.72     17481

    accuracy                           0.93    200000
   macro avg       0.79      0.94      0.84    200000
weighted avg       0.96      0.93      0.94    200000

Overall Accuracy: 0.9346
Recall (Fraud, Class 1): 0.9475


In [15]:
# 6. SMOTE the training data
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

In [16]:
print(f"Original training shape: {Counter(y_train)}")
print(f"SMOTEd training shape: {Counter(y_train_smote)}")
print("---")

Original training shape: Counter({0.0: 730078, 1.0: 69922})
SMOTEd training shape: Counter({0.0: 730078, 1.0: 730078})
---


In [17]:
# Train model with SMOTEd data
model_smote = LogisticRegression(solver='liblinear', random_state=42)
model_smote.fit(X_train_smote, y_train_smote)

In [18]:
#Evaluate the model on the original test set
y_pred_smote = model_smote.predict(X_test)

print("### SMOTE Model Evaluation")
print(classification_report(y_test, y_pred_smote))
print(f"Overall Accuracy: {accuracy_score(y_test, y_pred_smote):.4f}")
print(f"Recall (Fraud, Class 1): {recall_score(y_test, y_pred_smote):.4f}")

### SMOTE Model Evaluation
              precision    recall  f1-score   support

         0.0       0.99      0.93      0.96    182519
         1.0       0.58      0.95      0.72     17481

    accuracy                           0.94    200000
   macro avg       0.79      0.94      0.84    200000
weighted avg       0.96      0.94      0.94    200000

Overall Accuracy: 0.9351
Recall (Fraud, Class 1): 0.9461


In [19]:
#All three resampling methods significantly improved the performance of the model with respect to the critical business metric, Recall for Class 1 (Fraud).
# The change in model behavior is directly due to the balancing: the model is no longer biased toward the majority class (legitimate) and treats both classes as equally important.
# The small differences between Oversampling, Undersampling, and SMOTE suggest that the underlying data is highly separable, and the balancing itself is the main driver of the performance change. 
# SMOTE provided the most balanced result with the highest accuracy (0.9351) among the three.
# Demonstrated key techniques for modeling real-world, datasets.