# LAB | Imbalanced

**Load the data**

In this challenge, we will be working with Credit Card Fraud dataset.

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv

Metadata

- **distance_from_home:** the distance from home where the transaction happened.
- **distance_from_last_transaction:** the distance from last transaction happened.
- **ratio_to_median_purchase_price:** Ratio of purchased price transaction to median purchase price.
- **repeat_retailer:** Is the transaction happened from same retailer.
- **used_chip:** Is the transaction through chip (credit card).
- **used_pin_number:** Is the transaction happened by using PIN number.
- **online_order:** Is the transaction an online order.
- **fraud:** Is the transaction fraudulent. **0=legit** -  **1=fraud**


In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.metrics import r2_score, precision_score, recall_score, classification_report, confusion_matrix, f1_score, mean_absolute_error, mean_squared_error, root_mean_squared_error, make_scorer
from sklearn.utils import resample
import scipy.stats as st
import time


In [None]:
fraud = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv")
fraud.head()

**Steps:**

- **1.** What is the distribution of our target variable? Can we say we're dealing with an imbalanced dataset?
- **2.** Train a LogisticRegression.
- **3.** Evaluate your model. Take in consideration class importance, and evaluate it by selection the correct metric.
- **4.** Run **Oversample** in order to balance our target variable and repeat the steps above, now with balanced data. Does it improve the performance of our model? 
- **5.** Now, run **Undersample** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model?
- **6.** Finally, run **SMOTE** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model? 

In [None]:
fraud.isnull().sum()

In [None]:
fraud.info()

In [None]:
fraud.shape

1. What is the distribution of our target variable? Can we say we're dealing with an imbalanced dataset?

In [None]:
frauds = fraud["fraud"].value_counts()
frauds.plot(kind="bar");

3. Evaluate your model. Take in consideration class importance, and evaluate it by selection the correct metric.

Train a LogisticRegression

In [None]:
target = fraud["fraud"]
features = fraud.drop(columns = ["fraud"])

X_train, X_test, y_train, y_test = train_test_split(features, target, random_state=0)

scaler = StandardScaler()
scaler.fit(X_train)

X_train_scaled_np = scaler.transform(X_train)
X_test_scaled_np = scaler.transform(X_test)

log_reg = LogisticRegression()

X_train_scaled_df = pd.DataFrame(X_train_scaled_np, columns=X_train.columns, index=X_train.index)
X_test_scaled_df  = pd.DataFrame(X_test_scaled_np, columns=X_test.columns, index=X_test.index)

log_reg.fit(X_train_scaled_df, y_train)
log_reg.score(X_test_scaled_df, y_test)

- **3.** Evaluate your model. Take in consideration class importance, and evaluate it by selection the correct metric.

In [None]:
y_pred_test_log = log_reg.predict(X_test_scaled_df)
print(classification_report(y_pred = y_pred_test_log, y_true = y_test))

- Strengths: High precision for fraud cases (0.89), meaning flagged transactions are likely actual fraud.
- Weaknesses: Recall for fraud (0.60) is too low, meaning 40% of fraudulent transactions go undetected.
- Overall: Unacceptable for fraud detection because many fraud cases are missed.

4. Run Oversample in order to balance our target variable and repeat the steps above, now with balanced data. Does it improve the performance of our model?

In [None]:
train = pd.DataFrame(X_train_scaled_df, columns = X_train.columns)
train["fraud"] = y_train.values

In [None]:
fraud = train[train["fraud"] == 1]
not_fraud = train[train["fraud"] == 0]

In [None]:
fraud_oversampled = resample(fraud, replace=True, n_samples = len(not_fraud),random_state=0)

In [None]:
train_over = pd.concat([fraud_oversampled, not_fraud])
train_over

In [None]:
fraud_plot = train_over["fraud"].value_counts()
fraud_plot.plot(kind="bar")
plt.show()

In [None]:
X_train_over = train_over.drop(columns = ["fraud"])
y_train_over = train_over["fraud"]

In [None]:
log_reg = LogisticRegression()
log_reg.fit(X_train_over, y_train_over)
pred = log_reg.predict(X_test_scaled_df)
print(classification_report(y_pred = pred, y_true = y_test))

- Strengths: Recall for fraud (0.95) is excellent; most fraud cases are detected.
- Weaknesses: Precision drops significantly (0.57), leading to many false alarms.
- Overall: A better approach for fraud detection as it minimizes missed fraud cases, though the trade-off is more manual reviews.

5. Now, run Undersample in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model?

In [None]:
train

In [None]:
not_fraud_undersampled = resample(not_fraud, replace=False, n_samples = len(fraud), random_state=0)
not_fraud_undersampled.head()

In [None]:
train_under = pd.concat([not_fraud_undersampled, fraud])
train_under.head()

In [None]:
fraud_plot = train_under["fraud"].value_counts()
fraud_plot.plot(kind="bar")
plt.show()

In [None]:
X_train_under = train_under.drop(columns = ["fraud"])
y_train_under = train_under["fraud"]

In [None]:
log_reg = LogisticRegression()
log_reg.fit(X_train_under, y_train_under)
pred = log_reg.predict(X_test_scaled_df)
print(classification_report(y_pred = pred, y_true = y_test))

- Strengths: Same recall improvement (0.95) as oversampling, detecting most fraud cases.
- Weaknesses: Precision remains low (0.57), and majority-class data is discarded, potentially impacting model robustness.
- Overall: Similar to oversampling but less desirable due to the potential loss of critical majority-class information.

6. Finally, run SMOTE in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model?

In [None]:
from imblearn.over_sampling import SMOTE

In [None]:
sm = SMOTE(random_state = 1,sampling_strategy=1.0)
X_train_sm,y_train_sm = sm.fit_resample(X_train_scaled_df,y_train)

In [None]:
log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train_sm, y_train_sm)
pred = log_reg.predict(X_test_scaled_df)
print(classification_report(y_pred = pred, y_true = y_test))

For credit card fraud detection, we should prioritize SMOTE because:

- It achieves high recall (0.95) for fraud cases, detecting nearly all fraudulent transactions.
- It avoids the data loss associated with undersampling and is more robust than direct oversampling.
- Although precision (0.57) is low, the trade-off is acceptable given the importance of recall in this context.