# LAB | Imbalanced

**Load the data**

In this challenge, we will be working with Credit Card Fraud dataset.

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv

Metadata

- **distance_from_home:** the distance from home where the transaction happened.
- **distance_from_last_transaction:** the distance from last transaction happened.
- **ratio_to_median_purchase_price:** Ratio of purchased price transaction to median purchase price.
- **repeat_retailer:** Is the transaction happened from same retailer.
- **used_chip:** Is the transaction through chip (credit card).
- **used_pin_number:** Is the transaction happened by using PIN number.
- **online_order:** Is the transaction an online order.
- **fraud:** Is the transaction fraudulent. **0=legit** -  **1=fraud**


In [1]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [2]:
fraud = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv")
fraud.head()

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
0,57.877857,0.31114,1.94594,1.0,1.0,0.0,0.0,0.0
1,10.829943,0.175592,1.294219,1.0,0.0,0.0,0.0,0.0
2,5.091079,0.805153,0.427715,1.0,0.0,0.0,1.0,0.0
3,2.247564,5.600044,0.362663,1.0,1.0,0.0,1.0,0.0
4,44.190936,0.566486,2.222767,1.0,1.0,0.0,1.0,0.0


**Steps:**

- **1.** What is the distribution of our target variable? Can we say we're dealing with an imbalanced dataset?
- **2.** Train a LogisticRegression.
- **3.** Evaluate your model. Take in consideration class importance, and evaluate it by selection the correct metric.
- **4.** Run **Oversample** in order to balance our target variable and repeat the steps above, now with balanced data. Does it improve the performance of our model? 
- **5.** Now, run **Undersample** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model?
- **6.** Finally, run **SMOTE** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model? 

In [3]:
fraud.fraud.value_counts(normalize=True)*100

fraud
0.0    91.2597
1.0     8.7403
Name: proportion, dtype: float64

In [4]:
# the dataset is imbalanced

In [5]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

In [6]:
# Separate features and target
X = fraud.drop('fraud', axis=1)
y = fraud['fraud']

# Standardization
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)


log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)


y_pred = log_reg.predict(X_test)


print(confusion_matrix(y_test, y_pred))

[[271911   1960]
 [ 10456  15673]]


In [8]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         0.0       0.96      0.99      0.98    273871
         1.0       0.89      0.60      0.72     26129

    accuracy                           0.96    300000
   macro avg       0.93      0.80      0.85    300000
weighted avg       0.96      0.96      0.95    300000



In [9]:
from sklearn.utils import resample



# Over sampeling

# Separate features and target variable
X = fraud.drop('fraud', axis=1)
y = fraud['fraud']

# Concatenate our features and target variable for resampling
df_resample = pd.concat([X, y], axis=1)

# Separate majority and minority classes
df_majority = df_resample[df_resample.fraud == 0]
df_minority = df_resample[df_resample.fraud == 1]

# Upsample minority class
df_minority_upsampled = resample(df_minority, 
                                 replace=True,     # sample with replacement
                                 n_samples=len(df_majority),    # to match majority class
                                 random_state=42) # reproducible results

# Combine majority class with upsampled minority class
df_upsampled = pd.concat([df_majority, df_minority_upsampled])

# Separate features and target variable from the upsampled dataframe
X_upsampled = df_upsampled.drop('fraud', axis=1)
y_upsampled = df_upsampled['fraud']

# Preprocess the data (e.g., standardization)
scaler = StandardScaler()
X_upsampled_scaled = scaler.fit_transform(X_upsampled)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_upsampled_scaled, y_upsampled, test_size=0.3, random_state=42)

# Train the logistic regression model
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)

# Make predictions
y_pred = log_reg.predict(X_test)

# Evaluate the model
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))


[[255630  18256]
 [ 13873 259800]]
              precision    recall  f1-score   support

         0.0       0.95      0.93      0.94    273886
         1.0       0.93      0.95      0.94    273673

    accuracy                           0.94    547559
   macro avg       0.94      0.94      0.94    547559
weighted avg       0.94      0.94      0.94    547559



In [10]:
from sklearn.utils import resample



# Separate features and target variable
X = fraud.drop('fraud', axis=1)
y = fraud['fraud']

# Concatenate our features and target variable for resampling
df_resample = pd.concat([X, y], axis=1)

# Separate majority and minority classes
df_majority = df_resample[df_resample.fraud == 0]
df_minority = df_resample[df_resample.fraud == 1]

# Downsample majority class
df_majority_downsampled = resample(df_majority, 
                                   replace=False,    # sample without replacement
                                   n_samples=len(df_minority), # to match minority class
                                   random_state=42) # reproducible results

# Combine minority class with downsampled majority class
df_downsampled = pd.concat([df_minority, df_majority_downsampled])

# Separate features and target variable from the downsampled dataframe
X_downsampled = df_downsampled.drop('fraud', axis=1)
y_downsampled = df_downsampled['fraud']

# Preprocess the data (e.g., standardization)
scaler = StandardScaler()
X_downsampled_scaled = scaler.fit_transform(X_downsampled)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_downsampled_scaled, y_downsampled, test_size=0.3, random_state=42)

# Train the logistic regression model
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)

# Make predictions
y_pred = log_reg.predict(X_test)

# Evaluate the model
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))


[[24373  1838]
 [ 1278 24953]]
              precision    recall  f1-score   support

         0.0       0.95      0.93      0.94     26211
         1.0       0.93      0.95      0.94     26231

    accuracy                           0.94     52442
   macro avg       0.94      0.94      0.94     52442
weighted avg       0.94      0.94      0.94     52442



In [11]:
# Running SMOTE
from imblearn.over_sampling import SMOTE

sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(X_scaled, y)

X_train, X_test, y_train, y_test = train_test_split(X_res, y_res, test_size=0.3, random_state=42)


log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)


y_pred = log_reg.predict(X_test)


print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[255571  18376]
 [ 13963 259649]]
              precision    recall  f1-score   support

         0.0       0.95      0.93      0.94    273947
         1.0       0.93      0.95      0.94    273612

    accuracy                           0.94    547559
   macro avg       0.94      0.94      0.94    547559
weighted avg       0.94      0.94      0.94    547559

