# LAB | Imbalanced

**Load the data**

In this challenge, we will be working with Credit Card Fraud dataset.

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv

Metadata

- **distance_from_home:** the distance from home where the transaction happened.
- **distance_from_last_transaction:** the distance from last transaction happened.
- **ratio_to_median_purchase_price:** Ratio of purchased price transaction to median purchase price.
- **repeat_retailer:** Is the transaction happened from same retailer.
- **used_chip:** Is the transaction through chip (credit card).
- **used_pin_number:** Is the transaction happened by using PIN number.
- **online_order:** Is the transaction an online order.
- **fraud:** Is the transaction fraudulent. **0=legit** -  **1=fraud**


In [None]:
pip install imblearn

In [18]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score, recall_score, classification_report, confusion_matrix, f1_score
from sklearn.preprocessing import StandardScaler
from sklearn.utils import resample
from imblearn.over_sampling import SMOTE

In [2]:
fraud = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv")
fraud.head()

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
0,57.877857,0.31114,1.94594,1.0,1.0,0.0,0.0,0.0
1,10.829943,0.175592,1.294219,1.0,0.0,0.0,0.0,0.0
2,5.091079,0.805153,0.427715,1.0,0.0,0.0,1.0,0.0
3,2.247564,5.600044,0.362663,1.0,1.0,0.0,1.0,0.0
4,44.190936,0.566486,2.222767,1.0,1.0,0.0,1.0,0.0


**Steps:**

- **1.** What is the distribution of our target variable? Can we say we're dealing with an imbalanced dataset?
- **2.** Train a LogisticRegression.
- **3.** Evaluate your model. Take in consideration class importance, and evaluate it by selection the correct metric.
- **4.** Run **Oversample** in order to balance our target variable and repeat the steps above, now with balanced data. Does it improve the performance of our model? 
- **5.** Now, run **Undersample** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model?
- **6.** Finally, run **SMOTE** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model? 

In [3]:
# 1)¿Cuál es la distribución de nuestra variable objetivo? 
# ¿Podemos decir que estamos tratando con un conjunto de datos desequilibrado?
fraud.isnull().sum()

# No tenemos valores nulos en ninguna de las 8 columnas.


distance_from_home                0
distance_from_last_transaction    0
ratio_to_median_purchase_price    0
repeat_retailer                   0
used_chip                         0
used_pin_number                   0
online_order                      0
fraud                             0
dtype: int64

In [4]:
# 2) Entrena una regresión logística.

features = fraud.drop(columns = ["online_order"])
target = fraud["online_order"]

X_train, X_test, y_train, y_test = train_test_split(features, target)

In [5]:
scaler = StandardScaler()
scaler.fit(X_train)

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [6]:
log_reg = LogisticRegression()
log_reg.fit(X_train_scaled, y_train)

In [7]:
log_reg.score(X_test_scaled, y_test)

0.66928

In [8]:
# 3) Evalúa tu modelo. Ten en cuenta la importancia de la clase y evalúala seleccionando la métrica correcta. 
prediccion = log_reg.predict(X_test_scaled)
print(classification_report(y_pred = prediccion, y_true = y_test))


              precision    recall  f1-score   support

         0.0       0.79      0.07      0.13     87148
         1.0       0.67      0.99      0.80    162852

    accuracy                           0.67    250000
   macro avg       0.73      0.53      0.46    250000
weighted avg       0.71      0.67      0.56    250000



In [None]:
# 4) Ejecuta **Oversample** para equilibrar nuestra variable objetivo y repite los pasos anteriores, ahora con datos equilibrados. ¿Mejora el rendimiento de nuestro modelo?
fraude = pd.DataFrame(X_train_scaled, columns = X_train.columns)
fraude["online_order"] = y_train.values
survived = fraude[fraude["online_order"] == 1]
no_survived = fraude[fraud["online_order"] == 0]

  no_survived = fraude[fraud["online_order"] == 0]


In [None]:
compra_online = resample(fraude, 
                        replace=True, 
                        n_samples = 50,
                        andom_state=0)

In [14]:
# 5) Ahora, ejecuta **Undersample** para equilibrar nuestra variable objetivo y repite los pasos anteriores (1-3), ahora con datos equilibrados. ¿Mejora el rendimiento de nuestro modelo?
compra_online_undersampled = resample(fraude, 
                                    replace=False, 
                                    n_samples = 50,
                                    random_state=0)
compra_online_undersampled

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,fraud,online_order
629436,-0.163032,-0.181921,-0.591284,0.367049,1.361808,-0.33497,-0.309219,1.0
525015,-0.377159,-0.170181,-0.597709,0.367049,-0.734318,2.985338,-0.309219,1.0
114041,-0.350523,0.128243,-0.031784,0.367049,-0.734318,-0.33497,-0.309219,1.0
523286,-0.326495,-0.070161,0.038095,0.367049,1.361808,-0.33497,-0.309219,1.0
613265,-0.311583,-0.172478,-0.448258,0.367049,-0.734318,-0.33497,-0.309219,1.0
643946,-0.168236,-0.179637,1.246809,0.367049,-0.734318,2.985338,-0.309219,1.0
434455,0.295264,-0.058867,-0.348066,0.367049,-0.734318,-0.33497,-0.309219,0.0
286429,-0.408151,-0.120541,0.20041,-2.72443,-0.734318,-0.33497,-0.309219,0.0
576014,-0.410093,-0.133414,1.125018,-2.72443,1.361808,-0.33497,3.233949,1.0
711485,-0.252146,0.011637,-0.504423,0.367049,-0.734318,-0.33497,-0.309219,1.0


In [19]:
# 6) Finalmente, ejecuta **SMOTE** para equilibrar nuestra variable objetivo y repite los pasos anteriores (1-3), ahora con datos equilibrados. ¿Mejora el rendimiento de nuestro modelo?
sm = SMOTE(random_state = 1,sampling_strategy=1.0)
X_train_sm,y_train_sm = sm.fit_resample(X_train_scaled,y_train)
log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train_sm, y_train_sm)

In [20]:
prediction = log_reg.predict(X_test_scaled)
print(classification_report(y_pred = prediction, y_true = y_test))

              precision    recall  f1-score   support

         0.0       0.39      0.69      0.50     87148
         1.0       0.72      0.43      0.53    162852

    accuracy                           0.52    250000
   macro avg       0.55      0.56      0.52    250000
weighted avg       0.60      0.52      0.52    250000

