## Problem Statement
find an imbalanced dataset (or generate a dataset) with fewer classes to apply oversampling or undersampling techniques (random over and undersampling, tomek link, smot and class weighing). Train the model on balanced dataset and find the performances metrics (accuracy, F1 score and AUC) and compare which technique is improving model performance. 

In [8]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score
from imblearn.over_sampling import SMOTE, RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler, TomekLinks

df = pd.read_csv('creditcard.csv')
X = df.drop('Class', axis=1)
y = df['Class']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


In [9]:
lr = LogisticRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)

baseline_acc = accuracy_score(y_test, y_pred)
baseline_f1 = f1_score(y_test, y_pred)
baseline_auc = roc_auc_score(y_test, lr.predict_proba(X_test)[:, 1])


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [10]:

# 2. Random Oversampling
ros = RandomOverSampler(random_state=42)
X_res, y_res = ros.fit_resample(X_train, y_train)
lr.fit(X_res, y_res)
y_pred_ros = lr.predict(X_test)

ros_acc = accuracy_score(y_test, y_pred_ros)
ros_f1 = f1_score(y_test, y_pred_ros)
ros_auc = roc_auc_score(y_test, lr.predict_proba(X_test)[:, 1])



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [11]:

# 2. Random Undersampling
rus = RandomUnderSampler(random_state=42)
X_rus, y_rus = rus.fit_resample(X_train, y_train)
lr.fit(X_rus, y_rus)
y_pred_rus = lr.predict(X_test)

rus_acc = accuracy_score(y_test, y_pred_rus)
rus_f1 = f1_score(y_test, y_pred_rus)
rus_auc = roc_auc_score(y_test, lr.predict_proba(X_test)[:, 1])


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [12]:

# 3. SMOTE
smote = SMOTE(random_state=42)
X_sm, y_sm = smote.fit_resample(X_train, y_train)
lr.fit(X_sm, y_sm)
y_pred_smote = lr.predict(X_test)

smote_acc = accuracy_score(y_test, y_pred_smote)
smote_f1 = f1_score(y_test, y_pred_smote)
smote_auc = roc_auc_score(y_test, lr.predict_proba(X_test)[:, 1])


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [13]:

# 4. Tomek Links
tl = TomekLinks()
X_tl, y_tl = tl.fit_resample(X_train, y_train)
lr.fit(X_tl, y_tl)
y_pred_tl = lr.predict(X_test)

tl_acc = accuracy_score(y_test, y_pred_tl)
tl_f1 = f1_score(y_test, y_pred_tl)
tl_auc = roc_auc_score(y_test, lr.predict_proba(X_test)[:, 1])


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [14]:

# 5. Class Weighing
lr_weighted = LogisticRegression(class_weight='balanced')
lr_weighted.fit(X_train, y_train)
y_pred_weighted = lr_weighted.predict(X_test)

weighted_acc = accuracy_score(y_test, y_pred_weighted)
weighted_f1 = f1_score(y_test, y_pred_weighted)
weighted_auc = roc_auc_score(y_test, lr_weighted.predict_proba(X_test)[:, 1])


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [15]:
# Print performance metrics
print(f"Baseline: Accuracy={baseline_acc}, F1={baseline_f1}, AUC={baseline_auc}")
print(f"Random Oversampling: Accuracy={ros_acc}, F1={ros_f1}, AUC={ros_auc}")
print(f"Random Undersampling: Accuracy={rus_acc}, F1={rus_f1}, AUC={rus_auc}")
print(f"SMOTE: Accuracy={smote_acc}, F1={smote_f1}, AUC={smote_auc}")
print(f"Tomek Links: Accuracy={tl_acc}, F1={tl_f1}, AUC={tl_auc}")
print(f"Class Weighing: Accuracy={weighted_acc}, F1={weighted_f1}, AUC={weighted_auc}")

Baseline: Accuracy=0.9991807403766253, F1=0.7222222222222222, AUC=0.9400659486601679
Random Oversampling: Accuracy=0.961342649485622, F1=0.06931530008453085, AUC=0.9739207492109813
Random Undersampling: Accuracy=0.9697459124796648, F1=0.08882622488544237, AUC=0.975813049615265
SMOTE: Accuracy=0.9825146588954039, F1=0.1423650975889782, AUC=0.9775235240332667
Tomek Links: Accuracy=0.9988881476539916, F1=0.6387832699619772, AUC=0.9214372105178597
Class Weighing: Accuracy=0.963800428355746, F1=0.07533632286995516, AUC=0.9768245778741003
