**Table of contents**<a id='toc0_'></a>    
- [Imbalanced Data](#toc1_)    
- [A relatively bad model](#toc2_)    
  - [Changing weights internally](#toc2_1_)    
- [Oversampling / undersampling](#toc3_)    
  - [Oversampling](#toc3_1_)    
  - [Undersampling](#toc3_2_)    
- [SMOTE](#toc4_)    
- [Acknowledgements](#toc5_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[Imbalanced Data](#toc0_)

Why is this important? Most of the events we are trying to predict (e.g. fraudulent card transactions, disease, customer churn) are **minority** events.

In [None]:
# You know the drill...
# What library do I need to install? :)

In [None]:
import imblearn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import warnings
warnings.filterwarnings('ignore')

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

In [None]:
diabetes = pd.read_csv('https://raw.githubusercontent.com/sabinagio/data-analytics/main/data/diabetes.csv')
diabetes.head()

In [None]:
diabetes.shape

In [None]:
diabetes.Outcome.value_counts()

> While there are more imbalanced datasets, we have a significant imbalance and the cost of failing to detect the minority class is quite high (lack of diagnosis of diabetes):

In [None]:
count_classes = pd.value_counts(diabetes['Outcome'])
count_classes.plot(kind='bar')

In [None]:
X = diabetes.drop('Outcome', axis=1)
y = diabetes['Outcome']

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# <a id='toc2_'></a>[A relatively bad model](#toc0_)

In [None]:
from sklearn.linear_model import LogisticRegression

LR = LogisticRegression(max_iter=1000)
LR.fit(X_train, y_train)
LR.score(X_test, y_test)

> While accuracy is not absolutely terrible, a closer look reveals some serious problems:

In [None]:
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

pred = LR.predict(X_test)

print("precision: ",precision_score(y_test,pred))
print("recall: ",recall_score(y_test,pred))
print("f1: ",f1_score(y_test,pred))

> We fail to identify 40%+ of diabetes cases!

In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, pred)

In [None]:
cm = pd.DataFrame(confusion_matrix(y_test, pred))
cm

In [None]:
# Rename columns to predicted values - 0 = No diabetes, 1 = Diabetes
cm.rename({0: 'No - Pred', 1: 'Yes - Pred'}, axis=1, inplace=True)
# Rename rows to real values - 0 = No diabetes, 1 = Diabetes
cm.rename({0: 'No - True', 1: 'Yes - True'}, axis=0, inplace=True)
px.imshow(cm, text_auto=True, color_continuous_scale='RdBu', color_continuous_midpoint=0)

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, pred))

## <a id='toc2_1_'></a>[Changing weights internally](#toc0_)

In [None]:
LR = LogisticRegression(max_iter=1000, class_weight='balanced')
LR.fit(X_train, y_train)
LR.score(X_test, y_test)

In [None]:
pred = LR.predict(X_test)

print("precision: ", precision_score(y_test, pred))
print("recall: ", recall_score(y_test, pred))
print("f1: ", f1_score(y_test, pred))

In [None]:
confusion_matrix(y_test, pred)

In [None]:
cm = pd.DataFrame(confusion_matrix(y_test, pred))
# Rename columns to predicted values - 0 = No diabetes, 1 = Diabetes
cm.rename({0: 'No - Pred', 1: 'Yes - Pred'}, axis=1, inplace=True)
# Rename rows to real values - 0 = No diabetes, 1 = Diabetes
cm.rename({0: 'No - True', 1: 'Yes - True'}, axis=0, inplace=True)
px.imshow(cm, text_auto=True, color_continuous_scale='RdBu', color_continuous_midpoint=0)

In [None]:
print(classification_report(y_test, pred))

# <a id='toc3_'></a>[Oversampling / undersampling](#toc0_)

In [None]:
from sklearn.utils import resample

**Oversampling / undersampling is only to be done on the TRAINING set!** Otherwise we might have test samples leaking into the training set:
- **Resampling + Train-Test Split:** X = ABCD -resampling-> AABBCCDD -train test split-> AABBC / CDD   
- **Train-Test Split + Resampling:** X = ABCD -train test split-> ABC / D -resampling-> AABBCC / D 

In [None]:
train = pd.concat([X_train, y_train], axis=1)
display(train.shape)
train.head()

## <a id='toc3_1_'></a>[Oversampling - Resampling](#toc0_)

In [None]:
# separate majority/minority classes
no_diabetes = train[train['Outcome']==0]
yes_diabetes = train[train['Outcome']==1]

In [None]:
display(no_diabetes.shape)
display(yes_diabetes.shape)

In [None]:
# oversample minority
yes_diabetes_oversampled = resample(yes_diabetes, #<- sample from here
                                    replace=True, #<- we need replacement, since we don't have enough data otherwise
                                    n_samples = len(no_diabetes),#<- make both sets the same size
                                    random_state=0)

In [None]:
# both sets are now of a reasonable size
display(no_diabetes.shape)
display(yes_diabetes_oversampled.shape)

In [None]:
train_oversampled = pd.concat([no_diabetes, yes_diabetes_oversampled])
train_oversampled.head()

In [None]:
y_train_over = train_oversampled['Outcome'].copy()
X_train_over = train_oversampled.drop('Outcome',axis = 1).copy()

> Our Logistic Regression, while still not amazing, has improved substantially! Especially at detecting instances of diabetes.

In [None]:
LR = LogisticRegression(max_iter=1000)
LR.fit(X_train_over, y_train_over)
pred = LR.predict(X_test)

print("precision: ",precision_score(y_test,pred))
print("recall: ",recall_score(y_test,pred))
print("f1: ",f1_score(y_test,pred))

In [None]:
confusion_matrix(y_test, pred)

In [None]:
cm = pd.DataFrame(confusion_matrix(y_test, pred))
# Rename columns to predicted values - 0 = No diabetes, 1 = Diabetes
cm.rename({0: 'No - Pred', 1: 'Yes - Pred'}, axis=1, inplace=True)
# Rename rows to real values - 0 = No diabetes, 1 = Diabetes
cm.rename({0: 'No - True', 1: 'Yes - True'}, axis=0, inplace=True)
px.imshow(cm, text_auto=True, color_continuous_scale='RdBu', color_continuous_midpoint=0)

In [None]:
print(classification_report(y_test, pred))

## <a id='toc3_2_'></a>[Undersampling](#toc0_)

In [None]:
# undersample majority
no_diabetes_undersampled = resample(no_diabetes, #<- downsample from here
                                    replace=False, #<- no need to reuse data now, we have an abundance
                                    n_samples = len(yes_diabetes),
                                    random_state=0)

> Both sets are the same size - small, but balanced, and no repeated data.

In [None]:
display(yes_diabetes.shape)
display(no_diabetes_undersampled.shape)

In [None]:
train_undersampled = pd.concat([yes_diabetes,no_diabetes_undersampled])
train_undersampled.head()

In [None]:
y_train_under = train_undersampled['Outcome'].copy()
X_train_under = train_undersampled.drop('Outcome',axis = 1).copy()

In [None]:
from sklearn.linear_model import LogisticRegression

LR = LogisticRegression(max_iter=1000)
LR.fit(X_train_under, y_train_under)
pred = LR.predict(X_test)

print("precision: ",precision_score(y_test,pred))
print("recall: ",recall_score(y_test,pred))
print("f1: ",f1_score(y_test,pred))

In [None]:
cm = pd.DataFrame(confusion_matrix(y_test, pred))
# Rename columns to predicted values - 0 = No diabetes, 1 = Diabetes
cm.rename({0: 'No - Pred', 1: 'Yes - Pred'}, axis=1, inplace=True)
# Rename rows to real values - 0 = No diabetes, 1 = Diabetes
cm.rename({0: 'No - True', 1: 'Yes - True'}, axis=0, inplace=True)
px.imshow(cm, text_auto=True, color_continuous_scale='RdBu', color_continuous_midpoint=0)

In [None]:
print(classification_report(y_test, pred))

# <a id='toc4_'></a>[SMOTE](#toc0_)

In [None]:
from imblearn.over_sampling import SMOTE

> A bit of magic, you can find documentation here: https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html. By default, takes a 5-neightbour KNN to build a new point:

In [None]:
sm = SMOTE(random_state=123,sampling_strategy=1.0)
X_train_SMOTE, y_train_SMOTE = sm.fit_resample(X_train,y_train)

In [None]:
y_train_SMOTE.value_counts()

> Yet another small improvement, but bear in mind that we saved 12 hypothetical people with these "small improvements".

In [None]:
from sklearn.linear_model import LogisticRegression

LR = LogisticRegression(max_iter=1000)
LR.fit(X_train_SMOTE, y_train_SMOTE)
pred = LR.predict(X_test)

print("precision: ",precision_score(y_test, pred))
print("recall: ",recall_score(y_test, pred))
print("f1: ",f1_score(y_test,pred))

In [None]:
cm = pd.DataFrame(confusion_matrix(y_test, pred))
# Rename columns to predicted values - 0 = No diabetes, 1 = Diabetes
cm.rename({0: 'No - Pred', 1: 'Yes - Pred'}, axis=1, inplace=True)
# Rename rows to real values - 0 = No diabetes, 1 = Diabetes
cm.rename({0: 'No - True', 1: 'Yes - True'}, axis=0, inplace=True)
px.imshow(cm, text_auto=True, color_continuous_scale='RdBu', color_continuous_midpoint=0)

In [None]:
print(classification_report(y_test, pred))

# <a id='toc5_'></a>[Acknowledgements](#toc0_)

Thank you, David Henriques, for your awesome lesson structure and content!