# Lab | Handling Data Imbalance Classification

<br>

<details><summary>▶ Instructions:</summary>
<p>

In this lab, we will first take a look at the degree of imbalance in the data and correct it using the techniques we learned on the class.
Here is the list of steps to be followed (building a simple model without balancing the data):
* Import the required libraries and modules that you would need.
* Read that data into Python and call the dataframe churnData.
* Check the datatypes of all the columns in the data. You would see that the column TotalCharges is object type. Convert this column into numeric type using pd.to_numeric function.
* Check for null values in the dataframe. Replace the null values.
* Use the following features: tenure, SeniorCitizen, MonthlyCharges and TotalCharges:
    * Scale the features either by using normalizer or a standard scaler.
    * Split the data into a training set and a test set.
    * Fit a logistic regression model on the training data.
    * Check the accuracy on the test data.
Note: So far we have not balanced the data.
Managing imbalance in the dataset
* Check for the imbalance.
* Use the resampling strategies used in class for upsampling and downsampling to create a balance between the two classes.
* Each time fit the model and see how the accuracy of the model is.

</p>
</details>

In [1]:
# Import the required libraries and modules that you would need.
import pandas as pd
import numpy as np
import statsmodels.api as sm
import imblearn
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import r2_score
from sklearn.metrics import plot_confusion_matrix
# Read that data into Python and call the dataframe churnData.
churnData = pd.read_csv('files_for_lab/Customer-Churn.csv')
churnData.head()

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,MonthlyCharges,TotalCharges,Churn
0,Female,0,Yes,No,1,No,No,Yes,No,No,No,No,Month-to-month,29.85,29.85,No
1,Male,0,No,No,34,Yes,Yes,No,Yes,No,No,No,One year,56.95,1889.5,No
2,Male,0,No,No,2,Yes,Yes,Yes,No,No,No,No,Month-to-month,53.85,108.15,Yes
3,Male,0,No,No,45,No,Yes,No,Yes,Yes,No,No,One year,42.3,1840.75,No
4,Female,0,No,No,2,Yes,No,No,No,No,No,No,Month-to-month,70.7,151.65,Yes


In [2]:
# Check the datatypes of all the columns in the data. You would see that the column 
# TotalCharges is object type. Convert this column into numeric type using pd.to_numeric function.
#churnData.info()
churnData['TotalCharges'] = churnData['TotalCharges'].apply(pd.to_numeric, errors='coerce')
churnData['TotalCharges'].dtype

dtype('float64')

In [3]:
# Check for null values in the dataframe. Replace the null values.
churnData.isnull().sum()/len(churnData)  

gender              0.000000
SeniorCitizen       0.000000
Partner             0.000000
Dependents          0.000000
tenure              0.000000
PhoneService        0.000000
OnlineSecurity      0.000000
OnlineBackup        0.000000
DeviceProtection    0.000000
TechSupport         0.000000
StreamingTV         0.000000
StreamingMovies     0.000000
Contract            0.000000
MonthlyCharges      0.000000
TotalCharges        0.001562
Churn               0.000000
dtype: float64

In [4]:
# decide to drop the few rows with NaN in TotalCharges
churnData = churnData.dropna(subset=['TotalCharges'])
churnData.isnull().sum()/len(churnData)  

gender              0.0
SeniorCitizen       0.0
Partner             0.0
Dependents          0.0
tenure              0.0
PhoneService        0.0
OnlineSecurity      0.0
OnlineBackup        0.0
DeviceProtection    0.0
TechSupport         0.0
StreamingTV         0.0
StreamingMovies     0.0
Contract            0.0
MonthlyCharges      0.0
TotalCharges        0.0
Churn               0.0
dtype: float64

In [5]:
# Use the following features: tenure, SeniorCitizen, MonthlyCharges and TotalCharges:
#     Scale the features either by using normalizer or a standard scaler.
#     Split the data into a training set and a test set.
#     Fit a logistic regression model on the training data.
#     Check the accuracy on the test data.Note: So far we have not balanced the data

In [6]:
# Scaling
churnData.head()
# first change Churn into numerical
churnData['Churn'] = churnData['Churn'].replace({'No': 0, 'Yes': 1})
numerical = churnData._get_numeric_data()
numerical.head()

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges,TotalCharges,Churn
0,0,1,29.85,29.85,0
1,0,34,56.95,1889.5,0
2,0,2,53.85,108.15,1
3,0,45,42.3,1840.75,0
4,0,2,70.7,151.65,1


In [7]:
# Test-split
from sklearn.model_selection import train_test_split
numerical = churnData._get_numeric_data()
X = numerical.drop(labels='Churn', axis=1)
y = numerical['Churn']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [8]:
# Scale 
from sklearn.preprocessing import MinMaxScaler

MinMaxtransformer = MinMaxScaler().fit(X_train)
X_normalized_train = MinMaxtransformer.transform(X_train)
X_train_normalized = pd.DataFrame(X_normalized_train, columns=X_train.columns)

MinMaxtransformer = MinMaxScaler().fit(X_test)
X_normalized_test = MinMaxtransformer.transform(X_test)
X_test_normalized = pd.DataFrame(X_normalized_test, columns=X_test.columns)

y_train = y_train.reset_index(drop=True)
y_test = y_test.reset_index(drop=True)

In [9]:
# Fit 
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(max_depth=5)
model.fit(X_train_normalized, y_train)

DecisionTreeClassifier(max_depth=5)

In [10]:
print("Accuracy test data:",model.score(X_test_normalized, y_test))
print("Acuracy train data:",model.score(X_train_normalized, y_train))

Accuracy test data: 0.7668798862828714
Acuracy train data: 0.7982222222222223


In [11]:
# Managing imbalance in the dataset
# * Check for the imbalance.
y.value_counts()
y_train.value_counts()
# with target values nearly 3 times more 0, it can be called imbalanced

0    4130
1    1495
Name: Churn, dtype: int64

In [12]:
# * Use the resampling strategies used in class for upsampling and downsampling to create a 
#   balance between the two classes.
# * Each time fit the model and see how the accuracy of the model is.
from sklearn.utils import resample
# Only on the TRAINING set we do resampling, TEST set still shows reality
train = pd.concat([X_train_normalized, y_train],axis=1)
churn_0 = train[train['Churn'] == 0]     # 4130
churn_1 = train[train['Churn'] == 1]     # 1495

## Upsampling

In [13]:
# us = upsampling    1495 --> 4130 

In [14]:
churn_1_us = resample(churn_1, replace=True, n_samples = len(churn_0))
print("0 =", churn_0.shape)
print("1 =", churn_1_us.shape)

0 = (4130, 5)
1 = (4130, 5)


In [15]:
churn_us = pd.concat([churn_0, churn_1_us], axis=0)
churn_us['Churn'].value_counts()

y_train_us = churn_us['Churn']
X_train_us = churn_us[['tenure','SeniorCitizen','MonthlyCharges','TotalCharges']]

In [16]:
from sklearn.linear_model import LogisticRegression
LR_us = LogisticRegression(max_iter=1000)
LR_us.fit(X_train_us, y_train_us)
pred_us = LR_us.predict(X_test_normalized)
print("accuracy: ",LR_us.score(X_test_normalized, y_test))
print("precision:",precision_score(y_test,pred_us))
print("recall:   ",recall_score(y_test,pred_us))
print("f1:       ",f1_score(y_test,pred_us))

confusion_matrix(y_test,pred_us)

accuracy:  0.3582089552238806
precision: 0.25117591721542804
recall:    0.713903743315508
f1:        0.37160751565762


array([[237, 796],
       [107, 267]])

## Downsampling

In [17]:
# DS = Downsampling    4130 --> 1495 

In [18]:
churn_0_ds = resample(churn_0, replace=False, n_samples = len(churn_1))
print("0 =", churn_0_ds.shape)
print("1 =", churn_1.shape)

0 = (1495, 5)
1 = (1495, 5)


In [19]:
churn_ds = pd.concat([churn_0_ds, churn_1], axis=0)
churn_ds['Churn'].value_counts()

y_train_ds = churn_ds['Churn']
X_train_ds = churn_ds[['tenure','SeniorCitizen','MonthlyCharges','TotalCharges']]

In [20]:
LR_ds = LogisticRegression(max_iter=1000)
LR_ds.fit(X_train_ds, y_train_ds)
pred_ds = LR_ds.predict(X_test_normalized)
print("accuracy: ",LR_ds.score(X_test_normalized, y_test))
print("precision:",precision_score(y_test,pred_ds))
print("recall:   ",recall_score(y_test,pred_ds))
print("f1:       ",f1_score(y_test,pred_ds))

confusion_matrix(y_test,pred_ds)

accuracy:  0.3603411513859275
precision: 0.2514177693761815
recall:    0.7112299465240641
f1:        0.3715083798882682


array([[241, 792],
       [108, 266]])

In [21]:
# Repeat from upsampling
confusion_matrix(y_test,pred_us)

array([[237, 796],
       [107, 267]])

In [22]:
# Re-running seems to make a competition in false positives: sometimes upsampling 
# sometimes downsampling gives more false positives, so difficult to say which 
# resampling is more appropriate. Same goes for the other KPI's