## Lab | Handling Data Imbalance in Classification Models

**Scenario**

You are working as an analyst with this internet service provider. You are provided with this historical data about your company's customers and their churn trends. Your task is to build a machine learning model that will help the company identify customers that are more likely to default/churn and thus prevent losses from such customers.

**Instructions**

In this lab, we will first take a look at the degree of imbalance in the data and correct it using the techniques we learned on the class.

Here is the list of steps to be followed (building a simple model without balancing the data):

 - **Import the required libraries and modules that you would need.**

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from sklearn.metrics import classification_report
import warnings
warnings.filterwarnings("ignore")

 - **Read that data into Python and call the dataframe churnData.**

In [2]:
churnData= pd.read_csv('Customer-Churn.csv')

 - **Check the datatypes of all the columns in the data. You would see that the column TotalCharges is object type. Convert this column into numeric type using pd.to_numeric function.**

In [3]:
churnData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   gender            7043 non-null   object 
 1   SeniorCitizen     7043 non-null   int64  
 2   Partner           7043 non-null   object 
 3   Dependents        7043 non-null   object 
 4   tenure            7043 non-null   int64  
 5   PhoneService      7043 non-null   object 
 6   OnlineSecurity    7043 non-null   object 
 7   OnlineBackup      7043 non-null   object 
 8   DeviceProtection  7043 non-null   object 
 9   TechSupport       7043 non-null   object 
 10  StreamingTV       7043 non-null   object 
 11  StreamingMovies   7043 non-null   object 
 12  Contract          7043 non-null   object 
 13  MonthlyCharges    7043 non-null   float64
 14  TotalCharges      7043 non-null   object 
 15  Churn             7043 non-null   object 
dtypes: float64(1), int64(2), object(13)
memory

**Converting Total Charges to numeric**

In [4]:
churnData['TotalCharges']= pd.to_numeric(churnData['TotalCharges'], errors = 'coerce' )

In [5]:
churnData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   gender            7043 non-null   object 
 1   SeniorCitizen     7043 non-null   int64  
 2   Partner           7043 non-null   object 
 3   Dependents        7043 non-null   object 
 4   tenure            7043 non-null   int64  
 5   PhoneService      7043 non-null   object 
 6   OnlineSecurity    7043 non-null   object 
 7   OnlineBackup      7043 non-null   object 
 8   DeviceProtection  7043 non-null   object 
 9   TechSupport       7043 non-null   object 
 10  StreamingTV       7043 non-null   object 
 11  StreamingMovies   7043 non-null   object 
 12  Contract          7043 non-null   object 
 13  MonthlyCharges    7043 non-null   float64
 14  TotalCharges      7032 non-null   float64
 15  Churn             7043 non-null   object 
dtypes: float64(2), int64(2), object(12)
memory

 - **Check for null values in the dataframe. Replace the null values.**

In [6]:
churnData.isnull().sum()

gender               0
SeniorCitizen        0
Partner              0
Dependents           0
tenure               0
PhoneService         0
OnlineSecurity       0
OnlineBackup         0
DeviceProtection     0
TechSupport          0
StreamingTV          0
StreamingMovies      0
Contract             0
MonthlyCharges       0
TotalCharges        11
Churn                0
dtype: int64

 - TotalCharges column has 11 null values.
 - Dropping 11 nulls would not affect the data as it has 7043 entries

In [7]:
churnData.dropna(subset=['TotalCharges'], inplace=True)

In [8]:
churnData.isnull().sum()

gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64

 - **Use the following features: tenure, SeniorCitizen, MonthlyCharges and TotalCharges :**

1. Scale the features either by using normalizer or a standard scaler.

In [9]:
features = ['tenure', 'SeniorCitizen', 'MonthlyCharges', 'TotalCharges'] 
X= churnData[features]
y= churnData['Churn']

 - Using MinMax scaler to scale the data.

In [10]:
scaler= MinMaxScaler()
X_scaled= scaler.fit_transform(X)

2. Split the data into a training set and a test set.

In [11]:
X_train, X_test, y_train, y_test= train_test_split(X_scaled, y, test_size=0.2, random_state=42)

3. Fit a logistic regression model on the training data.

In [12]:
lr= LogisticRegression()
lr.fit(X_train, y_train)
print('Done')

Done


In [13]:
# checking predictions
predictions= lr.predict(X_test)

In [14]:
predictions

array(['No', 'No', 'Yes', ..., 'No', 'Yes', 'No'], dtype=object)

4. Check the accuracy on the test data.

In [15]:
accuracy = accuracy_score(y_test,predictions)

In [16]:
accuracy

0.7803837953091685

In [17]:
from sklearn.metrics import classification_report
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

          No       0.82      0.91      0.86      1033
         Yes       0.63      0.43      0.51       374

    accuracy                           0.78      1407
   macro avg       0.72      0.67      0.69      1407
weighted avg       0.76      0.78      0.77      1407



 - **Managing imbalance in the dataset**

1. Check for the imbalance.

In [18]:
churnData['Churn'].value_counts()

No     5163
Yes    1869
Name: Churn, dtype: int64

 - There is a huge imbalance in the representation of target values.

2. Use the resampling strategies used in class for upsampling and downsampling to create a balance between the two classes.
3. Each time fit the model and see how the accuracy of the model is.

In [19]:
# Applying imblearn.over_sampling.SMOTE to the dataset
from imblearn.over_sampling import SMOTE
smote = SMOTE()
X_sm, y_sm = smote.fit_resample(X, y)
y_sm.value_counts()

No     5163
Yes    5163
Name: Churn, dtype: int64

In [20]:
X_train, X_test, y_train, y_test = train_test_split(X_sm, y_sm, test_size=0.3, random_state=100)
classification = LogisticRegression(random_state=100, multi_class='ovr').fit(X_train, y_train)
s_predictions = classification.predict(X_test)

classification.score(X_test, y_test)

0.7346675274370562

In [21]:
from sklearn.metrics import classification_report
print(classification_report(y_test, s_predictions))

              precision    recall  f1-score   support

          No       0.74      0.72      0.73      1531
         Yes       0.73      0.75      0.74      1567

    accuracy                           0.73      3098
   macro avg       0.73      0.73      0.73      3098
weighted avg       0.73      0.73      0.73      3098



In [22]:
# Applying imblearn.under_sampling.TomekLinks to the dataset.
from imblearn.under_sampling import TomekLinks
tl = TomekLinks(sampling_strategy='majority')
X_tl, y_tl = tl.fit_resample(X, y)
y_tl.value_counts()

No     4610
Yes    1869
Name: Churn, dtype: int64

In [23]:
X_tl2, y_tl2 = tl.fit_resample(X_tl, y_tl)
y_tl2.value_counts()

No     4443
Yes    1869
Name: Churn, dtype: int64

In [24]:
X_train, X_test, y_train, y_test = train_test_split(X_tl, y_tl, test_size=0.3, random_state=100)
classification = LogisticRegression(random_state=100, multi_class='ovr').fit(X_train, y_train)
t_predictions = classification.predict(X_test)

classification.score(X_test, y_test)

0.7911522633744856

In [25]:
from sklearn.metrics import classification_report
print(classification_report(y_test, t_predictions))

              precision    recall  f1-score   support

          No       0.82      0.90      0.86      1355
         Yes       0.70      0.55      0.61       589

    accuracy                           0.79      1944
   macro avg       0.76      0.72      0.74      1944
weighted avg       0.78      0.79      0.78      1944

