# Lab | Handling Data Imbalance in Classification Models

For this lab and in the next lessons we will build a model on customer churn binary classification problem. You will be using **files_for_lab/Customer-Churn.csv** file.

### Instructions

In this lab, we will first take a look at the degree of imbalance in the data and correct it using the techniques we learned on the class.

Here is the list of steps to be followed (building a simple model without balancing the data):

 - Import the required libraries and modules that you would need.
 - Read that data into Python and call the dataframe churnData.
 - Check the datatypes of all the columns in the data. You would see that the column TotalCharges is object type. 
   Convert this column into numeric type using pd.to_numeric function.
 - Check for null values in the dataframe. Replace the null values.
 - Use the following features: *tenure*, *SeniorCitizen*, *MonthlyCharges* and *TotalCharges*:
            
     - Scale the features either by using normalizer or a standard scaler.
     - Split the data into a training set and a test set.
     - Fit a logistic regression model on the training data.
    
     - Check the accuracy on the test data.
**Note**: So far we have not balanced the data.

Managing imbalance in the dataset

 - Check for the imbalance.
 - Use the resampling strategies used in class for upsampling and downsampling to create a balance between the two classes.
 - Each time fit the model and see how the accuracy of the model is.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sb
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from imblearn.over_sampling import SMOTE 
from imblearn.under_sampling import TomekLinks
import warnings
warnings.filterwarnings('ignore')

In [2]:
df_data=pd.read_csv(r"files_for_lab/Customer-Churn.csv")
df_data.head()

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,MonthlyCharges,TotalCharges,Churn
0,Female,0,Yes,No,1,No,No,Yes,No,No,No,No,Month-to-month,29.85,29.85,No
1,Male,0,No,No,34,Yes,Yes,No,Yes,No,No,No,One year,56.95,1889.5,No
2,Male,0,No,No,2,Yes,Yes,Yes,No,No,No,No,Month-to-month,53.85,108.15,Yes
3,Male,0,No,No,45,No,Yes,No,Yes,Yes,No,No,One year,42.3,1840.75,No
4,Female,0,No,No,2,Yes,No,No,No,No,No,No,Month-to-month,70.7,151.65,Yes


In [3]:
df_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   gender            7043 non-null   object 
 1   SeniorCitizen     7043 non-null   int64  
 2   Partner           7043 non-null   object 
 3   Dependents        7043 non-null   object 
 4   tenure            7043 non-null   int64  
 5   PhoneService      7043 non-null   object 
 6   OnlineSecurity    7043 non-null   object 
 7   OnlineBackup      7043 non-null   object 
 8   DeviceProtection  7043 non-null   object 
 9   TechSupport       7043 non-null   object 
 10  StreamingTV       7043 non-null   object 
 11  StreamingMovies   7043 non-null   object 
 12  Contract          7043 non-null   object 
 13  MonthlyCharges    7043 non-null   float64
 14  TotalCharges      7043 non-null   object 
 15  Churn             7043 non-null   object 
dtypes: float64(1), int64(2), object(13)
memory

In [4]:
col_names=df_data.columns.values.tolist()
for col in range(len(col_names)):
    col_names[col]=col_names[col].lower()
    col_names[col]=col_names[col].replace(" ", "_")

In [5]:
for i in range(len(col_names)):
    df_data.rename(columns={df_data.columns.values[i]:col_names[i]},inplace=True)

In [6]:
df_data.head(2)

Unnamed: 0,gender,seniorcitizen,partner,dependents,tenure,phoneservice,onlinesecurity,onlinebackup,deviceprotection,techsupport,streamingtv,streamingmovies,contract,monthlycharges,totalcharges,churn
0,Female,0,Yes,No,1,No,No,Yes,No,No,No,No,Month-to-month,29.85,29.85,No
1,Male,0,No,No,34,Yes,Yes,No,Yes,No,No,No,One year,56.95,1889.5,No


In [7]:
df_data['totalcharges'].replace(" ", np.nan, inplace=True)

In [8]:
df_data["totalcharges"]=pd.to_numeric(df_data["totalcharges"]) # row 488 is empty, unable to convert it to numeric

In [9]:
df_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   gender            7043 non-null   object 
 1   seniorcitizen     7043 non-null   int64  
 2   partner           7043 non-null   object 
 3   dependents        7043 non-null   object 
 4   tenure            7043 non-null   int64  
 5   phoneservice      7043 non-null   object 
 6   onlinesecurity    7043 non-null   object 
 7   onlinebackup      7043 non-null   object 
 8   deviceprotection  7043 non-null   object 
 9   techsupport       7043 non-null   object 
 10  streamingtv       7043 non-null   object 
 11  streamingmovies   7043 non-null   object 
 12  contract          7043 non-null   object 
 13  monthlycharges    7043 non-null   float64
 14  totalcharges      7032 non-null   float64
 15  churn             7043 non-null   object 
dtypes: float64(2), int64(2), object(12)
memory

In [10]:
df_data.isnull().sum() # we can see that there are 11 null values in the "totalcharges" column

gender               0
seniorcitizen        0
partner              0
dependents           0
tenure               0
phoneservice         0
onlinesecurity       0
onlinebackup         0
deviceprotection     0
techsupport          0
streamingtv          0
streamingmovies      0
contract             0
monthlycharges       0
totalcharges        11
churn                0
dtype: int64

In [11]:
df_data["totalcharges"].describe()

count    7032.000000
mean     2283.300441
std      2266.771362
min        18.800000
25%       401.450000
50%      1397.475000
75%      3794.737500
max      8684.800000
Name: totalcharges, dtype: float64

In [12]:
median=df_data["totalcharges"].median() # we are going to replace de null values for the median

In [13]:
median

1397.475

In [14]:
df_data["totalcharges"]=df_data["totalcharges"].fillna(median) # we replace the nulls for the median value

In [15]:
df_data.isnull().sum()

gender              0
seniorcitizen       0
partner             0
dependents          0
tenure              0
phoneservice        0
onlinesecurity      0
onlinebackup        0
deviceprotection    0
techsupport         0
streamingtv         0
streamingmovies     0
contract            0
monthlycharges      0
totalcharges        0
churn               0
dtype: int64

In [16]:
cols_to_scale=["tenure", "seniorcitizen", "monthlycharges","totalcharges"]

In [17]:
df_scale=df_data[cols_to_scale]

In [18]:
df_data["churn"].value_counts() # As we can see below, the data is imbalanced.
                                # Just the 26% of the values are "Yes" in the target variable.

No     5174
Yes    1869
Name: churn, dtype: int64

* We are going to use the StandardScaler to scale the data

In [19]:
scaler=StandardScaler()

In [20]:
scaled=scaler.fit_transform(df_scale)

In [21]:
df_st_scaled=pd.DataFrame(scaled,columns=cols_to_scale)

In [22]:
df_st_scaled

Unnamed: 0,tenure,seniorcitizen,monthlycharges,totalcharges
0,-1.277445,-0.439916,-1.160323,-0.994242
1,0.066327,-0.439916,-0.259629,-0.173244
2,-1.236724,-0.439916,-0.362660,-0.959674
3,0.514251,-0.439916,-0.746535,-0.194766
4,-1.236724,-0.439916,0.197365,-0.940470
...,...,...,...,...
7038,-0.340876,-0.439916,0.665992,-0.128655
7039,1.613701,-0.439916,1.277533,2.243151
7040,-0.870241,-0.439916,-1.168632,-0.854469
7041,-1.155283,2.273159,0.320338,-0.872062


In [23]:
X=df_st_scaled
y=df_data["churn"]

In [24]:
X_train, X_test, y_train, y_test=train_test_split(X,y,test_size=0.3,random_state=42)

* Next we are going to create a Logistic Regression model

In [25]:
model=LogisticRegression()
model.fit(X_train,y_train)

In [26]:
y_pred=model.predict(X_test)

In [27]:
# print('Accuracy of the test set: {:.2f}'.format(model.score(X_train, y_train))) # same as in test data?

In [28]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

          No       0.82      0.92      0.87      1539
         Yes       0.68      0.45      0.54       574

    accuracy                           0.79      2113
   macro avg       0.75      0.68      0.70      2113
weighted avg       0.78      0.79      0.78      2113



## Managing imbalance in the dataset

* As we can see in the classification report, we are working with imbalanced data. To be precise, the 74% (1539) of the "churns" are "No" and just the 26% (574) are "Yes". 

* Accuracy measures the proportion of well predicted true positives and true negatives over the total of predictions. In our database the "No" proportion is 0,74. This means, that if we make a prediction saying that all the targets variables will be a "No", we would have an accuracy of 74%. That is why, even with such a simple assumption or model, the accuracy would be very likely to be higher than 70%. This is a consequence of having imbalanced data.

1. Use the resampling strategies used in class for upsampling and downsampling to create a balance between the two classes. Each time fit the model and see how the accuracy of the model is.


### SMOTE (over-sampling)

In [29]:
smote = SMOTE()

# X=df_st_scaled
# y=df_data["churn"]
X_train_sm, X_test_sm, y_train_sm, y_test_sm = train_test_split(X, y, test_size=0.3, random_state=42)
X_res_sm, y_res_sm = smote.fit_resample(X_train_sm, y_train_sm)
y_res_sm.value_counts()

No     3635
Yes    3635
Name: churn, dtype: int64

In [30]:


classification = LogisticRegression(random_state=42, max_iter=10000)
classification.fit(X_res_sm, y_res_sm)

y_sm_predictions = classification.predict(X_test_sm)
print(classification_report(y_test_sm, y_sm_predictions))

              precision    recall  f1-score   support

          No       0.89      0.73      0.80      1539
         Yes       0.51      0.77      0.61       574

    accuracy                           0.74      2113
   macro avg       0.70      0.75      0.71      2113
weighted avg       0.79      0.74      0.75      2113



As we can see in the resulsts above, the results changed and the accuracy metric decreased. Regarding the prediction of "No" results, the precision increased while recall and the f1-score have decreased. For the prediction of the "Yes" values, the effect of the SMOTE has been exactly the inverse (decrease of precision, recall and f1-score).

### TOMEK links (down-sampling)

In [31]:
X_train_tl, X_test_tl, y_train_tl, y_test_tl = train_test_split(X, y, test_size=0.3, random_state=42)

In [32]:
tomek = TomekLinks()
X_res_tl, y_res_tl = tomek.fit_resample(X_train_tl, y_train_tl)
y_res_tl.value_counts()

No     3264
Yes    1295
Name: churn, dtype: int64

In [33]:
classification = LogisticRegression(random_state=42, max_iter=10000)
classification.fit(X_res_tl, y_res_tl)

y_tl_predictions = classification.predict(X_test_tl)
print(classification_report(y_test_tl, y_tl_predictions))

              precision    recall  f1-score   support

          No       0.83      0.88      0.86      1539
         Yes       0.62      0.52      0.57       574

    accuracy                           0.78      2113
   macro avg       0.73      0.70      0.71      2113
weighted avg       0.77      0.78      0.78      2113



In this case, the metrics for "No" stayed similar the same comparing them to the original model. On the other hand, the metrics for "Yes" improved considerably, except the precision. At the same time, the accuracy got a bit better than in the model with imbalanced data.

Indenpendently of the higher or lower metrics' values, imbalanced data should be balanced in order to get trustable results. Otherwise, the results can be strongly biased and, therefore, would not be meaningful.