<a href="https://colab.research.google.com/github/undefined-ankit/teleco_churn_prediction/blob/main/handling_of_imbalanced_telecom_churn_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import tensorflow as tf
from tensorflow import keras
from sklearn.metrics import confusion_matrix, classification_report

In [3]:
import pandas as pd
import numpy as np
data = pd.read_csv("/content/drive/My Drive/datasets/Customer-Churn.csv")

In [4]:
data.shape

(7043, 21)

## **Data Cleaning**

In [5]:
data.shape

(7043, 21)

In [6]:
# customerID does not require for our prediction
data.drop('customerID',axis='columns',inplace=True)

In [7]:
def unique_values(data):
  ''' function for printing the unique values in a column
  '''
  for col in data:
    print(f'{col}: {data[col].unique()}')

In [8]:
#total cherges are object type we have to convert it into numeric type
data[pd.to_numeric(data.TotalCharges,errors='coerce').isnull()]
# we have to drop null values from datasets
data1 = data[data['TotalCharges']!=' ']

In [9]:
data1['TotalCharges'] = pd.to_numeric(data1.TotalCharges) # we have converted TotalCharges type from str to float

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [10]:
data1.dtypes

gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
MultipleLines        object
InternetService      object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
PaperlessBilling     object
PaymentMethod        object
MonthlyCharges      float64
TotalCharges        float64
Churn                object
dtype: object

In [11]:
unique_values(data1)

gender: ['Female' 'Male']
SeniorCitizen: [0 1]
Partner: ['Yes' 'No']
Dependents: ['No' 'Yes']
tenure: [ 1 34  2 45  8 22 10 28 62 13 16 58 49 25 69 52 71 21 12 30 47 72 17 27
  5 46 11 70 63 43 15 60 18 66  9  3 31 50 64 56  7 42 35 48 29 65 38 68
 32 55 37 36 41  6  4 33 67 23 57 61 14 20 53 40 59 24 44 19 54 51 26 39]
PhoneService: ['No' 'Yes']
MultipleLines: ['No phone service' 'No' 'Yes']
InternetService: ['DSL' 'Fiber optic' 'No']
OnlineSecurity: ['No' 'Yes' 'No internet service']
OnlineBackup: ['Yes' 'No' 'No internet service']
DeviceProtection: ['No' 'Yes' 'No internet service']
TechSupport: ['No' 'Yes' 'No internet service']
StreamingTV: ['No' 'Yes' 'No internet service']
StreamingMovies: ['No' 'Yes' 'No internet service']
Contract: ['Month-to-month' 'One year' 'Two year']
PaperlessBilling: ['Yes' 'No']
PaymentMethod: ['Electronic check' 'Mailed check' 'Bank transfer (automatic)'
 'Credit card (automatic)']
MonthlyCharges: [29.85 56.95 53.85 ... 63.1  44.2  78.7 ]
TotalCharges:

In [12]:
# some replacement required like
data1.replace('No phone service','No',inplace=True)
data1.replace('No internet service','No',inplace=True)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  method=method,


In [13]:
unique_values(data1)

gender: ['Female' 'Male']
SeniorCitizen: [0 1]
Partner: ['Yes' 'No']
Dependents: ['No' 'Yes']
tenure: [ 1 34  2 45  8 22 10 28 62 13 16 58 49 25 69 52 71 21 12 30 47 72 17 27
  5 46 11 70 63 43 15 60 18 66  9  3 31 50 64 56  7 42 35 48 29 65 38 68
 32 55 37 36 41  6  4 33 67 23 57 61 14 20 53 40 59 24 44 19 54 51 26 39]
PhoneService: ['No' 'Yes']
MultipleLines: ['No' 'Yes']
InternetService: ['DSL' 'Fiber optic' 'No']
OnlineSecurity: ['No' 'Yes']
OnlineBackup: ['Yes' 'No']
DeviceProtection: ['No' 'Yes']
TechSupport: ['No' 'Yes']
StreamingTV: ['No' 'Yes']
StreamingMovies: ['No' 'Yes']
Contract: ['Month-to-month' 'One year' 'Two year']
PaperlessBilling: ['Yes' 'No']
PaymentMethod: ['Electronic check' 'Mailed check' 'Bank transfer (automatic)'
 'Credit card (automatic)']
MonthlyCharges: [29.85 56.95 53.85 ... 63.1  44.2  78.7 ]
TotalCharges: [  29.85 1889.5   108.15 ...  346.45  306.6  6844.5 ]
Churn: ['No' 'Yes']


In [14]:
#model uses numeric values for processing so we will convert yes/no by 1/0
yes_no_cols = ['Partner','Dependents','PhoneService','MultipleLines','OnlineSecurity','OnlineBackup','DeviceProtection','TechSupport','StreamingTV','PaperlessBilling','Churn']
for i in yes_no_cols:
  data1.replace({'Yes':1,'No':0},inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  method=method,


In [15]:
unique_values(data1)

gender: ['Female' 'Male']
SeniorCitizen: [0 1]
Partner: [1 0]
Dependents: [0 1]
tenure: [ 1 34  2 45  8 22 10 28 62 13 16 58 49 25 69 52 71 21 12 30 47 72 17 27
  5 46 11 70 63 43 15 60 18 66  9  3 31 50 64 56  7 42 35 48 29 65 38 68
 32 55 37 36 41  6  4 33 67 23 57 61 14 20 53 40 59 24 44 19 54 51 26 39]
PhoneService: [0 1]
MultipleLines: [0 1]
InternetService: ['DSL' 'Fiber optic' 0]
OnlineSecurity: [0 1]
OnlineBackup: [1 0]
DeviceProtection: [0 1]
TechSupport: [0 1]
StreamingTV: [0 1]
StreamingMovies: [0 1]
Contract: ['Month-to-month' 'One year' 'Two year']
PaperlessBilling: [1 0]
PaymentMethod: ['Electronic check' 'Mailed check' 'Bank transfer (automatic)'
 'Credit card (automatic)']
MonthlyCharges: [29.85 56.95 53.85 ... 63.1  44.2  78.7 ]
TotalCharges: [  29.85 1889.5   108.15 ...  346.45  306.6  6844.5 ]
Churn: [0 1]


In [16]:
#replace male/female by 1/0
data1.gender.replace({'Female':0,'Male':1},inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  method=method,


In [17]:
#we have to convert categorical values to numerical columns by one-hot encoding for processing.
data2 = pd.get_dummies(data=data1,columns= ['InternetService','Contract','PaymentMethod'])


**Now our data is cleaned and ready to process but some columns do not have values in same range. So, we need to scale those value for better fit.**

In [18]:
data2.sample(5)

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,PaperlessBilling,MonthlyCharges,TotalCharges,Churn,InternetService_0,InternetService_DSL,InternetService_Fiber optic,Contract_Month-to-month,Contract_One year,Contract_Two year,PaymentMethod_Bank transfer (automatic),PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check
6157,1,0,0,0,3,1,0,0,0,0,0,0,0,0,19.85,64.55,1,1,0,0,1,0,0,0,0,0,1
3800,0,0,0,1,66,1,1,1,0,1,1,0,0,1,90.95,5930.05,0,0,0,1,0,0,1,0,1,0,0
960,1,1,1,0,7,0,0,1,0,0,0,0,0,0,29.8,220.45,0,0,1,0,1,0,0,1,0,0,0
858,0,0,1,0,66,1,1,1,1,0,1,0,0,0,89.0,5898.6,0,0,0,1,0,1,0,0,0,1,0
2659,0,0,0,0,61,1,1,1,0,1,1,1,1,0,86.45,5175.3,0,0,1,0,0,0,1,0,1,0,0


In [19]:
#We have to scale Tenure, TotalCharge and MonthlyCharges
cols_to_scale = ["tenure",'TotalCharges','MonthlyCharges']

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

data2[cols_to_scale] = scaler.fit_transform(data2[cols_to_scale])


In [20]:
data2.sample(2)

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,PaperlessBilling,MonthlyCharges,TotalCharges,Churn,InternetService_0,InternetService_DSL,InternetService_Fiber optic,Contract_Month-to-month,Contract_One year,Contract_Two year,PaymentMethod_Bank transfer (automatic),PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check
4720,1,1,0,0,0.887324,1,1,0,0,1,1,1,1,0,0.834328,0.769594,0,0,0,1,0,1,0,0,0,1,0
2769,0,0,0,0,0.197183,1,0,0,0,0,0,0,0,0,0.014925,0.032137,1,1,0,0,1,0,0,1,0,0,0


## **Build The Model**

In [21]:
def ANN(x_train,y_train,x_test,y_test):
  model = keras.Sequential([
                            keras.layers.Dense(13,input_shape=(26,),activation='relu'),
                            keras.layers.Dense(7,activation='relu'),
                            keras.layers.Dense(1,activation='sigmoid')
  ])

  model.compile(optimizer='adam',
                loss = 'binary_crossentropy',
                metrics=['accuracy'])

  model.fit(x_train,y_train,epochs =100,verbose = 0)

  print(model.evaluate(x_test,y_test))

  y_pred = model.predict(x_test)
  y_pred = np.round(y_pred)

  print('Classification Report: \n',classification_report(y_test,y_pred))

  return y_pred

## **Model Training and fitting on imbalanced dataset**

In [22]:
#x_data is our features for model and y_data is target
x_data = data2.drop('Churn',axis='columns')
y_data= data2['Churn']

In [23]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x_data,y_data,test_size = 0.2,random_state = 5)

In [24]:
y_train.value_counts()

0    4164
1    1461
Name: Churn, dtype: int64

In [25]:
y_pred_imbalanced = ANN(x_train,y_train,x_test,y_test)

[0.4434742331504822, 0.7882018685340881]
Classification Report: 
               precision    recall  f1-score   support

           0       0.82      0.90      0.86       999
           1       0.68      0.52      0.59       408

    accuracy                           0.79      1407
   macro avg       0.75      0.71      0.72      1407
weighted avg       0.78      0.79      0.78      1407



**In Above model, we have seen the f1-score for class 0 is 0.85 and for class 1 is 0.57. Due to imbalanced dataset it feels like biased towards majority class. We try to balance the dataset and check the f1-score it will be same or improved.**



*   For a good model, The f1-score should be balanced for all classes.
*   The precision and recall should be balanced( not like one is higher and other is lower).



# **...............For Balancing the Dataset, we have following methods.............................**

# **Method 1: Under Sampling of Majority Class**

In this method, We try to randomly extract the number of samples from majority class that is equal to the minority class.


*   let class A have data points = 1000
*   class B have data points = 5000
*   then, extract 1000 samples randomly from B class to balance the data points of class A and fit the model.



In [26]:
# We try to balance the dataset
count_class0,count_class1= data2.Churn.value_counts()
print(f'class 0: {count_class0} and class 1: {count_class1}')
#divide by class
data_class_0 = data2[data2['Churn']==0]
data_class_1 = data2[data2['Churn']==1]
print(data_class_0.shape,data_class_1.shape)

class 0: 5163 and class 1: 1869
(5163, 27) (1869, 27)


In [27]:
# our data have 5163 samples in class 0 and 1869 samples in class 1.
# we extract the 1869 samples from class 0.
under_sampled_data = pd.concat([data_class_0.sample(count_class1),data_class_1],axis = 0)
under_sampled_data.Churn.value_counts()

1    1869
0    1869
Name: Churn, dtype: int64

In [28]:
# now our training data is balanced
x = under_sampled_data.drop('Churn',axis='columns')
y= under_sampled_data['Churn']
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.2,random_state = 15,stratify=y)

In [29]:
y_pred_under_sampled = ANN(x_train,y_train,x_test,y_test)

[0.5493006110191345, 0.740641713142395]
Classification Report: 
               precision    recall  f1-score   support

           0       0.73      0.76      0.75       374
           1       0.75      0.72      0.73       374

    accuracy                           0.74       748
   macro avg       0.74      0.74      0.74       748
weighted avg       0.74      0.74      0.74       748



# **Method 2: Over Sampling of Minority Class**

In this method, We try to create duplicates data samples in minority class to balance the dataset.


*   let class A have data points = 1000
*   class B have data points = 5000
*   then, create 4000 duplicate data samples in class A  and fit the model.



In [30]:
# our data have 5163 samples in class 0 and 1869 samples in class 1.
# we create the duplicate samples in class 0 and make the count 5163.
over_sampled_data = pd.concat([data_class_0,data_class_1.sample(count_class0,replace=True)],axis = 0)
over_sampled_data.Churn.value_counts()

1    5163
0    5163
Name: Churn, dtype: int64

In [31]:
x=over_sampled_data.drop('Churn',axis = 'columns')
y=over_sampled_data['Churn']

from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.2,random_state = 15,stratify=y)

In [32]:
y_pred_over_sampled = ANN(x_train,y_train,x_test,y_test)

[0.46310967206954956, 0.7710551619529724]
Classification Report: 
               precision    recall  f1-score   support

           0       0.81      0.71      0.76      1033
           1       0.74      0.83      0.78      1033

    accuracy                           0.77      2066
   macro avg       0.78      0.77      0.77      2066
weighted avg       0.78      0.77      0.77      2066



# **Method 3: Over Sampling of Minority Class using SMOTE**

In this method, We try to create duplicates data samples in minority class to balance the dataset using SMOTE.




In [33]:
x = data2.drop('Churn',axis='columns')
y= data2['Churn']

In [34]:
y.value_counts()

0    5163
1    1869
Name: Churn, dtype: int64

In [35]:
from imblearn.over_sampling import SMOTE
smote = SMOTE(sampling_strategy='minority')

x_sm,y_sm = smote.fit_sample(x,y)
x_sm = pd.DataFrame(x_sm)
y_sm = pd.Series(y_sm)
y_sm.value_counts()



1    5163
0    5163
dtype: int64

In [36]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x_sm,y_sm,test_size = 0.2,random_state = 15,stratify=y_sm)

In [37]:
y_train.value_counts()# our training data is balanced now.

1    4130
0    4130
dtype: int64

In [38]:
y_pred_smote = ANN(x_train,y_train,x_test,y_test)


[0.47740474343299866, 0.7686350345611572]
Classification Report: 
               precision    recall  f1-score   support

           0       0.79      0.73      0.76      1033
           1       0.75      0.80      0.78      1033

    accuracy                           0.77      2066
   macro avg       0.77      0.77      0.77      2066
weighted avg       0.77      0.77      0.77      2066



# **Method 4: Insemble Method**

In this method, We try to create duplicates data samples in minority class to balance the dataset.


*   let class A have data points = 1000
*   class B have data points = 3000
*   then, we break the data samples(into 3 parts) of majority class and make the group of equal number of samples(1000 from each class) from both the class and fit the models for each group.
*   Consider the class which have majority votes in all models for a prediction.  


In [39]:
data2.Churn.value_counts()


0    5163
1    1869
Name: Churn, dtype: int64

In [40]:
#we have seen our dataset has 5163 samples in class 0 and  1869 samples in class 1.
x = data2.drop('Churn',axis='columns')
y= data2['Churn']

In [41]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.2,random_state = 15,stratify=y)

In [42]:
y_train.value_counts()

0    4130
1    1495
Name: Churn, dtype: int64

In [43]:
df3 = x_train.copy()
df3['Churn'] = y_train

In [44]:
data_class0 = df3[df3.Churn==0]
data_class1 = df3[df3.Churn==1]

In [45]:
def get_train_batch(df_majority,df_minority,start,end):
  ''' Function used to group the data samples in equal numbers
  '''
  data_train = pd.concat([df_majority[start:end],df_minority],axis=0)
  x_train = data_train.drop('Churn',axis='columns')
  y_train= data_train['Churn']
  return x_train,y_train

In [46]:
# 1st model have 1495 data samples of each class.
x_train,y_train = get_train_batch(data_class0,data_class1,0,1495)
print(y_train.value_counts())
y_pred1 = ANN(x_train,y_train,x_test,y_test,)

1    1495
0    1495
Name: Churn, dtype: int64
[0.5204752087593079, 0.7306325435638428]
Classification Report: 
               precision    recall  f1-score   support

           0       0.91      0.70      0.79      1033
           1       0.50      0.81      0.62       374

    accuracy                           0.73      1407
   macro avg       0.70      0.76      0.70      1407
weighted avg       0.80      0.73      0.75      1407



In [47]:
# 2nd model have 1495 data samples of each class.
x_train,y_train = get_train_batch(data_class0,data_class1,1495,2990)
print(y_train.value_counts())
y_pred2 = ANN(x_train,y_train,x_test,y_test)

1    1495
0    1495
Name: Churn, dtype: int64
[0.5718969106674194, 0.7292110919952393]
Classification Report: 
               precision    recall  f1-score   support

           0       0.89      0.72      0.80      1033
           1       0.49      0.76      0.60       374

    accuracy                           0.73      1407
   macro avg       0.69      0.74      0.70      1407
weighted avg       0.79      0.73      0.74      1407



In [48]:
# 3rd model have 1495 data samples of each class
x_train,y_train = get_train_batch(data_class0,data_class1,2990,4130)
print(y_train.value_counts())
y_pred3 = ANN(x_train,y_train,x_test,y_test)

1    1495
0    1140
Name: Churn, dtype: int64
[0.6204428672790527, 0.6680881381034851]
Classification Report: 
               precision    recall  f1-score   support

           0       0.92      0.60      0.73      1033
           1       0.44      0.86      0.58       374

    accuracy                           0.67      1407
   macro avg       0.68      0.73      0.65      1407
weighted avg       0.79      0.67      0.69      1407



In [49]:
#y_pred_final is a list that contains the class value have majority votes in all three models
y_pred_final = y_pred1.copy()
for i in range(len(y_pred1)):
  n_ones = y_pred1[i]+y_pred2[i]+y_pred3[i]
  if n_ones>1:
    y_pred_final[i]=1
  else:
    y_pred_final[i]=0

In [50]:
print(classification_report(y_test,y_pred_final))

              precision    recall  f1-score   support

           0       0.91      0.67      0.78      1033
           1       0.48      0.83      0.61       374

    accuracy                           0.71      1407
   macro avg       0.70      0.75      0.69      1407
weighted avg       0.80      0.71      0.73      1407

