# 4.0 Pre-processing and Training Data  - Customer Churn

## 4.1 Introduction
### Customers Churn
The Telco company specializes in providing telephone and internet service to California. Thanks to its promotions, the company reached the number of 7,073 customers during the third quarter. However, it encounters serious difficulties in retaining its customers. In fact, the disengagement rate during this quarter was around 26.54%. This difficulty in retaining customers is reflected in the structure of the company's customer base: • 25% of customers have less than 9 months of relationship • The median duration of the customer relationship with the company is 29 months, far from the 72 months relationship of the most loyal customers • 75% of customers have a relationship duration of less than 55 months. Telco Management would like to commit to a loyalty policy by reducing the disengagement rate to less than 10% by: • Identification, upon entering a relationship, of the customers most likely to leave the company after the promotion period. • Concrete actions are likely to build loyalty among current customers. The Data Science team was entrusted with the mission of identifying the factors on which to act to retain existing customers and to develop a model to identify potential customers towards whom future promotions should be directed. This work involves the participation of: • General management for orientations • Director of the Customer Service team • Director of the Marketing team • Director of Technology It was produced using information available in the customer database which contains 33 variables for each customer. The approach adopted is: • To retain the 'Churn Label' column as an explanatory or independent variable • To determine to what extent each of the other variables allows predicting the explanatory variable • To retain the most relevant variables to develop a model based on the regression mode with the smallest absolute error

This document is the continuation of the daEDAing work. It contai

 - Creation of  dummy or indicator features for categorical variables - Split of the data into testing and training datasets
 - 
Standardiation of e the magnitude of numeric features using a scalertsns:

### 4.2 Importation of common Libraries

In [1]:
# Code task 1#
# Importation of common libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import matplotlib.ticker as ticker

### 4.3 Read the Dataset

In [2]:
# Code task 2#
# Read the dataset using read_csv
df_churn = pd.read_csv('df_churn_eda_f.csv')

In [3]:
# Set up to display the entire columns of the datasets
pd.set_option('display.max_columns', None)

In [4]:
# Code task 3#
# Visualize the 5 first registers in the dataset
df_churn.head()

Unnamed: 0,CustomerID,Gender,Senior Citizen,Partner,Dependents,Tenure Months,Phone Service,Multiple Lines,Internet Service,Online Security,Online Backup,Device Protection,Tech Support,Streaming TV,Streaming Movies,Contract,Paperless Billing,Payment Method,Monthly Charges,Churn Label,Churn Score,CLTV,Name,County
0,3668-QPYBK,Male,0,0,0,2.0,1,0,DSL,1,1,0,0,0,0,Month-to-month,1,Mailed check,53.85,1,86.0,3239.0,Los Angeles,Los Angeles County
1,3668-QPYBK,Male,0,0,0,2.0,1,0,DSL,1,1,0,0,0,0,Month-to-month,1,Mailed check,53.85,1,86.0,3239.0,Los Angeles,Los Angeles County
2,3668-QPYBK,Male,0,0,0,2.0,1,0,DSL,1,1,0,0,0,0,Month-to-month,1,Mailed check,53.85,1,86.0,3239.0,Los Angeles,Los Angeles County
3,9237-HQITU,Female,0,0,1,2.0,1,0,Fiber optic,0,0,0,0,0,0,Month-to-month,1,Electronic check,70.7,1,67.0,2701.0,Los Angeles,Los Angeles County
4,9237-HQITU,Female,0,0,1,2.0,1,0,Fiber optic,0,0,0,0,0,0,Month-to-month,1,Electronic check,70.7,1,67.0,2701.0,Los Angeles,Los Angeles County


In [5]:
#Code task 3#
# Check the columns and their contains using the info() method
df_churn.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11779 entries, 0 to 11778
Data columns (total 24 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   CustomerID         11779 non-null  object 
 1   Gender             11779 non-null  object 
 2   Senior Citizen     11779 non-null  int64  
 3   Partner            11779 non-null  int64  
 4   Dependents         11779 non-null  int64  
 5   Tenure Months      11779 non-null  float64
 6   Phone Service      11779 non-null  int64  
 7   Multiple Lines     11779 non-null  int64  
 8   Internet Service   11779 non-null  object 
 9   Online Security    11779 non-null  int64  
 10  Online Backup      11779 non-null  int64  
 11  Device Protection  11779 non-null  int64  
 12  Tech Support       11779 non-null  int64  
 13  Streaming TV       11779 non-null  int64  
 14  Streaming Movies   11779 non-null  int64  
 15  Contract           11779 non-null  object 
 16  Paperless Billing  117

#### 4.4 Modify the Column 'County'

In [6]:
#Code task 4#
# Create a new DataFrame with the number of customer by County
count_counties = df_churn.groupby('County')['County'].value_counts() # Count the number of customer by County
count_counties2 = count_counties. to_frame() # Convert the report to a DataFrame
count_counties2.columns = ['Count'] # Define the name of the column
count_counties2.index_name = 'County' # Attribute a name to the index
count_counties2.head()

Unnamed: 0_level_0,Count
County,Unnamed: 1_level_1
Alameda County,184
Alpine County,4
Alpine County and Amador County,4
Amador County,40
Butte County,96


In [7]:
#Code task 5#
# Visualize the index of the County
count_counties2.index

Index([' Alameda County', ' Alpine County', ' Alpine County and Amador County',
       ' Amador County', ' Butte County', ' Calaveras County',
       ' Colusa County', ' Contra Costa County', ' Del Norte County',
       ' El Dorado County',
       ...
       'Stanislaus County', 'Sutter County', 'Tehama County', 'Trinity County',
       'Tulare County', 'Tuolumne County', 'Ventura County',
       'Wuk Village CDP ', 'Yolo County', 'Yuba County'],
      dtype='object', name='County', length=121)

##### Divide the Counties into two categories: Those whose number of customers is in the interquartile range and the outliers.

In [8]:
# Code task 6#
# Create mask for 'Interquantile' counties
mask_interquantile = df_churn['County'].isin(
    count_counties2[(count_counties2['Count'] < count_counties2['Count'].quantile(0.75)) 
                    & (count_counties2['Count'] > count_counties2['Count'].quantile(0.25))].index
)
df_churn.loc[mask_interquantile, 'County'] = 'County_Interquantile'

# Code task 7#
# Create mask for 'Outlier' counties
mask_outlier = df_churn['County'].isin(
    count_counties2[(count_counties2['Count'] >= count_counties2['Count'].quantile(0.75)) 
                    | (count_counties2['Count'] <= count_counties2['Count'].quantile(0.25))].index
)
df_churn.loc[mask_outlier, 'County'] = 'Outlier'


In [9]:
#Code task 8#
# Visualize the modified column
df_churn['County'].value_counts()

County
Outlier                 8798
County_Interquantile    2981
Name: count, dtype: int64

#### 4.5 Convert the boolean columns to one hot encoded Data Frame

In [10]:
#Code task 9#
# Select columns to one-hot encode and define prefixes
columns_to_encoded = ['Gender', 'Internet Service','Contract','Payment Method', 'County']

prefix = {'Gender': 'G_', 'Internet Service':'IS_', 'Contract':'CO_', 'Payment Method':'PM_','County':'Coun_'}

#Code task 10#
# One-Hot Encode 
df_churn_encoded = pd.get_dummies(df_churn, columns = columns_to_encoded, prefix = prefix)

#Code task 11#
# Display the first 5 rows of the encoded Data Frame
df_churn_encoded.head()

Unnamed: 0,CustomerID,Senior Citizen,Partner,Dependents,Tenure Months,Phone Service,Multiple Lines,Online Security,Online Backup,Device Protection,Tech Support,Streaming TV,Streaming Movies,Paperless Billing,Monthly Charges,Churn Label,Churn Score,CLTV,Name,G__Female,G__Male,IS__DSL,IS__Fiber optic,IS__No,CO__Month-to-month,CO__One year,CO__Two year,PM__Bank transfer (automatic),PM__Credit card (automatic),PM__Electronic check,PM__Mailed check,Coun__County_Interquantile,Coun__Outlier
0,3668-QPYBK,0,0,0,2.0,1,0,1,1,0,0,0,0,1,53.85,1,86.0,3239.0,Los Angeles,False,True,True,False,False,True,False,False,False,False,False,True,False,True
1,3668-QPYBK,0,0,0,2.0,1,0,1,1,0,0,0,0,1,53.85,1,86.0,3239.0,Los Angeles,False,True,True,False,False,True,False,False,False,False,False,True,False,True
2,3668-QPYBK,0,0,0,2.0,1,0,1,1,0,0,0,0,1,53.85,1,86.0,3239.0,Los Angeles,False,True,True,False,False,True,False,False,False,False,False,True,False,True
3,9237-HQITU,0,0,1,2.0,1,0,0,0,0,0,0,0,1,70.7,1,67.0,2701.0,Los Angeles,True,False,False,True,False,True,False,False,False,False,True,False,False,True
4,9237-HQITU,0,0,1,2.0,1,0,0,0,0,0,0,0,1,70.7,1,67.0,2701.0,Los Angeles,True,False,False,True,False,True,False,False,False,False,True,False,False,True


In [11]:
#Code task 13# 
# Display the column of the DataFrame
df_churn_encoded.columns

Index(['CustomerID', 'Senior Citizen', 'Partner', 'Dependents',
       'Tenure Months', 'Phone Service', 'Multiple Lines', 'Online Security',
       'Online Backup', 'Device Protection', 'Tech Support', 'Streaming TV',
       'Streaming Movies', 'Paperless Billing', 'Monthly Charges',
       'Churn Label', 'Churn Score', 'CLTV', 'Name', 'G__Female', 'G__Male',
       'IS__DSL', 'IS__Fiber optic', 'IS__No', 'CO__Month-to-month',
       'CO__One year', 'CO__Two year', 'PM__Bank transfer (automatic)',
       'PM__Credit card (automatic)', 'PM__Electronic check',
       'PM__Mailed check', 'Coun__County_Interquantile', 'Coun__Outlier'],
      dtype='object')

In [12]:
#Code task 14#
# Convert the True and False of the categorical variables into numeric values
# Columns to convert to numeric
columns_to_numeric =['Senior Citizen', 'Partner', 'Dependents', 'Tenure Months',
       'Phone Service', 'Multiple Lines', 'Online Security', 'Online Backup',
       'Device Protection', 'Tech Support', 'Streaming TV', 'Streaming Movies',
       'Paperless Billing', 'Monthly Charges', 'Churn Label',
       'Churn Score', 'CLTV', 'G__Female', 'G__Male', 'IS__DSL',
       'IS__Fiber optic', 'IS__No', 'CO__Month-to-month', 'CO__One year',
       'CO__Two year', 'PM__Bank transfer (automatic)',
       'PM__Credit card (automatic)', 'PM__Electronic check',
       'PM__Mailed check', 'Coun__County_Interquantile', 'Coun__Outlier']

# Code task 15#
# Convert the colummns
df_churn_encoded[columns_to_numeric] = df_churn_encoded[columns_to_numeric].astype(int)

In [13]:

df_churn_encoded.head()

Unnamed: 0,CustomerID,Senior Citizen,Partner,Dependents,Tenure Months,Phone Service,Multiple Lines,Online Security,Online Backup,Device Protection,Tech Support,Streaming TV,Streaming Movies,Paperless Billing,Monthly Charges,Churn Label,Churn Score,CLTV,Name,G__Female,G__Male,IS__DSL,IS__Fiber optic,IS__No,CO__Month-to-month,CO__One year,CO__Two year,PM__Bank transfer (automatic),PM__Credit card (automatic),PM__Electronic check,PM__Mailed check,Coun__County_Interquantile,Coun__Outlier
0,3668-QPYBK,0,0,0,2,1,0,1,1,0,0,0,0,1,53,1,86,3239,Los Angeles,0,1,1,0,0,1,0,0,0,0,0,1,0,1
1,3668-QPYBK,0,0,0,2,1,0,1,1,0,0,0,0,1,53,1,86,3239,Los Angeles,0,1,1,0,0,1,0,0,0,0,0,1,0,1
2,3668-QPYBK,0,0,0,2,1,0,1,1,0,0,0,0,1,53,1,86,3239,Los Angeles,0,1,1,0,0,1,0,0,0,0,0,1,0,1
3,9237-HQITU,0,0,1,2,1,0,0,0,0,0,0,0,1,70,1,67,2701,Los Angeles,1,0,0,1,0,1,0,0,0,0,1,0,0,1
4,9237-HQITU,0,0,1,2,1,0,0,0,0,0,0,0,1,70,1,67,2701,Los Angeles,1,0,0,1,0,1,0,0,0,0,1,0,0,1


In [14]:
#Code task 16#
# Remove the columns 'CustomerID', 'Name' which will not include in the model
df_churn_encoded = df_churn_encoded.drop(['CustomerID','Name'], axis=1)
df_churn_encoded.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11779 entries, 0 to 11778
Data columns (total 31 columns):
 #   Column                         Non-Null Count  Dtype
---  ------                         --------------  -----
 0   Senior Citizen                 11779 non-null  int32
 1   Partner                        11779 non-null  int32
 2   Dependents                     11779 non-null  int32
 3   Tenure Months                  11779 non-null  int32
 4   Phone Service                  11779 non-null  int32
 5   Multiple Lines                 11779 non-null  int32
 6   Online Security                11779 non-null  int32
 7   Online Backup                  11779 non-null  int32
 8   Device Protection              11779 non-null  int32
 9   Tech Support                   11779 non-null  int32
 10  Streaming TV                   11779 non-null  int32
 11  Streaming Movies               11779 non-null  int32
 12  Paperless Billing              11779 non-null  int32
 13  Monthly Charges 

### 4.6 Verify the structure of the Dataset

In [16]:
#Code task 17#
# Verify if the dataset is balanced
# Get the count and percentage of each unique value in the label column
value_counts = df_churn_encoded['Churn Label'].value_counts()
percentage_counts = df_churn_encoded['Churn Label'].value_counts(normalize=True) * 100

#Code task 18#
# Create a new DataFrame with the results
structure_dependent_variable = pd.DataFrame({'Count': value_counts, 'Percentage': percentage_counts})

print(structure_dependent_variable)


             Count  Percentage
Churn Label                   
0             8620   73.181085
1             3159   26.818915


The dataset is imbalanced. We will use RandomOverSampler to generate new samples by randomly sampling with remplacement the current available samples.

The documentation for this method can be found here: https://imbalanced-learn.org/stable/over_sampling.html

## 4.7 Split de Data into Training and Test

In [17]:
#Code task 19#
# Dependant variable or target: Churn Label
y = df_churn_encoded['Churn Label']

#Code task 20#
# Independant variables or features
X = df_churn_encoded.drop(['Churn Label'], axis=1)

### 4.7.1 Generate a balanced dataset

In [18]:
#Code task 21#
# Oversampling the minority Class
from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(random_state=42)
X_resampled, y_resampled = ros.fit_resample(X, y)

#Code task 22#
# Print the counting resampled classes
from collections import Counter
print(sorted(Counter(y_resampled).items()))


[(0, 8620), (1, 8620)]


In [19]:
#Code task 23#
# Split the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.20, random_state=42,stratify=y_resampled)

## 4.8 Scale The Data

In [20]:
#Code task 24#
# Initialize the StandardScaler()
scaler = StandardScaler()

#Code task 25#
# Fit and transform the training data
X_train = scaler.fit_transform(X_train)

#Code task 26#
# Transform the testing data 
X_test = scaler.transform(X_test)

In [21]:
#Code task 27#
# Print the scaled datas
print("Scaled Training Features:\n", X_train) 
print("Scaled Testing Features:\n", X_test)

Scaled Training Features:
 [[-0.48802888 -0.88712538 -0.46765108 ... -0.51283883 -0.57935961
   0.57935961]
 [-0.48802888  1.12723637  2.13834638 ... -0.51283883 -0.57935961
   0.57935961]
 [ 2.04905906 -0.88712538 -0.46765108 ... -0.51283883  1.72604368
  -1.72604368]
 ...
 [-0.48802888 -0.88712538 -0.46765108 ...  1.94993036  1.72604368
  -1.72604368]
 [-0.48802888  1.12723637  2.13834638 ... -0.51283883 -0.57935961
   0.57935961]
 [-0.48802888 -0.88712538 -0.46765108 ... -0.51283883 -0.57935961
   0.57935961]]
Scaled Testing Features:
 [[-0.48802888  1.12723637 -0.46765108 ... -0.51283883  1.72604368
  -1.72604368]
 [-0.48802888 -0.88712538  2.13834638 ... -0.51283883 -0.57935961
   0.57935961]
 [-0.48802888 -0.88712538 -0.46765108 ... -0.51283883 -0.57935961
   0.57935961]
 ...
 [ 2.04905906  1.12723637 -0.46765108 ... -0.51283883 -0.57935961
   0.57935961]
 [ 2.04905906 -0.88712538 -0.46765108 ... -0.51283883 -0.57935961
   0.57935961]
 [-0.48802888 -0.88712538 -0.46765108 ... -0.

In [22]:
#Code task 28#
# Save de data
df_churn_encoded.to_csv('df_churn_model_f', index=False)

###                                                                                                 End