# Telecom Data - Predict Customer Churn

### How many customers stop using the product?

Dataset Link - https://www.kaggle.com/radmirzosimov/telecom-users-dataset

# Process

![img](https://github.com/SSaishruthi/Guvi_AI_For_India/blob/main/steps.png?raw=1)

##### The AI Fairness 360 toolkit (AIF360) is an open source software toolkit that can help detect and remove bias in machine learning models. It enables developers to use state-of-the-art algorithms to regularly check for unwanted biases from entering their machine learning pipeline and to mitigate any biases that are discovered.

[AIF360](https://developer.ibm.com/technologies/artificial-intelligence/projects/ai-fairness-360/)

In [1]:
!pip install aif360

Collecting aif360
[?25l  Downloading https://files.pythonhosted.org/packages/4c/71/0e19eaf2c513b2328b2b6188770bf1692437380c6e7a1eec3320354e4c87/aif360-0.4.0-py3-none-any.whl (175kB)
[K     |█▉                              | 10kB 12.4MB/s eta 0:00:01[K     |███▊                            | 20kB 17.5MB/s eta 0:00:01[K     |█████▋                          | 30kB 13.1MB/s eta 0:00:01[K     |███████▌                        | 40kB 10.2MB/s eta 0:00:01[K     |█████████▍                      | 51kB 9.9MB/s eta 0:00:01[K     |███████████▎                    | 61kB 8.7MB/s eta 0:00:01[K     |█████████████                   | 71kB 9.5MB/s eta 0:00:01[K     |███████████████                 | 81kB 9.5MB/s eta 0:00:01[K     |████████████████▉               | 92kB 9.9MB/s eta 0:00:01[K     |██████████████████▊             | 102kB 8.9MB/s eta 0:00:01[K     |████████████████████▋           | 112kB 8.9MB/s eta 0:00:01[K     |██████████████████████▌         | 122kB 8.9MB/s eta 0

In [3]:
!pip install fairlearn

Collecting fairlearn
[?25l  Downloading https://files.pythonhosted.org/packages/ea/a4/87a3ee19c036860a0b04dc5c9d51c86b0e147a379981f05fec0b34f8cdfc/fairlearn-0.6.2-py3-none-any.whl (24.6MB)
[K     |████████████████████████████████| 24.6MB 121kB/s 
Installing collected packages: fairlearn
Successfully installed fairlearn-0.6.2


In [4]:
# Load all necessary packages
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import aif360
from aif360.metrics import BinaryLabelDatasetMetric # require !pip install fairlearn
from aif360.algorithms.preprocessing import Reweighing
from sklearn.ensemble import AdaBoostClassifier
from sklearn import metrics
import warnings
warnings.filterwarnings("ignore")

# Data Extraction

In [5]:
# try to read from google drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [6]:
path = "/content/drive/MyDrive/Colab_Notebooks/telecom_users.csv"
df = pd.read_csv(path)
# Dataset is now stored in a Pandas Dataframe
df.head()

Unnamed: 0.1,Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,1869,7010-BRBUU,Male,0,Yes,Yes,72,Yes,Yes,No,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,Two year,No,Credit card (automatic),24.1,1734.65,No
1,4528,9688-YGXVR,Female,0,No,No,44,Yes,No,Fiber optic,No,Yes,Yes,No,Yes,No,Month-to-month,Yes,Credit card (automatic),88.15,3973.2,No
2,6344,9286-DOJGF,Female,1,Yes,No,38,Yes,Yes,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Bank transfer (automatic),74.95,2869.85,Yes
3,6739,6994-KERXL,Male,0,No,No,4,Yes,No,DSL,No,No,No,No,No,Yes,Month-to-month,Yes,Electronic check,55.9,238.5,No
4,432,2181-UAESM,Male,0,No,No,2,Yes,No,DSL,Yes,No,Yes,No,No,No,Month-to-month,No,Electronic check,53.45,119.5,No


In [8]:
# Let's display columns 
df.columns

Index(['Unnamed: 0', 'customerID', 'gender', 'SeniorCitizen', 'Partner',
       'Dependents', 'tenure', 'PhoneService', 'MultipleLines',
       'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
       'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract',
       'PaperlessBilling', 'PaymentMethod', 'MonthlyCharges', 'TotalCharges',
       'Churn'],
      dtype='object')

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5986 entries, 0 to 5985
Data columns (total 22 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Unnamed: 0        5986 non-null   int64  
 1   customerID        5986 non-null   object 
 2   gender            5986 non-null   object 
 3   SeniorCitizen     5986 non-null   int64  
 4   Partner           5986 non-null   object 
 5   Dependents        5986 non-null   object 
 6   tenure            5986 non-null   int64  
 7   PhoneService      5986 non-null   object 
 8   MultipleLines     5986 non-null   object 
 9   InternetService   5986 non-null   object 
 10  OnlineSecurity    5986 non-null   object 
 11  OnlineBackup      5986 non-null   object 
 12  DeviceProtection  5986 non-null   object 
 13  TechSupport       5986 non-null   object 
 14  StreamingTV       5986 non-null   object 
 15  StreamingMovies   5986 non-null   object 
 16  Contract          5986 non-null   object 


# Data Pre-processing

#### List of columns we will be using to experiment with the fairness toolkit

- gender - Categorical
- tenure - Continous
- PhoneService - Categorical
- InternetService - Categorical
- DeviceProtection - Categorical
- TechSupport - Categorical
- MonthlyCharges - Continous
- TotalCharges - Continous
- Churn - Categorical

In [10]:
# Get list of categorical variables
cat = ['PhoneService', 'InternetService', 'DeviceProtection', 'TechSupport']

# Get list of continous variables
continous = ['tenure', 'MonthlyCharges', 'TotalCharges']

# Variable that we suspect bias 
expected_bias = ['gender']

# Target variable
target = ['Churn']

In [11]:
# Use selected column 
selected_data = df[['gender', 'tenure', 'PhoneService', 'InternetService', 'DeviceProtection', 
                   'TechSupport', 'MonthlyCharges', 'TotalCharges', 'Churn']]

In [12]:
#Display selected data
selected_data.head()

Unnamed: 0,gender,tenure,PhoneService,InternetService,DeviceProtection,TechSupport,MonthlyCharges,TotalCharges,Churn
0,Male,72,Yes,No,No internet service,No internet service,24.1,1734.65,No
1,Female,44,Yes,Fiber optic,Yes,No,88.15,3973.2,No
2,Female,38,Yes,Fiber optic,No,No,74.95,2869.85,Yes
3,Male,4,Yes,DSL,No,No,55.9,238.5,No
4,Male,2,Yes,DSL,Yes,No,53.45,119.5,No


In [13]:
# Basic information
selected_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5986 entries, 0 to 5985
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   gender            5986 non-null   object 
 1   tenure            5986 non-null   int64  
 2   PhoneService      5986 non-null   object 
 3   InternetService   5986 non-null   object 
 4   DeviceProtection  5986 non-null   object 
 5   TechSupport       5986 non-null   object 
 6   MonthlyCharges    5986 non-null   float64
 7   TotalCharges      5986 non-null   object 
 8   Churn             5986 non-null   object 
dtypes: float64(1), int64(1), object(7)
memory usage: 421.0+ KB


In [14]:
# Observe that TotalCharges have blank values
print('Before removing blank values')
print(selected_data[selected_data['TotalCharges'] == ' '].index)

# replacing values that contains only spaces "  " (any number of spaces) with 0
selected_data['TotalCharges'] = selected_data['TotalCharges'].replace(r'^\s*$', 0, regex=True)

print('After removing blank values')
print(selected_data[selected_data['TotalCharges'] == ' '].index)

selected_data['TotalCharges'] = selected_data['TotalCharges'].astype(float) # converting datatype to float from object

Before removing blank values
Int64Index([356, 634, 2771, 3086, 3255, 4326, 5375, 5382, 5695, 5951], dtype='int64')
After removing blank values
Int64Index([], dtype='int64')


### Encode Data (String -> Numeric)

#### Target Column

In [15]:
# Encode target column
# First let's see unique values in the target column
print('Before encoding:', selected_data['Churn'].unique())

# Encode target columns: Assign `Yes` to 1 and `No` to 0
selected_data["Churn"] = np.where(selected_data["Churn"].str.contains("Yes"), 1, 0)
print('After encoding:', selected_data['Churn'].unique())

Before encoding: ['No' 'Yes']
After encoding: [0 1]


#### Bias Feature

In [16]:
# We suspect bias to exist in the gender so let's encode the column seprately as we want them to have as one column

# First let's see unique values in the target column
print('Before encoding:', selected_data['gender'].unique())

# Encode target columns: Assign `Male` to 1 and `Female` to 0
selected_data["gender"] = np.where(selected_data["gender"].str.contains("Male"), 1, 0)

print('After encoding:', selected_data['gender'].unique())

Before encoding: ['Male' 'Female']
After encoding: [1 0]


#### Categorical Column

![img1](https://github.com/SSaishruthi/Guvi_AI_For_India/blob/main/encoding.png?raw=1)

In [17]:
# Display unique values in categorical columns
for col in cat:
    print(selected_data[col].unique())

['Yes' 'No']
['No' 'Fiber optic' 'DSL']
['No internet service' 'Yes' 'No']
['No internet service' 'No' 'Yes']


In [18]:
# Encode other categorical column - Using one-hot encoding
dum = pd.get_dummies(selected_data[cat].astype('category'),prefix_sep='=')

In [19]:
dum.head(2)

Unnamed: 0,PhoneService=No,PhoneService=Yes,InternetService=DSL,InternetService=Fiber optic,InternetService=No,DeviceProtection=No,DeviceProtection=No internet service,DeviceProtection=Yes,TechSupport=No,TechSupport=No internet service,TechSupport=Yes
0,0,1,0,0,1,0,1,0,0,1,0
1,0,1,0,1,0,0,0,1,1,0,0


#### Get Processed Dataset

Combine encoded data with continous variable, gender (expected bias), and target variable)

In [20]:
encoded_df = pd.concat([dum, selected_data[continous], selected_data[expected_bias], selected_data[target]], axis=1)
# as axis is 1, they will get concatenated along axis

In [21]:
encoded_df.columns

Index(['PhoneService=No', 'PhoneService=Yes', 'InternetService=DSL',
       'InternetService=Fiber optic', 'InternetService=No',
       'DeviceProtection=No', 'DeviceProtection=No internet service',
       'DeviceProtection=Yes', 'TechSupport=No',
       'TechSupport=No internet service', 'TechSupport=Yes', 'tenure',
       'MonthlyCharges', 'TotalCharges', 'gender', 'Churn'],
      dtype='object')

In [22]:
encoded_df.head()

Unnamed: 0,PhoneService=No,PhoneService=Yes,InternetService=DSL,InternetService=Fiber optic,InternetService=No,DeviceProtection=No,DeviceProtection=No internet service,DeviceProtection=Yes,TechSupport=No,TechSupport=No internet service,TechSupport=Yes,tenure,MonthlyCharges,TotalCharges,gender,Churn
0,0,1,0,0,1,0,1,0,0,1,0,72,24.1,1734.65,1,0
1,0,1,0,1,0,0,0,1,1,0,0,44,88.15,3973.2,0,0
2,0,1,0,1,0,1,0,0,1,0,0,38,74.95,2869.85,0,1
3,0,1,1,0,0,1,0,0,1,0,0,4,55.9,238.5,1,0
4,0,1,1,0,0,0,0,1,1,0,0,2,53.45,119.5,1,0


# Model Development - Without Bias Analysis

In [23]:
# Check if the dataset is balanced
encoded_df['Churn'].value_counts()

0    4399
1    1587
Name: Churn, dtype: int64

#### How to deal with data imbalance?

- https://www.analyticsvidhya.com/blog/2017/03/imbalanced-data-classification/
- https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/

Here, we are using adaptive boosting technique in this example to deal with data imbalance.

#### An AdaBoost classifier.

Ada Boost is the first original boosting technique which creates a highly accurate prediction rule by combining many weak and inaccurate rules.  Each classifier is serially trained with the goal of correctly classifying examples in every round that were incorrectly classified in the previous round.

In [24]:
# Get only features
feature_df = encoded_df.drop(['Churn'], axis=1)

# Extract target column 
target_df = encoded_df[['Churn']]

# Split dataset into train and test (Best Practise is to split into train, validation, and test)
x_train,x_test,y_train,y_test = train_test_split(feature_df, target_df, test_size=0.2, random_state = 0)

# Initialize adaboost classifier
cls = AdaBoostClassifier(n_estimators=100)

# Fit the model
cls.fit(x_train, y_train)

# Predict and calculate metrics
print("Accuracy:", metrics.accuracy_score(y_test, cls.predict(x_test)))

Accuracy: 0.7821368948247078


# Model Development - With Bias Analysis using AI Fairness 360

![img](https://github.com/SSaishruthi/Guvi_AI_For_India/blob/main/aif.png?raw=1)

#### Terms:
    
##### Bias
The bias is an error from erroneous assumptions in the learning algorithm.

---
##### Protected attribute
Attribute that partition population in groups.

---
##### Privileged protected attribute
Value of the protected attribute indicating a group that has historically been at systematic advantage.

---
##### Unwanted Bias
Places privileged groups at a systematic advantage and unprivileged groups at a systematic disadvantage.

---
##### Favorable Label & Unfavourable Label
A label whose value corresponds to an outcome that provides an advantage to the recipient. The opposite is an unfavorable lable.

In [25]:
privileged_groups = [{'gender': 1}] #male
unprivileged_groups = [{'gender': 0}] #female

favorable_label=0
unfavorable_label=1

#### Covert into AIF360 compatible dataset

In [26]:
# Covert into AIF360 compatible dataset
aif360_dataset = aif360.datasets.BinaryLabelDataset(
    favorable_label=favorable_label,
    unfavorable_label=unfavorable_label,
    df=encoded_df,
    label_names=['Churn'],
    protected_attribute_names=['gender'])

In [27]:
aif360_dataset.label_names

['Churn']

In [28]:
aif360_dataset.feature_names

['PhoneService=No',
 'PhoneService=Yes',
 'InternetService=DSL',
 'InternetService=Fiber optic',
 'InternetService=No',
 'DeviceProtection=No',
 'DeviceProtection=No internet service',
 'DeviceProtection=Yes',
 'TechSupport=No',
 'TechSupport=No internet service',
 'TechSupport=Yes',
 'tenure',
 'MonthlyCharges',
 'TotalCharges',
 'gender']

In [29]:
#  Get the dataset and split into train and test
aif360_train, aif360_test = aif360_dataset.split([0.7])

#### Check for bias

##### statistical parity difference

This measure is based on the following formula :

Pr(Y=1|D=unprivileged)−Pr(Y=1|D=privileged)
 
Here the bias or statistical imparity is the difference between the probability that a random individual drawn from unprivileged is labeled 1 (so here that he has more than 50K for income) and the probability that a random individual from privileged is labeled 1.

So it has to be close to 0 so it will be fair.

In [30]:
# Metric for the original dataset
metric_orig_train = BinaryLabelDatasetMetric(aif360_train, 
                                             unprivileged_groups=unprivileged_groups,
                                             privileged_groups=privileged_groups)
print("Original training dataset")
print("Difference in mean outcomes between unprivileged and privileged groups = %f" % metric_orig_train.statistical_parity_difference())

Original training dataset
Difference in mean outcomes between unprivileged and privileged groups = -0.004643


##### Reweighing

Reweighing is a preprocessing technique that Weights the examples in each (group, label) combination differently to ensure fairness before classification.

In [32]:
RW = Reweighing(unprivileged_groups=unprivileged_groups,
               privileged_groups=privileged_groups)
RW.fit(aif360_train)
transf_dataset = RW.transform(aif360_train)

In [33]:
# Metric for the original dataset
metric_transf_train = BinaryLabelDatasetMetric(transf_dataset, 
                                             unprivileged_groups=unprivileged_groups,
                                             privileged_groups=privileged_groups)
print('Modified training dataset')
print("Difference in mean outcomes between unprivileged and privileged groups = %f" % metric_transf_train.statistical_parity_difference())

Modified training dataset
Difference in mean outcomes between unprivileged and privileged groups = 0.000000


#### Model Development

In [34]:
transf_dataset.features

array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 2.41000e+01,
        1.73465e+03, 1.00000e+00],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 8.81500e+01,
        3.97320e+03, 0.00000e+00],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 7.49500e+01,
        2.86985e+03, 0.00000e+00],
       ...,
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 9.51500e+01,
        1.77995e+03, 1.00000e+00],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 2.46000e+01,
        1.26640e+03, 1.00000e+00],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 8.46000e+01,
        1.11520e+03, 0.00000e+00]])

In [35]:
transf_dataset.labels

array([[0.],
       [0.],
       [1.],
       ...,
       [1.],
       [0.],
       [1.]])

In [36]:
transf_dataset.instance_weights

array([0.99685649, 1.0032184 , 0.99140811, ..., 1.00859189, 0.99685649,
       0.99140811])

In [None]:
#n_estimators is the maximum number of estimators at which the boosting is terminated. Default is 50 and this can be tuned as well.
cls = AdaBoostClassifier(n_estimators=100)
cls.fit(transf_dataset.features, transf_dataset.labels,sample_weight=transf_dataset.instance_weights)

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None, learning_rate=1.0,
                   n_estimators=100, random_state=None)

In [37]:
print("Accuracy:", metrics.accuracy_score(aif360_test.labels, cls.predict(aif360_test.features)))

Accuracy: 0.7989977728285078
