# C4021 - Group Project 2.3

### Problem Statement: Telecom Customer Churn Prediction

Classification Problem to predict the Telecom customer churn which is one of the major problem Telecom industry is facing today.


## Dataset Information

####   Source: https://www.kaggle.com/blastchar/telco-customer-churn 

#### Context: 
"Predict behavior to retain customers. You can analyze all relevant customer data and develop focused customer retention programs." [IBM Sample Data Sets]

#### Content: 
Each row represents a customer, each column contains customer’s attributes described on the column Metadata.

##### The data set includes information about:
- Customers who left within the last month – the column is called Churn
- Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies
- Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges
- Demographic info about customers – gender, age range, and if they have partners and dependents

## Team Contributions:

#### Training data
#### Pre-processing
#### Algorithm training and evaluation
#### Visualization of outputs

In [1]:
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### Load the CSV file into Pandas dataframe

In [2]:
dataframe = pd.read_csv("WA_Fn-UseC_-Telco-Customer-Churn.csv")

In [3]:
# Features available in the dataset
dataframe.columns.values

array(['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents',
       'tenure', 'PhoneService', 'MultipleLines', 'InternetService',
       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
       'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract',
       'PaperlessBilling', 'PaymentMethod', 'MonthlyCharges',
       'TotalCharges', 'Churn'], dtype=object)

In [4]:
# Lets see the how many Churn values are present.
dataframe.Churn.value_counts()

No     5174
Yes    1869
Name: Churn, dtype: int64

In [5]:
# We can see that its a Binary classification problem as there are 2 outcomes.
# Assign outcome as 0 if Churn is No and as 1 if Churn is Yes

dataframe["Churn"] = [1 if each == 'Yes' else 0 for each in dataframe["Churn"]]
dataframe.Churn.value_counts()

0    5174
1    1869
Name: Churn, dtype: int64

### Basic data exploration of the input Dataframe

In [6]:
# Check the data dimension
dataframe.shape

(7043, 21)

In [7]:
# Print the first 5 rows of the dataframe to get a feel for the data
dataframe.head(5)

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,0
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,0
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,1
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,0
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,1


In [8]:
# Lets have a look into the numeric features
dataframe.describe()

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges,Churn
count,7043.0,7043.0,7043.0,7043.0
mean,0.162147,32.371149,64.761692,0.26537
std,0.368612,24.559481,30.090047,0.441561
min,0.0,0.0,18.25,0.0
25%,0.0,9.0,35.5,0.0
50%,0.0,29.0,70.35,0.0
75%,0.0,55.0,89.85,1.0
max,1.0,72.0,118.75,1.0


In [9]:
# Lets have a look into the categorical features
dataframe.describe(include=['O'])

Unnamed: 0,customerID,gender,Partner,Dependents,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,TotalCharges
count,7043,7043,7043,7043,7043,7043,7043,7043,7043,7043,7043,7043,7043,7043,7043,7043,7043.0
unique,7043,2,2,2,2,3,3,3,3,3,3,3,3,3,2,4,6531.0
top,3066-RRJIO,Male,No,No,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,20.2
freq,1,3555,3641,4933,6361,3390,3096,3498,3088,3095,3473,2810,2785,3875,4171,2365,11.0


#### Which features are categorical?
- Categorical features: gender, Partner, Dependents, SeniorCitizen, PhoneService, PaperlessBilling, MultipleLines, InternetService, OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, StreamingMovies, Contract, PaymentMethod  

#### Which features are numerical?
- Continous: Tenure, MonthlyCharges, TotalCharges (has some strings)

In [10]:
# Checking the data types and counts of all the columns
dataframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
customerID          7043 non-null object
gender              7043 non-null object
SeniorCitizen       7043 non-null int64
Partner             7043 non-null object
Dependents          7043 non-null object
tenure              7043 non-null int64
PhoneService        7043 non-null object
MultipleLines       7043 non-null object
InternetService     7043 non-null object
OnlineSecurity      7043 non-null object
OnlineBackup        7043 non-null object
DeviceProtection    7043 non-null object
TechSupport         7043 non-null object
StreamingTV         7043 non-null object
StreamingMovies     7043 non-null object
Contract            7043 non-null object
PaperlessBilling    7043 non-null object
PaymentMethod       7043 non-null object
MonthlyCharges      7043 non-null float64
TotalCharges        7043 non-null object
Churn               7043 non-null int64
dtypes: float64(1), int64(3), obje

#### The Total Charges is shown as a string and not a numeric number. Lets do some investigation and fix that

In [11]:
# The Total Charges sections is of string type
dataframe.TotalCharges.describe()

count     7043
unique    6531
top       20.2
freq        11
Name: TotalCharges, dtype: object

In [12]:
# Lets convert the TotalCharges to numerical data type.
dataframe.TotalCharges = pd.to_numeric(dataframe.TotalCharges, errors='coerce')
dataframe.TotalCharges.head()

0      29.85
1    1889.50
2     108.15
3    1840.75
4     151.65
Name: TotalCharges, dtype: float64

### Basic data cleaning process

As per Depy2016 lesson, lets do some data cleaning

#### A. Dealing with data types
- Models can only handle numeric features
- Categorical and ordinal features must be converted into numeric features
    - Create dummy features
    - Transform a categorical feature into a set of dummy features, each representing a unique category
    - In the set of dummy features, 1 indicates that the observation belongs to that category

##### Now we need to see if there are any missing values in the dataset

In [13]:
dataframe.isnull().sum()

customerID           0
gender               0
SeniorCitizen        0
Partner              0
Dependents           0
tenure               0
PhoneService         0
MultipleLines        0
InternetService      0
OnlineSecurity       0
OnlineBackup         0
DeviceProtection     0
TechSupport          0
StreamingTV          0
StreamingMovies      0
Contract             0
PaperlessBilling     0
PaymentMethod        0
MonthlyCharges       0
TotalCharges        11
Churn                0
dtype: int64

There are 11 missing values for Total Charges. Since its a small proportion of the total number of input samples, we can remove these 11 rows from our data set.

In [14]:
# Drop rows with missing values for Total Charges
dataframe.dropna(axis=0, inplace = True)
dataframe.shape

(7032, 21)

#### B. Handling missing data
- Models can not handle missing data
- Solution:
    - Remove observations/features that have missing data
    - An alternative solution is to use imputation
        - Replace missing value with another value
        - Strategies: mean, median, highest frequency value of given feature
    

In [15]:
# Find the categorical variables in the dataset (Code is inspired from the Depy2016 tutorial)
for col_name in dataframe.columns:
    if dataframe[col_name].dtypes == 'object':
        unique_cat = len(dataframe[col_name].unique())
        print("Feature '{}' has {} unique categories".format(col_name, unique_cat))

Feature 'customerID' has 7032 unique categories
Feature 'gender' has 2 unique categories
Feature 'Partner' has 2 unique categories
Feature 'Dependents' has 2 unique categories
Feature 'PhoneService' has 2 unique categories
Feature 'MultipleLines' has 3 unique categories
Feature 'InternetService' has 3 unique categories
Feature 'OnlineSecurity' has 3 unique categories
Feature 'OnlineBackup' has 3 unique categories
Feature 'DeviceProtection' has 3 unique categories
Feature 'TechSupport' has 3 unique categories
Feature 'StreamingTV' has 3 unique categories
Feature 'StreamingMovies' has 3 unique categories
Feature 'Contract' has 3 unique categories
Feature 'PaperlessBilling' has 2 unique categories
Feature 'PaymentMethod' has 4 unique categories


In [16]:
# Customer ID doesn't show any meanigful information can be dropped from the data set
dataframe.drop(["customerID"], axis=1, inplace=True)
dataframe.head(5)

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,0
1,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,0
2,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,1
3,Male,0,No,No,45,No,No phone service,DSL,Yes,No,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,0
4,Female,0,No,No,2,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,1


In [17]:
# Let's convert all the categorical variables into dummy variables by enumerating them

df_dummies = pd.get_dummies(dataframe)
df_dummies.head(5)

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges,TotalCharges,Churn,gender_Female,gender_Male,Partner_No,Partner_Yes,Dependents_No,...,StreamingMovies_Yes,Contract_Month-to-month,Contract_One year,Contract_Two year,PaperlessBilling_No,PaperlessBilling_Yes,PaymentMethod_Bank transfer (automatic),PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check
0,0,1,29.85,29.85,0,1,0,0,1,1,...,0,1,0,0,0,1,0,0,1,0
1,0,34,56.95,1889.5,0,0,1,1,0,1,...,0,0,1,0,1,0,0,0,0,1
2,0,2,53.85,108.15,1,0,1,1,0,1,...,0,1,0,0,0,1,0,0,0,1
3,0,45,42.3,1840.75,0,0,1,1,0,1,...,0,0,1,0,1,0,1,0,0,0
4,0,2,70.7,151.65,1,1,0,1,0,1,...,0,1,0,0,0,1,0,0,1,0


#### Split the dataframe into Features and Labels

In [18]:
# Assign X as a DataFrame of features and Drop the Customer Churn (outcome)
X = dataframe.drop("Churn", axis=1)

# Assign the y as a Series of the outcome variable
y = dataframe["Churn"]

In [19]:
X.shape

(7032, 19)