# Churn prediction assignment


# Index
* [Assignment Summary](#1)
* [Import Libraries](#2)
* [Understanding & Preparing Data](#3)
  - [Understanding the Data](#4)
  - [Preparing Data](#5)  
  - [Visualization of Data & Findings](#6)
    - [Gender, SeniorCitizens & InternetService Analysis](#7)
    - [Tenure & Monthly Charges Analysis](#8)
    - [Contract & PhoneService Analysis](#9)
	
* [Data Preprocessing](#10)
   * [Feature Selection](#11)
   * [Feature Encoding](#12)
   * [Transform Categorical Values](#13)
       *[Scaling data](#14)
* [Train Test Split](#15)

*[Modeling](#16)
  - [Model Building](#17)
  - [Model Training](#18)
  - [Model Evaluation](#19)
* [Summary](#20)
* [Reference](#21)

# Assignment Summary <a id ="1"></a>

Given the Telco company customer dataset, I want to find out which customers are most likely to churn.

In this assignment I will not only look at what are the attributes for customers to terminate services, but I will also try to make an analysis for what can be done about it.

Let's start!

# Import Libraries <a id ="2"></a>

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
from matplotlib import colors as mcolors
import seaborn as sns
import pandas_profiling
import plotly.offline as po
import plotly.graph_objs as go
%matplotlib inline

#For Scaling
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

#For Modeling
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, GRU, Flatten, Dropout, Lambda
from keras.layers.embeddings import Embedding
import tensorflow as tf
from sklearn.metrics import confusion_matrix, classification_report

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Understanding & Preparing Data <a id ="3"></a>

Let's load data and check out what's inside to better understand it.

In [None]:
telco_churn = pd.read_csv('/kaggle/input/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv')
telco_churn.head()

In [None]:
# Dataset rows & columns. It will help us to better understand this dataset.
telco_churn.shape

# # Understanding the Data <a id ="4"></a>

Okay. From what I can see here is that:

* each row represents customer
* columns consists of customer's attributes
* in dataset exists column "Churn" -> we can use it later to see how many customers terminated their services. 
* there are 7043 customers and 21 data points for every customer

# # Preparing data <a id ="5"></a>

Now, when we did Initial check of data, let's move into preparing them for Machine Learning modeling.

In [None]:
telco_churn.dtypes

In [None]:
len(telco_churn[telco_churn['TotalCharges'] == " "])

In [None]:
# Converting Total Charges to a numerical data type.
telco_churn.TotalCharges = pd.to_numeric(telco_churn.TotalCharges, errors='coerce')
telco_churn.isnull().sum()

In [None]:
telco_churn['TotalCharges'].dtypes

In [None]:
telco_churn.groupby('Churn')[['MonthlyCharges', 'tenure', 'TotalCharges']].agg(['min', 'max', 'mean'])

Since, we have 11 null values in dataset, either we can fill them, or remove them. 11 is a low number, so I will drop them.

In [None]:
#Drop null values of 'Total Charges'
telco_churn = telco_churn[telco_churn["TotalCharges"].notnull()]
telco_churn = telco_churn.reset_index()[telco_churn.columns]

# Convert 'Total Charges'column values to float data type
telco_churn["TotalCharges"] = telco_churn["TotalCharges"].astype(float)

Let's see if we have still missing values.

In [None]:
# Calculate the missing value counts for each column
missing = telco_churn.isnull().sum()
# Show the number of columns that have missing values
missing[missing > 0].count()

# Visualization of Data & Findings <a id ="6"></a>

In [None]:
# Visualization of Total Customer Churn

plot_by_churn_labels = telco_churn["Churn"].value_counts().keys().tolist()
plot_by_churn_values = telco_churn["Churn"].value_counts().values.tolist()

plot_data= [
    go.Pie(labels = plot_by_churn_labels,
           values = plot_by_churn_values,
           marker = dict(colors = [ 'Olive', 'Ivory'],
                         line = dict(color = "white",
                                     width = 1.5)),
           rotation = 90,
           hoverinfo = "label+value+text",
           hole = .6)
]

# Total Number of Customers that will churn
count_of_cust_churn_yes = telco_churn[telco_churn.Churn == 'Yes'].shape[0]
# Total Number of Customers that will not churn
count_of_cust_churn_no = telco_churn[telco_churn.Churn == 'No'].shape[0]

# Percentage of customer that will churn
percent_of_cust_churn_yes = round((count_of_cust_churn_yes / (count_of_cust_churn_yes + count_of_cust_churn_no) * 100),2)
# Percentage of customer that will not churn (retain)
percent_of_cust_churn_no = round((count_of_cust_churn_no / (count_of_cust_churn_yes + count_of_cust_churn_no) * 100 ),2)


plot_layout = go.Layout(dict(title = f'{percent_of_cust_churn_yes} % ({count_of_cust_churn_yes} number) of customers will churn & {percent_of_cust_churn_no} % ({count_of_cust_churn_no} number) of customers will retain',
                   plot_bgcolor = "rgb(243,243,243)",
                  paper_bgcolor = "rgb(243,243,243)",))


fig = go.Figure(data=plot_data, layout=plot_layout)
po.iplot(fig)

Summary of above chart:

* 73,4 % of Customers remain with the company. 26.6 % terminate their service.
* If 73.42% customers remain using their services at the company, then for any given input we can say with 73,4% accuracy that the customer will stay with the company. 
* So our first objective is to get the model's accuracy more than 73,4%.


Let's look at the  other attributes with customer retention.

# # Gender, SeniorCitizens & InternetService Analysis <a id ="7"></a>

In [None]:
plt.figure(figsize=(8,4))
ax = sns.countplot(x= 'gender', hue='Churn', data=telco_churn, palette="Set3")
ax.set_title(f'Gender and Customer Churn connection')


plt.figure(figsize=(8,4))
ax = sns.countplot(x= 'SeniorCitizen', hue='Churn', data=telco_churn, palette="Set3")
ax.set_title(f'SeniorCitizens have higher customer churn')
plt.xlabel('SeniorCitizens(0: No, 1: Yes)')

plt.figure(figsize=(8,4))
ax = sns.countplot(x= 'InternetService', hue='Churn', data=telco_churn, palette="Set3")
ax.set_title(f'Effect of internet service on customer churn')

In these 3 charts above we see Attributes: Gender, SeniorCitizens & Internet Service. I wanted to have a look on what is the Effect of these Attributes to Customer Churn.

Analysis:

1. Gender has no Effect on Customer Churn
2. Senior Citizens are more likely to Churn. In this case company can focuse more on younger group of customers.
3. DSL internet service customers tends continue with their services at company. It has mainly difference between Fiber Optic. By deeper analysis of DSL and Fiber Optic can be found reasons why. If customers do not use any Internet Service, Churn count is very low. In this case is the question if company should take a look on Internet offerings for customers?

Let's have a look at the effect of Tenure & Monthly Charges on customer churn.

# # Tenure & Monthly Charges Analysis <a id ="8"></a>

In [None]:
plt.figure(figsize = (10,8))

ax = sns.distplot(telco_churn['tenure'], color="y")
                  
    
    
plt.figure(figsize=(10,8))

sns.distplot(telco_churn['MonthlyCharges']);

Attributes like Tenure and Monthly Charges have Effect on Customer Churn:

* There are people staying with this company for about 70 years. But Customers who have registered services with company between 1 - 8 years, are likely to leave company and terminate Services.

* Most of the customer has low monthly charge.

# # Contract, PhoneService & Paperless Billing Analysis <a id ="9"></a>

In [None]:
plt.figure(figsize=(8,4))
ax = sns.countplot(x= 'Contract', hue='Churn', data=telco_churn, palette="Set3")
ax.set_title(f'Contract')


plt.figure(figsize=(8,4))
ax = sns.countplot(x= 'PhoneService', hue='Churn', data=telco_churn, palette="Set3")
ax.set_title(f'PhoneService')
plt.xlabel('PhoneService(0: No, 1: Yes)')

plt.figure(figsize=(8,4))
ax = sns.countplot(x= 'PaperlessBilling', hue='Churn', data=telco_churn, palette="Set3")
ax.set_title(f'PaperlessBilling')
plt.xlabel('PaperlessBilling(0: No, 1: Yes)')

I see the Contract part as a very nice Attribute. We can say that customers, who have registered "Month-to-Month" contract are more likely to leave comparing to customers who have contracts from 1 year longer. As a suggestion to company is to focus on short-term contract migrating into long-term with better offerings. 

# Data Preprocessing <a id ="10"></a>

We used data visualization to show us attributes which are affecting Customer churn. Now, let's preprocess data and make it ready for modeling.
Data visualization gave us idea about the features which are affecting the customer churn. Now let's clean and precess the data to make it ready for modeling.


### Feature Selection <a id ="11"></a>

Not all given attributes in dataset will help us to predict Customer churn. In this assignment, we do not need "Customer ID". It won't add any added value to accuracy of models.
I will remove "Customer ID" collumns from this dataset, create new pandas dataframe and increase the number by 1. 




In [None]:
telco_churn_1 = telco_churn.drop('customerID', axis = 'columns')
telco_churn_1.shape # Print the shape of new dataframe

Dataset contains  now 7043 training examples and 20 features. Let's review the data types of the features.

Now our dataset contains 7043 training examples and 20 features. Now let's review the data types of the features.

### Feature Encoding <a id ="12"></a>
* I do  make sure that each feature has datatype as per the data it represents. In this case 'MonthlyCharge' feature has float64 datatype as well as 'TotalCharges'. I already changed it from 'object' to float at the beginning.

In [None]:
telco_churn_1.dtypes

Let check the values present in 'TotalCharges' column and change the data type to numeric. 

In [None]:
# Print the values in TotalCharges
telco_churn_1.TotalCharges.unique()

In [None]:
# Print rows with missing TotalCharges values
telco_churn_1[telco_churn_1.TotalCharges == ' ']

In [None]:
telco_churn_1.TotalCharges =  telco_churn_1.TotalCharges.replace(r' ', '0')
telco_churn_1[telco_churn_1.tenure == 0]

In [None]:
telco_churn_1.TotalCharges = pd.to_numeric(telco_churn_1.TotalCharges)
print(f'New datatype of TotalCharges : { telco_churn_1.TotalCharges.dtype}')    

### Transform Categorical Values <a id ="13"></a>

Hot Encoding

* Most of the columns have values 'Yes', 'No' , 'No phone service, 'No internet service' ...
* I will print the unique values first and then replace the duplicate categories to one 'No' category .
* After replacement is done, I will convert them  into numeric format (Yes: 1, No: 0)


In [None]:
def unique_col_values(telco_churn):
    """Print unique values from categorical columns of the given dataframe"""
    print('Unique values from categorical columns,\n')
    for column in telco_churn.columns:
        if(telco_churn[column].dtypes == 'object'): 
            print(f'column: {column}, Unique vlaues: {telco_churn[column].unique()}')
        
unique_col_values(telco_churn_1)

I will replace 'No phone service' with 'No' in column MultipleLines and  ''No internet service' with 'no' in columns OnlineSecurity, DeviceProtection, TechSupport, StreamingTV, StreamingMovies.

In [None]:
telco_churn_1.replace('No internet service', 'No', inplace = True)
telco_churn_1.replace('No phone service', 'No', inplace = True)
# Print unique values again
unique_col_values(telco_churn_1)

All categorical values don't have numeric order or relationship between them.  

One Hot Encoding should be used rather than label encoding to convert them into numeric format. As Churn contains our class labels I can manually assign numeric values to it.

In [None]:
# Converting churn to numeric
telco_churn_1['Churn'].replace({'Yes': 1,'No': 0},inplace=True)

Now let's create a telco_churn_2 for cleaned dataset.

In [None]:
telco_churn_2 = pd.get_dummies(data = telco_churn_1)

print(f'So I have added {telco_churn_2.shape[1]- telco_churn_1.shape[1]} more columns to our list. New shape : {telco_churn_2.shape}')
telco_churn_2.sample(5)

# # # Scaling data <a id ="14"></a>

I'm going to use in this section sklearn - MinMaxScaler to standardize the range of input features. It's for my machine learning model, so it can learn quickly from data.

Attributes to be scaled: tenure, MonthlyCharges and TotalCharges

In [None]:
atts_scaling = ['tenure','MonthlyCharges','TotalCharges']

scaler = MinMaxScaler()
telco_churn_2[atts_scaling] = scaler.fit_transform(telco_churn_2[atts_scaling]) # Fit to data, then transform it
telco_churn_2[atts_scaling].describe()

Great. Now we have all values in range from 0 -1 . Let's prepare the training and testing dataset for modeling.


# Train Test Split <a id ="15"></a>

I'm going to create train and test datasets for training and testing respectively. Training dataset will have 80% of the data and test set will have 20% of the data. 

Reference [link](https://satishgunjal.com/train_test_split/)

In [None]:
# Create feature matrix X without label column 'Churn'
X = telco_churn_2.drop('Churn',axis = 'columns')
# Create label vector y
y = telco_churn_2['Churn']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state= 42)
print(f'X_train: {X_train.shape}, y_train: {y_train.shape}')
print(f'X_test: {X_test.shape}, y_test: {y_test.shape}')

# Look at training datatset
X_train.sample(5)

# Modeling <a id ="16"></a>

## Model Building <a id ="17"></a>

To build Neural Network Model I have to configure input, hidden and output layers. As seen above, there is 38 input features. 

* to create first layer with 38 neurons and 'relu' activation function.
* I will add the second dense layer with 14 nodes ( or neurons) and 'relu' first layer with 38 neurons and 'relu' activation function.
* I will add the second dense layer with 14 nodes (or neurons) and ‘relu’ activation function. 

As the expected output is binary, in the last layer I add only one neuron with sigmoid activation function.

Will be used 'binary_crossentropy' --> output is binary(churn or not).

In [None]:
model = tf.keras.Sequential([
    tf.keras.layers.Dense(38, input_shape= (38,), activation= 'relu'),
    tf.keras.layers.Dense(14, activation= 'relu'),
    tf.keras.layers.Dense(1, activation= 'sigmoid')     
])

model.compile(optimizer= 'adam',
             loss='binary_crossentropy',
             metrics=['accuracy'])

## Model Training <a id ="18"></a>
For model training we will run for 100 epochs.

In [None]:
model.fit(X_train, y_train, epochs= 100)

## Model Evaluation <a id ="19"></a>
For model evaluation we will use test data.

In [None]:
model.evaluate(X_test, y_test)

Now I will verify the model predictions on tested data.

In [None]:
predictions = model.predict(X_test)
predictions[:5]

predictions for the test data are in the form of 2D array with values ranging from 0 to 1. 
In order to get in the binary format we will use threshold 0.5, anything more than 0.5 will be 1(churn-yes) else 0(churn-no)

In [None]:
y_pred = []

for val in predictions:
    if val > 0.5:
        y_pred.append(1)
    else:
        y_pred.append(0)
            
y_pred[:10]

Lets create a dataframe of true values and predicted values for comparison.

In [None]:
df_true_pred = pd.DataFrame({'y_test':y_test, 'y_pred':y_pred}) 
df_true_pred[:10]

Check stats like precision, recall and f1-score

In [None]:
print(classification_report(y_test,y_pred))

What we can see from the report:

1. Precision - Accuracy of positive predicstions
2. Recall - Fraction of positives that were correctly identified
3. F1 score - percentage of positive predictions

for class 0 is more than 80 % and around 55 % for class 1

* The accuracy of the model is about 75% which is better than the 73.42%.( considering 73.42% customer from given data do not churn.)


Let's print the confusion matrix for better visualization.

In [None]:
co_m= tf.math.confusion_matrix(labels= y_test, predictions= y_pred)
plt.figure(figsize = (10, 7))
sns.heatmap(co_m, annot= True, cmap="YlGnBu")
plt.xlabel('Predicted')
plt.ylabel('Truth')

# Summary <a id ="20"></a>

In this assignment I applied several Data Science Teqniques to help me identify Customer churn for Telco company.

I have gone through major attributes with high impact on customers. 

My approach was to focus on Artificial Neural Network (ANN) to predict the customer churn / customer turnover. I used TensorFlow deep learning framework to build the model. 

I would be next time focused on Logistic Regression to try different approach for Prediction


I can conclude it was fun to make these analysis, build a model, train it and Evaluate at the end. 


# References <a id ="21"></a>


*     https://www.youtube.com/watch?v=ocMd2loRfWE
*     https://www.youtube.com/watch?v=MSBY28IJ47U&list=PLeo1K3hjS3uu7CxAacxVndI4bE_o3BDtO&index=18
