<h2 style="background-color:#f15a39; padding: 20px; font-family:cursive">Outline</h2>

1. [Package Imports](#imports)
2. [Quick Data Check](#check)
    - [Read the data](#read)
    - [Fix misspecified feature types](#fix)
    - [Check for null values](#nulls)
    - [Validate the value range of each feature](#valid)
3. [Exploratory Data Analysis](#explore)
    - [Display the balance of the class labels (Churn)](#balance)
    - [Distribution of main variables](#dist)
    - [Correlation analysis](#corr)
4. [Churn Analysis](#churn)
    - [Creating Cohorts based on Tenure](#chorot)
5. [Predictive Modeling](#preds)
    - [Single Decision Tree](#tree)
    - [Random Forest](#rs)
    - [Boosted Trees](#bt)
    - [Support Vector Machine](#svm)
    - [KNN Classifier](#knn)

<a id = "imports"></a>
<h2 style="background-color:#f15a39; padding: 20px; font-family:cursive">1. Package Imports</h2>

In [None]:
# Data Prep and Visuals
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#set max rows and columns
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Models
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import plot_tree

# Evaluation
from sklearn.metrics import accuracy_score, plot_confusion_matrix, classification_report

# Cross Validation
from sklearn.model_selection import GridSearchCV

<a id = "check"></a>
<h2 style="background-color:#f15a39; padding: 20px; font-family:cursive">2. Quick Data Check</h2>

<a id = "read"></a>
#### **- Read the data**

In [None]:
# Read the data frame
df = pd.read_csv("../input/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv")

In [None]:
# Display the top 5 rows
df.head()

In [None]:
# Statistical Summary
df.describe()

<a id = "fix"></a>
#### **- Fix misspecified feature types**

In [None]:
# Main Info
df.info()

TotalCharges is stored as object, where in fact it should be float. Lets fix that. 
If you run the following line of code it will result in an AttributeError: 'str' object has no attribute 'astype'. This means that the column has a string value instead of a number in one of the rows. In order to pick which row, we will run the pd.value_counts() fuction.

In [None]:
# convert TotalCharges to float
df["TotalCharges"] = df["TotalCharges"].astype(float)

In [None]:
# trying to catch the cause of the problem
df["TotalCharges"].value_counts()[:5]

In [None]:
# the rows with the problem
df[df.TotalCharges == " "]

In [None]:
# fill in the values causing the problem 
mode = df["TotalCharges"].mode()[1]
df["TotalCharges"] = df["TotalCharges"].apply(lambda x: x.replace(" ", mode))

In [None]:
# convert TotalCharges to float
df["TotalCharges"] = pd.to_numeric(df["TotalCharges"])

In [None]:
#Lets Check if it is actually corrected
df.info()

Now all features are in the correct type

<a id = "nulls"><a/>
#### **- Check for null values**

In [None]:
# check for nulls
nulls = df.isna().sum()
pd.DataFrame(data = nulls, columns = ["Nulls"]).reset_index()

There is no nulls null values

<a id = "valid"></a>
#### **- Validate the value range of each feature**

In [None]:
# Display column names
df.columns

In [None]:
df[df.columns[1]].unique()

In [None]:
# Feature 2
df[df.columns[2]].unique()

In [None]:
# Feature 3
df[df.columns[3]].unique()

In [None]:
# Feature 4
df[df.columns[4]].unique()

In [None]:
# Feature 5
df[df.columns[5]].unique()

In [None]:
# Feature 6
df[df.columns[6]].unique()

In [None]:
# Feature 7
df[df.columns[7]].unique()

In [None]:
# Feature 8
df[df.columns[8]].unique()

In [None]:
# Feature 9
df[df.columns[9]].unique()

The values of all features are in the expected range according to the definition of each variable in the dataset.

<a id = "explore"></a>
<h2 style="background-color:#f15a39; padding: 20px; font-family:cursive">3. Exploratory Data Analysis</h2>

<a id = "balance"></a>
#### **- Display the balance of the class labels (Churn)**

In [None]:
plt.figure(figsize = (8, 4), dpi = 100)
sns.countplot(data = df, x = "Churn")
plt.show()

The classes are inbalanced, we need to take that into consideration when buildin the model.  

<a id = 'dist'></a>
#### **- Distribution of main variables**

In [None]:
# The distrbution of TotalCharges between Churn categories with a Box Plot
plt.figure(figsize = (8, 4), dpi = 100)
sns.boxplot(data = df, x = "Churn", y = "TotalCharges")
plt.show()

In [None]:
#The distribution of TotalCharges per Contract type
plt.figure(figsize = (8, 4), dpi = 100)
sns.boxplot(data = df, x = "Contract", y = "TotalCharges", hue = "Churn")
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()

In [None]:
# The distrbution of MonthlyCharges between Churn categories with a Box Plot
plt.figure(figsize = (8, 4), dpi = 100)
sns.boxplot(data = df, x = "Churn", y = "MonthlyCharges")
plt.show()

In [None]:
#The distribution of MonthlyCharges per Contract type
plt.figure(figsize = (8, 4), dpi = 100)
sns.boxplot(data = df, x = "Contract", y = "MonthlyCharges", hue = "Churn")
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()

In [None]:
# The distrbution of SeniorCitizen between Churn categories with a Box Plot
plt.figure(figsize = (8, 4), dpi = 100)
sns.countplot(data = df, x = "SeniorCitizen", hue = 'Churn')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()

<a id = 'corr'></a>
#### **- Correlation analysis**

We specifically listed only the features belo, we should not check the correlation for every feature, as some features have too many unique instances for such an analysis, such as customerID.

Keep in mind, for the categorical features, we will need to convert them into dummy variables first, as you can only calculate correlation for numeric features.

In [None]:
# Select the subset of features 
corr_feats = df.drop("customerID", axis = 1)

In [None]:
# convert them to dummy vars
corr_feats = pd.get_dummies(corr_feats)

In [None]:
# create the correlation matrix
corr_feats.head()

In [None]:
# calculate the correlation matrix
corr_array = corr_feats.corr()
corr_array = corr_array["Churn_Yes"][1: len(corr_array.index) - 1].sort_values()
corr_array

In [None]:
# vosulaize the correlation array
plt.figure(figsize = (10, 4), dpi = 100)
sns.barplot(x = corr_array.index, y = corr_array.values)
plt.xticks(rotation = 90)
plt.show()

<a id = "churn"></a>
<h2 style="background-color:#f15a39; padding: 20px; font-family:cursive">4. Churn Analysis</h2>

This section focuses on segementing customers based on their tenure, creating "cohorts", allowing us to examine differences between customer cohort segments.

In [None]:
# What are the 3 contract types available?
df['Contract'].unique()

In [None]:
# Histogram displaying the distribution of 'tenure' column
plt.figure(figsize = (10, 4), dpi = 100)
sns.histplot(data = df, x = "tenure", bins = 60)
plt.show()

In [None]:
# Create histograms separated by two additional features, Churn and Contract
plt.figure(figsize=(10,3),dpi=200) 
sns.displot(data=df,x='tenure',bins=70,col='Contract',row='Churn');

In [None]:
#Display a scatter plot of Total Charges versus Monthly Charges, and color hue by Churn
plt.figure(figsize=(10,4),dpi=200)
sns.scatterplot(data=df,x='MonthlyCharges',y='TotalCharges',hue='Churn', alpha=0.5, palette='Dark2', linewidth=0.5)
plt.show()

<a id = 'chorot'></a>
#### **- Creating Cohorts based on Tenure**

Let's begin by treating each unique tenure length, 1 month, 2 month, 3 month...N months as its own cohort. Treating each unique tenure group as a cohort, calculate the Churn rate (percentage that had Yes Churn) per cohort. We should have cohorts 1-72 months with a general trend of the longer the tenure of the cohort, the less of a churn rate. This makes sense as you are less likely to stop service the longer you've had it.

In [None]:
# churn rate per months of tenure
no_churn = df.groupby(['Churn','tenure']).count().transpose()['No']
yes_churn = df.groupby(['Churn','tenure']).count().transpose()['Yes']

churn_rate = 100 * yes_churn / (no_churn+yes_churn)
churn_rate = churn_rate.transpose()['customerID'][1:]
churn_rate

In [None]:
# churn rate per months of tenure
plt.figure(figsize=(10,4),dpi=200)
churn_rate.plot()
plt.show()

Based on the tenure column values, create a new column called Tenure Cohort that creates 4 separate categories:
- '0-12 Months'
- '24-48 Months'
- '12-24 Months'
- 'Over 48 Months'

In [None]:
def cohort(tenure):
    if tenure < 13:
        return '0-12 Months'
    elif tenure < 25:
        return '12-24 Months'
    elif tenure < 49:
        return '24-48 Months'
    else:
        return "Over 48 Months"
    
df['Tenure Cohort'] = df['tenure'].apply(cohort)

In [None]:
df.head(10)[['tenure','Tenure Cohort']]

In [None]:
# reate a scatterplot of Total Charges versus Monthly Charts,colored by Tenure Cohort
plt.figure(figsize=(10,4),dpi=200)
sns.scatterplot(data=df,x='MonthlyCharges',y='TotalCharges',hue='Tenure Cohort', alpha=0.5, palette='Dark2', linewidth=0.5)
plt.show()

In [None]:
# Create a count plot showing the churn count per cohort
plt.figure(figsize=(10,4),dpi=200)
sns.countplot(data=df,x='Tenure Cohort',hue='Churn');

In [None]:
#Create a grid of Count Plots showing counts per Tenure Cohort, separated out by contract type
plt.figure(figsize=(10,4),dpi=200)
sns.catplot(data=df,x='Tenure Cohort',hue='Churn',col='Contract',kind='count')
plt.show()

<a id = "preds"></a>
<h2 style="background-color:#f15a39; padding: 20px; font-family:cursive">5. Predictive Modeling</h2>

In [None]:
# X, y split
X = df.drop("Churn", axis = 1)
y = df["Churn"]

In [None]:
# dummies
X = pd.get_dummies(X)

In [None]:
# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=101)

In [None]:
# Feature Scaling
scaler = StandardScaler()
scaled_X_train = scaler.fit_transform(X_train)
scaled_X_test = scaler.transform(X_test)

<a id  = 'tree'></a>
#### **- Single Decision Tree** 

In [None]:
# initiate the model
model = DecisionTreeClassifier(max_depth = 3)

# fit the model
model.fit(scaled_X_train, y_train)

# predict
preds = model.predict(scaled_X_test)

# print accuracy score 
print(accuracy_score(y_test,preds))

# plot confusion matrix
fig, ax = plt.subplots(figsize=(6, 6), dpi = 100)
plot_confusion_matrix(model, scaled_X_test, y_test, ax = ax);

In [None]:
# plot the tree
plt.figure(figsize=(12,8),dpi=200)
plot_tree(model,filled=True);

In [None]:
# Hyper Parameter tuning
param_grid = {
    'criterion': ["gini", "entropy"],
    'max_depth': [1, 2, 3, 4, 5]
}

In [None]:
# initiating the grid model
grid_model = GridSearchCV(model, param_grid)

# fit the grid model
grid_model.fit(scaled_X_train, y_train)

In [None]:
# predict 
preds = grid_model.predict(scaled_X_test)

# print accuracy score 
print(accuracy_score(y_test,preds))

# plot confusion matrix
fig, ax = plt.subplots(figsize=(6, 6), dpi = 100)
plot_confusion_matrix(grid_model, scaled_X_test, y_test, ax = ax);

<a id  = 'rs'></a>
#### **- Random Forest** 

In [None]:
# initiate the model
model = RandomForestClassifier()

# fit the model
model.fit(scaled_X_train, y_train)

# predict
preds = model.predict(scaled_X_test)

# print accuracy score 
print(accuracy_score(y_test,preds))

# plot confusion matrix
fig, ax = plt.subplots(figsize=(6, 6), dpi = 100)
plot_confusion_matrix(model, scaled_X_test, y_test, ax = ax);

<a id  = 'bt'></a>
#### **- Boosted Trees** 

In [None]:
# initiate the model
model = GradientBoostingClassifier()

# fit the model
model.fit(scaled_X_train, y_train)

# predict
preds = model.predict(scaled_X_test)

# print accuracy score 
print(accuracy_score(y_test,preds))

# plot confusion matrix
fig, ax = plt.subplots(figsize=(6, 6), dpi = 100)
plot_confusion_matrix(model, scaled_X_test, y_test, ax = ax);

In [None]:
# Hyper Parameter tuning
param_grid = {"n_estimators":[1,5,10,20,40,100],'max_depth':[3,4,5,6]}

In [None]:
# initiating the grid model
grid_model = GridSearchCV(model, param_grid)

# fit the grid model
grid_model.fit(scaled_X_train, y_train)

In [None]:
# predict 
preds = grid_model.predict(scaled_X_test)

# print accuracy score 
print(accuracy_score(y_test,preds))

# plot confusion matrix
fig, ax = plt.subplots(figsize=(6, 6), dpi = 100)
plot_confusion_matrix(grid_model, scaled_X_test, y_test, ax = ax);

<a id  = 'svm'></a>
#### **- Support Vector Machine** 

In [None]:
# initiate the model
model = SVC()

# fit the model
model.fit(scaled_X_train, y_train)

# predict
preds = model.predict(scaled_X_test)

# print accuracy score 
print(accuracy_score(y_test,preds))

# plot confusion matrix
fig, ax = plt.subplots(figsize=(6, 6), dpi = 100)
plot_confusion_matrix(model, scaled_X_test, y_test, ax = ax);

<a id  = 'knn'></a>
#### **- KNN Classifier** 

In [None]:
# initiate the model
model = KNeighborsClassifier()

# fit the model
model.fit(scaled_X_train, y_train)

# predict
preds = model.predict(scaled_X_test)

# print accuracy score 
print(accuracy_score(y_test,preds))

# plot confusion matrix
fig, ax = plt.subplots(figsize=(6, 6), dpi = 100)
plot_confusion_matrix(model, scaled_X_test, y_test, ax = ax);