# I. Introduction
Nowadays, companies are increasingly aware of the importance of subscription services, and the churn rate is a critical indicator to track the health of a subscription-based company. To be more precise, the company can take measures in advance by predicting the customer churn rate to retain customers consistently. Therefore, this project goal is to make a churn prediction so that Telco can optimize products and services proactively.

# II. Data Description
The raw data contains 7043 rows (customers) and 21 columns (features).
* customer ID: Customer ID
* gender: Whether the customer is a male or a female
* SeniorCitizen: Whether the customer is a senior citizen or not (1, 0)
* Partner: Whether the customer has a partner or not (Yes, No)
* Dependents: Whether the customer has dependents or not (Yes, No)
* tenure: Number of months the customer has stayed with the company
* PhoneService: Whether the customer has a phone service or not (Yes, No)
* MultipleLines: Whether the customer has multiple lines or not (Yes, No, No phone service)
* InternetService: Customer’s internet service provider (DSL, Fiber optic, No)
* OnlineSecurity: Whether the customer has online security or not (Yes, No, No internet service)
* OnlineBackup: Whether the customer has online backup or not (Yes, No, No internet service)
* DeviceProtection: Whether the customer has device protection or not (Yes, No, No internet service)
* TechSupport: Whether the customer has tech support or not (Yes, No, No internet service)
* StreamingTV: Whether the customer has streaming TV or not (Yes, No, No internet service)
* StreamingMovies: Whether the customer has streaming movies or not (Yes, No, No internet service)
* Contract: The contract term of the customer (Month-to-month, One year, Two year)
* PaperlessBilling: Whether the customer has paperless billing or not (Yes, No)
* PaymentMethod: The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic))
* MonthlyCharges: The amount charged to the customer monthly
* TotalCharges: The total amount charged to the customer
* Churn: Whether the customer churned or not (Yes or No)

# III. Data Collection

 ### 1. Importing Modules

In [None]:
# data preprocessing
import numpy as np # linear algebra
import pandas as pd # data processing

# plot
import seaborn as sns 
sns.set_style('whitegrid') 

import matplotlib.pyplot as plt 
plt.style.use('seaborn-white')
from mpl_toolkits.mplot3d import Axes3D 
!pip install chart-studio
import chart_studio.plotly as py
from plotly import __version__

import graphviz

### 2. Loading Dataset

In [None]:
# loading dataset
df = pd.read_csv("../input/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv")

### 3. About the Data
This dataset has 7,043 samples and 21 attributes(2 integer, 1 float, and 18 objects)
* Target Feature: Churn
* Numeric Features: Tenure, MonthlyCharges, and TotalCharges
* Categorical Features: CustomerID, Gender, SeniorCitizen, Partner, Dependents, PhoneService, MulitpleLines, InternetService, OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, StreamingMovies, Contract, PaperlessBilling, PaymentMethod

In [None]:
df.info()

In [None]:
df.head(5)

In [None]:
df.tail(5)

* From the summary table below, we may infer that the feature TotalCharges has some **missing values**.

In [None]:
df.describe()

### 4. Data Reshaping
* Rename the features 'tenure' and 'gender'
* Convert the feature 'TotalCharges' to numerical data type
* Converting the feature 'SeniorCitizen' to object data type

In [None]:
# renaming 'tenure' and 'gender'
df = df.rename(columns={'tenure': 'Tenure', 'gender': 'Gender'})

# converting 'TotalCharges' to numerical data type
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce', downcast='float')

# converting 'SeniorCitizen' to object data type
df['SeniorCitizen'] = df['SeniorCitizen'].astype(np.object)

# check
df.info()

# IV. Exploratory Data Analysis(EDA)

### 1. Target Variable
(1) Churn: Customer churn rate of Telco from this dataset is 27%, implying this is an ****imbalanced dataset****.

In [None]:
# Pie chart of churn
churn_rate = df.Churn.value_counts() / len(df.Churn)
labels = 'Non-Churn', 'Churn'

fig, ax = plt.subplots()
ax.pie(churn_rate, labels=labels, autopct='%.f%%')  
ax.set_title('Churn vs Non Churn', fontsize=16)

### 2. Numeric Features
(1) Tenure: Customer with less tenure is more likely to churn.

(2) Monthly Charges: Customer with low monthly charges is less likely to churn; however, the churn trend between churn customers and non-churn customers gets similar as monthly charges go up.

(3) Total Charges: The distribution is similar for both churn customers and non-churn customers, implying that the feature Monthly Charges may not be a good predictor.

In [None]:
# numerical features grouped by churn
for col in ['Tenure', 'MonthlyCharges', 'TotalCharges']:
    fig = plt.figure(figsize=(8,5))
    sns.distplot(df[df.Churn == 'No'][col],
                 bins=10,
                 color='orange',
                 label='Non-Churn',
                 kde=True)
    sns.distplot(df[df.Churn == 'Yes'][col],
                 bins=10,
                 color='blue',
                 label='Churn',
                 kde=True)
    plt.legend(labels)

(4) Outliers: The box plots show there is **no outliers** in this data set.

In [None]:
# check outliers
for col in ['Tenure', 'MonthlyCharges', 'TotalCharges']:
    fig = plt.figure(figsize=(8,3))
    sns.boxplot(df[col])

(4) Skewness: The density plots show they are **not normal distributions**.

In [None]:
# distribution
for col in ['Tenure', 'MonthlyCharges', 'TotalCharges']:
    fig = plt.figure(figsize=(8,3))
    sns.kdeplot(df[col])

(6) Correlation: The correlation matrix plot shows that these numeric features have a positive relationship.

In [None]:
# correlation between numerical features
plt.figure(figsize=(10, 8))
feature_corr = df.corr()
sns.heatmap(feature_corr, annot=True, cmap='coolwarm')

### 3. Categorical Features

(1) Gender: The churn rate is similar between male and female, indicating **Gender may not be a good predictor**.

(2) Senior Citizen: Customer who is senior citizen is more likely to churn.

(2) Partner: Customer who does not have partner is more likely to churn.

(3) Dependents: Customer who does have dependents is more likely to churn.

In [None]:
for col in ['Gender', 'SeniorCitizen', 'Partner', 'Dependents']:
    plt.figure(figsize=(8,5))
    sns.countplot(x=col, hue='Churn', data=df, palette="tab10")
    plt.show()

(4) Phone Service: PhoneService is a **redundant feature** since we can get the same information from teh feature Multiple Lines. So, we could drop this column.

(5) Multiple Lines: Customer who has multiple lines is slightly more likely to churn.

In [None]:
for col in ['PhoneService', 'MultipleLines']:
    plt.figure(figsize=(8,5))
    sns.countplot(x=col, hue='Churn', data=df, palette="tab10")
    plt.show()

(6) Internet Service: If customer's Internet service provider is Fiber optic, then he/she is more likely to churn.

(7) Online Security: Customer who does not have online security is more likely to churn.

(8) Online Backup: Customer who does not have online backup is more likely to churn.

(9) Device Protection: Customer who does not have device protection is more likely to churn.

(10) Tech Support: Customer who does not have tech support is more likely to churn.

(11) Streaming TV / Streaming Movies: Streaming TV and Streaming Movies have no big effect on churn rate; however, if customer does not have internet service, then he/she is less likely to churn.

In [None]:
for col in ['InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
            'TechSupport','StreamingTV', 'StreamingMovies']:
    plt.figure(figsize=(8,5))
    sns.countplot(x=col, hue='Churn', data=df, palette="tab10")
    plt.show()

(12) Paper less Billing: Customer who has paperless billing is more likely to churn.

(13) Payment Method: Customer who uses electronic check to pay bills is more likely to churn than those who using other payment methods.

(14) Contract: The churn rate goes down as the length of contract increases.

In [None]:
for col in ['PaperlessBilling', 'PaymentMethod', 'Contract',]:
    plt.figure(figsize=(8,5))
    sns.countplot(x=col, hue='Churn', data=df, palette="tab10")
    plt.show()

# V. Data Preprocessing

### 1. Removing Duplicates
There is no repeated value in this data set.

In [None]:
# summarize duplicates
sum(df.duplicated('customerID'))
#df2 = df.drop_duplicates('customerID')

### 2. Droping Unnecessary Columns

Remove the useless feature customerID.

In [None]:
# remove customerID and PhoneService
df2 = df.drop(['customerID'], axis = 1)
df2.head(5)

### 3. Categorical Data Encoding

Encode categorical variables, we use One-Hot Encoding for nominal variables and Label Encoding for ordinal variables.

* One-Hot Encoding: Gender, Partner, Dependents, MultipleLines, InternetService, OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, StreamingMovies, PaperlessBilling, PaymentMethod, Churn

* Label Encoding: Contract

In [None]:
# Dummy Variables(One-Hot Encoding)
Gender = pd.get_dummies(df2['Gender'], prefix='Genger', drop_first=True)
Partner = pd.get_dummies(df2['Partner'], prefix='Partner', drop_first=True)
Dependents = pd.get_dummies(df2['Dependents'], prefix='Dependents', drop_first=True)
MultipleLines = pd.get_dummies(df2['MultipleLines'], prefix='MultipleLines', drop_first=True)
InternetService = pd.get_dummies(df2['InternetService'], prefix='InternetService', drop_first=True)
OnlineSecurity = pd.get_dummies(df2['OnlineSecurity'], prefix='OnlineSecurity', drop_first=True)
OnlineBackup = pd.get_dummies(df2['OnlineBackup'], prefix='OnlineBackup', drop_first=True)
DeviceProtection = pd.get_dummies(df2['DeviceProtection'], prefix='DeviceProtection', drop_first=True)
TechSupport = pd.get_dummies(df2['TechSupport'], prefix='TechSupport', drop_first=True)
StreamingTV = pd.get_dummies(df2['StreamingTV'], prefix='StreamingTV', drop_first=True)
StreamingMovies = pd.get_dummies(df2['StreamingMovies'], prefix='StreamingMovies', drop_first=True)
PaperlessBilling = pd.get_dummies(df2['PaperlessBilling'], prefix='PaperlessBilling', drop_first=True)
PaymentMethod = pd.get_dummies(df2['PaymentMethod'], prefix='PaymentMethod', drop_first=True)
Churn = pd.get_dummies(df2['Churn'], prefix='Churn', drop_first=True)
PaymentMethod = pd.get_dummies(df2['PhoneService'], prefix='PhoneService', drop_first=True)


df3 = pd.concat([df2, Gender, Partner, Dependents, MultipleLines, InternetService, 
                 OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, 
                 StreamingMovies, PaperlessBilling, PaymentMethod, Churn], axis=1)

In [None]:
# Label Encoding
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()
df3['Contract']= label_encoder.fit_transform(df3['Contract']) 

In [None]:
# drop original columns
list = ['Gender', 'Partner', 'Dependents', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 
'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'PaperlessBilling', 
'PaymentMethod', 'Churn', 'Contract', 'PhoneService']
df3.drop(df3[list], axis=1, inplace=True)
df3.head()

### 4. Splitting the Data into Training Set(70%) and Test Set(30%)
We split the data in 70:30 ratio so that 70% of the data will be used for training the model while 30% will be used for testing the model.

In [None]:
# train test split
from sklearn.model_selection import train_test_split # split dataset
X_train, X_test, y_train, y_test = train_test_split(df3.drop('Churn_Yes',axis=1),df3['Churn_Yes'],test_size=0.3,random_state=101)

In [None]:
#check
for i in [X_train, X_test, y_train, y_test]:
    i.index = range(i.shape[0]) 
    print(i.index)

### 5. Identifying Missing Values

For training set, the feature TotalCharges has **null/missing values**, so we can impute the missing values and replace them with average.

In [None]:
#summarize missing values - X_train
X_train.isnull().sum()

In [None]:
# fill missing value w/ mean 
X_train['TotalCharges'].fillna(value=X_train['TotalCharges'].mean(), inplace=True)
# check missing values
X_train.isnull().sum()

In [None]:
#summarize missing values - y_train
y_train.isnull().sum()

In [None]:
#summarize missing values - X_test
X_test.isnull().sum()

In [None]:
# fill missing value w/ mean 
X_test['TotalCharges'].fillna(value=X_test['TotalCharges'].mean(), inplace=True)
# check missing values
X_test.isnull().sum()

In [None]:
#summarize missing values - y_test
y_test.isnull().sum()

In [None]:
y_train

### 6. Identifying Outliers

The training data set does not have outliers.

In [None]:
# check outliers
for col in ['Tenure', 'MonthlyCharges', 'TotalCharges']:
    fig = plt.figure(figsize=(8,3))
    sns.boxplot(X_train[col])

### 7. Feature Scaling - Standardization / Normalization

In [None]:
## Standardization
standard_scaler = preprocessing.StandardScaler().fit(X_train)
X_train_standard = standard_scaler.transform(X_train)
X_test_standard = standard_scaler.transform(X_test)

#from sklearn.preprocessing import StandardScaler
#scaler_s = StandardScaler() 
#data_standard_scaled = scaler_s.fit_transform(data)

In [None]:
# Normalization
minmax_scaler = preprocessing.MinMaxScaler().fit(X_train)
X_train_minmax = minmax_scaler.transform(X_train)
X_test_minmax = minmax_scaler.transform(X_test)

#from sklearn.preprocessing import MinMaxScaler
#scaler_m = MinMaxScaler() 
#data_normal_scaled = scaler_m.fit_transform(data)

# VI. Model Building & Evaluation

## 1. Logistic Regression 

Since this is an imbalanced dataset, we decide to use weighted logistic regression.

In [None]:
# training 
from sklearn.linear_model import LogisticRegression
lm = LogisticRegression(random_state=0, max_iter=1000, solver='lbfgs', class_weight='balanced')
lm.fit(X_train_standard, y_train)

In [None]:
# predicting
y_pred = lm.predict(X_test_standard)

In [None]:
# evaluation
from sklearn.metrics import classification_report 
print(classification_report(y_test, y_pred))

In [None]:
# confusion_matrix
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, y_pred))

In [None]:
# performance matrix
from sklearn.metrics import accuracy_score, f1_score, precision_score ,recall_score, roc_auc_score
accuracy = round(accuracy_score(y_test, y_pred),2)
f1_score = round(f1_score(y_test, y_pred),2)
precision = round(precision_score(y_test, y_pred),2)
recall = round(recall_score(y_test, y_pred),2)
y_prob_scores_test = lm.predict_proba(X_test_standard)[:,1]
auc_score = round(roc_auc_score(y_test, y_prob_scores_test),2)

#logis
from astropy.table import Table
dict1 = [{'accuracy': accuracy, 'f1_score': f1_score, 'precision': precision, 'recall': recall, 'auc_score': auc_score}]
logis_matrix = Table(rows=dict1)
print(logis_matrix)

In [None]:
# roc plot
from sklearn.metrics import plot_roc_curve
fig,ax = plt.subplots(figsize=(7,7))
plot_roc_curve(lm, X_test_standard, y_test, ax = ax)

## 2. K Nearest Neighbors 

#### Data Transformation: Normalization (Min-Max Scalar) 

In [None]:
# training and predicting
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train_minmax,y_train)
y_pred = knn.predict(X_test_minmax)

In [None]:
# evaluation
from sklearn.metrics import classification_report 
print(classification_report(y_test, y_pred))

In [None]:
# confusion_matrix
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, y_pred))

In [None]:
# performance matrix
from sklearn.metrics import accuracy_score, f1_score, precision_score ,recall_score, roc_auc_score
accuracy = round(accuracy_score(y_test, y_pred),2)
f1_score = round(f1_score(y_test, y_pred),2)
precision = round(precision_score(y_test, y_pred),2)
recall = round(recall_score(y_test, y_pred),2)
y_prob_scores_test = knn.predict_proba(X_test_minmax)[:,1]
auc_score = round(roc_auc_score(y_test, y_prob_scores_test),2)

# knn
from astropy.table import Table
dict2 = [{'accuracy': accuracy, 'f1_score': f1_score, 'precision': precision, 'recall': recall, 'auc_score': auc_score}]
knn_matrix = Table(rows=dict2)
print(knn_matrix)

In [None]:
# roc plot
from sklearn.metrics import plot_roc_curve
fig,ax = plt.subplots(figsize=(7,7))
plot_roc_curve(knn, X_test_minmax, y_test, ax = ax)

## 3. Decision Tree

In [None]:
# training and predicting
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(random_state=0, max_depth=5)
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)

In [None]:
# evaluation
from sklearn.metrics import classification_report 
print(classification_report(y_test, y_pred))

In [None]:
# confusion_matrix
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, y_pred))

In [None]:
# performance matrix
from sklearn.metrics import accuracy_score, f1_score, precision_score ,recall_score, roc_auc_score
accuracy = round(accuracy_score(y_test, y_pred),2)
f1_score = round(f1_score(y_test, y_pred),2)
precision = round(precision_score(y_test, y_pred),2)
recall = round(recall_score(y_test, y_pred),2)
y_prob_scores_test = clf.predict_proba(X_test)[:,1]
auc_score = round(roc_auc_score(y_test, y_prob_scores_test),2)

# knn
from astropy.table import Table
dict3 = [{'accuracy': accuracy, 'f1_score': f1_score, 'precision': precision, 'recall': recall, 'auc_score': auc_score}]
tree_matrix = Table(rows=dict3)
print(tree_matrix)

In [None]:
# roc plot
from sklearn.metrics import plot_roc_curve
fig,ax = plt.subplots(figsize=(7,7))
plot_roc_curve(clf, X_test, y_test, ax = ax)

In [None]:
from sklearn import tree
from sklearn.tree import export_graphviz
dot_data = tree.export_graphviz(clf, out_file=None, 
                                feature_names=X_train.columns,
                                class_names='Churn_Yes',
                                filled=True,
                                max_depth=3)

graph = graphviz.Source(dot_data, format="png") 

graph

In [None]:
importances = clf.feature_importances_
weights = pd.Series(importances,index=X_train.columns.values)
weights.sort_values()[-10:].plot(kind = 'barh')

## 4. Random Forest

In [None]:
# training and predicting
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier()
forest.fit(X_train, y_train)
y_pred = forest.predict(X_test)

In [None]:
# evaluation
from sklearn.metrics import classification_report 
print(classification_report(y_test, y_pred))

In [None]:
# confusion_matrix
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, y_pred))

In [None]:
# performance matrix
from sklearn.metrics import accuracy_score, f1_score, precision_score ,recall_score, roc_auc_score
accuracy = round(accuracy_score(y_test, y_pred),2)
f1_score = round(f1_score(y_test, y_pred),2)
precision = round(precision_score(y_test, y_pred),2)
recall = round(recall_score(y_test, y_pred),2)
y_prob_scores_test = forest.predict_proba(X_test)[:,1]
auc_score = round(roc_auc_score(y_test, y_prob_scores_test),2)

# knn
from astropy.table import Table
dict4 = [{'accuracy': accuracy, 'f1_score': f1_score, 'precision': precision, 'recall': recall, 'auc_score': auc_score}]
forest_matrix = Table(rows=dict4)
print(tree_matrix)

In [None]:
# roc plot
from sklearn.metrics import plot_roc_curve
fig,ax = plt.subplots(figsize=(7,7))
plot_roc_curve(forest, X_test, y_test, ax = ax)

In [None]:
importances = forest.feature_importances_
weights = pd.Series(importances,index=X_train.columns.values)
weights.sort_values()[-10:].plot(kind = 'barh')

# VII. Conclusion

Based on the performance metrics below, the best model is Logistic Regression with F1-score of 62% and auc score of 0.84.

In [None]:
print('logistic regression')
print(logis_matrix)
print('knn')
print(knn_matrix)
print('decision tree')
print(tree_matrix)
print('random forest')
print(forest_matrix)