# Intoduction

The development and success of a company not only depends on attracting new customers but also keeping the existing ones. Therefore, it is vital to investigate what motivates the customers to leave and make predictions accordingly, hence we can take action on those who are likely to drop out to prevent outflow. The telecom-users dataset contains around 6 thousand records of customers from a telecom company. The attributes include demographics of the customers, the services they subscribe to, the billing information, and most importantly, whether the contracts are renewed. The objectives of this notebook are to explore the relations between customers' features and churn and build models to predict whether a customer would leave. The notebook consists of five sections: 
* Import Data and Cleaning
* Explore the Distribution of Target and Features
* Explore the Effects of Features on Target
* Preprocess
* Build Models and Make Prediction

In [None]:

import numpy as np 
import pandas as pd 
from math import floor
import seaborn as sns 
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingClassifier,RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import  plot_confusion_matrix, classification_report
%matplotlib inline
sns.set_theme()


# Import Data and Cleaning

In [None]:
# import data from csv file and show the head
df = pd.read_csv('../input/telecom-users-dataset/telecom_users.csv')
df.head()

In [None]:
# first take a brief look at data types & non-null counts in each columns
df.info()
# check duplicated entries
df.duplicated().sum()

It appears that the first column is redundant. So I am dropping it and set the customerID as the index.
Also, the data type of TotalCharges and SeniorCitizen are changed to make further inspection easier. 

In [None]:
# drop Unnamed: 0 and set customerID as index
df.drop('Unnamed: 0',axis = 1, inplace = True)
df.set_index('customerID',inplace = True)
# change the data types of TotalCharges and SeniorCitizen 
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
df['SeniorCitizen'] = df['SeniorCitizen'].astype('object')
df.info()

There are 10 entries with null TotalCharges, from the column names we can infer that TotalCharges could be tenure * MonthlyCharges. So let's make a scatter plot to see if it's the case. 


In [None]:
eval_TotalCharges = df.tenure * df.MonthlyCharges
ax = sns.scatterplot(x = eval_TotalCharges,y = df.TotalCharges )
ax.set(xlabel = 'Evaluated TotalCharges',ylabel ='Actual TotalCharges' )
#after seeing the scatter plort, i think it's now safe to fill null TotalCharges with tenure * MonthlyCharges
df['TotalCharges'].fillna(df.tenure * df.MonthlyCharges,inplace = True)

# Explore the Distribution of Target and Features

**First take a look at the target variable.**

In [None]:
# set the palette to colorblind-friendly style
sns.set_palette('colorblind')
sns.countplot(data = df, x = 'Churn')

**Distribution of numeric features**

In [None]:
# extract numeric and categorical columns
num_cols = df.columns[df.dtypes!='object']
non_num_cols = df.columns[(df.dtypes=='object') & (df.columns!='Churn')]

fig, axes = plt.subplots(1,3,figsize=(18, 5))
for i,col in enumerate(num_cols):
    sns.violinplot(ax = axes[i], y = df[col])

**Distribution of categorical features**

In [None]:
fig, axes = plt.subplots(8,2,figsize=(15,60))
for i,col in enumerate(non_num_cols):
    plt_col = (i+floor(i/8))%2
    plt_row = i%8
    counts = df[col].value_counts()
    counts.plot.pie(ax = axes[plt_row,plt_col],explode=[0.03]*df[col].nunique(),autopct="%.1f%%",labeldistance = 1.05,radius = 0.8)

It seems that the dataset is pretty clean already. There are no unreasonable outliers or categories that are irrelevant to the column names. So we are good to move on to the next section without further cleaning.

# Explore the Effects of Features on Target

**Effects of numeric features**

In [None]:
sns.pairplot(df,vars =num_cols, hue = "Churn",kind = 'kde')

* The first pattern observed is that clients are most likely to drop out within the first couple of months. 
* And this is particularly the case for those who received expensive bills right after signing up. We can find the densest churn where the monthly charges are above 70 and the tenure is close to 0.

In [None]:
#df.groupby('Churn').gender.value_counts().plot(kind = 'bar)
fig, axes = plt.subplots(4,4,figsize=(20, 20),sharey = True)
#df.groupby('Churn').tenure.hist(alpha = 0.7,legend = True,bins = 20)
for i,col in enumerate(non_num_cols):
    plt_col = i%4
    plt_row = floor(i/4)
    chart = sns.countplot(ax =axes[plt_row,plt_col],x= col ,hue = 'Churn' ,data = df)
    if df[col].astype(str).str.len().max()>20:
        chart.set_xticklabels(chart.get_xticklabels(), rotation=15)

* demographics have little impact on target 
* Under the condition that internet service is subscribed, clients who subscribe to additional services are more likely to stay.
* Month-to-Month contracts make it easier to change service providers.

# **Preprocess**

Before fitting data into the models, we need to transform categorical data into numeric. As most categories are not ordinal, one-hot encoding will be applied instead of integer encoding.

In [None]:
# keep drop_first argument false, and manually select redundant dummies to drop and keep the relevant ones
df_dummies = pd.get_dummies(df)
df_dummies.drop(df_dummies.columns[df_dummies.columns.str.endswith('No internet service')],axis=1,inplace = True)
df_dummies.drop(['gender_Male', 'SeniorCitizen_0', 'Partner_No', 'Dependents_No', 'PhoneService_No','PaperlessBilling_No','Churn_No'],axis=1,inplace = True)

In [None]:
corMatrix = df_dummies.corr()
plt.figure(figsize=(40,25))
sns.heatmap(corMatrix,  annot = True)

In [None]:
corMatrix['Churn_Yes'].sort_values()

In [None]:
X = df_dummies.drop(['Churn_Yes'],axis = 1)
scaler = StandardScaler()
scaler.fit(X)
X1 = scaler.transform(X)

# Build Models and Make Prediction

In [None]:

y = df_dummies.Churn_Yes
X_train, X_test, y_train, y_test = train_test_split(X1, y, test_size=0.3, random_state=42,stratify =y)
lr = LogisticRegression(max_iter = 500,random_state = 42)
lr.fit(X_train,y_train)
plot_confusion_matrix(lr, X_test, y_test)
predy = lr.predict(X_test)
print(classification_report(y_test,predy))

In [None]:
features = pd.DataFrame(lr.coef_.transpose(),index = X.columns)
features.sort_values(by = 0, ascending = False)


In [None]:
rfclf = RandomForestClassifier(n_estimators = 500,criterion = 'entropy',random_state = 42)
rfclf.fit(X_train,y_train)
plot_confusion_matrix(rfclf, X_test, y_test)
predy = rfclf.predict(X_test)
print(classification_report(y_test,predy))

In [None]:
features = pd.DataFrame(rfclf.feature_importances_,index = X.columns)
features.sort_values(by = 0, ascending = False)

In [None]:
gbclf = GradientBoostingClassifier(n_estimators = 500,random_state = 42)
gbclf.fit(X_train,y_train)
plot_confusion_matrix(gbclf, X_test, y_test)
predy = gbclf.predict(X_test)
print(classification_report(y_test,predy))

In [None]:
features = pd.DataFrame(gbclf.feature_importances_,index = X.columns)
features.sort_values(by = 0, ascending = False)


LogisticRegression scored the best in overall accuracy, but GradientBoostingClassifier performed slightly better in terms of churn recall, and the top features are more aligned with what we observed in the EDA.