# Building Models to Predict Teclo Customer Churn
Using the Telco Customer Churn dataset, I'll build models to predict customer churn.

### Outline:
1. Import libraries and data
2. Get an understanding of the data: relationships between features, check for imbalanced supervisor, remove possible outliers
3. Build models using original dataset and data sampled using SMOTE to account for imbalanced casses
4. Build models based on internet service type

What I am doing with this notebook is illustrating that building an effective model for the entire dataset can be supplemented with building models for specific subsets of the data.

More specifically, I'll build a model to predict customer churn given no specifics about the customer. I'll then build models for customers who have DSL internet and those that have no internet. In order to get the best predictions, based on the attributes of a customer, a company could use the general model, or if they know the customer has DSL or no internet, they can use those specific models to generate more accurate predictions.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix, f1_score, recall_score, precision_score, accuracy_score, classification_report
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB

# Import dataset
file_path = '/kaggle/input/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv'
df = pd.read_csv(file_path)
df.head(10)

In [None]:
# Check for missing values
if df.isna().sum().sum() == 0:
    print('No missing values.')

# Check for duplicate customerIDs
if df['customerID'].duplicated().any() == False:
    print('No duplicate customers.')

In [None]:
# Check feature datatypes
df.dtypes

In [None]:
# Understand number of unique values for each categorical feature
df.select_dtypes(object).nunique()

Below, I'll plot three features with customer churn:

In [None]:
# Choose a few features to look into with regards to churn
cols_vis = ['InternetService','PaymentMethod','Contract']
sns.set()
cust_pal = ['#157DEC','#4CC552']
sns.set_palette(cust_pal)
for i in range(0,len(cols_vis)):
    df_plot = pd.DataFrame(df.groupby([cols_vis[i],'Churn'])['Churn'].count())
    df_plot.rename(columns = {'Churn': 'TotalCustomers'}, inplace=True)
    df_plot.reset_index(inplace = True)
    chart = sns.barplot(x=cols_vis[i], y='TotalCustomers', hue="Churn", data=df_plot)
    plt.ylabel("Total Customers", size=14)
    plt.xlabel(cols_vis[i], size=14)
    chart.set_xticklabels(chart.get_xticklabels(), rotation=30)
    plt.title("Customer Churn by " + cols_vis[i], size=18)
    plt.show()

The contract vs. customer churn plot seems obvious - those in a two-year contract might face some monetary penalty for leaving, thus leading to very low customer churn.

Customer churn across payment methods looks pretty stable with the exception of customers who pay by electronic check, of which there are more that churn than the other methods. That could be something to look into in further work.

Customer churn by internet service was particularly interesting to me. It appears that customer with no internet have the lowest churn, and those with the fastest speed (fiber optic) have the highest rate of churn. Could it be that the speeds aren't as advertised, or perhaps the service is too expensive? It appears that the company's fiber optic internet service just doesn't seem to be worth the money in some way.

In [None]:
# Relationship between monthly charges and churn
df_charge_churn = df.groupby(['Churn'])['MonthlyCharges'].mean()
display(df_charge_churn)

It seems logical that those who are generally paying more per month are more likely to churn. The more expensive a service, the more a customer would consider that when looking at their budget. They're also more likely to be skeptical of any mishaps with that service if they're paying more than they'd like.

In [None]:
# Check number of classes
df.Churn.unique()

# Change Yes and No to 1 and 0 for columns where only options are Yes and No
dict = {'Yes': 1, 'No': 0}
df.replace({'Churn': dict}, inplace = True)

# Check class balance
df.groupby(['Churn'])['Churn'].count()

There is an imbalance in classes which I'll work with before building the models.

In [None]:
# Note that TotalCharges it not categorical - it's numeric
# Likely loaded that way due to bad data so turn those values into nan
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
df['TotalCharges'].isna().sum()
df.dropna(subset=['TotalCharges'], axis = 0, inplace = True)

Now I'll do a very basic check for outliers by creating a box and whisker chart for customer monthly charges and customer total charges.

In [None]:
# Look for outliers in monthly charges and total charges
sns.boxplot(x=df['MonthlyCharges'])
plt.show()
sns.boxplot(x=df['TotalCharges'])
plt.show()

The total charges chart is skewed, so I'll focus my work on the data between the 5th and 95th percentile of charges.

In [None]:
# Looks to be a significant amount of variation in total charges
# Focus on 5% to 95%
Perc5 = round(np.percentile(df['TotalCharges'],5),2)
Perc95 = round(np.percentile(df['TotalCharges'],95),2)

print('Focus on customers with total charges between ${} and ${}.'.format(Perc5,Perc95))
    
# Filter dataset
df = df.loc[(df['TotalCharges'] <= Perc95) & (df['TotalCharges'] >= Perc5)]

In [None]:
# Noticing some redundant information in columns for customers without internet service
cols = ['OnlineSecurity','OnlineBackup','DeviceProtection','TechSupport','StreamingTV','StreamingMovies']
df[cols] = df[cols].replace({'No internet service': 'No'})

Next, I will encode the categorical variables to prepare for building the models:

In [None]:
# Encode categorical variables
df.drop('customerID', axis = 1, inplace = True)
df['SeniorCitizen'] = df['SeniorCitizen'].astype(object)
df_encoded = pd.get_dummies(df.select_dtypes(object), drop_first = True)
df_numeric = df.select_dtypes(include = 'number')
df_final = pd.merge(df_encoded, df_numeric, left_index = True, right_index = True)

# Build a Logistic Regression model
X = df_final.drop(['Churn'], axis = 1)
Y = df_final['Churn']

I define a function to easily build and evaluate models:

In [None]:
# Define Build Models Function
def BuildModel(X,Y,Algorithm,imb_class):
    if imb_class == 1:
        oversample = SMOTE()
        X, Y = oversample.fit_resample(X, Y)
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 0)
    if Algorithm == 'LogisticRegression':
        Classifier = LogisticRegression(max_iter=1000)
    if Algorithm == 'RandomForest':
        Classifier = RandomForestClassifier(n_estimators = 1000)
    if Algorithm == 'NaiveBayes':
        Classifier = GaussianNB()
    Classifier = Classifier.fit(X_train,Y_train)
    Y_pred = Classifier.predict(X_test)
    precision = round(precision_score(Y_test,Y_pred),2)
    recall = round(recall_score(Y_test,Y_pred),2)
    fscore = round(f1_score(Y_test,Y_pred),2)
    accuracy = round(accuracy_score(Y_test,Y_pred),2)
    return precision, recall, fscore, accuracy;

Now that I have my BuildModels function define, I am going to loop through a few different algorithms, with and without using SMOTE, to see how well they perform.

In [None]:
model_efficacy = pd.DataFrame(columns = ['Algorithm','SMOTE','Precision','Recall','FScore','Accuracy'])

for i in range(0,2):
    Alg = ['LogisticRegression','RandomForest','NaiveBayes']
    for j in range(0,len(Alg)):
        Algorithm = Alg[j]
        precision, recall, fscore, accuracy = BuildModel(X,Y,Algorithm,i)
        new_row = {'Algorithm': Algorithm, 'SMOTE': i, 'Precision': precision, 'Recall': recall, 'FScore': fscore, 'Accuracy': accuracy}
        model_efficacy = model_efficacy.append(new_row, ignore_index = True)
    
    i = i+1

display(model_efficacy)

Logistic Regression and Random Forest, both with SMOTE, perform the best. It would be my business informed decision to choose Logistic Regression to deploy in this case, as it is easier to explain to internal and external clients.

Since I noticed a distinct difference in churn by the type of internet service a customer has, I'm going to build models based on the subsets of data group by internet service:

In [None]:
# Try building a different model for each group of internet users..

IS_DSL = df.loc[df['InternetService'] == 'DSL']
IS_FiberOptic = df.loc[df['InternetService'] == 'Fiber optic']
IS_No = df.loc[df['InternetService'] == 'No']

# Check count of classes
dsl = IS_DSL.groupby(['Churn'])['Churn'].count()
fo = IS_FiberOptic.groupby(['Churn'])['Churn'].count()
no = IS_No.groupby(['Churn'])['Churn'].count()
class_compare = \
pd.DataFrame(columns = ['InternetService','Churn_0','Churn_1'] \
             ,data = [['DSL',dsl[0],dsl[1]], \
                      ['FiberOptic',fo[0],fo[1]], \
                      ['None',no[0],no[1]]])
display(class_compare)

There is obviously some class imbalance in DSL and No Internet, but Fiber Optic looks ok. I'll stick with Logistic Regression for simplicity since it performed the best on the general dataset.

In [None]:
# Encode categorical variables
IS_DSL.drop('InternetService', axis = 1, inplace = True)
IS_FiberOptic.drop('InternetService', axis = 1, inplace = True)
IS_No.drop('InternetService', axis = 1, inplace = True)

IS_DSL_encoded = pd.get_dummies(IS_DSL.select_dtypes(object), drop_first = True)
IS_DSL_numeric = IS_DSL.select_dtypes(include = 'number')
IS_DSL_final = pd.merge(IS_DSL_encoded, IS_DSL_numeric, left_index = True, right_index = True)

IS_FiberOptic_encoded = pd.get_dummies(IS_FiberOptic.select_dtypes(object), drop_first = True)
IS_FiberOptic_numeric = IS_FiberOptic.select_dtypes(include = 'number')
IS_FiberOptic_final = pd.merge(IS_FiberOptic_encoded, IS_FiberOptic_numeric, left_index = True, right_index = True)

IS_No_encoded = pd.get_dummies(IS_No.select_dtypes(object), drop_first = True)
IS_No_numeric = IS_No.select_dtypes(include = 'number')
IS_No_final = pd.merge(IS_No_encoded, IS_No_numeric, left_index = True, right_index = True)

In [None]:
model_efficacy_separate = pd.DataFrame(columns = ['InternetService','Precision','Recall','FScore','Accuracy'])

# DSL
X = IS_DSL_final.drop(['Churn'], axis = 1)
Y = IS_DSL_final['Churn']
precision, recall, fscore, accuracy = BuildModel(X,Y,'LogisticRegression',1)
new_row = {'InternetService': 'DSL', 'Precision': precision, 'Recall': recall, 'FScore': fscore, 'Accuracy': accuracy}
model_efficacy_separate = model_efficacy_separate.append(new_row, ignore_index = True)

# Fiber Optic
X = IS_FiberOptic_final.drop(['Churn'], axis = 1)
Y = IS_FiberOptic_final['Churn']
precision, recall, fscore, accuracy = BuildModel(X,Y,'LogisticRegression',0)
new_row = {'InternetService': 'Fiber Optic', 'Precision': precision, 'Recall': recall, 'FScore': fscore, 'Accuracy': accuracy}
model_efficacy_separate = model_efficacy_separate.append(new_row, ignore_index = True)

# None
X = IS_No_final.drop(['Churn'], axis = 1)
Y = IS_No_final['Churn']
precision, recall, fscore, accuracy = BuildModel(X,Y,'LogisticRegression',1)
new_row = {'InternetService': 'None', 'Precision': precision, 'Recall': recall, 'FScore': fscore, 'Accuracy': accuracy}
model_efficacy_separate = model_efficacy_separate.append(new_row, ignore_index = True)

display(model_efficacy_separate)

Ultimately, I was able to get slightly higher performance for DSL and No Internet, and lower performance for Fiber Optic. This is expected given the general dataset performance.

In this case, I would recommend the business use the DSL or No Internet specific models if the customer falls into those groups. If the customer has Fiber Optic internet, I'd recommend using the general model. If the business is unsure which internet service a customer has for some reason, or they frequently change service, the general model should be used in that case.