# Telco Customer Churn Prediction

This project aims to predict customer churn using python with the pandas and sklearn libraries. Logistic regression and random forest classifier were the models used in this project. SMOTE was applied to address the imbalanced dataset.

Link to dataset: https://www.kaggle.com/datasets/blastchar/telco-customer-churn/discussion?sort=hotness

## Import libraries

In [398]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [399]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

In [400]:
df = pd.read_csv('/Users/vivianchung/Desktop/projects/WA_Fn-UseC_-Telco-Customer-Churn.csv')

In [401]:
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [402]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


In [403]:
# Check for null values
df.isnull().sum()

customerID          0
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64

In [404]:
# Check for duplicates
df.duplicated().sum()

0

In [405]:
# Calculate the percentage distribution of 'Churn' values
round(df['Churn'].value_counts()/df.shape[0]*100)

Churn
No     73.0
Yes    27.0
Name: count, dtype: float64

### Class imbalance

About 73% of customers did not churn and about 27% of customers did churn. The dataset is imbalanced because a larger proportion of customers did not churn compared to those that did.

### Actions to consider

SMOTE or Synthetic Minority Over-sampling Technique will be applied to the dataset to address the class imbalance issue.


In [406]:
#ok = [488, 753, 936, 1082, 1340, 3331, 3826, 4380, 5218, 6670, 6754]

## 1. Change 'TotalCharges' from object to numeric

In [407]:
# Change 'TotalCharges' from object to numeric
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')

In [408]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


## 2. Remove NaN in 'TotalCharges' column

In [409]:
df['TotalCharges'].isnull().sum()

11

In [410]:
df[df['TotalCharges'].isnull()]

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
488,4472-LVYGI,Female,0,Yes,Yes,0,No,No phone service,DSL,Yes,...,Yes,Yes,Yes,No,Two year,Yes,Bank transfer (automatic),52.55,,No
753,3115-CZMZD,Male,0,No,Yes,0,Yes,No,No,No internet service,...,No internet service,No internet service,No internet service,No internet service,Two year,No,Mailed check,20.25,,No
936,5709-LVOEQ,Female,0,Yes,Yes,0,Yes,No,DSL,Yes,...,Yes,No,Yes,Yes,Two year,No,Mailed check,80.85,,No
1082,4367-NUYAO,Male,0,Yes,Yes,0,Yes,Yes,No,No internet service,...,No internet service,No internet service,No internet service,No internet service,Two year,No,Mailed check,25.75,,No
1340,1371-DWPAZ,Female,0,Yes,Yes,0,No,No phone service,DSL,Yes,...,Yes,Yes,Yes,No,Two year,No,Credit card (automatic),56.05,,No
3331,7644-OMVMY,Male,0,Yes,Yes,0,Yes,No,No,No internet service,...,No internet service,No internet service,No internet service,No internet service,Two year,No,Mailed check,19.85,,No
3826,3213-VVOLG,Male,0,Yes,Yes,0,Yes,Yes,No,No internet service,...,No internet service,No internet service,No internet service,No internet service,Two year,No,Mailed check,25.35,,No
4380,2520-SGTTA,Female,0,Yes,Yes,0,Yes,No,No,No internet service,...,No internet service,No internet service,No internet service,No internet service,Two year,No,Mailed check,20.0,,No
5218,2923-ARZLG,Male,0,Yes,Yes,0,Yes,No,No,No internet service,...,No internet service,No internet service,No internet service,No internet service,One year,Yes,Mailed check,19.7,,No
6670,4075-WKNIU,Female,0,Yes,Yes,0,Yes,Yes,DSL,No,...,Yes,Yes,Yes,No,Two year,No,Mailed check,73.35,,No


In [411]:
df['TotalCharges'].isnull().sum() / df.shape[0] * 100

0.1561834445548772

0.156% of the data is missing in the 'TotalCharges' column. Since this value is small, I decided to remove the rows with missing values. 

In [412]:
# Drop rows with missing values in 'TotalCharges' column
df.dropna(subset=['TotalCharges'], inplace=True)

In [413]:
df['TotalCharges'].isnull().sum()

0

In [414]:
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


## 3. Encode categorical variables

In [415]:
# Drop 'customerID'
df.drop('customerID', axis=1, inplace=True)

# Replace 'No phone service' and 'No internet service' to 'No' for the following columns:
# MultipleLines, OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, StreamingMovies
df['MultipleLines'].replace('No phone service', 'No', inplace=True)
df.replace('No internet service', 'No', inplace=True)

# Gender, Partner, Dependents, PhoneService, PaperlessBilling, Churn --> Binary
df['gender'].replace({'Male':0, 'Female':1}, inplace=True)
df.replace({'No':0, 'Yes':1}, inplace=True)

# SeniorCitizen --> NA

# tenure, TotalCharges, MonthlyCharges --> NA

# InternetService, Contract, PaymentMethod --> One Hot
df = pd.get_dummies(df, columns=['InternetService', 'Contract', 'PaymentMethod'])
df.replace({False:0, True:1}, inplace=True)

In [416]:
df.head()

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,OnlineSecurity,OnlineBackup,DeviceProtection,...,InternetService_0,InternetService_DSL,InternetService_Fiber optic,Contract_Month-to-month,Contract_One year,Contract_Two year,PaymentMethod_Bank transfer (automatic),PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check
0,1,0,1,0,1,0,0,0,1,0,...,0,1,0,1,0,0,0,0,1,0
1,0,0,0,0,34,1,0,1,0,1,...,0,1,0,0,1,0,0,0,0,1
2,0,0,0,0,2,1,0,1,1,0,...,0,1,0,1,0,0,0,0,0,1
3,0,0,0,0,45,0,0,1,0,1,...,0,1,0,0,1,0,1,0,0,0
4,1,0,0,0,2,1,0,0,0,0,...,0,0,1,1,0,0,0,0,1,0


# Logistic Regression 

In [417]:
# Define features and target
X = df.drop('Churn', axis=1)
y = df['Churn']

In [418]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [419]:
# Scale the data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [420]:
# Apply SMOTE
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

In [421]:
# Train the model
lr_model = LogisticRegression(random_state=42)
lr_model.fit(X_resampled, y_resampled)

In [422]:
# Make predictions
y_pred = lr_model.predict(X_test)

In [423]:
# Evaluate the model
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
print('Accuracy Score:', accuracy_score(y_test, y_pred))

[[736 297]
 [ 73 301]]
              precision    recall  f1-score   support

           0       0.91      0.71      0.80      1033
           1       0.50      0.80      0.62       374

    accuracy                           0.74      1407
   macro avg       0.71      0.76      0.71      1407
weighted avg       0.80      0.74      0.75      1407

Accuracy Score: 0.7370291400142146


# Random Forest Classifier

In [424]:
# Train the model
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_resampled, y_resampled)

In [425]:
# Make predictions
y_pred = rf_model.predict(X_test)

In [426]:
# Evaluate the model
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
print('Accuracy Score:', accuracy_score(y_test, y_pred))

[[876 157]
 [152 222]]
              precision    recall  f1-score   support

           0       0.85      0.85      0.85      1033
           1       0.59      0.59      0.59       374

    accuracy                           0.78      1407
   macro avg       0.72      0.72      0.72      1407
weighted avg       0.78      0.78      0.78      1407

Accuracy Score: 0.7803837953091685


## Logistic regression
Precision for Class 0 (non-churn) is high, meaning the model is good at predicting non-churn cases accurately. Precision for Class 1 (churn) is lower, indicating that when the model predicts churn, it is less accurate.

Recall for Class 1 is high, suggesting that the model is good at identifying most of the actual churn cases. Recall for Class 0 is lower, meaning the model misses some non-churn cases.

The model correctly predicts the outcome 74% of the time. 

#### Strengths:
The model is fairly good at detecting churn cases (high recall for Class 1).

It has a high precision for non-churn cases, indicating reliability in predicting non-churn when it makes a prediction.

#### Weaknesses:
The model struggles with precision for churn cases, meaning it may generate a higher number of false positives (predicting churn when it’s not actually churn).

The F1-score for churn is lower, which indicates a trade-off between precision and recall that could be improved.

## Random Forest
Precision is higher for Class 0 (non-churn), showing the model is good at predicting non-churn cases accurately. For Class 1 (churn), precision is lower, indicating that the model is less reliable when predicting churn.

Recall for both classes is equal, reflecting the model's ability to identify churn and non-churn cases. The recall is lower for Class 1, meaning it misses some actual churn cases.

The model correctly predicts the outcome 78% of the time. This accuracy is better than the Logistic Regression model but still shows room for improvement, particularly in the prediction of churn cases.

#### Strengths
Better accuracy compared to Logistic Regression.

Balanced performance across classes with an overall higher precision and recall.

#### Weaknesses: 
Precision and recall for churn cases (Class 1) are lower, indicating that while the model detects some churn cases, it misses or incorrectly labels others.


# Conclusion
Both models provide valuable insights, but the Random Forest model's higher accuracy and better balance across classes may make it more suitable for scenarios where better overall performance is desired.