In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#Styling the plots
sns.set_style('whitegrid')

#Loading the csv
df = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')

In [2]:
# Displaying the first few rows and info of the dataframe
print(df.head())

   customerID  gender  SeniorCitizen Partner Dependents  tenure PhoneService  \
0  7590-VHVEG  Female              0     Yes         No       1           No   
1  5575-GNVDE    Male              0      No         No      34          Yes   
2  3668-QPYBK    Male              0      No         No       2          Yes   
3  7795-CFOCW    Male              0      No         No      45           No   
4  9237-HQITU  Female              0      No         No       2          Yes   

      MultipleLines InternetService OnlineSecurity  ... DeviceProtection  \
0  No phone service             DSL             No  ...               No   
1                No             DSL            Yes  ...              Yes   
2                No             DSL            Yes  ...               No   
3  No phone service             DSL            Yes  ...              Yes   
4                No     Fiber optic             No  ...               No   

  TechSupport StreamingTV StreamingMovies        Contract Pape

In [3]:
#To know the info of the dataframe
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


*Observations*

- We can see that totalCharges is an object and not float64
- This is something we have to fix before analyzing the data

***Data Cleaning Process***

In [4]:
#1/ Converting TotalCharges to numeric

#If the values cannot be converted to numeric, coerce will set them to NaN
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')

#Checking for null values
print("Check for null values in TotalCharges column:")
#Counts how many NaN values are present in TotalCharges
print(df['TotalCharges'].isnull().sum())

#Filling null values with the median
df['TotalCharges'].fillna(df['TotalCharges'].median(), inplace=True)

Check for null values in TotalCharges column:
11


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['TotalCharges'].fillna(df['TotalCharges'].median(), inplace=True)


In [5]:
#2/ Check if there is any missing values.

print("\nCheck for any missing values in the dataframe:")
print(df.isnull().sum())


Check for any missing values in the dataframe:
customerID          0
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64


In [6]:
#3/ Data Cleaning in categorical columns

# Get all the categorical columns
categorical_columns = df.select_dtypes(include=['object']).columns
# Display unique values in each categorical column
for col in categorical_columns:
    print(f"\nUnique values in column '{col}':")
    print(df[col].value_counts())


Unique values in column 'customerID':
customerID
7590-VHVEG    1
5575-GNVDE    1
3668-QPYBK    1
7795-CFOCW    1
9237-HQITU    1
             ..
6840-RESVB    1
2234-XADUH    1
4801-JZAZL    1
8361-LTMKD    1
3186-AJIEK    1
Name: count, Length: 7043, dtype: int64

Unique values in column 'gender':
gender
Male      3555
Female    3488
Name: count, dtype: int64

Unique values in column 'Partner':
Partner
No     3641
Yes    3402
Name: count, dtype: int64

Unique values in column 'Dependents':
Dependents
No     4933
Yes    2110
Name: count, dtype: int64

Unique values in column 'PhoneService':
PhoneService
Yes    6361
No      682
Name: count, dtype: int64

Unique values in column 'MultipleLines':
MultipleLines
No                  3390
Yes                 2971
No phone service     682
Name: count, dtype: int64

Unique values in column 'InternetService':
InternetService
Fiber optic    3096
DSL            2421
No             1526
Name: count, dtype: int64

Unique values in column 'OnlineSec

*Observations*

- This is just a check to ensure that we do not have unique values that have the same meaning. This problem exists in a most of the datasets, but in this case all the categorical column values are unique.

In [7]:
#4 Data cleaning in numerical columns

numerical_columns = df.select_dtypes(include=['int64', 'float64']).columns
for col in numerical_columns:
    print(f"\nStatistical summary of column '{col}':")
    print(df[col].describe())


Statistical summary of column 'SeniorCitizen':
count    7043.000000
mean        0.162147
std         0.368612
min         0.000000
25%         0.000000
50%         0.000000
75%         0.000000
max         1.000000
Name: SeniorCitizen, dtype: float64

Statistical summary of column 'tenure':
count    7043.000000
mean       32.371149
std        24.559481
min         0.000000
25%         9.000000
50%        29.000000
75%        55.000000
max        72.000000
Name: tenure, dtype: float64

Statistical summary of column 'MonthlyCharges':
count    7043.000000
mean       64.761692
std        30.090047
min        18.250000
25%        35.500000
50%        70.350000
75%        89.850000
max       118.750000
Name: MonthlyCharges, dtype: float64

Statistical summary of column 'TotalCharges':
count    7043.000000
mean     2281.916928
std      2265.270398
min        18.800000
25%       402.225000
50%      1397.475000
75%      3786.600000
max      8684.800000
Name: TotalCharges, dtype: float64


*Observations*

- Here, 'Monthly Charges' and 'tenure' should be non negative values. 
- In this dataset, the values of the above 2 columns are correct, so it is a clean dataset. 
- Decide the range of the numerical columns based on prior info about that aspect and clean it accordingly.

***Feature Engineering Process***

In [8]:
#1/ Encoding categorical data using one hot encoding

#In our case the target variable is 'Churn'
df_processed = df.copy()
df_processed['Churn'] = df_processed['Churn'].map({'No': 0, 'Yes': 1})

# Identify all other categorical columns
# As churn is already converted to numeric
categorical_cols = df_processed.select_dtypes(include=['object']).columns

# Apply one-hot encoding to these columns
df_processed = pd.get_dummies(df_processed, columns=categorical_cols, drop_first=True, dtype=int)

print("--- Data after one-hot encoding ---")
print(df_processed.head())
print(f"\nNew shape of the dataframe: {df_processed.shape}")
print(f"\nOriginal Shape : {df.shape}")

--- Data after one-hot encoding ---
   SeniorCitizen  tenure  MonthlyCharges  TotalCharges  Churn  \
0              0       1           29.85         29.85      0   
1              0      34           56.95       1889.50      0   
2              0       2           53.85        108.15      1   
3              0      45           42.30       1840.75      0   
4              0       2           70.70        151.65      1   

   customerID_0003-MKNFE  customerID_0004-TLHLJ  customerID_0011-IGKFF  \
0                      0                      0                      0   
1                      0                      0                      0   
2                      0                      0                      0   
3                      0                      0                      0   
4                      0                      0                      0   

   customerID_0013-EXCHZ  customerID_0013-MHZWF  ...  \
0                      0                      0  ...   
1               

In [9]:
#2/ Scaling the numerical columns
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

# Get the numerical columns
numerical_cols = ["tenure", "MonthlyCharges", "TotalCharges"]

# Fit and transform the numerical columns
# It transforms the data to a standard normal distribution (thereby easy calculations)
df_processed[numerical_cols] = scaler.fit_transform(df_processed[numerical_cols])

# Printing the numerical
print(f"\nNumerical columns after scaling")
print(df_processed[numerical_cols].head())



Numerical columns after scaling
     tenure  MonthlyCharges  TotalCharges
0 -1.277445       -1.160323     -0.994242
1  0.066327       -0.259629     -0.173244
2 -1.236724       -0.362660     -0.959674
3  0.514251       -0.746535     -0.194766
4 -1.236724        0.197365     -0.940470


*Observations*

- We change the range of features such that they are on a comparable scale.
- If we do not do scaling, then during prediction, models that rely on distance or gradient updates will behave poorly.

In [10]:
#3/ Check for duplicates

# Get the number of duplicate rows
# duplicate_rows = df_processed.duplicated().sum()
# print(f"Number of duplicate rows: {duplicate_rows}")

# # If there are duplicate rows, display them
# if duplicate_rows > 0:
#     print("\nDuplicate rows:")
#     print(df_processed[df_processed.duplicated()])
# else:
#     print("No duplicate rows found.")

In [11]:
#4/ Split it into train and test data
from sklearn.model_selection import train_test_split

#Split the input and target variable
X = df_processed.drop('Churn', axis=1)
y = df_processed['Churn']

#Split into train and test splits
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=30, stratify=y)

#Print train and test splits
print(f"Training set's shape : {X_train.shape}")
print(f"Test set's shape : {X_test.shape}")

Training set's shape : (5634, 7072)
Test set's shape : (1409, 7072)


***Model Building Process***

In [12]:
# 1/ Building a Logistic Regression Model
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
#Create an instance of the Logistic Regression model
print("\n--- Logistic Regression Model Evaluation ---")
lr_model = LogisticRegression(max_iter=1000)
#Fit the model on the training data
lr_model.fit(X_train, y_train)
#Predict on the test data
y_pred_log = lr_model.predict(X_test)
#Evaluate the model
print(f"Accuracy: {accuracy_score(y_test, y_pred_log)}")
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_log))
print("\nClassification Report:")
print(classification_report(y_test, y_pred_log, target_names=['Not Churn (0)', 'Churn (1)']))



--- Logistic Regression Model Evaluation ---
Accuracy: 0.78708303761533
Confusion Matrix:
[[912 123]
 [177 197]]

Classification Report:
               precision    recall  f1-score   support

Not Churn (0)       0.84      0.88      0.86      1035
    Churn (1)       0.62      0.53      0.57       374

     accuracy                           0.79      1409
    macro avg       0.73      0.70      0.71      1409
 weighted avg       0.78      0.79      0.78      1409



In [13]:
# 2/ Building a Random Forest Model
from sklearn.ensemble import RandomForestClassifier

#Create an instance of the Random Forest model
rf_model = RandomForestClassifier(n_estimators=100, random_state=30)
#Fit the model on the training data
rf_model.fit(X_train, y_train)
#Predict on the test data
y_pred_rf = rf_model.predict(X_test)
#Evaluate the model
print("\n--- Random Forest Model Evaluation ---")
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_rf))
print(f"Accuracy: {accuracy_score(y_test, y_pred_rf)}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred_rf, target_names=['Not Churn (0)', 'Churn (1)']))



--- Random Forest Model Evaluation ---
Confusion Matrix:
[[939  96]
 [197 177]]
Accuracy: 0.7920511000709723

Classification Report:
               precision    recall  f1-score   support

Not Churn (0)       0.83      0.91      0.87      1035
    Churn (1)       0.65      0.47      0.55       374

     accuracy                           0.79      1409
    macro avg       0.74      0.69      0.71      1409
 weighted avg       0.78      0.79      0.78      1409



*Observations*

- From the two models that we have trained, we choose the logistic regression model to make predictions
- Why??
    - Here the cost of being wrong is not symmetrical. So we do not choose based on accuracy
    - In this case, the cost of missing a customer who is about to leave (Churn = 1) [False Negative] is more significant than giving someone a discount who is going to stay anyway (Churn = 0) [but the model predicted churn, false positive]
    - So, **recall** is a better indicator, because a high recall score indicates, false negatives are really low
    - Based on the 2 models' recall score for churn = 1, in log reg is better than random forest model.

***Insights***

In [None]:
#Exporting data
#We go with the logistic regression model as it has a better recall score for churn = 1

df_results = X_test.copy()
df_results['Actual_Churn'] = y_test
df_results['Predicted_Churn_LogReg'] = y_pred_log

# Add the probability of churning (the probability of the '1' class)
df_results['Churn_Probability'] = lr_model.predict_proba(X_test)[:, 1]

# Export the results to a CSV file
df_results.to_csv('churn_predictions.csv', index=False)

print("Data for dashboard has been exported")

Data for dashboard has been exported successfully to 'churn_predictions_for_dashboard.csv'
