---
### Contributors: Brian Waweru, Start-Date        : 08th May, 2025
---

# 0.1 : Working Libraries and Preliminaries

In [98]:
# Python Libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

In [99]:
# dataset location
file = "churn_in_telecoms_dataset.csv"
# creating a dataframe
df = pd.read_csv(file)
# shape of the dataset
print(df.shape)

(3333, 21)


# 0.2 : Feature Engineering and Preprocessing

In [100]:
# General information of each column
# Including entry types
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3333 entries, 0 to 3332
Data columns (total 21 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   state                   3333 non-null   object 
 1   account length          3333 non-null   int64  
 2   area code               3333 non-null   int64  
 3   phone number            3333 non-null   object 
 4   international plan      3333 non-null   object 
 5   voice mail plan         3333 non-null   object 
 6   number vmail messages   3333 non-null   int64  
 7   total day minutes       3333 non-null   float64
 8   total day calls         3333 non-null   int64  
 9   total day charge        3333 non-null   float64
 10  total eve minutes       3333 non-null   float64
 11  total eve calls         3333 non-null   int64  
 12  total eve charge        3333 non-null   float64
 13  total night minutes     3333 non-null   float64
 14  total night calls       3333 non-null   

In [101]:
# drop the 'phone number' column
df = df.drop(columns='phone number')
# columns
print(df.columns)

Index(['state', 'account length', 'area code', 'international plan',
       'voice mail plan', 'number vmail messages', 'total day minutes',
       'total day calls', 'total day charge', 'total eve minutes',
       'total eve calls', 'total eve charge', 'total night minutes',
       'total night calls', 'total night charge', 'total intl minutes',
       'total intl calls', 'total intl charge', 'customer service calls',
       'churn'],
      dtype='object')


In [None]:
# finding out entries of the 'churn' column
print(f"The Unique entries in teh 'churn' column are: {df.churn.unique()}")
# >>> array([False,  True])
df['churn'] = df['churn'].astype(int)
# Convert boolean values in the 'churn' column to integers: False → 0, True → 1
print(f"After conversion, the new converted entries for ease of classfication are: {df.churn.unique()}")

The Unique entries in teh 'churn' column are: [False  True]
The new converted entries for ease of classfication are: [0 1]


In [None]:
# Column Description
df.describe()

Unnamed: 0,account length,area code,number vmail messages,total day minutes,total day calls,total day charge,total eve minutes,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls,churn
count,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0
mean,101.064806,437.182418,8.09901,179.775098,100.435644,30.562307,200.980348,100.114311,17.08354,200.872037,100.107711,9.039325,10.237294,4.479448,2.764581,1.562856,0.144914
std,39.822106,42.37129,13.688365,54.467389,20.069084,9.259435,50.713844,19.922625,4.310668,50.573847,19.568609,2.275873,2.79184,2.461214,0.753773,1.315491,0.352067
min,1.0,408.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,23.2,33.0,1.04,0.0,0.0,0.0,0.0,0.0
25%,74.0,408.0,0.0,143.7,87.0,24.43,166.6,87.0,14.16,167.0,87.0,7.52,8.5,3.0,2.3,1.0,0.0
50%,101.0,415.0,0.0,179.4,101.0,30.5,201.4,100.0,17.12,201.2,100.0,9.05,10.3,4.0,2.78,1.0,0.0
75%,127.0,510.0,20.0,216.4,114.0,36.79,235.3,114.0,20.0,235.3,113.0,10.59,12.1,6.0,3.27,2.0,0.0
max,243.0,510.0,51.0,350.8,165.0,59.64,363.7,170.0,30.91,395.0,175.0,17.77,20.0,20.0,5.4,9.0,1.0


In [None]:
# Categorical Columns
missing_columns = [col for col in df.columns if col not in df.describe().columns]
print("Columns missing from df.describe():", missing_columns)
# printing them out
df[['state', 'international plan', 'voice mail plan']].head(3)

Columns missing from df.describe(): ['state', 'international plan', 'voice mail plan']


Unnamed: 0,state,international plan,voice mail plan
0,KS,no,yes
1,OH,no,yes
2,NJ,no,no


In [None]:
# Shape of the datset
print(f"The Shape of the dataset is: {df.shape}")
# 'Area Code' column
print(f"The 'area code has only 3 entries: {df['area code'].unique()}") # 3 

The Shape of the dataset is: (3333, 20)
The 'area code has only 3 entries: [415 408 510]


# 1: Overview

## 1.1: Project Overview: Customer Churn Prediction for SyriaTel

This project aims to predict customer churn for `SyriaTel`, a telecommunications company, using a sample of their historical customer data. By building a binary classification model since the customer either churns '1' or does not '0', we shall aim to identify patterns and factors that influence whether a customer will leave the company. The predictive model will assist the company in targeting at-risk customers with `retention strategies`, thereby reducing customer attrition and preserving revenue.

The overall project pipeline consists of:

1. **Business Understanding**: Understanding churn's impact on SyriaTel’s business.

2. **Data Understanding and Preparation**: Exploring the structure and distribution of data. Cleaning, transforming, and encoding the dataset.

3. **Exploratory Data Analysis (EDA)**: Finding patterns and feature relationships with churn.

4. **Model Building**: Training and tuning classifiers such as Logistic Regression, Decision Trees or Random Forests.

5. **Evaluation**: Measuring performance using metrics like accuracy, precision, recall, F1-score, and AUC as well as ROC.

6. **Interpretation**: Identifying key drivers of churn.

7. **Recommendations and Actionable Insights**: Informing business interventions to reduce churn. Provide recommendations for customer retention based on analytical findings.

## 1.2: Objectives

Here are the key Objectives in this project:-

1. **Build a Predictive Model for Churn**  
2. **Identify Key Drivers of Churn**  
3. **Improve Churn Prediction Accuracy**  
4. **Support Retention Strategy Development**  
5. **Then Communicate Findings Clearly**: Present model insights and business recommendations in a format accessible to both technical and non-technical stakeholders.

However, if time-allows it is also important to explore these other secondary objectives

1. **Understand Customer Behavior**  
2. **Segment At-Risk Customers**  
3. **Evaluate Cost-Benefit Trade-offs**  
   Analyze which churn-prone customers are most valuable to retain based on their potential lifetime value.
4. **Develop a Repeatable ML Pipeline**  
   Build a clean and modular workflow that can be reused with updated customer data in the future.

# 2: Business and Data Understanding

# 3: Data Preparation 

# 4: Modeling

# 5: Evaluation

# 6: Conclusions and Recommendations