# Final Project Submission



* Student name: **Samwel Ongechi**
* Student pace: spart time 
* Instructor name: George Kamundia
* Project Title: **Telecom Customer Churn Prediction for SyriaTel: A Data-Driven Retention Strategy**

### 1. **Business Understanding & Project Goal**

In the highly competitive telecommunications industry, customer retention is a critical driver of profitability. Acquiring a new customer can cost five to ten times more than retaining an existing one. As such, understanding and predicting customer churn, that is, the likelihood that a customer will discontinue their service is a strategic priority for SyriaTel, as leading telecom provider.

#### **Project Objective:**
This project aims to develop a robust machine learning classification model to predict customer churn. By identifying at-risk customers early, SyriaTel can implement proactive retention strategies.

#### **Specific Objectives:**

**Diagnose Drivers of Churn**: Conduct in-depth exploratory data analysis (EDA) to uncover patterns and behavioral signals associated with churn.

**Build Predictive Models**: Train and evaluate machine learning classifiers to predict which customers are most likely to churn.

**Deliver Business Value**: Provide actionable insights that enable marketing, customer success, and product teams to reduce churn, improve customer satisfaction, and increase lifetime value.

#### **Stakeholder Impact:**
Business units can leverage model outputs to design targeted interventions such as loyalty offers, personalized communication, or service plan optimization ultimately reducing revenue loss and strengthening customer relationships.



### 2. **Data Loading & Initial Exploration**
In this section, I began by importing and loading the dataset, followed by an initial examination of its structure and contents. This step is crucial for:

- *Verifying that the data has been correctly imported*.

- *Understanding the dimensions, column types, and general distribution of the features*.

- *Identifying any immediate data quality issues such as missing values or duplicates*.

This early exploration lays the groundwork for informed decisions during cleaning, feature engineering, and modeling.

In [1]:
# Impoert necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set consistent visualization style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)

# Load Dataset
try:
    df = pd.read_csv('bigml_59c28831336c6604c800002a.csv')
    print("Dataset loaded successfully.")
except FileNotFoundError:
    print("File not found. Please check the file path.")

# Preview the dataset
df.head()


Dataset loaded successfully.


Unnamed: 0,state,account length,area code,phone number,international plan,voice mail plan,number vmail messages,total day minutes,total day calls,total day charge,...,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls,churn
0,KS,128,415,382-4657,no,yes,25,265.1,110,45.07,...,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False
1,OH,107,415,371-7191,no,yes,26,161.6,123,27.47,...,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False
2,NJ,137,415,358-1921,no,no,0,243.4,114,41.38,...,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False
3,OH,84,408,375-9999,yes,no,0,299.4,71,50.9,...,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False
4,OK,75,415,330-6626,yes,no,0,166.7,113,28.34,...,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False


### 3. **Dataset Summary & Basic Diagnostics**

Before diving deeper, I performed a quick structural check of the dataset to answer the following questions:

- What is the size and shape of the dataset?
- What are the data types and presence of null values?
- Are there any duplicated rows?
- What do the basic statistics (mean, std, min, max, etc.) tell us about the numerical features?

These checks helped to identify early issues related to data quality and guide the next steps in cleaning and exploration.


In [2]:
#  Dataset Summary and Basic Diagnostics

# General structure and data types
print(" Dataset Info:\n")
df.info()

# Descriptive statistics for numeric features
print("\n Statistical Summary:\n")
display(df.describe())

# Check for duplicate rows
duplicate_count = df.duplicated().sum()
print(f"\n Number of duplicate rows: {duplicate_count}")


 Dataset Info:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3333 entries, 0 to 3332
Data columns (total 21 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   state                   3333 non-null   object 
 1   account length          3333 non-null   int64  
 2   area code               3333 non-null   int64  
 3   phone number            3333 non-null   object 
 4   international plan      3333 non-null   object 
 5   voice mail plan         3333 non-null   object 
 6   number vmail messages   3333 non-null   int64  
 7   total day minutes       3333 non-null   float64
 8   total day calls         3333 non-null   int64  
 9   total day charge        3333 non-null   float64
 10  total eve minutes       3333 non-null   float64
 11  total eve calls         3333 non-null   int64  
 12  total eve charge        3333 non-null   float64
 13  total night minutes     3333 non-null   float64
 14  total night calls       

Unnamed: 0,account length,area code,number vmail messages,total day minutes,total day calls,total day charge,total eve minutes,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls
count,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0
mean,101.064806,437.182418,8.09901,179.775098,100.435644,30.562307,200.980348,100.114311,17.08354,200.872037,100.107711,9.039325,10.237294,4.479448,2.764581,1.562856
std,39.822106,42.37129,13.688365,54.467389,20.069084,9.259435,50.713844,19.922625,4.310668,50.573847,19.568609,2.275873,2.79184,2.461214,0.753773,1.315491
min,1.0,408.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,23.2,33.0,1.04,0.0,0.0,0.0,0.0
25%,74.0,408.0,0.0,143.7,87.0,24.43,166.6,87.0,14.16,167.0,87.0,7.52,8.5,3.0,2.3,1.0
50%,101.0,415.0,0.0,179.4,101.0,30.5,201.4,100.0,17.12,201.2,100.0,9.05,10.3,4.0,2.78,1.0
75%,127.0,510.0,20.0,216.4,114.0,36.79,235.3,114.0,20.0,235.3,113.0,10.59,12.1,6.0,3.27,2.0
max,243.0,510.0,51.0,350.8,165.0,59.64,363.7,170.0,30.91,395.0,175.0,17.77,20.0,20.0,5.4,9.0



 Number of duplicate rows: 0


###  **Initial Observations & Dataset Overview**

- The dataset contains **3,333 entries** and **20 columns**.
- There are **no missing values**, which simplifies preprocessing.
- The target variable, **`churn`**, is of type `bool` (True = churned, False = retained).
- All column names are understandable, but we may apply formatting improvements (e.g., lowercase, underscores) for consistency.



###  **Data Dictionary**

Since no formal documentation was provided, we inferred the feature definitions based on column names:

| Feature                    | Inferred Description                                            | Data Type  |
|----------------------------|------------------------------------------------------------------|------------|
| `state`                   | US state where the customer resides                             | object     |
| `account_length`          | Number of days the account has been active                      | int64      |
| `area_code`               | Area code of the customer                                       | int64      |
| `phone_number`            | Customer's unique phone number                                  | object     |
| `international_plan`      | Whether the customer has an international calling plan (yes/no) | object     |
| `voice_mail_plan`         | Whether the customer has a voice mail plan (yes/no)             | object     |
| `number_vmail_messages`   | Number of voice mail messages                                   | int64      |
| `total_day_minutes`       | Total minutes used during the day                              | float64    |
| `total_day_calls`         | Total number of calls during the day                           | int64      |
| `total_day_charge`        | Total charge for daytime calls                                 | float64    |
| `total_eve_minutes`       | Total minutes used in the evening                              | float64    |
| `total_eve_calls`         | Total number of calls in the evening                           | int64      |
| `total_eve_charge`        | Total charge for evening calls                                 | float64    |
| `total_night_minutes`     | Total minutes used at night                                    | float64    |
| `total_night_calls`       | Total number of calls at night                                 | int64      |
| `total_night_charge`      | Total charge for night calls                                   | float64    |
| `total_intl_minutes`      | Total international call minutes                               | float64    |
| `total_intl_calls`        | Total number of international calls                            | int64      |
| `total_intl_charge`       | Total charge for international calls                           | float64    |
| `customer_service_calls`  | Number of calls made to customer service                       | int64      |
| `churn`                   | Target variable; whether the customer churned (`True`/`False`) | bool       |


### 4. *Data Cleaning & Preparation*
In this section, I addressed inconsistencies and prepared the data for analysis and modeling.

#### 4.1 *Standardizing Column Names & Dropping Irrelevant Features*
To enhance code readability and avoid potential issues when referencing column names, I performed the following actions:

- Standardize column names by converting them to snake_case (lowercase with underscores).

- Dropped irrelevant features:

- phone_number: Acts as a unique identifier and holds no predictive value.

I treated area_code as categorical, even though it is stored as a numerical value, since it represents location rather than a continuous variable.

In [3]:
#  Data Cleaning: Standardize and Simplify Feature Set

# Convert all column names to snake_case for consistency
df.columns = df.columns.str.lower().str.replace(' ', '_')

# Drop 'phone_number' as it is a unique identifier with no predictive value
df.drop(columns=['phone_number'], inplace=True)

# Convert 'area_code' to string to treat it as a categorical feature
df['area_code'] = df['area_code'].astype(str)

# Confirm the updated column names
print(" Cleaned column names:")
print(df.columns.tolist())


 Cleaned column names:
['state', 'account_length', 'area_code', 'international_plan', 'voice_mail_plan', 'number_vmail_messages', 'total_day_minutes', 'total_day_calls', 'total_day_charge', 'total_eve_minutes', 'total_eve_calls', 'total_eve_charge', 'total_night_minutes', 'total_night_calls', 'total_night_charge', 'total_intl_minutes', 'total_intl_calls', 'total_intl_charge', 'customer_service_calls', 'churn']


####  **Standardization & Feature Pruning**

To improve consistency and simplify modeling:

- Column names were converted to `snake_case` format.
- The `phone_number` column was dropped, as it is a unique identifier with no predictive value.
- The `area_code` feature was explicitly cast to a string type to reflect its categorical nature.

These steps helped to prepare the dataset for effective encoding and analysis.


####  4.2 **Converting Binary Categorical Features**

Some columns in the dataset like `international_plan` and `voice_mail_plan` are currently stored as `"yes"`/`"no"` strings. These are binary categorical features and will be converted to a numerical format for compatibility with machine learning algorithm.

To do this, I followed the following steps:
- Normalized these string values to lowercase for consistency.
- Mapped `"yes"` to `1` and `"no"` to `0`.
- Converted the target variable `churn` from boolean (`True`/`False`) to integer (`1`/`0`) for easier evaluation and compatibility with classifiers.

This preprocessing step ensured that our model could interpret these features correctly during training.


In [4]:
#  Convert binary categorical features and target to numerical format

# Normalize string values
df['international_plan'] = df['international_plan'].str.lower()
df['voice_mail_plan'] = df['voice_mail_plan'].str.lower()

# Map 'yes' to 1 and 'no' to 0
binary_cols = ['international_plan', 'voice_mail_plan']
for col in binary_cols:
    df[col] = df[col].apply(lambda x: 1 if x == 'yes' else 0)

# Convert boolean target to integer
df['churn'] = df['churn'].astype(int)

# Confirm data type changes
print("Data types after binary conversion:")
print(df[binary_cols + ['churn']].dtypes)


Data types after binary conversion:
international_plan    int64
voice_mail_plan       int64
churn                 int32
dtype: object
