# SyriaTel Customer Churn Analysis & Prediction

---

### **Author:** Rose Miriti

### Problem Statement

Customer churn is impacting SyriaTel’s revenue. Each subscriber who leaves reduces Average Revenue Per User (ARPU) and incurs additional acquisition costs. In a market characterized by strong competition and low switching barriers, even a modest increase in churn can substantially impact profitability.

In this notebook, churn is framed as a binary classification task: predicting which customers are likely to discontinue service within our target window as early identification of at risk subscribers will enable SyriaTel’s retention team to deploy targeted offers or service adjustments before cancellation occurs.

### Stakeholders   
- **Customer Retention Team:** needs to know which customers to target with retention offers  
- **Marketing Department:** wants to allocate budget efficiently on high-risk segments  
- **Finance & Executive Leadership:** cares about the financial impact of reducing churn ie monitors ROI of retention efforts and P&L impact  

### Specific Objectives  
1. Load and clean the SyriaTel churn dataset, ensuring all key fields (tenure, billing, service calls, geography, etc.) are ready for analysis.  
2. Explore and visualize relationships between customer attributes and churn to uncover the strongest risk factors.  
3. Engineer new features such as support call counts, usage ratios, tenure buckets, and region level flags to boost model signal.  
4. Build and compare classification models (logistic regression, random forest, gradient boosting) to find the best predictor of churn.  

### Research Questions  
1. Does a high volume of calls or tickets correlate with higher churn risk?  
2. How does actual usage (minutes, texts, data) versus plan allowance predict churn?  
3. Are there specific regions or ZIP codes where churn is significantly above average?  
4. How do contract length and tenure buckets (e.g. 0–6 mo, 6–12 mo, > 12 mo) interact to affect churn probability?  
5. Can we derive other early warning features (e.g. months since last plan change, add-on uptake, billing anomalies)?  

### Success Metrics  
- **Recall ≥ 80%** on the churn class, so we catch most customers who actually leave.  
- **ROC-AUC ≥ 0.75**, indicating strong separation between churners and stayers.  
- In production, these predictions should contribute to a **5% net reduction in churn** over six months.  

### Implications  
By achieving a recall of 80 % or higher, SyriaTel can proactively target at least four‑fifths of likely churners with retention offers potentially preserving millions in annual revenue. An ROC‑AUC above 0.75 will give confidence in model reliability, and an approximate 5 % reduction in churn translates directly into lower customer acquisition costs and stronger lifetime value.

With these targets met, SyriaTel’s retention and marketing teams can allocate resources more effectively, and Finance will realize measurable improvements in profitability.

---

## 2. Data Understanding

In this section we will:

1. Load the [`syria_telco_churn.csv`](https://www.kaggle.com/datasets/becksddf/churn-in-telecoms-dataset) file.  
2. Inspect its shape, column types and sample rows.  
3. Check for missing values and duplicates.  

In [2]:
# Importing the necessary libraries needed to load, clean, analyze, visulize, modelling the data 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import LabelEncoder, StandardScaler, OneHotEncoder, KBinsDiscretizer, FunctionTransformer
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, HistGradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, recall_score, roc_auc_score, confusion_matrix, ConfusionMatrixDisplay, classification_report,roc_curve, precision_score, f1_score, roc_curve, auc, RocCurveDisplay, PrecisionRecallDisplay
from sklearn.tree import DecisionTreeClassifier
from sklearn.inspection import permutation_importance
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
sns.set_style('whitegrid')

In [3]:
# Loading and previewing the data
df_churn = pd.read_csv('syria_telco_churn.csv')

# Looks at the first five rows
df_churn.head()

Unnamed: 0,state,account length,area code,phone number,international plan,voice mail plan,number vmail messages,total day minutes,total day calls,total day charge,...,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls,churn
0,KS,128,415,382-4657,no,yes,25,265.1,110,45.07,...,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False
1,OH,107,415,371-7191,no,yes,26,161.6,123,27.47,...,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False
2,NJ,137,415,358-1921,no,no,0,243.4,114,41.38,...,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False
3,OH,84,408,375-9999,yes,no,0,299.4,71,50.9,...,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False
4,OK,75,415,330-6626,yes,no,0,166.7,113,28.34,...,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False


### Data Inspection
- First, I checked the shape of each dataset, the column names, data types, missing values, duplicate rows, and summary statistics using `.shape`, `.columns`, `.dtypes`, and `.describe()` methods.
- To streamline this process, I created a function that runs all methods at once and returns an overview for each, avoiding the need to check them one by one.
-I’ll use the function, `inspect_df`, to print:

 1 Number of rows & columns  
 2 Column names & data types  
 3 Descriptive statistics  
 4 Missing value counts  
 5 Duplicate row counts  

This gives me a fast, reproducible overview of the dataset loaded to determine areas that need cleaning, standardization, or manipulation to fit the purpose of my analysis.

In [4]:
#creating a function to inspect the data for our dataframe 
def inspect_df(df_churn, name=None, preview_rows=5):

    # Show the name of the dataframe we are inspecting
    print(f"\n=== Inspecting: {name or 'DataFrame'} ===")
    
    # Print the shape of the DataFrame (rows, columns)
    print(f"Shape: {df_churn.shape}")
    
    # Print list of column names
    print(f"Columns: {df_churn.columns.tolist()}")
    
    # Print data types for each column
    print("\nData Types:")
    print(df_churn.dtypes)
    
    # Print descriptive statistics for all columns
    print("\nDescriptive Statistics:")
    display(df_churn.describe(include='all'))
    
    # Print the number of null values in each column
    print("\nMissing Values per Column:")
    print(df_churn.isnull().sum())
    
    # Print the number of duplicate rows in the DataFrame
    print(f"\nDuplicate Rows: {df_churn.duplicated().sum()}")

    
# Calling the Function with the different dataframes
inspect_df(df_churn , name="df_churn")


=== Inspecting: df_churn ===
Shape: (3333, 21)
Columns: ['state', 'account length', 'area code', 'phone number', 'international plan', 'voice mail plan', 'number vmail messages', 'total day minutes', 'total day calls', 'total day charge', 'total eve minutes', 'total eve calls', 'total eve charge', 'total night minutes', 'total night calls', 'total night charge', 'total intl minutes', 'total intl calls', 'total intl charge', 'customer service calls', 'churn']

Data Types:
state                      object
account length              int64
area code                   int64
phone number               object
international plan         object
voice mail plan            object
number vmail messages       int64
total day minutes         float64
total day calls             int64
total day charge          float64
total eve minutes         float64
total eve calls             int64
total eve charge          float64
total night minutes       float64
total night calls           int64
total night c

Unnamed: 0,state,account length,area code,phone number,international plan,voice mail plan,number vmail messages,total day minutes,total day calls,total day charge,...,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls,churn
count,3333,3333.0,3333.0,3333,3333,3333,3333.0,3333.0,3333.0,3333.0,...,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333
unique,51,,,3333,2,2,,,,,...,,,,,,,,,,2
top,WV,,,382-4657,no,no,,,,,...,,,,,,,,,,False
freq,106,,,1,3010,2411,,,,,...,,,,,,,,,,2850
mean,,101.064806,437.182418,,,,8.09901,179.775098,100.435644,30.562307,...,100.114311,17.08354,200.872037,100.107711,9.039325,10.237294,4.479448,2.764581,1.562856,
std,,39.822106,42.37129,,,,13.688365,54.467389,20.069084,9.259435,...,19.922625,4.310668,50.573847,19.568609,2.275873,2.79184,2.461214,0.753773,1.315491,
min,,1.0,408.0,,,,0.0,0.0,0.0,0.0,...,0.0,0.0,23.2,33.0,1.04,0.0,0.0,0.0,0.0,
25%,,74.0,408.0,,,,0.0,143.7,87.0,24.43,...,87.0,14.16,167.0,87.0,7.52,8.5,3.0,2.3,1.0,
50%,,101.0,415.0,,,,0.0,179.4,101.0,30.5,...,100.0,17.12,201.2,100.0,9.05,10.3,4.0,2.78,1.0,
75%,,127.0,510.0,,,,20.0,216.4,114.0,36.79,...,114.0,20.0,235.3,113.0,10.59,12.1,6.0,3.27,2.0,



Missing Values per Column:
state                     0
account length            0
area code                 0
phone number              0
international plan        0
voice mail plan           0
number vmail messages     0
total day minutes         0
total day calls           0
total day charge          0
total eve minutes         0
total eve calls           0
total eve charge          0
total night minutes       0
total night calls         0
total night charge        0
total intl minutes        0
total intl calls          0
total intl charge         0
customer service calls    0
churn                     0
dtype: int64

Duplicate Rows: 0


## Dataset Overview
- **Shape:** 3,333 records × 21 features  
- **Data types:**  
  - Object: `state`, `phone number`, `international plan`, `voice mail plan`  
  - Integer: `account length`, `area code`, `number vmail messages`, `total day calls`, `total eve calls`, `total night calls`, `total intl calls`, `customer service calls`  
  - Float: `total day minutes`, `total day charge`, `total eve minutes`, `total eve charge`, `total night minutes`, `total night charge`, `total intl minutes`, `total intl charge`  
  - Boolean: `churn`

- **Missing values:** 0 across all 21 features  
- **Duplicate rows:** 0  

- **Account length** ranges from 1 day to 242 days (median = 101 days), suggesting a mix of new and long‑term customers.
- **Usage distributions** are roughly symmetrical for day/eve/night minutes, with 25th–75th percentiles indicating moderate variability.
- **Plan adoption:** “International plan” and “voice mail plan” each have two categories (`yes`/`no`), which can be one‑hot encoded.

Because there are no missing or duplicate entries and all data types align with expectations, the dataset is clean and ready for deeper exploration.

## Feature Overview

| **Feature Group**   | **Columns**                                                                                                                                      |
| ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------ |
| Identifier          | `phone number`                                                                                                                                   |
| Tenure              | `account length`                                                                                                                                 |
| Location / Tier     | `state`, `area code`                                                                                                                             |
| Plans               | `international plan`, `voice mail plan`                                                                                                          |
| Voice Mail Usage    | `number vmail messages`                                                                                                                          |
| Usage (Day)         | `total day minutes`, `total day calls`, `total day charge`                                                                                       |
| Usage (Eve)         | `total eve minutes`, `total eve calls`, `total eve charge`                                                                                       |
| Usage (Night)       | `total night minutes`, `total night calls`, `total night charge`                                                                                 |
| Usage (Intl)        | `total intl minutes`, `total intl calls`, `total intl charge`                                                                                   |
| Support             | `customer service calls`                                                                                                                         |
| Target              | `churn` (Boolean: **True** = churned, **False** = stayed)                                                                                         |

---

*Next, I will move into the Exploratory Data Analysis (EDA) phase to visualize these patterns and surface the strongest predictors of churn.* 