## TELCO CUSTOMER CHURN - MACHINE LEARNING PROJECT

Dataset - Location : https://www.kaggle.com/datasets/blastchar/telco-customer-churn/data

### Project Overview :

* The project is about predicting whether a customer of a telecom company(Telco) will churn (i.e stop using the company services)or stay .
* Churn prediction is very important for telecom companies because retaining customers is cheaper than acquiring new ones.
* The data set includes customer demographics,account information,and survices subscribed,along with a churn label(Yes/No).

### Aim of the Project :

* Primary Goal : Predict customer churn based on the historical data.
* Business Objective : Help the telecom company identify customers likely to leave,so they can take proactive retension actions(Discounts,offers,improved support).
* ML Objective : Build a Classification model.

NOTE :
* Churn means when a customer leaves (cancels a service/subscription).
  In the Telco Customer Churn dataset, each row is a customer, and the target variable Churn tells us:
  "Yes" → customer left (churned).
  "No" → customer stayed (retained).
* Business importance:
  Churn prediction helps a company identify which customers are likely to leave so they can give discounts, improve support, or launch       retention campaigns.
  Example: If churn rate = 26%, the company loses 1 out of 4 customers → huge revenue loss.

### Features in the Telco Customer Churn Dataset :

* CustomerID → Unique identifier for each customer (not useful for ML, we usually drop it).

Customer Demographics :

* gender → Male / Female.

* SeniorCitizen → 1 if customer is a senior citizen, 0 otherwise.

* Partner → Yes/No → Whether the customer has a spouse/partner.

* Dependents → Yes/No → Whether the customer has dependents (kids, family).

* Account Information

* tenure → Number of months the customer has stayed with the company (important feature: long-tenure customers usually less likely to churn).

* Contract → Month-to-month / One year / Two year → Type of contract (longer contracts usually lower churn).

* PaperlessBilling → Yes/No → Whether the customer gets paperless bills (digital).

* PaymentMethod → Type of payment (Credit card, Electronic check, Mailed check, Bank transfer).

Services Signed Up :

* PhoneService → Yes/No → Whether customer has a phone service.

* MultipleLines → No / Yes / No phone service → Whether customer has multiple lines.

* InternetService → DSL / Fiber optic / No.

* OnlineSecurity → Yes/No/No internet service → Extra internet security service.

* OnlineBackup → Yes/No/No internet service → Cloud backup service.

* DeviceProtection → Yes/No/No internet service → Insurance/protection for devices.

* TechSupport → Yes/No/No internet service → Whether they receive tech support service.

* StreamingTV → Yes/No/No internet service.

* StreamingMovies → Yes/No/No internet service.

Billing Information :

* MonthlyCharges → The amount the customer pays per month.

* TotalCharges → Total amount charged to the customer over their tenure.

* ⚠ Sometimes this has missing values or is stored as text (we need to clean it).

Target :

* Churn → Yes/No (did the customer leave?).

### NOTE : Quick Notes on Feature Importance 
* tenure: Customers with longer tenure usually stay → very strong predictor.
* Contract: Month-to-month contracts have the highest churn.
* MonthlyCharges: Higher charges often → higher churn, but also depends on services.
* InternetService = Fiber optic customers tend to churn more than DSL (from real analysis).
* PaymentMethod: Customers paying via electronic check have higher churn.

In [1]:
import warnings
warnings.filterwarnings("ignore") #Warnings are not shown/hidden.

## STEP 1 : IMPORT LIBRARIES 

In [2]:
# Data handling

import numpy as np
import pandas as pd # pandas,numpy -- handle dataset.

In [3]:
# Data visualization

import matplotlib.pyplot as plt
import seaborn as sns   # matplotlib,seaborn -- visualizations.

## STEP 2 : LOAD THE DATASET 

In [4]:
df = pd.read_csv("/Users/b.tejateja/Downloads/Telco_customer_ML/Telco_customer_churn_ML.csv")

df.head() # # Display first 5 rows.

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


## STEP 3 : EXPLORE THE DATASET

In [5]:
df.shape         # Shape of the dataset (rows, columns)

(7043, 21)

In [6]:
df.info()         # Columns and their data types

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


In [7]:
df.describe()   # Summary statistics (only for numeric columns).

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges
count,7043.0,7043.0,7043.0
mean,0.162147,32.371149,64.761692
std,0.368612,24.559481,30.090047
min,0.0,0.0,18.25
25%,0.0,9.0,35.5
50%,0.0,29.0,70.35
75%,0.0,55.0,89.85
max,1.0,72.0,118.75


In [8]:
df.isnull().sum() # Check missing values.

customerID          0
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64

## STEP 4 : HANDLE MISSING VALUES 

In [9]:
# Convert TotalCharges to numeric (it may be stored as string)

df["TotalCharges"] = pd.to_numeric(df["TotalCharges"], errors="coerce")

# Check again for missing values

print("Missing values after conversion:")
print(df.isnull().sum())

# Drop missing values (or we can fill with mean/median, here we drop)

df = df.dropna()

Missing values after conversion:
customerID           0
gender               0
SeniorCitizen        0
Partner              0
Dependents           0
tenure               0
PhoneService         0
MultipleLines        0
InternetService      0
OnlineSecurity       0
OnlineBackup         0
DeviceProtection     0
TechSupport          0
StreamingTV          0
StreamingMovies      0
Contract             0
PaperlessBilling     0
PaymentMethod        0
MonthlyCharges       0
TotalCharges        11
Churn                0
dtype: int64


In [10]:
print("Missing values in Churn:", df["Churn"].isnull().sum())
print(df["Churn"].unique())

Missing values in Churn: 0
['No' 'Yes']


## STEP 5 : ENCODE CATEGORICAL VARIABLES 

In [11]:
# Drop CustomerID (not useful for ML)
df = df.drop("customerID", axis=1)

# Encode 'Yes/No' columns into 1/0
yes_no_cols = ["Partner", "Dependents", "PhoneService", "PaperlessBilling", "Churn"]
for col in yes_no_cols:
    df[col] = df[col].map({"Yes": 1, "No": 0})

# Encode gender (Male=1, Female=0)
df["gender"] = df["gender"].map({"Male": 1, "Female": 0})

# For categorical columns with multiple categories (Contract, PaymentMethod, InternetService, etc.)
df = pd.get_dummies(df,drop_first=True)

# Check processed dataset
df.head()

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,PaperlessBilling,MonthlyCharges,TotalCharges,Churn,...,TechSupport_Yes,StreamingTV_No internet service,StreamingTV_Yes,StreamingMovies_No internet service,StreamingMovies_Yes,Contract_One year,Contract_Two year,PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check
0,0,0,1,0,1,0,1,29.85,29.85,0,...,False,False,False,False,False,False,False,False,True,False
1,1,0,0,0,34,1,0,56.95,1889.5,0,...,False,False,False,False,False,True,False,False,False,True
2,1,0,0,0,2,1,1,53.85,108.15,1,...,False,False,False,False,False,False,False,False,False,True
3,1,0,0,0,45,0,0,42.3,1840.75,0,...,True,False,False,False,False,True,False,False,False,False
4,0,0,0,0,2,1,1,70.7,151.65,1,...,False,False,False,False,False,False,False,False,True,False


In [12]:
# Machine learning preprocessing & modeling 

from sklearn.model_selection import train_test_split 
from sklearn.preprocessing import LabelEncoder, StandardScaler 
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score 
from sklearn.linear_model import LogisticRegression


## STEP 6 : TRAIN TEST SPLIT

In [13]:

# Separate features (X) and target (y)
X = df.drop("Churn", axis=1)
y = df["Churn"]

# Train-test split (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print("Train shape:", X_train.shape)
print("Test shape:", X_test.shape)


Train shape: (5625, 30)
Test shape: (1407, 30)


## STEP 7 : FEATURE SCALING 

In [14]:

# Scale features (important for models like Logistic Regression)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)


## STEP 8 : BUILD LOGISTIC REGRESSION MODEL

In [15]:

# Train Logistic Regression model
log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train, y_train)

# Predictions
y_pred = log_reg.predict(X_test)

# Evaluate
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))


Accuracy: 0.8038379530916845

Classification Report:
               precision    recall  f1-score   support

           0       0.85      0.89      0.87      1033
           1       0.65      0.57      0.61       374

    accuracy                           0.80      1407
   macro avg       0.75      0.73      0.74      1407
weighted avg       0.80      0.80      0.80      1407


Confusion Matrix:
 [[916 117]
 [159 215]]


### SUMMARY OF THIS MODEL :
* Out of all 1407 customers in the test set, the model predicted correctly ~80% of the time.

### Classification Report
For Class 0 (No Churn, i.e. customers who stayed):

* Precision = 0.85 → When the model predicts "No Churn", it’s correct 85% of the time.

* Recall = 0.89 → Out of all actual non-churners, the model correctly identified 89%.

* F1 = 0.87 → Balanced score between precision & recall.
👉 Model is very good at detecting people who stay.

For Class 1 (Churn, i.e. customers who left):

* Precision = 0.65 → When the model predicts "Churn", 65% are truly churners (35% false alarms).

* Recall = 0.57 → Out of all actual churners, the model only caught 57% (missed 43%).

* F1 = 0.61 → Moderate, but weaker than class 0.
👉 Model is not as good at catching churners (but still better than random).

### confusion matrix :
* 916 customers correctly identified as staying.

* 215 customers correctly identified as leaving.

* 159 churners missed (bad for business!).

* 117 loyal customers falsely flagged as churners (false alarms).

### Business Interpretation

* Model is strong at predicting loyal customers (class 0).

* But it misses ~43% of churners (class 1).

* From a business view, this is risky — missing churners = lost revenue.
👉 That’s why we try Random Forest, XGBoost, SMOTE to improve recall for churners.

### Summary in plain words:
Your Logistic Regression model is about 80% accurate overall. It’s excellent at spotting customers who will stay, but weaker at identifying those who will leave (it misses ~43% of them). For churn prediction, this means the model is a decent starting point, but we should try advanced models to improve Recall for churners.

## STEP 8.1 : BUILDING DECISION TREE CLASSIFIER 

In [16]:

from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier(max_depth=5, random_state=42)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)

print("Decision Tree Accuracy:", accuracy_score(y_test, y_pred_dt))
print(classification_report(y_test, y_pred_dt))


Decision Tree Accuracy: 0.7782515991471215
              precision    recall  f1-score   support

           0       0.85      0.84      0.85      1033
           1       0.58      0.60      0.59       374

    accuracy                           0.78      1407
   macro avg       0.72      0.72      0.72      1407
weighted avg       0.78      0.78      0.78      1407



## STEP 8.2 : BUILDING RANDOMFOREST CLASSIFIER

In [17]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=200, random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf))
print(classification_report(y_test, y_pred_rf))


Random Forest Accuracy: 0.7896233120113717
              precision    recall  f1-score   support

           0       0.83      0.89      0.86      1033
           1       0.63      0.50      0.56       374

    accuracy                           0.79      1407
   macro avg       0.73      0.70      0.71      1407
weighted avg       0.78      0.79      0.78      1407



In [18]:
!pip install xgboost



## STEP 8.3 : BUILDING XGBCLASSIFIER 

In [19]:
from xgboost import XGBClassifier

xgb = XGBClassifier(use_label_encoder=False, eval_metric="logloss", random_state=42)
xgb.fit(X_train, y_train)
y_pred_xgb = xgb.predict(X_test)

print("XGBoost Accuracy:", accuracy_score(y_test, y_pred_xgb))
print(classification_report(y_test, y_pred_xgb))


XGBoost Accuracy: 0.7782515991471215
              precision    recall  f1-score   support

           0       0.84      0.86      0.85      1033
           1       0.59      0.55      0.57       374

    accuracy                           0.78      1407
   macro avg       0.71      0.70      0.71      1407
weighted avg       0.77      0.78      0.78      1407



## STEP 8.4 : BUILDING SVC MODEL

In [20]:
from sklearn.svm import SVC

svm = SVC(kernel="rbf", probability=True, random_state=42)
svm.fit(X_train, y_train)
y_pred_svm = svm.predict(X_test)

print("SVM Accuracy:", accuracy_score(y_test, y_pred_svm))
print(classification_report(y_test, y_pred_svm))


SVM Accuracy: 0.7867803837953091
              precision    recall  f1-score   support

           0       0.83      0.89      0.86      1033
           1       0.63      0.49      0.55       374

    accuracy                           0.79      1407
   macro avg       0.73      0.69      0.71      1407
weighted avg       0.78      0.79      0.78      1407



## NOTE : Why not just accuracy?

In the Telco dataset, about 73% customers stay (No Churn) and 27% leave (Churn).

If a model predicts everyone as “No Churn”, accuracy = 73% — but it catches zero churners (useless for business).
👉 So accuracy can be misleading when classes are imbalanced.

## Precision vs Recall

Precision (for churn) → Of all customers we predicted as churn, how many actually churn?

Recall (for churn) → Of all customers who actually churned, how many did we catch?

## Why Recall is more important here

In customer churn, the business priority is to identify as many churners as possible, because:

Missing a churner = lost customer, lost revenue.

Acting on a “false churn” (a loyal customer we mistakenly flagged) costs less (maybe a discount offer, which is cheaper than losing the customer entirely).

👉 That means Recall (catching churners) is often more critical than Precision.

## Why compare recall across models

Different models (Logistic Regression, Random Forest, XGBoost) give different trade-offs between precision & recall.

By comparing recall values, we ensure we select the model that finds the maximum churners.

Later, we can tune the decision threshold (default = 0.5) to adjust the balance between precision & recall.

## Bottom line:
We compare recall values because in churn prediction, catching churners (Recall) is more valuable than just having a high overall accuracy.

In [21]:
results = pd.DataFrame({
    "Model": ["Logistic Regression", "Random Forest", "Decision Tree","XGBClassifer","SVC"],
    "Accuracy": [0.8038, 0.7896, 0.7782,0.7782,0.7867],
    "Recall (Churn)": [0.57, 0.50, 0.60,0.55,0.49],
    "Precision (Churn)": [0.65, 0.63, 0.58,0.59,0.63],
    "F1 (Churn)": [0.61, 0.56, 0.59,0.57,0.55]
})
results

Unnamed: 0,Model,Accuracy,Recall (Churn),Precision (Churn),F1 (Churn)
0,Logistic Regression,0.8038,0.57,0.65,0.61
1,Random Forest,0.7896,0.5,0.63,0.56
2,Decision Tree,0.7782,0.6,0.58,0.59
3,XGBClassifer,0.7782,0.55,0.59,0.57
4,SVC,0.7867,0.49,0.63,0.55


## DEPLOYMENT OF MODEL :

### STEP 1 : Save your model and preprocessing objects 

In [24]:
import streamlit as st
import pandas as pd
import pickle

In [25]:
with open('logistic_model.pkl', 'rb') as f:
    model = pickle.load(f)

In [26]:
st.title("📊 Telco Customer Churn Prediction App")
st.write("""
This app predicts whether a customer will **churn**...
""")



In [27]:
option = st.selectbox("Choose prediction type:", ["Single Customer", "Upload CSV"])




In [29]:
def user_input_features():
    gender = st.sidebar.selectbox('Gender', ('Male', 'Female'))
    gender_num = 1 if gender=='Male' else 0

    has_internet = st.sidebar.selectbox('Has Internet Service?', (True, False))
    has_internet_num = 1 if has_internet else 0

    SeniorCitizen = st.sidebar.selectbox('Senior Citizen', (0,1))
    tenure = st.sidebar.slider('Tenure (Months)', 0, 72, 12)
    MonthlyCharges = st.sidebar.number_input('Monthly Charges', 0, 1000, 70)
    TotalCharges = st.sidebar.number_input('Total Charges', 0, 10000, 500)

    data = {
        'gender': gender_num,
        'SeniorCitizen': SeniorCitizen,
        'HasInternetService': has_internet_num,
        'tenure': tenure,
        'MonthlyCharges': MonthlyCharges,
        'TotalCharges': TotalCharges
    }
    features = pd.DataFrame(data, index=[0])
    return features

In [30]:
input_df = user_input_features()
st.subheader("Customer Input")
st.dataframe(input_df)



DeltaGenerator()