<h1 style="font-family: 'poppins'; font-weight: bold; color: Green;">👨‍💻Author: SOBIA ALAMGIR</h1>

[![GitHub](https://img.shields.io/badge/GitHub-Profile-blue?style=for-the-badge&logo=github)](https://github.com/sobiahashmi) 
[![Kaggle](https://img.shields.io/badge/Kaggle-Profile-blue?style=for-the-badge&logo=kaggle)](https://www.linkedin.com/in/sobia-alamgir-a027b939/) 
[![LinkedIn](https://img.shields.io/badge/LinkedIn-Profile-blue?style=for-the-badge&logo=linkedin)](https://www.linkedin.com/in/sobia-alamgir-a027b939/)



<!-- [![Streamlit](https://img.shields.io/badge/Streamlit-Open%20App-FF4B4B?style=for-the-badge&logo=streamlit&logoColor=white)](https://predict-podcast-listening-time-fgkp77kmvwwpruyistfzhj.streamlit.app/) -->

<a id="13"></a>
<h1 style="background-color:#435420;font-family:newtimeroman;font-size:300%;text-align:center;border-radius: 15px 50px;color:#FF9900;">Customer Churn Prediction for a Telecom Company</h1>
<figcaption style="text-align: center;">
    <strong>
    </strong>
</figcaption>

**Table of contents**<a id='toc0_'></a>    
- [Step-01 Load Libraries](#toc1_1_)    
  - [Step-02 Load Dataset](#toc1_2_)    
  - [Step-03 Data Preprocessing](#toc1_3_)    
  - [Step-04 Split the dataset into Training and Testing](#toc1_4_)    
  - [Step-05 Model Selection](#toc1_5_)    
    - [(i) Apply Logistic Regression](#toc1_5_1_)    
    - [(ii) Apply XG Boost](#toc1_5_2_)    
    - [(iii) Apply XG Boost with Optuna](#toc1_5_3_)    
  - [Step-06 Model Prediction](#toc1_6_)    
  - [Step-07 Save and Load Model](#toc1_7_)    
  - [Step-08 Model Evaluation](#toc1_8_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

- This is a Classification Task

## <a id='toc1_1_'></a>[Step-01 Load Libraries](#toc0_)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns 

from sklearn.model_selection import train_test_split, StratifiedKFold ,cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import confusion_matrix,roc_auc_score, accuracy_score, classification_report

from xgboost import XGBClassifier

import optuna
import joblib

import warnings
warnings.filterwarnings("ignore")

## <a id='toc1_2_'></a>[Step-02 Load Dataset](#toc0_)

In [3]:
np.random.seed(42)

n = 10000

data = pd.DataFrame({

  'CustomerID': np.arange(n),

  'Gender': np.random.choice(['Male', 'Female'], size=n),

  'SeniorCitizen': np.random.choice([0, 1], size=n),

  'Tenure': np.random.randint(1, 72, size=n),

  'MonthlyCharges': np.round(np.random.uniform(20, 120, size=n), 2),

  'TotalCharges': lambda df: df['Tenure'] * df['MonthlyCharges'],

  'Contract': np.random.choice(['Month-to-month', 'One year', 'Two year'], size=n),

  'PaymentMethod': np.random.choice(['Electronic check', 'Mailed check', 'Bank transfer', 'Credit card'], size=n),

  'Churn': np.random.choice([0, 1], size=n, p=[0.73, 0.27])

})

data['TotalCharges'] = (data['Tenure'] * data['MonthlyCharges']).round(2)



In [None]:
df = data.copy()
display(df.head())
print(f"Number of rows: {df.shape[0]}")
print(f"Number of columns: {df.shape[1]}")

Unnamed: 0,CustomerID,Gender,SeniorCitizen,Tenure,MonthlyCharges,TotalCharges,Contract,PaymentMethod,Churn
0,0,Male,0,55,111.88,6153.4,Two year,Mailed check,0
1,1,Female,1,36,58.7,2113.2,Two year,Electronic check,0
2,2,Male,0,37,118.86,4397.82,One year,Electronic check,0
3,3,Male,1,14,96.14,1345.96,Month-to-month,Mailed check,1
4,4,Male,1,27,28.05,757.35,Two year,Mailed check,0


Number of rows: 10000
Number of columns: 9


## <a id='toc1_3_'></a>[Step-03 Data Preprocessing](#toc0_)

* **Let's check information about dataset**

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   CustomerID      10000 non-null  int32  
 1   Gender          10000 non-null  object 
 2   SeniorCitizen   10000 non-null  int32  
 3   Tenure          10000 non-null  int32  
 4   MonthlyCharges  10000 non-null  float64
 5   TotalCharges    10000 non-null  float64
 6   Contract        10000 non-null  object 
 7   PaymentMethod   10000 non-null  object 
 8   Churn           10000 non-null  int32  
dtypes: float64(2), int32(4), object(3)
memory usage: 547.0+ KB


* **Let's check Null values in dataset**

In [6]:
df.isnull().sum()/len(df)*100

CustomerID        0.0
Gender            0.0
SeniorCitizen     0.0
Tenure            0.0
MonthlyCharges    0.0
TotalCharges      0.0
Contract          0.0
PaymentMethod     0.0
Churn             0.0
dtype: float64

* **Let's drop CustomerID, its not beneficial for us**

In [7]:
df.drop("CustomerID", axis = 1, inplace = True)

In [8]:
df.head()

Unnamed: 0,Gender,SeniorCitizen,Tenure,MonthlyCharges,TotalCharges,Contract,PaymentMethod,Churn
0,Male,0,55,111.88,6153.4,Two year,Mailed check,0
1,Female,1,36,58.7,2113.2,Two year,Electronic check,0
2,Male,0,37,118.86,4397.82,One year,Electronic check,0
3,Male,1,14,96.14,1345.96,Month-to-month,Mailed check,1
4,Male,1,27,28.05,757.35,Two year,Mailed check,0


* **Let's do One Hot Encoding**
 
   - labels are already encode, so we will not perform Label Encoding

In [9]:

df_encoded = pd.get_dummies(df, columns = ['Gender','Contract','PaymentMethod'])
df_encoded.head()

Unnamed: 0,SeniorCitizen,Tenure,MonthlyCharges,TotalCharges,Churn,Gender_Female,Gender_Male,Contract_Month-to-month,Contract_One year,Contract_Two year,PaymentMethod_Bank transfer,PaymentMethod_Credit card,PaymentMethod_Electronic check,PaymentMethod_Mailed check
0,0,55,111.88,6153.4,0,False,True,False,False,True,False,False,False,True
1,1,36,58.7,2113.2,0,True,False,False,False,True,False,False,True,False
2,0,37,118.86,4397.82,0,False,True,False,True,False,False,False,True,False
3,1,14,96.14,1345.96,1,False,True,True,False,False,False,False,False,True
4,1,27,28.05,757.35,0,False,True,False,False,True,False,False,False,True


In [None]:
df_encoded.shape

(10000, 14)

## <a id='toc1_4_'></a>[Step-04 Split the dataset into Training and Testing](#toc0_)

In [None]:
X = df_encoded.drop('Churn', axis = 1)
y = df_encoded['Churn']

X_train, X_test , y_train , y_test  = train_test_split(X, y , test_size = 0.2 , random_state = 42)

X_train.shape , X_test.shape , y_train.shape , y_test.shape

((8000, 13), (2000, 13), (8000,), (2000,))

## <a id='toc1_5_'></a>[Step-05 Model Selection](#toc0_)

### <a id='toc1_5_1_'></a>[(i) Apply Logistic Regression](#toc0_)

In [None]:
lr = LogisticRegression()
model_lr = lr.fit(X_train,y_train)

### <a id='toc1_5_2_'></a>[(ii) Apply XG Boost](#toc0_)

In [None]:
xgb = XGBClassifier()
model_xgb = xgb.fit(X_train, y_train)
model_xgb

### <a id='toc1_5_3_'></a>[(iii) Apply XG Boost with Optuna](#toc0_)

In [36]:
# Step 1 : Define Objective Function

def objective(trial):
    params = {
        'objective': 'binary:logistic',
        'eval_metric': 'logloss',
        'use_label_encoder': False,
        'booster': 'gbtree',
        'max_depth': trial.suggest_int('max_depth', 3, 12),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3),
        'n_estimators': trial.suggest_int('n_estimators', 100, 1000),
        'gamma': trial.suggest_float('gamma', 0, 5),
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 10),
        'subsample': trial.suggest_float('subsample', 0.5, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
        'reg_alpha': trial.suggest_float('reg_alpha', 0, 5),
        'reg_lambda': trial.suggest_float('reg_lambda', 0, 5),
    }

    model_xgboost_optuna = XGBClassifier(**params)
    skf = StratifiedKFold(n_splits=5 , shuffle=True , random_state = 42)

    scores = cross_val_score(model_xgboost_optuna, X_train, y_train, scoring = 'accuracy' , cv = skf)

    return scores.mean()

In [37]:
# step 2: Run Optimization
study = optuna.create_study(direction = 'maximize')
study.optimize(objective , n_trials = 50)

[I 2025-06-24 08:36:01,181] A new study created in memory with name: no-name-d16a2401-5b69-4c03-8e86-bc903c327157
[I 2025-06-24 08:36:03,658] Trial 0 finished with value: 0.7321250000000001 and parameters: {'max_depth': 12, 'learning_rate': 0.09886326567581917, 'n_estimators': 268, 'gamma': 1.6260804221459602, 'min_child_weight': 3, 'subsample': 0.591293239642561, 'colsample_bytree': 0.8573825341147474, 'reg_alpha': 4.225562107668251, 'reg_lambda': 2.471570087307508}. Best is trial 0 with value: 0.7321250000000001.
[I 2025-06-24 08:36:09,705] Trial 1 finished with value: 0.730375 and parameters: {'max_depth': 8, 'learning_rate': 0.1326490610375761, 'n_estimators': 624, 'gamma': 1.5216503551752165, 'min_child_weight': 3, 'subsample': 0.8323269890551637, 'colsample_bytree': 0.7002966010750518, 'reg_alpha': 0.9684563965506265, 'reg_lambda': 1.1153090191860477}. Best is trial 0 with value: 0.7321250000000001.
[I 2025-06-24 08:36:15,949] Trial 2 finished with value: 0.7187499999999999 and p

In [None]:
# Step 3: Train Final Model with best parameters
best_params = study.best_params
print("Best Parameters:", best_params)

model_xgboost_optuna = XGBClassifier(**best_params)
model_xgboost_optuna.fit(X_train , y_train) 

Best Parameters: {'max_depth': 9, 'learning_rate': 0.0860172028810596, 'n_estimators': 261, 'gamma': 4.0748562445153915, 'min_child_weight': 5, 'subsample': 0.9365194180845771, 'colsample_bytree': 0.9997945072095886, 'reg_alpha': 1.3198267147953258, 'reg_lambda': 4.966644364270188}


## <a id='toc1_6_'></a>[Step-06 Model Prediction](#toc0_)

In [None]:
y_pred_lr = model_lr.predict(X_test)
y_pred_xgb = model_xgb.predict(X_test)
y_pred_xgb_optuna = model_xgboost_optuna.predict(X_test)

## <a id='toc1_7_'></a>[Step-07 Save and Load Model](#toc0_)

In [41]:
joblib.dump(model_xgboost_optuna, 'xgboost_model_optuna.pkl')

['xgboost_model_optuna.pkl']

In [None]:
joblib.load('xgboost_model_optuna.pkl')

## <a id='toc1_8_'></a>[Step-08 Model Evaluation](#toc0_)

* **Evaluation Matrix for Logistic Regression**

In [23]:
accuracy = accuracy_score(y_pred_lr , y_test)
print(f"Accuracy of Logistic Regression: {accuracy: .2f}")

cf_lr = confusion_matrix(y_pred_lr,y_test)
print(f"Confusion Matrix: \n {cf_lr}")

cr = classification_report(y_pred_lr, y_test)
print(f"Classification Report: \n {cr}")

Accuracy of Logistic Regression:  0.74
Confusion Matrix: 
 [[1472  528]
 [   0    0]]
Classification Report: 
               precision    recall  f1-score   support

           0       1.00      0.74      0.85      2000
           1       0.00      0.00      0.00         0

    accuracy                           0.74      2000
   macro avg       0.50      0.37      0.42      2000
weighted avg       1.00      0.74      0.85      2000



* **Evaluation Matrix for XGBoost**

In [32]:
accuracy = accuracy_score(y_pred_xgb , y_test)
print(f"Accuracy of XG Boost Classifier: {accuracy: .2f}")

cf_lr = confusion_matrix(y_pred_xgb,y_test)
print(f"Confusion Matrix: \n {cf_lr}")

cr = classification_report(y_pred_xgb, y_test)
print(f"Classification Report: \n {cr}")

Accuracy of XG Boost Classifier:  0.71
Confusion Matrix: 
 [[1386  497]
 [  86   31]]
Classification Report: 
               precision    recall  f1-score   support

           0       0.94      0.74      0.83      1883
           1       0.06      0.26      0.10       117

    accuracy                           0.71      2000
   macro avg       0.50      0.50      0.46      2000
weighted avg       0.89      0.71      0.78      2000



* **Evaluation Matrix for XGBoost with Optuna**

In [43]:
accuracy = accuracy_score(y_pred_xgb_optuna , y_test)
print(f"Accuracy of XG Boost Classifier with Optuna: {accuracy: .2f}")

cf_lr = confusion_matrix(y_pred_xgb_optuna,y_test)
print(f"Confusion Matrix: \n {cf_lr}")

cr = classification_report(y_pred_xgb_optuna, y_test)
print(f"Classification Report: \n {cr}")

Accuracy of XG Boost Classifier with Optuna:  0.74
Confusion Matrix: 
 [[1472  528]
 [   0    0]]
Classification Report: 
               precision    recall  f1-score   support

           0       1.00      0.74      0.85      2000
           1       0.00      0.00      0.00         0

    accuracy                           0.74      2000
   macro avg       0.50      0.37      0.42      2000
weighted avg       1.00      0.74      0.85      2000



<a id="13"></a>
<h1 style="background-color:#435420;font-family:newtimeroman;font-size:300%;text-align:center;border-radius: 15px 50px;color:#FF9900;">Thankyou</h1>
<figcaption style="text-align: center;">
    <strong>
    </strong>
</figcaption>