**Customer Churn Prediction Using Machine Learning**

In [1]:
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns
import warnings 
warnings.filterwarnings('ignore')

In [2]:
df=pd.read_csv("Telco-Customer-Churn.csv")

In [3]:
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


**Applying Exploratory Data Analysis (EDA)**

In [4]:
df.shape

(7043, 21)

Remove columns from dataset 

->Some columns add noise
->Some are duplicate information
->Some have zero effect on churn
->Removing them increases model accuracy

In [5]:
drop_cols = [
    'customerID',
    'gender',
    'PhoneService',
    'MultipleLines',
    'StreamingTV',
    'StreamingMovies',
    'TotalCharges'
]

df.drop(drop_cols, axis=1, inplace=True)


In [6]:
df.head()

Unnamed: 0,SeniorCitizen,Partner,Dependents,tenure,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,Churn
0,0,Yes,No,1,DSL,No,Yes,No,No,Month-to-month,Yes,Electronic check,29.85,No
1,0,No,No,34,DSL,Yes,No,Yes,No,One year,No,Mailed check,56.95,No
2,0,No,No,2,DSL,Yes,Yes,No,No,Month-to-month,Yes,Mailed check,53.85,Yes
3,0,No,No,45,DSL,Yes,No,Yes,Yes,One year,No,Bank transfer (automatic),42.3,No
4,0,No,No,2,Fiber optic,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,Yes


In [7]:
df.dtypes

SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
InternetService      object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
Contract             object
PaperlessBilling     object
PaymentMethod        object
MonthlyCharges      float64
Churn                object
dtype: object

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 14 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   SeniorCitizen     7043 non-null   int64  
 1   Partner           7043 non-null   object 
 2   Dependents        7043 non-null   object 
 3   tenure            7043 non-null   int64  
 4   InternetService   7043 non-null   object 
 5   OnlineSecurity    7043 non-null   object 
 6   OnlineBackup      7043 non-null   object 
 7   DeviceProtection  7043 non-null   object 
 8   TechSupport       7043 non-null   object 
 9   Contract          7043 non-null   object 
 10  PaperlessBilling  7043 non-null   object 
 11  PaymentMethod     7043 non-null   object 
 12  MonthlyCharges    7043 non-null   float64
 13  Churn             7043 non-null   object 
dtypes: float64(1), int64(2), object(11)
memory usage: 770.5+ KB


In [9]:
df.describe()

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges
count,7043.0,7043.0,7043.0
mean,0.162147,32.371149,64.761692
std,0.368612,24.559481,30.090047
min,0.0,0.0,18.25
25%,0.0,9.0,35.5
50%,0.0,29.0,70.35
75%,0.0,55.0,89.85
max,1.0,72.0,118.75


Check unique value in each columns 

In [10]:
def unique_values():
    for i in df.columns : 
        print(i,df[i].unique())
        print()
unique_values()

SeniorCitizen [0 1]

Partner ['Yes' 'No']

Dependents ['No' 'Yes']

tenure [ 1 34  2 45  8 22 10 28 62 13 16 58 49 25 69 52 71 21 12 30 47 72 17 27
  5 46 11 70 63 43 15 60 18 66  9  3 31 50 64 56  7 42 35 48 29 65 38 68
 32 55 37 36 41  6  4 33 67 23 57 61 14 20 53 40 59 24 44 19 54 51 26  0
 39]

InternetService ['DSL' 'Fiber optic' 'No']

OnlineSecurity ['No' 'Yes' 'No internet service']

OnlineBackup ['Yes' 'No' 'No internet service']

DeviceProtection ['No' 'Yes' 'No internet service']

TechSupport ['No' 'Yes' 'No internet service']

Contract ['Month-to-month' 'One year' 'Two year']

PaperlessBilling ['Yes' 'No']

PaymentMethod ['Electronic check' 'Mailed check' 'Bank transfer (automatic)'
 'Credit card (automatic)']

MonthlyCharges [29.85 56.95 53.85 ... 63.1  44.2  78.7 ]

Churn ['No' 'Yes']



Convert target column (Churn) to numeric

**Feature Encoding**

In [11]:
df['Churn'] = df['Churn'].map({'Yes':1, 'No':0})


Convert other categorical columns

In [12]:
df_encoded = pd.get_dummies(df, drop_first=True)


In [13]:
df_encoded

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges,Churn,Partner_Yes,Dependents_Yes,InternetService_Fiber optic,InternetService_No,OnlineSecurity_No internet service,OnlineSecurity_Yes,...,DeviceProtection_No internet service,DeviceProtection_Yes,TechSupport_No internet service,TechSupport_Yes,Contract_One year,Contract_Two year,PaperlessBilling_Yes,PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check
0,0,1,29.85,0,True,False,False,False,False,False,...,False,False,False,False,False,False,True,False,True,False
1,0,34,56.95,0,False,False,False,False,False,True,...,False,True,False,False,True,False,False,False,False,True
2,0,2,53.85,1,False,False,False,False,False,True,...,False,False,False,False,False,False,True,False,False,True
3,0,45,42.30,0,False,False,False,False,False,True,...,False,True,False,True,True,False,False,False,False,False
4,0,2,70.70,1,False,False,True,False,False,False,...,False,False,False,False,False,False,True,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7038,0,24,84.80,0,True,True,False,False,False,True,...,False,True,False,True,True,False,True,False,False,True
7039,0,72,103.20,0,True,True,True,False,False,False,...,False,True,False,False,True,False,True,True,False,False
7040,0,11,29.60,0,True,True,False,False,False,True,...,False,False,False,False,False,False,True,False,True,False
7041,1,4,74.40,1,True,False,True,False,False,False,...,False,False,False,False,False,False,True,False,False,True


In [14]:
df.shape

(7043, 14)

Calculate correlation

In [15]:
correlation = df_encoded.corr()


See correlation with Churn only

In [16]:
churn_corr = correlation['Churn'].sort_values(ascending=False)
print(churn_corr)


Churn                                    1.000000
InternetService_Fiber optic              0.308020
PaymentMethod_Electronic check           0.301919
MonthlyCharges                           0.193356
PaperlessBilling_Yes                     0.191825
SeniorCitizen                            0.150889
DeviceProtection_Yes                    -0.066160
OnlineBackup_Yes                        -0.082255
PaymentMethod_Mailed check              -0.091683
PaymentMethod_Credit card (automatic)   -0.134302
Partner_Yes                             -0.150448
Dependents_Yes                          -0.164221
TechSupport_Yes                         -0.164674
OnlineSecurity_Yes                      -0.171226
Contract_One year                       -0.177820
OnlineSecurity_No internet service      -0.227890
OnlineBackup_No internet service        -0.227890
DeviceProtection_No internet service    -0.227890
InternetService_No                      -0.227890
TechSupport_No internet service         -0.227890


We removed low-correlation features like Device Protection, Online Backup, and Mailed Check payment because they had very weak relationship with churn. Removing them reduces noise and improves model performance

In [17]:
# drop_low_impact = [
#     'DeviceProtection_Yes',
#     'OnlineBackup_Yes',
#     'PaymentMethod_Mailed check'
# ]

# df_encoded.drop(drop_low_impact, axis=1, inplace=True)

In [18]:
df_encoded

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges,Churn,Partner_Yes,Dependents_Yes,InternetService_Fiber optic,InternetService_No,OnlineSecurity_No internet service,OnlineSecurity_Yes,...,DeviceProtection_No internet service,DeviceProtection_Yes,TechSupport_No internet service,TechSupport_Yes,Contract_One year,Contract_Two year,PaperlessBilling_Yes,PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check
0,0,1,29.85,0,True,False,False,False,False,False,...,False,False,False,False,False,False,True,False,True,False
1,0,34,56.95,0,False,False,False,False,False,True,...,False,True,False,False,True,False,False,False,False,True
2,0,2,53.85,1,False,False,False,False,False,True,...,False,False,False,False,False,False,True,False,False,True
3,0,45,42.30,0,False,False,False,False,False,True,...,False,True,False,True,True,False,False,False,False,False
4,0,2,70.70,1,False,False,True,False,False,False,...,False,False,False,False,False,False,True,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7038,0,24,84.80,0,True,True,False,False,False,True,...,False,True,False,True,True,False,True,False,False,True
7039,0,72,103.20,0,True,True,True,False,False,False,...,False,True,False,False,True,False,True,True,False,False
7040,0,11,29.60,0,True,True,False,False,False,True,...,False,False,False,False,False,False,True,False,True,False
7041,1,4,74.40,1,True,False,True,False,False,False,...,False,False,False,False,False,False,True,False,False,True


Machine learning models need two separate inputs:

Features (X) â†’ input variables for prediction (e.g., Tenure, MonthlyCharges, Age).

Target (y) â†’ the output variable you want to predict (e.g., Churn yes/no).

**Feature and Target Separation**

In [21]:
X = df.drop("Churn", axis=1)
y = df["Churn"]


In [22]:
X = pd.get_dummies(X, drop_first=True)


**Feaature Scaling**

-> Feature scaling is very important Because models like SVM, KNN, Logistic Regression depend on distance and magnitude.
-> We used StandardScaler to normalize all numeric features so that no variable dominates the model. This is especially important for distance-based models like KNN and SVM

In [23]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)


**Train Test Split**

In [24]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
# random_state ensures same split every time you run the code.
# It helps in reproducibility.
# Without this, train/test data will change on every run.

In [44]:
df["Churn"].value_counts(normalize=True) * 100

Churn
0    73.463013
1    26.536987
Name: proportion, dtype: float64

This line calculates the percentage distribution of each class (0 and 1) in the Churn column so we can check whether the dataset is balanced or imbalanced.
Balanced dataset : e.g. 50:50, 60:40
Mildly imbalanced : e.g. 70:30 
Highly imbalanced : e.g. 95:5

**Logistic Regression**

In [51]:
from sklearn.metrics import classification_report

In [49]:
from sklearn.linear_model import LogisticRegression 
from sklearn.metrics import accuracy_score

lr = LogisticRegression()
lr.fit(X_train , y_train)

lr_pred = lr.predict(X_test)
print("Logistic Regression Classification Report")
print(classification_report(y_test, lr_pred))

Logistic Regression Classification Report
              precision    recall  f1-score   support

           0       0.86      0.91      0.88      1036
           1       0.70      0.57      0.63       373

    accuracy                           0.82      1409
   macro avg       0.78      0.74      0.76      1409
weighted avg       0.81      0.82      0.82      1409



-> linear classification algo. 
-> tries to find straight line.
-> used when target var is cat hav only 2 cls.
-> calculates a score using a linear formula ,applies a sigmoid function.
-> predict probabilty 
-> define decision boundry 0.5 
-> if probability > 0.5 = 1 otherwise = 0 

**Random Forest Classifier**

In [50]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(n_estimators=200,max_depth=10)
rfc.fit(X_train,y_train)

rfc_pred = rfc.predict(X_test)
print("Decision Tree Classification Report")
print(classification_report(y_test, dt_pred))


Decision Tree Classification Report
              precision    recall  f1-score   support

           0       0.82      0.94      0.87      1036
           1       0.71      0.43      0.54       373

    accuracy                           0.80      1409
   macro avg       0.76      0.68      0.70      1409
weighted avg       0.79      0.80      0.78      1409



In [None]:
-> ensemble model , uses many decision trees.
-> Each tree is trained on different random parts of the data.
-> Each tree gives its own prediction.
-> The final result is decided by majority voting.
-> That is why Random Forest usually performs better than a single Decision Tree.

**Gradient Boosting Classifier**

In [48]:
from sklearn.ensemble import GradientBoostingClassifier

gb = GradientBoostingClassifier(
    n_estimators=200,
    learning_rate=0.1,
    max_depth=3,
    random_state=42
)

gb.fit(X_train, y_train)

y_pred_gb = gb.predict(X_test)
print("Gradient Boosting Classification Report")
print(classification_report(y_test, y_pred_gb))

Gradient Boosting Classification Report
              precision    recall  f1-score   support

           0       0.84      0.90      0.87      1036
           1       0.67      0.54      0.60       373

    accuracy                           0.81      1409
   macro avg       0.76      0.72      0.73      1409
weighted avg       0.80      0.81      0.80      1409



**Models Kept**

Logistic Regression, Random Forest, Gradient Boosting

Reason:
These models provided better recall and F1-score for churn, and handled imbalanced data more effectively.

**Models Removed**

KNN, Decision Tree, SVM

Reason:
These models gave low recall for churn customers, meaning they failed to identify many customers who actually left. Since missing a churn customer causes business loss, these models were not suitable.

In [59]:
print("Decision Tree Classification Report")
print(classification_report(y_test, dt_pred))

print("SVM Classification Report")
print(classification_report(y_test, svc_pred))

print("KNN Classification Report")
print(classification_report(y_test, knn_pred))

Decision Tree Classification Report
              precision    recall  f1-score   support

           0       0.82      0.94      0.87      1036
           1       0.71      0.43      0.54       373

    accuracy                           0.80      1409
   macro avg       0.76      0.68      0.70      1409
weighted avg       0.79      0.80      0.78      1409

SVM Classification Report
              precision    recall  f1-score   support

           0       0.82      0.94      0.87      1036
           1       0.71      0.43      0.53       373

    accuracy                           0.80      1409
   macro avg       0.77      0.68      0.70      1409
weighted avg       0.79      0.80      0.78      1409

KNN Classification Report
              precision    recall  f1-score   support

           0       0.83      0.86      0.84      1036
           1       0.56      0.49      0.53       373

    accuracy                           0.77      1409
   macro avg       0.69      0.68      0

**ðŸ”¹ Project Objective**

To predict whether a customer will churn or stay using machine learning.

This helps the company take preventive actions before losing customers.

**ðŸ”¹ Data Preprocessing**
Missing values were handled to avoid incorrect predictions.

Categorical data was converted using One-Hot Encoding.

Feature scaling was applied so that all features contribute equally.


**ðŸ”¹ Models Used**
Logistic Regression

KNN

Decision Tree

Random Forest

SVM

Boosting Models (AdaBoost, Gradient Boosting)


**ðŸ”¹ Evaluation Metrics**
Accuracy

Precision

Recall

F1-Score

Recall for churn was given highest importance because missing a churn customer causes business loss.


**ðŸ”¹ Model Selection**

KNN, Decision Tree and SVM were removed due to low churn recall.

Logistic Regression, Random Forest and Gradient Boosting performed better.


**ðŸ”¹ Final Model**

Logistic Regression was selected because it gave the highest recall and F1-score for churn customers and is easy to explain.


**ðŸ”¹ Business Use**

The model identifies high-risk customers.

The company can offer discounts, calls, or special offers to retain them.

This improves customer retention and profit.

In [57]:
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier(max_depth=3) #class_weight='balanced')
dt.fit(X_train,y_train)

dt_pred = dt.predict(X_test)

In [58]:
from sklearn.svm import SVC 

svm = SVC()
svm.fit(X_train,y_train)

svc_pred = svm.predict(X_test)

In [60]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()
knn.fit(X_train,y_train)

knn_pred = knn.predict(X_test)