## Predicting Business Customer Churn in the Bulgarian Telecom Sector

#### Author: Rishi Koushik Sridharan

### 1. Introduction

This project focuses on **B2B churn prediction** - forecasting which business clients are likely to terminate their telecom contracts.  
Unlike individual consumers (B2C churn), **business churn** often involves higher revenue loss per customer and is influenced by factors like contract type, company size, and service complaints.

**Source**: The dataset used here was collected from a Bulgarian telecom company and contains around 8,000 business accounts, each labeled as churned or retained. (https://data.mendeley.com/datasets/nrb55gr66h/1)




1️⃣ Introduction

2️⃣ Dataset Overview

3️⃣ Exploratory Data Analysis (EDA)

4️⃣ Data Cleaning & Preprocessing

5️⃣ Feature Engineering

6️⃣ Model Building & Evaluation

7️⃣ Interpretation & Business Insights

8️⃣ Conclusion

### Installing and Importing Libraries 

In [None]:
# Uncomment the following line to install required packages:

# %pip install pandas numpy matplotlib scikit-learn statsmodels

Collecting pandas
  Using cached pandas-2.3.3-cp312-cp312-win_amd64.whl.metadata (19 kB)
Collecting numpy
  Using cached numpy-2.3.4-cp312-cp312-win_amd64.whl.metadata (60 kB)
Collecting matplotlib
  Using cached matplotlib-3.10.7-cp312-cp312-win_amd64.whl.metadata (11 kB)
Collecting scikit-learn
  Downloading scikit_learn-1.7.2-cp312-cp312-win_amd64.whl.metadata (11 kB)
Collecting statsmodels
  Downloading statsmodels-0.14.5-cp312-cp312-win_amd64.whl.metadata (9.8 kB)
Collecting pytz>=2020.1 (from pandas)
  Using cached pytz-2025.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas)
  Using cached tzdata-2025.2-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting contourpy>=1.0.1 (from matplotlib)
  Downloading contourpy-1.3.3-cp312-cp312-win_amd64.whl.metadata (5.5 kB)
Collecting cycler>=0.10 (from matplotlib)
  Using cached cycler-0.12.1-py3-none-any.whl.metadata (3.8 kB)
Collecting fonttools>=4.22.0 (from matplotlib)
  Downloading fonttools-4.60.1-cp312-cp312

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

### 2. Dataset Overview

In [None]:
df = pd.read_csv("Baza customer Telecom v2.csv")
print(df.shape)
df.head()

(8453, 14)


Unnamed: 0,PID,CRM_PID_Value_Segment,EffectiveSegment,Billing_ZIP,KA_name,Active_subscribers,Not_Active_subscribers,Suspended_subscribers,Total_SUBs,AvgMobileRevenue,AvgFIXRevenue,TotalRevenue,ARPU,CHURN
0,123759242,Bronze,SOHO,6000.0,VM,2,,,2,40.17,0.0,40.17,,No
1,126145737,Bronze,SOHO,6400.0,VM,3,,,3,40.17,0.0,40.17,13.39,No
2,123506355,Bronze,SOHO,6000.0,DI,2,3.0,,5,40.17,0.0,40.17,20.09,No
3,112595585,Bronze,SOHO,4400.0,MT,1,2.0,,3,40.17,0.0,40.17,40.17,No
4,115097935,Iron,SOHO,4000.0,AD,2,1.0,,3,40.17,0.0,40.17,20.09,No


In [3]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8453 entries, 0 to 8452
Data columns (total 14 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   PID                     8453 non-null   object 
 1   CRM_PID_Value_Segment   8448 non-null   object 
 2   EffectiveSegment        8453 non-null   object 
 3   Billing_ZIP             8451 non-null   float64
 4   KA_name                 8453 non-null   object 
 5   Active_subscribers      8453 non-null   int64  
 6   Not_Active_subscribers  4304 non-null   float64
 7   Suspended_subscribers   352 non-null    float64
 8   Total_SUBs              8453 non-null   int64  
 9   AvgMobileRevenue        8453 non-null   float64
 10  AvgFIXRevenue           8453 non-null   float64
 11  TotalRevenue            8453 non-null   float64
 12  ARPU                    8452 non-null   float64
 13  CHURN                   8453 non-null   object 
dtypes: float64(7), int64(2), object(5)
memor

In [4]:
df.describe()

Unnamed: 0,Billing_ZIP,Active_subscribers,Not_Active_subscribers,Suspended_subscribers,Total_SUBs,AvgMobileRevenue,AvgFIXRevenue,TotalRevenue,ARPU
count,8451.0,8453.0,4304.0,352.0,8453.0,8453.0,8453.0,8453.0,8452.0
mean,4879.727725,7.774636,4.163336,1.576705,9.960132,148.011956,0.821185,148.833141,24.441789
std,1061.095394,6.680524,9.462847,1.979905,10.246648,102.570539,11.73788,103.250779,22.820585
min,1000.0,1.0,1.0,1.0,1.0,0.0,0.0,4.67,0.0
25%,4003.0,4.0,1.0,1.0,5.0,71.5,0.0,71.83,14.07
50%,4400.0,6.0,2.0,1.0,7.0,113.17,0.0,113.67,19.315
75%,6000.0,10.0,4.0,1.0,12.0,191.17,0.0,192.33,27.255
max,9644.0,110.0,214.0,22.0,235.0,499.83,480.5,499.83,462.83


In [5]:
df.isna().sum()

PID                          0
CRM_PID_Value_Segment        5
EffectiveSegment             0
Billing_ZIP                  2
KA_name                      0
Active_subscribers           0
Not_Active_subscribers    4149
Suspended_subscribers     8101
Total_SUBs                   0
AvgMobileRevenue             0
AvgFIXRevenue                0
TotalRevenue                 0
ARPU                         1
CHURN                        0
dtype: int64

In [7]:
df['CHURN'].value_counts(normalize=True)

CHURN
No     0.935053
Yes    0.064947
Name: proportion, dtype: float64