# Telco Customer Churn ML Pipeline

This notebook implements a complete machine learning pipeline for predicting customer churn in a telecommunications company. The pipeline includes data loading, preprocessing, model training, and evaluation.

## 1. Import Required Libraries

Import essential Python libraries for data manipulation, visualization, and machine learning model development.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, accuracy_score
import joblib
import warnings

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')

# Set visualization style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

## 2. Load Dataset

Load the Telco Customer Churn dataset from the raw data directory into a pandas DataFrame.

In [3]:
import os
os.chdir('/content/drive/MyDrive/datasets')
print(os.getcwd())


/content/drive/MyDrive/datasets


In [4]:
import pandas as pd

df = pd.read_csv('Telco-Customer-Churn.csv')
df.head()


Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


## 3. Dataset Overview

Display fundamental information about the dataset structure, dimensions, and basic statistics.

In [5]:
# Display dataset shape
print("=" * 80)
print("DATASET SHAPE")
print("=" * 80)
print(f"Number of rows: {df.shape[0]}")
print(f"Number of columns: {df.shape[1]}")
print()

DATASET SHAPE
Number of rows: 7043
Number of columns: 21



In [6]:
# Display first few rows
print("=" * 80)
print("FIRST 5 ROWS")
print("=" * 80)
print(df.head())

FIRST 5 ROWS
   customerID  gender  SeniorCitizen Partner Dependents  tenure PhoneService  \
0  7590-VHVEG  Female              0     Yes         No       1           No   
1  5575-GNVDE    Male              0      No         No      34          Yes   
2  3668-QPYBK    Male              0      No         No       2          Yes   
3  7795-CFOCW    Male              0      No         No      45           No   
4  9237-HQITU  Female              0      No         No       2          Yes   

      MultipleLines InternetService OnlineSecurity  ... DeviceProtection  \
0  No phone service             DSL             No  ...               No   
1                No             DSL            Yes  ...              Yes   
2                No             DSL            Yes  ...               No   
3  No phone service             DSL            Yes  ...              Yes   
4                No     Fiber optic             No  ...               No   

  TechSupport StreamingTV StreamingMovies        

In [7]:
# Display dataset information
print("\n" + "=" * 80)
print("DATASET INFORMATION")
print("=" * 80)
df.info()


DATASET INFORMATION
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  704

In [8]:
# Display statistical summary
print("\n" + "=" * 80)
print("STATISTICAL SUMMARY")
print("=" * 80)
print(df.describe())


STATISTICAL SUMMARY
       SeniorCitizen       tenure  MonthlyCharges
count    7043.000000  7043.000000     7043.000000
mean        0.162147    32.371149       64.761692
std         0.368612    24.559481       30.090047
min         0.000000     0.000000       18.250000
25%         0.000000     9.000000       35.500000
50%         0.000000    29.000000       70.350000
75%         0.000000    55.000000       89.850000
max         1.000000    72.000000      118.750000


## Handling Missing Values

The Telco dataset contains hidden missing values represented as blank strings (" ")
instead of proper NaN values. Machine learning algorithms and imputers cannot detect
blank strings as missing data.

Therefore, we first convert all blank entries into NaN so that the preprocessing
pipeline (SimpleImputer) can handle them correctly.


In [9]:
import numpy as np

# check missing values
print("Missing values before cleaning:")
print(df.isnull().sum())

# replace blank strings with NaN
df = df.replace(" ", np.nan)

print("\nMissing values after replacing blanks:")
print(df.isnull().sum())


Missing values before cleaning:
customerID          0
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64

Missing values after replacing blanks:
customerID           0
gender               0
SeniorCitizen        0
Partner              0
Dependents           0
tenure               0
PhoneService         0
MultipleLines        0
InternetService      0
OnlineSecurity       0
OnlineBackup         0
DeviceProtection     0
TechSupport          0
StreamingTV          0
StreamingMovies      0
Contract             0
PaperlessBilling     0
PaymentMethod        0
MonthlyCharges       0
TotalCharges    

## Converting TotalCharges to Numeric

The 'TotalCharges' column is incorrectly stored as an object (string) datatype due to
the presence of blank values. Machine learning models require numeric input.

We convert this column to a numeric datatype. Any non-convertible values will be
automatically converted into NaN, which will later be handled by the imputer.


In [10]:
# convert to numeric
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')

# verify datatype
print(df.dtypes['TotalCharges'])

# check new missing values created
print("\nMissing values in TotalCharges:")
print(df['TotalCharges'].isnull().sum())


float64

Missing values in TotalCharges:
11


## Feature–Target Separation

To train a machine learning model, we separate the dataset into:

X (features): all input variables describing a customer  
y (target): the variable we want to predict (Churn)

The model will learn the relationship between X and y.


In [11]:
# features and target
X = df.drop('Churn', axis=1)
y = df['Churn']

print("Feature shape:", X.shape)
print("Target shape:", y.shape)

Feature shape: (7043, 20)
Target shape: (7043,)


## 8. Split Dataset into Training and Testing Sets

Split the data into training (80%) and testing (20%) sets using train_test_split with a fixed random state for reproducibility.

In [12]:
# Split dataset into training and testing sets (80-20 split)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2, 
    random_state=42,
    stratify=y  # Maintain class distribution
)

print(f"Training set size: {X_train.shape[0]} samples ({X_train.shape[0]/len(X)*100:.1f}%)")
print(f"Testing set size: {X_test.shape[0]} samples ({X_test.shape[0]/len(X)*100:.1f}%)")
print(f"\nTraining features shape: {X_train.shape}")
print(f"Testing features shape: {X_test.shape}")
print(f"\nClass distribution in training set:")
print(y_train.value_counts())
print(f"\nClass distribution in testing set:")
print(y_test.value_counts())

Training set size: 5634 samples (80.0%)
Testing set size: 1409 samples (20.0%)

Training features shape: (5634, 20)
Testing features shape: (1409, 20)

Class distribution in training set:
Churn
No     4139
Yes    1495
Name: count, dtype: int64

Class distribution in testing set:
Churn
No     1035
Yes     374
Name: count, dtype: int64


## 9. Automatically Detect Categorical and Numerical Feature Columns

Separate features into categorical and numerical columns based on their data types for appropriate preprocessing.

In [13]:
# Automatically detect categorical and numerical columns
numerical_features = X_train.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = X_train.select_dtypes(include=['object']).columns.tolist()

print(f"Numerical features ({len(numerical_features)}):")
print(numerical_features)
print(f"\nCategorical features ({len(categorical_features)}):")
print(categorical_features)

# Display data types
print(f"\nData types in training set:")
print(X_train.dtypes)
print(f"\nShape of training features: {X_train.shape}")
print(f"Number of numerical features: {len(numerical_features)}")
print(f"Number of categorical features: {len(categorical_features)}")

Numerical features (4):
['SeniorCitizen', 'tenure', 'MonthlyCharges', 'TotalCharges']

Categorical features (16):
['customerID', 'gender', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod']

Data types in training set:
customerID           object
gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
MultipleLines        object
InternetService      object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
PaperlessBilling     object
PaymentMethod        object
MonthlyCharges      float64
TotalCharges        float64
dtype: object

Shape of training features: 

## 10. Create Numerical Preprocessing Pipeline

Create a preprocessing pipeline for numerical features using SimpleImputer (for missing values) and StandardScaler (for normalization).

In [None]:
# Import preprocessing tools
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

# Create numerical preprocessing pipeline
numerical_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),  # Fill missing values with median
    ('scaler', StandardScaler())  # Standardize features (mean=0, std=1)
])

print("Numerical Preprocessing Pipeline created:")
print(numerical_pipeline)
print(f"\nThis pipeline will be applied to {len(numerical_features)} numerical features:")
print(numerical_features)
print("\nPipeline steps:")
print("  1. SimpleImputer: Handles missing values using median strategy")
print("  2. StandardScaler: Normalizes features to have mean=0 and standard deviation=1")

ImportError: cannot import name 'SimpleImputer' from 'sklearn.preprocessing' (/usr/local/lib/python3.12/dist-packages/sklearn/preprocessing/__init__.py)