# **Case Study: Customer Churn Prediction**

You are working as a data analyst at a subscription-based company.  
The company has noticed that many customers cancel their subscriptions and wants to understand **who is likely to churn and why**.

You are given historical customer data containing subscription details, usage behavior, and a label indicating whether the customer churned or not.

Your task is to use **logistic regression** to predict whether a customer will churn.

Run the code cell below to access the dataset from the Finance & Accounting Team.

In [3]:
import kagglehub
import os

path = kagglehub.dataset_download("safrin03/predictive-analytics-for-customer-churn-dataset")
os.listdir(path)

Using Colab cache for faster access to the 'predictive-analytics-for-customer-churn-dataset' dataset.


['data_descriptions.csv',
 '.nfs00000000147fed0400000908',
 'train.csv',
 'test.csv']

Let's have a look at the dataset

In [4]:
import pandas as pd

df = pd.read_csv(os.path.join(path, "train.csv"))

df.head()
# Print the first 5 rows of the dataset

Unnamed: 0,AccountAge,MonthlyCharges,TotalCharges,SubscriptionType,PaymentMethod,PaperlessBilling,ContentType,MultiDeviceAccess,DeviceRegistered,ViewingHoursPerWeek,...,ContentDownloadsPerMonth,GenrePreference,UserRating,SupportTicketsPerMonth,Gender,WatchlistSize,ParentalControl,SubtitlesEnabled,CustomerID,Churn
0,20,11.055215,221.104302,Premium,Mailed check,No,Both,No,Mobile,36.758104,...,10,Sci-Fi,2.176498,4,Male,3,No,No,CB6SXPNVZA,0
1,57,5.175208,294.986882,Basic,Credit card,Yes,Movies,No,Tablet,32.450568,...,18,Action,3.478632,8,Male,23,No,Yes,S7R2G87O09,0
2,73,12.106657,883.785952,Basic,Mailed check,Yes,Movies,No,Computer,7.39516,...,23,Fantasy,4.238824,6,Male,1,Yes,Yes,EASDC20BDT,0
3,32,7.263743,232.439774,Basic,Electronic check,No,TV Shows,No,Tablet,27.960389,...,30,Drama,4.276013,2,Male,24,Yes,Yes,NPF69NT69N,0
4,57,16.953078,966.325422,Premium,Electronic check,Yes,TV Shows,No,TV,20.083397,...,20,Comedy,3.61617,4,Female,0,No,No,4LGYPK7VOL,0


Let's have a look at the features we have.

In [5]:
# Print the names of all the columns in the dataset

df.columns


Index(['AccountAge', 'MonthlyCharges', 'TotalCharges', 'SubscriptionType',
       'PaymentMethod', 'PaperlessBilling', 'ContentType', 'MultiDeviceAccess',
       'DeviceRegistered', 'ViewingHoursPerWeek', 'AverageViewingDuration',
       'ContentDownloadsPerMonth', 'GenrePreference', 'UserRating',
       'SupportTicketsPerMonth', 'Gender', 'WatchlistSize', 'ParentalControl',
       'SubtitlesEnabled', 'CustomerID', 'Churn'],
      dtype='object')

# Data can also have problems!

After reviewing the data, the **Finance & Accounting team** reports that some customer attributes were incorrectly captured due to issues in the billing and content-logging systems.

As a result, you are instructed to **remove the following columns from the dataset and not use them for your analysis**:
- `PaymentMethod`
- `PaperlessBilling`
- `DeviceRegistered`
- `GenrePreference`

Make sure these columns are **dropped in-place** (set inplace = True when you use drop method)

In [6]:
# Drop the mentioned columns\
dropcols=["PaymentMethod","PaperlessBilling","DeviceRegistered","GenrePreference"]
df.drop(dropcols,axis=1,inplace=True)

Before training the model, review all remaining features and ask yourself:

- Does this variable contain information about customer behavior, or is it only an identifier?
- Would knowing this value help predict churn for a **new** customer?

Identify any such column(s) and drop them inplace as well.

In [7]:
# Drop such columns
dropcols=["CustomerID"]
df.drop(dropcols,axis=1,inplace=True)

Check if there are any missing values in the dataset.

In [8]:
# Check for NA values
df.isna().sum()

Unnamed: 0,0
AccountAge,0
MonthlyCharges,0
TotalCharges,0
SubscriptionType,0
ContentType,0
MultiDeviceAccess,0
ViewingHoursPerWeek,0
AverageViewingDuration,0
ContentDownloadsPerMonth,0
UserRating,0


# **Encoding Categorical Variables for Modeling**

During data preparation, you notice that several features in the dataset are **categorical** (they contain labels instead of numbers).  
Since logistic regression works only with numerical inputs, these variables must be **encoded** before modeling.



### Why Encoding Is Important
- Machine learning models cannot interpret text values such as *“Basic”* or *“Yes”*
- Encoding converts categories into numbers **without introducing false meaning**
- Poor encoding can mislead the model and hurt predictions



### Encoding Rules to Follow

1. **Columns with exactly two categories**  
   Encode them using **0 / 1**  
   - Example: `No → 0`, `Yes → 1`

2. **Columns with more than two categories**  
   Use **One-Hot Encoding**, then **drop the original column**



### What Is One-Hot Encoding?

One-hot encoding represents each category as its **own binary column**.

Instead of giving categories numbers (which can imply order), it creates **one column per category**.

Example:

SubscriptionType can be Basic, Standard or Premium

Create 3 separate columns: SubscriptionTypeBasic, SubscriptionTypeStandard, SubscriptionTypePremium.

If SubscriptionType is Basic for some row, then for that row, set SubscriptionTypeBasic = 1, SubscriptionTypeStandard = 0, SubscriptionTypePremium = 0. Do similarly for all the rows then drop the SubscriptionType column inplace.



# **Hint from the Analytics Team**

One of the categorical columns appears to have three values.  
Before applying one-hot encoding, think carefully about what these values mean.

Ask yourself:
- Are these three values mutually exclusive choices?
- Or do they represent **different combinations of the same underlying options**?

Do you really need three different columns for that feature or fewer can be sufficient?

In [9]:
# Encode the 6 categorical features
# For binary features, replace the text values with 0 / 1
# For more than 2 categories, create separate binary columns for each category and drop the original column

# Binary encoding (0/1) for features with two categories
df['MultiDeviceAccess'] = df['MultiDeviceAccess'].map({'Yes': 1, 'No': 0})
df['Gender'] = df['Gender'].map({'Male': 1, 'Female': 0})
df['ParentalControl'] = df['ParentalControl'].map({'Yes': 1, 'No': 0})
df['SubtitlesEnabled'] = df['SubtitlesEnabled'].map({'Yes': 1, 'No': 0})

# Special handling for ContentType
df['ContentType_Movies'] = df['ContentType'].apply(lambda x: 1 if x == 'Movies' or x == 'Both' else 0)
df['ContentType_TVShows'] = df['ContentType'].apply(lambda x: 1 if x == 'TV Shows' or x == 'Both' else 0)
df.drop('ContentType', axis=1, inplace=True)

# One-Hot Encoding for SubscriptionType
df = pd.get_dummies(df, columns=['SubscriptionType'], drop_first=False)

# Ensure the new columns are named as expected from the hint (though get_dummies does this by default)
df.rename(columns={'SubscriptionType_Basic': 'SubscriptionTypeBasic',
                   'SubscriptionType_Premium': 'SubscriptionTypePremium',
                   'SubscriptionType_Standard': 'SubscriptionTypeStandard'}, inplace=True)
df['SubscriptionTypeBasic'] = df['SubscriptionTypeBasic'].apply(lambda x: 1 if x == True else 0)
df['SubscriptionTypePremium'] = df['SubscriptionTypePremium'].apply(lambda x: 1 if x == True  else 0)
df['SubscriptionTypeStandard'] = df['SubscriptionTypeStandard'].apply(lambda x: 1 if x == True  else 0)
display(df.head())

Unnamed: 0,AccountAge,MonthlyCharges,TotalCharges,MultiDeviceAccess,ViewingHoursPerWeek,AverageViewingDuration,ContentDownloadsPerMonth,UserRating,SupportTicketsPerMonth,Gender,WatchlistSize,ParentalControl,SubtitlesEnabled,Churn,ContentType_Movies,ContentType_TVShows,SubscriptionTypeBasic,SubscriptionTypePremium,SubscriptionTypeStandard
0,20,11.055215,221.104302,0,36.758104,63.531377,10,2.176498,4,1,3,0,0,0,1,1,0,1,0
1,57,5.175208,294.986882,0,32.450568,25.725595,18,3.478632,8,1,23,0,1,0,1,0,1,0,0
2,73,12.106657,883.785952,0,7.39516,57.364061,23,4.238824,6,1,1,1,1,0,1,0,1,0,0
3,32,7.263743,232.439774,0,27.960389,131.537507,30,4.276013,2,1,24,1,1,0,0,1,1,0,0
4,57,16.953078,966.325422,0,20.083397,45.356653,20,3.61617,4,0,0,0,0,0,0,1,0,1,0


# **What is Scikit-learn?**

**Scikit-learn** is a Python library that helps you build machine learning models easily.  
Instead of coding all the mathematical formulas yourself, scikit-learn provides reliable, ready-made implementations.

It offers tools to:
- Prepare data (splitting, scaling, encoding, etc.)
- Train machine learning models
- Evaluate how well models perform
- Avoid common mistakes in machine learning workflows

Scikit-learn is one of the most widely used machine learning libraries.  
Going forward, we will use scikit-learn step by step to complete our task.



# **Preparing Data for Modeling**

1. Create
   - **x** → all feature columns used for prediction (a new dataframe same as current df just without the churn column)
   - **y** → the target column


2. Once x and y are defined, split them into training and test sets using scikit-learn's train_test_split method
- **Training set** - used to train the model
- **Test set** - used to evaluate the model on unseen data

Follow these rules:
- Keep **20%** of the data for testing
- The result should give you **four outputs**:
  - training features
  - test features
  - training labels
  - test labels

Since this is your first time using scikit-learn, you're given the splitting code below, but understand what are the inputs and outputs of the train_test_split method.


In [10]:
y = df["Churn"]
x = df.drop(["Churn"], axis=1)

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

# **A Note from the Statistics Team**

Before training the churn prediction model, the statisticians review your data and point out an important issue.

They observe that some numerical features have **much larger values** than others. These large-scale features could influence the model more than other features.

To avoid this, you are asked to **normalize all continuous numerical features** so that they are on a similar scale.  
Use **standardization**, which adjusts each feature to have:
- a mean of 0  
- a standard deviation of 1  

The statisticians also clarify that **binary features (encoded as 0 and 1)** should **not** be normalized. These already represent simple yes/no information and work correctly without scaling.

You'll use scikit-learn's StandardScaler for this task.

In [17]:
print(x)
continuous_cols = ["AccountAge", "MonthlyCharges", "ViewingHoursPerWeek","AverageViewingDuration","ContentDownloadsPerMonth","UserRating","SupportTicketsPerMonth","WatchlistSize"] # Fill the rest as well

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

x_train[continuous_cols] = scaler.fit_transform(x_train[continuous_cols])
# Do the same for x_test
x_test[continuous_cols] = scaler.transform(x_test[continuous_cols])


        AccountAge  MonthlyCharges  TotalCharges  MultiDeviceAccess  \
0               20       11.055215    221.104302                  0   
1               57        5.175208    294.986882                  0   
2               73       12.106657    883.785952                  0   
3               32        7.263743    232.439774                  0   
4               57       16.953078    966.325422                  0   
...            ...             ...           ...                ...   
243782          77        9.639902    742.272460                  0   
243783         117       13.049257   1526.763053                  1   
243784         113       14.514569   1640.146267                  0   
243785           7       18.140555    126.983887                  0   
243786          90       11.593774   1043.439704                  0   

        ViewingHoursPerWeek  AverageViewingDuration  ContentDownloadsPerMonth  \
0                 36.758104               63.531377               

# **Imbalanced data**

First, check the number of samples in each output class (churn vs non-churn).  
Determine whether there is a **severe class imbalance**.

If a strong imbalance exists, apply **SMOTE** to balance the classes, as discussed in the previous session.

You're given the code for SMOTE since it's your first time, but carefully go through it and understand the inputs and outputs.

In [18]:
# Print the counts of each output class in the training data
print(y_train.value_counts())

Churn
0    159637
1     35392
Name: count, dtype: int64


In [19]:
from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)

x_train_resampled, y_train_resampled = smote.fit_resample(x_train, y_train)
# Print the counts of each output class again after SMOTE
print(y_train_resampled.value_counts())

Churn
0    159637
1    159637
Name: count, dtype: int64


# Time to train the model, finally!

In [22]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=1000)

# Use mode.fit() method, you have to look up what are the arguments that go into this method
# Make sure you pass resampled data i.e. the data after SMOTE
model.fit(x_train_resampled, y_train_resampled)

# Test your model on the testing data

In [23]:
y_pred = model.predict(x_test)
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score

accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)

print("Accuracy:", accuracy)
print("Confusion Matrix:\n", conf_matrix)
print("Precision:", precision)
print("Recall:", recall)

Accuracy: 0.7551786373518192
Confusion Matrix:
 [[32874  7094]
 [ 4843  3947]]
Precision: 0.35748573498777286
Recall: 0.449032992036405


# A bit of Consulting.... Since it's an ICG Project

Now, you can take some time to think about which metrics matter more to your business specifically. After this churn analysis, the business will ultimately try to retain the churners by contacting them or providing some offers to them. False Negatives are churners you missed, so you lose customers while False Positives are non-churners who you classified as churners, so your business spends additional amount contacting them.

It's upto your business - whether revenue loss due to losing customers is more significant or marketing cost to the churners is more significant. That's the precision-recall tradeoff we discussed.