# **Case Study: Customer Churn Prediction**

You are working as a data analyst at a subscription-based company.  
The company has noticed that many customers cancel their subscriptions and wants to understand **who is likely to churn and why**.

You are given historical customer data containing subscription details, usage behavior, and a label indicating whether the customer churned or not.

Your task is to use **logistic regression** to predict whether a customer will churn.

Run the code cell below to access the dataset from the Finance & Accounting Team.

In [None]:
import kagglehub
import os

path = kagglehub.dataset_download("safrin03/predictive-analytics-for-customer-churn-dataset")
os.listdir(path)

Let's have a look at the dataset

In [None]:
import pandas as pd

df = pd.read_csv(os.path.join(path, "train.csv"))

# Print the first 5 rows of the dataset
print(df.head())

Let's have a look at the features we have.

In [None]:
# Print the names of all the columns in the dataset
print(df.columns)

# Data can also have problems!

After reviewing the data, the **Finance & Accounting team** reports that some customer attributes were incorrectly captured due to issues in the billing and content-logging systems.

As a result, you are instructed to **remove the following columns from the dataset and not use them for your analysis**:
- `PaymentMethod`
- `PaperlessBilling`
- `DeviceRegistered`
- `GenrePreference`

Make sure these columns are **dropped in-place** (set inplace = True when you use drop method)

In [None]:
# Drop the mentioned columns
df.drop(
    ["PaymentMethod", "PaperlessBilling", "DeviceRegistered", "GenrePreference"],
    axis=1,
    inplace=True
)



Before training the model, review all remaining features and ask yourself:

- Does this variable contain information about customer behavior, or is it only an identifier?
- Would knowing this value help predict churn for a **new** customer?

Identify any such column(s) and drop them inplace as well.

In [None]:
# Drop such columns
df.drop(["CustomerID"],
        axis=1, inplace=True)
print(df.columns)

Check if there are any missing values in the dataset.

In [None]:
# Check for NA values
print(df.isna().any())

# **Encoding Categorical Variables for Modeling**

During data preparation, you notice that several features in the dataset are **categorical** (they contain labels instead of numbers).  
Since logistic regression works only with numerical inputs, these variables must be **encoded** before modeling.



### Why Encoding Is Important
- Machine learning models cannot interpret text values such as *“Basic”* or *“Yes”*
- Encoding converts categories into numbers **without introducing false meaning**
- Poor encoding can mislead the model and hurt predictions



### Encoding Rules to Follow

1. **Columns with exactly two categories**  
   Encode them using **0 / 1**  
   - Example: `No → 0`, `Yes → 1`

2. **Columns with more than two categories**  
   Use **One-Hot Encoding**, then **drop the original column**



### What Is One-Hot Encoding?

One-hot encoding represents each category as its **own binary column**.

Instead of giving categories numbers (which can imply order), it creates **one column per category**.

Example:

SubscriptionType can be Basic, Standard or Premium

Create 3 separate columns: SubscriptionTypeBasic, SubscriptionTypeStandard, SubscriptionTypePremium.

If SubscriptionType is Basic for some row, then for that row, set SubscriptionTypeBasic = 1, SubscriptionTypeStandard = 0, SubscriptionTypePremium = 0. Do similarly for all the rows then drop the SubscriptionType column inplace.



# **Hint from the Analytics Team**

One of the categorical columns appears to have three values.  
Before applying one-hot encoding, think carefully about what these values mean.

Ask yourself:
- Are these three values mutually exclusive choices?
- Or do they represent **different combinations of the same underlying options**?

Do you really need three different columns for that feature or fewer can be sufficient?

In [None]:
# Encode the 6 categorical features
# For binary features, replace the text values with 0 / 1
# For mroe than 2 categories, create separate binary columns for each category and drop the original column
binary_cols = ["MultiDeviceAccess","ParentalControl","SubtitlesEnabled"]
for col in binary_cols:
    df[col] = df[col].map({"No": 0, "Yes": 1})

df["Gender"] = df["Gender"].map({"Male": 0, "Female": 1})

print(df.isna().sum()) #Verify no NaNs exist

subscription_dummies = pd.get_dummies(
    df["SubscriptionType"],
    prefix="SubscriptionType"
)

df = pd.concat([df, subscription_dummies], axis=1)
df.drop("SubscriptionType", axis=1, inplace=True)

content_dummies = pd.get_dummies(
    df["ContentType"],
    prefix="ContentType"
)

df = pd.concat([df, content_dummies], axis=1)
df.drop("ContentType", axis=1, inplace=True)


# **What is Scikit-learn?**

**Scikit-learn** is a Python library that helps you build machine learning models easily.  
Instead of coding all the mathematical formulas yourself, scikit-learn provides reliable, ready-made implementations.

It offers tools to:
- Prepare data (splitting, scaling, encoding, etc.)
- Train machine learning models
- Evaluate how well models perform
- Avoid common mistakes in machine learning workflows

Scikit-learn is one of the most widely used machine learning libraries.  
Going forward, we will use scikit-learn step by step to complete our task.



# **Preparing Data for Modeling**

1. Create
   - **x** → all feature columns used for prediction (a new dataframe same as current df just without the churn column)
   - **y** → the target column


2. Once x and y are defined, split them into training and test sets using scikit-learn's train_test_split method
- **Training set** - used to train the model
- **Test set** - used to evaluate the model on unseen data

Follow these rules:
- Keep **20%** of the data for testing
- The result should give you **four outputs**:
  - training features
  - test features
  - training labels
  - test labels

Since this is your first time using scikit-learn, you're given the splitting code below, but understand what are the inputs and outputs of the train_test_split method.


In [None]:
x =df.drop("Churn", axis=1)  # entire df excluding churn column
y =df["Churn"] # the churn column

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

# **A Note from the Statistics Team**

Before training the churn prediction model, the statisticians review your data and point out an important issue.

They observe that some numerical features have **much larger values** than others. These large-scale features could influence the model more than other features.

To avoid this, you are asked to **normalize all continuous numerical features** so that they are on a similar scale.  
Use **standardization**, which adjusts each feature to have:
- a mean of 0  
- a standard deviation of 1  

The statisticians also clarify that **binary features (encoded as 0 and 1)** should **not** be normalized. These already represent simple yes/no information and work correctly without scaling.

You'll use scikit-learn's StandardScaler for this task.

In [None]:
continuous_cols = ["AccountAge", "MonthlyCharges", ] # Fill the rest as well

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

x_train[continuous_cols] = scaler.fit_transform(x_train[continuous_cols])
# Do the same for x_test
x_test[continuous_cols] = scaler.transform(x_test[continuous_cols])

# **Imbalanced data**

First, check the number of samples in each output class (churn vs non-churn).  
Determine whether there is a **severe class imbalance**.

If a strong imbalance exists, apply **SMOTE** to balance the classes, as discussed in the previous session.

You're given the code for SMOTE since it's your first time, but carefully go through it and understand the inputs and outputs.

In [None]:
# Print the counts of each output class in the training data
print(y_train.value_counts())

In [None]:
from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)

x_train_resampled, y_train_resampled = smote.fit_resample(x_train, y_train)

# Print the counts of each output class again after SMOTE
print(y_train_resampled.value_counts())

# Time to train the model, finally!

In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=1000)

# Use mode.fit() method, you have to look up what are the arguments that go into this method
# Make sure you pass resampled data i.e. the data after SMOTE
model.fit(x_train_resampled, y_train_resampled)

# Test your model on the testing data

In [None]:
y_pred = model.predict(x_test) # use model.predict() method, look up what goes as argument in that method

from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score

accuracy = accuracy_score(y_test, y_pred)      # use accuracy_score() method, look up the arguments
conf_matrix = confusion_matrix(y_test, y_pred) # use confusion_matrix() method, look up the arguments
precision = precision_score(y_test, y_pred)    # use precision_score() method, look up the arguments
recall = recall_score(y_test, y_pred)          # use recall_score() method, look up the arguments

print("Accuracy:", accuracy)
print("Confusion Matrix:\n", conf_matrix)
print("Precision:", precision)
print("Recall:", recall)

# A bit of Consulting.... Since it's an ICG Project

Now, you can take some time to think about which metrics matter more to your business specifically. After this churn analysis, the business will ultimately try to retain the churners by contacting them or providing some offers to them. False Negatives are churners you missed, so you lose customers while False Positives are non-churners who you classified as churners, so your business spends additional amount contacting them.

It's upto your business - whether revenue loss due to losing customers is more significant or marketing cost to the churners is more significant. That's the precision-recall tradeoff we discussed.