# **Case Study: Customer Churn Prediction**

You are working as a data analyst at a subscription-based company.  
The company has noticed that many customers cancel their subscriptions and wants to understand **who is likely to churn and why**.

You are given historical customer data containing subscription details, usage behavior, and a label indicating whether the customer churned or not.

Your task is to use **logistic regression** to predict whether a customer will churn.

Run the code cell below to access the dataset from the Finance & Accounting Team.

In [None]:
import kagglehub
import os

path = kagglehub.dataset_download("safrin03/predictive-analytics-for-customer-churn-dataset")
os.listdir(path)

Using Colab cache for faster access to the 'predictive-analytics-for-customer-churn-dataset' dataset.


['data_descriptions.csv', 'train.csv', 'test.csv']

Let's have a look at the dataset

In [None]:
import pandas as pd

# Churn means to stop doing business with the company

df = pd.read_csv(os.path.join(path, "train.csv"))

# Print the first 5 rows of the dataset
df.head()

Unnamed: 0,AccountAge,MonthlyCharges,TotalCharges,SubscriptionType,PaymentMethod,PaperlessBilling,ContentType,MultiDeviceAccess,DeviceRegistered,ViewingHoursPerWeek,...,ContentDownloadsPerMonth,GenrePreference,UserRating,SupportTicketsPerMonth,Gender,WatchlistSize,ParentalControl,SubtitlesEnabled,CustomerID,Churn
0,20,11.055215,221.104302,Premium,Mailed check,No,Both,No,Mobile,36.758104,...,10,Sci-Fi,2.176498,4,Male,3,No,No,CB6SXPNVZA,0
1,57,5.175208,294.986882,Basic,Credit card,Yes,Movies,No,Tablet,32.450568,...,18,Action,3.478632,8,Male,23,No,Yes,S7R2G87O09,0
2,73,12.106657,883.785952,Basic,Mailed check,Yes,Movies,No,Computer,7.39516,...,23,Fantasy,4.238824,6,Male,1,Yes,Yes,EASDC20BDT,0
3,32,7.263743,232.439774,Basic,Electronic check,No,TV Shows,No,Tablet,27.960389,...,30,Drama,4.276013,2,Male,24,Yes,Yes,NPF69NT69N,0
4,57,16.953078,966.325422,Premium,Electronic check,Yes,TV Shows,No,TV,20.083397,...,20,Comedy,3.61617,4,Female,0,No,No,4LGYPK7VOL,0


In [None]:
df

Unnamed: 0,AccountAge,MonthlyCharges,TotalCharges,SubscriptionType,PaymentMethod,PaperlessBilling,ContentType,MultiDeviceAccess,DeviceRegistered,ViewingHoursPerWeek,...,ContentDownloadsPerMonth,GenrePreference,UserRating,SupportTicketsPerMonth,Gender,WatchlistSize,ParentalControl,SubtitlesEnabled,CustomerID,Churn
0,20,11.055215,221.104302,Premium,Mailed check,No,Both,No,Mobile,36.758104,...,10,Sci-Fi,2.176498,4,Male,3,No,No,CB6SXPNVZA,0
1,57,5.175208,294.986882,Basic,Credit card,Yes,Movies,No,Tablet,32.450568,...,18,Action,3.478632,8,Male,23,No,Yes,S7R2G87O09,0
2,73,12.106657,883.785952,Basic,Mailed check,Yes,Movies,No,Computer,7.395160,...,23,Fantasy,4.238824,6,Male,1,Yes,Yes,EASDC20BDT,0
3,32,7.263743,232.439774,Basic,Electronic check,No,TV Shows,No,Tablet,27.960389,...,30,Drama,4.276013,2,Male,24,Yes,Yes,NPF69NT69N,0
4,57,16.953078,966.325422,Premium,Electronic check,Yes,TV Shows,No,TV,20.083397,...,20,Comedy,3.616170,4,Female,0,No,No,4LGYPK7VOL,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
243782,77,9.639902,742.272460,Basic,Mailed check,No,Movies,No,Computer,13.502729,...,47,Sci-Fi,3.697451,1,Male,8,Yes,No,FBZ38J108Z,0
243783,117,13.049257,1526.763053,Premium,Credit card,No,TV Shows,Yes,TV,24.963291,...,35,Comedy,1.449742,4,Male,20,No,No,W4AO1Y6NAI,0
243784,113,14.514569,1640.146267,Premium,Credit card,Yes,TV Shows,No,TV,10.628728,...,44,Action,4.012217,6,Male,13,Yes,Yes,0H3SWWI7IU,0
243785,7,18.140555,126.983887,Premium,Bank transfer,Yes,TV Shows,No,TV,30.466782,...,36,Fantasy,2.135789,7,Female,5,No,Yes,63SJ44RT4A,0


Let's have a look at the features we have.

In [None]:
# Print the names of all the columns in the dataset
list(df.columns)

['AccountAge',
 'MonthlyCharges',
 'TotalCharges',
 'SubscriptionType',
 'PaymentMethod',
 'PaperlessBilling',
 'ContentType',
 'MultiDeviceAccess',
 'DeviceRegistered',
 'ViewingHoursPerWeek',
 'AverageViewingDuration',
 'ContentDownloadsPerMonth',
 'GenrePreference',
 'UserRating',
 'SupportTicketsPerMonth',
 'Gender',
 'WatchlistSize',
 'ParentalControl',
 'SubtitlesEnabled',
 'CustomerID',
 'Churn']

# Data can also have problems!

After reviewing the data, the **Finance & Accounting team** reports that some customer attributes were incorrectly captured due to issues in the billing and content-logging systems.

As a result, you are instructed to **remove the following columns from the dataset and not use them for your analysis**:
- `PaymentMethod`
- `PaperlessBilling`
- `DeviceRegistered`
- `GenrePreference`

Make sure these columns are **dropped in-place** (set inplace = True when you use drop method)

In [None]:
# Drop the mentioned columns
df.drop(["PaymentMethod", "PaperlessBilling", "DeviceRegistered", "GenrePreference"], axis=1, inplace=True)

In [None]:
df.head()

Unnamed: 0,AccountAge,MonthlyCharges,TotalCharges,SubscriptionType,ContentType,MultiDeviceAccess,ViewingHoursPerWeek,AverageViewingDuration,ContentDownloadsPerMonth,UserRating,SupportTicketsPerMonth,Gender,WatchlistSize,ParentalControl,SubtitlesEnabled,CustomerID,Churn
0,20,11.055215,221.104302,Premium,Both,No,36.758104,63.531377,10,2.176498,4,Male,3,No,No,CB6SXPNVZA,0
1,57,5.175208,294.986882,Basic,Movies,No,32.450568,25.725595,18,3.478632,8,Male,23,No,Yes,S7R2G87O09,0
2,73,12.106657,883.785952,Basic,Movies,No,7.39516,57.364061,23,4.238824,6,Male,1,Yes,Yes,EASDC20BDT,0
3,32,7.263743,232.439774,Basic,TV Shows,No,27.960389,131.537507,30,4.276013,2,Male,24,Yes,Yes,NPF69NT69N,0
4,57,16.953078,966.325422,Premium,TV Shows,No,20.083397,45.356653,20,3.61617,4,Female,0,No,No,4LGYPK7VOL,0


Before training the model, review all remaining features and ask yourself:

- Does this variable contain information about customer behavior, or is it only an identifier?
- Would knowing this value help predict churn for a **new** customer?

Identify any such column(s) and drop them inplace as well.

In [None]:
# Drop such columns
df.drop(["CustomerID"], axis=1, inplace=True)
df.drop(["SubtitlesEnabled"], axis=1, inplace=True)
df.head()

Unnamed: 0,AccountAge,MonthlyCharges,TotalCharges,SubscriptionType,ContentType,MultiDeviceAccess,ViewingHoursPerWeek,AverageViewingDuration,ContentDownloadsPerMonth,UserRating,SupportTicketsPerMonth,Gender,WatchlistSize,ParentalControl,Churn
0,20,11.055215,221.104302,Premium,Both,No,36.758104,63.531377,10,2.176498,4,Male,3,No,0
1,57,5.175208,294.986882,Basic,Movies,No,32.450568,25.725595,18,3.478632,8,Male,23,No,0
2,73,12.106657,883.785952,Basic,Movies,No,7.39516,57.364061,23,4.238824,6,Male,1,Yes,0
3,32,7.263743,232.439774,Basic,TV Shows,No,27.960389,131.537507,30,4.276013,2,Male,24,Yes,0
4,57,16.953078,966.325422,Premium,TV Shows,No,20.083397,45.356653,20,3.61617,4,Female,0,No,0


Check if there are any missing values in the dataset.

In [None]:
# Check for NA values
df.isnull().sum()
# no NA values in tne df

Unnamed: 0,0
AccountAge,0
MonthlyCharges,0
TotalCharges,0
SubscriptionType,0
ContentType,0
MultiDeviceAccess,0
ViewingHoursPerWeek,0
AverageViewingDuration,0
ContentDownloadsPerMonth,0
UserRating,0


# **Encoding Categorical Variables for Modeling**

During data preparation, you notice that several features in the dataset are **categorical** (they contain labels instead of numbers).  
Since logistic regression works only with numerical inputs, these variables must be **encoded** before modeling.



### Why Encoding Is Important
- Machine learning models cannot interpret text values such as *“Basic”* or *“Yes”*
- Encoding converts categories into numbers **without introducing false meaning**
- Poor encoding can mislead the model and hurt predictions



### Encoding Rules to Follow

1. **Columns with exactly two categories**  
   Encode them using **0 / 1**  
   - Example: `No → 0`, `Yes → 1`

2. **Columns with more than two categories**  
   Use **One-Hot Encoding**, then **drop the original column**



### What Is One-Hot Encoding?

One-hot encoding represents each category as its **own binary column**.

Instead of giving categories numbers (which can imply order), it creates **one column per category**.

Example:

SubscriptionType can be Basic, Standard or Premium

Create 3 separate columns: SubscriptionTypeBasic, SubscriptionTypeStandard, SubscriptionTypePremium.

If SubscriptionType is Basic for some row, then for that row, set SubscriptionTypeBasic = 1, SubscriptionTypeStandard = 0, SubscriptionTypePremium = 0. Do similarly for all the rows then drop the SubscriptionType column inplace.



# **Hint from the Analytics Team**

One of the categorical columns appears to have three values.  
Before applying one-hot encoding, think carefully about what these values mean.

Ask yourself:
- Are these three values mutually exclusive choices?
- Or do they represent **different combinations of the same underlying options**?

Do you really need three different columns for that feature or fewer can be sufficient?

### Note:

1. **Simple binary columns**  
   - `MultiDeviceAccess`  
   - `Gender`  
   - `ParentalControl`  

   For these columns, we simply replace the binary responses with **0 or 1**, and there is **no need to create any extra columns**.

2. **Ternary column formed from two binary columns**  
   - `ContentType`

   This column can be represented as a combination of two binary columns:
   - `ContentTypeMovies`
   - `ContentTypeTVShows`

   - If the entry is **both**, we put `1` in **both columns**.
   - If the response is **only one**, we put `1` in the corresponding column and `0` in the other.

3. **Ternary column formed from three binary columns**  
   - `SubscriptionType`

   `SubscriptionType` can take the values **Basic**, **Standard**, or **Premium**.  
   Hence, we create three binary columns:
   - `SubscriptionTypeBasic`
   - `SubscriptionTypeStandard`
   - `SubscriptionTypePremium`

   We put `1` in the column corresponding to the value present in the original DataFrame and `0` in the other columns.


In [None]:
# Encode the 6 categorical features
# For binary features, replace the text values with 0 / 1
# For mroe than 2 categories, create separate binary columns for each category and drop the original column
df['MultiDeviceAccess'] = df['MultiDeviceAccess'].replace({'Yes': 1, 'No': 0})
df['Gender'] = df['Gender'].replace({'Male': 1, 'Female': 0})
df['ParentalControl'] = df['ParentalControl'].replace({'Yes': 1, 'No': 0})
# Binary columns are encoded
df.head()


  df['MultiDeviceAccess'] = df['MultiDeviceAccess'].replace({'Yes': 1, 'No': 0})
  df['Gender'] = df['Gender'].replace({'Male': 1, 'Female': 0})
  df['ParentalControl'] = df['ParentalControl'].replace({'Yes': 1, 'No': 0})


Unnamed: 0,AccountAge,MonthlyCharges,TotalCharges,SubscriptionType,ContentType,MultiDeviceAccess,ViewingHoursPerWeek,AverageViewingDuration,ContentDownloadsPerMonth,UserRating,SupportTicketsPerMonth,Gender,WatchlistSize,ParentalControl,Churn
0,20,11.055215,221.104302,Premium,Both,0,36.758104,63.531377,10,2.176498,4,1,3,0,0
1,57,5.175208,294.986882,Basic,Movies,0,32.450568,25.725595,18,3.478632,8,1,23,0,0
2,73,12.106657,883.785952,Basic,Movies,0,7.39516,57.364061,23,4.238824,6,1,1,1,0
3,32,7.263743,232.439774,Basic,TV Shows,0,27.960389,131.537507,30,4.276013,2,1,24,1,0
4,57,16.953078,966.325422,Premium,TV Shows,0,20.083397,45.356653,20,3.61617,4,0,0,0,0


In [None]:
# Now we will make 2 columns from the column ContentType
df['ContentTypeMovies'] = df['ContentType']
df['ContentTypeTVShows'] = df['ContentType']
df['ContentTypeMovies'] = df['ContentTypeMovies'].replace({'Movies': 1, 'Both':1,'TV Shows': 0})
df['ContentTypeTVShows'] = df['ContentTypeTVShows'].replace({'Movies': 0, 'Both':1, 'TV Shows': 1})
df.drop(['ContentType'], axis=1, inplace=True)
df.head()

  df['ContentTypeMovies'] = df['ContentTypeMovies'].replace({'Movies': 1, 'Both':1,'TV Shows': 0})
  df['ContentTypeTVShows'] = df['ContentTypeTVShows'].replace({'Movies': 0, 'Both':1, 'TV Shows': 1})


Unnamed: 0,AccountAge,MonthlyCharges,TotalCharges,SubscriptionType,MultiDeviceAccess,ViewingHoursPerWeek,AverageViewingDuration,ContentDownloadsPerMonth,UserRating,SupportTicketsPerMonth,Gender,WatchlistSize,ParentalControl,Churn,ContentTypeMovies,ContentTypeTVShows
0,20,11.055215,221.104302,Premium,0,36.758104,63.531377,10,2.176498,4,1,3,0,0,1,1
1,57,5.175208,294.986882,Basic,0,32.450568,25.725595,18,3.478632,8,1,23,0,0,1,0
2,73,12.106657,883.785952,Basic,0,7.39516,57.364061,23,4.238824,6,1,1,1,0,1,0
3,32,7.263743,232.439774,Basic,0,27.960389,131.537507,30,4.276013,2,1,24,1,0,0,1
4,57,16.953078,966.325422,Premium,0,20.083397,45.356653,20,3.61617,4,0,0,0,0,0,1


In [None]:
#Making 3 columns for column SubscriptionType
df['SubscriptionTypeBasic'] = df['SubscriptionType']
df['SubscriptionTypeStandard'] = df['SubscriptionType']
df['SubscriptionTypePremium'] = df['SubscriptionType']
df['SubscriptionTypeBasic'] = df['SubscriptionTypeBasic'].replace({'Basic':1,'Standard':0,'Premium':0})
df['SubscriptionTypeStandard'] = df['SubscriptionTypeStandard'].replace({'Standard':1,'Basic':0,'Premium':0})
df['SubscriptionTypePremium'] = df['SubscriptionTypePremium'].replace({'Premium':1,'Standard':0,'Basic':0})
df.drop(['SubscriptionType'], axis=1, inplace=True)
df.head()

  df['SubscriptionTypeBasic'] = df['SubscriptionTypeBasic'].replace({'Basic':1,'Standard':0,'Premium':0})
  df['SubscriptionTypeStandard'] = df['SubscriptionTypeStandard'].replace({'Standard':1,'Basic':0,'Premium':0})
  df['SubscriptionTypePremium'] = df['SubscriptionTypePremium'].replace({'Premium':1,'Standard':0,'Basic':0})


Unnamed: 0,AccountAge,MonthlyCharges,TotalCharges,MultiDeviceAccess,ViewingHoursPerWeek,AverageViewingDuration,ContentDownloadsPerMonth,UserRating,SupportTicketsPerMonth,Gender,WatchlistSize,ParentalControl,Churn,ContentTypeMovies,ContentTypeTVShows,SubscriptionTypeBasic,SubscriptionTypeStandard,SubscriptionTypePremium
0,20,11.055215,221.104302,0,36.758104,63.531377,10,2.176498,4,1,3,0,0,1,1,0,0,1
1,57,5.175208,294.986882,0,32.450568,25.725595,18,3.478632,8,1,23,0,0,1,0,1,0,0
2,73,12.106657,883.785952,0,7.39516,57.364061,23,4.238824,6,1,1,1,0,1,0,1,0,0
3,32,7.263743,232.439774,0,27.960389,131.537507,30,4.276013,2,1,24,1,0,0,1,1,0,0
4,57,16.953078,966.325422,0,20.083397,45.356653,20,3.61617,4,0,0,0,0,0,1,0,0,1


In [None]:
col = df.pop('Churn')
df['Churn'] = col
df.head()

Unnamed: 0,AccountAge,MonthlyCharges,TotalCharges,MultiDeviceAccess,ViewingHoursPerWeek,AverageViewingDuration,ContentDownloadsPerMonth,UserRating,SupportTicketsPerMonth,Gender,WatchlistSize,ParentalControl,ContentTypeMovies,ContentTypeTVShows,SubscriptionTypeBasic,SubscriptionTypeStandard,SubscriptionTypePremium,Churn
0,20,11.055215,221.104302,0,36.758104,63.531377,10,2.176498,4,1,3,0,1,1,0,0,1,0
1,57,5.175208,294.986882,0,32.450568,25.725595,18,3.478632,8,1,23,0,1,0,1,0,0,0
2,73,12.106657,883.785952,0,7.39516,57.364061,23,4.238824,6,1,1,1,1,0,1,0,0,0
3,32,7.263743,232.439774,0,27.960389,131.537507,30,4.276013,2,1,24,1,0,1,1,0,0,0
4,57,16.953078,966.325422,0,20.083397,45.356653,20,3.61617,4,0,0,0,0,1,0,0,1,0


In [None]:
# Check for NA values
df.isnull().sum()

Unnamed: 0,0
AccountAge,0
MonthlyCharges,0
TotalCharges,0
MultiDeviceAccess,0
ViewingHoursPerWeek,0
AverageViewingDuration,0
ContentDownloadsPerMonth,0
UserRating,0
SupportTicketsPerMonth,0
Gender,0


# **What is Scikit-learn?**

**Scikit-learn** is a Python library that helps you build machine learning models easily.  
Instead of coding all the mathematical formulas yourself, scikit-learn provides reliable, ready-made implementations.

It offers tools to:
- Prepare data (splitting, scaling, encoding, etc.)
- Train machine learning models
- Evaluate how well models perform
- Avoid common mistakes in machine learning workflows

Scikit-learn is one of the most widely used machine learning libraries.  
Going forward, we will use scikit-learn step by step to complete our task.



# **Preparing Data for Modeling**

1. Create
   - **x** → all feature columns used for prediction (a new dataframe same as current df just without the churn column)
   - **y** → the target column


2. Once x and y are defined, split them into training and test sets using scikit-learn's train_test_split method
- **Training set** - used to train the model
- **Test set** - used to evaluate the model on unseen data

Follow these rules:
- Keep **20%** of the data for testing
- The result should give you **four outputs**:
  - training features
  - test features
  - training labels
  - test labels

Since this is your first time using scikit-learn, you're given the splitting code below, but understand what are the inputs and outputs of the train_test_split method.


In [None]:
x = df.drop('Churn', axis=1)
y = df['Churn']

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

## Explanation of `train_test_split`

So basically, for our model, we have to split the data so that:

- One part of the data is used for **training** the model  
- The other part is used for **testing** the model  

This ensures that the model **does not get trained on the same data** that it is later tested on.

---

### Parameters Used

- **`X`** → Input features  
- **`y`** → Target / output labels  

- **`test_size = 0.2`**  
  - Means **20% of the data** is used for testing  
  - Remaining **80%** is used for training  

- **`random_state`**  
  - Used for **shuffling the data in a fixed and reproducible way**  
  - Ensures that the same train-test split is obtained every time the code is run  

---

### What All Is Returned

The function returns the following four objects:

- **`X_train`**  
- **`X_test`**  
- **`y_train`**  
- **`y_test`**  

Their meanings are **self-explanatory**.

---

### Key Idea

> Splitting the data helps in evaluating how well the model performs on **unseen data**, which is crucial for building reliable machine learning models.



# **A Note from the Statistics Team**

Before training the churn prediction model, the statisticians review your data and point out an important issue.

They observe that some numerical features have **much larger values** than others. These large-scale features could influence the model more than other features.

To avoid this, you are asked to **normalize all continuous numerical features** so that they are on a similar scale.  
Use **standardization**, which adjusts each feature to have:
- a mean of 0  
- a standard deviation of 1  

The statisticians also clarify that **binary features (encoded as 0 and 1)** should **not** be normalized. These already represent simple yes/no information and work correctly without scaling.

You'll use scikit-learn's StandardScaler for this task.

# **Imbalanced data**

First, check the number of samples in each output class (churn vs non-churn).  
Determine whether there is a **severe class imbalance**.

If a strong imbalance exists, apply **SMOTE** to balance the classes, as discussed in the previous session.

You're given the code for SMOTE since it's your first time, but carefully go through it and understand the inputs and outputs.

## What is SMOTE?

**SMOTE** stands for **Synthetic Minority Over-sampling Technique**.

In many real-world datasets, one class has **far fewer samples** than the other, which leads to **class imbalance**.

To handle this, we use **SMOTE**.

- It is **not random**
- It **does not copy** existing data points
- It **generates new synthetic data points**  
- The new points are created **between existing minority-class points**
### When do we use it ??  
- simply recall the example of cancer patients we have so less of positive cases , the model can't be trained well if the split is so disproportionate so we have to fabricate new points and SMOTE is a good way for this

In [None]:
continuous_cols = ["AccountAge", "MonthlyCharges","TotalCharges","ViewingHoursPerWeek","AverageViewingDuration","ContentDownloadsPerMonth","UserRating","SupportTicketsPerMonth","WatchlistSize",]
# These all columns have the values which are continuous and not being binary

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

x_train[continuous_cols] = scaler.fit_transform(x_train[continuous_cols])
x_test[continuous_cols] = scaler.fit_transform(x_test[continuous_cols])

In [None]:
import numpy as np
np.unique(y_train, return_counts=True)

(array([0, 1]), array([159637,  35392]))

In [None]:
from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)

x_train_resampled, y_train_resampled = smote.fit_resample(x_train, y_train)

np.unique(y_train_resampled, return_counts=True)

(array([0, 1]), array([159637, 159637]))

# Time to train the model, finally!

In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=1000)

# Use mode.fit() method, you have to look up what are the arguments that go into this method
model.fit(x_train_resampled, y_train_resampled)
# Make sure you pass resampled data i.e. the data after SMOTE


# Test your model on the testing data

In [None]:
y_pred = model.predict(x_test)
# IMP: we used the SMOTE se modified data just for training the model , but we have to analyze the accuracy for the original data so we pass X_test as parameter and not X_train_resampled
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score

accuracy = accuracy_score(y_test, y_pred)

conf_matrix = confusion_matrix(y_test, y_pred)

precision = precision_score(y_test, y_pred)

recall = recall_score(y_test, y_pred)

print("Accuracy:", accuracy)
print("Confusion Matrix:\n", conf_matrix)
print("Precision:", precision)
print("Recall:", recall)

Accuracy: 0.687148775585545
Confusion Matrix:
 [[27679 12289]
 [ 2965  5825]]
Precision: 0.3215744727834824
Recall: 0.6626848691695109


# A bit of Consulting.... Since it's an ICG Project

Now, you can take some time to think about which metrics matter more to your business specifically. After this churn analysis, the business will ultimately try to retain the churners by contacting them or providing some offers to them. False Negatives are churners you missed, so you lose customers while False Positives are non-churners who you classified as churners, so your business spends additional amount contacting them.

It's upto your business - whether revenue loss due to losing customers is more significant or marketing cost to the churners is more significant. That's the precision-recall tradeoff we discussed.

we see that the recall is high but precision is low so  
- due to high recall ,company finds a higher percentage of the total churns so it will not loose many customers (Hopefully)  
- but due to low precision , company will contact and spend extra on many costumers which were not likely to churn

# Extra Tasks

## Question

Think about what will happen if we standardize the **entire dataset** before splitting it into **training** and **testing** sets.

**Hint:**  
Almost always, this will improve the accuracy on the test dataset, but it is **not a good practice**.  
Why don’t we do something that improves accuracy?

---

## Answer

This situation is actually called **overfitting**.

What happens is that the model performs **really well on the test data**, but its accuracy **decreases on new, unseen inputs**.

This happens because:
- The model has already seen information from the test data during standardization  
- In real-world scenarios, the model will **not receive standardized data**
- Instead, it will receive **natural and noisy data**

As a result, the model gives **false confidence** about its performance.


## Trying to train he model without SMOTE and looking on to the recall

In [None]:
model2 = LogisticRegression(max_iter=1000)


model2.fit(x_train, y_train)
y_pred2 = model2.predict(x_test)

from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score

accuracy2 = accuracy_score(y_test, y_pred2)

conf_matrix2 = confusion_matrix(y_test, y_pred2)

precision2 = precision_score(y_test, y_pred2)

recall2 = recall_score(y_test, y_pred2)

print("Accuracy:", accuracy2)
print("Confusion Matrix:\n", conf_matrix2)
print("Precision:", precision2)
print("Recall:", recall2)

Accuracy: 0.8252594446039624
Confusion Matrix:
 [[39213   755]
 [ 7765  1025]]
Precision: 0.5758426966292135
Recall: 0.11660978384527873


## Observation

We observe that **recall falls significantly**, and as a **trade-off**, **precision increases**.

---

## What Happened?

The model had **very little input data for the positive class**, so it could not be properly trained to predict the positive output.

This happened because **oversampling was not applied** in this case, leading to **class imbalance** during training.

---

## Effect on the Company

- The company would **lose many customers**
- The model fails to identify a **large percentage of potential churners**
- This is due to **low recall**
- Missing potential churners is **harmful for the business**

Overall, a model with low recall in this scenario is **not suitable for the company**.


## Min–Max Scaling vs Standardization

Feature scaling is used to bring numerical features to a similar scale so that no feature dominates the model.

---

## Min–Max Scaling

Rescales data to a **fixed range**, usually **[0, 1]**.

**Pros**
- Simple and intuitive  
- Bounded output  
- Useful for distance-based models

**Cons**
- Very sensitive to outliers  
- New data can fall outside the range

---

## Standardization

Rescales data to have:
- **Mean = 0**
- **Standard Deviation = 1**

**Pros**
- Less sensitive to outliers  
- Works well with linear models

**Cons**
- Values are unbounded  
- Less intuitive

---

## Key Difference

| Aspect | Min–Max | Standardization |
|------|--------|----------------|
| Range | Fixed \([0,1]\) | Unbounded |
| Outliers | Sensitive | More robust |

---