## Step 3: Feature Engineering and Preprocessing

In [2]:
# Necessary libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

In [3]:
# Load clean csv data 
df = pd.read_csv('cleaned_telco.csv')
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [4]:
# Summary 
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7032 entries, 0 to 7031
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7032 non-null   object 
 1   gender            7032 non-null   object 
 2   SeniorCitizen     7032 non-null   int64  
 3   Partner           7032 non-null   object 
 4   Dependents        7032 non-null   object 
 5   tenure            7032 non-null   int64  
 6   PhoneService      7032 non-null   object 
 7   MultipleLines     7032 non-null   object 
 8   InternetService   7032 non-null   object 
 9   OnlineSecurity    7032 non-null   object 
 10  OnlineBackup      7032 non-null   object 
 11  DeviceProtection  7032 non-null   object 
 12  TechSupport       7032 non-null   object 
 13  StreamingTV       7032 non-null   object 
 14  StreamingMovies   7032 non-null   object 
 15  Contract          7032 non-null   object 
 16  PaperlessBilling  7032 non-null   object 


In [5]:
# Drop irrelevant columns that don't contribute to the prediction
# 'customerID' is a unique identifier and has no impact on churn behavior
df = df.drop('customerID', axis=1)

In [6]:
# Encode the target variable 'Churn' to numeric values (0 = No, 1 = Yes)
# This is essential for machine learning models to understand the target
# Using map() function forr efficient and readable conversion
df['Churn'] = df['Churn'].map({'Yes':1, 'No':0})

In [7]:
# Binary encode gender: Female = 0, Male = 1
df['gender'] = df['gender'].map({'Male':1, 'Female':0})

In [8]:
# Encode binary columns: Yes = 1, No = 0
binary_cols = ['Partner', 'Dependents', 'PhoneService', 'PaperlessBilling']
df[binary_cols] = df[binary_cols].apply(lambda x: x.map({'Yes':1, 'No':0}))

To prepare binary categorical columns for modeling, we encoded the values as numeric:

- `Yes` was mapped to `1`  
- `No` was mapped to `0`  

This allows machine learning algorithms to understand and process the data correctly.  
We used `apply()` with a `lambda` function to apply the mapping to multiple binary columns at once.


In [10]:
# One-hot encode multiclass categorical columns (drop first to avoid dummy variable trap)
multi_cat_cols = ['MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup',
                  'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies',
                  'Contract', 'PaymentMethod']

df = pd.get_dummies(df, columns=multi_cat_cols, drop_first=True)

# Shape after encoding
print("Shape after encoding", df.shape)

Shape after encoding (7032, 31)


To prepare the multi-class categorical columns for machine learning, we applied one-hot encoding.  
We used `drop_first=True` to avoid the dummy variable trap, which helps prevent multicollinearity in linear models.  
This step converts categorical text data into a format that can be interpreted by the model.


In [12]:
# Check class imbalance
# Sometimes the dataset shrinks after cleaning
print(df['Churn'].value_counts(normalize=True))

Churn
0    0.734215
1    0.265785
Name: proportion, dtype: float64


In [13]:
# Scale numerical features using standard scaler
numeric_cols = ['tenure', 'MonthlyCharges', 'TotalCharges']
scaler = StandardScaler()
df[numeric_cols] = scaler.fit_transform(df[numeric_cols])

We scaled the numerical features using **StandardScaler**, which standardizes the values by removing the mean and scaling to unit variance.

This ensures that all numeric features (like `tenure`, `MonthlyCharges`, `TotalCharges`) are on a similar scale, which helps many machine learning models (like Logistic Regression or SVM) perform better.

Formula used:
\[
z = \frac{{x - \mu}}{{\sigma}}
\]

Where:
- \( x \) is the original value  
- \( \mu \) is the mean  
- \( \sigma \) is the standard deviation  


In [15]:
# Summary 
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7032 entries, 0 to 7031
Data columns (total 31 columns):
 #   Column                                 Non-Null Count  Dtype  
---  ------                                 --------------  -----  
 0   gender                                 7032 non-null   int64  
 1   SeniorCitizen                          7032 non-null   int64  
 2   Partner                                7032 non-null   int64  
 3   Dependents                             7032 non-null   int64  
 4   tenure                                 7032 non-null   float64
 5   PhoneService                           7032 non-null   int64  
 6   PaperlessBilling                       7032 non-null   int64  
 7   MonthlyCharges                         7032 non-null   float64
 8   TotalCharges                           7032 non-null   float64
 9   Churn                                  7032 non-null   int64  
 10  MultipleLines_No phone service         7032 non-null   bool   
 11  Mult

- As we can see all columns are processed and are in correct form.

In [17]:
# Export this data into csv format for further processing
df.to_csv("preprocessed_telco_churn.csv", index=False)

### Final Summary
In this notebook, we performed all essential steps to transform raw data into a machine-readable format:

1. **Dropped Irrelevant Columns**  
   - Removed `customerID` as it does not carry predictive value.

2. **Encoded Target Variable**  
   - Converted `Churn` from "Yes"/"No" to `1`/`0`.

3. **Binary Encoding**  
   - Encoded binary categorical features (`Yes`/`No`) into `1`/`0` for columns like `Partner`, `Dependents`, `PhoneService`, and `PaperlessBilling`.

4. **Gender Encoding**  
   - Encoded `gender` as `1` for Male and `0` for Female.

5. **One-Hot Encoding for Multiclass Features**  
   - Applied one-hot encoding to features like `InternetService`, `Contract`, `PaymentMethod`, etc., while avoiding the dummy variable trap using `drop_first=True`.

6. **Feature Scaling**  
   - Applied `StandardScaler` to standardize numerical features: `tenure`, `MonthlyCharges`, and `TotalCharges`.

7. **Exported the Preprocessed Data**  
   - Saved the final cleaned dataset to a CSV file: `preprocessed_telco_churn.csv` for reuse in modeling and future steps.

---

With this, our dataset is clean, consistent, and ready for model training in the next step.
