# Notebook 02: Data Cleaning & Preprocessing

This notebook prepares the dataset for modeling. We perform data cleaning, encoding, scaling, and train-test splitting.

### Objectives of This Notebook:
- Drop irrelevant or redundant columns
- Encode categorical and target variables
- Scale numerical features
- Split the dataset into training and testing sets
- Save cleaned files for future modeling

### Step 1: Import Libraries
We import the necessary preprocessing tools from `sklearn`, along with `pandas` for data manipulation.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler

### Step 2: Load the Dataset
We load the same raw dataset used in Notebook 01. Cleaning will be performed on this DataFrame.

In [2]:
df = pd.read_csv("../data/raw_data.csv")
df.columns = df.columns.str.strip()

### Step 3: Drop Irrelevant Columns
We remove columns that won't help with predictions (e.g., identifiers and pre-built model predictions).

In [3]:
df.drop(columns=[
    "CLIENTNUM",
    "Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1",
    "Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2"
], inplace=True)

### Step 4: Encode the Target Column
We convert the target label to a binary integer (0 = existing, 1 = churned). This is required for most ML algorithms.

In [4]:
df["Attrition_Flag"] = df["Attrition_Flag"].apply(lambda x: 1 if x == "Attrited Customer" else 0)

### Step 5: Encode Categorical Features
Label encoding is used to convert categorical features into numeric codes. This allows models to interpret them.

In [5]:
categorical_cols = df.select_dtypes(include="object").columns.tolist()
le = LabelEncoder()
for col in categorical_cols:
    df[col] = le.fit_transform(df[col])

### Step 6: Train-Test Split
We separate the data into features (X) and target (y), then split into training and test sets with stratification to preserve churn ratios.

In [6]:
X = df.drop("Attrition_Flag", axis=1)
y = df["Attrition_Flag"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

### Step 7: Feature Scaling
We scale only the numeric columns to standardize their ranges. This benefits many models like Logistic Regression and SVM.

In [7]:
numeric_cols = X.select_dtypes(include=['float64', 'int64']).columns.tolist()
scaler = StandardScaler()
X_train[numeric_cols] = scaler.fit_transform(X_train[numeric_cols])
X_test[numeric_cols] = scaler.transform(X_test[numeric_cols])

### Step 8: Preview Cleaned Datasets
Let’s preview the transformed features and labels to ensure everything looks clean and consistent.

In [8]:
print("X_train preview:")
display(X_train.head())

print("\ny_train distribution:")
print(y_train.value_counts())

X_train preview:


Unnamed: 0,Customer_Age,Gender,Dependent_count,Education_Level,Marital_Status,Income_Category,Card_Category,Months_on_book,Total_Relationship_Count,Months_Inactive_12_mon,Contacts_Count_12_mon,Credit_Limit,Total_Revolving_Bal,Avg_Open_To_Buy,Total_Amt_Chng_Q4_Q1,Total_Trans_Amt,Total_Trans_Ct,Total_Ct_Chng_Q4_Q1,Avg_Utilization_Ratio
2856,-1.301232,-0.946732,-1.8099,1.58682,-0.627024,0.760924,-0.256393,-1.513205,0.122196,0.646495,-0.411236,-0.665434,1.303942,-0.783057,-0.335597,-0.775093,-0.961644,-1.093573,2.148867
6515,-0.29934,1.056265,0.501503,-0.05625,0.726432,-0.565148,4.128624,-0.0007,0.765411,-0.342413,0.489635,1.848571,-1.433246,1.977862,0.407489,-0.61996,-1.131083,-0.537244,-1.000072
7141,-0.048867,1.056265,-0.268965,-0.05625,-0.627024,0.097888,-0.256393,-0.756952,1.408625,0.646495,-1.312106,0.342117,-0.310139,0.370094,0.603518,-0.032694,1.029259,0.579534,-0.717103
632,-1.301232,-0.946732,-0.268965,-0.60394,-0.627024,0.760924,-0.256393,-1.513205,-0.521019,-1.33132,0.489635,-0.604095,0.522064,-0.65119,0.498665,-0.805413,-1.004004,-1.42737,0.850111
3496,0.452079,1.056265,-1.039432,-0.60394,0.726432,-1.891219,-0.256393,0.503468,0.122196,-0.342413,0.489635,2.871622,0.021269,2.869713,-0.157803,-0.151325,0.309145,0.002601,-0.876727



y_train distribution:
Attrition_Flag
0    6799
1    1302
Name: count, dtype: int64


The cleaned training dataset is now ready for modeling:
- `X_train` shows all features have been numerically encoded and scaled, ensuring compatibility with most ML models.
- `y_train` class counts:
  - **0 (Existing Customers): 6799**
  - **1 (Attrited Customers): 1302**
  - This confirms we still have an imbalanced target distribution.

In [9]:
print("\ny_test distribution:")
print(y_test.value_counts())


y_test distribution:
Attrition_Flag
0    1701
1     325
Name: count, dtype: int64


**Stratification Check:**  
The `y_test` distribution is as follows:
- **0 (Existing Customers): 1701**
- **1 (Attrited Customers): 325**

This closely mirrors the training set proportions, confirming that **stratified splitting worked as intended**.


### Step 9: Save Preprocessed Data
We export the cleaned datasets as CSVs for use in modeling.

In [10]:
X_train.to_csv("../data/X_train_clean.csv", index=False)
X_test.to_csv("../data/X_test_clean.csv", index=False)
y_train.to_csv("../data/y_train_clean.csv", index=False)
y_test.to_csv("../data/y_test_clean.csv", index=False)

### Step 10: Summary & Next Steps
We now have CSVs that are cleaned and ready for machine learning algorithms. 

In Notebook 03, we’ll: 

- Begin building predictive models to classify churn risk, starting with a baseline Logistic Regression model. 
- Then, evaluate a more powerful ensemble model (Random Forest), and compare their performance. 
- Our goal will be to identify which approach offers the best balance between accuracy and interpretability.