# Feature Engineering and Data Preparation

This notebook prepares the dataset for machine learning models by performing
train-test splitting, preprocessing, and baseline feature transformations.

**1. Import**

In [3]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

**2. Load Data**

In [4]:
df = pd.read_csv("../data/raw_data.csv")

# Important fix from EDA
df["TotalCharges"] = pd.to_numeric(df["TotalCharges"], errors="coerce")

df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


**3. Target & Features Split**

In [5]:
X = df.drop("Churn", axis=1)
y = df["Churn"].map({"Yes": 1, "No": 0})

X.shape, y.shape

((7043, 20), (7043,))

**4. Train / Test Split**

## Train-Test Split

The dataset is split into training and test sets using stratified sampling
to preserve the class distribution of the target variable.

In [6]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    stratify=y,
    random_state=42
)

X_train.shape, X_test.shape

((5634, 20), (1409, 20))

**5. Feature Groups**

In [7]:
num_features = ["tenure", "MonthlyCharges", "TotalCharges"]
cat_features = [col for col in X.columns if col not in num_features]

num_features, len(cat_features)

(['tenure', 'MonthlyCharges', 'TotalCharges'], 17)

**6. Preprocessing Pipeline**

## Preprocessing Pipeline

A unified preprocessing pipeline is constructed to handle numerical scaling
and categorical encoding while preventing data leakage.

In [8]:
numeric_transformer = Pipeline(
    steps=[
        ("scaler", StandardScaler())
    ]
)

categorical_transformer = Pipeline(
    steps=[
        ("encoder", OneHotEncoder(handle_unknown="ignore"))
    ]
)

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, num_features),
        ("cat", categorical_transformer, cat_features)
    ]
)

**7. Apply Preprocessing (test run)**

## Applying the Preprocessing Pipeline

The preprocessing pipeline is fitted on the training data and applied to both
training and test sets.

In [9]:
X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)

X_train_processed.shape, X_test_processed.shape

((5634, 5680), (1409, 5680))

**8. Summary**

    ### Summary

- The dataset was split into training and test sets using stratified sampling.
- Numerical features were scaled.
- Categorical features were encoded using one-hot encoding.
- The preprocessing pipeline is now ready to be used with machine learning models.