# Day 04 â€” Feature Engineering

Feature engineering turns raw columns into signals a model can learn from.

We will cover:
- Handling missing values
- Encoding categorical variables
- Scaling numeric features
- Creating interaction/ratio features
- Building a repeatable preprocessing pipeline


## 1) Create a mixed-type dataset
We will simulate a small dataset with numeric, categorical, and missing values.


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

customers = pd.DataFrame(
    {
        "age": [25, 32, 47, None, 52, 23, 39, 41, None, 36],
        "income": [48000, 62000, 72000, 51000, None, 39000, 68000, 59000, 61000, 65000],
        "region": ["north", "south", "west", "east", "north", "south", "west", "west", "east", "north"],
        "is_premium": [0, 1, 1, 0, 1, 0, 1, 1, 0, 1],
    }
)

customers.head()


## 2) Add a simple engineered feature
Ratio features often capture behavior better than raw values.


In [None]:
customers["income_per_age"] = customers["income"] / (customers["age"] + 1)
customers.head()


## 3) Define preprocessing steps
We treat numeric and categorical columns differently.


In [None]:
X = customers.drop(columns="is_premium")
y = customers["is_premium"]

numeric_features = ["age", "income", "income_per_age"]
categorical_features = ["region"]

numeric_transformer = Pipeline(
    [
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler()),
    ]
)

categorical_transformer = Pipeline(
    [
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("onehot", OneHotEncoder(handle_unknown="ignore")),
    ]
)

preprocessor = ColumnTransformer(
    [
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)


## 4) Train a model with preprocessing in a pipeline
Pipelines make preprocessing repeatable and avoid data leakage.


In [None]:
model = Pipeline(
    [
        ("preprocess", preprocessor),
        ("classifier", LogisticRegression()),
    ]
)

model.fit(X_train, y_train)

preds = model.predict(X_test)
accuracy_score(y_test, preds)


## 5) What to do next
Once features are in good shape, the next improvement often comes from
**hyperparameter tuning** and model selection (Day 05).
