# Guided Project: Feature Engineering for Logistic Regression

**Student Version (with explanations, examples, and TODO tasks)**  
Date: 2025-08-20

---


## ðŸŽ¯ Learning Objectives
- Understand why **feature engineering (FE)** matters in ML.  
- Practice FE steps: handling missing values, encoding, scaling, feature selection.  
- Use **Logistic Regression** (algorithm you know) to test impact of FE choices.  

Each section follows: **Example (done for you) â†’ TODO (your turn)**.  


## 1) Setup & Data Loading

In [None]:

import numpy as np
import pandas as pd

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import OneHotEncoder, StandardScaler, MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.linear_model import LogisticRegression
from sklearn.utils import shuffle

# Load data
dataset = load_breast_cancer()
X = pd.DataFrame(dataset.data, columns=dataset.feature_names)
y = pd.Series(dataset.target, name="target")
X, y = shuffle(X, y, random_state=42)

X.head()


## 2) Baseline Model
**Example:** simple imputer + logistic regression.

In [None]:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

baseline = Pipeline([
    ("impute", SimpleImputer(strategy="mean")),
    ("clf", LogisticRegression(max_iter=1000))
])
baseline.fit(X_train, y_train)
y_pred = baseline.predict(X_test)
print("Baseline accuracy:", accuracy_score(y_test, y_pred))


**TODO:** Try `SimpleImputer(strategy='median')` and compare results.

## 3) Handling Missing Values
**Example:** inject missing values and impute with mean.

In [None]:

rng = np.random.default_rng(0)
X_miss = X.copy()
col = "mean radius"
n = len(X_miss)
idx = rng.choice(n, size=int(0.05*n), replace=False)
X_miss.loc[X_miss.index[idx], col] = np.nan

pipe_mean = Pipeline([
    ("imp", SimpleImputer(strategy="mean")),
    ("clf", LogisticRegression(max_iter=1000))
])
Xtr, Xte, ytr, yte = train_test_split(X_miss, y, test_size=0.2, random_state=42, stratify=y)
pipe_mean.fit(Xtr, ytr)
print("Mean imputation acc:", accuracy_score(yte, pipe_mean.predict(Xte)))


**TODO:** Replace imputer with `strategy='median'` and check if accuracy changes.

## 4) Encoding Categorical Variables + Scaling
**Example:** bin `mean radius` into 3 categories and OHE.

In [None]:

X_enc = X.copy()
X_enc['radius_cat'] = pd.qcut(X_enc['mean radius'], q=3, labels=['small','medium','large'])
cat_cols = ['radius_cat']
num_cols = [c for c in X_enc.columns if c not in cat_cols]

pre_no_scale = ColumnTransformer([
    ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols),
    ("num", SimpleImputer(strategy="mean"), num_cols)
])
pipe = Pipeline([("pre", pre_no_scale), ("clf", LogisticRegression(max_iter=2000))])
Xtr, Xte, ytr, yte = train_test_split(X_enc, y, test_size=0.2, random_state=42, stratify=y)
pipe.fit(Xtr, ytr)
print("OHE no scaling acc:", accuracy_score(yte, pipe.predict(Xte)))


**TODO:** Add `StandardScaler()` for numeric columns and compare accuracy vs without scaling.

## 5) Feature Selection
**Example:** SelectKBest with chi2, k=20.

In [None]:

pre = ColumnTransformer([
    ("num", Pipeline([("imp", SimpleImputer()), ("mms", MinMaxScaler())]), X.columns)
])
pipe = Pipeline([("pre", pre),
                 ("sel", SelectKBest(score_func=chi2, k=20)),
                 ("clf", LogisticRegression(max_iter=2000))])
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
pipe.fit(Xtr, ytr)
print("Acc with k=20:", accuracy_score(yte, pipe.predict(Xte)))


**TODO:** Try k=10, 40, 80 and note which works best.

## 6) Mini Experiments
**Example:** log-transform one numeric feature.

In [None]:

X_log = X.copy()
X_log['mean area'] = np.log1p(X_log['mean area'])
Xtr, Xte, ytr, yte = train_test_split(X_log, y, test_size=0.2, random_state=42, stratify=y)
pipe = Pipeline([("imp", SimpleImputer()), ("clf", LogisticRegression(max_iter=2000))])
pipe.fit(Xtr, ytr)
print("Acc with log(mean area):", accuracy_score(yte, pipe.predict(Xte)))



**TODO (pick 2+):**
- Try another log-transform or binning a variable and OHE.  
- Compare StandardScaler vs MinMaxScaler.  
- Add `class_weight='balanced'` in LogisticRegression.  
Record results in a small DataFrame.  


## 7) Reflection
- Which FE step helped most?
- Which had little/no effect?
- What would you try next?


## âœ… What You Learned
- FE is often more important than the model.  
- Logistic Regression improves greatly with proper preprocessing.  
- You practiced: imputation, encoding, scaling, feature selection.  

---
### ðŸ“Š Grading Rubric (100 points)
- Baseline model (10)  
- Missing values (15)  
- Encoding + scaling (20)  
- Feature selection (20)  
- Experiments (20)  
- Reflection (15)  
