# Day 03 — Decision trees + comparison

Today I add a **decision tree classifier** and compare it to the Day 02 logistic regression baseline.
I’ll annotate each step so it reads like a mini-tutorial.


## What we are building
We will: 
1. Load a clean dataset (Breast Cancer Wisconsin).
2. Train a logistic regression baseline.
3. Train a decision tree.
4. Compare accuracy, precision, and recall side by side.

This mirrors the Day 02 flow but adds a non-linear model.


## Imports
- `load_breast_cancer` gives us numeric features with a binary target.
- We keep metrics simple (accuracy/precision/recall) for quick comparison.


In [None]:
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score


## Load data
The dataset is already cleaned. I still wrap it in a DataFrame so
column names and quick inspection are easy.


In [None]:
dataset = load_breast_cancer()
df = pd.DataFrame(dataset.data, columns=dataset.feature_names)
df["target"] = dataset.target
df.head()


## Train/test split
I stratify on the target so both splits preserve class balance.
That makes the comparison fair for both models.


In [None]:
X = df.drop(columns=["target"])
y = df["target"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)


## Baseline: logistic regression
This is the Day 02 baseline. Logistic regression is a linear model,
so it’s a good reference before trying the tree.


In [None]:
log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train, y_train)
log_preds = log_reg.predict(X_test)

log_metrics = {
    "model": "logistic_regression",
    "accuracy": accuracy_score(y_test, log_preds),
    "precision": precision_score(y_test, log_preds),
    "recall": recall_score(y_test, log_preds),
}
log_metrics


## Decision tree
A tree can split on multiple thresholds, capturing non-linear patterns.
I limit the depth to reduce overfitting for this quick demo.


In [None]:
tree = DecisionTreeClassifier(max_depth=4, random_state=42)
tree.fit(X_train, y_train)
tree_preds = tree.predict(X_test)

tree_metrics = {
    "model": "decision_tree",
    "accuracy": accuracy_score(y_test, tree_preds),
    "precision": precision_score(y_test, tree_preds),
    "recall": recall_score(y_test, tree_preds),
}
tree_metrics


## Compare results
Putting metrics side by side makes it easy to see which model performs better
on this dataset.


In [None]:
pd.DataFrame([log_metrics, tree_metrics])
