# Day 03 — Decision Trees + Comparison

Decision trees are intuitive and powerful but can overfit if not controlled.

We will cover:
- How decision trees split data
- Gini/entropy intuition
- Controlling depth to avoid overfitting
- Comparing a decision tree to logistic regression


## 1) Load a real dataset
We will use the Breast Cancer Wisconsin dataset from scikit-learn.


In [None]:
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt

cancer = load_breast_cancer()
X = pd.DataFrame(cancer.data, columns=cancer.feature_names)
y = pd.Series(cancer.target, name="target")

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

X.head()


## 2) Train a decision tree
A decision tree splits the data into regions that are as “pure” as possible.


In [None]:
tree = DecisionTreeClassifier(random_state=42)
tree.fit(X_train, y_train)

train_acc = accuracy_score(y_train, tree.predict(X_train))
test_acc = accuracy_score(y_test, tree.predict(X_test))

train_acc, test_acc


## 3) Control overfitting with max_depth
Decision trees can memorize the training set. Limiting depth improves generalization.


In [None]:
depths = list(range(1, 11))
train_scores = []
test_scores = []

for d in depths:
    model = DecisionTreeClassifier(max_depth=d, random_state=42)
    model.fit(X_train, y_train)
    train_scores.append(accuracy_score(y_train, model.predict(X_train)))
    test_scores.append(accuracy_score(y_test, model.predict(X_test)))

pd.DataFrame({"depth": depths, "train_acc": train_scores, "test_acc": test_scores})


## 4) Visualize a small tree
Plotting a shallow tree makes the logic interpretable.


In [None]:
small_tree = DecisionTreeClassifier(max_depth=3, random_state=42)
small_tree.fit(X_train, y_train)

plt.figure(figsize=(12, 6))
plot_tree(small_tree, feature_names=cancer.feature_names, class_names=cancer.target_names, filled=True)
plt.show()


## 5) Compare with logistic regression
Logistic regression is a linear model, which can be a strong baseline.


In [None]:
logreg = LogisticRegression(max_iter=1000)
logreg.fit(X_train, y_train)

logreg_acc = accuracy_score(y_test, logreg.predict(X_test))
small_tree_acc = accuracy_score(y_test, small_tree.predict(X_test))

logreg_acc, small_tree_acc


## 6) What to do next
Next we’ll focus on **feature engineering**, which often yields bigger gains than model changes.
