# 🌳 Decision Tree Pruning with Cost-Complexity Alpha (`ccp_alpha`)
This notebook walks through how to prune a Decision Tree Classifier using cost-complexity pruning in `scikit-learn`. We'll:
- Train a decision tree
- Get the pruning path
- Cross-validate to find the best `ccp_alpha`
- Plot the results
- Train the final tree using the best `ccp_alpha`

In [None]:
# Step 1: Import libraries and generate example data
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.tree import DecisionTreeClassifier
import matplotlib.pyplot as plt
import numpy as np

# Load data
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [None]:
# Step 2: Generate cost-complexity pruning path
# ----------------------------------------------------
# Create a blank Decision Tree Classifier.
# Setting random_state=0 ensures consistent results across runs.
clf = DecisionTreeClassifier(random_state=0)

# Generate the pruning path using the training data.
# This gives us a series of possible ccp_alpha values and corresponding tree impurities.
path = clf.cost_complexity_pruning_path(X_train, y_train)

# Extract the list of ccp_alpha values from the pruning path.
# Each value represents a different level of pruning severity.
ccp_alphas = path.ccp_alphas

# Extract the total impurity at each level of pruning.
# This shows how the impurity changes as the tree is simplified.
impurities = path.impurities

# Display the alpha values to inspect pruning options
print("Available ccp_alpha values:", ccp_alphas)

In [None]:
# Step 3: Cross-validate to find the best alpha
# ----------------------------------------------------
# We loop through each alpha value from the pruning path,
# train a decision tree using that alpha, and perform 5-fold cross-validation.
# We store the mean CV accuracy and the model for each alpha.
clfs = []
alpha_scores = []

for alpha in ccp_alphas:
    clf = DecisionTreeClassifier(random_state=0, ccp_alpha=alpha)
    scores = cross_val_score(clf, X_train, y_train, cv=5)
    clfs.append(clf)
    alpha_scores.append(np.mean(scores))

# Display alpha and scores
for a, s in zip(ccp_alphas, alpha_scores):
    print(f"Alpha: {a:.5f}  |  CV Accuracy: {s:.4f}")

# 🧠 Pick the alpha that gives the highest CV accuracy
best_alpha_index = np.argmax(alpha_scores)  # index of highest accuracy
best_alpha = ccp_alphas[best_alpha_index]   # corresponding best alpha
print(f"\n✅ Best alpha based on cross-validation: {best_alpha:.5f}")

In [None]:
# Step 4: Plot accuracy vs alpha
plt.figure(figsize=(10, 6))
plt.plot(ccp_alphas, alpha_scores, marker='o', drawstyle="steps-post")
plt.xlabel("ccp_alpha")
plt.ylabel("Mean Cross-Validation Accuracy")
plt.title("Cross-Validation Accuracy vs CCP Alpha")
plt.grid(True)
plt.show()

In [None]:
# Step 5: Train final model with best alpha
# ---------------------------------------------
# Now that we've found the best alpha from cross-validation,
# we train a final decision tree using that alpha.
final_clf = DecisionTreeClassifier(random_state=0, ccp_alpha=best_alpha)
final_clf.fit(X_train, y_train)

print(f"Best alpha: {best_alpha:.5f}")
print(f"Final model accuracy on test set: {final_clf.score(X_test, y_test):.4f}")