# Case Study: Decision Tree Pruning for Iris Dataset

In this notebook, we will explore the concepts of **pre-pruning** and **post-pruning** in decision trees. We will use the Iris dataset and evaluate the performance of decision trees with pre-pruning (restricting the depth and splitting criteria) and post-pruning (cost-complexity pruning). We will then compare the results based on training and testing accuracy.

In [None]:
# Importing necessary libraries
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score

### Load Dataset

We will use the Iris dataset, a classic dataset for classification tasks. It contains 150 samples of iris flowers, categorized into three species: setosa, versicolor, and virginica, based on four features.

In [None]:
# Load dataset
data = load_iris()
X = data.data
y = data.target

### Split Data into Training and Test Sets

We will split the data into training (80%) and testing (20%) sets for evaluation.

In [None]:
# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Pre-pruning Decision Tree

Pre-pruning is done by setting restrictions during tree construction, such as limiting the maximum depth or the minimum number of samples required to split a node. Here, we set the maximum depth to 3 and the minimum number of samples required to split a node to 5.

In [None]:
# Train a decision tree classifier with pre-pruning
clf_preprune = DecisionTreeClassifier(max_depth=3, min_samples_split=5)
clf_preprune.fit(X_train, y_train)

### Evaluate Pre-pruned Model

We evaluate the performance of the pre-pruned decision tree on both the training and testing sets.

In [None]:
# Evaluate pre-pruned model
train_acc_preprune = clf_preprune.score(X_train, y_train)
test_acc_preprune = clf_preprune.score(X_test, y_test)

train_acc_preprune, test_acc_preprune

### Post-pruning Decision Tree (Cost-Complexity Pruning)

Post-pruning is done after the tree is fully grown by removing branches that add little value to the model. This is typically done using cost-complexity pruning. Here, we apply post-pruning by setting a value for `ccp_alpha`.

In [None]:
# Perform post-pruning using cost-complexity pruning
clf_postprune = DecisionTreeClassifier(ccp_alpha=0.01)
clf_postprune.fit(X_train, y_train)

### Evaluate Post-pruned Model

We evaluate the performance of the post-pruned decision tree on both the training and testing sets.

In [None]:
# Evaluate post-pruned model
train_acc_postprune = clf_postprune.score(X_train, y_train)
test_acc_postprune = clf_postprune.score(X_test, y_test)

train_acc_postprune, test_acc_postprune

### Conclusion

The results show the accuracy of both pre-pruned and post-pruned decision trees. By controlling the depth and splitting criteria, pre-pruning reduces the complexity of the model, while post-pruning aims to simplify the tree further after it has been built.

We can use the following output to compare the performance:
- **Pre-pruning**: Train and test accuracy
- **Post-pruning**: Train and test accuracy

In [None]:
print(f"Pre-pruning: Train Accuracy = {train_acc_preprune}, Test Accuracy = {test_acc_preprune}")
print(f"Post-pruning: Train Accuracy = {train_acc_postprune}, Test Accuracy = {test_acc_postprune}")