<a href="https://colab.research.google.com/github/s34836/EWD/blob/main/lab12.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab - Trees

## Tasks

1. Load the `Carseats.csv` dataset. Drop the `Sales` column and replace it with a categorical column `SalesHigh`, which should take the value `Yes` if `Sales >= 8` and `No` otherwise. Use decision trees to predict the value of `SalesHigh` based on the other variables.
    - Divide the data into training, validation and test sets (or use cross validation instead of validation set).
    - Fit a [`DecisionTreeClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) model.
    - Apply pruning to reduce the size of the tree. Generate `ccp_alpha` values with the `cost_complexity_pruning_path()` method. Find the best `ccp_alpha` using the validation set or cross validation (e.g. [`GridSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)).
    - Compare the pruned and unpruned trees. How did pruning affect the quality of predictions? How did it affect the size of the model (compare tree sizes using the methods `get_depth()` and `get_n_leaves()`).
    - Fit a [`RandomForestClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) and compare it to single-tree models.
    - Select the best decision tree, taking into account both prediction accuracy and model size. Visualize the tree using the method [`plot_tree()`](https://scikit-learn.org/stable/modules/generated/sklearn.tree.plot_tree.html). Use the test set to evaluate the model.
2. Use the [`DecisionTreeRegressor`](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html) and [`RandomForestRegressor`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html) models to predict the value of `medv` based on the other variables in the `boston.csv` dataset. Follow the steps from Task 1.

In [11]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

df = pd.read_csv('Carseats.csv')
df['SalesHigh'] = ['Yes' if x >= 8 else 'No' for x in df['Sales']]
df = df.drop('Sales', axis=1)

print(df.head())

   CompPrice  Income  Advertising  Population  Price ShelveLoc  Age  \
0        138      73           11         276    120       Bad   42   
1        111      48           16         260     83      Good   65   
2        113      35           10         269     80    Medium   59   
3        117     100            4         466     97    Medium   55   
4        141      64            3         340    128       Bad   38   

   Education Urban   US SalesHigh  
0         17   Yes  Yes       Yes  
1         10   Yes  Yes       Yes  
2         12   Yes  Yes       Yes  
3         14   Yes  Yes        No  
4         13   Yes   No        No  


In [12]:

train_val_data, test_data = train_test_split(df, test_size=0.2, random_state=1)
train_data, val_data = train_test_split(train_val_data, test_size=0.25, random_state=1)

print(f"Full dataset shape: {df.shape}")
print(f"Training set shape: {train_data.shape}")
print(f"Validation set shape: {val_data.shape}")
print(f"Test set shape: {test_data.shape}")


Full dataset shape: (400, 11)
Training set shape: (240, 11)
Validation set shape: (80, 11)
Test set shape: (80, 11)


In [13]:
X_train = train_data.drop('SalesHigh', axis=1)
y_train = train_data['SalesHigh']

X_train = pd.get_dummies(X_train, drop_first=True)
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train, y_train)

# Print model information
print("Decision Tree model has been fitted.")
print(f"Number of nodes: {dt_model.tree_.node_count}")
print(f"Depth of tree: {dt_model.get_depth()}")

Decision Tree model has been fitted.
Number of nodes: 81
Depth of tree: 10


In [14]:
# Prepare validation data the same way as training data
X_val = val_data.drop('SalesHigh', axis=1)
y_val = val_data['SalesHigh']
X_val = pd.get_dummies(X_val, drop_first=True)

# Get the cost complexity pruning path
path = dt_model.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas = path.ccp_alphas

# Remove the last alpha which is the largest and would give a trivial tree
ccp_alphas = ccp_alphas[:-1]

# Create trees with different alphas and evaluate on validation set
acc_scores = []
trees = []

for ccp_alpha in ccp_alphas:
    dt = DecisionTreeClassifier(random_state=42, ccp_alpha=ccp_alpha)
    dt.fit(X_train, y_train)
    y_val_pred = dt.predict(X_val)
    acc = accuracy_score(y_val, y_val_pred)
    acc_scores.append(acc)
    trees.append(dt)

# Find the best alpha
best_alpha_idx = np.argmax(acc_scores)
best_alpha = ccp_alphas[best_alpha_idx]
best_tree = trees[best_alpha_idx]

print(f"Best ccp_alpha: {best_alpha}")
print(f"Validation accuracy with best alpha: {acc_scores[best_alpha_idx]}")
print(f"Original tree nodes: {dt_model.tree_.node_count}, depth: {dt_model.get_depth()}")
print(f"Pruned tree nodes: {best_tree.tree_.node_count}, depth: {best_tree.get_depth()}")


Best ccp_alpha: 0.005263157894736845
Validation accuracy with best alpha: 0.8
Original tree nodes: 81, depth: 10
Pruned tree nodes: 59, depth: 10


In [15]:
# Compare pruned and unpruned trees

# Evaluate unpruned tree on validation set
y_val_pred_unpruned = dt_model.predict(X_val)
unpruned_accuracy = accuracy_score(y_val, y_val_pred_unpruned)

# Evaluate pruned tree on validation set
y_val_pred_pruned = best_tree.predict(X_val)
pruned_accuracy = accuracy_score(y_val, y_val_pred_pruned)

# Evaluate both models on test set
X_test = test_data.drop('SalesHigh', axis=1)
y_test = test_data['SalesHigh']
X_test = pd.get_dummies(X_test, drop_first=True)

# Ensure test set has same columns as training set
X_test = X_test[X_train.columns]

y_test_pred_unpruned = dt_model.predict(X_test)
y_test_pred_pruned = best_tree.predict(X_test)

unpruned_test_accuracy = accuracy_score(y_test, y_test_pred_unpruned)
pruned_test_accuracy = accuracy_score(y_test, y_test_pred_pruned)

# Compare model sizes
print("\nComparison of Tree Sizes:")
print(f"Unpruned Tree - Depth: {dt_model.get_depth()}, Leaves: {dt_model.get_n_leaves()}, Nodes: {dt_model.tree_.node_count}")
print(f"Pruned Tree - Depth: {best_tree.get_depth()}, Leaves: {best_tree.get_n_leaves()}, Nodes: {best_tree.tree_.node_count}")
print(f"Reduction in Depth: {dt_model.get_depth() - best_tree.get_depth()} ({(1 - best_tree.get_depth()/dt_model.get_depth())*100:.1f}%)")
print(f"Reduction in Leaves: {dt_model.get_n_leaves() - best_tree.get_n_leaves()} ({(1 - best_tree.get_n_leaves()/dt_model.get_n_leaves())*100:.1f}%)")
print(f"Reduction in Nodes: {dt_model.tree_.node_count - best_tree.tree_.node_count} ({(1 - best_tree.tree_.node_count/dt_model.tree_.node_count)*100:.1f}%)")

print("\nComparison of Prediction Quality:")
print(f"Validation Set - Unpruned Accuracy: {unpruned_accuracy:.4f}, Pruned Accuracy: {pruned_accuracy:.4f}")
print(f"Test Set - Unpruned Accuracy: {unpruned_test_accuracy:.4f}, Pruned Accuracy: {pruned_test_accuracy:.4f}")

# Calculate if pruning helped with overfitting by comparing train-test accuracy differences
X_train_eval = X_train
y_train_pred_unpruned = dt_model.predict(X_train_eval)
y_train_pred_pruned = best_tree.predict(X_train_eval)
unpruned_train_accuracy = accuracy_score(y_train, y_train_pred_unpruned)
pruned_train_accuracy = accuracy_score(y_train, y_train_pred_pruned)

unpruned_overfit = unpruned_train_accuracy - unpruned_test_accuracy
pruned_overfit = pruned_train_accuracy - pruned_test_accuracy

print("\nOverfitting Analysis:")
print(f"Unpruned Tree - Train Accuracy: {unpruned_train_accuracy:.4f}, Test Accuracy: {unpruned_test_accuracy:.4f}, Difference: {unpruned_overfit:.4f}")
print(f"Pruned Tree - Train Accuracy: {pruned_train_accuracy:.4f}, Test Accuracy: {pruned_test_accuracy:.4f}, Difference: {pruned_overfit:.4f}")
if unpruned_overfit > pruned_overfit:
    print("The pruned tree shows less overfitting compared to the unpruned tree.")
else:
    print("The pruned tree does not show reduced overfitting compared to the unpruned tree.")


Comparison of Tree Sizes:
Unpruned Tree - Depth: 10, Leaves: 41, Nodes: 81
Pruned Tree - Depth: 10, Leaves: 30, Nodes: 59
Reduction in Depth: 0 (0.0%)
Reduction in Leaves: 11 (26.8%)
Reduction in Nodes: 22 (27.2%)

Comparison of Prediction Quality:
Validation Set - Unpruned Accuracy: 0.7750, Pruned Accuracy: 0.8000
Test Set - Unpruned Accuracy: 0.8000, Pruned Accuracy: 0.8125

Overfitting Analysis:
Unpruned Tree - Train Accuracy: 1.0000, Test Accuracy: 0.8000, Difference: 0.2000
Pruned Tree - Train Accuracy: 0.9750, Test Accuracy: 0.8125, Difference: 0.1625
The pruned tree shows less overfitting compared to the unpruned tree.
