# Overfitting/Underfitting for Decision Tree
https://www.kaggle.com/code/dansbecker/underfitting-and-overfitting

 Assume having a Decision Tree model. 
 
-Overfitting: mae error on training data is low but validation error is high. It happend when the tree depth is big.

-Underfitting: mae error is high and probably validation is bad too. It happend when tree depth is low.

-How to overcome: find min of mae when tree depth changes, ie when max_leaf_node changes.

In [22]:
# A utility function to compute mae for different max_leaf_nodes
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split

def get_mae(max_leaf_nodes, X_train, X_test, y_train, y_test):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)  # Predict using X_test, not y_test
    mae = mean_absolute_error(y_test, y_pred)
    return mae

In [23]:
# Load the iris dataset
from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data[:, :2]  # Use the first two features
y = iris.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# Loop over different values for max_leaf_nodes
for max_leaf_nodes in [5, 50, 500, 5000]:
    my_mae = get_mae(max_leaf_nodes, X_train, X_test, y_train, y_test)  # Assign the result to my_mae
    print("Max Leaf nodes: %d \t \t Mean Absolute Error: %f" % (max_leaf_nodes, my_mae))

Max Leaf nodes: 5 	 	 Mean Absolute Error: 0.334646
Max Leaf nodes: 50 	 	 Mean Absolute Error: 0.342105
Max Leaf nodes: 500 	 	 Mean Absolute Error: 0.342105
Max Leaf nodes: 5000 	 	 Mean Absolute Error: 0.342105


## Conclusion. Avoid over/underfitting by taking max_leaf_nodes = 500