### Q3.1 What does the max leaf nodes argument describe for a Decision Tree model?

The `max_leaf_nodes` argument in a Decision Tree model specifies the maximum number of leaf nodes the tree can have. Limiting the number of leaf nodes helps control the model's complexity, balancing between overfitting and underfitting. Overfitting occurs when the model is too complex, capturing noise in the training data, which leads to poor performance on new data. Underfitting happens when the model is too simple to capture the underlying structure of the data, resulting in poor performance both on training and new data. Adjusting `max_leaf_nodes` is a way to find a sweet spot where the model is complex enough to capture essential patterns without fitting to the noise.

### Q3.2 Write a script which finds the optimal number of max leaf nodes for your Decision Tree model. What is the optimal number of max leaf nodes?

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error

# Load the dataset
melbourne_data = pd.read_csv('melb_data.csv')

# Select the target variable and features. Assuming 'Price' is the target variable.
y = melbourne_data['Price']
features = ['Rooms', 'Distance']  # Assuming 'Type' needs to be encoded if it's categorical

# For simplicity, let's ignore 'Type' for now unless it's numeric. If 'Type' is categorical, you would typically use pd.get_dummies() to convert it to a format suitable for modeling.
X = melbourne_data[features]

# Split data into training and validation sets
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=0)

def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return mae

# Define the range of `max_leaf_nodes` to test
max_leaf_nodes_options = [5, 50, 500, 5000]

# Initialize variables to store the minimum MAE and the optimal number of max_leaf_nodes
min_mae = float('inf')
best_max_leaf_nodes = None

# Find the optimal number of max_leaf_nodes
for max_leaf_nodes in max_leaf_nodes_options:
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print(f"Max leaf nodes: {max_leaf_nodes} \t\t Mean Absolute Error:  {my_mae}")
    
    if my_mae < min_mae:
        min_mae = my_mae
        best_max_leaf_nodes = max_leaf_nodes

print(f"\nOptimal number of max leaf nodes: {best_max_leaf_nodes}")


Max leaf nodes: 5 		 Mean Absolute Error:  362208.44519981806
Max leaf nodes: 50 		 Mean Absolute Error:  320536.05856001494
Max leaf nodes: 500 		 Mean Absolute Error:  296622.1652119891
Max leaf nodes: 5000 		 Mean Absolute Error:  296046.56250849954

Optimal number of max leaf nodes: 5000
