In [4]:
# general
import numpy as np

## Solution 1: Splitting criteria

### a) 

In [5]:
#| label: 1-a-1
# self-defined function for computing mse of an array
def mse(y):
    l1 = lambda var: var - y.mean()
    residuals = l1(y)
    return np.mean(residuals**2)

In [6]:
#| label: 1-a-2
# self-defined function for computing all possible splits of a regression tree and return the best splits and empirical risk
def find_best_split(x_train,y_train):
    best_threshold = None
    min_risk = np.inf
    unique_sorted_x = np.unique(x_train)
    unique_sorted_x.sort()
    # compute the threshold with biggest margin (middle of all unique values)
    thesholds = unique_sorted_x[1:] - (unique_sorted_x[1:] - unique_sorted_x[:len(unique_sorted_x)-1])/2
    
    for t in thesholds:
        y_left_ix = x_train < t # retuns an index set for all true values
        y_left, y_right = y_train[y_left_ix], y_train[~y_left_ix] # ~ considers all other indices
        weight_left = len(y_train[y_left_ix])/len(y_train) # compute weight of left node
        t_mse = weight_left * mse(y_left) + (1-weight_left) * mse(y_right) # compute empirical risk of split t
        print("split at %.2f: empirical risk = %.2f" % (t,t_mse)) # tracking the emp. risk of each split
        
        if t_mse < min_risk: # save best split
            min_risk = t_mse
            best_threshold = t
            
    print("best split at ", best_threshold)
    return {'threshold': best_threshold, 'empirical_risk': min_risk}


In [7]:
#| label: 1-a-3
# actually compute regression tree for your data
x = np.array([1, 2, 7, 10, 20])
y = np.array([1, 1, 0.5, 10, 11])

In [8]:
#| label: 1-a-4
# run function
find_best_split(x, y)

split at 1.50: empirical risk = 19.14
split at 4.50: empirical risk = 13.43
split at 8.50: empirical risk = 0.13
split at 15.00: empirical risk = 12.64
best split at  8.5


{'threshold': 8.5, 'empirical_risk': 0.13333333333333333}

In [9]:
#| label: 1-a-5
# test with log transformed feature
find_best_split(np.log(x), y)

split at 0.35: empirical risk = 19.14
split at 1.32: empirical risk = 13.43
split at 2.12: empirical risk = 0.13
split at 2.65: empirical risk = 12.64
best split at  2.1242476210246797


{'threshold': 2.1242476210246797, 'empirical_risk': 0.13333333333333333}

### b)

- For regression trees, we usually identify *impurity* with *variance*. Here is why:
  - It is reasonable to define impurity via the deviation between actual target values and the predicted constant -- either using absolute or square distances to enforce symmetry of positive and negative residuals.
  - Recall the constant \(L2\) risk minimizer for a node \(\mathcal{N}\):
    $$
    \bar y = \arg\min_c \frac{1}{|\mathcal{N}|}  \sum_{i = 1}^{|\mathcal{N}|} (y_i - c)^2,
    $$
    because

    \begin{align*}
    \min_c \frac{1}{|\mathcal{N}|} \sum_{i = 1}^{|\mathcal{N}|} (y_i - c)^2 &\rightarrow \frac{\partial}{\partial c} \left( \frac{1}{|\mathcal{N}|} \sum_{i = 1}^{|\mathcal{N}|} (y_i^2 - 2y_i c + c^2) \right) = 0 \\
    &\rightarrow \frac{1}{|\mathcal{N}|} \left( \sum_{i = 1}^{|\mathcal{N}|} (-2y_i + 2c) \right) = 0 \\
    &\rightarrow \sum_{i = 1}^{|\mathcal{N}|} (-2y_i + 2c) = 0 \\
    &\rightarrow -2 \sum_{i = 1}^{|\mathcal{N}|} y_i + 2|\mathcal{N}|c = 0 \\
    \end{align*}
    
    This implies $\hat{c} = \frac{1}{|\mathcal{N}|} \sum_{i = 1}^{|\mathcal{N}|} y_i = \bar{y}$.

  - Consequently, we have 
    $$
    \bar y = \arg\min_c \frac{1}{|\mathcal{N}|}  \sum_{i = 1}^{|\mathcal{N}|} (y_i - c)^2,
    $$
    where the right hand side is the (biased) sample variance for sample mean \(c\).
  - Therefore, predicting the sample mean both minimizes risk under \(L2\) loss and variance impurity.
  - Since constant mean prediction is equivalent to an intercept LM (minimizing the sum of squared residuals!), regression trees with \(L2\) loss perform piecewise constant linear regression.
  - The same correspondence holds between impurity via absolute distances and \(L1\) regression.
