In [8]:
import numpy as np

Hyperparameters are the configurable aspects of your algorithm that can be adjusted to improve its performance. For example, k-Nearest Neighbors has the hyperparameter “k” that determines the size of the neighborhood.

One great aspect of CART trees is that essentially you can always build a tree that gets zero training error on any real world data set. The reason is if there's any two data points that have a different label we split between them.

## Overfitting

We adjust hyperparameters to avoid under or overfitting the training data.  

A classic sign of overfitting is high test set error and low training set error.




In [9]:
def createLearningRate(learningRate, steps):
    a = learningRate

    for i in range(steps):
        a = learningRate/steps
        steps = steps - 1
        print(a)
    
    return a



In [10]:
print(createLearningRate(0.1, 100))

0.001
0.00101010101010101
0.0010204081632653062
0.0010309278350515464
0.0010416666666666667
0.0010526315789473684
0.0010638297872340426
0.001075268817204301
0.0010869565217391304
0.001098901098901099
0.0011111111111111111
0.0011235955056179776
0.0011363636363636365
0.0011494252873563218
0.0011627906976744186
0.0011764705882352942
0.0011904761904761906
0.0012048192771084338
0.0012195121951219512
0.0012345679012345679
0.00125
0.0012658227848101266
0.001282051282051282
0.0012987012987012987
0.0013157894736842105
0.0013333333333333335
0.0013513513513513514
0.0013698630136986301
0.001388888888888889
0.0014084507042253522
0.0014285714285714286
0.0014492753623188406
0.0014705882352941176
0.0014925373134328358
0.0015151515151515152
0.0015384615384615385
0.0015625
0.0015873015873015873
0.0016129032258064516
0.001639344262295082
0.0016666666666666668
0.0016949152542372883
0.001724137931034483
0.0017543859649122807
0.0017857142857142859
0.0018181818181818182
0.001851851851851852
0.001886792452830

### Underfitting

Both the training error and the test error will be high

### Overfitting

Although training error continues to decrease over time, test error will begin to increase again

## Cross-validation

Cross validation is a method used to estimate the testing error of a machine learning model.

A common practice is to take the training data therefore and split it into two partitions: training and validation. 

## Pruning

Pruning is a technique used to speed up the CART algorithm by finding the right balance between complexity and simplicity in a decision tree. 

And we build a full tree on the training data set until we get zero error. All the way, we don't worry about overfitting. And once we have this gigantic tree, we start pruning away nodes from that tree that have no or a negative effect on the error on the validation data set.

## Grid Search

Grid Search is a common method for setting multiple hyperparameters

We train a model for each of all possible combinations of hyperparameter values in the specified parameter grid, pick the one that gives you the lowest validation error.

## Other Techniques to mitigate overfitting

#### Regularization

In regularization, we attach penalties to complex model parameters that compels the model to be simpler.

CLassifiers that get too complex are penalized. 

So, for example, for each split, we would have to assess whether the decrease in impunity is more than increase in the penality we get for introducing additional nodes. 

#### Early Stopping

As you work through the training data, you regularly "peek" at the validation data.  Once you've surpassed the minimum in the validation set,  you recover the minimum vectors by undoing steps, or by recovering something that you saved onto disk, and that's the classifier that you return.

In other words, in early stopping, we stop training as validation error starts to increase again.





In [11]:
def generate_kFold(n, k):
    """
    Generates [(training_indices, validation_indices), ...] for k-fold validation.
    
    Input:
        n: number of training examples
        k: number of folds
    
    Output:
        kfold_indices: a list of length k. Each entry takes the form (training indices, validation indices)
    """
    assert k >= 2
    kfold_indices = []
    
    indices = list(range(n))
    fold_size = n // k
    remainder = n % k
    
    start = 0

    for i in range(k):
        fold_indices = indices[start:start+fold_size]
        validation_indices = fold_indices.copy()
        start += fold_size
        
        if remainder > 0:
            validation_indices.append(indices[start])
            start += 1
            remainder -= 1
        
        training_indices = [idx for idx in indices if idx not in validation_indices]
        kfold_indices.append((training_indices, validation_indices))
    
   
    return kfold_indices

In [12]:
def cross_validation(xTr, yTr, depths, indices):
    training_losses = []
    validation_losses = []
    best_loss = float('inf')
    best_depth = None
    
    for fold_indices in indices:
        training_indices, validation_indices = fold_indices
        xTr_fold = xTr[training_indices]
        yTr_fold = yTr[training_indices]
        xVal_fold = xTr[validation_indices]
        yVal_fold = yTr[validation_indices]
        
        fold_training_losses = []
        fold_validation_losses = []
        
        for depth in depths:
            model = DecisionTreeRegressor(max_depth=depth)
            model.fit(xTr_fold, yTr_fold)
            yTr_pred = model.predict(xTr_fold)
            yVal_pred = model.predict(xVal_fold)
            tr_loss = square_loss(yTr_fold, yTr_pred)
            val_loss = square_loss(yVal_fold, yVal_pred)
            fold_training_losses.append(tr_loss)
            fold_validation_losses.append(val_loss)
        
        avg_training_loss = np.mean(fold_training_losses)
        avg_validation_loss = np.mean(fold_validation_losses)
        training_losses.append(avg_training_loss)
        validation_losses.append(avg_validation_loss)
        
        if avg_validation_loss < best_loss:
            best_loss = avg_validation_loss
            best_depth = depths[np.argmin(fold_validation_losses)]
    
    return best_depth, training_losses, validation_losses


In [13]:
np.random.randn(100)

array([-0.05307258, -2.2338863 , -0.59348454,  0.11202077, -0.76738004,
       -1.26311451,  0.66433584,  0.91055912, -1.6646629 ,  0.56281574,
        0.83704072, -1.4855316 ,  0.54125185,  1.10432153, -1.36060803,
        0.35580083,  0.1210704 , -1.1964292 ,  0.26010688, -1.25826104,
        0.41170476,  1.23538567, -1.41498507, -2.71778568,  0.83997386,
       -1.47754469,  0.73456265,  1.58256232,  1.06842739, -0.36561153,
       -1.2489387 ,  1.28895897,  1.56850344, -0.11643947,  0.37507747,
        0.45292638, -0.45394039,  0.80868525, -0.46816592,  0.19007972,
        0.28944009,  1.33937448, -0.57175225,  0.36074114,  0.60156309,
        2.30251948, -1.17517046, -0.36540592, -1.14453543,  0.40875822,
        1.79469481, -0.15581822, -1.13609731, -2.07397915, -0.44395563,
       -1.50158435, -0.51508367,  0.10574328,  0.96978416, -1.97569999,
        0.02669615,  0.02376358,  0.51097562, -0.78536143, -1.43596569,
       -1.11454219, -1.44983201, -0.60624508, -1.14518402,  1.50