---
layout: exercises
chapter: 4
chapter-title: Training Models
permalink: /ml-book/chapter4/exercises.html
---

## Exercise 1

You can use any flavor of Gradient Descent (Batch (provided there are few enough samples to fit in memory), Stochastic, or Mini-batch) if you have a training set with millions of features.

## Exercise 2

All forms of Gradient Descent might suffer from features with very different scales. This is because they will converge slower since the gradient descent will take a circuitous route to the minimum as 4-7 illustrates. I like to think about this as an ellipse and the gradient descent goes towards the semi major axis first, once it reaches there it heads along it towards the minimum. To fix this you can standard scale or min max scale the features.

## Exercise 3

No, the log loss function for Logistic Regression is convex so there's no need to worry about it getting stuck in a local minimum.

## Exercise 4

Not all Gradient Descent algorithms lead to the exact same model provided you let them run long enough. Most of the time it's close enough, but Batch Gradient descent will converge smoothly to a specific minimum. However, Stochastic Gradient Descent and Mini-Batch Gradient Descent may "bounce" around the global minimum. However, if you gradually lower the learning rate they will become closer and closer to BGD.

## Exercise 5

If the validation error goes up at every epoch you're likely overfitting. There are various methods to prevent overfitting including: Use a less complicated model, use Ridge, Lasso, or Elastic Net regularization, increase the size of your (training) dataset. Geron notes that it could also be that the learning rate is too high this would definitely be the case if the training error is going up as well.

## Exercise 6

Typically to implement early stopping you would have some patience factor. I don't think it would be a good idea to stop immediately because of this. For example, the validation error could be consistently going down, increase for one epoch, and then go down for the next 10 epochs. The patience factor says something like "if _ epochs go by without a decrease in the validation error, then stop (and revert to minimum validation error model)".

## Exercise 7

The normal equations will be fast when the number of features is low because it is linear in the number of samples, it will converge exactly. Batch Gradient Descent will be slow for a large number of samples, but largely unaffected by the number of features, it will converge exactly. Stochastic Gradient Descent and Mini-batch Gradient Descent will both be fast, but require that the learning rate be decreases so that they actually converge.

## Exercise 8

Three ways to solve a gap between the training error and validation error in Polynomial Regression are: Increase size of the dataset, apply regularization, use a less complicated model. Overfitting is what is happening. Nailed this answer : )

## Exercise 9

If the training error and validation error are almost equal and fairly high then that indicates high bias. You should reduce the regularization parameter $\alpha$.

## Exercise 10

* Ridge Regression instead of Linear Regression?
  * You want to prevent overfitting (high variance) in your model.
* Lasso instead of Ridge Regression?
  * You want to completely eliminate the impact of the least important features instead of just penalizing them.
* Elastic Net instead of Lasso?
  * You want to reduce the complexity of your model and only use the most important features, but want to avoid the erratic behavior of Lasso (when # features > # samples or several features are strongly correlated).
  * Good tip from Geron: If you want to use Lasso without the negative effects just use Elastic Net with an l1 ratio close to 1.

## Exercise 11

If you want to classify pictures as outdoor/indoor and daytime/nighttime, then you should implement two Logistic Regression classifiers instead of one Softmax Regression classifier. Softmax Regression is multi-class, not multi-output.

## Exercise 12

Implement Batch Gradient Descent with early stopping for Softmax Regression

In [134]:
# This sounds difficult, but let's give it a shot
from sklearn.datasets import load_iris
import numpy as np

iris = load_iris()

In [135]:
X, y = iris["data"], iris["target"]
arr = np.hstack((X, y[np.newaxis, :].T))
np.random.seed(42)
p = np.random.permutation(len(arr))
arr = arr[p]
s = (np.array([0.7, 0.2, 0.1]) * len(arr)).astype(int)
s = np.cumsum(s)
print(s)
X_train, y_train = arr[0:s[0], :-1], arr[0:s[0], -1][:, np.newaxis]
print(X_train.shape, y_train.shape)
X_val, y_val = arr[s[0]:s[1], :-1], arr[s[0]:s[1], -1][:, np.newaxis]
print(X_val.shape, y_val.shape)
X_test, y_test = arr[s[1]:s[2], :-1], arr[s[1]:s[2], -1][:, np.newaxis]
print(X_test.shape, y_test.shape)

[105 135 150]
(105, 4) (105, 1)
(30, 4) (30, 1)
(15, 4) (15, 1)


In [136]:
def softmax_eval(X, y, theta):
    y = y[:, 0]
    m = X.shape[0]
    K = len(set(y))
    l = 0
    for k in range(K):
        y_tmp = (y == k).astype(int)
        for i in range(m):
            s_k = theta[:, k].T @ X[i]
            p_k = np.exp(s_k) / sum(np.exp(theta[:, k].T @ X[i]) for k in range(K))
            l += y_tmp[i] * np.log(p_k)
    l = (-1 / m) * l
    return l
        

def train(X, y, iters=1000, lr=0.01):
    y = y[:, 0]
    m = X.shape[0]
    K = len(set(y))
    theta = np.ones((X.shape[1], K))
    eta = lr
    for iteration in range(iters):
        for k in range(K):
            y_tmp = (y == k).astype(int)
            grad_k = 0
            for i in range(m):
                s_k = theta[:, k].T.dot(X[i])
                p_k = np.exp(s_k) / sum(np.exp(theta[:, k].T @ X[i]) for k in range(K))
                grad_k += (p_k - y_tmp[i]) * X[i]
            grad_k = (1 / m)*grad_k
            theta[:, k] = theta[:, k] - eta * grad_k
        if (iteration + 1) % 100 == 0:
            l = softmax_eval(X, y.reshape((-1, 1)), theta)
            print("Loss: ", round(l, 2))
    return theta

theta = train(X_train, y_train)

Loss:  0.68
Loss:  0.56
Loss:  0.5
Loss:  0.46
Loss:  0.43
Loss:  0.41
Loss:  0.39
Loss:  0.38
Loss:  0.37
Loss:  0.35


This was my first attempt... looking at Geron's work there's still a little to do here.
The main difference is that he vectorized everything. He also added the bias term which I forgot to do.

In [137]:
def one_hot(y):
    K = len(set(y[:, 0]))
    new_y = np.zeros((len(y), K))
    i, j = np.indices(new_y.shape)
    return (j == y).astype(int)
    
y_train_hot = one_hot(y_train)
y_val_hot = one_hot(y_val)
y_test_hot = one_hot(y_test)

# Add bias
X_train = np.hstack((np.ones((X_train.shape[0], 1)), X_train))
X_val = np.hstack((np.ones((X_val.shape[0], 1)), X_val))
X_test = np.hstack((np.ones((X_test.shape[0], 1)), X_test))

In [None]:
def train(X, y, iters=1000, lr=0.01):
    y = y[:, 0]
    m = X.shape[0]
    K = len(set(y))
    theta = np.ones((X.shape[1], K))
    eta = lr
    for iteration in range(iters):
        # TODO
        
        
        
        
        
        for k in range(K):
            y_tmp = (y == k).astype(int)
            grad_k = 0
            for i in range(m):
                s_k = theta[:, k].T.dot(X[i])
                p_k = np.exp(s_k) / sum(np.exp(theta[:, k].T @ X[i]) for k in range(K))
                grad_k += (p_k - y_tmp[i]) * X[i]
            grad_k = (1 / m)*grad_k
            theta[:, k] = theta[:, k] - eta * grad_k
        if (iteration + 1) % 100 == 0:
            l = softmax_eval(X, y.reshape((-1, 1)), theta)
            print("Loss: ", round(l, 2))            
        # Early stopping portion
        new_val_loss = softmax_eval(X_v, y_v, theta)
        if new_val_loss <= best_val_loss:
            best_val_loss = new_val_loss
            best_theta = theta
            patience_counter = 0
        else:
            patience_counter += 1
            if patience_counter >= 10:
                return best_theta, best_val_loss
    print("Warning: did not Early Stop...")
    return best_theta, best_val_loss

# theta, val_loss = train_with_early_stopping(X_train, y_train, X_val, y_val, lr=11, iters=100000)
# theta