1. Which linear regression training algorithm can you use if you have a training set
   with millions of features?

> I will choose Gradient Descent (mini-batch version) to benefit from vectorization and provide convergence.

<br>
<br>

2. Suppose the features in your training set have very different scales. Which algo‐
   rithms might suffer from this, and how? What can you do about it?

> Gradient Descent will suffer from different scales for features. because if $\theta{}_1$ is too small it will take to long to make an effective change in the cost function and versa-vise.
> it will converge (reach global minima) but take too long to do that.
> we can use Standard Scaler (Normalization)

> Additional: Moreover, regularized models may converge to a suboptimal solution if the features are not scaled: since regularization penalizes large weights, features with smaller values will tend to be ignored compared to features with larger values.

<br>
<br>

3. Can gradient descent get stuck in a local minimum when training a logistic
   regression model?

> The Answer is **NO** because the cost function for gradient descent is also **Convex** ( which have no local minima But Only one global minima)

<br>
<br>

4. Do all gradient descent algorithms lead to the same model, provided you let them run long enough?

> **NO** because SGD (Stochastic) & mini-batch will not converge to the global minima but Batch & will converge.

<br>
<br>

5. Suppose you use batch gradient descent and you plot the validation error at every
   epoch. If you notice that the validation error consistently goes up, what is likely
   going on? How can you fix this

> if training error also goes up after each iteration it must be the learning rate too large and the algorithm diverge, or it will be **OverFitting**, we can fix it by several methods, but in this case i would use early stopping with restoring best weights.
> other methods such as using penalty for the weights such as $l2$ and $l1$ regularization.

<br>
<br>

6. Is it a good idea to stop mini-batch gradient descent immediately when the
   validation error goes up?

> **NO**. Because it may goes down after it, it depends on how much the current batch is difficult for learning, and due to shuffling it happens.
> so, we can save the best model with intervals and give the model some tolerance epochs to improve, otherwise return best model and stop training.

<br>
<br>

7. Which gradient descent algorithm (among those we discussed) will reach the
   vicinity of the optimal solution the fastest? Which will actually converge? How
   can you make the others converge as well?

> **Mini-Batch** will be the fastest, we can make SGD get convergence by using learning rate schedule.

> Additional: Stochastic Gradient Descent has the fastest training iteration since it considers only one training instance at a time, so it is generally the first to reach the vicinity of the global optimum (or Mini-batch GD with a very small mini-batch size). However, only Batch Gradient Descent will actually converge, given enough training time. As mentioned, Stochastic GD and Mini-batch GD will bounce around the optimum, unless you gradually reduce the learning rate.

<br>
<br>

8. Suppose you are using polynomial regression. You plot the learning curves and
   you notice that there is a large gap between the training error and the validation
   error. What is happening? What are three ways to solve this?

> The is a problem of **Over-fitting** that happened because of complexity of the model, which not suit the data (model too complex than the data). we can simplify the model by reducing degree of polynomial or using regularization technique.

<br>
<br>

9. Suppose you are using ridge regression and you notice that the training error
   and the validation error are almost equal and fairly high. Would you say that
   the model suffers from high bias or high variance? Should you increase the
   regularization hyperparameter α or reduce it?

> The model suffering from **High-Bias**. in this case we try to reduce the regularization hyperparameter $\lambda$.

<br>
<br>

10. Why would you want to use:
    a. Ridge regression instead of plain linear regression (i.e., without any
    regularization)?
    b. Lasso instead of ridge regression?
    c. Elastic net instead of lasso regression?

> We use **Ridge** instead of Linear Regression because we need the model to generalize the output instead of getting in over-fitting trap easily.
> And we choose **Lasso** Over Ridge if we need to perform Feature Selection automatically.
> **ElasticNet** over Lasso because lasso behave erratically when some features are correlated.

<br>
<br>

11. Suppose you want to classify pictures as outdoor/indoor and daytime/nighttime.
    Should you implement two logistic regression classifiers or one softmax regression classifier?

> I would choose to train two Logistic models, or try to implement some multi-output classifier to detect whatever in the picture.


---


12. Implement batch gradient descent with early stopping for softmax regression
    without using Scikit-Learn, only NumPy. Use it on a classification task such as
    the iris dataset.


In [31]:
from sklearn.datasets import load_iris
import numpy as np
data = load_iris(as_frame=True)
X, y = data['data'], data['target']

['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module']


In [34]:
X_with_b = np.c_[np.ones(len(X)), X]
X_with_b[:2]

array([[1. , 5.1, 3.5, 1.4, 0.2],
       [1. , 4.9, 3. , 1.4, 0.2]])

In [43]:
test_ratio = .2
validation_ratio = .2
total_size = len(X_with_b)

test_size = int(test_ratio*total_size)
validation_size = int(validation_ratio*total_size)
train_size = total_size - validation_size - test_size

np.random.seed(42)
shuffled = np.random.permutation(total_size)

X_train = X_with_b[shuffled[:train_size]]
X_valid = X_with_b[shuffled[train_size:-test_size]]
X_test = X_with_b[shuffled[-test_size:]]

y_train = y[shuffled[:train_size]]
y_valid = y[shuffled[train_size:-test_size]]
y_test = y[shuffled[-test_size:]]

In [62]:
number_of_classes = 3


def onehot(y):
    return np.identity(3)[y]


y_train_enc = onehot(y_train)
y_valid_enc = onehot(y_valid)
y_test_enc = onehot(y_test)

Scaling


In [71]:
mean = X_train[:, 1:].mean(axis=0)
std = X_train[:, 1:].std(axis=0)

X_train[:, 1:] = (X_train[:, 1:]-mean)/std
X_valid[:, 1:] = (X_valid[:, 1:]-mean)/std
X_test[:, 1:] = (X_test[:, 1:]-mean)/std

In [73]:
def softmax(logits):
    exps = np.exp(logits)
    exp_sums = exps.sum(axis=1, keepdims=True)
    return exps / exp_sums


n_inputs = X_train.shape[1]
n_outputs = number_of_classes

In [74]:
learning_rate = .5
n_epochs = 5001
m = len(X_train)
epsilon = 1e-5

np.random.seed(42)
Theta = np.random.randn(n_inputs, n_outputs)

for epoch in range(n_epochs):
    z = np.dot(X_train, Theta)
    a = softmax(z)

    if epoch % 1000 == 0:
        Y_proba_valid = softmax(X_valid @ Theta)
        xentropy_losses = -(y_valid_enc * np.log(Y_proba_valid + epsilon))
        print(epoch, xentropy_losses.sum(axis=1).mean())

    error = a - y_train_enc
    grads = (1/m)*(X_train.T@error)
    Theta = Theta - learning_rate*grads

0 4.626481998690278
1000 0.11433651664960792
2000 0.09771828547727753
3000 0.09482470188955566
4000 0.09774000000102669
5000 0.1031457416377819


In [75]:
Theta

array([[ 1.23389181,  6.24071307, -6.46846648],
       [-2.52703783,  2.47299879,  1.10877856],
       [ 3.64053358,  0.04103479, -1.80439521],
       [-5.10552678, -2.6481553 ,  7.36709468],
       [-5.3114502 , -2.65232032,  4.56753472]])

In [83]:
z_preds = X_valid@Theta
a_preds = softmax(z_preds)
preds = np.argmax(a_preds, axis=1)
accuracy_score = (preds == y_valid).mean()
accuracy_score

0.9666666666666667

Regularization


In [88]:
learning_rate = .5
n_epochs = 5001
m = len(X_train)
epsilon = 1e-5
alpha = .01

np.random.seed(42)
Theta = np.random.randn(n_inputs, n_outputs)

for epoch in range(n_epochs):
    z = np.dot(X_train, Theta)
    a = softmax(z)

    if epoch % 1000 == 0:
        Y_proba_valid = softmax(X_valid @ Theta)
        xentropy_losses = -(y_valid_enc * np.log(Y_proba_valid + epsilon))
        l2_loss = 1 / 2 * (Theta[1:] ** 2).sum()
        total_loss = xentropy_losses.sum(axis=1).mean() + alpha * l2_loss
        print(epoch, total_loss.round(4))

    error = a - y_train_enc
    grads = (1/m)*(X_train.T@error)
    # penalty for only W not b (this is the reason for whole concatenation)
    grads += np.r_[np.zeros([1, n_outputs]), .01 * Theta[1:]]
    Theta = Theta - learning_rate*grads

0 4.6922
1000 0.2835
2000 0.2834
3000 0.2834
4000 0.2834
5000 0.2834


In [89]:
z_preds = X_valid@Theta
a_preds = softmax(z_preds)
preds = np.argmax(a_preds, axis=1)
accuracy_score = (preds == y_valid).mean()
accuracy_score

0.9333333333333333

In [93]:
eta = 0.5
n_epochs = 50_001
m = len(X_train)
epsilon = 1e-5
C = 100  # regularization hyperparameter
best_loss = np.infty

np.random.seed(42)
Theta = np.random.randn(n_inputs, n_outputs)
tolerance = 5  # 5 epochs of not getting better (patience)
for epoch in range(n_epochs):
    logits = X_train @ Theta
    Y_proba = softmax(logits)
    Y_proba_valid = softmax(X_valid @ Theta)
    xentropy_losses = -(y_valid_enc * np.log(Y_proba_valid + epsilon))
    l2_loss = 1 / 2 * (Theta[1:] ** 2).sum()
    total_loss = xentropy_losses.sum(axis=1).mean() + 1 / C * l2_loss
    if epoch % 1000 == 0:
        print(epoch, total_loss.round(4))
    if total_loss < best_loss:
        best_loss = total_loss
    else:
        tolerance -= 1
        if tolerance == 0:
            print(epoch - 1, best_loss.round(4))
            print(epoch, total_loss.round(4), "early stopping!")
            break
    error = Y_proba - y_train_enc
    gradients = 1 / m * X_train.T @ error
    gradients += np.r_[np.zeros([1, n_outputs]), 1 / C * Theta[1:]]
    Theta = Theta - eta * gradients

0 4.6922
1000 0.2835
2000 0.2834
3000 0.2834
4000 0.2834
5000 0.2834
5514 0.2834
5515 0.2834 early stopping!


In [94]:
z_preds = X_valid@Theta
a_preds = softmax(z_preds)
preds = np.argmax(a_preds, axis=1)
accuracy_score = (preds == y_valid).mean()
accuracy_score

0.9333333333333333