A linear model contains a constant called the *bias term* (also called the *intercept term*).

The linear equation can be written in a vectorized form:

$\hat{y} = h_{\boldsymbol\theta}(\textbf{x}) = \boldsymbol\theta \cdot \textbf{x}$

$\hat{y}$ is the predicted value.

$\boldsymbol\theta$ is the model's *parameter vector*, containing the bias term $\boldsymbol\theta_{0}$ and the feature weights $\boldsymbol\theta_{1}$ to $\boldsymbol\theta_{n}$

$\textbf{x}$ is the instant's feature vector, containing $\textbf{x}_{0}$ to $\textbf{x}_{n}$, with $\textbf{x}_{0}$ always equal to 1.

$h_{\boldsymbol\theta}$ is the hypothesis function, using the model parameters $\boldsymbol\theta$

MSE for a linear regression model is

$\text{MSE} (\textbf{X},h_{\boldsymbol\theta}) = \frac{1}{m} \sum\limits_{i = 1}^m (\boldsymbol\theta^{T}\textbf{x}^{(i)} - y^{(i)})^2$

A *closed-form solution* (in other words, a mathematical equation that gives the result directly) for $\boldsymbol\theta$ is called the *Normal Equation*:

$\hat{\boldsymbol\theta} = (\textbf{X}^{T}\textbf{X})^{-1}\textbf{X}^{T}y$

$\hat{\boldsymbol\theta}$ is the value of $\boldsymbol\theta$ that minimizes the cost function

For np.linalg.lstsq, the function computes $\hat{\boldsymbol\theta} = \textbf{X}^{+} y$ where $\textbf{X}^{+}$ is called the *pseudoinverse of $\textbf{X}$* (specifically, the Moore-Penrose inverse).

The pseudoinverse itself is computed using a standard matrix factorization technique called *Singular Value Decomposition* (SVD)

*Gradient Descent* is a generic optimization algorithm capable of finding optimal solutions to a wide range of problems. The general idea of Gradient Descent is to tweak parameters iteratively in order to minimize a cost function.

*Batch Gradient Descent* uses the whole batch of training data at every step of the training.

Equation for Gradient Descent step is as follows:

$\boldsymbol\theta^{\text{next step}} = \boldsymbol\theta - \eta \nabla_{\boldsymbol\theta} \text{MSE}(\boldsymbol\theta)$

where $\eta$ is called the *learning rate*.

Iterations for Gradient Descent can be stopped when the gradient vector becomes smaller than a number $\epsilon$ called *tolerance*.

Unlike Batch Gradient Descent, *Stochastic Gradient Descent* picks a random instance in the training set at every step and computes the gradients based only on that single instance.

*Simulated annealing* (SA) is a probabilistic technique for approximating the global optimum of a given function.

The function that determines the learning rate at each iteration is called the *learning schedule*.

By convention we iterate by rounds of $m$ iterations; each round is called an epoch.

*Mini-batch Gradient Descent* is simple to understand once you know Batch and Stochastic Gradient Descent: at each step, instead of computing the gradients based on the full training set (as in Batch GD) or based on just one instance (as in Stochastic GD), Mini-batch GD computes the gradients on small random sets of instances called mini-batches.

Adding powers of each feature as new features, then train a linear model on this extended set of features is called *Polynomial Regression*.

*learning curves* are plots of the model’s performance on the training set and the validation set as a function of the training set size (or the training iteration).

An important theoretical result of statistics and Machine Learning is the fact that a model’s generalization error can be expressed as the sum of three very different errors:

*Bias*

    This part of the generalization error is due to wrong assumptions, such as assuming that the data is linear when it is actually quadratic. A high-bias model is most likely to underfit the training data.
    
*Variance*

    This part is due to the model’s excessive sensitivity to small variations in the training data. A model with many degrees of freedom (such as a high-degree polynomial model) is likely to have high variance and thus overfit the training data.

*Irreducible error*

    This part is due to the noisiness of the data itself. The only way to reduce this part of the error is to clean up the data (e.g., fix the data sources, such as broken sensors, or detect and remove outliers).

Increasing a model’s complexity will typically increase its variance and reduce its bias. Conversely, reducing a model’s complexity increases its bias and reduces its variance. This is why it is called a Bias/Variance trade-off.

*Ridge Regression* (also called *Tikhonov regularization*) is a regularized version of Linear Regression: a *regularization term* equal to $\alpha \Sigma^{n}_{i=1} \theta_{i}^{2}$ is added to the cost function, which forces the learning algorithm to not only  fit the data but also keep the model weights as small as possible. Note that the regularization term should only be added to the cost function during the training.

Ridge Regression cost function:
    
$J(\boldsymbol\theta) = \text{MSE}(\boldsymbol\theta) + \frac{\alpha}{2} \sum\limits^{n}_{i=1} \theta_{i}^{2}$

If we define **w** as the vector of feature weights ($\boldsymbol\theta_{1} $ to $\boldsymbol\theta_{n}$), then the regularization term is equal to $\frac{1}{2}\parallel \boldsymbol {\textbf{w}} \parallel_{2}^{2}$ where $\parallel \boldsymbol {\textbf{w}} \parallel_{2}$ represents the $l_{2}$ norm of the weight vector.

Ridge Regression closed-form solution

$\hat{\boldsymbol\theta} = (\textbf{X}^{T}\textbf{X} + \alpha \textbf{E})^{-1}\textbf{X}^{T}y$

where **E** is the identity matrix

*Least Absolute Shrinkage and Selection Operator Regression* (usually simply called Lasso Regression) is another regularized version of Linear Regression: just like Ridge Regression, it adds a regularization term to the cost function, but it uses the $l_{1}$ norm of the weight vector instead of half the square of the $l_{2}$ norm:

$J(\boldsymbol\theta) = \text{MSE}(\boldsymbol\theta) + \frac{\alpha}{2} \sum\limits^{n}_{i=1} |\theta_{i}|$

*Sparse models* have only a small fraction of parameters are non-zero.

Lasso Regression subgradient vector:

$g(\boldsymbol\theta, J) = \nabla_{\boldsymbol\theta} \text{MSE}(\boldsymbol\theta) + \alpha \times \text{sign}(\boldsymbol\theta)$

*Elastic Net* is a middle ground between Ridge Regression and Lasso Regression. The regularization term is a simple mix of both Ridge and
Lasso’s regularization terms, and you can control the mix ratio r. When r = 0, Elastic Net is equivalent to Ridge Regression, and when r = 1, it is equivalent to Lasso Regression

A very different way to regularize iterative learning algorithms such as Gradient Descent is to stop training as soon as the validation error reaches a minimum. This is called *early stopping*.

*Logistic Regression* (also called Logit Regression) is commonly used to estimate the probability that an instance belongs to a particular class (e.g., what is the probability that this email is spam?). If the estimated probability is greater than 50%, then the model predicts that the instance belongs to that class (called the positive class, labeled “1”), and otherwise it predicts that it does not (i.e., it belongs to the negative class, labeled “0”). This makes it a binary classifier.

Logistic Regression model estimated probability (vectorized form)

$\hat{p} = h_{\boldsymbol\theta}(\textbf{x}) = \sigma(\textbf{x}^{T} \boldsymbol\theta)$

The logistic—noted σ(·)—is a *sigmoid function* (i.e., S-shaped) that outputs a number between 0 and 1.

$\sigma (t) = \frac{1}{1 + \text{exp} (-t)}$

The output is 0 if the probability is less than 0.5. Otherwise it is 1. The score *t* is often called the *logit*.

Logistic Regression cost function (log loss)

$J(\boldsymbol\theta) = - \frac{1}{m} \sum\limits^{m}_{i=1} [y^{(i)}\log (\hat p^{(i)}) + (1 - y^{(i)})\log (1 - \hat p^{(i)})]$

The Logistic Regression model can be generalized to support multiple classes directly, without having to train and combine multiple binary classifiers. This is called *Softmax Regression*, or *Multinomial Logistic Regression*.

Softmax score for class $k$

$s_{k}(\textbf{x}) = \textbf{x}^{T} \boldsymbol\theta^{(k)}$

Each class has its own dedicated parameter vector $\boldsymbol\theta^{(k)}$. All these vectors are typically stored as rows in a *parameter matrix* $\Theta$.


Softmax function

$\hat{p}_{k} = \sigma(\textbf{s}(\textbf{x}))_{k} = \frac{\text{exp}(\sigma(s_{k}))}{\sum\limits^{K}_{j=1} \text{exp}(\sigma(s_{j}))}$

$K$ is the number of classes, $\textbf{s}(\textbf{x})$ is a vector containing the scores of each class for the instance **x**, $\sigma(\textbf{s}(\textbf{x}))_{k}$ is the estimated probability that the instance **x** belongs to class $k$, given the scores of each class for that instance.

$\hat y =  \underset{k}{\text{argmax }} \sigma(\textbf{s}(\textbf{x}))_{k} = \underset{k}{\text{argmax }} ((\theta^{(k)})^{T}\textbf{x})$

Cross entropy cost function

$J(\boldsymbol\theta) = - \frac{1}{m} \sum\limits^{m}_{i=1} \sum\limits^{K}_{k=1} y^{(i)}_{k}\log (\hat p^{(i)}_{k})$

$y^{(i)}_{k}$ is the target probability that the i-th instance belongs to class $k$. In general, it is either equal to 1 or 0, depending on whether the instance belongs to the class or not.
