Indeed, in many situations you don't really need to know the implementation details. However, having a good understanding of how thins work can help you quickly home in on the appropriate model, the right training alogrithm to use, and a good set of hyperparameters for your task.

In this chapter, we will start by looking at the Linear Regression model, one of the simplest models there is. We will discuss 2 different ways to train it:
- Using a direct "closed-form" equation that directly computes the model parameters that best fit the model to the training set(i.e., the model parameters that minimize the cost function over the training set).
- Using an iterative optimization approach, called Gradient Descent(GD), that gradually tweaks the model parameters to minimize the cost function over the training set, eventually converging to the same set of parameters as the first method.

Next will look at the Polynomial Regression, a more complex model that can fit nonliear datasets. It has more parameters than Linear Regression and is more prone to overfitting the training data. Then we will look at several regularization techniques that can reduce the risk of overfitting the training set.

Finally, we will look at 2 more models that are commonly used for classification tasks:Logistic Regression and Softmax Regression.

# Linear Regression
Generally, a linear model makes a prediciton by simply computing a weighted sum of the input features, plus a constant called the *bias term or intercept term*.
$$\hat{y} = \theta_0+\theta_1x_1+\theta_2x_2+\dots++\theta_nx_n$$
- $\hat{y}$ is the predicted value.
- $n$ is the number of features.
- $x_i$ is the i^th feature value.
- $\theta_j$ is the j^th model parameter(including the bias term \theta_0 and the feature weights $\theta_0,\theta_1,\theta_2,\dots,\theta_n$

This can also be written much more concisely using a vectorized form.
$$\hat{y}=h_\theta(X)=\theta^T\cdot X$$
- $\theta$ is the model's *parameter vector*, containing the bias term $\theta_0$ and the feature weights $\theta_0$ to $\theta_n$
- $\theta^T$ is the transpose of $\theta$( a row vector instead of a column vector.
- $X$ is the instance's *feature vector*, containing $x_0$ to $x_n$, with $x_0$ is always to 1.
- $\theta^T\cdot$ $X$ is the dot product of $\theta^T$ and $X$
- $h_\theta$ is the hypothesis function, using the model parameters $\theta$.

So this is the LR model, now how do we train it? For this purpose, we first need a measure of how well(or poorly) the model fits the training data. A common performance measure of a regression model is the Root Mean Square Error(RMSE). Therefore, to train a LR model, you need to find the value of $\theta$ that minimizes the RMSE. In practice, it is simpler to minimize the MSE than the RMSE, and it leads to the same results(because the value that minimizes a function also minimizes its square root).

The MSE of a LR hypothesis $h_\theta$ on a training set $X$ is calculated using this equation.$$MSE(X,h_\theta)=\frac{1}{m}\sum_{i=1}^m(\theta^T\cdot X^{(i)} - y^{(i)})^2$$

NOTE: We write $h_\theta$ instead of $h$ in order to make it clear that the model is parametrized by the vector $\theta$. To simplify the notations, we will just write MSE($\theta$) instead of $MSE(X,h_\theta)$.

## The Normal Equation
To find the value of $\theta$ that minimizes the cost function, there is a *closed-form solution*. In other words, a mathematical equation that gives the result directly. This is called the **Normal Equation**.$$\hat{\theta}=(X^T\cdot X)^-1\cdot X^T\cdot y$$
- $\hat{\theta}$ is the value of $\theta$ that minimizes the cost function.
- $y$ is the vector of target values containing $y^{(1)}$ to $y^{(m)}$.