# Table of Contents
* [1) Multiple Features](#1%29-Multiple-Features)
	* [1) Notation](#1%29-Notation)
	* [2) Hypothesis](#2%29-Hypothesis)
* [2) Gradient Descent for Multiple Variables](#2%29-Gradient-Descent-for-Multiple-Variables)
	* [New algorithm](#New-algorithm)
* [3) Gradient Descent in Practice: Feature Scaling](#3%29-Gradient-Descent-in-Practice:-Feature-Scaling)
	* [1) Idea: Make sure features are on a similar scale.](#1%29-Idea:-Make-sure-features-are-on-a-similar-scale.)
	* [2) Mean Normalization](#2%29-Mean-Normalization)
* [4) Gradient Descent in Practice: Learning Rate](#4%29-Gradient-Descent-in-Practice:-Learning-Rate)
	* [1) Making sure gradient descent is working correctly](#1%29-Making-sure-gradient-descent-is-working-correctly)
	* [2) Summary](#2%29-Summary)
* [5) Features and polynomial regression](#5%29-Features-and-polynomial-regression)
	* [1) Polynomial regression](#1%29-Polynomial-regression)
	* [2) Choice of features](#2%29-Choice-of-features)


# 1) Multiple Features

## 1) Notation

<img src="images/lec4_pic1.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/machine-learning/lecture/6Nj1q/multiple-features) 3:05*

<!--TEASER_END-->

<img src="images/lec4_pic2.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/machine-learning/lecture/6Nj1q/multiple-features) 3:24*

<!--TEASER_END-->

## 2) Hypothesis

$$\large h_{\theta}(x) = \theta_{0} + \theta_{1}x_{1} + \theta_{2}x_{2} + ... + \theta_{n}x_{n} $$

- For convenience of notation, we define $\large x_{0} = 1$.
- Previously, we had n features starting from $\large x_{1}, x_{2} ... x_{n}$. Now after we define an additional zero feature, now feature vector x becomes n + 1 dimensional vector that is zero index.
- For the parameter vector $\large \theta$, again, this is another zero index vector staring from $\large \theta_{0}$, so this is another n + 1 dimensional vector.

<img src="images/lec4_pic3.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/machine-learning/lecture/6Nj1q/multiple-features) 8:15*

<!--TEASER_END-->

- $\large \theta^{T}$ is a (1 by n) matrix
- x is a (n by 1) matrix
- The result of $\large \theta^{T}x$ is a (1 by 1) constant value

The form of the hypothesis is just the inner product between our parameter vector theta and our theta vector X.

The term multivariable is just maybe a fancy term for saying we have multiple features, or multivariables with which to try to predict the value Y.

# 2) Gradient Descent for Multiple Variables

<img src="images/lec4_pic4.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/machine-learning/lecture/Z9DKX/gradient-descent-for-multiple-variables) 1:20*

<!--TEASER_END-->

- For this quiz, we have:
$$\large h_{\theta}(x) = \theta_{0}x_{0} + \theta_{1}x_{1} + \theta_{2}x_{2} + ... + \theta_{n}x_{n} = \theta^{T}x$$

    - Answer 1:  $\large h_{\theta}(x^{(i)}) = \theta^{T}x^{(i)} $
    - Answer 2:
    $\large h_{\theta}(x^{(i)})= \theta_{0}x_{0} + \theta_{1}x_{1} + \theta_{2}x_{2} + ... + \theta_{n}x_{n} = (\sum_{j=0}^n \theta_{j}x_{j}^{(i)}) $

<img src="images/lec4_pic5.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/machine-learning/lecture/Z9DKX/gradient-descent-for-multiple-variables) 1:47*

<!--TEASER_END-->

## New algorithm

The new and old algorithms are both similar to each other. If you consider a case where we have two features, then we have three update rules for the parameters $\theta_{0}, \theta_{1}, \theta_{2}$.

- If you look at the update rule for $\theta_{0}$, what you find is that this update rule here is the same as the update rule that we had previously for the case of n = 1. Because in our notational convention we had $x_{0}^{(i)} = 1$.
- Same thing if you compare the new $\theta_{1}$ equation with the old one, you find them similar.


<img src="images/lec4_pic6.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/machine-learning/lecture/Z9DKX/gradient-descent-for-multiple-variables) 4:57*

<!--TEASER_END-->

# 3) Gradient Descent in Practice: Feature Scaling

## 1) Idea: Make sure features are on a similar scale.

For example, if you have 2 feafures:
- x1: size of the house
- x2: number of bedrooms

If those feature are not on a similar scale, when you plot the contours of the cost function $\large J_{\theta}$, you will have a skewed elliptical shape. And if you run gradient descents on this cost function, your gradients may end up taking a long time and can oscillate back and forth, before it can finally find its way to the global minimum.

When you divide x1 by 2000, and x2 by 5, you have a less skewed contour cost function. Also when you run the gradient descent, you can find a much more direct path to the global minimum.

<img src="images/lec4_pic7.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/machine-learning/lecture/xx3Da/gradient-descent-in-practice-i-feature-scaling) 3:00*

<!--TEASER_END-->

**Make sure features are on a similar scale.**

- Get every feature into approximately a $-1 <= x_{i} <= 1$ range.

## 2) Mean Normalization

- $x_{i}$: feature
- $\mu_{i}$: mean (avarage) value of x (feature)
- $s_{i}$: range of the feature (max - min) or you can set $s_{i}$ as the standard deviation of the feature.

$$\large x_{i} = \dfrac{x_{i} - \mu_{i}}{s_{i}}$$

For example,
- $x_{1}$: size of the house. We have the average size of a house is 1000. Then we set feature $x_{1} = \dfrac{size - 1000}{2000}$
- $x_{2}$: number of bedrooms. We have the average size of a house is 2. Then $x_{2} = \dfrac{bedrooms - 2}{5}$.

<img src="images/lec4_pic8.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/machine-learning/lecture/xx3Da/gradient-descent-in-practice-i-feature-scaling) 8:15*

<!--TEASER_END-->

<img src="images/lec4_pic9.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/machine-learning/lecture/xx3Da/gradient-descent-in-practice-i-feature-scaling) 8:35*

<!--TEASER_END-->

# 4) Gradient Descent in Practice: Learning Rate

## 1) Making sure gradient descent is working correctly

If gradient descent is working properly, then the cost function $J_{\theta}$ should decrease after every iteration.

One useful thing in the plot below is that it looks like by the time you've gotten out to maybe 300 iterations, between 300 and 400 iterations, $J_{\theta}$ hasn't gone down much more. By the time you get to 400 iterations, it looks like this curve has flattened out here. This means that at 400 iterations, gradient descent has more or less <u>**converged**</u> because your cost function isn't going down much more.

So, looking at this figure can also help you judge whether or not gradient descent has converged. Also, the number of iterations
the gradient descent takes to converge for a physical application can vary a lot.

<img src="images/lec4_pic10.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/machine-learning/lecture/3iawu/gradient-descent-in-practice-ii-learning-rate) 4:22*

<!--TEASER_END-->

If you see $J_{\theta}$ is actually increasing, then that gives you a clear sign that gradient descent is not working. It usually means that you should decrease the learning rate $\alpha$.

<img src="images/lec4_pic11.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/machine-learning/lecture/3iawu/gradient-descent-in-practice-ii-learning-rate) 6:50*

<!--TEASER_END-->

<img src="images/lec4_pic12.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/machine-learning/lecture/3iawu/gradient-descent-in-practice-ii-learning-rate) 6:51*

<!--TEASER_END-->

## 2) Summary

What I can do when I try to run gradient descent is I would try a range of values, for example: 0.001, 0.003, 0.01, ... until I find a good learning rate for the problem.

<img src="images/lec4_pic13.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/machine-learning/lecture/3iawu/gradient-descent-in-practice-ii-learning-rate) 8:35*

<!--TEASER_END-->

# 5) Features and polynomial regression

- Polynomial regression allows you to use the machinery of linear regression to fit very complicated, even very non-linear functions.

Let's take the example of predicting the price of the house. Suppose the house has 2 features: frontage and depth. You might build a linear regression model using frontage as your first feature x1 and depth is your second feature x2. But when you apply linear regression, you can actually create a new feature: area of the house.

<img src="images/lec4_pic14.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/machine-learning/lecture/Rqgfz/features-and-polynomial-regression) 2:03*

<!--TEASER_END-->

## 1) Polynomial regression

Let's take a look at an example below:

- It doesn't look like a straight line fits this data very well. So maybe you want to fit a **<u>quadratic model</u>** which will give you a better fit. But then you may decide that your quadratic model doesn't make sense because of a quadratic function, eventually this function comes back down and we don't think housing prices should go down when the size goes up too high.

- Then, we may choose to use a **<u>cubic function</u>**, and where we have now a third-order term and we fit that, the green line is a better fit to the data cause it doesn't eventually come back down.

<img src="images/lec4_pic15.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/machine-learning/lecture/Rqgfz/features-and-polynomial-regression) 2:03*

<!--TEASER_END-->

So if we want to fit a model with the cubic function, we need to have 3 features:
- $x_{1}:$ size of the house
- $x_{2}:$ square of size of the house
- $x_{1}:$ the cube of size of the house

And, just by choosing my three features this way and applying the machinery of linear regression, I can fit this model and end up with a cubic fit to my data.

**<u>Feature scaling</u>** becomes increasingly important when we choose features like this. Because the three features take on a very different ranges of values, it's important to apply feature scaling if you're using gradient descent to get them into comparable ranges of values.

## 2) Choice of features

Rather than going to use a **<u>cubic model</u>**, you can choose to use a **<u>square root function</u>**, then you end up with the curve flattens out a bit and doesn't ever come back down.

<img src="images/lec4_pic16.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/machine-learning/lecture/Rqgfz/features-and-polynomial-regression) 6:32*

<!--TEASER_END-->

<img src="images/lec4_pic17.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/machine-learning/lecture/Rqgfz/features-and-polynomial-regression) 6:35*

<!--TEASER_END-->