# Gradient Descent

- Start with some $\theta$
- Keep changing $\theta$ to reduce cost function $J(\theta)$
- Repeat until it ends up at a minimum

Gradient descent algorithm

$$
\theta_j = \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta)
$$

$j$ is index for $p$ features. Repeat it until convergence. For each parameter, take derivative of cost function with respect to each parameter. Simultaneously update all the j $\theta$'s

Correct (Simultaneous update)

temp0 = $\theta_0 - \alpha \frac{\partial}{\partial \theta_0} J(\theta)$

temp1 = $\theta_1 - \alpha \frac{\partial}{\partial \theta_1} J(\theta)$

$\theta_0 = $ temp0

$\theta_1 = $ temp1

Incorrect (Not simultaneous update)

temp0 = $\theta_0 - \alpha \frac{\partial}{\partial \theta_0} J(\theta)$

$\theta_0 = $ temp0

temp1 = $\theta_1 - \alpha \frac{\partial}{\partial \theta_1} J(\theta)$

$\theta_1 = $ temp1

Gradient descent can converge to a local minimum even with the learning rate $\alpha$ **fixed**. As it approaches a local minimum, gradient gets smaller, so gradient descent will take smaller steps. So no need to decrease $\alpha$ over time.

## Batch Gradient Descent

Use **all n examples** in each iteration. I think it's **confusing** because batch sounds like a part of the data, but it actually uses all the data.

In **linear regression** where the hypothesis function is $h_{\theta}(x) = \sum_{j = 0}^{p} \theta_j x_j$ and the cost function is $J(\theta) = \frac{1}{2n} \sum_{i = 1}^{n} (h_{\theta}(x^{(i)}) - y^{(i)})^2$ by using the squared error, the derivative of the cost function with respect to parameters is $\frac{\partial}{\partial \theta} J(\theta) = \frac{1}{n} \sum_{i = 1}^{n} (h_{\theta}(x^{(i)}) - y^{(i)}) x_j^{(i)}$, the batch gradient descent is,

$$
\theta_j = \theta_j - \alpha \frac{1}{n} \sum_{i = 1}^{n} (h_{\theta}(x^{(i)}) - y^{(i)}) x_j^{(i)} \quad \text{for every } j = 0, ..., p
$$

$\sum_{i = 1}^{n}$ tells us that, even for 1 iteration of parameter $\theta$ update, we need to sum up all the data. It would be **slow** when $n = 300,000,000$ for example.


## Stochastic Gradient Descent

Use **1 example** in each iteration. Stochastic gradient descent defines a function for the single example like below,

$$
cost(\theta, (x^{(i)}, y^{(i)})) = \frac{1}{2} (h_{\theta}(x^{(i)}) - y^{(i)})^2
$$

And define the overall cost function as,

$$
J(\theta) = \frac{1}{n} \sum_{i = 1}^{n} cost(\theta, (x^{(i)}, y^{(i)}))
$$

But in each iteration to update parameter $\theta$ by gradient descent, it **doesn't sum up all the example**,

$$
\theta_j = \theta_j - \alpha (h_{\theta}(x^{(i)}) - y^{(i)}) x_j^{(i)} \quad \text{for every } j = 0, ..., p
$$

If we want parameter $\theta$ to converge in stochastic gradient descent, we can slowly decrease learning rate $\alpha$ over time (learning rate $\alpha$ is typically held constant).

$$
\alpha = \frac{\text{constant}_1}{\text{Iteration number} + \text{constant}_2}
$$

So **as iteration goes, the denominator gets larger, and learning rate $\alpha$ gets smaller**. But maybe people won't do this, because you need to spend the extra time to tune these additional constant parameters such as $\text{constant}_1$ and $\text{constant}_2$ in the equation

## Mini-Batch Gradient Descent

Use **b examples** in each iteration. 

For example, $b = 10$. $n = 1000$. $p$ is the number of features.

In 1st iteration,

$$
\theta_j = \theta_j - \alpha \frac{1}{10} \sum_{k = i}^{i + 9} (h_{\theta}(x^{(k)}) - y^{(k)}) x_j{(k)} \quad (\text{for every } j = 0, ..., p)
$$

It starts from $i = 1$. In the 1st iteration, 10 examples are used to update $\theta_j$. In the 2nd iteration, it starts from $i = 11$ and ends at $i = 20$. Repeat it until $i = 1000$.

People argue that, with a good implementation of **vectorization**, mini-batch gradient descent runs faster than stochastic gradient descent.

## Feature Scaling

If we make sure multiple features are on a **similar scales**, meaning having similar ranges of values, then gradient descent can **converge more quickly**.

Contour of loss function (cost function) becomes a skewed elliptical shape. Gradient is likely to oscillate and take a long time to reach a global optimum, because gradient is **perpendicular** to contour.

By scaling, contour becomes circle. Gradient descent is less likely to have oscillation.

For example, get every feature into approximately a $-1 \le x \le 1$ range.

**Mean normalization** is to replace $x$ with $x - \mu$ to make features have approximately 0 mean (Do not apply to intercept $x_0 = 1$).

## Reference

- [Machine Learning by Stanford University | Coursera](https://www.coursera.org/learn/machine-learning)