# Gradient Descent

---

## Cost Function:
- Find *weights* and *biases* to minimize the **Mean Squared Error (MSE)** Cost Function: $C(w,b) \approx 0 $
$$ C(w,b) \equiv \frac{1}{2n}\sum_x \lVert (x)−a \rVert^2$$
    - $w$: all weights in the network
    - $b$: all the biases
    - $n$: is the total number of training inputs
    - $a$: is the vector of outputs from the network when $x$ is input
    - Sum is over all training inputs, $x$

## Generalized Cost Function:
- Minimize the Cost Function: $C(v) \approx 0 $
- Small changes in parameters $v$ change the Cost Function $C$ as follows:
$$ \Delta C \approx \frac{\partial C}{\partial v_1} \Delta v_1 + \frac{\partial C}{\partial v_2} \Delta v_2 $$

## Gradient:
- The *Gradient of C* can be written as the vector of partial derivatives of the *Cost Function* with respect to each parameter: 
$$ \nabla C \equiv (\frac{\partial C}{\partial v_1}, \frac{\partial C}{\partial v_2})$$
- The changes in parameters $\Delta v$ can be written as a vector:
$$\Delta v \equiv (\Delta v_1, \Delta v_2, ... \Delta v_n)$$
- The expression for chnage in *Cost Function* $\Delta C$ can be rewritten as:
$$ \Delta C \approx \nabla C \cdot \Delta v $$

## Change in $\Delta v$ and Learning Rate:
- To decrease $\Delta C$, choose $\Delta v$ to be $\Delta v = - \eta \nabla C$, where $\eta$ is the **learning rate**
- To compute a value for $\Delta v$:
$$ v \rightarrow v^\prime = v - \eta \nabla C$$
- Use the above formula to update after each move in *Gradient Descent*.  Changing the position $v$ in order to find a minimum of the function $C$
- Coose the *learning rate* $\eta$ to be small enough to not increase $\Delta C$, but not too small which will make the *Gradient Descent* algorithm work very slowly

## Gradient Descent with *weights* and *biases*:
- The rule which can be used to learn in a neural network:

$$ w_k \rightarrow w_k^\prime = w_k - \eta \frac{\partial C}{\partial w_k} $$

$$ b_l \rightarrow b_l^\prime = b_l - \eta \frac{\partial C}{\partial b_l} $$

- Find a minimum of the *Cost Function* by repeatedly applying the update rule 

---

## Stochastic Gradient Descent:
- With Big Data, *Gradient Descent* can be slow, **Stochastic Gradient Descent** can be used to speed up learning
- **Stochastic Gradient Descent** reduces the number of terms that need to be computed
- Epecially useful  when there are redundancies in the data
- Similar to the **Gradient Descent**, start with a relatively *large* **Learning Rate** and make it *smaller* with each step.  A process called **schedule**
- **Stochastic Gradient Descent** uses 1 sample per step, whereas **Mini-batch Gradient Descent** uses a small subset of data at each step

---

## Gradient Descent Algorithm:
1. Take the **Gradient** of the Cost Function - Take the derivative of the Cost Function for each parameter
2. Plug the intial parameter values into the **Gradient** (derivatives)
3. Calculate the **Step Size**: **Step Size = Slope x Learning Rate**
4. Calculate the new parameter values: **New Parameter = Old Parameter - Step Size**
5. Repeat Step 3 until **Step Size** is very small or untill **Maximum Number of Steps** is reached 