# Feature Scaling in Gradient Descent

The range of values for particular features influences the speed of the algorithm:  

Find $w_{1}$ and $w_{2}$ for:
$$\hat{y} = w_{1}x_{1} + w_{2}x_{2} + b $$
where $x_{1}$ large range: 300-2000 and $x_{2}$ small range: 0-5

Large values of $w_{1}$ and small changes in $w_{1}$ have a larger affect on $\hat{y}$ compared to $w_{2}$

- If the values in a feature's range are large, a good model will likely learn to choose a relatively small parameter, and the vica versa is also true. (larger range usually means larger magnitude for the values)



## How changes in features and Parameters effect the estimation
![unchanged](unscaled.png)

**Features graph:** the values on the x axis are over a larger range than the y axis

**Contour plot:** x axis is narrower and the y axis is wider.

- Small change to w1 gives large impact to estimation and the cost function J, because w1 is usually multiplied by a large number x1. 
    - So smaller change to w1 is needed for meaningful cost change.
- Small change to w2 gives small impact to estimation and the cost function J, because w2 is usually multiplied by a smaller number x2. 
    - So larger change to w2 is needed for meaningful cost change.


## The effect on Gradient Descent:

![scaled](scaled.png)
- **Unscaled:** As the contours are large and skinny (top right), the gradient descent may end up bouncing backwards and forwards for a longer time. 
- **Scale Features:** Perform a transformation on the training data features so the scale for the feature axis are similar (bottom left). 
    - Rescaling the features, e.g. x1 and x2, means we are now taking comparible values for each feature
    - In the contour graph to find the cost function J, the contours are more circular. (bottom right)
    - The gradient descent finds a much more direct and quicker path to the global minimum

## How to scale features (Normalisation)?

### Divide Feature by max
- Divide the feature by the maximum in its' range so the upper limit is 1
![divide_max](divide_by_max_scale.png)

### Mean Normalisation

- Shifts features so the mean is around zero
- Scales the feature by its' range so the interval is mapped usually to [-1,1]
- Find the mean of the training data for each feature (meuw) and find the following:

$$x_{new} = \frac{x_{old} - \mu}{max\_range - min\_range}$$

- Subctracting by the mean shifts the centre around 0, so positive values are above the mean and negative are below

![mean](mean_normal.png)


### Z-Score Normalisation

- Calculate the mean and the standard deviation to find the following:

$$x_{new} = \frac{x_{old} - \mu}{\sigma}$$

![z_score](z_score.png)

## Scaling rule of thumb:
![thumb](scaling_rule_thumb.png)
Try make the range for the features small enough, so small parameter changes dont effect the cost function too much.  
When in doubt, do the scaling.

Feature scaling is useful when one feature is much larger (or smaller) than another feature.