# Part I: Feature Scaling

**Feature and Parameter Values**

- price_hat = $w_{1}x_{1} * w_{2}x_{2} + b$  
$x_{1}$: size (feet squared), range from 300 - 2,0000  
$x_{2}$: # bedrooms, range from 0 - 5  
- Training example: $x_{1}$ = 2000, $x_{2}$ = 5, price = 500k  

**Size of parameters $w_{1}, w_{2}$**

- Case 1:  
$w_{1} = 50, w_{2} = 0.1, b = 50$   
-> price_hat = 50 * 2,000 + 0.1 * 5 + 50 = 100,000k  
- Case 2:
$w_{1} = 0.1, w_{2} = 50, b = 50$  
-> price_hat = 0.1 * 2,000 + 50 * 5 + 50 = 500k

**Feature size and Parameter Size**
![image.png](attachment:image.png)

- A very small change to w1 can have a very large impact on the estimated price and that's a very large impact on the cost J. Because w1 tends to be multiplied by a very large number, the size and square feet.
- It takes a much larger change in w2 in order to change the predictions much. And thus small changes to w2, don't change the cost function nearly as much

**Feature size and Gradient Descent**
![image.png](attachment:image.png)

- Performing transformation on data so that they all have comparable range of values to each other
- Gradient Descent can find a much more direct path to global minimum

**Feature Scaling**
![image.png](attachment:image.png)

For any $x_{j}$:  
$x_{j, scaled} = \frac{x_{j}}{max}$  
=> $Then: 0 <= x_{j, scaled} <= 1$

**Mean Normalization**
![image.png](attachment:image.png)

For any $x_{j}$:  
$x_{j, scaled} = \frac{x_{j} - \mu{j}}{max - min}$  
=> $Then: -1 <= x_{j, scaled} <= 1$

**Z-Score Normaliztion**
![image.png](attachment:image.png)

For any x_{j}:
$x_{j, scaled} = \frac{x_{j} - \mu{j}}{\sigma{j}}$

**Rule of Thumb**

- aim for about -1 <= $x_{j}$ <= 1 for each feature  
  acceptable ranges: -3 <= $x_{j}$ <= 3 or -0.3 <= $x_{j}$ <= 0.3  
- Other acceptable ranges:  
  0 <= $x_{j}$ <= 3  
  -0.5 <= $x_{j}$ <= 2
- Need rescaling:  
  -100 <= $x_{j}$ <= 100  
  -0.001 <= $x_{j}$ <= 0.001  
  98 <= $x_{j}$ <= 105

# Part II: Checking Gradient Descent for Convergence

![image.png](attachment:image.png)

- Objective: $min_{w, b}J_{\vec{w}, b}$
- $J_{\vec{w}, b}$ should decrease after each iteration, if not, consider decrease the learning rate

**Automatic Convergence Test**  
Let $\epsilon$ be $10^{-3}$:  
if $J_{\vec{w}, b}$ decreases by <= $\epsilon$ in one iteration, declare convergence

# Part III: Choosing the Learning Rate

**Identify the Problem with Gradient Descent**
![image.png](attachment:image.png)

- If the cost function does not consistently decrease through each iteration, there can be 2 issues:  
there is a bug in the code  
or $\alpha$ is too large

**Adjust Learning Rate**
![image.png](attachment:image.png)

- If $\alpha$ is too big, we are likely to shoot the moon during each iteration, hence never reaches minimum
- Choosing a smaller $\alpha$, we would gradually converges
- If $\alpha$ is too small, gradient descent may take a lot of iterations to converge
- Values of $\alpha$ to try: ..., 0.001, 0.01, 0.1, 1, ... until we reaches a desired curve

# Part IV: Polynomial Regression

**Choice of Feature**
![image.png](attachment:image.png)

- Quadratic model: $$f_{\vec{w}, b}(x) = w_{1}x + w_{2} x^2 + b$$  
However, choosing we don't expect the price to go down, as size increase (that a parabola is formed), so choosing a quadratic function will not suffice

- Cubic model: $$f_{\vec{w}, b}(x) = w_{1}x + w_{2} x^2 + w_{3} x^3 + b$$  

- Square root model: $$f_{\vec{w}, b}(x) = w_{1}x + w_{2} \sqrt{x} + b$$  