# Linear Regression



### Goal: 
- To find the best line $f(x) = \theta_0 + \theta_1x$  that fits the training examples

### Process:
1. Training data is supplied to our algorithm
2. This algorithm learns a function (known as hypothesis => ie, learns $\theta_0$ and $\theta_1$)
3. We can use this hypothesis to work with Xtest to generate Ytest

### Pseudo code:
1. Initialise with random value of $\theta$  (where $\theta = [\theta_0$   $\theta_1]$)
2. Measure how good this value of $\theta$ is, using error function
3. Minimise the error function to get optimal $\theta$ (using algorithms like Gradient Descent)


#### Error function
1. This is the actual difference of hypothesized value and actual value, but as modulus function is not differentiable at 0, we use squared error: <br><br>
> $E = \sum_{i=1}^{m} |y_i - h(x_i)|$
    
    
    
    
    
    
    
2. Total squared error:
> $E = \frac{1}{2}   \sum_{i=1}^{m} (y_i - h(x_i))^2$

## Gradient Descent Algorithm
- Used to minimise error functions
- Used in convex optimisation problems (for convex functions only)


**Algorithm**:
1. Initialise with random value of theta <br>
2. Repeat till convergence (till reaching minima):
    - Update theta in a direction of decreasing gradient (slope) till slope becomes zero (minima of function) according to update rule

##### Update Rule:
$ \theta = \theta - \eta \frac{\partial f(\theta)}{\partial\theta}$ <br>
where $\eta$ is learning rate, which defines the step size (number of steps)

If $\eta$ is very small, it will take too long to converge to minima, as very small steps will be taken <br>
If $\eta$ is very large, steps will be very large and we will overshoot the minima and keep oscillating left and right

- If theta init is on left side of minima, slope is -ve, so $- \eta $*slope will be +ve, so theta moves towards right side
- If theta init is on right side of minima, slope is +ve, so $- \eta $*slope will be -ve, so theta moves towards left side

#### Conversion Criteria:
1. Max number of steps defined (must converge in 1000 steps)
2. $\Delta E$  $\approx$ 0 (change in error is almost zero), <br> or $\Delta E$ < Threshold
3. Gradient is close to 0


![image.png](attachment:image.png)


$J(\theta) = \frac{1}{2}   \sum_{i=1}^{m} (h(x_i) - y_i)^2$ <br><br>
$J(\theta) = \frac{1}{2}   \sum_{i=1}^{m} (\theta_0 + \theta_1x - y_i)^2$
<br><br>

So, $J(\theta)$ depends on both $\theta_0$ and $\theta_1$ => Bowl shaped 3D curve
<br>
To get to this minima, we use gradient descent, so we differentiate $J(\theta)$ wrt both $\theta_0$ and $\theta_1$ and calculate updated value of  $\theta_0$ and $\theta_1$

$ \theta = \theta - \eta$ .  $\Delta_\theta J(\theta)$ <br>
1. Update $\theta_0$
2. Update $\theta_1$ <br>
(See pic for calculations)