# Machine Learning - Week 1

## What is Machine Learning?

**Machine Learning:**
 - Computers learn w/o being programmed
 - Computers learn from experience $E$ w/ respect to task $T$ if performance, as measured by $P$ improves w/ experience $E$
 
**Supervised Learning:** Algorithm is taught (fed correct data)
 - *Regression:* Predict a continuous valued ouptut
 - *Classification:* Predict a discrete valued output

**Unsupervised Learning:** Algorithm learns by itself (no correct data)
 - *Clustering:* Automatically group data into clusters

Others: **reinforcement learning**, **recommender systems**

## Linear Regression

**Linear Regression:** Fit a line through the data and use to predict
 - *Univariate/Simple:* One input variable
 - *Multivariate/Multiple:* More than one input variable

### Representation

$m$ - Number of training examples, $x$ - Input variables, $y$ - Output variables

$(x,y)$ - One training example, $(x_i,y_i)$ - $i$th training example

Training Set $\rightarrow$ Learning Algorithm $\rightarrow$ $h$

**Hypothesis:** $h$ maps $x$'s to $y$'s $\rightarrow$ In univariate regrission, this is: $h_{\theta}(x) = \theta_0 + \theta_1x$

### Cost

**Cost Function:** A measure of how well the line fits

**MSE (Mean Squared Error):** A measure of the average difference between predicted and actual values $\rightarrow$ $J(\theta_0,\theta_1) = {\frac{1}{2m}} {\sum_{i=1}^m {(h_{\theta}(x_i) - y_i)^2}}$
 - We square to get rid of negatives
 - We divide average by 2 to make gradient descent differentiation easier
 - MSE can be visualized as a function of $\theta_0, \theta_1$ using contour plots (2D depictions of 3D plots)
 - Goal of linear regression is to $\min_{\theta_0, \theta_1} J(\theta_0,\theta_1) = \min_{\theta_0, \theta_1} {\frac{1}{2m}} {\sum_{i=1}^m {(h_{\theta}(x_i) - y_i)^2}}$

### Gradient Descent

**Gradient Descent Algorithm:** Used to find minimums on functions: Start with some $\theta_0, \theta_1$ $\rightarrow$ Change $\theta_0, \theta_1$ until minimum
 - Formally: $\text{repeat until convergence: } \theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta_0,\theta_1)$
     - *Simultaneously update* $j=0$ and $j=1$
 - $\alpha$ represents the *learning rate* (speed of gradient descent / step size)
     - Always +ve
 - $\frac{\partial}{\partial \theta_j} J(\theta_0,\theta_1)$ represents the *derivative term*
     - The slope of the line at the current point controls the direction of descent


Notes:
 - Small values of $a$ causes slow descent
 - Large values of $a$ could overshoot, might diverge instead of converge
 - Do not need to update $a$, algorithm will take smaller steps as slope flattens closer to minimum
 - **Batch Gradient Descent:** Each step of the gradient descent uses all training examples
     

### Gradient Descent and Linear Regression

Key derivatives:

$\frac{\partial}{\partial \theta_j} J(\theta_0,\theta_1)$

$ = \frac{\partial}{\partial \theta_j} \frac{1}{2m} {\sum_{i=1}^m {(h_{\theta}(x_i) - y_i)^2}}$

$ = \frac{\partial}{\partial \theta_j} \frac{1}{2m} {\sum_{i=1}^m {(\theta_0 + \theta_1x_i - y_i)^2}}$

$\text{Case 1}\rightarrow j=0: \frac{\partial}{\partial \theta_0} J(\theta_0,\theta_1) = \frac{1}{m} \sum_{i=1}^m (h_\theta x_i - y_i)$

$\text{Case 2}\rightarrow j=1: \frac{\partial}{\partial \theta_1} J(\theta_0,\theta_1) = \frac{1}{m} \sum_{i=1}^m (h_\theta x_i - y_i) \cdot x_i$

The final gradient descent algorithm is therefore:

$\text{repeat until convergence: }$    
&emsp; $\theta_0 := \theta_0 - \alpha \frac{1}{m} \sum_{i=1}^m (h_\theta x_i - y_i) \\
\theta_1 := \theta_1 - \alpha \frac{1}{m} \sum_{i=1}^m (h_\theta x_i - y_i) \cdot x_i \\
\text{return } \theta_0, \theta_1$


Cost function for linear regression will always be *convex*, meaning it has only one optimum and it is a global one - don't have to choose between local optima


## Scrap Equations

$$
\min_{\theta_0, \theta_1} J(\theta_0,\theta_1) = \min_{\theta_0, \theta_1} {\frac{1}{2m}} {\sum_{i=1}^m {(h_{\theta}(x_i) - y_i)^2}}
$$

$$
\text{repeat until convergence: } \theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta_0,\theta_1)
\\ \text{ for } j=0, j=1
$$

$$
\text{repeat until convergence: } \\
\theta_0 := \theta_0 - \alpha \frac{1}{m} \sum_{i=1}^m (h_\theta x_i - y_i) \\
\theta_1 := \theta_1 - \alpha \frac{1}{m} \sum_{i=1}^m (h_\theta x_i - y_i) \cdot x_i \\
\text{return } \theta_0, \theta_1
$$

$$
\text{return } \theta_0 + \theta_1x
$$