# Machine Learning

## Lecture 1: Introduction
Machine learning: the field of study that gives computers to learn without being explicitly programmed. A program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.

### Supervised Learning
Providing the algorithm a dataset and the right answer of what to expect given the input. We want the algorithm to learn the output given an input.

**Regression**: The input and output have a continuous relationship.

#### Classification Problem
Example: given tumor properties (lets say size), predict if the tumor is malignant or benign (y-axis is 0 or 1).

Term: Support Vector Machines

### Unsupervised Learning
Given dataset is unclassified, unsupervised learning is asking for interesting structures within the data. Can cluster or partition the data.

Unsupervised learning can also be used in image processing. It can be used to group pixels together.

Term: Independent Component Analysis

### Reinforcement Learning
Problems where its not 1-shot decision making. Required to make a series of decision over time. This is done through a **REWARD function**. Try to maximize the positives actions and minimize the negative actions.


## Lecture 2: Application of Supervised Learning
### Linear Regression
|Living Area ($m^2$)|Price $(1000s)|
|-------------------|--------------|
|2104               |      400     |
|1416               |      232     |
|1534               |      315     |
|852                |      178     |
|1940               |      240     |
|...                |      ...     |

There are 47 training rows.

Notation:
* m = # of training examples
* x = input variables/features
* y = output variables/target variable
* (x, y) = training example
* i-th training example = $(x^{(i)}, y^{(i)})$
* n = # of features
* $\theta$'s are called the **parameters**

Steps:
1. Given a training set, change it to a learning algorithm.
2. Algorithm then has to output a function _h_. This is called a **hypothesis**.
3. Hypothesis takes an input "living area" and outputs an "estimated price".

For learning purposes, $h(x) = \theta_1 + \theta_2x$

If there are multiple inputs (living area, # of bedrooms):
Then denote $x_1$ = size, $x_2$ = # bedrooms

Then $h(x) = \theta_0 + \theta_1x_1 + \theta_2x_2$

For conciseness, $x_0 = 1$

Therefore, h(x) = $\sum_{i=0}^{n} \theta_i x_i = \theta^Tx$

Define $J(\theta)$ as the squared diference between the predicted values and the actual values for the m training data points:

$J(\theta) = \frac{1}{2} \sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})^2$

Want to minimize J($\theta$).

### Ways to Minimize the function
Start with some $\theta$. Say $\theta = \vec{0}$ (vector of all zeros)

Keep changing $\theta$ to reduce J($\theta$).

#### Algorithm 1: Gradient Descent
<img src="images/gradient_descent.png" styles={ width = 50%}/>

1. Randomly choose a some parameters.
2. Start at point X. Look in all directions and see which direction would take the value to the lowest value possible.
3. Iterate until you get to a local minima.

Gradient Descent depends on where the start point is. It can end up at different local minima's depending on where the start point is. Usually, the J($\theta$) function is a parabola shape and will only have **one** global minimum.

$\theta_i := \theta_i - \alpha\frac{\partial}{\partial\theta_i}J(\theta)$

Solving for $\frac{\partial}{\partial\theta_i}J(\theta)$:

> $= \frac{\partial}{\partial\theta_i}\frac{1}{2}(h_\theta(x)-y)^2$

> $= 2 * \frac{1}{2}(h_\theta(x)-y)*\frac{\partial}{\partial\theta_i}(h_\theta(x)-y))$

> $= (h_\theta(x)-y)*\frac{\partial}{\partial\theta_i}(\theta_0x_0 + ... + \theta_nx_n-y)$

> only $\theta_ix_i$ term will be left

> $= (h_\theta(x)-y)*x_i$

Therefore,
$\theta_i := \theta_i - \alpha(h_\theta(x)-y)*x_i$
* $\alpha$ is the learning rate
* larger values would involve larger steps, smaller values would have smaller steps

Repeat until convergence.

This is known as **Batch Gradient Descent**. The issue with this is that if m is very large, then each iteration will take a long time to run.

#### Stochastic (Incremental) Gradient Descent
Look at each data point and update your parameters based on that data point.

for j = 1 to m:

$\theta_i := \theta_i-\alpha(h_\theta(x^{(j)})-j^{(j)})*x_i^{(j)}$

This will not converge to the global minimum exactly but will be approximately the minimum. Each step will not always be decreasing and will wander around but will end up near the minimum. This method is a lot better for large data set.

#### Matrix Derivative Notation for Deriving Normal Equations
$\nabla_\theta J$ = Gradient Descent of $J$ = $\begin{bmatrix}\frac{\partial J}{\partial\theta_0}\\\frac{\partial J}{\partial\theta_1}\\...\\\frac{\partial J}{\partial\theta_n}\end{bmatrix} \in \mathbb{R}^{n+1}$

Can now rewrite batch gradient descent as:

$\theta := \theta - \alpha\nabla_\theta J$.

> _Note that $\theta$ and $\nabla_\theta J$ are all $\mathbb{R}^{n+1}$_ dimensional vectors

More generally, if there is a function that maps matrices to row numbers:

$f: \mathbb{R}^{m x n} |-> \mathbb{R}$
> where $f(A)$ such that $A \in \mathbb{R}^{m x n}$

Define the derivative of $f(A)$ with respect to A as:

$\nabla_A f(A)$ = $\begin{bmatrix} \frac{\partial J}{\partial\theta_{11}} & ... &\frac{\partial J}{\partial\theta_{1n}}\\... & ... & ...\\\frac{\partial J}{\partial\theta_{m1}} & ... & \frac{\partial J} {\partial\theta_{mn}}\end{bmatrix}$

If $A \in \mathbb{R}^{m x n}$

trace of A = tr A = $\sum_{i=1}^{n} A_{ii}$, or the sum of the diagonal elements

Facts:
* tr AB = tf BA
* tr ABC = tr CAB = tr BCA
* If f(A) = tr AB, then $\nabla_A tr AB = B^T$
* tr A = tr $A^T$
* If $a \in \mathbb{R}$, tr a = a. This means that the trace of a row number is just itself.
* $\nabla_A tr ABCA^TC = CAB + C^TAB^T$

Set X to be the **design matrix** with all the training inputs.

$X = \begin{bmatrix} --(x^{(1)})^T--\\--(x^{(2)})^T--\\...\\--(x^{(m)})^T--\end{bmatrix}$

Want to multiply it by the parameter matrix $\theta$:

$X\theta = \begin{bmatrix} (x^{(1)})^T\theta\\(x^{(2)})^T\theta\\...\\(x^{(m)})^T\theta\end{bmatrix} = \begin{bmatrix}h_\theta(x^{(1)})\\h_\theta(x^{(2)})\\...\\h_\theta(x^{(m)})\end{bmatrix}$

Also design:

$\vec{y} = \begin{bmatrix}y^{(1)}\\y^{(2)}\\...\\y^{(m)}\end{bmatrix}$

Therefore, $X\theta - y = \begin{bmatrix}h_\theta(x^{(1)}) - y^{(1)}\\h_\theta(x^{(2)}) - y^{(2)}\\...\\h_\theta(x^{(m)}) - y^{(m)}\end{bmatrix}$

Recall that is z is a vector, then $z^Tz = \sum_iz_i^2$

So if we take $\frac{1}{2}(X\theta - y)^T(X\theta - y) = \frac{1}{2} \sum_{i=1}^{m} (h(x^{(i)})-y^{(i)})^2 = J(\theta)$

In order to minimize $J(\theta)$ wrt $\theta$, set $\nabla_\theta J(\theta) = \vec{0}$:

$\nabla_\theta \frac{1}{2}(X\theta - y)^T(X\theta - y)$

$= \frac{1}{2}\nabla_\theta tr(\theta^TX^TX\theta - \theta^TX^Ty - y^TX\theta+y^Ty)$ 
> can add the "tr" since the expansion is just a row number and the trace of a row number is still the row number

$= \frac{1}{2}[\nabla_\theta tr\theta\theta^TX^TX - \nabla_\theta tr y^T\theta X - \nabla_\theta y^TX\theta]$
> can change $\theta^TX^Ty$ to $y^T\theta X$ because the value is just a row number and the transpose of a row nubmer is still itself

> $y^Ty$ becomes 0 since it does not depend on $\theta$, so taking the derivative wrt to theta will make it 0$

We know:

> $\nabla_\theta tr \theta I \theta^TX^TX$ where $I$ is the identity matrix. In this equation, $\theta$ is A, $I$ is B, and $X^TX$ is C. Using the property, we can say: $\nabla_\theta tr \theta I \theta^TX^TX = X^TX\theta + X^TX\theta$

> $\nabla_\theta tr y^TX\theta = X^Ty$ using a previous property where $y^TX$ is B, $\theta$ is A.

Therefore:

$\nabla_\theta J(\theta) = \frac{1}{2}[X^TX\theta + X^TX\theta - X^Ty - X^Ty] = X^TX\theta - X^Ty$ and set to 0.

We then get the **Normal Equations**:

### $X^TX\theta = X^Ty$

From this we can solve for $\theta$:

$\theta = (X^TX)^{-1}X^Ty$

## Lecture 3: Concept of Underfitting and Overfitting
When given data set and trying to find a relationship to represent the curve, need to becareful not to overfit or underfit.

For example, say we are using the data from before with regards to housing prices. The y-axis is the price and the x-axis is the size of the house.

<img src="images/overfitting.png" styles={ width = 50%}/>

Given the data points in red, we can fit the relationship with a linear relationship ($\theta_0+\theta_1X_1$) where $X_1$ is the size. This will result in the green line. This relationship is not representing the quadratic relationship in the data. This is known as ***"under-fitting"**.

We can also take $X_2 = (X_1)^2$ which is squaring the size giving us a 2nd order quadratic equation ($\theta_0 + \theta_1X_1 + \theta_2X_1^2$) giving us the blue line. This is a good model to use for the data.

We can also use a 5-order equation which will perfectly fit all 5 points. Although this model fits the training data perfectly, it will not be a good predictor for new data. This is known as **"over-fitting"**.

Linear regression is a "parametric" learning algorithm. This is defined as an algorithm that has a fixed number of parameters that is fit to the data. The $\theta$'s are a fixed set of parameters.

### Non-parametric Learning Algorithms
Alleviates the need to choose features carefully and see when you are over- or under-predicting.

A non-parametric learning algorithm is defined as an algorithm where the number of parameters grows linearly with the training set $m$.

#### Locally Weighted Regression (also known as loess/lowess)
To determine a prediction of y given x, LWR will take the data points in the training set that are close to x and apply linear regression on those data points instead of trying to model the whole data set.

Formal Definition:

LWR: Fit $\theta$ to minimize $\sum_i w^{(i)}(y^{(i)}-\theta^TX^{(u)})^2$

Where $w^{(i)}$ = weights = exp(-$\frac{(x^{(i)}-x)^2}{2\tau^2}$). _Note: there can be other functions too, this is just an example._

> If $|x^{(i)}-x|$ is small, then $w^{(i)}$ will be close to 1.

> If $|x^{(i)}-x|$ is large, then $w^{(i)}$ will be close to 0.

> If $\tau$ is small, then the bell shape will be narrow and values further away from the desired point will have less weight as its value decreases rapidly.

> If $\tau$ is large, then the bell shape will be wide and values further away from the desired point will have relatively larger weights.

Because of the above, points closer to the desired x will have a larger weight and points further away to x will have a smaller weight.

Now for all training data points, if you apply linear regression to all data points, you will draw a linear approximation for the hypothesis and fit the points near the current data point. If you do this for all the data points, then LWR is able to trace out this non-linear curve.

### Probablistic Interpretation

Assume $y^{(i)} = \theta^TX^{(i)} - \epsilon^{(i)}$

where $\epsilon^{(i)}$ is the error term (or the unmodelled effects/random noise)

Set $\epsilon^{(i)} = \mathcal{N}(0, \sigma^2)$ which is a Gaussian distribution.

The density for Gaussian is $P(\epsilon^{(i)}) = \frac{1}{\sqrt{2\pi}\sigma}exp(-\frac{(\epsilon^{(i)})^2}{2\sigma^2})$

<img src="images/gaussian_dist.gif" styles={ width = 50%}/>

This means that given $P(y^{{i}} | x^{(i)}; \theta)$
> _Notation_: The $\theta$ is not a RV but is a true value but unknown value. The way to read the above is the probability of $y^{(i)}$ given $x^{(i)}$ and parameterized by $\theta$.

$= \frac{1}{\sqrt{2\pi}\sigma}exp(-\frac{(y^{(i)}-\theta^TX^{(i)})^2}{2\sigma^2})$

Or in other words, $y^{(i)} | x^{(i)}; \theta \sim \mathcal{N}(\theta^TX^{(i)}, \sigma^2)$

Assume $\epsilon^{(i)}$ are IID (independently identically distributed). This means that the errors are independent of each other and have the same Gaussian distribution.

The **likelihood** of $\theta: L(\theta) = P(\vec{y} | X;\theta)$

$= \prod_{i=1}^{m} P(y^{(i)} | X^{(i)}; \theta)$

$= \prod_{i=1}^{m} \frac{1}{\sqrt{2\pi}\sigma}exp(-\frac{(y^{(i)}-\theta^TX^{(i)})^2}{2\sigma^2})$

> Likelihood is the same as probability except that likeliness is the probability as a function of $\theta$ given fixed $\vec{y}$ and X.

> Therefore, will say likelihood of the parameters and probability of the data.

#### Maximum Likelihood Estimation
Choose $\theta$ to maximize L($\theta$)= $P(\vec{y} | X;\theta)$. Or choose the parameters to make it as likely as possible to see the data that was just seen.

$l(\theta) = logL(\theta)$

$= log \prod_{i=1}^{m} \frac{1}{\sqrt{2\pi}\sigma}exp(-\frac{(y^{(i)}-\theta^TX^{(i)})^2}{2\sigma^2})$
> log of a product is the sum of the logs

$= \sum_{i=1}^{m} log[\frac{1}{\sqrt{2\pi}\sigma}exp(-\frac{(y^{(i)}-\theta^TX^{(i)})^2}{2\sigma^2})]$

$= m * log\frac{1}{\sqrt{2\pi}\sigma} + \sum_{i=1}^m -\frac{(y^{(i)}-\theta^TX^{(i)})^2}{2\sigma^2}$

So maximizing the $l(\theta)$ is the same as minimizing $\frac {\sum_{i=1}^m (y^{(i)}-\theta^TX^{(i)})^2}{2} = J(\theta)$
> Note: the value of $\sigma$ doesn't matter in this case

### Classification
Where $y \in {0, 1}$. Discrete values.
> The case of two values is a Binary Classifier.

Since you know that $y \in {0, 1}$, you can set $h_\theta(X)\in[0, 1]$

Choose: $h_\theta(X) = g(\theta^TX) = \frac{1}{1+e^{-\theta^TX}}$

Where g(z) = $\frac{1}{1+e^{-z}}$.
> This is known as the **sigmoid function** or the **logistic function**

<img src="images/sigmoid.png" styles={ width = 50%}/>

Endow the hypothesis with a probablistic function:

> $P(y=1 | x; \theta) = h_\theta(x)$

> $P(y=0 | x; \theta) = 1 - h_\theta(x)$

Combining both equations: $P(y| x; \theta) = h_\theta(x)^y (1-h_\theta(x))^{1-y}$

To fit the parameters:

$L(\theta) = \prod_i P(y^{(i)} | x^{(i)}; \theta)$

$= \prod_i h_\theta(x^{(i)})^{y^{(i)}} (1-h_\theta(x^{(i)}))^{1-y^{(i)}}$

The log likelihood:

$l(\theta) = logL(\theta)$

$= \sum_{i=1}^{m} y^{(i)}log h_\theta(x^{(i)}) + (1-y^{(i)})log(1-h_\theta(x^{(i)}))$

Can use gradient descent to maximize the log likelihood.

$\theta := \theta + \alpha\nabla_\theta l(\theta)$
> this is now adding because we are trying to maximize

$\frac{\partial}{\partial\theta_i}l(\theta) = \sum_{j=1}^{m} (y^{(i)}-h_\theta(x^{(i)})x_j^{(i)}$

This gives us:

$\theta_j := \theta_j + \alpha\sum_{j=1}^{m} (y^{(i)}-h_\theta(x^{(i)})x_j^{(i)}$
> Although this looks like it is the same equation as the linear regression equation, the definition of $h_\theta(x^{(i)})$ is different so the learning rule is different

#### Digression Perceptron Algorithm
Everything is the same as before except the function g(z) now becomes a step function.

$g(z) = \left\{\begin{array}{ll} 1 & z\geq 0 \\ 0 & otherwise \\ \end{array} \right.$

$h_\theta(x) = g(\theta^TX)$

$\theta_j := \theta_j + \alpha\sum_{j=1}^{m} (y^{(i)}-h_\theta(x^{(i)})x_j^{(i)}$
> Although this looks like it is the same learning rule as before, it is different as the hypothesis is different.

> This is relatively easier as you just threshold and choose either 0 or 1 as the output.

## Lecture 4: Newton's Method

Terms:
Frequentist vs Baysian