# 1. Introduction

Supervised Leanring : we are given a data set, already know what our correct outputs are. Ex. Regression and Classification.
- Regression : continuous output. Map input to continuous function.
- Classification : discrete output. Map input to discrete categories.

Unsupervised learning : we have little or no idea what our correct outputs are. We derive structure from data.

Reinforcement learning & Recommender system : TBD.

# 2. Regression with two variables

- $m$ = number of training examples
- $x$ = inputs
- $y$ = outputs
- $h$ = hypothesis
- $\theta$ = parameter
- Hypothesis : $h_\theta(x) = \theta_{0} + \theta_{1}x$
- Paramaters : $\theta_{0}, \theta_{1}$
- Cost : $J(\theta_{0}, \theta_{1}) = \dfrac{1}{2m}\displaystyle\sum_{i=1}^{m}\left(h_\theta(x^{(i)})-y^{(i)}\right)^2$
- Objective : $\displaystyle\min J(\theta_{0}, \theta_{1})$

## Gradient descent

- Start with some $\theta_{0}, \theta_{1}$
- Keep changing $\theta_{0}, \theta_{1}$ to reduce $J(\theta_{0}, \theta_{1})$ until reaching minimum.
- Repeat until convergence. (Simultaneous update all j)
- $\theta_{j} := \theta_{j} - \alpha\dfrac{\partial}{\partial \theta_{j}}J(\theta_{0}, \theta_{1})$
- Small $\alpha$ ? slow
- Large $\alpha$ ? may diverge
- No need to decrease $\alpha$ since gradient descent will automatically take smaller steps as it reaches towards optimum.

# 3. Regression with Multiple variables

- Hypothesis : $h_\theta(x) = \theta^{T}x$
- Paramaters : $\theta$
- Cost : $J(\theta) =  \dfrac{1}{2m}\displaystyle\sum_{i=1}^{m}\left(h_\theta(x^{(i)})-y^{(i)}\right)^2$
- Objective : Min $J(\theta)$

## Gradient descent

- Repeat until convergence. (Simultaneous update all j)
- $\theta_{j} := \theta_{j} - \alpha\dfrac{\partial}{\partial \theta_{j}}J(\theta)$
- Feature Scaling : Make sure feature scales are between $-1$ and $+1$
- Mean normalization : Shift $x_{i}$ to $x_{i} - \mu_{i}$ so that means are zero.

## Normal Equation

- Method to solve $\theta$ analytically.
- Solve for $\theta_{0}, \theta_{1} \dots \theta_{m}$
    - where $J(\theta_{0}, \theta_{1} \dots \theta_{m}) =  \dfrac{1}{2m}\displaystyle\sum_{i=1}^{m} (h_\theta(x^{(i)})-y^{(i)})^2$ and $\dfrac{\partial}{\partial \theta_{j}}J(\theta)$ = 0 for all j
- Solution is $\theta = (X^{T}X)^{-1}X^{T}y$
- If $X^{T}X$ is non-invertible? Redundant features (linearly dependent) or too many features. ($n$ > $m$)
- Gradient descent - need to choose $\alpha$, need to iterate, works when $n$ is large.
- Normal equation - no need to choose $\alpha$, no need to iterate, slow when $n$ is large.

# 4. Logistic regression : classification

- Threshold classifier output
- If $h_\theta(x) \ge 0.5$, then $y = 1$
- If $h_\theta(x) \lt 0.5$, then $y = 0$

- We want to model $h_\theta(x)$ as probability such that $0 \le h_\theta(x) \le 1$
- Let $h_\theta(x) = g(\theta^{T}X)$ where $g(z) = \dfrac{1}{1 + e^{-z}}$
    - Then $h_\theta(x) = \dfrac{1}{1 + e^{-\theta^{T}X}}$ (sigmoid function)
- Logistic regression cost function
    - $J(\theta) = -\dfrac{1}{m}\left[\displaystyle\sum_{i=1}^{m} y^{(i)}logh_\theta(x^{(i)}) + (1-y^{(i)})log(1-h_\theta(x^{(i)}))\right]$

## Advanced Gradient descent

- Conjugate gradient, BFGS, L-BFGS

## One vs All

- Train a logistic regression classifier $h_\theta^{(i)}(x)$ for each class i to predict probability that $y = i$
- Pick the class that maximizes $h_\theta^{(i)}(x)$

## Problem of overfitting

- Too many features fitting training examples very well but fail to generalize for new examples
- Regularization: reduce magnitude of parameters
    - $J(\theta) =  \dfrac{1}{2m}\left[\displaystyle\sum_{i=1}^{m} (h_\theta(x^{(i)})-y^{(i)})^2 + \lambda\displaystyle\sum_{j=1}^{n}\theta_j^2\right]$

# 5. Neural Network

![1](images/machine-learning/1.png){width=10%}

- $a_i^{(j)}$ = activation of unit i in layer j
- $\Theta^{(j)}$ = weight matrix controlling mapping from layer $j$ to $j+1$
- $a_1^{(2)} = g\left(\Theta_{10}^{(1)}x_0 + \Theta_{11}^{(1)}x_1 + \Theta_{12}^{(1)}x_2 + \Theta_{13}^{(1)}x_3\right)$
- $a_2^{(2)} = g\left(\Theta_{20}^{(1)}x_0 + \Theta_{21}^{(1)}x_1 + \Theta_{22}^{(1)}x_2 + \Theta_{23}^{(1)}x_3\right)$
- $a_3^{(2)} = g\left(\Theta_{30}^{(1)}x_0 + \Theta_{31}^{(1)}x_1 + \Theta_{32}^{(1)}x_2 + \Theta_{33}^{(1)}x_3\right)$
- $h_{\Theta}(x) = a_1^{(3)} = g\left(\Theta_{10}^{(2)}x_0 + \Theta_{11}^{(2)}x_1 + \Theta_{12}^{(2)}x_2 + \Theta_{13}^{(2)}x_3\right)$
- $z^{(2)} = \Theta^{(1)}a^{(1)}$
- $a^{(2)} = g(z^{(2)})$
- $z^{(3)} = \Theta^{(2)}a^{(2)}$
- $a^{(3)} = g(z^{(3)}) = h_{\Theta}(x)$

## Cost function

- Logistic regression: $J(\theta) = -\frac{1}{m}\left[\displaystyle\sum_{i=1}^{m} y^{(i)}logh_\theta(x^{(i)}) + (1-y^{(i)})log(1-h_\theta(x^{(i)}))\right] + \dfrac{\lambda}{2m}\displaystyle\sum_{j=1}^{n}\theta_j^2$
- Neural network: $J(\theta) = -\frac{1}{m}\left[\displaystyle\sum_{i=1}^{m}\displaystyle\sum_{k=1}^{K} y_{k}^{(i)}logh_\theta(x^{(i)})_{k} + (1-y_{k}^{(i)})log(1-h_\theta(x^{(i)})_{k})\right] + \dfrac{\lambda}{2m}\displaystyle\sum_{l=1}^{L-1}\displaystyle\sum_{i=1}^{s_{l}}\displaystyle\sum_{j=1}^{s_{l+1}}(\theta_{j}^{(l)})^2$
- Need to compute
    - $-J(\Theta)$
    - $-\dfrac{\partial}{\partial\Theta_{ij}^{(l)}}J(\Theta)$

![2](images/machine-learning/2.png){width=20%}

- $a^{(1)} = x$
- $z^{(2)} = \Theta^{(1)}a^{(1)}$
- $a^{(2)} = g(z^{(2)})$ (add $a_{0}^{(2)}$)
- $z^{(3)} = \Theta^{(2)}a^{(2)}$
- $a^{(3)} = g(z^{(3)})$ (add $a_{0}^{(3)}$)
- $z^{(4)} = \Theta^{(3)}a^{(3)}$
- $a^{(4)} = g(z^{(3)}) = h_{\Theta}(x)$

## Backpropagation

- $\delta_{j}^{(l)}$ = error of node $j$ in layer $l$
- $\delta_{j}^{(4)} = a_{j}^{(4)} - y_{j}$
- $\delta_{j}^{(3)} = (\Theta^{(3)})^{T}\delta^{(4)}.*g^{\prime}(z^{(3)})$
- $\delta_{j}^{(2)} = (\Theta^{(2)})^{T}\delta^{(3)}.*g^{\prime}(z^{(2)})$

## Gradient checking

- Compare $\dfrac{\partial}{\partial\Theta_{jk}^{(l)}}J(\Theta)$ vs numerical estiamte of $J(\Theta)$
- Turn off during training, otherwise code runs very slow
- Random initialization
- Symmetry breaking

## Putting all together

1. Randomly initialize weights
2. Compute forward prop to get $h_{\Theta}(x^{(i)})$ for any $x^{(i)}$
3. Compute cost $J(\Theta)$
4. Compute backward prop to get $\dfrac{\partial}{\partial\Theta_{jk}^{(l)}}J(\Theta)$
5. Iterate through $(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}) \dots (x^{(m)}, y^{(m)})$ and do forward and backward prop for $(x^{(i)}, y^{(i)})$. Get $a^{(l)}$ and $\delta^{(l)}$ for $l = 2 \dots L$
6. Gradient checking
7. Use gradient descent to minimize $J(\Theta)$

# 6. Deciding what to do

- Bias(underfit) : both $J_{train}(\theta)$ and $J_{cv}(\theta)$ are high
- Variance(overfit) : $J_{train}(\theta)$ is low but $J_{cv}(\theta)$ is high
- High Variance? Get more data, try smaller set of features, increase learning rate $\lambda$
- High Bias? Try adding polynomial features, try additional set of features, decrease learning rate $\lambda$
- In neural network, use regularization $\lambda$ to overcome overfitting
- Precision = true positive / (true postivie + false positive)
- Recall = true positive / (true postivie + false negative)
- We want to predict 1 when $h_{\theta}(X) \ge$ threshold
- F1 score = $2\dfrac{PR}{P+R}$

# 7. Support Vector Machine

- Alternative view of logistic regression
- $h_{\theta}(x) = \dfrac{1}{1+exp(-\theta^{T}x)}$
    - if $y = 1$, we want $h_{\theta}(x) = 1, \theta^{T}x >> 0$ or $ \theta^{T}x \ge 1$
    - if $y = 0$, we want $h_{\theta}(x) = 0, \theta^{T}x << 0$ or $ \theta^{T}x \le -1$

## Cost function

- Logistic regression: $J(\theta) = -\frac{1}{m}\left[\displaystyle\sum_{i=1}^{m} y^{(i)}logh_\theta(x^{(i)}) + (1-y^{(i)})log(1-h_\theta(x^{(i)}))\right] + \frac{\lambda}{2m}\displaystyle\sum_{j=1}^{n}\theta_j^2$
- Support vector machine: $C\frac{1}{m}\left[\displaystyle\sum_{i=1}^{m} y^{(i)}cost_{1}(\theta^{T}x^{(i)}) + (1-y^{(i)})cost_{0}(\theta^{T}x^{(i)})\right] + \frac{1}{2}\displaystyle\sum_{j=1}^{n}\theta_j^2$
- Kernel : idea is to compute features based on proximity to landmarks $l^{(1)}, l^{(2)}, l^{(3)} \dots$
- $f_{1}$ = Similarity$(x, l^{(1)}) = exp\left(-\dfrac{\left\|x-l^{(1)}\right\|^2}{2\sigma^2}\right)$
    - If $x$ is close to $l^{(1)} : f_{1} = 1$
    - If $x$ is far from $l^{(1)} : f_{1} = 0$
- Hypothesis : Given $x$, compute features $f$. Predict $y = 1$ if $\theta^{T}x \ge 0$
- Training : $C\dfrac{1}{m}\left[\displaystyle\sum_{i=1}^{m} y^{(i)}cost_{1}(\theta^{T}f^{(i)}) + (1-y^{(i)})cost_{0}(\theta^{T}f^{(i)})\right] + \dfrac{1}{2}\displaystyle\sum_{j=1}^{n}\theta_j^2$
<br><br>
- Parameters : Large $C$ or small $\sigma^{2}$, low bias and high variance. Small $C$ or large $\sigma^{2}$, high bias and low variance

# 8. Clustering

- In supervised learning, training set : ${(x^{(1)}, y^{(1)}) \dots (x^{(m)}, y^{(m)})}$
- In unsupervised learning, training set : ${x^{(1)} \dots x^{(m)}}$

## K-means algorithm

- Input : K(number of clusters) and training set ${x^{(1)} \dots x^{(m)}}$
- Randomly initialize centroids : $\mu_{1} \dots \mu_{K}$
- Repeat
    - for 1 to m
        - $c^{(i)}$ = index (from 1 to K) of cluster centroid closest to $x^{(i)}$
    - for 1 to K
        - $\mu_{k}$ = average of points assigned to cluster $k$

# 9. Principle Component Analysis

Dimension reduction technique
- Reduce from $n$ dimension to $k$ dimension : Find $k$ vectors $u^{(1)} \dots u^{(k)}$ onto which to project the data, in order to minimize the error
- Training Set : $x^{(1)} \dots x^{(m)}$
- Preprocessing: feature scaling and mean normalization
- Compute "covariance matrix"
- $\Sigma = \dfrac{1}{m}\displaystyle\sum_{i=1}^{n}(x^{(i)})(x^{(i)})^{T}$
- Compute "eigenvectors of matrix $\Sigma$"

# 10. Anomaly Detection

Anamoly detection algorithm
1. Choose feature $x_{i}$ that you think is anamoly
2. Fit parameters $\mu_{1} \dots \mu_{n}$ and $\sigma_{1}^{2} \dots \sigma_{n}^{2}$ of Gaussian distribution
    - $\mu_{j} = \dfrac{1}{m}\displaystyle\sum_{i=1}^{m}x_{j}^{(i)}$, $\sigma_{j}^{2} = \dfrac{1}{m}\displaystyle\sum_{i=1}^{m}\left(x_{j}^{(i)}-\mu_{j}\right)^{2}$
<br><br>
3. Compute $p(x) = \displaystyle\prod_{j=1}^{n}p(x_{j};\mu_{j},\sigma_{j}^{2})$ 
<br><br>
4. Anamoly if $p(x) < \epsilon$

Anamoly detection vs supervised learning
- Very small number of positive examples / large number of positive and negative examples
- Anamolies are very unique from one another / positive examples are similar to one another
- Ex. fraud detection / Ex. email spam classification

# 11. Recommender Systems

## Problem formulation

- $r(i,j) = 1$ if user $i$ rated movie $j$ (0 otherwise)
- $y^{(i,j)}$ = rating by user $j$ on movie $i$ (if defined)
- $\theta^{(j)}$ = parameter vector for user $j$
- $x^{(i)}$ = feature vector for movie $i$
- For user $j$, movie $i$, predicted rating : $\left(\theta^{(j)}\right)^{T}(x^{(i)})$
- $m^{(j)}$ = number of movies rated by user $j$

## Optimization objective

- To learn $\theta^{(j)}$ (pamater for user $j$)
- $\displaystyle\min_{\theta^{(j)}} \dfrac{1}{2}\displaystyle\sum_{i:r(i,j)=1}\left(\left(\theta^{(j)}\right)^{T}x^{(i)} - y^{(i,j)}\right)^2 + \dfrac{\lambda}{2}\displaystyle\sum_{k=1}^{n}\left(\theta_{k}^{(j)}\right)^{2}$
- Thus, to learn $\theta^{(1)} \dots \theta^{(n_{u})}$ 
- $\displaystyle\min_{\theta^{(1)} \dots \theta^{(n_{u})}} \dfrac{1}{2}\displaystyle\sum_{i:r(i,j)=1}\left(\left(\theta^{(j)}\right)^{T}x^{(i)} - y^{(i,j)}\right)^2 + \dfrac{\lambda}{2}\displaystyle\sum_{j=1}^{n_{u}}\displaystyle\sum_{k=1}^{n}\left(\theta_{k}^{(j)}\right)^{2}$

## Gradient update

- $\theta_{k}^{(j)} := \theta_{k}^{(j)} - \alpha\displaystyle\sum_{i:r(i,j)=1}\left(\left(\theta^{(j)}\right)^{T}x^{(i)} - y^{(i,j)}\right)x_{k}^{(i)}$ for $k = 0$
- $\theta_{k}^{(j)} := \theta_{k}^{(j)} - \alpha\left[\displaystyle\sum_{i:r(i,j)=1}\left(\left(\theta^{(j)}\right)^{T}x^{(i)} - y^{(i,j)}\right)x_{k}^{(i)} + \lambda\theta_{k}^{(j)}\right]$ for $k \ne 1$

## Similarly

- To learn $x^{(i)}$ 
= $\displaystyle\min_{x^{(i)}} \dfrac{1}{2}\displaystyle\sum_{i:r(i,j)=1}\left(\left(\theta^{(j)}\right)^{T}x^{(i)} - y^{(i,j)}\right)^2 + \dfrac{\lambda}{2}\displaystyle\sum_{k=1}^{n}\left(x_{k}^{(i)}\right)^{2}$
- Thus, to learn $x^{(1)} \dots x^{(n_{m})}$ 
- $\displaystyle\min_{x^{(1)} \dots x^{(n_{m})}} \dfrac{1}{2}\displaystyle\sum_{i:r(i,j)=1}\left(\left(\theta^{(j)}\right)^{T}x^{(i)} - y^{(i,j)}\right)^2 + \dfrac{\lambda}{2}\displaystyle\sum_{j=1}^{n_{m}}\displaystyle\sum_{k=1}^{n}\left(x_{k}^{(i)}\right)^{2}$
- $x_{k}^{(i)} := x_{k}^{(i)} - \alpha\displaystyle\sum_{i:r(i,j)=1}\left(\left(\theta^{(j)}\right)^{T}x^{(i)} - y^{(i,j)}\right)\theta_{k}^{(j)}$ for $k = 0$
- $x_{k}^{(i)} := x_{k}^{(i)} - \alpha\left[\displaystyle\sum_{i:r(i,j)=1}\left(\left(\theta^{(j)}\right)^{T}x^{(i)} - y^{(i,j)}\right)\theta_{k}^{(j)} + \lambda x_{k}^{(i)}\right]$ for $k \ne 1$

## Collaborative filtering

Guess $\theta -> x -> \theta -> x \dots $
1. Initialize $x^{(1)} \dots x^{(n_{m})}, \theta^{(1)} \dots \theta^{(n_{u})}$ to samll random values
2. Minimize $\theta^{(1)} \dots \theta^{(n_{u})}$ and $x^{(1)} \dots x^{(n_{m})}$ simultaneously
- $\displaystyle\min_{x^{(1)} \dots x^{(n_{m})}, \theta^{(1)} \dots \theta^{(n_{u})}} \dfrac{1}{2}\displaystyle\sum_{i:r(i,j)=1}\left(\left(\theta^{(j)}\right)^{T}x^{(i)} - y^{(i,j)}\right)^2 + \dfrac{\lambda}{2}\displaystyle\sum_{j=1}^{n_{m}}\displaystyle\sum_{k=1}^{n}\left(x_{k}^{(i)}\right)^{2} + \dfrac{\lambda}{2}\displaystyle\sum_{j=1}^{n_{u}}\displaystyle\sum_{k=1}^{n}\left(\theta_{k}^{(j)}\right)^{2}$
- $x_{k}^{(i)} := x_{k}^{(i)} - \alpha\left[\displaystyle\sum_{i:r(i,j)=1}\left(\left(\theta^{(j)}\right)^{T}x^{(i)} - y^{(i,j)}\right)\theta_{k}^{(j)} + \lambda x_{k}^{(i)}\right]$
- $\theta_{k}^{(j)} := \theta_{k}^{(j)} - \alpha\left[\displaystyle\sum_{i:r(i,j)=1}\left(\left(\theta^{(j)}\right)^{T}x^{(i)} - y^{(i,j)}\right)x_{k}^{(i)} + \lambda\theta_{k}^{(j)}\right]$
3. For a user with parameters $\theta$ and a movie with (learned) features $x$, predict a star rating of $\theta^{T}x$

# 12. Large Scale Machine Learning

## Batch Gradient Descent

- $h_{\theta}(x) = \displaystyle\sum_{j=0}^{n}\theta_{j}x_{j}$
- $J_{train}(\theta) = \dfrac{1}{2m}\displaystyle\sum_{i=1}^{m}\left(h_{\theta}(x^{(i)})-y^{(i)}\right)^{2}$
- Repeat 
    - $\theta_{j} := \theta_{j} - \alpha\dfrac{1}{m}\displaystyle\sum_{i=1}^{m}\left(h_{\theta}(x^{(i)})-y^{(i)}\right)x_{(j)}^{(i)}$ for every $j = 0 \dots n$
- For large $m$, computation is very expensive - use all $m$ examples in each iteration

## Stochastic Gradient Descent

- For the same objective function, randomly shuffle dataset
- Repeat
    - for $i := 1 \dots m$
        - $\theta_{j} := \theta_{j} - \alpha\left(h_{\theta}(x^{(i)})-y^{(i)}\right)x_{(j)}^{(i)}$ for every $j = 0 \dots n$
- Use $1$ example in each iteration

## Min-batch Gradient Descent

- Say $b = 0$ and $m = 1000$
- Repeat
    - for $i := 1,21,31 \dots 991$
        - $\theta_{j} := \theta_{j} - \alpha\dfrac{1}{10}\displaystyle\sum_{k=i}^{i+9}\left(h_{\theta}(x^{(k)})-y^{(k)}\right)x_{(j)}^{(k)}$ for every $j = 0 \dots n$
- Use $b$ examples in each iteration

# 13. Photo OCR

1. Text detection
2. Character segmentation
3. Character classification

## Discussion on getting more data

1. Keep increasing # of features / hidden units in neural network until you have low bias classifier
2. Ask how much effort it is to get 10 times more data

## Ceiling analysis

- What part of pipeline should you spend most time to improve