### Supervised and Unsupervised Learning

*Supervised Learning* is an approach in machine learning where we have a *labeled* dataset, meaning that each data point consists of certain *features* and a corresponding label. These datasets are designed to supervise learning algorthims is to deduce a function that maps these input feature vectors to output labels. In supervised learning, unlike unsupervised one, we have a *ground truth*, meaning that we know what outputs will be for certain input samples. We use **feature extraction** techniques to best describe the raw data using the features that would be useful for our predictive model, e.g. the RGB values of an image. Sometimes these extracted features could not best represent the data for our model, and instead we engineer other features from existing ones, this process is called **feature engineering**, for example we use acceleration derived from velocity as a function of time, or use the log of a feature instead of the actual feature itself.

Most widely used algorithms in supervised learning includes:
- Linear regression
- Logistic regression
- Support-vector machines (SVM)
- Naive Bayes
- Decision trees
- K-nearest neighbors (KNN)
- Neural Networks


*Unsupervised learning* is identified by the lack of ground truth, the algorithms do not take labeled input, but the goal is to infer the inherint structures present in the dataset, and to do an explotary analysis. The output could be in many forms, such as features in an image, or most commonly, clusters in the data. **Dimensionality reduction** is a key technique within unsupervised learning. Often times working in high-dimensional spaces is complex and unpleasant, its computatinoally expensive, or the data is sparse. Dimensionality reduction, is the transformation of data from a high-dimensional space (many distinctive features or independent variables) into a low-dimensional space, such that it retains intrinsic characteristics of the original data. Dimensionality reduction enables us to reduce noise and redundancy in the dataset and find an approximate version of the dataset using fewer features. Unsupervised learning (along with supervised learning) is also used for **representation/feature learning**, which is the set of all techniques used in a system to *automatically* extract features from the raw data or discover the representation needed for feature detection or classification. This allows the model to automatically learn the features (as opposed to manual feature extraction and engineering) and use them, in a perhapse, supervised learning model, to perform a certain task.

Unsupervised learning models are used in three main tasks:
- Clustering
- Association
- Dimensionality reduction

### Linear Regression

*Linear regression* is used to find a linear relation between one or more features:

$$
\begin{align}
y_i &= \beta_0 + \beta_1 x_{i1} + \cdots + \beta_m x_{im} + \varepsilon_i = \boldsymbol{x}_i \cdot \boldsymbol{\beta} + \varepsilon_i, \ \ i = 1, \cdots, n \\
\boldsymbol{y} &= \boldsymbol{X} \cdot \boldsymbol{\beta} + \boldsymbol{\varepsilon} \ \ \textsf{in matrix notation, where} \\
\boldsymbol{y} &= 
\begin{bmatrix}
y_1 \\
y_2 \\
\vdots\\
y_n
\end{bmatrix}
, \ \ \boldsymbol{X} = 
\begin{bmatrix}
1 & x_{11} & \cdots & x_{1m} \\
1 & x_{21} & \cdots & x_{2m} \\
\vdots & \vdots & \ddots & \vdots \\
1 & x_{n1} & \cdots & x_{nm}
\end{bmatrix}
, \ \ \boldsymbol{\beta} = 
\begin{bmatrix}
\beta_0 \\
\beta_1 \\
\vdots \\
\beta_m
\end{bmatrix}
, \ \ \boldsymbol{\varepsilon} = 
\begin{bmatrix}
\varepsilon_0 \\
\varepsilon_1 \\
\vdots \\
\varepsilon_n
\end{bmatrix}
\end{align}
$$

In a linear regression model each target, ($y_i$) is a linear combination of $m$ features, plus $\beta_0$, the intercept term, the value of the prediction when all the features are $0$. $\boldsymbol{\beta}$ elements are known as *regression coefficients*. $\boldsymbol{\varepsilon}$ is the *error term*, or *noise* as apposed to the signal provided by the rest of the model, this variable captures all other factors which influence the dependent variable $y$ other than *regressors* $\boldsymbol{x}$. Fitting a linear model to a given dataset requires estimating the regression coefficients such that the error term, $\boldsymbol{\varepsilon} = \boldsymbol{y} - \boldsymbol{X} \cdot \boldsymbol{\beta}$ is minimized. For examlpe it is common to use, **mean squared errors**, $\frac{1}{n}\sum_{i=1}^{n}\varepsilon_i^2$, as a measure for minimization of $\boldsymbol{\varepsilon}$, as we'll see in the next section.

#### Training

Imagine we have a batch of $n$ observations that we want to model with a linear regression. We want to find the parameters of the model (regression coefficients and intercept term, here represented with $\boldsymbol{W}$ and $\boldsymbol{B}$), such that our predictions, $\boldsymbol{p}$ are as close to the target labels, $\boldsymbol{y}$, as possible. One way to measure this closeness is with the mean squared errors (MSE) as our **Loss Function**, the closer to zero it is the better our model parameters are:

$$
\begin{align}
\boldsymbol{X} &=
\begin{bmatrix}
x_{11} & x_{12} & \cdots & x_{1m} \\
x_{21} & x_{22} & \cdots & x_{2m} \\
\vdots & \vdots & \ddots & \vdots \\
x_{n1} & x_{n2} & \cdots & x_{nm}
\end{bmatrix}, \ \ \boldsymbol{W} = 
\begin{bmatrix}
w_1 \\
w_2 \\
\vdots \\
w_m
\end{bmatrix}, \ \ \boldsymbol{B} = 
\begin{bmatrix}
b \\
b \\
\vdots \\
b
\end{bmatrix}_{n \times 1} \\
\boldsymbol{p} &= 
\begin{bmatrix}
p_1 \\
p_2 \\
\vdots \\
p_n
\end{bmatrix} = 
\boldsymbol{X} \cdot \boldsymbol{W} + \boldsymbol{B} = 
\begin{bmatrix}
x_{11}w_1 + x_{12}w_2 + \cdots + x_{1m}w_m + b \\
x_{21}w_1 + x_{22}w_2 + \cdots + x_{2m}w_m + b \\
\cdots \\
x_{n1}w_1 + x_{n2}w_2 + \cdots + x_{nm}w_m + b
\end{bmatrix} \\
L &= \text{MSE}(\boldsymbol{y}, \boldsymbol{p}) = \frac{1}{n}\sum_{i=1}^{n}(y_i - p_i)^2
\end{align}
$$

We can use the gradient of $L$ wrt. $W$, $\nabla{L}$, to update each element of $W$, such that MSE is decreased, and so our loss in modelling the dataset. Note that the intercept term in linear regression is same for all observations, so every element in $B$ is the same number, $b$, this is the **bias** term.

In [26]:
def linear_regression_forward_pass(X, W, y, b):
    # make sure the number of data points and the number of labels match
    assert X.shape[0] == y.shape[0]
    # we should be able to do dot product of X  and W
    assert X.shape[1] == W.shape[0]
    # bias should be a scalar value
    assert isinstance(b, (int, float))
    
    # do the forward pass of the computation graph
    N = np.dot(X, W)
    p = N + b*np.ones_like(X[:,0])
    L = np.mean(np.power(y - p, 2))
    
    # save the forward pass data
    forward_pass_data = {'N': N, 'p': p}
    return L, forward_pass_data

##### Calculating Gradients

In order to do the backward pass in the computation graph we compute the derivative of each constituent function and evaluate those derivatives at the inputs that those functions receive on the forward pass, and then multiply these derivatives together:

$$
\begin{align}
\nu(X, W) &= X \cdot W  = N\\
\sigma(N, B) &= X \cdot W + B = N + B = \boldsymbol{p} \\
\lambda(\boldsymbol{y}, \boldsymbol{p}) &= (\boldsymbol{y} - \boldsymbol{p})^2 \\
L &= \lambda(\sigma(\nu(X, W), B)) = \lambda(\sigma(N, B)) = \lambda(\boldsymbol{p})\\
\frac{\partial L}{\partial W} &= \frac{\partial \lambda }{\partial \boldsymbol{p}}(\boldsymbol{y}, \boldsymbol{p}) \frac{\partial \sigma}{\partial N}(N, B) \frac{\partial \nu}{\partial W}(X, W) \ \ (\textsf{chain rule}) \\
\frac{\partial L}{\partial B} &= \frac{\partial \lambda}{\partial \boldsymbol{p}}(\boldsymbol{y}, \boldsymbol{p})\frac{\partial \sigma}{\partial B}(N, B) \ \ (\textsf{no $B$ dpendence of $\nu$})\\
\frac{\partial \lambda}{\partial \boldsymbol{p}} &= -2(\boldsymbol{y} - \boldsymbol{p}) \\
\frac{\partial \sigma}{\partial N} &= \boldsymbol{I}_{N}, \ \frac{\partial \sigma}{\partial B} = \boldsymbol{I}_B \\
\frac{\partial \nu}{\partial W} &= X^T \ \ (\textsf{proof in foundations notebook}) \\
\Rightarrow \frac{\partial L}{\partial W} &= \boldsymbol{X}^T \cdot \boldsymbol{I}_N \odot 2(\boldsymbol{p} - \boldsymbol{y}) = \boldsymbol{X}^T \cdot 2(\boldsymbol{p} - \boldsymbol{y}) \\
\Rightarrow \frac{\partial L}{\partial B} &= -2(\boldsymbol{y} - \boldsymbol{p}) \odot \boldsymbol{I}_B
\end{align}
$$

- **TODO**: bias, variance, and bias-variance tradeoff

### Resources
- [Wikipedia - Supervised Learning](https://en.wikipedia.org/wiki/Supervised_learning)
- [IBM Learn - Unsupervised Learning](https://www.ibm.com/cloud/learn/unsupervised-learning)
- [A Review on Linear Regression Comprehensive in Machine Learning](https://jastt.org/index.php/jasttpath/article/download/57/20)
- [Wikipedia - Linear Regression](https://en.wikipedia.org/wiki/Linear_regression)