# Machine Learning Taxonomy 

In Machine learning have two main types of taxonomy:

### Taxonomy by Supervision (The "How")

- ### Supervised Learning
  - We're provided with labeled data to train on.
  - We wish to learn to predict from the corresponding set of labels on new samples.
  - We usually do this using Regression or Classification based on the label type.
- ### Unsupervised Learning 
  - The data we're provided contains no labels.
  - We aim to discover structural/spacial relations based on the data.
  - Common tasks: Clustering (K-means), Dimensionality Reduction, Association

- ### Semi-Supervised Learning
  - Some of the data we're provided is labeled or most of the data isn't labeled.
  - The general task is for the model to be able to label to full data-set otherwise known as self-training
  - We apply similar or hybrid solutions to general approach described above.
  
- ### Self-Supervised Learning
  - The task is designed so that the data is its own label
  - There's no need for human annotation.
  - We aim to teach a model to extract the useful features of the data provided.
  - Only later on in the modelling would we use this pre-trained model for specific tasks

### Taxonomy by Model Objective (The "What")

Regardless of the supervision above, the model will usually fall into one of the following buckets:

- ### Discriminative Models
  - Focus on mapping inputs to outputs directly
  - Example: Provide an image with a dog and the model classifies the image as a dog from a cat/dog label set
- ### Generative Models
  - Focus on understanding how the data was "made"
  - Example: Provide an "Empty Matrix" and then model outputs an image of a dog <br> from a pool of images of different dogs and cats it can create

# Supervised Learning 

We now dive into the main concepts that are applied in supervised learning since these are the foundamental blocks that are used repetatively throughout Deep Learning (no matter the complexity of the model).

Supervised learning can summarised as a model the produces a mathematical function, such that when pass through inputs to the function it'll compute the the output, where the output is referred as the $\color{lightblue}inference$.

This function contains $\color{lightblue}parameter$, which affect the output from a given input. As such the model equation described a family of possible relationships between inputs and outputs, where the parameters specifiy the particular relationship.

$\color{lightblue}training$ a model essentially means trying to find the parameters that describe the true relationship between the inputs and output. 

We train the model by following a procedure of trial and error over the training set where for each trial measure the error and then correct the parameters in the direction that will reduce the error in the next run of trials and errors.

After training a model, we asses its performance; we run themodel on a seperate test data to see how well it performs or how well it $\color{lightblue}generalises$

If the results are good enough then the model is ready for deployment

#### Formalization

$\text{Let } \vec x \in \R^n \text{ be our input and } \vec y \in \R^m  \text{ be the output}$

$\text{To make a prediction we need a model } f[•] \text{ which takes x and returns y}$

$$ f[x]= y $$

$\text{Since we need parameters to describe the relation then } f \text{ needs to accept these parameters } \phi \text{ therefore:}$ 

$$ f[x, \phi] = y$$

$\text{To train the model we quantify the degree of mismatch between the inference and the true values:}$

$$ \hat\phi = \argmin_{\phi}\left[L[\phi]\right] = \argmin_{\phi}\left[L[\{x_i, y_i\}, \phi]\right]  $$

## 1D Linear Regression Model Example

Assume our input and output are scalar values

$\text{model: }$  $$y = f[x, \phi] = \phi_0 + \phi_1x$$

$\text{Parameters: }$  $$\phi = [\phi_0, \phi_2]$$

$\text{Loss Function: }$ $$L[\phi] = \sum_{i=1}^N(f[x_i, \phi] - y_i)^2 = \sum_{i=1}^N(\phi_0 + \phi_1x - y_i)^2$$

The following link provides a visualization of the parameters and it's effect on the Loss function


$\text{Our Goal: }$ $$\hat\phi = \argmin_{\phi} \left[\sum_{i=1}^N(\phi_0+\phi_1x - y_i)^2\right]$$

The following link provides a visualization of the above concepts:
1. 1D linear model
2. Least Square loss Function
3. Loss function space with respect to parameters

https://udlbook.github.io/udlfigures/

## Extending Linear Regression to higher Dimension

Suppose we have d features $x = (x_1, x_2, ..., x_d)$

The general structure remains the same, from a mathematical and implementation stand point we now represent this using vectors and vector operation.

$\text{model: }$ $$y = f[x, \phi] = \phi_0 + \sum_{i=1}^dx_i\phi_i =  \phi^Tx$$

$\text{Parameters: }$ $$\phi = [\phi_0, \, \phi_1, \dots, \phi_d]$$

If this applies to a single instance then we can apply to a subset of the training data and this can be represented by:

$$
X = \begin{bmatrix}
— & x_1 & — \\
— & x_2 & — \\
— & x_3 & — \\
 & \vdots & \\
— & x_N & — 
\end{bmatrix}
$$

Where each $x_{i} \in \R^d$ that is a row vector containing d-values representing the d-features the inference would be: 

$$ 
f[X, \phi]  = X\begin{bmatrix} \phi_1 \\ \phi_2 \\ \vdots \\ \phi_d \end{bmatrix} + \phi_0
\begin{bmatrix} 
1 \\ 1 \\ \vdots \\ 1 
\end{bmatrix}
$$

We also present what's happening for a dimension perspective
$$

(N \times 1) = (N \times d) (d \times 1) + (N \times 1)
$$

$\text{Loss Function: }$ $$L[\phi] = \frac{1}{2N}\sum_{i=1}^N(f[X, \phi] - y)^2  = \frac{1}{2N} ||f[X, \phi] - y||^2$$


### Questions Remained to be answered: 
1. The current model would only perform well on linear data, what if the data isn't linearly correlated?
2. Whe haven't yet described the method to improve the parameters?
3. How did we come up with the this loss function and how do we know it's a good loss to use?