# Machine Learning Recap

## Objective

In this chapter we go over two main problems that are solved in Machine Learning: 
1. Regression Problem
2. Classification Problem

These two methods are used and discussed throughout the rest of the chapters

Before getting started below describes general overview of the machine Learning Taxonomy and how we describe different types of problems and different methods of solutions.

# Machine Learning Taxonomy 

In Machine learning have two main types of taxonomy:

### Taxonomy by Supervision (The "How")

- ### Supervised Learning
  - We're provided with labeled data to train on.
  - We wish to learn to predict from the corresponding set of labels on new samples.
  - We usually do this using Regression or Classification based on the label type.
- ### Unsupervised Learning 
  - The data we're provided contains no labels.
  - We aim to discover structural/spacial relations based on the data.
  - Common tasks: Clustering (K-means), Dimensionality Reduction, Association

- ### Semi-Supervised Learning
  - Some of the data we're provided is labeled or most of the data isn't labeled.
  - The general task is for the model to be able to label to full data-set otherwise known as self-training
  - We apply similar or hybrid solutions to general approach described above.
  
- ### Self-Supervised Learning
  - The task is designed so that the data is its own label
  - There's no need for human annotation.
  - We aim to teach a model to extract the useful features of the data provided.
  - Only later on in the modelling would we use this pre-trained model for specific tasks

### Taxonomy by Model Objective (The "What")

Regardless of the supervision above, the model will usually fall into one of the following buckets:

- ### Discriminative Models
  - Focus on mapping inputs to outputs directly
  - Example: Provide an image with a dog and the model classifies the image as a dog from a cat/dog label set
- ### Generative Models
  - Focus on understanding how the data was "made"
  - Example: Provide an "Empty Matrix" and then model outputs an image of a dog <br> from a pool of images of different dogs and cats it can create

# Supervised Learning 

We now dive into the main concepts that are applied in supervised learning since these are the foundamental blocks that are used repetatively throughout Deep Learning (no matter the complexity of the model).

Supervised learning can summarised as a model the produces a mathematical function, such that given an input the function produces an output.
<br>The output is referred as the $\color{lightblue}inference\color{silver}^1$.

This function contains $\color{lightblue}parameter\color{silver}^2$, which affect the output from a given input.<br> As such the model equation (this function) describes a family of possible relationships between inputs and outputs, where the parameters specifiy a particular relationship between input instance and its output.

$\color{lightblue}training\color{silver}^3$ a model essentially means trying to find the parameters that describe the true relationship between the inputs and outputs. 

We train the model by following a procedure of trial and error over the training set where for each trial we measure the error and then correct the parameters in the direction that will reduce the error in the next run of trials and errors (this is our hope/hypothesis).

After training a model, we asses its performance; we run the model on a seperate test dataset to see how well it performs or how well it $\color{lightblue}generalises$

If the results are good enough then the model is ready for deployment (brought to the outside world).

#### Formalization

$\text{Let } \vec x \in \R^n \text{ be our input and } \vec y \in \R^m  \text{ be the output}$

$\text{To make a prediction we need a model } f[•] \text{ which takes x and returns y}$

$$ f[x]= y\color{silver}^1 $$

$\text{Since we need parameters to describe the relation then } f \text{ needs to accept these parameters } \phi\color{silver}^2 \text{ therefore:}$ 

$$ f[x, \phi] = y$$

$\text{To train the model we quantify the degree of mismatch between the inference and the true values:}$

$$ \hat\phi\color{silver}^3 = \argmin_{\phi}\left[L[\phi]\right] = \argmin_{\phi}\left[L[\{x_i, y_i\}, \phi]\right]  $$

## 1D Linear Regression Model Example

This is a problem where we aim to find a linear function that best fits the points on the graph 

Assume our input and output are scalar values

$\text{model: }$  $$y = f[x, \phi] = \phi_0 + \phi_1x$$

$\text{Parameters: }$  $$\phi = [\phi_0, \phi_2]$$

$\text{Loss Function: }$ $$L[\phi] = \sum_{i=1}^N(f[x_i, \phi] - y_i)^2 = \sum_{i=1}^N(\phi_0 + \phi_1x - y_i)^2$$


$\text{Our Goal: }$ $$\hat\phi = \argmin_{\phi} \left[\sum_{i=1}^N(\phi_0+\phi_1x - y_i)^2\right]$$

The following link provides a visualization of the above concepts:
1. 1D linear model
2. Least Square loss Function
3. Loss function space with respect to parameters

https://udlbook.github.io/udlfigures/

## Extending Linear Regression to higher Dimension

Suppose we have d features $x = (x_1, x_2, ..., x_d)$

The general structure remains the same, from a mathematical and implementation stand point we now represent this using vectors and vector operation.

$\text{model: }$ $$y = f[x, \phi] = \phi_0 + \sum_{i=1}^dx_i\phi_i =  \phi^Tx$$

$\text{Parameters: }$ $$\phi = [\phi_0, \, \phi_1, \dots, \phi_d]$$

If this applies to a single instance then we can apply to a subset of the training data and this can be represented by:

$$
X = \begin{bmatrix}
— & x^1 & — \\
— & x^2 & — \\
— & x^3 & — \\
 & \vdots & \\
— & x^N & — 
\end{bmatrix}
$$

Note: It's easy to get confused since there's a lot of notation:
- superscript = feature 
- subscript = instance
  
Where each $x_{i} \in \R^d$ that is a row vector containing d-values representing the d-features the inference would be: 

$$ 
f[X, \phi]  = X\begin{bmatrix} \phi_1 \\ \phi_2 \\ \vdots \\ \phi_d \end{bmatrix} + \phi_0
\begin{bmatrix} 
1 \\ 1 \\ \vdots \\ 1 
\end{bmatrix}
$$

We also present what's happening to our dimensions
$$

(N \times 1) = (N \times d) (d \times 1) + (N \times 1)
$$

$\text{Loss Function: }$ $$L[\phi] = \frac{1}{2N}\sum_{i=1}^N(f[X, \phi] - y)^2  = \frac{1}{2N} ||f[X, \phi] - y||^2$$

To Clarify: 
- We're summing of over the number of **instances** N
- The division by 2 is a convinience which we'll see later (this doesn't affect the training process)

## Analytically Solving Linear Regression
We mentioned earlier that: <br>"training the model is trying to find the parameters that describe the true relationship between the inputs and outputs."<br>
In theory we're done training when our inference matches the true results: $$(x^i)\phi - y^i = 0$$ <br>
Ideally we want the Loss function to be equal to 0, this in essence would represent the parameters are best aligned to the data.<br>

Which means we're solving for $\phi$. We do this by solving the partial derivative. 

This is because being able to understand how the change in the parameters affect the loss is essentially showing us how to tune our parameters for better results.

This also means that any **change** in the parameters can lead to a non-zero Loss values therefore:

$$\frac{\partial L }{\partial \phi} = 0 = \lim_{h \to 0} \frac{(\phi + h) - L(\phi)}{h}$$

and 

$$\frac{\partial L }{\partial \phi_0} = 0 = \lim_{h \to 0} \frac{(\phi_0 + h) - L(\phi_0)}{h} $$

Can be interpreted as:

"How does the Loss **changes/is affected** if we change the parameters"

Or Formally: 

"The partial derivative with respect to the parameters"

### Solution

$$
\begin{aligned}
\frac{\partial L}{\partial \phi} &= \frac{\partial }{\partial \phi} \frac{\|X\phi - y\|^2}{2N} \\
&= \frac{1}{2N} \frac{\partial}{\partial \phi} \|X\phi - y \|^2 \\
&= \frac{2}{2N} X^T (X\phi - y) \\
&= \frac{1}{N} X^T (X\phi - y) \\
&= 0 
\end{aligned}
$$
We now solve for for the parameters $\phi$
$$
\begin{aligned}
X^T(X \phi - y) &= 0 \\
X^T X \phi - X^T y &= 0 \\
X^T X \phi &= X^T y
\end{aligned}
$$

We now have two cases X is invertible or isn't

If there exists an inverse for the matrix $(X^T X)$, denoted as $(X^T X)^{-1}$ (which can be verified via a non-zero determinant), we can solve for $\phi$:

$$
\begin{aligned}
(X^T X)^{-1} X^T X \phi &= (X^T X)^{-1} X^T y \\
I \phi &= (X^T X)^{-1} X^T y \\
\phi &= (X^T X)^{-1} X^T y
\end{aligned}
$$


If the determinant of $(X^T X)$ is zero, the inverse does not exist. This typically occurs in two scenarios:
1. **Redundant Features**: Two or more features are perfectly correlated (Linearly Dependent).
2. **Data Sparsity**: There are more features than training samples ($d > N$).

In these cases, we use the **Moore-Penrose Pseudoinverse**, denoted as $(X^T X)^{+}$, which provides the best "least-squares" solution even when the matrix is singular:

$$
\begin{aligned}
\phi &= (X^T X)^{+} X^T y \\
     &= X^{+} y
\end{aligned}
$$

Alternatively, we apply **Regularization** (like Ridge Regression) to force the matrix to be invertible by adding a small value $\lambda$ to the diagonal:

$$
\phi = (X^T X + \lambda I)^{-1} X^T y
$$

### Analytical Solution fails in Deep Learning

While $\phi = (X^T X)^{-1} X^T y$ is elegant, it is impractical for Deep Learning for three reasons:

1. **Scalability**: Inverting a matrix is $O(d^3)$. With $d > 10^6$, this is computationally prohibitive.
2. **Memory**: Storing the $(d \times d)$ matrix $(X^T X)$ requires $O(d^2)$ space, which exceeds modern GPU memory.
3. **Non-Convexity**: Deep Neural Networks use non-linear activations, making the loss surface non-convex. There is no closed-form algebraic solution for:
   $$ \nabla_\phi L(\phi) = 0 $$

Therefore, we use a general method known as **Gradient Descent**, this is an iterative method for find the optimal paramters

### Summary: The Transition to Iterative Methods

While the analytical solution is mathematically beautiful, it acts as a bottleneck for high-dimensional data. In modern Deep Learning, where we deal with millions of parameters and non-linear transformations, we shift from **solving** for the minimum to **searching** for it.

---

### ❓ Questions Remaining

> **1. Non-Linearity**
> The current model assumes $y$ is a linear combination of $X$. What if the underlying relationship is curved or complex? 
> * *Sneak peek: We will introduce **Activation Functions** and **Hidden Layers**.*

> **2. The Iterative Process**
> We've established that we cannot invert the matrix. How do we actually take those "small steps" toward the minimum without knowing the global solution?
> * *Sneak peek: We will dive into **Gradient Descent** and **Backpropagation**.*

> **3. Loss Function Justification**
> We used the Squared Error $\|X\phi - y\|^2$. Why this specific formula? Does a "good" loss function have mathematical properties that make optimization easier?
> * *Sneak peek: We will explore **Maximum Likelihood Estimation (MLE)** and **Information Theory**.*

# 1D Classification

This is a problem where we're provided with a set of instance with various number of features, which have classes. 
$$ \left\{ (x^{(i)}, c^{(i)}) \quad |\quad c^{(i)} \in \{1, \dots ,k\} \quad \forall  i \ \in \{1, \dots, N\} \right\}$$

A typical example would be to be able to classify a dog from a set of images of dogs and cats. <br>
Our input data is pixels and our output is a classfier. <br>
### Objectives

1. We'll first cover the standard vanilla version of this problem known as binary-classification this is where $k=2$
2. We'll go over a sequence of improving loss functions and explain about their properties
3. We'll then deal with $k > 2$ also known as Multiclass Classification and learn the typical loss functions for this case

### Binary Classification

**Data**: Our training set $$\{(x^{(i)}, c^{(i)})\}_{i=1}^N \quad | \quad c^{(i)} \in \{-1, +1\}$$

**Model**: Is a Linear function of $x$ which has a threshold

$$z = f[x, \phi] = \phi_0 + \sum_{i=1}^dx_i\phi_i =  \phi^Tx$$

$$y = g(z) = 
\begin{cases}
+1, &, z \ge 0 \\
-1, &, x \lt 0
\end{cases}
$$