**CS596 - Machine Learning**
<br>
Date: **9 September 2020**


Title: **Lecture 3**
<br>
Speaker: **Dr. Shota Tsiskaridze**
<br>
Teaching Assistant: **Levan Sanadiradze**

Bibliography:
<br>
[1] Bishop, Christopher M., *Pattern Recognition and Machine Learning*, Springer, 2006

<h1 align="center">Linear Regression (Part II)</h1>

<h3 align="center">ML Steps</h3>

While we go into details, I would like to list the **steps** that ML go through in case of linear regression:

1. **Loading the Data**, i.e. $\mathbf{X}$ and $\mathbf{y}$.


2. **Exploring the Data**, i.e. perform **Exploratory Data Analysis (EDA)** for $\mathbf{X}$ and $\mathbf{y}$.


3. **Slicing the Data**, i.e. split the data into **Train**, **Validation** and **Test** sets.


4. **Choosing the Model**, i.e. identifying the form of the function $f(\mathbf{x}, \mathbf{w})$.

 
5. **Defining the Cost function**, i.e. the way you measure the distance $E$ between observed $y$ and predicted $\hat{y}$.


6. **Finding the Extrema** of the Coss function, i.e. optimal values $\hat{\theta} = \arg \min_{\theta} E(\theta)$.


7. **Evaluating the Accuracy**, i.e. run the model with optimal values of $\hat{\theta}$ on **Validation** set.


8. **Revisiting Hyperparameters**, i.e. **optimize the hyperparameters** of the model and **re-run steps 4-7**.


9. **Checking the Final Result**, i.e. run the obtained model on **Test** set and validate the accuracy.

<h3 align="center">Linear Basis Function Models</h3>

- The simplest linear model for regression is one that involves a **linear combination** of the input variables:

  $$y(\mathbf{x}, \mathbf{w}) = w_0 + \sum_{i=1}^{D} w_i x_i = w_0 + w_1 x_1 + ... + w_D x_D,$$

  where $\mathbf{x} = [x_1, ..., x_D]^{\mathbf{T}}$. 


- The **key property** of this model is that it is a **linear function** of the parameters $w_0, ..., w_D$.


- We can **extend** the class of models by considering **linear combinations of fixed nonlinear functions** of the input variables, of the form:

  $$y(\mathbf{x}, \mathbf{w}) = w_0 + \sum_{i=1}^{D} w_i \phi_i(\mathbf{x}) = \mathbf{w}^{\mathbf{T}} \mathbf{\phi}(\mathbf{x}),$$

  where $\phi_i(\mathbf{x})$ are known as **basis functions**:
  
  $$\mathbf{w} = 
\begin{bmatrix}
w_0 \\ \vdots \\w_{D}
\end{bmatrix}
 \text{ and }
\mathbf{\phi} = 
\begin{bmatrix}
\phi_0 \\ \vdots \\ \phi_{D}
\end{bmatrix}.
$$

<h3 align="center">Fixed Basis Functions</h3>

- All of the algorithms are **equally applicable** if we first make a **fixed nonlinear transformation** of the inputs using a vector of basis functions $\phi(\mathbf{x})$.


- **Suitable choices** of nonlinearity can **make the process** of modelling **easier**.

<img src="images/L3_Fixed_Basis_Functions.png" width="800" alt="Example" />


.

<h3 align="center">Examples of Basis Functions</h3>

There are many other possible options for basic functions:


- **Polinomial** basis function: 
  
  $$\phi_i(x) = x^i.$$


- **Gaussian** basis function: 
  
  $$\phi_i(x) = \textrm{exp} \left (-\frac{(x-\mu_i)^2}{2s^2} \right ).$$
 
 
- **Sigmoidal** basis function: 
  
  $$\phi_i(x) = \sigma \left ( \frac{x-\mu_i}{s} \right ),$$
  
  where $\sigma(a)$ is the **Logistic Sigmoid Function** defined by:
    
  $$\sigma(a) = \frac{1}{1 + e^{-a}}.$$


- **Fourier** basis function:  $\phi_0(x) = w_0 $, $\phi_{2k+1}(x) = \cos{n x}$ and $\phi_{2k}(x) = \sin{n x}$.

  In many signal processing applications, it is of interest to consider basis functions that are localized in both space and frequency, leading to a class of functions known as **wavelets**.


- Figure showing **Polynomials** (on the left), **Gaussians** (in the centre), and **Sigmoidal** (on the right) basis functions.

<img src="images/L3_Basis_Functions.png" width="1800" alt="Example" />




<h3 align="center">The Gaussian Distribution</h3>

- In the case of a **single variable** $x$, the **Gaussian distribution**, also known as the **Normal distribution**, can be written in the form:

  $$\mathcal{N}(x | \mu, \sigma^2) = \frac{1}{{\left ( 2\pi \sigma^2 \right )}^{1/2}}\textrm{exp}\left\{ -\frac{1}{2\sigma^2} \left( x-\mu\right)^2\right\},$$

  where $\mu$ is the **mean** and $\sigma^2$ is the **variance**.


- For a $D$-**dimensional vector** $\textbf{x}$, the **multivariate Gaussian distribution** takes the form:

  $$\mathcal{N}(\textbf{x} | \mathbf{\mu}, \mathbf{\Sigma}) = \frac{1}{{\left ( 2\pi\right )}^{D/2}} \frac{1}{{\left | \Sigma \right |}^{1/2}} \textrm{exp}\left\{ -\frac{1}{2}\left( \textbf{x} - \mu \right)^{\textbf{T}} \Sigma^{-1} \left( \textbf{x} - \mu \right)\right\},$$

  where $\mu$ is a $D$-dimensional **mean** vector, $\mathbf{\Sigma}$ is a $D \times D$ **covariance matrix**, and $|\mathbf{\Sigma}|$ denotes the **determinant** of $\mathbf{\Sigma}$.
  
  
  <center><img src="images/L3_Normal.svg" width="700" alt="Example" /></center>

<h3 align="center">Sigmoid Function</h3>

- **Historical reference**. 

  - A **Sigmoid function** is a mathematical function whose name comes from the Greek letter **Sigma**.
  
  - A **Sigmoid function** has a characteristic **$S$-shaped curve** or **Sigmoid curve**.
  
  - A **Sigmoid function** is a type of **logistic function** and purely refers to any function that retains the $S$-shape. 
  
  - Traditionaly sigmoidal function exists between $0$ and $1$.


- A **sigmoid functions** are frequently used in **Machine Learning**, specifically in the testing of **Artificial Neural Networks (NN)**, as a way of understanding the output of a node or **Neuron**. 
  
  
  <center><img src="images/L3_Sigmoid_Function.png" width="700" alt="Example" /></center>


- **Sigmoid functions** have several useful properties:

  1. $\sigma(x)$ is **between 0 and 1**, i.e. $0 \leq \sigma(x) \leq 1$ for any $x \in (-\infty , +\infty )$;

  2. $\sigma(x)$ is **not linear**, i.e. $\sigma \left ( \alpha x + \beta y \right ) \neq \alpha \sigma \left ( x \right ) + \beta \sigma \left ( y \right )$;

  3. $\sigma(x)$ is **monotonic**, i.e. if $x < y$ then $\sigma(x) < \sigma(y)$;

  4. **Derivative** of ${\sigma}'(x)= \frac{d}{dx}\sigma(x)$ is **continuous** and it **can be expressed through** $\sigma(x)$.


- Different **Sigmoid** functions with it's **derivative** are listed below:

  - **Logistic** function:
    
    $$f(x) = \frac{1}{1+ e^{-x}} \text { and } \frac{df}{dx}(x) = f(x)(1-f(x)).$$
    
  - **Hyperbolic tangent** function:
  
    $$f(x) = \tanh x = \frac{e^x - e^{-x}}{e^x+ e^{-x}} \text{ and } \frac{df}{dx}(x) = 1-f(x).$$
    
  - **Arctangent** function:
  
    $$f(x) = \arctan x  \text{ and } \frac{df}{dx}(x) = \frac{1}{1+x^2}.$$


  
  


<h3 align="center">Fourier Transform</h3>

- The **Fourier Transform** is used in ML in a wide range of applications, such as **image analysis**, **image filtering**, **image reconstruction** and **image compression**.

 
  <center><img src="images/L3_Fourier_Transform.gif" width="500" alt="Example" /></center>


<h3 align="center">Maximum Likelihood and Least Square Methods</h3>

- Lets show that the **least squares approach** could be motivated as the **maximum likelihood solution** under an assumed **Gaussian noise** model.


- Let assume that for **some input value** $\mathbf{x}$ the **target variable** $t$ is given by by a deterministic function with additive **Gaussian noise** so that:

  $$t = y( \mathbf{x}, \mathbf{w}) + \varepsilon,$$
  
  where $\varepsilon$ is a **zero mean Gaussian random variable** with precision $\beta = \frac{1}{\sigma}$.


- Thus we can write **likelihood function** of of the **target variable** $t$ observing the $\mathbf{x}, \mathbf{w}$ and $\beta^{-1}$:

  $$p(t | \mathbf{x}, \mathbf{w}, \beta) = \mathcal{N} (t | y( \mathbf{x}, \mathbf{w}), \beta^{-1}).$$



- Now lets consider a **data set of inputs** $\mathbf{X} = \{\mathbf{x_1}, ..., \mathbf{x_N}\}$ with corresponding **target values** $\mathbf{t} = \{t_1, ..., t_N\}$.

  Making the assumption that these data points are **drawn independently** we obtain the following expression for the **likelihood function**:

$$p(\mathbf{t} | \mathbf{X}, \mathbf{w}, \beta) = \prod_{n=1}^{N} \mathcal{N} (t_n | \mathbf{w}^{\mathbf{T}}\mathbf{\phi}(\mathbf{x_n}), \beta^{-1}).$$


- Let's rewrite the **likelihood function** using the **standard form** for the **Normal distirbution**:

  $$\mathcal{N}(x | \mu, \sigma^2) = \frac{1}{{\left ( 2\pi \sigma^2 \right )}^{1/2}}\textrm{exp}\left\{ -\frac{1}{2\sigma^2} \left( x-\mu\right)^2\right\} \Rightarrow $$
  
  $$\mathcal{N}(t_n | \mathbf{w}^{\mathbf{T}}\mathbf{\phi}(\mathbf{x_n}), \beta^{-1}) = 
    \frac{\beta}{\sqrt{2\pi}}\textrm{exp}\left\{ -\frac{\beta^2}{2} \left( t_n - \mathbf{w}^{\mathbf{T}}\mathbf{\phi}(\mathbf{x_n}) \right)^2\right\} = 
    \frac{\beta}{\sqrt{2\pi}}\textrm{exp}\left\{ -\frac{\beta^2}{2} E_D(\mathbf{w}) \right\},$$
    
  where the **sum-of-squares error** function is defined by:

  $$E_D(\mathbf{w}) = \frac{1}{2} \sum_{n=1}^{N}\left\{  t_n - \mathbf{w}^{\mathbf{T}}\mathbf{\phi}(\mathbf{x_n})\right\}^2.$$

- Taking the **logarithm of the likelihood function**, we have:

  $$\ln p(\mathbf{t} | \mathbf{X}, \mathbf{w}, \beta) = 
\sum_{n=1}^{N} \ln \mathcal{N} (t_n | \mathbf{w}^{\mathbf{T}}\mathbf{\phi}(\mathbf{x_n}), \beta^{-1}) = 
\frac{N}{2} \ln \beta  - \frac{N}{2} \ln (2\pi)  - \beta E_D(\mathbf{w}),$$
 

- We see that **maximization of the likelihood** function **under a conditional Gaussian noise** distribution for a linear model is equivalent to **minimizing a sum-of-squares error** function given by $E_D(\mathbf{w})$.

<h3 align="center">Finding the Extrema of the Cost Function</h3>

- To **find the extrema** of the function we need to **calculate the gradient** of this functionand **set it to zero**!


- The **gradient** of the **sum-of-squares error** function $E_D(\mathbf{w})$ takes the form:

  $$\nabla E_D(\mathbf{w}) = 
\sum_{n=1}^{N}\left\{  t_n - \mathbf{w}^{\mathbf{T}}\mathbf{\phi}(\mathbf{x_n})\right\} \mathbf{\phi}(\mathbf{x_n})^{\mathbf{T}}.$$

- **Setting** the **gradient** to **zero** gives:

  $$ 0 = \sum_{n=1}^{N} t_n \mathbf{\phi}(\mathbf{x_n})^{\mathbf{T}} - \mathbf{w}^{\mathbf{T}} \left ( \sum_{n=1}^{N} \mathbf{\phi}(\mathbf{x_n})\mathbf{\phi}(\mathbf{x_n})^{\mathbf{T}} \right ).$$


- Solving for $\mathbf{w}$ we obtaing:

  $$\mathbf{w}_{ML}  = \left ( \Phi^{\mathbf{T}} \Phi \right )^{-1} \Phi^{\mathbf{T}} \mathbf{t} = \Phi^{\dagger}  \mathbf{t}$$

  which are known as the **normal equations** for the **least squares problem**.


- Here $\Phi$ is an $N \times M$ matrix, called the **design matrix**, whose elements are given by $\Phi_{ni} = \phi_i(\mathbf{x_n})$, so that:

  $$\Phi = \begin{bmatrix}
 \phi_0(\mathbf{x_1}) & \phi_1(\mathbf{x_1}) & \cdots & \phi_{M-1}(\mathbf{x_1})  \\ 
 \phi_0(\mathbf{x_2}) & \phi_1(\mathbf{x_2}) & \cdots & \phi_{M-1}(\mathbf{x_2})  \\ 
\vdots  & \vdots  & \ddots  & \vdots \\ 
\phi_0(\mathbf{x_N}) & \phi_1(\mathbf{x_N}) & \cdots & \phi_{M-1}(\mathbf{x_N})
\end{bmatrix}.$$

  And the quantity:

  $$\Phi^{\dagger} \equiv \left ( \Phi^{\mathbf{T}} \Phi \right )^{-1} \Phi^{\mathbf{T}}$$

  is known as the **Moore-Penrose pseudo-inverse** of the matrix $\Phi$.








<h3 align="center">Multiple Outputs</h3>

- So far, we have considered the case of a **single target variable** $t$.


- In some applications, we may wish to predict $K > 1$ target variables.


- This could be done by introducing a **different set of basis functions** for each component of $t$, leading to **multiple, independent regression problems** $\rightarrow$ **Not interesting**.


- More **common approach** is to **use** the **same set of basis functions**:

  $$y(\mathbf{x}, \mathbf{w}) = \mathbf{W}^{\mathbf{T}}\phi(\mathbf{x}),$$

  where $y$ is a $K$-dimensional column vector, $\mathbf{W}$ an $M \times K$ matrix of parameters,
and $\phi(\mathbf{x})$ is an $M$-dimensional column vector with elements $\phi_j(\mathbf{x})$.


- Suppose we take the conditional distribution of the target vector to be an isotropic Gaussian of the form:

  $$p(\mathbf{t} | \mathbf{x}, \mathbf{W}, \beta) = \mathcal{N} \left ( \mathbf{t} | \mathbf{W}^{\mathbf{T}} \phi(\mathbf{x}), \beta^{-1}\mathbf{I} \right ).$$


- If we have a set of observations ${\mathbf{t_1}, ..., \mathbf{t_N}}$, we can combine these into a matrix $\mathbf{T}$ of size $N \times K$.

  Similarly, we can combine the input vectors ${\mathbf{x_1}, ..., \mathbf{x_N}}$ into a matrix $\mathbf{X}$ of size $N \times M$.


- The **log likelihood function** is then given by:

  $$\ln p(\mathbf{T | \mathbf{X}, \mathbf{W}, \beta}) = 
\sum_{n=1}^{N} \ln \mathcal{N} \left ( \mathbf{t_n} | \mathbf{W}^{\mathbf{T}} \phi(\mathbf{x_n}), \beta^{-1}\mathbf{I}\right ) = \frac{NK}{2} \ln \left( \frac{\beta}{ 2\pi}\right)  - \frac{\beta}{2} \sum_{n=1}^{N} \left \| \mathbf{t_n} - \mathbf{W}^{\mathbf{T}} \phi(\mathbf{x_n}) \right \|^2.$$


- As before, we can maximize this function with respect to $\mathbf{W}$, giving:

  $$\mathbf{W}_{ML} = \left ( \Phi^{\mathbf{T}} \Phi \right )^{-1} \Phi^{\mathbf{T}} \mathbf{T}.$$


- If we examine this result for each target variable $t_k$, we have:

  $$\mathbf{w}_k = \left ( \Phi^{\mathbf{T}} \Phi \right )^{-1} \Phi^{\mathbf{T}} \mathbf{t}_k = \Phi^{\dagger} \mathbf{t}_k.$$

- Thus the solution to the regression problem **decouples** between the different target variables, and **we need only compute a single pseudo-inverse matrix** $\Phi^{\dagger}$, which is shared by all of the vectors $\mathbf{w}_k$.



<h3 align="center">Batch Techniques and Sequential Learning</h3>

- **Batch techniques** involve processing the entire training set in **one go**.<br>

  For example, **Normal Equation solution** for **sum-of-squares error** function. 
  
  
- **Sequential algorithms** consider the data points one at a time and the model parameters updated after each such presentation.

  For example, **Stochastic Gradient Descent (SGD)**
  
  $$\mathbf{w}^{(\tau + 1)} = \mathbf{w}^{(\tau)} - \eta \nabla E_n, $$
  
  where $\tau$ denotes the **iteration number**, and $\eta$ is a **learning rate parameter**.

  <center><img src="images/L3_Learning_Rate.png" width="500" alt="Example" /></center>


- In the case of the **sum-of-squares error** function, this gives:

  $$\mathbf{w}^{(\tau + 1)} = \mathbf{w}^{(\tau)} + \eta (t_n  - {\mathbf{w}^{(\tau)}}^{\mathbf{T}}\phi(\mathbf{x_n}))\phi(\mathbf{x_n}),$$
  
  which is known as **least-mean-squares** or the **LMS algorithm**.
  
  

<h3 align="center">Evaluating the Accuracy</h3>

- As we discussed before, to **Bias** and **Variance** are very usefull in **evaluating the accuracy** of our fitted model.


-  **Bias** of an estimator is the **expected difference** between its **estimates** and the **true values** in the data:

$$\operatorname{Bias}(\hat{f}(x_0))=f(x_0)-\mathbf{E}\left [\hat{f}(x_0) \right ]$$

 - **Variance** of an estimator is defined as the **expected value** of the **squared difference** between the **estimate** of a model and the **expected value** of the **estimate**:

$$\operatorname{Var}(\hat{f}(x_0))=\mathbf{E}\left [ \left(\hat{f}(x_0)-\mathbf{E} \left[\hat{f}(x_0)\right] \right)^2 \right]$$

<h3 align="center">Example</h3>

- Let's consider the dataset generated by the **true function** $f(x) = 4x^2 + 3x + 2$.


- On the **left side** plot you can see the result of the fit of this function with the **four different models**.

- On the **right side** the visualization of the **bias** and **variance** is shown.

- True value is calculated at $x_0 = 5$.

<center><img src="images/L3_Polynomials.png" width="1000" alt="Example" /></center>



<h3 align="center">Another Example</h3>


<center><img src="images/L3_Number_of_Degree.gif" width="800" alt="Example" /></center>

<center><img src="images/L3_Generated_Samples.gif" width="800" alt="Example" /></center>



- As you can see, the **degree of the polynomial**, i.e. the **number of features**, is the **hyperparameter**.


- **Questions**:
 - How to choose the hyperparameter correctly?  $\rightarrow$ **Bias-Variance Trade-Off**
 - Is there any other way to solve the problem of **overfitting**? $\rightarrow$ **Regularization**


<h3 align="center">Bias–Variance Decomposition</h3>

- **Total expected error** a point $x_0$ is defined as follows:

$$\mathbf{E} \left [ \left ( f(x_0) - \hat{f}(x_0)\right )^2\right ].$$

- **Whichever model**  $\hat{f}$ we choose, its expected error can be further **decomposed** as:

$$\mathbf{E} \left [\left(f(x_0) - \hat{f}(x_0)\right)^2\right]
= \left(\operatorname{Bias}\left[\hat{f}(x_0)\right] \right) ^2 + \operatorname{Var}\left[\hat{f}(x_0)\right] + \sigma^2,$$

&emsp; &emsp; &ensp;where $\sigma^2$ is an **irreducible error** which we can't get rid off.


<h3 align="center">Bias-Variance Trade-Off</h3>

$\bullet$ Usually in **Machine Learning** there is a **trade-off** between model's **Bias** and **Variance**.

$\bullet$ As we gradually **grow model's capacity**, we gradually **reduce bias** and **increase variance**.

$\bullet$ The **goal** is to **find a sweet spot** where both bias and variance are acceptable.

<center><img src="images/L3_Bias_vs_Variance.png" width="500" alt="Example" /></center>

<h3 align="center">Regularization</h3>

- **Regularization** is a technique used to **address overfitting**;


- **Regularization**, significantly **reduces the variance** of the model, **without** substantial **increase in its bias**.


- **Main idea** of regularization is to **keep all the features**, but **reduce magnitude of parameters** $\boldsymbol{\theta}$;


- A **regularization term** (or **regularizer**) $R(f)$ is added to a **loss function**:

  $$E_{reg}(\boldsymbol{\theta}) = E(\boldsymbol{\theta}) + R(f).$$


- Mainly, there are two types of forms of regularization:
  - **Ridge regularization**, or **Tikhonov regularization**, uses he $L_2$ **norm** of the vector $\boldsymbol{\theta}$:

    $$R(f) = \lambda \cdot \| \mathbf{w} \|^2_2 = \lambda \cdot \mathbf{w}^T\mathbf{w} = \lambda \cdot \sum_{i=1}^n w_i^2$$
  
  - **Lasso regularization** uses he $L_1$ **norm** of the vector $\mathbf{w}$:

    $$R(f) = \lambda \cdot \| \mathbf{w} \|_1 = \lambda \cdot \sum_{i=1}^n |w_i|$$

  - $\lambda$ is a **hyperparameter**, called **regularization parameter**.


- Plot of the contours of the **unregularized error function (blue)** along with the **constraint region** for 

  the **Ridge regularer** $q = 2$ on the **left** and the **Lasso regularizer** $q = 1$ on the **right**,
  
  in which the **optimum value** for the parameter **vector $\mathbf{w}$** is **denoted by $\mathbf{w}^{\star}$**.

<center><img src="images/L3_Regularization_2.png" width="800" alt="Example" /></center>



- A more general regularizer is sometimes used, for which the **regularized error** takes the form:

  $$R(f) = \lambda \cdot \sum_{i=1}^n  \| w_i \|^q$$


- Contours of the **regularization term** for various values of the **parameter $q$** are shown below:

<center><img src="images/L3_Regularization.png" width="800" alt="Example" /></center>





- The result of the **fitting** of the **true function** $f(x) = \sin 2\pi x$ to the **polynomial function of $10$-th degree** with **regularization** for different **learning rate** $\lambda$ are presented below:

<center><img src="images/L3_Learning_Rate.gif" width="800" alt="Example" /></center>


<h1 align="center">End of Lecture</h1>