**Artificial Inteligence (CS550)**
<br>
Date: **24 February 2020**
<br>
Location: **SU, NEW STEM building**
<br>
Room: **302**

Title: **Lecture 6: Linear Regression**
<br>
Speaker: **Dr. Shota Tsiskaridze**
<br>
Bibliography: 
<br>
[1] Ian Goodfellow and Yoshua Bengio and Aaron Courville, *Deep Learning*, MIT Press, 2016.

<h1 align="center">Linear Regression</h1>

<h3 align="center">Regression Model</h3>

$\bullet$ In practice, researchers:
<br> &emsp; $\bullet$ **select a model** they would like to estimate,
<br> &emsp; $\bullet$ **use chosen method** to estimate the parameters of that model.
 
$\bullet$ Regression models involve the following components:
<br> &emsp; $\bullet$ The **unknown parameter**s, often denoted as a scalar or vector $\boldsymbol{\theta}$.
<br> &emsp; $\bullet$ The **independent variables**, which are observed in data and are often denoted as a vector $X_i$.
<br> &emsp; $\bullet$ The **dependent variable**, which are observed in data and often denoted as a scalar $y_{i}$.
<br> &emsp; $\bullet$ The **error term**s, which are not directly observed in data and are often denoted as a scalar $\varepsilon_i$.

$\bullet$ Regression models propose that $y_{i}$ is a function of $X_{i}$ and $\boldsymbol{\theta}$, with $\varepsilon_i$ representing an **additive error term**:

$$y_i = f(X_i, \boldsymbol{\theta}) + \varepsilon_i.$$

$\bullet$ The researchers' **goal** is to **estimate the function** $f(X_i, \boldsymbol{\theta})$ that most closely fits the data.

<h3 align="center">Multivariate Linear Regression Problem</h3>


$\bullet$ **Linear regression** model assumes that the relationship between the **dependent variable** $Y_i$ and the 
<br> &ensp; **independent variables** (**regressors**) $X_i$ is **linear**, i.e. the **regresion function** takes the form:

$$f(X_i, \boldsymbol{\theta}) = \theta_1 X_1 + \theta_2 X_2 + \cdots + \theta_n X_n.$$

$\bullet$ Suppose we have $m$ training examples $(X_i, y_i)$ and $n$ features $X_i = [x_{i1}, x_{i2}, \cdots, x_{in} ]^T\in \mathbb{R}^n$.

$\bullet$ We can represent $X_i$ as a **Design Matrix** $X = \begin{bmatrix}
x_{11} & x_{12} & \cdots & x_{1n}\\
x_{21} & x_{22} & \cdots & x_{1n}\\ 
\vdots & \vdots & \ddots & \vdots \\
x_{m1} &  x_{m2} &\cdots & x_{mn}\\
\end{bmatrix}$, and $y_i$ as a vector $\mathbf{y} = \begin{bmatrix}
y_{1}\\ 
y_{2}\\ 
\vdots\\
y_{m}
\end{bmatrix}$.

$\bullet$ Thus we have a system:
$$ X  \boldsymbol{\theta} = \mathbf{y}.$$

$\bullet$ How do we solve it, and if there's no solution, how do we find the best possible $\boldsymbol{\theta}$?

<h3 align="center">Note</h3>

$\bullet$ Usially, there is an additional feature $x_{i0} = 1$ - the slope, so:

$$\boldsymbol{\theta} = 
\begin{bmatrix}
\theta_{0}\\ 
\theta_{1}\\ 
\vdots\\
\theta_{n}
\end{bmatrix}
\in \mathbb{R}^{n+1}
\text{, } 
X_i = 
\begin{bmatrix}
x_{i0}\\ 
x_{i1}\\ 
\vdots\\
x_{in}
\end{bmatrix}
\in \mathbb{R}^{n+1}
\text{, } 
\mathbf{y} = 
\begin{bmatrix}
y_{1}\\ 
y_{2}\\ 
\vdots\\
y_{m}
\end{bmatrix}
\text{, and } 
X = \begin{bmatrix}
x_{10} & x_{11} & \cdots & x_{1n}\\
x_{20} & x_{21} & \cdots & x_{1n}\\ 
\vdots & \vdots & \ddots & \vdots \\
x_{m0} &  x_{m1} &\cdots & x_{mn}\\
\end{bmatrix}
\in \mathbb{R}^{m\times n + 1}.$$

$\bullet$  Thus we have a system: 

$$X  \boldsymbol{\theta} = \mathbf{y}$$

<h3 align="center">Residuals and Coefficient of Determination</h3>

$\bullet$ In **regression analysis**, the difference between the **observed value** of the dependent variable $\mathbf{y}$ and 
<br> &ensp; the predicted value $\hat{\mathbf{y}}$ is called the **residual** $\mathbf{e}$. Each data point has one residual:

$$e_i = y_i - \hat{y_i}.$$

$\bullet$ Conventionally the expectation value (**mean**) of the residuals is zero: $\mathbf{E}(\mathbf{e}) = 0$.

$\bullet$ What about $\mathbf{Var}(\mathbf{e})$?
$$\mathbf{Var}(\mathbf{e}) = \mathbf{Var}(\mathbf{y})(1 - R^2).$$

&ensp; where $R^2$ is called **coefficient of determination**:

$$R \equiv 1 - \frac{SS_{res}}{SS_{tot}} \equiv 1 - \frac{\sum_{i} e_i^2}{\sum_{i}(y_i - \bar{y})^2}$$

$\bullet$ It determines how much of the **total variation in $\mathbf{y}$** is explained by the **variation in $X$**:
<br> &emsp; $\bullet$ $R^2=0$: the linear relationship **explains nothing** (so no linear relationship between $X$ and $\mathbf{y}$);
<br> &emsp; $\bullet$ $R^2=1$: the linear relationship **explains everything** - no left-overs, no uncertainty.


<h3 align="center">Least Squares Error</h3>

$\bullet$ When there's **no solution** to the system, we fit the data as good as possible.
<br>
$\bullet$ We try to predict such $\hat{\mathbf{y}}$ that minimizes the residuals $\mathbf{e}$.
<br>
$\bullet$ One way of measuring the **performance of the model** is to compute the **MSE** of the residuals:

$$J(\boldsymbol{\theta}) = \| \mathbf{e}\|^2 = \|\mathbf{y} - \hat{\mathbf{y}} \|^2$$
$\bullet$ So our task is to find $\hat{\boldsymbol{\theta}}$:
$$\hat{\boldsymbol{\theta}} = \arg\min_{\boldsymbol{\theta}}J(\boldsymbol{\theta})$$
$\bullet$ Let's expand $J(\boldsymbol{\theta})$:

$$J(\boldsymbol{\theta}) = \| \mathbf{y} - \hat{\mathbf{y}} \|^2 = (\mathbf{y} - X\boldsymbol{\theta})^T(\mathbf{y} - X\boldsymbol{\theta}) = \mathbf{y}^T\mathbf{y} - 2\boldsymbol{\theta}^TX^T\mathbf{y} + \boldsymbol{\theta}^TX^TX\boldsymbol{\theta}$$

$\bullet$ Use **Fermat's Theorem** to find the minimum:

$$\frac{\partial J(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}} = -2X^T\mathbf{y} + 2X^TX\boldsymbol{\theta} = 0 \Rightarrow X^TX\boldsymbol{\theta} = X^T \mathbf{y} \Rightarrow \hat{\boldsymbol{\theta}} = (X^TX)^{-1}X^T\mathbf{y} = X^{+}\mathbf{y},$$

&ensp; where $X^{+} = (X^TX)^{-1}X^T$ is the **Pseudoinverse** of $X$.

<h3 align="center">System of Linear Equations</h3>

$\bullet$ In **Linear Algebra** we typically use **different notation**.
<br>
$\bullet$ A general system of $m$ **linear equations** with $n$ **unknowns** can be written as:
$$\begin{matrix}
a_{11} x_1 + a_{12} x_2 + \cdots + a_{1n} x_n = b_1\\ 
a_{21} x_1 + a_{22} x_2 + \cdots + a_{2n} x_n = b_2\\  
\vdots \\
a_{m1} x_1 + a_{12} x_2 + \cdots + a_{mn} x_n = b_m
\end{matrix}$$
where $x_j$ are the unknowns, $a_{ij}$ are the coefficients of the system, and $b_i$ are the constant terms.
<br>
$\bullet$ We can write this system of linear equations in the equivalent matrix form:

$$A \mathbf{x} = \mathbf{b},$$

where $A = 
\begin{bmatrix}
a_{11} & a_{12} & \cdots & a_{1n}\\
a_{21} & a_{22} & \cdots & a_{1n}\\ 
\vdots & \vdots & \ddots & \vdots \\
a_{m1} & a_{m2} &\cdots & a_{mn}\\
\end{bmatrix} \in \mathbb{R}^{m \times n}$, 
$\mathbf{x} = \begin{bmatrix}
x_{1}\\ 
x_{2}\\ 
\vdots\\
x_{n}
\end{bmatrix}  \in \mathbb{R}^{n}$ and $\mathbf{b} = \begin{bmatrix}
b_{1}\\ 
b_{2}\\ 
\vdots\\
b_{m}
\end{bmatrix}  \in \mathbb{R}^{m}$.

<h3 align="center">Column Space</h3>

Let $\mathcal{F}$ be a field and let $A$ be an $m \times n$ matrix, with column vectors $\mathbf{v}_1, \mathbf{v}_2, \cdots, \mathbf{v}_n$, where $\mathbf{v}_i \in \mathcal{F}^m$.

$\bullet$ The **set** of **all possible linear combinations** of column vectors $\mathbf{v}_1, 
\cdots, \mathbf{v}_n$ is called the **column space**, $C(A)$:

$$\mathbf{v} = \alpha_1 \mathbf{v}_1 + \cdots + \alpha_n \mathbf{v}_n,$$

&ensp; where $\alpha_1, ..., \alpha_n \in \mathcal{F}$ are the scalars.

$\bullet$ Any linear combination of the column vectors can be written as the product of $A$ with a column vector:


$$\mathbf{v} = 
\alpha_1
\begin{bmatrix}
a_{11}\\ 
\vdots\\
a_{m1}
\end{bmatrix}
+ \cdots + \alpha_n
\begin{bmatrix}
a_{1n}\\ 
\vdots\\
a_{mn}
\end{bmatrix}
=
\begin{bmatrix}
\alpha_1 a_{11} + \cdots + \alpha_n a_{1n}\\
\alpha_1 a_{21} + \cdots + \alpha_n a_{2n}\\
\vdots \\
\alpha_1 a_{m1} + \cdots + \alpha_n a_{mn}
\end{bmatrix}
=
\begin{bmatrix}
a_{11} & \cdots & a_{1n}\\
\vdots & \vdots & \ddots\\
a_{m1} & \cdots & a_{mn}\\
\end{bmatrix}
\begin{bmatrix}
\alpha_{1}\\ 
\vdots\\
\alpha_{n}
\end{bmatrix}
= A 
\begin{bmatrix}
\alpha_{1}\\ 
\vdots\\
\alpha_{n}
\end{bmatrix}
$$ 

$\bullet$ Therefore, the **column space** of $A$ is the same as the **range** of the corresponding matrix **transformation**.

<h3 align="center">Projection onto Column Space</h3>

Let's consider a system of linear equations in the matrix form $A \mathbf{x} = \mathbf{b}$.

$\bullet$ If $\mathbf{b} \notin C(A)$, then the system does not have a solution.

$\bullet$ We can find an **approximate solution** by projecting $\mathbf{b}$ onto $C(A)$. Let's multiply both sides by $A^T$:

$$A^TA \mathbf{x} = A^T\mathbf{b},$$
&ensp; i.e. we've got the system of **Normal Equation**.

$\bullet$ We can find solution analytically, by simply solving the system of equations:

$$\hat{\mathbf{x}} = (A^TA)^{-1}A^T \mathbf{b}.$$

$\bullet$ $A^{+} = (A^TA)^{-1}A^T$ is also called the **Pseudoinverse** of the matrix $A$.

$\bullet$ It also happens to be the best solution in terms of **Least Squares error**:

$$\|\mathbf{b} - A\hat{\mathbf{x}}\|_2 =  \min_{\mathbf{x}}(\|\mathbf{b} - A\mathbf{x}\|_2)$$


<h3 align="center">Invertability of $A^TA$</h3>

$\bullet$ When $A^TA$ is invertible?
<br> &emsp; $\bullet$ According the $\textbf{Theorem} \space \textbf{11}$ from $\textbf{Lecture} \space \textbf{2}$, the matrix $A^TA$ is invertible if $\mathfrak{N}(A^TA) = \{0\}$;
<br> &emsp; $\bullet$ But $\mathfrak{N}(A^TA) = \mathfrak{N}(A)$, therefore the matrix $A^TA$ is invertible if $\mathfrak{N}(A)=\{0\}$;
<br> &emsp; $\bullet$ In other words, the matrix $A^TA$ is **invertible** if the **column vectors** of $A$ are **linearly independent**.

$\bullet$ When does $A^TA$ have no inverse?
<br> &emsp; $\bullet$ similarly, $A^TA$ is not invertible if $\mathfrak{R}(A^TA) = \mathfrak{R}(A)$;
<br> &emsp; $\bullet$ In other words, some **columns** are **linearly dependent**;
<br> &emsp; $\bullet$ Or, for example, when we have too many features $(m < n)$.

$\bullet$ **Solution**:
<br> &emsp; $\bullet$ Remove the linear dependency;
<br> &emsp; $\bullet$ Delete some features.




<h3 align="center">Gradient Descent vs Normal Equation</h3>

|                                                      Gradient Descent                                                      |                                Normal Equation                                |
|:---------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------|
| **Has a hyperparameter** (learning rate $\alpha$)                                                                              | **Does not have a hyperparameter**                                                  |
| An **iterative process**, <br> i.e. many iterations are required  to achieve the required accuracy | **No iterations required**, i.e computed in one step                              |
| **No inverse computation is required**                                                                                         | **Computation of the inverse matrix $X^TX$ is required**      |
| Works well with **large number of features** $n$. <br>Time complexity is $O(nC\log(1/\varepsilon))$  for **GD**, and <br > Time complexity is $O(C/\varepsilon)$ for **SGD**| **Very slow** if number of features $n$ is large: $n \geq 10^4$. <br>Time complexity is $O(n^3)$ |
| An iterative process **may not converge**                                                                                      | **May not have an analytical solution**, i.e. $X^TX$ is not-invertible               |

<h3 align="center">Gradient Descent Learning Rate Problems</h3>

$\bullet$ If we **stack in local extrema** or saddle point, we can **stay there for long**;

$\bullet$ **Large learning rate** might cause so called **bouncing gradient**.

<br>

<center><img src="images/GD_0.png" width="1800" alt="Example" /></center>


<h3 align="center">Optimization Methods for Gradient Descent</h3>

$\bullet$ The following two animations provide some intuitions towards the behaviour of the optimization methods.
<br>

<center><img src="images/GD_Optimizers2.gif" width="1800" alt="Example" /></center>


<h3 align="center">Overfitting and Underfitting</h3>

$\bullet$ The central challenge in machine learning is that we must perform well on **previously unseen (new) inputs**.
<br>
$\bullet$ The ability to perform well on previously unobserved inputs is called **generalization**.
<br>
$\bullet$ There are two type of errors we can compute:
<br> &emsp; $\bullet$ error measure on the training set called the **training error**.
<br> &emsp; $\bullet$ **test error**, or **generalization error**, defined as the expected value of the error on a **new input**.

$\bullet$ In our linear regresion example, the **training** and **test error** are:

$$\text{training error} = \frac{1}{m^{(train)}} || \mathbf{X}^{(train)} - \mathbf{y}^{(train)}||_2^2$$

$$\text{test error} = \frac{1}{m^{(test)}} || \mathbf{X}^{(test)} - \mathbf{y}^{(test)}||_2^2$$

$\bullet$ The factors determining **how well** a **machine learning** algorithm will **perform** are its ability to:
<br> &emsp; $\bullet$ Make the **training error** small.
<br> &emsp; $\bullet$ Make the gap between **training** and **test error** small.
<br>
$\bullet$ These factors correspond to the two central challenges in machine learning: **underfitting** and **overfitting**.

<h3 align="center">Capacity</h3>



$\bullet$ Model’s **capacity** is its ability to fit a wide variety of functions. 
<br> &ensp; It **can control** whether a model is more likely to **underfit** or **overfit**.

$\bullet$  One way to control the **capacity** of a ML algorithm is by choosing its **hypothesis space**, 
<br> &ensp; the set of functions that the learning algorithm is allowed to select as being the solution.

<center><img src="images/Capacity.png" width="1000" alt="Example" /></center>

<h3 align="center">Bias and Variance of an Estimator</h3>


-  **Bias** of an estimator is the **expected difference** between its **estimates** and the **true values** in the data:

$$\operatorname{Bias}(\hat{f}(x_0))=f(x_0)-\mathbf{E}\left [\hat{f}(x_0) \right ]$$

 - **Variance** of an estimator is defined as the **expected value** of the **squared difference** between the **estimate** of a model and the **expected value** of the **estimate**:

$$\operatorname{Var}(\hat{f}(x_0))=\mathbf{E}\left [ \left(\hat{f}(x_0)-\mathbf{E} \left[\hat{f}(x_0)\right] \right)^2 \right]$$

<h3 align="center">Bias–Variance Decomposition</h3>

- **Total expected error** a point $x_0$ is defined as follows:

$$\mathbf{E} \left [ \left ( f(x_0) - \hat{f}(x_0)\right )^2\right ].$$

- **Whichever model**  $\hat{f}$ we choose, its expected error can be further **decomposed** as:

$$\mathbf{E} \left [\left(f(x_0) - \hat{f}(x_0)\right)^2\right]
= \left(\operatorname{Bias}\left[\hat{f}(x_0)\right] \right) ^2 + \operatorname{Var}\left[\hat{f}(x_0)\right] + \sigma^2,$$

&emsp; &emsp; &ensp;where $\sigma^2$ is an **irreducible error** which we can't get rid off.

- **Estimating expected error** is, in some sense, a **good way** to estimate model's **generalization ability**.


<h3 align="center">Bias-Variance Trade-Off</h3>

$\bullet$ Usually in **Machine Learning** there is a **trade-off** between model's **Bias** and **Variance**.

$\bullet$ As we gradually **grow model's capacity**, we gradually **reduce bias** and **increase variance**.

$\bullet$ The **goal** is to **find a sweet spot** where both bias and variance are acceptable.

<center><img src="images/biasvariance.png" width="800" alt="Example" /></center>

$\bullet$ Let's consider the dataset generated by the function $f(x) = 4x^2 + 3x + 2$.
<br> 
$\bullet$ True value is calculated at $x_0 = 5$.

<center><img src="images/Polynomials3.png" width="1500" alt="Example" /></center>

<h3 align="center">Regularization</h3>

$\bullet$ **Regularization** is a technique used to **address overfitting**;
<br>
$\bullet$ **Regularization**, significantly **reduces the variance** of the model, **without** substantial **increase in its bias**.
<br>
$\bullet$ **Main idea** of regularization is to **keep all the features**, but **reduce magnitude of parameters** $\boldsymbol{\theta}$;
$\bullet$ A **regularization term** (or **regularizer**) $R(f)$ is added to a **loss function**:

$$J_{reg}(\boldsymbol{\theta}) = J(\boldsymbol{\theta}) + R(f).$$

$\bullet$ Mainly, there are two types of forms of regularization:
<br> &emsp; $\bullet$ **Ridge regularization**, or **Tikhonov regularization**, uses he $L_2$ **norm** of the vector $\boldsymbol{\theta}$:
$$R(f) = \lambda \cdot \| \boldsymbol{\theta} \|^2_2 = \lambda \cdot \boldsymbol{\theta}^T\boldsymbol{\theta} = \lambda \cdot \sum_{i=1}^n \theta_i^2$$
&emsp; $\bullet$ **Lasso regularization** uses he $L_1$ **norm** of the vector $\boldsymbol{\theta}$:
$$R(f) = \lambda \cdot \| \boldsymbol{\theta} \|_1 = \lambda \cdot \sum_{i=1}^n |\theta_i|$$
$\bullet$ $\lambda$ is a **hyperparameter**, called **regularization parameter**.

<h3 align="center">Regularization Parameter $\lambda$</h3>

$\bullet$ We need to choose $\lambda$ carefully: **large** $\lambda$ will lead to **underfitting**.

$\bullet$ **Green** - True distribution, **Red** - Linear Regression (**LR**);
<br> &ensp; **Blue** - LR with **Lasso** regularization, **Magenta** - LR with **Ridge** regularization.
<center><img src="images/preg2.png" width="1300" alt="Example" /></center>

<h3 align="center">Regularization in case of Normal Equations</h3>

$\bullet$ As we expressed above, analytical solution of the Normal Equation is:

$$\hat{\boldsymbol{\theta}} = (X^TX)^{-1}X^T\mathbf{y} = X^{+}\mathbf{y}.$$

$\bullet$ We add **regularization term** as:

$$\hat{\boldsymbol{\theta}} = (X^TX + \lambda E^{+})^{-1}X^T\mathbf{y} = X^{+}\mathbf{y},$$

&ensp; where $E^{+} \in \mathbb{R}^{(n+1)\times(n+1)}$ and is almost identity matrix:

$$E^{+} = \begin{bmatrix}
0 & 0 & \cdots & 0\\
0 & 1 & \cdots & 0\\ 
0 & \vdots & \ddots & \vdots \\
0 & 0 &\cdots & 1
\end{bmatrix}.$$

$\bullet$ Matrix $X^TX + \lambda E^{+}$ is **allways invertible**!


<h1 align="center">End of Lecture</h1>

<h1 align="center">The Next will be a Midterm Exam №1</h1>