# Machine Learning Quiz 2 Practice
- toc: true
- badges: true
- comments: true
- author: Sachin Yadav
- categories: [MLCourse2022]

In [1]:
#collapse
import numpy as np
import pandas as pd
import torch
import matplotlib.pyplot as plt
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets

In [2]:
np.random.seed(0)

In [3]:
#collapse
%pip install -q blackcellmagic
%load_ext blackcellmagic

Note: you may need to restart the kernel to use updated packages.


## Maths for ML

1. Given a vector $\epsilon$, we can calculate $\sum\epsilon_{i}^{2}$ using $\epsilon^{T} \epsilon$

In [4]:
#collapse
# As per the convention we take epsilon to be a 2D column vector
y = np.array([1.3, 2.5, 6.4, 8.1, 9.0]).reshape(-1, 1)
y_hat = np.array([1.5, 2.0, 5.9, 8.5, 9.0]).reshape(-1, 1)

epsilon = np.abs(y - y_hat)

epsilon_square_sum1 = np.sum(epsilon**2)
epsilon_square_sum2 = (epsilon.T @ epsilon).item()

assert np.allclose(epsilon_square_sum1, epsilon_square_sum2)

👍 works!

2. $(AB)^{T} = B^{T}A^{T}$

In [5]:
#collapse
A = np.random.randn(50, 10)
B = np.random.randn(10, 20)

ab_t = (A @ B).T
b_t_a_t = B.T @ A.T
assert np.allclose(ab_t, b_t_a_t)

🙄 Knew it already!

3. For a scalar $s$, $s = s^{T}$

4. Derivative of scalar $s$ with respect to (yes!, I wrote wrt as full here 😁) vector $\theta$
   $$\theta = \begin{bmatrix} \theta_{1} \\ \theta_{2} \\ \vdots \\ \theta{n} \end{bmatrix}$$
    $$\frac{\partial s}{\partial \theta} = \begin{bmatrix}
     \frac{\partial s}{\partial \theta_{1}} \\
     \frac{\partial s}{\partial \theta_{2}} \\
     \frac{\partial s}{\partial \theta_{3}} \\
     \vdots \\
     \frac{\partial s}{\partial \theta_{n}} 
     \end{bmatrix} $$

5. If $A$ is a matrix and $\theta$ is a vector, and $A\theta$ is a scalar. Then 
   $$ \frac{\partial A \theta}{\partial \theta} = A^{T} $$

🤔 Taking some similarity with $a\theta$, where both $a$ and $\theta$ are scalar, I have an idea that it would be A. But shape of gradient would be $N \times 1$, so $A^{T}$ is my guess before starting any calculations.

In [6]:
#collapse
N = 20
# as A $\theta$ is scalar, so A.shape[0] should be 1.
A = torch.randn((1, N))
theta = torch.randn((N, 1), requires_grad=True)
scalar = A @ theta
scalar.backward()
assert torch.allclose(theta.grad, A.T)

👍 all good

6. Assume $Z$ is a matrix of form $X^{T}X$, then 
   $$ \frac{\partial (\theta^{T}Z\theta)}{\partial \theta} = 2Z^{T}\theta$$

🤔 Let me again make a good guess before any calculation, if $\theta$ and $Z$ are both scaler, then the derivative would look like $2Z\theta$. So my guess would $2Z\theta$, which is equal to $2Z^{T}\theta$ as both are $Z$ is symmetric.

In [7]:
#collapse
X = torch.randn((N, N))
Z = X.T @ X
theta = torch.randn((N, 1), requires_grad=True)

scalar = theta.T @ Z @ theta
scalar.backward()

assert torch.allclose(theta.grad, 2 * Z.T @ theta)

👍 good

Let's skip over the content of Rank topic for now. 

The maximum rank possible for a matrix is $max(R, C)$ 

But an interesting question would be 🤔, what is the minimum rank possible for a matrix, is it 0, is it 1?

Ans: Rank is zero, in case of zero matrix.

Just a leaving thought, if I would have been a developer of Numpy, I would not have allowed `np.eye` as the method for identity matrix. Better to use `np.identity` only. 😞

## Linear Regression Introduction

Considering `weight` as a linear function of `height`:
- $weight_{1} \approx \theta_{0} + \theta_{1} * height_{1}$
- $weight_{2} \approx \theta_{0} + \theta_{1} * height_{2}$
- $weight_{N} \approx \theta_{0} + \theta_{1} * height_{N}$

Add extra columns of $1s$ for the bias term in $\theta$

$$ W_{N\times1} = X_{N\times2} \, \theta_{2\times1} $$
where the feature matrix $X$, $X = \begin{bmatrix}
1 & height_{1} \\
1 & height_{2} \\
\vdots & \vdots \\
1 & height_{N}
\end{bmatrix}$

- $\theta_{0}$, Bias/Intercept term : (the value of $y$, when $x$ is set to zero)
- $\theta_{1}$, Slope term : (the increase in $y$, when $x$ is increased by 1 unit)

**Generalized Linear Regression**
- $N$: Number of training samples
- $M$: Number of features
  
$$ \begin{bmatrix}
\hat{y}_{1} \\
\hat{y}_{2} \\
\vdots \\
\hat{y}_{N} \\
\end{bmatrix}
_{N \times 1}
= \begin{bmatrix}
1 & x_{1, 1} & x_{1, 2} & \ldots & x_{1, M} \\
1 & x_{2, 1} & x_{2, 2} & \ldots & x_{2, M} \\
\vdots & \vdots & \vdots & \ldots & \vdots \\
1 & x_{N, 1} & x_{N, 2} & \ldots & x_{N, M} \\
\end{bmatrix} _{N \times (M + 1)}
\begin{bmatrix}
\theta_{0} \\
\theta_{1} \\
\vdots \\
\theta_{M}
\end{bmatrix} _{(M + 1)\times 1}
$$

$$ \hat{y} = X \theta $$




Now, the task at our hand is to estimate "good" values of $\theta$, which will give "good" approximation to the actual values.But how do we decide if a set of values of $\theta$ is "better" than another value of $\theta$. We need a metric for evalution here.

Let $\epsilon_{i}$ be $y_{i} - \hat{y}_{i}$, where $\epsilon_{i} \sim \mathcal{N} (0, \sigma^{2})$. We are assuming that $\epsilon_{i}$ is coming from this normal distribution.

We want $|\epsilon_{1}|$, $|\epsilon_{2}|$, $|\epsilon_{3}|$ ... , $|\epsilon_{N}|$ to be small.

So we can try to minimize L2 norm (Squared Error) or L1 norm.

In [8]:
#collapse
weight_height_df = pd.read_csv(
    "assets/2022-02-17-machine-learning-quiz2-practice/weight-height.csv"
)
# take 30 points
sampled_idx = np.random.choice(np.arange(len(weight_height_df)), size=30, replace=False)
weight_height_df = weight_height_df.iloc[sampled_idx][["Height", "Weight"]].sort_values(
    by=["Height"]
)


def plot_func(theta0, theta1):
    x = weight_height_df["Height"]
    y = weight_height_df["Weight"]
    y_hat = theta0 + x * theta1
    fig, ax = plt.subplots(figsize = (10, 8))
    ax.scatter(x, y, label="Actual")
    ax.plot(x, y_hat, label="Pred", linestyle = "--")
    ax.legend()
    ax = plt.gca()
    ax.set_ylim([50, 400])
    mse_val = np.mean((y - y_hat)**2)
    ax.set_title(rf"$\theta_{0}$={theta0}, $\theta_{1}$={theta1}    MSE val: {mse_val:.3f}")


interact(
    plot_func,
    theta0=widgets.FloatSlider(name = "theta0 (bias)", value=-300, min=-1000, max=1000, step=1),
    theta1=widgets.FloatSlider(name = "theta1 (slope)", value=7.5, min=-20, max=20, step=0.01),
)

interactive(children=(FloatSlider(value=-300.0, description='theta0', max=1000.0, min=-1000.0, step=1.0), Floa…

<function __main__.plot_func(theta0, theta1)>

>Note: Run the notebook in Colab to view the interactive plot above, where we manually change parameters (using sliders) and fit the line through training points with Mean Squared error as the guiding value.

### Normal Equation
$$ y = X\theta + \epsilon$$
Objective: To minimize $\epsilon^{T} \epsilon$
$$\epsilon^{T} \epsilon = y y^{T} - 2 y^{T}X\theta + \theta^{T}X^{T}X\theta$$
$$\frac{\partial (\epsilon^{T} \epsilon)}{\partial \theta} = -2X^{T}y + 2X^{T}X\theta$$ 
(we use some of our results from previous chapter "Maths for ML")

Setting it to zero, 
$$ \theta^{*} = (X^{T}X)^{-1}X^{T}y$$