# Lab Practice Week 7

This notebook does not need to be submitted. This is only for you to gain experience and get some practice.

In [None]:
# as always
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Problem 1: Multi-dimensional data fitting and gradient checking

This problem uses an artificial dataset generated by the [`make_regression` function](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_regression.html) in the `scikit-learn`'s dataset submodule. This time we have two features in each training sample. Recall in Week 6's lectures, we have learned the regressions of the citric acid (feature) vs fixed acidity (target) and volatile acidity (feature) vs fixed acidity (target). In both examples, the key problem is essentially a 1-dimensional problem since we using a line (1-d object) to "fit" the general trend. 

If we have two features, say a data point is $\mathbf{x} = (x_1, x_2)$, and the label is $y$, then the linear regression is to find a linear function 
$$ h(\mathbf{x}) = w_0 + w_1 x_1 + w_2 x_2 = \mathbf{w}^{\top} [1, \mathbf{x}]$$

such that $h(\mathbf{x}) \approx y$. Suppose we totally have $N$ training samples, our loss function can be written as:

$$
L(\mathbf{w}) = 
\frac{1}{N}\sum_{i=1}^N  \left( [1, \;\mathbf{x}^{(i)}]^{\top} \mathbf{w} - y^{(i)} \right)^2
= \frac{1}{N}\sum_{i=1}^N  
\left( w_0 x_0^{(i)} + w_1 x_1^{(i)} + w_2 x_2^{(i)}  - y^{(i)} \right)^2,
$$
where $x_0^{(i)} = 1$ is articially added to each training samples as the 0-th feature.

Taking gradient for $w_0$, $w_1$, and $w_2$: for $k=0,1,2$

$$
\frac{\partial L}{\partial w_k} = \frac{2}{N}\sum_{i=1}^N x^{(i)}_k \left(h(\mathbf{x}^{(i)}) - y^{(i)}\right)
= \frac{2}{N}\sum_{i=1}^N x^{(i)}_k \left(w_0 x_0^{(i)} + w_1 x_1^{(i)} + w_2 x_2^{(i)}  - y^{(i)} \right),
$$
which is the sum of the product of $k$-th feature and the residual $h(\mathbf{x}^{(i)}) - y^{(i)}$. An easier expression for vectorization is:
$$
\frac{\partial L}{\partial \mathbf{w}} = 
= \frac{2}{N}\sum_{i=1}^N \mathbf{x}^{(i)} \left(w_0 x_0^{(i)} + w_1 x_1^{(i)} + w_2 x_2^{(i)}  - y^{(i)} \right),
$$

## Zero bias dataset

If we already know that our dataset has zero bias, i.e., the best weight possible $w_0 = 0$, then our model simplifies to:

$$
L(\mathbf{w}) = \frac{1}{N}\sum_{i=1}^N  \left(w_1 x_1^{(i)} + w_2 x_2^{(i)}  - y^{(i)} \right)^2,
$$

and

$$
\frac{\partial L}{\partial \mathbf{w}} = \frac{2}{N}\sum_{i=1}^N\mathbf{x}^{(i)} 
\left(w_1 x_1^{(i)} + w_2 x_2^{(i)}  - y^{(i)} \right),
$$

In [None]:
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
# generate a zero bias dataset
X, y = make_regression(n_samples=5000, n_features= 2, bias = 0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

In [None]:
X_train.shape # training sample's shape

In [None]:
y_test.shape # testing target values/labels

In [None]:
# import seaborn for better visualization than pyplot
# visualization of all data
import seaborn as sns
sns.set(style='white')
_, ax = plt.subplots(figsize=[12,8])
sns.scatterplot(ax = ax, x = X[:,0], y = X[:,1],  s = y, 
                # s is the size of the dot, we use our target value y as the size of the dot
                alpha=0.4, edgecolors='w')
ax.set(xlabel='Feature 1', ylabel='Feature 2')
plt.show()

# Question 1: GD

Adapt the gradient descent method from Lecture 16 to this case using the exact gradient. Implement the exact gradient (not numerical gradient) of the loss function in the cell below. When you have done training the model, compute the predicted target value using your model, then visualize the trained model's prediction using the cell which follows.

In [None]:
# model
def h(w, X): 
    return np.matmul(X,w)

# loss
def loss(w, X, y):
    residual_components = h(w, X) - y
    return np.mean(residual_components**2)

# gradient of the loss
def gradient_loss(w, X, y):
    # implemente the gradient here
    pass

In [None]:
# your implementation of gradient descent here


In [None]:
# compute the y_pred using your model
y_pred = h(w, X_test)

In [None]:
fig, axes = plt.subplots(1,2, figsize=(16, 8))
sns.scatterplot(ax = axes[0], x = X_test[:,0], y = X_test[:,1],  s = y_pred, 
                alpha=0.4, color='b').set_title("Predicted value", fontsize = 20)
sns.scatterplot(ax = axes[1], x = X_test[:,0], y = X_test[:,1],  s = y_test, 
                alpha=0.4, color='r').set_title("Actual value", fontsize = 20)

plt.show()

# Question 2: Gradient checking

So far we have worked with relatively straight-forward loss functions, and their gradients can be derived with pen-and-paper, and then be implemented directly. 

We also have used a forward difference to approximate gradient function.

For more complex models that we will see later, for example the backpropagation for neural networks involving multiple nonlinear composite functions' gradient. The gradient computation can be notoriously difficult to debug and get right. Sometimes a subtle buggy implementation will manage to learn something that can look surprisingly reasonable (while performing less well than a correct implementation). Thus, even with a buggy implementation, it may not at all be apparent that anything is amiss. 

In this problem, we simply check if the gradient we implemented agrees with the numerical gradient (usually the central difference) up to certain tolerance we set. This is called *gradient checking*. Carrying out the derivative checking procedure described here will significantly increase your confidence in the correctness of your code.



Suppose we have a function $g_k(\mathbf{w})$ that purportedly computes $\partial L/\partial w_k$; we would like to check if $g_k(\mathbf{w})$ is outputting correct derivative values. Let 

$$\mathbf{e}_k = (0,\dots, 1, \dots, 0)^{\top},$$

be the unit vector in the $k$-th component. Then the gradient checking reads:

>  $\mathbf{w}$: the weights, $\epsilon$: certain close to 0 values, $\texttt{tol}$: tolerance we set<br>
> If $\displaystyle \left| g_k(\mathbf{w}) - \frac{L(\mathbf{w} + \epsilon\mathbf{e}_k) - L(\mathbf{w} - \epsilon\mathbf{e}_k)}{2\epsilon}\right| < \texttt{tol}$ for every $k$ <br><br>
> &nbsp;&nbsp;&nbsp;&nbsp; Proceed to do gradient decent

## Question:
Add the gradient checking algorithm to your gradient descent code, you can copy the numerical gradient implementation in Lecture 14. You can choose $\epsilon$ be $10^{-5}$, and $\texttt{tol}$ be $10^{-2}$. If the gradient checking is not passed in an iteration of the gradient descent, your function should print a warning message (you do not have to `break`).

In [None]:
# your code here