# Skill Check 2

In this assignment, you will be required to import three external packages (`scikit-learn`, `autograd`, and `scipy`) that were introduced in the lectures and write a few lines of codes with some useful objects in each package. This practice will provide you with an abstract idea of object-oriented programming (OOP). There is no need to know what exactly OOP means, but for the purpose of this course you can think of it as **using programmable "objects" made by others to save time**.

## 1. scikit-learn (30 pts)

`scikit-learn` is the most popular Python package that provides a plethora of useful functions and objects in machine learning. You will go through a workflow of building a simple regression model using `scikit-learn`. You will need to use this skill a lot to build more complicated models for the rest of the semester.

Let's import the `scikit-learn` package (no alias needed). (5 pts)

In [1]:
########################################
# Start your code here
import sklearn
########################################

In [2]:
assert sklearn.__version__, "scikit-learn not imported"

Simple linear regression can be implemented with `scikit-learn`. First, declare a `LinearRegression` model with a variable name `lr`. (5 pts)

In [3]:
########################################
# Start your code here
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
########################################

In [4]:
assert type(lr) == sklearn.linear_model._base.LinearRegression

The `LinearRegression` object takes several parameters (or arguments) so that users can easily change the model settings.  You can see the details of parameters as well as the model itself in the [official documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#). In most cases, `scikit-learn` developers have already set the default value for each parameter. For `LinearRegression`, the `fit_intercept` parameter indicates whether the model will add an intercept column to the input matrix during the training process. The default value for `fit_intercept` is `True`.

Change `fit_intercept` of `lr` to `False`. (5 pts)

In [5]:
########################################
# Start your code here
lr.fit_intercept = False
########################################

In [6]:
assert lr.fit_intercept == False, "fit_intercept is True"

Write a function `return_coeff` that takes training data `X` and target values `y` as arguments. In this function, a `LinearRegression` model is trained with `X` and `y` and the resulting coefficients of the model should be returned. Make sure that `fit_intercept` is set to `False`. The data type of the returned variable should be `list`. (15 pts)

In [7]:
def return_coeff(X, y):
########################################
# Start your code here
    lr = LinearRegression(fit_intercept = False)
    lr.fit(X, y)
    
    return list(lr.coef_)
########################################

In [8]:
import pandas as pd
import numpy as np

df = pd.read_csv('ethanol_IR.csv')

X = df['wavenumber [cm^-1]'].values[500:520].reshape(-1, 1)
y = df['absorbance'].values[500:520].reshape(-1, 1)

assert np.isclose(np.linalg.norm(return_coeff(X, y)) * np.linalg.norm(return_coeff(y, X)) , 0.6785527874159363), "return_coeff not correct"
assert type(return_coeff(X, y)) == list, "the return values should be in list!"

## 2. Gradient Descent with autograd (40 pts)

In this problem, you will implement gradient descent optimization with functions in the `autograd` package. We will fit IR spectrum peaks with multiple Gaussians to find the optimal positions and widths of the peaks.

$$y = \sum^N_{i=0} w_i exp(-\frac{(x-\mu_i)^2}{2\sigma_i^2})$$

First, import `autograd`. (5 pts)

In [9]:
########################################
# Start your code here
import autograd
########################################

In [10]:
assert autograd, "autograd not imported"

To implement gradient descent method, you need a well-defined loss function. Create a function `loss` which returns mean-squarred-error of the estimation. `loss` takes the following arguments:

- a parameter vector `lamda` $\vec{\lambda} = [\vec{w}, \vec{\mu}, \vec{\sigma}]$ (1-dimensional numpy array)
- a training data `X` (wavenumbers in this case) (1-dimensional numpy array)
- a target values `y` (absorbance in this case) (1-dimensional numpy array)
- a number of Gaussians `N` (int)

You may assume that the length of `lamda` is 3 x `N`. (10 pts)

In [11]:
def loss(lamda, X, y, N):
########################################
# Start your code here
    predict = np.zeros(X.shape[0])
    for i in range(N):
        predict += lamda[i] * np.exp(-(X - lamda[i+N])**2 / 2 / lamda[i+2*N]**2)
        
    return ((predict-y)**2).mean()
########################################

In [12]:
import pandas as pd
import numpy as np

In [13]:
df = pd.read_csv('ethanol_IR.csv')

X = df['wavenumber [cm^-1]'].values[500:520]
y = df['absorbance'].values[500:520]

l1 = np.array([5., 5., 5., 3000., 3200., 3300., 50., 50., 50.])
ans = loss(l1, X, y, 3)

l2 = np.array([10., 11., 12., 13., 14., 2850, 2900., 2950, 3000., 3050., 30., 40., 50., 60., 70.])
ans *= loss(l2, X, y, 5)

assert np.isclose(ans, 2.2449682077520627), "loss function not correct"

Using the `grad` function in the `autograd` package, create a function `diff_g` that returns the derivative of the `loss` function with respect to `lamda`. You may assume that `N` equals to 3. (10 pts)

In [14]:
import autograd.numpy as np
from autograd import grad

In [15]:
########################################
# Start your code here
def g(lamda, X = X, y = y, N = 3):
    return loss(lamda, X, y, N)

diff_g = grad(g)
########################################

In [16]:
l1 = np.array([5., 5., 5., 3000., 3200., 3300., 50., 50., 50.])
assert np.isclose(np.linalg.norm(diff_g(l1)), 0.0034468711), "diff_g not correct"

Finally, write a function `grad_descent` that implements gradient descent method. This function returns the optimal `lamda` and takes the following arguments:
- a parameter vector `lamda` $\vec{\lambda}$ (1-dimensional numpy array)
- a derivative function `diff_g` (function)
- a step size `h` (float)
- a tolerance `tol` (float)

In numerical optimization, it is very important to set a proper convergence criterion. Optimization should stop once the criterion meets. Various options are available, but, for now, you will compare the the $L_2$ norm of (`current_lamda` - `previous_lamda`) to `tolerance`. If the norm is smaller than `tolerance`, your code should give the `lamda` at that iteration as the optimal solution. (15 pts)

In [17]:
def grad_descent(lamda, diff_g, h, tol):
########################################
# Start your code here
    err = np.inf
    previous_lamda = lamda
    while err > tol:
        current_lamda = previous_lamda - h * np.array(diff_g(previous_lamda))
        err = np.linalg.norm(current_lamda - previous_lamda)
        previous_lamda = current_lamda
        
    return current_lamda
########################################

In [18]:
l1 = np.array([5., 5., 5., 3000., 3200., 3300., 50., 50., 50.])
assert np.isclose(np.linalg.norm(grad_descent(l1, diff_g, .1, .001)), 5489.76999348897)

## 3. scipy (30 pts)

You will simplify the code that you have written in the previous problem by taking advantage of `scipy` package. `minimize` function in the `scipy` package, which supports numerical optimization and is faster and more reliable. You will find that how convenient and fast it is to code when you find a right function from the internet that fits your interest or intention, although it's always good to have a basic understanding of what is happening "under the hood". For more information on the `minimize` function, refer to the [official documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize.html).

Import the `minimize` function from `scipy`. (5 pts)

In [19]:
########################################
# Start your code here
from scipy.optimize import minimize
########################################

In [20]:
assert minimize, "minimize not imported"

As covered in the lecture, the `minimize` function requires a loss function of which only one argument is unknown. Write a function `g` that takes the same argument as `loss` does, while `X`, `y`, and `N` is predefined and only `lamda` remains unknown. The returned value of `g` should be the same as that of `loss` function. The default values for `X`, `y`, and `N` are provided below. You may wish to review the lecture notes if you are unsure of how to do this. (5 pts)

In [21]:
df = pd.read_csv('ethanol_IR.csv')

X = df['wavenumber [cm^-1]'].values[500:520]
y = df['absorbance'].values[500:520]

N = 3

In [22]:
########################################
# Start your code here
def g(lamda, X = X, y = y, N = 3):
    return loss(lamda, X, y, N)
########################################

In [23]:
l = np.array([10., 11., 12., 13., 14., 2850, 2900., 2950, 3000., 3050., 30., 40., 50., 60., 70.])
assert np.isclose(g(l, X, y, 5), 290.43601541265457)

Minimize the `g` function with respect to `lamda` by using the scipy `minimize` function. You should use the `L-BFGS-B` algorithm for the optimization.  Save the result to a variable `res`. The initial guess for `lamda` is provided below. The `BFGS` family of algorithms are a good default since they are usually fast and robust. The details of how the algorithms work are beyond this course, but you can read about them [here](https://en.wikipedia.org/wiki/Broyden%E2%80%93Fletcher%E2%80%93Goldfarb%E2%80%93Shanno_algorithm). The short version is that they use clever math to optimize the gradient direction and step size used at each iteration. (20 pts)

In [24]:
init_lamda = np.array([6., 7., 8., 3500., 3200., 3300., 40., 40., 40.])

In [25]:
########################################
# Start your code here
res = minimize(g, init_lamda, method = 'L-BFGS-B')
########################################

In [26]:
assert np.isclose(np.linalg.norm(res.x), 5777.971010657634)