# Homework 5: Regression, Part 1

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
from typing import Callable, List, Tuple

In this homework we'll be exploring the diabetes dataset to try to predict blood sugar level. To do this we'll be using linear regression models as our prediction methods. Next week the tutorial will build more advanced methods based off of this week's tutorial.

This homework is dedicated to Wilford Brimley, who helped millions of Americans get their diabetes under control by promoting reasonably priced diabetes testing equipment supplied by Liberty Medical and Quaker Oats brand oatmeal, a great addition to a diabetes diet. In part thanks to Liberty Medical's testing equipment, the wonderful taste of Quaker Oats, and having sufficient income as a successful actor and brand spokesperson, thus affording him proper medial care, Brimley lived to the ripe old age of 85. 

<center>
    <a href="https://www.youtube.com/watch?v=kyxBDARsGEw">
        <img src="wbrimley.jpg" alt="GD image" style="width:30%;">
    </a>    
    
    26 September 1934 -- 1 August 2020
</center>

## **Attention**: Expectations for this assignment

We're a bit more than halfway through the course by now, which means we're removing some of the structure of previous assignments. Here's a list of the expectations we have:

- You will have to write all of your own code. There is no skeleton code or tests.
- You will need to write your own tests. We've provided some hints along the way as sanity checks, but these are no replacement for proper tests. Some of these functions might be difficult to come up with test cases for, but still, we would at least encourage you all to have a good hard think about how you could potentially test a given function or method.
- You will need to write your own docstrings. We've included descriptions for each function we'd like you to write, but these are not docstrings. With every function, please also include a docstring, even if it's just copying the description that we've provided. Our solutions, when released will include numpy docstrings. 
- You will need to type all of your functions. This may be a requirement in the exam, and you might as well get good and make a habit of it while you have the opportunity to fail, than right before the exam.
- You will need to add input testing to assure that your inputs are valid, and throw an exception otherwise. 

## Load the Data

Write a function `load_diabetes` that loads the data from the file `diabetes.tab`. This is a tab-spaced value file, so when splitting each line, be sure to use `\t`. 

This function should return a tuple in this order:

- `data` -- A 2D numpy array that contains our data as floating point numbers, with individual data points being on the 0th axis and features being on the 1st axis.
- `target` -- A 1D array that contains target values for each data point as floating point numbers. They're the last value in each row, and their feature label is `Y`.
- `feature_labels` -- A list of strings that contain the labels for our features. They're the first line of the `.tab` file.

Load the diabetes dataset by invoking this function. 

In [None]:

### Please enter your solution here ###



## Exploring the Data

Write a function `print_table` that prints numerical values in a table with row and column labels. Each numerical value should have **three** decimal points. Each column should have a width equal to the number of the largest number of non-whitespace characters in a single row in the column. To do this, you will need to add a whitespace to most entries in a column. Columns are separated by `sep`.

Parameters:

- `row_names` -- A list of strings that label each row
- `col_names` -- A list of string that label each column
- `data` -- A list of numpy arrays, the same length of `col_names` where each numpy array has the same length as `row_names`
- `sep` -- A string that separates each column. Default is three spaces.

Return: None

Write another function `print_stats_table` that uses `print_table` to print the mean, standard deviation, minimum value, maximum value, and range (max-min) of each feature. Use `print_stats_table` to print the statistics of the diabetes dataset. 

In [None]:
"""
For example, it should print something like

Feat    feat1   feat2
row0    1.499   0.589
row1    3.492   2.449
"""


### Please enter your solution here ###



### Pearson Correlation Coefficient 

This is wonderful information!... but it doesn't get us much closer to figuring out how were going to predict the target (which is blood sugar level). Let's see how much each feature correlates with the target feature. To do this, we'll calculate the Pearson correlation coefficient $r_{x,y}$ for each feature. A positive correlation indicates that the target tends to increase as the feature's value increases, and a negative correlation indicates that the target tends to decrease as the feature's value increases. The magnitude of the correlation coefficient indicates to what extent this occurs. If you care to know more, [here's](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) a link to the Wikipedia page for Pearson correlation. The formula for this metric given two vectors $x$ and $y$ is:

$$ r_{x,y} = \frac{\sum^n_{i=1} (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum^n_{i=1}(x_i-\bar{x})^2}\sqrt{\sum^n_{i=1}(y_i-\bar{y})^2}} $$
where the sample mean is defined as
$$ \bar{x} = \frac{1}{n} \sum_{i = 1}^n x_i $$
with $\bar{y}$ defined similarly with respect to $y$.

Write the function `pearson_correlation` that computes the Pearson correlation coefficients with the following arguments and return values. Do **not** use `np.corrcoef` in your implementation, but feel free to use it to check your work.

Arguments:
- `X` A matrix of datapoints with individual datapoints in the rows and features in the columns
- `Y` A matrix or vector of datapoints to find the correlation with `X`
  
Return:
- Numpy array with the Pearson correlation coefficients

Print the Pearson correlation coefficients between each feature and the target values in a table with feature labels as the rows and the Pearson correlations in the first column.

In [None]:

### Please enter your solution here ###



If you did this correctly, the highest correlation should be `bmi` at 0.586, and the lowest `s3` at -0.395. 

### Pearson Correlation Matrix



Write a function `plot_matrix` that plots the covariance matrix as a heatmap along a scale that goes from -1 to 1 and includes feature labels on the left and top of the plot. The labels on top should be slanted at a 45 degree angle. Plot the covariance matrix.

In [None]:

### Please enter your solution here ###



If you did this correctly, aside from those on the diagonal, you should see `s1` and `s2` have the highest correlation a bit above 0.75, and the lowest should be between `s3` and `s4` between -0.5 and -0.75.

## Linear Regression

For this task, we will be exploring fitting data to a linear model $\hat{y}_i = \theta^\top x_i + \theta_0$. The goal of fitting is to find the parameters $\theta$ and $\theta_0$. 

By using a linear model we are making the following assumptions about our data. 

- **Linearity** -- The target variable can be represented as a linear combination of input features, or otherwise said that the target variable was generated by a linear process of all input features, plus some noise.
- **Independence** -- Each datapoint, or more importantly the intrinsic noise of each datapoint, are independent of one another.
- **Homoscedasticity** -- The intrinsic noise, or rather the error, of all datapoints has a constant variance.
- **No multicollinearity** -- Each feature is not perfectly correlated with one another.
- **Zero mean of residuals** -- The mean of the error, or intrinsic noise, is 0.

To fit a model to data we need a metric that tells us how wrong the current model is:

A **loss function** determines how we determine the error for a single prediction, and for this we'll be using squared error. $$L(x_i, y_i, \theta, \theta_0) = (y_i - \hat{y}_i)^2 = (y_i - (\theta^\top x_i + \theta_0))^2$$

An **objective function** is the function we'll be minimizing, which in this case will be mean squared error (MSE). $$J(X, Y, \theta, \theta_0) = \frac{1}{N}\sum^N_{i=1}L(x_i, y_i, \theta, \theta_0) = \frac{1}{N}\sum^N_{i=1}(y_i - \hat{y}_i)^2$$

In practice you'll often see loss function, objective function, error function, cost function, among other similar terms used interchangeably. 

Generally, you might see this optimization task denoted as:

$$\theta, \theta_0 = \underset{\theta, \theta_0}{\min} J(X, Y, \theta, \theta_0)$$

Luckily, for linear regression with mean squared error there's a closed-form solution, creating an ordinary least-squares (OLS) estimator.

$$\theta, \theta_0 = (X^\top X)^{-1}X^\top y$$


To account for the bias term $\theta_0$, we will need to add a feature to our data set where every value is $1$. 

Write a function called `least_squares` that has the following parameters and return values:

Parameters:
- `X` -- Data points as a numpy array
- `Y` -- Target values as a numpy array

Returns: Parameter vector as a numpy array

Feel free to use `np.linalg.inv()`.

Write another function called `mse` that calculates the mean squared error given our target variable $y$ and our predictions $\hat{y}$ as numpy arrays.

In [None]:

### Please enter your solution here ###



Using `least_squares`, calculate the parameters of the model. Don't forget to add a column of 1's for the bias term. Print the model weights and the Pearson correlations between each feature and the target in the same table. Set the Pearson correlation for the bias term to a silly value, such as 42, since there is no Pearson correlation for the bias term. Also, print the MSE for the fit model.

In [None]:

### Please enter your solution here ###



Often when working with a datasets we'll standardize or normalize the data. Some methods specifically require this. In the case of the diabetes dataset, it shouldn't do anything. 

- **Normalization** -- This linearly shifts the range of each feature to a predetermined range, usually $[0,1]$ or $[-1, 1]$.

The formula for the interval $[0, 1]$ looks like:
$$ x' = \frac{x - x_{min}}{x_{max} - x_{min}} $$

and the formula for the interval $[-1, 1]$ looks like:
$$ x' = 2 * \frac{x - x_{min}}{x_{max} - x_{min}} - 1$$
  
- **Standardization** -- This centers the data so the mean of each feature is $0$ and the standard deviation of each feature is $1$. The formula for this looks like:

    $$ x' = \frac{x - \bar{x}}{\sigma(x)} $$

Write three functions `normalize_0_1`, `normalize_1_1` and `standardize`, both which take the data array, and return that array in a normalized form in the ranges $[0,1]$ or $[-1,1]$, or a standardized form, respectively. For each modified dataset print a table with statistics on each feature, as you did after you loaded the dataset, and also use ordinary least squares to find the model parameters for each modified dataset and calculate the MSE. Print these parameters and the MSE. Don't forget to add the columns of $1$s for the bias term. 

In [None]:

### Please enter your solution here ###



## Ridge Regression

Often, in machine learning tasks, we'll find that the weights of our model become extremely large. This is often a result of **overfitting**, or rather where the model memorizes the data that we've trained on. This would be fine if our model only ever encountered our training data, but this is almost never the case in practice. When a model overfits on its training data, it often fails to **generalize** to new data. To help a model generalize we're going to introduce a **regularization** term. There are two main types of regularization:

**L1 Regularization** -- Also known as LASSO when used with linear regression, where we add the sum absolute values of all coefficients to the objective function, multiplied by the hyperparameter $\lambda$. This encourages sparser models, pushing most parameters towards 0.

$$ J(X, Y, \theta, \theta_0) = \frac{1}{n}\sum^n_{i=1}L(x_i, y_i, \theta, \theta_0) + \lambda |[\theta,\theta_0]| = \frac{1}{n}\sum^n_{i=1}(y_i - \hat{y}_i)^2  + \lambda |[\theta,\theta_0]|$$ 

**L2 Regularization** -- Also called ridge regression when used with linear regression, where we add the sum of the squares of all parameters, multiplied by hyperparameter $\lambda$, to the objective function. 

$$ J(X, Y, \theta, \theta_0) = \frac{1}{n}\sum^n_{i=1}L(x_i, y_i, \theta, \theta_0) + \lambda \|[\theta,\theta_0]\|^2 = \frac{1}{n}\sum^n_{i=1}(y_i - \hat{y}_i)^2  + \lambda \|[\theta,\theta_0]\|^2$$ 

There is no closed-form solution for LASSO. Use L1 regularization with a linear regression model and you have to use an optimization method. Ridge regression, however, has a nice closed-form solution:

$$\theta, \theta_0 = (X^\top X + \lambda I)^{-1}X^\top Y$$

Write a function `ridge_regression` that calculates $\theta$ and $\theta_0$. This should have an identical arguments and return types as `least_squares` with the exception of an additional parameter `lam`, which is a float.

In [None]:

### Please enter your solution here ###



From this point on we will be using standardized data. Feel free to overwrite your data term. Find the parameters of the model using ridge regression and a $\lambda$ value of 0.2. 

In [None]:

### Please enter your solution here ###

