# TDA233 / DIT382 Machine Learning: Homework 1 <br />
**Goal:** Start working with Jupyter notebooks, introduction to probability and regression models <br />
**Grader:** Jack Sandberg <br />
**Submitted by:** 📝 Name, Personal no., email <br />

# Read this before starting


## General guidelines
* Answer all fields marked with 📝 (and `# TODO` in the code blocks). This includes
    * your name, personal number and email address above, and
    * all later fields marked with "📝 Your answer here:".
* Feel free to add more cells if needed.
* All solutions to theoretical and pratical problems must be submitted in this notebook, and equations wherever required, should be formatted using LaTeX math-mode.
    * Do NOT hand in an assignment that isn't runnable!
* All discussion regarding practical problems, along with solutions and plots should be specified in this notebook.
All plots/results should be visible such that the notebook does not have to be run. But the code in the notebook should reproduce the plots/results if we choose to do so.
* All tables and other additional information should be included in this notebook.
* **Before submitting, make sure that your code can run on another computer, i.e. that all plots can show on another computer including all your text and equations. It is good to check if your code can run here: https://colab.research.google.com**
* **Submit your solutions as a notebook file (`.ipynb`) and in HTML format (`.html`). To export this notebook to HTML format click `File` $\rightarrow$ `Download as` $\rightarrow$ `HTML`.**

**Jupyter/IPython Notebook** is a collaborative Python web-based environment. This will be used in all our Homework Assignments. It is installed in the halls ES61-ES62, E-studio and MT9. You can also use google-colab: https://colab.research.google.com
to run these notebooks without having to download, install, or do anything on your own computer other than a browser.
Some useful resources:
1. https://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/ (Quick-start guide)
2. https://www.kdnuggets.com/2016/04/top-10-ipython-nb-tutorials.html
3. http://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/Working%20With%20Markdown%20Cells.html (markdown)

## Required software

For the practical problem in this assignment you will need to install the following Python packages:

- `numpy`: The fundamental package for scientific computing with Python (so fundamental there is a [Nature review](https://www.nature.com/articles/s41586-020-2649-2) on it)
- `matplotlib`: Visualization with Python
- `sklearn`: Scikit-learn is a package for performing machine learning in Python.

> **Note:** In Google Colab you can install packages using   `!pip  install <package_name>`

> **Note:** In Google Colab these packages should be preinstalled but it is a good habit to check if all required packages are installed beforehand and the installed versions of packages. Use `!pip list` to list packages installed by pip on Google Colab.

> **Note:** We recommend that you install these packages in a [virtual environment](https://docs.python.org/3/library/venv.html) if you are running this on your own computer.

# Part 1: Theoretical problems

## Problem 1: Bayes Rule [3 points]

After your yearly checkup, the doctor has bad news and good news. The
bad news is that you tested positive for a very serious cancer and
that the test is 99.5% accurate i.e. the probability of testing
positive given you have the disease is 0.995. The probability of
testing negative if you don’t have the disease is the same (also 0.995). The good news is that it is a rare condition affecting only 1 in 500 people. **What is the probability you actually have the disease?**

After doing all your calculations you realize that there was a misprint on the test, and the accuracy was actually only 95% (for both testing postive given that you have the disease and for testing negative given that you do not have the disease). **How will this change your probability of having the disease?**

Show all calculations and the final result. Remember to use LaTeX math-mode to format mathematical expressions and equations.


In [None]:
def smitten(prob, acc):
    true = prob * acc
    false = (1 - prob) * (1 - acc)
    return true / (true + false)

s1 = smitten(1 / 500, 0.995)
s2 = smitten(1 / 500, 0.95)
print(f"Probability of having the disease (given  0.995 test accuracy): {s1:.1%}")
print(f"Probability of having the disease (given  0.95 test accuracy): {s2:.1%}")


Probability of having the disease (test accuracy 0.995): 28.5%
Probability of having the disease (test accuracy 0.95): 3.7%


Bayes' Rule is given by:

$ P(A \mid B) = \frac{P(B \mid A) P(A)}{P(B)} $

Let P(A) be the probability of having the disease and P(B) the probability of receiving a positive test.

The probability of having the disease given receiving a positive test result is thus $ P(A|B) $.

From the data we have that $ P(A) = 1/500, P(B|A) = 0.995$ and $P(\neg B|\neg A)= 0.995 $ (and $ P(B| \neg A) = 0.005$, $P(\neg B| A) = 0.005 $).

$P(B)$ can be calculated as recieving a positive test and you have the disease plus receiving a positive test and you don't have the disease, i.e. $P(B \land A) + P(B \land \neg B) = P(B|A)P(A) + P(B|\neg A)P(\neg A)$.

Hence, we get the following formula:

$ P(A|B) = \frac{P(B \mid A) P(A)}{P(B)} = \frac{P(B \mid A) P(A)}{P(B|A)P(A) + P(B|\neg A)P(\neg A)} = \frac{0.995 \cdot 1/500}{0.995 \cdot 1/500 + 0.005 \cdot 499/500} = 28.5\%$

For P(B|A) = 0.95$ and $P(\neg B|\neg A)= 0.95$ we get $\frac{0.95 \cdot 1/500}{0.95 \cdot 1/500 + 0.05 \cdot 499/500} = 3.7\% $

## Problem 2: Setting hyperparameters [2 points]

Suppose $\theta$ is a random variable generated from a beta distribution as: $\theta \sim \text{Beta}(a^2,b)$. Also assume that  the expectation of $\theta$ is $m$: $E[\theta] = m$
and the variance of $\theta$ is $v$: $\text{Var}(\theta) = v$. Express $a$ and $b$ in terms of (only) $m$ and $v$.
For more information about the $\text{Beta}$ distribution see https://en.wikipedia.org/wiki/Beta_distribution.

#### 📝 Your answer here:
$$ a = mv$$
$$b= (1 - m)v$$

## Problem 3: Correlation and Independence [2 points]

Let $X$ be a continuous random variable, uniformly distributed in $[-2, +2]$ and let $Y := X^4$. Clearly $Y$ is not independent of $X$ -- in fact it is uniquely determined by $X$. However, show that the covariance of $X$ and $Y$ is 0: $\text{cov}(X, Y ) = 0$.
Show and justify every step of the proof. Statements like "it is obvious that, it is trivial ..." will not be accepted.

The definition of covariance is given below.

$\mathrm{cov}(X,Y) = \mathrm{E}[(X-\mathrm{E}[X])(Y-\mathrm{E}[Y])] $

We can simplify the expression to get the expression below.

$\mathrm{cov}(X,Y) = ... = \mathrm{E}[XY] - \mathrm{E}[X]\mathrm{E}[Y]$

$\mathrm{cov}(X,Y) = \mathrm{cov}(X,X^4) = \mathrm{E}[X X^4] - \mathrm{E}[X] \mathrm{E}[X^4] = $

$= \mathrm{E}(X^5) - \mathrm{E}[X]\mathrm{E}[X^4] = 0 - 0 \cdot \mathrm{E}[X^4] = 0$

## Problem 4: Spherical Gaussian estimation [4 points]

Consider a dataset $X$ consisting of i.i.d. observations
generated from a spherical Gaussian distribution $N(\mu, \sigma^2I)$, where $\mu \in \mathbb{R}^p$, $I$
is the $p \times p $ identity matrix, and $\sigma^2$ is a scalar.

Write the mathematical expression for the Maximum Likelihood Estimator (MLE) for $\mu$ and $\sigma$ in above setup.

For more information about the spherical Gaussian distribution, see https://en.wikipedia.org/wiki/Multivariate_normal_distribution.
For more information about the identity matrix see: https://en.wikipedia.org/wiki/Identity_matrix

#### 📝 Your answer here:

The Likelihood Function
$
\text{A single observation } \mathbf{x}_i \in \mathbb{R}^p
\text{ from } \mathcal{N}(\mu, \sigma^2 I) \text{ has the density:}
$
$p(\mathbf{x}_i \mid \mu, \sigma^2)$
$=
\frac{1}{\bigl(2 \pi \sigma^2\bigr)^{\tfrac{p}{2}}}$
$\exp\!\Bigl(-\tfrac{1}{2\sigma^2}\,\|\mathbf{x}_i - \mu\|^2\Bigr),$
\\
$\text{where}$
$\|\mathbf{x}_i - \mu\|^2$
$=
(\mathbf{x}_i - \mu)^\top (\mathbf{x}_i - \mu).
$

Since x1,…,xn​ are assumed i.i.d., the joint likelihood L(μ,σ2) is the product of individual densities:

$
% 1) Joint likelihood
\\
L(\mu, \sigma^2)
=
\prod_{i=1}^n
\frac{1}{\bigl(2 \pi \sigma^2\bigr)^{\tfrac{p}{2}}}
\exp\!\Bigl(
  -\,\frac{1}{2\,\sigma^2}\,\|\mathbf{x}_i - \mu\|^2
\Bigr).
\\
$

The log turns products into sums
$
% 2) Log-likelihood
\\
\ell(\mu,\sigma^2)
=
\sum_{i=1}^n
\log\!\Bigl[
  \frac{1}{(2\pi\sigma^2)^{\tfrac{p}{2}}}
  \exp\!\bigl(-\tfrac{1}{2\sigma^2}\|\mathbf{x}_i - \mu\|^2\bigr)
\Bigr].
\\
$

The likelihood function can be further simplified
$
% 3) Splitting the log
\
\log\Bigl((2\pi\sigma^2)^{-\tfrac{p}{2}}\Bigr)
\;+\;
\log\!\Bigl(\exp\!\bigl(-\tfrac{1}{2\sigma^2}\|\mathbf{x}_i - \mu\|^2\bigr)\Bigr).
\
$

$
% 4) Summation form
\
\ell(\mu,\sigma^2)
=
\sum_{i=1}^n
\Bigl[
  -\tfrac{p}{2}\,\log(2\pi\sigma^2)
  \;-\;
  \tfrac{1}{2\sigma^2}\,\|\mathbf{x}_i - \mu\|^2
\Bigr].
\
$

$
% 5) Pull out constants
\
\ell(\mu,\sigma^2)
=
-\,\tfrac{n p}{2}\,\log\bigl(2\pi\sigma^2\bigr)
\;-\;
\tfrac{1}{2\sigma^2}\,
\sum_{i=1}^n
\|\mathbf{x}_i - \mu\|^2.
\
$

To find the MLE with regards to μ​, we take the partial derivative of ℓ(μ,σ2) with respect to μ, set it to zero, and solve it

$
% 6) Derivative w.r.t. mu
\
\frac{\partial}{\partial \mu}
\,\ell(\mu,\sigma^2)
=
-\,\frac{1}{2\,\sigma^2}
\;\frac{\partial}{\partial \mu}
\sum_{i=1}^n
\|\mathbf{x}_i - \mu\|^2
=
\frac{1}{\sigma^2}
\sum_{i=1}^n
(\mathbf{x}_i - \mu).
\
$

$
% 7) Setting it to zero -> MLE for mu
\
\sum_{i=1}^n (\mathbf{x}_i - \mu)
= 0
\;\;\Longrightarrow\;\;
\sum_{i=1}^n \mathbf{x}_i
= n\,\mu
\;\;\Longrightarrow\;\;
\hat{\mu}
=
\frac{1}{n}\sum_{i=1}^n
\mathbf{x}_i.
\
$

we then do the same thing for sigma

$
% 8) Plug in mu-hat and differentiate w.r.t. sigma^2
\
\ell(\mu,\sigma^2)
=
-\,\tfrac{n p}{2}\,\log\bigl(2\pi\sigma^2\bigr)
\;-\;
\frac{1}{2\sigma^2}
\sum_{i=1}^n
\|\mathbf{x}_i - \mu\|^2.
\
Let
\
S
=
\sum_{i=1}^n
\|\mathbf{x}_i - \hat{\mu}\|^2.
\
\
\ell\bigl(\hat{\mu},\sigma^2\bigr)
=
-\,\tfrac{n p}{2}\,\log\bigl(2\pi\sigma^2\bigr)
\;-\;
\frac{1}{2\sigma^2}\,S.
\
$

$
% 9) Derivative w.r.t. sigma^2
\
\frac{\partial}{\partial (\sigma^2)} \ell\bigl(\hat{\mu},\sigma^2\bigr)
=
-\,\frac{n p}{2}\,\frac{1}{\sigma^2}
\;+\;
\frac{S}{2\,(\sigma^2)^2}
=
0.
\
$

$
% 10) Solve for sigma^2
\
-\,n p\,\sigma^2 + S = 0
\;\;\Longrightarrow\;\;
\hat{\sigma}^2
=
\frac{S}{n p}
=
\frac{1}{n\,p}
\sum_{i=1}^n
\|\mathbf{x}_i - \hat{\mu}\|^2.
\
$



# Part 2: Practical problems

## Problem 5: Linear Regression with regularization [9 points]

You are newly recruited as a Data Scientist at a leading consultancy company in Gothenburg. Your first task at the job is to help the Swedish Public Health Agency (folkhälsomyndigheten) for predicting the diabetes progression of patients. Assume that you are given a dataset D of $n$ patients with $D = \{ (\mathbf{x}_i, y_i)\}_{i=1}^n$ where $\mathbf{x}_i \in \mathbb{R}^p$ represents numerical features of each patients and $y_i \in \mathbb{R}$ represent the numerical diabetes progression.  One can also view the dataset D as a pair of matrices $(\mathbf{X}, \mathbf{y})$ with $\mathbf{X} \in \mathbb{R}^{n \times p}$ and $\mathbf{y} \in \mathbb{R}^{n \times 1}$.

Fresh with the lectures in the machine learning course at Chalmers, you would like to use a linear model to quickly perform the task. In order words, you would like to find a vector $\mathbf{w} \in \mathbb{R}^{p \times 1}$  such that $\mathbf{y} = \mathbf{X} \mathbf{w}$.  However,  you have just read one of the most popular machine learning books and it argues that standard linear regression (for finding $\mathbf{w}$) can lead to various problems such as non-uniqueness of the solution,  overfitting, etc. As a result, you decided to add a penalty term called regularization to control the optimisation problem. More specifically, you want to solve for: $\min_{\mathbf{w}}  \mathcal{L}(\mathbf{w})$ where  $\mathcal{L}(\mathbf{w}) = \left(\sum_{i=1}^n (y_i - \mathbf{w}^T\mathbf{x}_i)^2 \right) + \left(\alpha \sum_{j=1}^p w_j^2 \right) $ with $\alpha \in \mathbb{R}$ a small coefficient that you will decide later on.

Please note the slight changes in the notation. Recall that in the lectures we had a dataset $\{ (\mathbf{x}_n, t_n)\}_{n=1}^N$ where $\mathbf{x}_n \in \mathbb{R}^D$ and $t_n \in \mathbb{R}$. We also appended $1$ to the begining of $\mathbf{x}_n$ so both $\mathbf{x}_n$ and $\mathbf{w}$ were in $\mathbb{R}^{D+1}$. Thus, here $p$ is the same thing as $D+1$. Compare $w_1, w_2, \dots, w_p$ with $w_0, w_1, \dots, w_D$.

### a) Expression for loss function [1 point]
Write down $\mathcal{L}(\mathbf{w})$ in matrix/vector forms using only $\mathbf{X}$, $\mathbf{y}$ and $\mathbf{w}$ and the L2 norm. In other words, you are not allowed to use any components $y_i, \mathbf{w}_j$ or $\mathbf{x}_i$ ( For any vector $\mathbf{z}$ use the following notation $|\mathbf{z}|_2$ to mean the L2 norm of  $\mathbf{z}$ See http://mathworld.wolfram.com/L2-Norm.html for more information about the L2 norm.)

$\|\mathbf{y} - \mathbf{X}\mathbf{w}\|_2^2 + \|\alpha * \mathbf{w}\|_2^2$

### b) Gradient of loss function [1 point]
Derive and write down in matrix/vector forms the gradient of $\mathcal{L}(\mathbf{w})$ with respect to $\mathbf{w}$. Show all the derivations. (Hint: You can start by  computing the gradient of the full expression and then convert it to matrix/vector forms. You can also directly get the gradients from your answer in a))

#### 📝 Your answer here:

$\|\mathbf{X}\|_2^2 + \|\alpha\|_2^2$


### c) Derive optimal solution [2 points]
Derive and write down in matrix/vector forms the solution $\mathbf{w}^*$ to the optimization problem $\min_{\mathbf{w}}  \mathcal{L}(\mathbf{w})$. Show all your derivations. (Hint: $\mathcal{L}(\mathbf{w})$ is convex in $\mathbf{w}$)

#### 📝 Your answer here:

### d) Uniqueness of solution [2 points]
Under which condition on the $\alpha$ is the solution $\mathbf{w}^*$ unique? Prove rigorously your statement. Make no assumptions on $\mathbf{X}$. (Hint: If your solution $\mathbf{w}^*$ requires to invert a matrix, then one necessary condition for uniqueness is for the matrix to be invertible. And any positive definitive matrix https://en.wikipedia.org/wiki/Definiteness_of_a_matrix is invertible. You might also want to look at the properties of transposition https://en.wikipedia.org/wiki/Transpose)


#### 📝 Your answer here:


### e) Implement linear model fitting [2 points]
In the code block below, implement a well commented function `fit_linear_with_regularization` that takes as input $\mathbf{X}$, $\mathbf{y}$ and $\alpha$ and return $\mathbf{w}^*$ as computed in question 3. You are not allowed to use any loops (for-loop, while-loop ...) to do the implementation. Instead you must use numpy operations as much as possible.

Fill in the `TODO` sections in the `fit_linear_with_regularization` function below.

### f) Implement linear model prediction [3 points]
In the code block below, implement a well commented function `predict` that takes as input a dataset $\mathbf{X_{\text{test}}}$ in the same dimensions as $\mathbf{X}$ and return the predictions.   Write down the mean squared error (https://en.wikipedia.org/wiki/Mean_squared_error) of your predictions. Then on the same plot with legends, show:

 a) A scatter plot of the first feature of $\mathbf{X_{\text{test}}}$ (x-axis) and the diabetes progression $\mathbf{y_{\text{test}}}$

 b) A plot of your prediction for $\mathbf{X_{\text{test}}}$

 The skeleton code in the cell below already implements most of data loading and you should only have to fill in the `TODO` sections. Again here no loops are allowed (for-loop, while loop in the implementation of the plots and the **predict** )


In [None]:
# Make it possible to show plots in the notebooks.
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets


def fit_linear_with_regularization(X, y, alpha):
  """TODO"""
  return np.random.randn(X.shape[1]) # TODO: Change me

def predict(X_test, w):
  """TODO"""
  return np.random.randn(X_test.shape[0]) # TODO: Change me

def plot_prediction(X_test, y_test, y_pred):
  """TODO"""
  # Scatter plot the first feature of X_test (x-axis) and y_test (y-axis)
  plt.plot(X_test) # TODO: Change me

  # TODO: Plot y_pred using the first feature of X_test as x-axis

  # TODO: Compute the mean squared error
  mean_squared_error = 0.
  return mean_squared_error

# Load the diabetes dataset
X, y = datasets.load_diabetes(return_X_y=True)

# Split the dataset into training and test set
num_test_elements = 20

X_train = X[:-num_test_elements]
X_test = X[-num_test_elements:]

y_train = y[:-num_test_elements]
y_test = y[-num_test_elements:]

# Set alpha
alpha = 0.01

# Train using linear regression with regularization and find optimal model
w = fit_linear_with_regularization(X_train, y_train, alpha)

# Make predictions using the testing set X_test
y_pred = predict(X_test, w)

# Plots and mean squared error
error = plot_prediction(X_test, y_test, y_pred)
print('Mean Squared error is ', error)

# Show the plot
plt.show()

# Bonus problem
Answering this question will not give you any additional points. Not answering this question will not prevent you from getting full points (if all other questions are answered correctly). However, if you answer this question, we will pick exactly one question where you didn't receive full points in this assignment and give you **at most** 4 more points there. In particular, between the questions for which you have reasonably attempted a solution, we will pick the one where the difference between the full point and the point you received is the maximum.

## Problem 5: Bayesian Linear Regression [up to 4 bonus points]

Proud of finishing the task using a linear model with regularization, you show your results to a representative of the Swedish Public Health Agency. You barely finish explaining your solution when the face of the representative turns red and you could distinctly hear: "Bayesian is the only way: How come didn't you use any probabilities?".

You quickly head back to your desk and now assume a Gaussian prior on the solution $\mathbf{w}$, that is $p(\mathbf{w}) = \mathcal{N}(\mathbf{0}, \lambda^{-1} \mathbf{I})$ where $\lambda \in \mathbb{R}$ is a constant real number, $I$ is the $p \times p$ identity matrix and $\mathcal{N}(\mathbf{0}, \lambda^{-1} \mathbf{I})$ is used to mean the multivariate gaussian distribution with mean $\mathbf{0} \in \mathbb{R}^p$ , a vector of zeros of dimension $p$ and covariance matrix $\lambda^{-1} \mathbf{I}$ . Then, you use the following likelihood:

$$p(\mathbf{y} | \mathbf{X}, \mathbf{w}) = \prod_{i=1}^n \mathcal{N}(\mathbf{w}^T \mathbf{x}_i, \gamma^{-1})$$

where $\gamma \in \mathbb{R}$ is a constant real number and $\mathcal{N}(\mathbf{w}^T \mathbf{x}_i, \gamma^{-1})$ is the Gaussian distribution with mean  $\mathbf{w}^T \mathbf{x}_i$ and variance $\gamma^{-1}$.



### a) Derive log posterior [2 points]
Derive an expression for the log posterior $\ln p(\mathbf{w} | \mathbf{y}, \mathbf{X})$ in vector/matrix form as a function of $\mathbf{X}, \mathbf{y}, \mathbf{w}$. Show all derivations. You can ignore normalizing constants.



#### 📝 Your answer here:


### b) Equivalence of Bayesian and regularized regression [2 points]
Show that maximizing the posterior in a) is  similar to minimizing the function $\mathcal{L}(\mathbf{w})$ seen in the previous section. Show your derivations. (Note: You should show this without computing the maximum of the posterior. Instead, you should express the log posterior in term of $\mathcal{L}(\mathbf{w})$, ignoring constants if necessary. Then find the $\alpha$ of $\mathcal{L}(\mathbf{w})$ in term of $\lambda$ and $\gamma$).

#### 📝 Your answer here: