# Exercise sheet 11
**Hello everyone!**

**Points: 15**

Please let us know if you have questions or problems! <br>
Contact us during the exercise session, on [ADAM](https://adam.unibas.ch/goto_adam_crs_1266890.html), [Piazza](https://piazza.com/class/kzy15kp8s5t6ku), or [via email](https://sada.dmi.unibas.ch/en/teaching/pids22).

Please submit this exercise sheet on **ADAM**.
Naming conventions:

1. Make a folder called "exercise11".
2. Put your submission "Exercise sheet 11.ipynb" in there.
3. Complete the sheet. Only put code or text inside the blocks where "# YOUR CODE HERE" or "YOUR ANSWER HERE" is written. Everything else will be deleted during grading. Don't add new blocks.
4. Then zip the folder called "exercise11". This will create a zip file called "exercise11.zip". Rename that zip file to your Unibas short name, e.g. "blabla0000.zip".

Common mistakes:
- Don't use capital letter for your identification name, or the exercise folder.
- Don't put previous sheet or dataset on the submission folders. Just the sheet, ipynb format.



**Handout date**: 2022/05/24 <br>
**Submission date**: 2022/05/31 <br>

In [None]:
import matplotlib.pyplot as plt
import numpy as np
from nose.tools import assert_is_instance, assert_equal, assert_almost_equal, assert_true

In this exercise, we do a simple regression task using the correlation coefficient as discussed in the last lecture.
At first, let's create the dataset:

In [None]:
def perturbed_line_data(N):
    np.random.seed(0)
    x = np.random.uniform(low=-2, high=+2, size = N)
    n = np.random.randn(N) * 0.1
    a = np.random.rand(1)
    b = np.random.rand(1)
    y = (b +  a * x) + n
    
    return x, y

N = 200 # The number of samples
x,y = perturbed_line_data(N)

ax = plt.subplot(111)
ax.scatter(x, y, alpha=0.5)
ax.spines['bottom'].set_position('zero')
ax.spines['left'].set_position('zero')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

### 1a (6 points)
Now we would like to predict the corresponding $\hat{y}$ for a given $x$. As discussed in the last lecture, we can make a linear estimator as follows:
$$
\hat{y} = \mu_y +  \rho ( \dfrac{x - \mu_x}{\sigma_x} ) \sigma_y
$$
In this section, we want to implement this linear estimator for the given dataset. So at first, calculate the correlation coefficient $\rho$ and call it rho; use this to calculate $\hat{y}$ and call it y_hat.

Hint: All you need is Numpy!


In [None]:
N = 200
x,y = perturbed_line_data(N)

# YOUR CODE HERE
raise NotImplementedError()
# rho = 
# y_hat = 
print('Correlation coefficient: {:0.3f}'.format(rho))

ax = plt.subplot(111)
ax.scatter(x, y, alpha=0.5)
ax.plot(x, y_hat, alpha=0.5 , color = 'r')
ax.spines['bottom'].set_position('zero')
ax.spines['left'].set_position('zero')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

In [None]:
print(len(y_hat))
print(y_hat.mean())
print(y_hat.std())
assert_equal(len(y_hat), N)
assert_almost_equal(y_hat.mean(), 0.294 , places=2)
assert_almost_equal(y_hat.std() , 0.497 , places=2)

Why is this estimator called "linear"? Can you explain it from the formula and the plot?

YOUR ANSWER HERE

### 1b (2 points)
Since we will use this linear estimator in the next parts, it is more convenient to make it a function that takes the samples $x$ and $y$ as input and returns the predictions $\hat{y}$ and correlation coefficient $\rho$: 

In [None]:
def linear_regressor(x,y):
    # YOUR CODE HERE
    raise NotImplementedError()
    return y_hat, rho  

y_hat, rho = linear_regressor(x,y)
print('Correlation coefficient: {:0.3f}'.format(rho))

In [None]:
print(len(y_hat))
print(y_hat.mean())
print(y_hat.std())
assert_equal(len(y_hat), N)
assert_almost_equal(y_hat.mean(), 0.294 , places=2)
assert_almost_equal(y_hat.std() , 0.497 , places=2)

We want to assess the performance of our linear regression over a more complicated dataset. At first, let's create it:

In [None]:
def perturbed_cos_data(N):
    np.random.seed(0)
    x = np.random.uniform(low = -2 , high = 2 , size = N)
    n = np.random.rand(N) * 0.5
    y = np.cos(4.5*x) + n
    
    return x, y

N = 500 # The number of samples
x,y = perturbed_cos_data(N)

ax = plt.subplot(111)
ax.scatter(x, y, alpha=0.5)
ax.spines['bottom'].set_position('zero')
ax.spines['left'].set_position('zero')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

### 1c (2 points)
Now let's calculate $\hat{y}$ and correlation coefficient from the linear regressor function over this dataset:

In [None]:
N = 500 # The number of samples
x,y = perturbed_cos_data(N)

# YOUR CODE HERE
raise NotImplementedError()
# y_hat, rho = 

print('Correlation coefficient: {:0.3f}'.format(rho))

ax = plt.subplot(111)
ax.scatter(x, y, alpha=0.5)
ax.plot(x, y_hat, alpha=0.5 , color = 'r')
ax.spines['bottom'].set_position('zero')
ax.spines['left'].set_position('zero')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

In [None]:
print(len(y_hat))
print(y_hat.mean())
print(y_hat.std())
assert_equal(len(y_hat), N)
assert_almost_equal(y_hat.mean(), 0.247 , places=2)
assert_almost_equal(y_hat.std() , 0.032 , places=2)

Is this a reasonable estimation? How can we design a better estimator?

Can you compare $\rho$ for these two datasets? Which one is greater, and what does it mean to you?

YOUR ANSWER HERE

### 1d (5 points)
SciPy provides a linear regression function that is similar to our function. Please use the following method to calculate the regression line's slope, intercept, and correlation coefficient ($\rho$). 

In [None]:
from scipy.stats import linregress
N = 200
x,y = perturbed_line_data(N)

# YOUR CODE HERE
raise NotImplementedError()
# slope = 
# intercept = 
# rho = 
print('Correlation coefficient: {:0.3f}'.format(rho))

ax = plt.subplot(111)
ax.scatter(x, y, alpha=0.5)
ax.plot(x, slope * x + intercept, alpha=0.5 , color = 'r')
ax.spines['bottom'].set_position('zero')
ax.spines['left'].set_position('zero')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

In [None]:
print(slope * intercept * rho)
print(slope/intercept)
assert_almost_equal(slope * intercept * rho, 0.126 , places=2)
assert_almost_equal(slope/intercept , 1.494 , places=2)