# Tutorial 5 - Regression Simulation


*Written and revised by Jozsef Arato, Mengfan Zhang, Dominik Pegler*  
Computational Cognition Course, University of Vienna  
https://github.com/univiemops/tewa1-computational-cognition

---

## This week's lab:

We will show you how to perform linear regression with one or more predictors in Python. We will start by creating fake data using a regression model with known parameters, and then fit that data to another linear regression model and see how well we can recover the model parameters of the first model. This approach is very common and provides a benchmark for evaluating the performance of different models. After visualizing the regression and evaluating model performance, we will use bootstrapping to model confidence intervals around our parameter values to show how confident or certain we are about them. Finally, we will show you how to use regression models to test hypotheses.


In this notebook, we have included many explanations as comments in the code cell. Please read them carefully instead of just pressing the run button.  

**Learning goals:** \
When finishing this tutorial, you should ...
* be able to generate synthetic data from a linear model
* fit a linear model to data and evaluate it's performance
* model uncertainty around your parameter estimates
* test hypotheses with linear models


**Estimated time to complete:** 2 hours \
**Deadline:** Next Wednesday, 10:00


## 1. Import libraries

In [None]:
import numpy as np
from matplotlib import pyplot as plt
from scipy import linalg, stats

## 2. Simulating data based on a regression model

* Equation for the predicted line:
$Y_{pred}=B_0+B_1X$

* Equation for the data generation:
$Y=B_0+B_1X+Error$

In our example, we are trying to predict the price of an apartment based on the age of the house. To start, we randomly generate the ages below for 150 houses, ranging from 0 to 200 years.

In [None]:
n = 150
age = np.random.randint(0, 200, n)

Now we want to generate apartment prices for these 150 houses using a linear regression model. In our scenario, we know that, let's say 

1. a brand new apartment costs  EUR 500,000,
2. the error in the model should have standard deviation of EUR 100,000, and
3. for each additional year, the price is lowered by EUR 1,000.

With this information, we are ready to go. A brand new apartment is 0 years old, so the price for an apartment with $age=0$ represents our intercept (`b0`). The price change per year is -EUR 1,000, which is our slope (`b1`), and the error, well ... Let's put everything together:

In [None]:
b0 = 5_000_000
b1 = -1_000
sd = 100_000
err = np.random.normal(0, sd, n)

price = b0 + b1 * age + err

price[:5]  # show the first 5

To get a better feel for what our data looks like, it's often useful to visualize it. We will use a scatterplot and also add some meaningful labels using Matplotlib, specifically its `pyplot` submodule, which we imported earlier under the alias `plt`.

In [None]:
plt.scatter(x=age, y=price)
plt.xlabel("age", fontsize=14)
plt.ylabel("price (EUR)", fontsize=14)

## 3. Fitting a regression line

Next, we fit the data to a regression model using least squares estimation (`lingalg.lstsq()`). This requires adding a column of ones to the predictor variable. For now, it doesn't matter much why, but for those who are interested: the reason is that regression models are computed as a matrix-vector multiplication for efficiency, and adding a column of ones is a way to include the intercept term (`b0`) in the matrix equation.

In [None]:
X = np.column_stack((np.ones(n), age))
print(np.shape(X))
reg_result = linalg.lstsq(X, price)

In [None]:
X[0:10, :]

In [None]:
?linalg.lstsq

The first argument that is returned by `lstsq` is the most important one for us now. `print` it out (it should contain two values), the 2nd argument is the residual (error), `print` it out as well.

In [None]:
print(reg_result[0])
# print(reg_result[1])

# ?linalg.lstsq

In [None]:
reg_result[0][1]

you can hopefully observe that we got similar values to what we created the data with, but not exactly the same


## 4. Visualizing the regression line
1. use again the scatter plot to visualize the age to price data, as before,
2. add the regression line (red), based on the result of the lstsq()
3.

In [None]:
price_pred = reg_result[0][0] + reg_result[0][1] * age

In [None]:
plt.scatter(age, price)
plt.plot(age, price_pred, color="r")
plt.xlabel("age", fontsize=14)
plt.ylabel("price €", fontsize=14)

In [None]:
residuals = price - price_pred
residuals[:5]  # show the first 5

## 5. Calculate the residuals and the total error for the fitted model

Use the `c=` argument of scatter to color dots based on the residual error. This can be done using `plt.plot`, but you will need some care how you include the values in $X$ (as they are in random order). You can also try it with the squared error.

In [None]:
plt.scatter(age, price, c=residuals**2)
plt.plot(age, price_pred, color="r")
plt.xlabel("age", fontsize=14)
plt.ylabel("price (EUR)", fontsize=14)

plt.colorbar()

In [None]:
plt.scatter(age, price, c=residuals)
plt.plot(age, price_pred, color="r")
plt.xlabel("age", fontsize=14)
plt.ylabel("price (EUR)", fontsize=14)

plt.colorbar()

compare what you calculated with the output of linalg.lstsq

In [None]:
print(reg_result[1])

print(np.sum((price - price_pred) ** 2))

## 6. Bootstrapping for a confidence interval in the regression line

+++ Advanced Part 1 +++

Resample the data with replacment and visualize the obtained confidence interval for the regression line.

In [None]:
nsim = 100
# YOUR CODE

## 7. Hypothesis test with randomization

+++ Advanced Part 2 +++


is the relationship between age and price different from chance?
use randomization to simulate 1000 slope under the null hypothesis of no relationship

### 7.1. Simulating data with a regression model with two predictors

Of course our model of apartment prices is limited, since there are many other factors influencing the price. Perhaps the most important one is the size of the aparment.

1. Make an additonal predictor, the size that ranges from 20 to 200m² with uniform random values.
2. We know that for each additonal m², the price increases with EUR 2000.
3. Simulate a new price data-set, that has 2 predictors, age as above, and size as defined here.
4. The error should stay the same, but it makes sense to have a lower intercept value of EUR 300 000. Why?

In [None]:
n = 150
age = np.random.randint(0, 200, n)
sqm2 = np.random.randint(20, 200, n)
b1 = -1000
b2 = 2000
error_sd = 15000
price = 300000 + b1 * age + b2 * sqm2 + np.random.normal(0, error_sd, n)

### 7.2. Visualize the data-set and ...

1. ... make a figure with 2 subplots horizontally arrarnged (1 for each predictor), scatter plots again
2. ... make a new figure, with age on the x axis, and the size of the dots in the scatter plot should be proportional to the size of the aparment (parameter s of scatter)



In [None]:
fig, ax = plt.subplots(ncols=2, figsize=(7, 3))
ax[0].scatter(age, price)
ax[0].set_xlabel("age")
ax[0].set_ylabel("price €")

ax[1].scatter(sqm2, price)
ax[1].set_xlabel("Size (m2)")

plt.tight_layout()
# your code
# your code

showing both factors on single figure  (size of dots as a new dimension, s= argument)

In [None]:
plt.figure()
plt.scatter(age, price, s=sqm2)
plt.xlabel("age", fontsize=14)
plt.ylabel("price €", fontsize=14)


### 7.3. Fit a linear regression model with intercept and the two predictors using `scipy.linalg` to the above data

#### 7.3.1. Calculate the error of the model


In [None]:
X = np.column_stack((np.ones(n), age, sqm2))
print(np.shape(X))
linalg.lstsq(X, price)[0]

#### 7.3.2. Observe the fitted coefficients B0,B1,B2

In [None]:
linalg.lstsq(X, price)[0]

#### 7.3.3. Fit 2 regressions to the above data

1. only intercept and age as predictors
2. only intercept and price as predictors

Compare the obtained errors and weigths(slopes) with the one obtained with using two predictors.

In [None]:
X1 = np.column_stack((np.ones(n), age))

print(
    "age only ",
    np.round(linalg.lstsq(X1, price)[0], 1),
    "error",
    np.round(np.sqrt(linalg.lstsq(X1, price)[1]), 1),
)

X2 = np.column_stack((np.ones(n), sqm2))
print(
    "Size only ",
    np.round(linalg.lstsq(X2, price)[0], 1),
    "error",
    np.round(np.sqrt(linalg.lstsq(X2, price)[1]), 1),
)

print(
    "Both ",
    np.round(linalg.lstsq(X, price)[0], 1),
    "error",
    np.round(np.sqrt(linalg.lstsq(X, price)[1]), 1),
)

---
## Homework 1

Write a  function to perform the above calculation `my_mult_regr()`. This function should take 3 inputs in the following order: 1. predictor1 (age), 2. predictor (size),  3. outcome variable (price).

Your function has to
1.  create a predictor matrix (as above), starting with a column of ones, and the two predictors. (3 columns in total)
2. use `lstsq ()` to fit the regression model, as above
3. the function should return 2 outputs, the 1st one is an array containing the 3 fitted regression parameters (1st output argument of `lstsq()`) , 2nd output should be the residual error (2nd output argument of lstsq()),

!! make sure that your function works for inputs of any size (this is important when you add the column of ones), (but you can assume that all of the 3 input vectors have the same length (otherwise the analysis does not make sense)





In [None]:
# YOUR CODE

## Homework 2

### Standardized predictors

- Standardize (z-score) your predictors by subtracting the mean and dividing by the standard deviaton.

- Fit a regression with both the single predictor and the two predictor models and compare error and beta weigths for fitting the model to standardized and non-standardized data-sets

- Use the MyMultRegr() function in this solution



In [None]:
# YOUR CODE

## Homework 3

**Car price simulation:**

A new car costs 30 000 euros on average. Simulate 200 car prices from the last 70 years, with the assumption that while as cars get older they are cheaper, however very old cars, have a vintage value, that is eventually if the are old enough they could worth more than a new car. Use a standard devation of EUR 10,000. Hint: Use a linear model for the simualtion with a two predictors, and linear and a quadratic term. Test different values for the two slopes, simulate data, until you manage to simulate realistic car prices, that satisfy the above criteria..

In [None]:
# YOUR CODE

Once you found good values for this simulation, make a nice visualization of the simulated data.

In [None]:
# YOUR CODE

Once, the data simulation is ready, fit 3 regression models to the simulated data:
1. intercept + linear predictor age
2. intercept + linear predictor + quadratic predictor x<sup>2</sup>
3.  intercept + linear predictor + quadratic predictor x<sup>2</sup> + cubic predicor x<sup>3</sup>

`print` the obtained residual error for the three models and visualize the model predictions

In [None]:
# YOUR CODE

## Bonus task

*No need to submit* 

**Reliability of regression analysis:**

Since we created the data, we can see how close are the true values to the 'generative' model. Next task is to systematically investigate this relationship. You will have to manipualte the number of datapoints, and the error in the model, and analyze the difference between the data generating and the fitted regression parameters. This task is somewhat analogous to the *t*-test simulation task

In [None]:
# YOUR CODE