# Week 1 

The purpose of this week's excercise is twofold: First, introduce you to Numpy and making you familiar to the library and some of its pitfalls. Secondly, you will use this knowledge to estimate the linear model using OLS.

## A short introduction to Numpy and Linear Algebra (Linalg)
First, import all necessary packages. If you are missing a package, you can either install it through your terminal using pip, or an Anaconda terminal using conda.

In [1]:
import numpy as np
from numpy import linalg as la
from numpy import random as random
from tabulate import tabulate
#(NB if you havent got tabulate yet, install it using pip install tabulate)
from matplotlib import pyplot as plt

Jeg elsker miriam

### Entering matrices manually
To create a $1\times9$ *row* vector write,

In [2]:
row = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])
print(row)

print(row.shape)

[1 2 3 4 5 6 7 8 9]
(9,)


To create a $9\times1$ *column* vector write,

In [3]:
col = np.array([[1], [2], [3], [4], [5], [6], [7], [8], [9]])
print(col)
print(col.shape)

[[1]
 [2]
 [3]
 [4]
 [5]
 [6]
 [7]
 [8]
 [9]]
(9, 1)


An easier method is to define a row vector, and transpose it. Notice the double [[]]. Try to see what happens if you transpose a row vector using only [].

In [4]:
col = np.array([[1, 2, 3, 4, 5, 6, 7, 8, 9]]).T
print(col)

[[1]
 [2]
 [3]
 [4]
 [5]
 [6]
 [7]
 [8]
 [9]]


**A short note on numpy vectors**
Numpy does not treat vectors and matrices the same. A *true* numpy vector has the shape (k,), . The shape of a numpy array is an attribute, how do you call this attribute for the `row` and `col` arrays? What is the shape of the `row.T` array? 

In [5]:
# Call the shape attribute for the row and col vars. Check the shape of row.T

# FILL IN HERE

print(col.shape)
print(row.shape)
print(row.T.shape)

(9, 1)
(9,)
(9,)


To create a matrix, you combine what you have learned to manually create a $3 \times 3$ matrix called x, that has the numbers 0 to 8.

In [6]:
# FILL IN HERE

x = np.array([[0, 1, 2], [3, 4, 5], [6, 7, 8]])
print(x)

[[0 1 2]
 [3 4 5]
 [6 7 8]]


Create the same $3 \times 3$ using `np.arange()` and np.reshape()

In [9]:
# FILL IN HERE

x = np.arange(0,9).reshape(3,3)
print(x)

[[0 1 2]
 [3 4 5]
 [6 7 8]]


### Matrix calculations 
There are several types of matrix calculations available to us with the numpy library, and we will introduce some here.

For matrix **multiplication** you can for the matrices `a` and `b` use `a@b`, `np.dot(a, b)` or `a.dot(b)`

Perform matrix multiplication on the following:
- `row`$\cdot$`row'`, `row'`$\cdot$`row`, `row`$\cdot$`row`;
- `col`$\cdot$`col'`, `col'`$\cdot$`col`, `col`$\cdot$`col`;
- `x`$\cdot$`x`, `row`$\cdot$`col'`, `col`$\cdot$`row'`.

Does the `row` vector behave as you would expect?

In [11]:
print(row.shape)

(9,)


ok so python doesn't actually know that row is a row vector, eventhough we intended to create it as a 1x9 row vector.

In [26]:
print( row@row.T)
print( row.T@row)
print( row@row)

# the vector does not behave as expected

285
285
285


In [12]:
print(col.shape)

(9, 1)


the column vector is actually saved as a column vector (9x1)

In [24]:
print( col@col.T)
print( col.T@col)
print( col@col)

[[ 1  2  3  4  5  6  7  8  9]
 [ 2  4  6  8 10 12 14 16 18]
 [ 3  6  9 12 15 18 21 24 27]
 [ 4  8 12 16 20 24 28 32 36]
 [ 5 10 15 20 25 30 35 40 45]
 [ 6 12 18 24 30 36 42 48 54]
 [ 7 14 21 28 35 42 49 56 63]
 [ 8 16 24 32 40 48 56 64 72]
 [ 9 18 27 36 45 54 63 72 81]]
[[285]]


ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 9 is different from 1)

the column vector behaved as expected.

In [29]:
print( x@x) #(3x3) x (3x3) = (3x3)
print(row@col.T) #(1x9) x (1x9) = NO
print(col@row.T) #(9x1) x (9x1) = NO

[[ 15  18  21]
 [ 42  54  66]
 [ 69  90 111]]


ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 1 is different from 9)

What happens if you use `*` and `/` operators on the same pairs as above? Does the `col` vector behave as you would expect?

In [31]:
print(row*row.T) #1x9 * 9.1 -> broadcasting
print(row.T*row) #9x1 * 1x9 -> broadcasting
print(row*row) #1x9 * 1x9 -> broadcasting
print(row/row.T) 
print(row.T/row)
print(row/row)

print

[ 1  4  9 16 25 36 49 64 81]
[ 1  4  9 16 25 36 49 64 81]
[ 1  4  9 16 25 36 49 64 81]
[1. 1. 1. 1. 1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1. 1. 1. 1. 1.]


Note: It doesn't know that row is a row vector -> it only does it for one row

In [39]:
print(col*col.T) #9x1 * 1x9 -> broadcasting
print(col.T*col) #1x9 * 9x1 -> broadcasting
print(col*col) #1x9 * 1x9 -> broadcasting
print(col/col.T)
print(col.T/col)
print(col/col)

[[1.         0.5        0.33333333 0.25       0.2        0.16666667
  0.14285714 0.125      0.11111111]
 [2.         1.         0.66666667 0.5        0.4        0.33333333
  0.28571429 0.25       0.22222222]
 [3.         1.5        1.         0.75       0.6        0.5
  0.42857143 0.375      0.33333333]
 [4.         2.         1.33333333 1.         0.8        0.66666667
  0.57142857 0.5        0.44444444]
 [5.         2.5        1.66666667 1.25       1.         0.83333333
  0.71428571 0.625      0.55555556]
 [6.         3.         2.         1.5        1.2        1.
  0.85714286 0.75       0.66666667]
 [7.         3.5        2.33333333 1.75       1.4        1.16666667
  1.         0.875      0.77777778]
 [8.         4.         2.66666667 2.         1.6        1.33333333
  1.14285714 1.         0.88888889]
 [9.         4.5        3.         2.25       1.8        1.5
  1.28571429 1.125      1.        ]]
[[1.         2.         3.         4.         5.         6.
  7.         8.         9

In [40]:
print(x*x) #3x3 * 3x3
print(row*col)
print(col*row)
print(x/x)
print(row/col)
print(col/row)

[[ 0  1  4]
 [ 9 16 25]
 [36 49 64]]
[[ 1  2  3  4  5  6  7  8  9]
 [ 2  4  6  8 10 12 14 16 18]
 [ 3  6  9 12 15 18 21 24 27]
 [ 4  8 12 16 20 24 28 32 36]
 [ 5 10 15 20 25 30 35 40 45]
 [ 6 12 18 24 30 36 42 48 54]
 [ 7 14 21 28 35 42 49 56 63]
 [ 8 16 24 32 40 48 56 64 72]
 [ 9 18 27 36 45 54 63 72 81]]
[[ 1  2  3  4  5  6  7  8  9]
 [ 2  4  6  8 10 12 14 16 18]
 [ 3  6  9 12 15 18 21 24 27]
 [ 4  8 12 16 20 24 28 32 36]
 [ 5 10 15 20 25 30 35 40 45]
 [ 6 12 18 24 30 36 42 48 54]
 [ 7 14 21 28 35 42 49 56 63]
 [ 8 16 24 32 40 48 56 64 72]
 [ 9 18 27 36 45 54 63 72 81]]
[[nan  1.  1.]
 [ 1.  1.  1.]
 [ 1.  1.  1.]]
[[1.         2.         3.         4.         5.         6.
  7.         8.         9.        ]
 [0.5        1.         1.5        2.         2.5        3.
  3.5        4.         4.5       ]
 [0.33333333 0.66666667 1.         1.33333333 1.66666667 2.
  2.33333333 2.66666667 3.        ]
 [0.25       0.5        0.75       1.         1.25       1.5
  1.75       2.         2.

  print(x/x)


For OLS we need to be able to calculate the inverse. This is done with the `linalg` submodule. Create a new matrix that we can calculate the inverse on. Why can't we take the inverse of `x`?

In [43]:
# FILL IN HERE
print(x)

y.linalg.inv(x)

[[0 1 2]
 [3 4 5]
 [6 7 8]]


NameError: name 'y' is not defined

x is linearly dependent

What do we normaly need to check before we take the inverse? What `numpy.linalg` method can we use to help us check for this?

In [6]:
# FILL IN HERE
np.linalg.matrix_rank(x)


2

check the rank of the matrix -> needs to be equal to the amount of columns

### Stack vectors or matrices together
If you have several 1-D vectors (has the shape (k,)), you can use `np.column_stack()` to get a matrix with the input vectors put together as column.

If you have matrices (or arrays) that are multidimensional (have the shape (k, t)), you can use `np.hstack()` (means horizontal stack). This is very useful if you already have a matrix, and you want to add a vector.

Try to make a matrix with two `row` vectors, this should give you a $9 \times 2$ vector.

Make a new vector, and add it to the `x` matrix. This should then be a $3 \times 4$ matrix

In [13]:
# FILL IN HERE
np.column_stack([row,row])

array([[1, 1],
       [2, 2],
       [3, 3],
       [4, 4],
       [5, 5],
       [6, 6],
       [7, 7],
       [8, 8],
       [9, 9]])

In [18]:
# FILL IN HERE
y = np.array([[1, 2, 3]]).T
print(y)
print(x)
z=np.hstack([x,y])
print(z)

[[1]
 [2]
 [3]]
[[0 1 2]
 [3 4 5]
 [6 7 8]]
[[0 1 2 1]
 [3 4 5 2]
 [6 7 8 3]]


### Other methods that you need to know.
The numpy library is vast. Some other methods that are useful are `ones`, `diag`, `diagonal`, `eye`.

## Exercise 1 - Data generation
### 1.1 
Create a synthetic dataset with the following characteristics

\begin{align}
    y_i &= \beta_0 + x_{1i}\beta_1 + x_{2i}\beta_2 + \varepsilon_i
\end{align}

where $\beta_0=1$, $\beta_1 = -0.5$, $\beta_2 = 2$, $x_{1i} \sim \mathcal{N}(0, 4)$, $x_{2i} \sim \mathcal{N}(5, 9)$, $\varepsilon_i \sim \mathcal{N}(0, 1)$, and $(x_{1i},x_{2i})$ and $\varepsilon_i$ are independent, and where $i = 0, ..., 99$. <br>

vigtigt: I rng.normal så tager den standardafvigelsen!!! ikke variansen.

In [32]:
# Create a seed to always have identical draws.
seed = 42
# Instance a random number generator using this seed.
rng = random.default_rng(seed=seed)
n = 100
b = np.array([1, -0.5, 2]).reshape(-1, 1)
#The code snippet you've provided creates a NumPy array b with the values 
# [1, -0.5, 2] and then reshapes it into a column vector with a shape of 
# (-1, 1). Using -1 in the reshape function allows NumPy to automatically
#  infer the size of one dimension based on the other.


# Make random draws from a normal distribution.
def random_draws(n):
    x0 = np.ones(n)
    x1 = rng.normal(0,2,n)
    x2 = rng.normal(5,3,n)
    eps = rng.normal(0,1,n).reshape(-1,1)
    
    # Stack the single columns into a matrix, 
    X = np.column_stack([x0,x1,x2])
    
    return X, eps

X, eps = random_draws(n)

print(X.shape)
print(eps.shape)

# Create y using the betas, X and eps.
y = X@b+eps

# Does y have the dimensions you expect?
print(y.shape)

(100, 3)
(100, 1)
(100, 1)


In [33]:
np.mean(y)

10.92414332185814

### 1.2 
Imagine that you had not generated the dataset yourself, but that you were given a similar data set that was already collected (generated) and ready to analyze. What would you observe and not observe in that data set?

FILL IN HERE

we would observe sample mean and variance. We would not observe the population mean and variance

## Exercise 2 - OLS
### 2.1
Make sure that you remember the mathematical equation for the OLS estimation, which we will later use to estimate the beta coefficients using date from the previous excercise. <br> 
**Write out the OLS estimator in matrix form:**


$\hat{\boldsymbol{\beta}} = [E(x'x)]^{-1}E(x'y) $ 

*Hint: Look it up on p.57 in Wooldridge*

### 2.2
As you might remember, to perform inference on the OLS estimators, we need to calculate the standard errors for the previously estimates OLS coefficients. Again, make sure you remember its equation, *and write up the OLS standard errors in matrix form:*

$\mathbf{\widehat{Var(\boldsymbol{\hat{\beta}})}} = \hat{\sigma}^2(X'X)^{-1}$, for $\hat{\sigma}^2 = SSR/(N-K)$, <br>

where $SSR = \sum_{i=0}^{N - 1} \hat{u}^2_i$, $N$ is the number of observations, and $K$ is the number of explanatory variables including the constant.

*Hint: Look it up on p.60 in Wooldridge* <br>
*Hint: Remember that the variance estimate is a function of $\hat{\sigma}^2$, which is calculated using SSR*

### 2.3
Estimate $\boldsymbol{\hat{\beta}}$ from the synthetic data set. Furthermore, calculate standard errors and t-values (assuming that the assumptions of the classical linear regression model are satisfied).

In [None]:
def ols_estimation(y, X):
    # Make sure that y and X are 2-D.
    y = y.reshape(-1, 1)
    if len(X.shape)<2:
        X = X.reshape(-1, 1)

    # Estimate beta
    b_hat =  # Fill in here

    # Calculate standard errors
    residual = # Fill in here
    sigma = # Fill in here
    varb = # Fill in here
    se = # Fill in here

    # Calculate t-values
    t_values = b_hat/se
    
    return b_hat, se, t_values

b_hat, se, t_values = ols_estimation(y, X)

Python stores vectors as one-dimensional rather than two-dimensional objects. This can sometimes cause havoc when we want to compute matrix products. Compute the outer and inner products of the residuals from above using np.inner() and np.outer(). Compare these with your computed outer and inner products when using matrix multiplication @. When computing outer and inner products of a column vector, a, recall that a'a is the inner product and aa' is the outer product.

In [None]:
res = # FILL IN HERE
inner = # FILL IN HERE
outer = # FILL IN HERE
matmul_inner = # FILL IN HERE
matmul_outer = # FILL IN HERE

print('res shape:         ', res.shape)
print('inner shape:       ', inner.shape)
print('outer shape:       ', outer.shape)
print('matmul_inner shape:', matmul_inner.shape)
print('matmul_outer shape:', matmul_outer.shape)

Now if we flatten the residuals to be stored in Python's default mode (i.e. one-dimensional) what happens?

In [None]:
res=res.flatten()
inner = # FILL IN HERE
outer = # FILL IN HERE
matmul_inner = # FILL IN HERE
matmul_outer = # FILL IN HERE

print('res shape:         ', res.shape)
print('inner shape:       ', inner.shape)
print('outer shape:       ', outer.shape)
print('matmul_inner shape:', matmul_inner.shape)
print('matmul_outer shape:', matmul_outer.shape)

I have written a code to print a table, using the `tabulate` package. You will need to add the row names for this code to work - each row contains a information about the different coefficients on the explanatory variables.

In [None]:
def print_table(row_names, b, b_hat, se, t_values):
    table = []

    # Make a list, where each row contains the estimated and calculated values.
    for index, name in enumerate(row_names):
        table_row = [
            name, b[index], b_hat[index], se[index], t_values[index]
        ]
        table.append(table_row)

    # Print the list using the tabulate class.
    headers = ['', '\u03b2', '\u03b2\u0302 ', 'Se', 't-value']
    print('OLS Estimates:\n')
    print(tabulate(table, headers, floatfmt=['', '.1f', '.3f', '.3f', '.1f']))

row_names = # Fill in here
print_table(# Fill in here)

Alternatively, you can print a table which you can paste straight into latex using the following code. This uses panda data frames  which we'll cover next week.

In [None]:
import pandas as pd
dat = pd.DataFrame(zip(b,b_hat.round(4),se.round(4),t_values.round(4)))
dat.columns = ['\u03b2','\u03b2\u0302','se','t-values']
dat.index = ['beta1','beta2','beta3']
print(dat.style.to_latex())

## Exercise 3 - a simple Monte Carlo Experiment
Carry out a Monte Carlo experiment with $S = 200$ replications and $N = 100$ observations to check if the OLS estimator provides an unbiased estimate of $\boldsymbol{\beta}$
### 3.1
Generate 200 data sets similar to what you did in exercise 1, and estimate $\boldsymbol{\beta}$ on each of them.

*Hint:* Start by making prefilling two arrays using `np.zeros`, one array to store the estimated beta coefficients, and one to store the estimated standard errors. What shape should these arrays have?

Then make a loop where each loop makes a random draw, and then estimates on this random draw. And finally stores the estimated coefficients and standard errors.

In [None]:
# Initialize the variables and lists
s = 200
n = 100

# Allocate memory for arrays to later fill
b_coeffs = np.zeros((s, b.size))
b_ses = np.zeros((s, b.size))

for i in range(s):
    # Generate data
    X, eps = # Fill in here
    y = # Fill in here

    # Estimate coefficients and variance
    b_hat, se, t_values = # Fill in here

    # Store estimates
    b_coeffs[i, :] = # Fill in here
    b_ses[i, :] = # Fill in here

# Make sure that there are no more zeros left in the arrays.
assert np.all(b_coeffs) and np.all(b_ses), 'Not all coefficients or standard errors are non-zero.'

### 3.2
Do the following three calculations:
- Calculate the means of the estimates (means across simulations)
- Calculate the means of the standard errors (means across simulations)
- Calculate the standard error of the MC estimates

In [None]:
mean_b_hat = # Fill in here
mean_b_se = # Fill in here
mean_mc_se = # Fill in here

### 3.3
Draw a histogram for the 200 estimates of $\beta_1$. This can be done using matplotlib with the method `plt.hist()`.

In [None]:
# Fill in here