### Final exam practice questions

#### Question 1
In this question, we will create a function to generate linear data step-by-step. 
1. Define a function called `gen_lin_data` that takes as its input an integer `n`, and returns a matrix `X` and a vector `Y`. 
2. First we will generate `X`. Within the function,
 - Create two random (numpy) vectors of size `n` named `X_1` and `X_2` where 
    - Each element of `X_1` is independent and distributed according to the Uniform(-2,2) distribution
    - Each element `X_2` is independent and distributed according to the Normal(2,2) distribution. 
    
  - Finally, concatenate `X_1` and `X_2` to generate `X` so that `X` is a numpy object with `n` rows and 2 columns.


3. Now we will work towards generating `Y`. Within the function,
  - Define a numpy array called `beta` and set it equal to `[-1,2]`. 
  - Define another `n` dimensional vector called `eps` where each element is independent and distributed according to the Normal(0,1/2) distribution. 
  - Finally, Generate a vector of size `n` called `Y` according to the following equation:
$$
Y = X \cdot \beta + eps
$$
4. Now that your function is defined, test your function in the designated cell and call it with **`n` equal to 500**. Save the resulting arrays as `Y500` and `X500`. **Print** the average of `Y500`.

In [1]:
# Your function here
import numpy as np

def gen_lin_data(n):
    X_1 = np.random.uniform(-2,2,size=n)
    X_2 = np.random.normal(2,2,size=n)
    X = np.column_stack((X_1,X_2))
    beta = np.array([-1,2])
    eps = np.random.normal(0,1/2,size=n)
    Y = np.dot(X,beta) + eps
    return X,Y

In [2]:
np.random.seed(10) #ignore this line and do not delete it

# Define Y500 and X500 below this line:
X500,Y500 = gen_lin_data(500)
# Your code here
print(Y500.mean())

3.803125339392462


#### Question 2
In this question, we will create our own function to perform basic OLS estimation.
1. Define a function called `simple_OLS` that takes a vector `Y` and a matrix `X` as inputs and returns a vector of OLS (linear regression) coefficient estimates. **To get full marks, you must explicitly calculate the OLS coefficients using the analytical formula for linear regression.** You will get some partial credit for estimating the equations using another method.
2. **In the Markdown Cell below your function**, answer this question: If `y` is an $n$-vector and `X` is an $n\times d$ matrix, what are the dimensions of the objects that will be output by the `simple_OLS` function? Verify this with your code or justify it using linear algebra.
3. In the designated cell, define `beta_hat` to be the output of `simple_OLS` using `Y500` and `X500` from the previous problem.
4. In the second provided Markdown Cell, explain what values you would expect the entries of `beta_hat` to be close to, and why.


In [3]:
# Your function here
def simple_OLS(X,Y):
    beta_hats = np.linalg.solve(X.T @ X, X.T @ Y)
    return beta_hats

We will get a dxn matrix showing all the coefficient estimates for each regression equation we have.

In [5]:
# Define beta_hat here
beta_hat = simple_OLS(X500,Y500)
print(beta_hat)
print(beta_hat.shape)

[-0.98368016  1.99845442]
(2,)


We would expect the values to be close to -1 and 2, which are the initial coefficient values.

#### Question 3
In this question, we will evaluate the performance of our OLS estimator. 
1. Define a function called `simple_eval` that takes `X`, `Y`, and `beta_hat` as inputs and returns `est_mse`, the mean-squared error of `beta_hat`. 
2. Within that function
    - Define `Yhat` to be the predicted values of `beta_hat` appled to `X`. 
    - Calculate the mean-squared error and set it to `est_mse`. To do this, you can either use the metrics module from scikitlearn or calculate the mean-squared error by hand. 
3. In the cell below your function, call your `simple_eval` function using `X500`, `Y500`, and `beta_hat` from Question 2. Save that value to a variable called `ols_mse`. 

In [6]:
# Your function here
def simple_eval(X,Y,beta_hat):
    Yhat = np.dot(X,beta_hat)
    residuals = Y - Yhat
    est_mse = np.mean(residuals ** 2)
    return est_mse

In [7]:
# Define ols_mse here
ols_mse = simple_eval(X500,Y500,beta_hat)
ols_mse

0.2335128919003309

#### Question 4
Now, fit a single regression tree on `Y500` and `X500`. Set the `random_state` and `max_depth` arguments equal to 123 and 10 respectively. Calculate the resulting training mean-squared error of this tree and set it equal to `tree_mse`.

Is the MSE from this regression tree lower or higher than the mean-squared error from linear regression (OLS)? Why would that be? Enter your answer in the markdown cell.

In [12]:
# Fit Tree Here
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
from sklearn import tree, metrics

tree_reg = DecisionTreeRegressor(random_state=123, max_depth=10)
tree_reg.fit(X500, Y500)

Yhat_tree = tree_reg.predict(X500)

tree_mse = mean_squared_error(Y500, Yhat_tree)

print("Training mean-squared error of the tree:", tree_mse)

Training mean-squared error of the tree: 0.018038859089902624


The MSE from the regression tree is lower than that of the OLS regression. This is because it is overfitting the training data we are using.

#### Question 5
Now we will generate test data by using `gen_lin_data` with `n` equal to 200. Save the resulting arrays as `X200` and `Y200`. Using `simple_eval` and this testing data, calculate the testing mean-squared error of the `beta_hat` estimate from Question 2. Do the same for tree estimator from Question 4. In the Markdown cell below, answer the following question: Which estimator has a lower test mean-squared error? Why?

In [17]:
# Your code here
X200,Y200 = gen_lin_data(200)
ols_mse_2 = simple_eval(X200,Y200,beta_hat)

tree_reg = DecisionTreeRegressor(random_state=123, max_depth=10)
tree_reg.fit(X200, Y200)

print(ols_mse_2)

tree_mse_test = metrics.mean_squared_error(Y200,tree_reg.predict(X200))
print(tree_mse_test)

0.22798934969660486
0.001410974261751797


The regression tree has a lower mse because the model is overfitting and we are testing out-of-sample.