# Discussion 6

## Geometry of Least Squares

#### **1)** 

Suppose we have a dataset represented with the design matrix $\text{span}(\mathbb{X})$ and response vector $\mathbb{Y}$. We use linear regression to solve for this and obtain optimal weights as $\hat{\theta}$. Label the following terms on the geometric interpretation of ordinary least squares:

<img src = "images/blank.jpg"></img>

answer:

#### **2)** 
Using the geometry of least squares, let’s answer a few questions about Ordinary Least Squares (OLS)!


**a)** Which of the following are true about the optimal solution $\hat{\theta}$ to OLS? Recall that the least squares estimate $\hat{\theta}$ solves the normal equation $(\Bbb{X}^T\Bbb{X})\theta = \Bbb{X}^T\Bbb{Y}$.


A. Using the normal equation, we can derive an optimal solution for simple linear regression with an $L_2$ loss.

B. Using the normal equation, we can derive an optimal solution for simple linear regression with an $L_1$ loss.

C. Using the normal equation, we can derive an optimal solution for a constant model with an $L_2$ loss.

D. Using the normal equation, we can derive an optimal solution for a constant model with an $L_1$ loss.

E. Using the normal equation, we can derive an optimal solution for the model $\hat{y} = \theta_1 x + \theta_2 \sin(x^2)$.


In [None]:
q2a = ...

**(b)** Which of the following conditions are required for the least squares estimate in the previous subpart?

A) $\Bbb{X}$ must be full column rank.

B) $\Bbb{Y}$ must be full column rank.

C) $\Bbb{X}$ must be invertible.

D) $\Bbb{X^T}$ must be invertible.

In [None]:
q2b = ...

**c)** What is always true about the residuals in the least squares regression? Select all that apply.

A) They are orthogonal to the column space of the design matrix.

B) They represent the errors of the predictions.

C) Their sum is equal to the mean squared error.

D) Their sum is equal to zero.

E) None of the above.

In [None]:
q2c = ...

**d)** Which are true about the predictions made by OLS? Select all that apply.

A) They are projections of the observations onto the column space of the design matrix.

B) They are linear combinations of the features.

C) They are orthogonal to the residuals.

D) They are orthogonal to the column space of the features.

E) None of the above.

In [2]:
q2d = ...

**e)** We fit a simple linear regression to our data $(x_i, y_i)$ for $i \in \{1, 2, \dots, n\}$, where $n$ is the number of samples, $x_i$ is the independent variable, and $y_i$ is the dependent variable. Our regression line is of the form $\hat{y} = \hat{\theta_0} + \hat{\theta_1}x$. Suppose we plot the relationship between the residuals of the model and the $\hat{y}_i$'s and find that there is a curve. What does this tell us about our model?

**(f)** Which are the following is true of the mystery quantity $\vec{v} = (I - \Bbb{X}(\Bbb{X}^T\Bbb{X})^{-1}\Bbb{X}^T) \Bbb{Y}$?

A) The vector $\vec{v}$ represents the residuals for any linear model.

B) If the $\Bbb{X}$ matrix contains the $\vec{1}$ vector, then the sum of the elements in vector $\vec{v}$ is 0 (i.e. $\sum_i v_i = 0$).

C) All the column vectors $x_i$ of $\Bbb{X}$ are orthogonal to $\vec{v}$.

D) If $\Bbb{X}$ is of shape $n$ by $p$, there are $p$ elements in vector $\vec{v}$.

E) For any $\alpha$, $\Bbb{X}\alpha$ is orthogonal to $\vec{v}$.

In [1]:
q2f = ...

**g)** Derive the least squares estimate $\hat{\theta}$ by leveraging the geometry of least squares. 

*Note:* While this isn't a "proof" or "derivation" class (and you certainly will not be asked to derive anything of this sort on an exam), we believe that understanding the geometry of least squares enough to derive the least squares estimate shows great understanding of all the linear regression concepts we want you to know! Additionally, it provides great practice with tricky linear algebra concepts such as rank, span, orthogonality, etc.

## Driving with a Constant Model

Adam is trying to use modeling to drive his car autonomously. To do this, he collects a lot of data where he drives around his neighborhood, and he wants your help to design a model that can drive on his behalf in the future using the outputs of the models you design. We will tackle two aspects of this autonomous car modeling framework: going forward and turning.

We show some statistics from the collected dataset below using *pd.describe*, which returns the  mean, standard deviation, quartiles, minimum, and maximum for the two columns in the dataset: *target_speed* and *degree_turn*.


<img src="images/describe.png"></img>

**(a)** Suppose the model predicts the target speed of the car. Using constant models trained on the speeds of the collected data shown above with $L_1$ and $L_2$ loss functions, which of the following is true?

A. The model trained with the $L_1$ loss will always drive slower than the model trained with $L_2$ loss.

B. The model trained with the $L_2$ loss will always drive slower than the model trained with $L_1$ loss.

C. The model trained with the $L_1$ loss will sometimes drive slower than the model trained with $L_2$ loss.

D. The model trained with the $L_2$ loss will sometimes drive slower than the model trained with $L_1$ loss.

In [3]:
q3a = ...
q3a

Ellipsis

**(b)** Finding that the model trained with the $L_2$ loss drives too slowly, Adam changes the loss function for the constant model where the loss is penalized \textbf{more} if the speed is higher. That way, the model wants to optimize more for the case where we wish to drive faster since the loss is higher, accomplishing his goal. Adam writes this as $L(y, \hat{y}) = y(y - \hat{y})^2$.

Find the optimal $\hat{\theta}$ for the constant model using the new empirical risk function $R(\theta)$ below:

$$
R(\theta) = \frac{1}{n} \sum_i y_i (y_i - \theta)^2
$$

answer:

**(c)** Suppose he is working on a model that predicts the degree of turning at a particular time between 0 and 359 degrees using the data in the *degree_turn* column. Explain why a constant model is likely inappropriate in this use case.

*Extra:* If you've studied some physics, you may recognize the behavior of our constant model!

answer:

**(d)** Suppose we finally expand our modeling framework to use simple linear regression (i.e. $f_\theta(x) = \theta_{w,0} + \theta_{w,1}x$). For our first simple linear regression model, we predict the turn angle ($y$) using target speed ($x$). Our optimal parameters are: $\hat{\theta}_{w,1} = 0.019$ and $\hat{\theta}_{w,0} = 143.1$.

However, we realize that we actually want a model that predicts target speed (our new $y$) using turn angle, our new $x$ (instead of the other way around)! What are our new optimal parameters for this new model? ($\textit{Hint: use the information in the table.}$)

answer:

## Modeling using Multiple Regression

We wish to model exam grades for $\textit{Data 100}$ students. We collect various information about student habits, such as how many hours they studied, how many hours they slept before the exam, and how many lectures they attended, and observe how well they did on the exam. Suppose you collect such information on $n$ students and wish to use a multiple-regression model to predict exam grades.
$$
\begin{bmatrix}
         1&study_1 & sleep_1 & lectures_1 \\
         1&...&... &... & \\
         1&...&... &... & \\
\end{bmatrix}
$$

**(a)** Suppose on our $n$ individuals, we construct our design matrix $\mathbb{X}$, adding an $\textbf{intercept term}$, and use the OLS formula to obtain the following $\hat{\theta}$:
$$
\begin{align*}
    \hat{\theta} = \begin{bmatrix} 0.5 \\ 3 \\ 2 \\ 1 \end{bmatrix}
\end{align*}
$$

Suppose our design matrix $\mathbb{X}$ was constructed such that the first is the bias, the second column represents how many hours each of the $n$ students studied, the third contains how many hours each student slept before the exam, and the fourth represents how many lectures each student attended. With this knowledge, give an interpretation of what each entry of $\hat{\theta}$ means in context.

answer:

**(b)** After fitting this model, suppose we have two individuals for which we would like to predict their exam grades using these variables. Suppose Individual 1, slept 10 hours, studied 15 hours, and attended 4 lectures. Suppose also Individual 2, slept 5 hours, studied 20 hours, and attended 10 lectures. Construct a matrix $\mathbb{X}'$ such that, if you computed $\mathbb{X}'\hat{\theta}$,
you would obtain a vector of each individual's predicted exam scores.

answer:

**(c)** Denote $\mathbb{Y}'$ as a $2 \times 1$ vector that represents the actual exam scores of the individuals we are predicting on. Write out an expression that evaluates to the MSE of our predictions, written as a function of the squared L2-norm of a vector.

answer: