# Simple Linear Regression

In this notebook are some exercises to gain more experience with the material presented in the Simple Linear Regression lecture. You'll get some practice fitting models, and gain a stronger theoretical understanding of the technique as well. We'll also introduce some new important concepts that weren't explicitly covered in the lecture.

In [2]:
# import the packages we'll use
## For data handling
import pandas as pd
import numpy as np
from numpy import meshgrid

## For plotting
import matplotlib.pyplot as plt
import seaborn as sns

## This sets the plot style
## to have a grid on a white background
sns.set_style("whitegrid")

### Theoretical Questions

##### 1. Mixing up $X$ and $y$

Explain how simple linear regression works. Suppose we go out and collect some data, $X$ a single feature and $y$ the target variable. If the true relationship between $y$ and $X$ is $y = X + \epsilon$, what should the output of SLR be?  Now suppose we mistakenly misclassify $X$ as the target and $y$ as the feature and regress $X$ on $y$. What would you expect to happen to the estimate $\hat{\beta_1}$? What about in the limit as the variance of $\epsilon$ goes to $\infty$?

#### answer1
1. The output should be $\hat{\beta_{1}}$ ~ 1 and $\hat{\beta_{0}}$ ~ 0
2. becomes to $\hat{\beta_1} = \frac{\sum_{i=1}^n \left( X_i - \overline{X} \right)\left( y_i - \overline{y} \right)}{\sum_{i=1}^n \left( y_i - \overline{y} \right)^2}$ 
3. becomes to $\hat{\beta_{1}}\frac{\sigma_{y}}{\sigma_{x}}$






##### 2. An Introduction to Maximum Likelihood Estimation (MLE)

In this question we'll introduce the concept of maximum likelihood estimation to derive the formula for $\hat{\beta_1}$. Assume the standard SLR assumptions. Let $y$ denote the target variable, let $X$ denote the feature variable and suppose the true relationship between $y$ and $X$ is $y = \beta_0 + \beta_1 X + \epsilon$. As usual assume there are $n$ observations.

For now let's look at the first observation, $(X_1,y_1)$. The likelihood of observing $y_1$ given $X_1$ is
$$
f\left(y_1|x_1;\beta_0,\beta_1\right) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{1}{2}\frac{\left(y_1 - \left(\beta_0 + \beta_1 x_1\right)\right)^2}{\sigma^2}\right)
$$
because we have assumed that $\epsilon\sim N(0,\sigma^2)$. You can think of this as the probability of observing $y_1$ given $x_1$ and our model parameters. The goal of maximum likelihood estimation is to choose the parameters, in this case $\beta_0$ and $\beta_1$, that maximize the likelihood. 

Because we've assumed independence of our observations the likelihood of observing $y$ given $X$ is:
$$
f\left(y|X;\beta_0,\beta_1\right) = \prod_{i=1}^n f\left(y_i|X_i;\beta_0,\beta_1\right) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{1}{2}\frac{\left(y_i - \left(\beta_0 + \beta_1 x_i\right)\right)^2}{\sigma^2}\right)
$$

Take the partial derivatives of $f\left(y|X;\beta_0,\beta_1\right)$ with respect to $\beta_0$ and $\beta_1$, then set these equal to $0$ and solve to find the maximum likelihood estimator for simple linear regression.

Hint: Try maximizing $\log\left(f\left(y|X;\beta_0,\beta_1\right)\right)$ instead, because $\log$ is a strictly increasing function this is the same as maximizing $f\left(y|X;\beta_0,\beta_1\right)$.

##### 3. Deriving the Standard Error for $\hat{\beta_0}$ and $\hat{\beta_1}$

For any parameter, $\theta$, you can find the standard error of the estimate, $\hat{\theta}$ by taking the square root of the variance of the estimate.

Recall that the formula for $\hat{\beta_0}$ and $\hat{\beta_1}$ from SLR are:
- $\hat{\beta_1} = \frac{\sum_{i=1}^n \left( X_i - \overline{X} \right)\left( y_i - \overline{y} \right)}{\sum_{i=1}^n \left( X_i - \overline{X} \right)^2}$ 

- $\hat{\beta_0} = \overline{y} - \hat{\beta_1} \overline{X}$

First find the standard error of $\hat{\beta_1}$, then use that to find the standard error of $\hat{\beta_0}$. 

Hint: Recall that $\overline{y} = \sum_{i=1}^n y_i/n$ and $y_i = \beta_0 + \beta_1 X_i + \epsilon_i$.

### answer3
$$
V\left(\beta_{1}\right) = \frac{\sigma^2}{S_{xx}}
$$

$$
V\left(\beta_{0}\right) = \sigma^2\left[\frac{1}{n} + \frac{\bar{x}^2}{S_{xx}}\right]
$$








##### 4. Deriving the Standard Error for $E(y|X=X^*)$

Use the solution to 3. to find the standard error of $E(y|X=X^*)$.

In [None]:
## Code here or write here










##### 5. Prediction Intervals for SLR

Recall our discussion on confidence intervals for $E(y|X=X^*)$.

In addition to a confidence interval for the conditional mean, you can also produce what are known as prediction intervals for $y|X=X^*$, which give us a sense of what reasonable lower and upper bounds are for $y|X=X^*$ for a given confidence level, $1-\alpha$.

Recall that the $(1-\alpha)$ confidence interval formula for $E(y|X=X^*)$ was given by:
$$
\hat{y} \pm t_{n-2,(1-\alpha/2)}\sqrt{\frac{\sum_{i=1}^n\left(y_i - \hat{y_i}\right)^2}{n-2}}\sqrt{\frac{1}{n} + \frac{\left(X^* - \overline{X}\right)^2}{(n-1)s_X^2}},
$$

The formula for the $(1-\alpha)$ prediction interval is quite similar:
$$
\hat{y} \pm t_{n-2,(1-\alpha/2)}\sqrt{\frac{\sum_{i=1}^n\left(y_i - \hat{y_i}\right)^2}{n-2}}\sqrt{1 + \frac{1}{n} + \frac{\left(X^* - \overline{X}\right)^2}{(n-1)s_X^2}},
$$
to see a derivation of this formula check out, <a href="https://online.stat.psu.edu/stat414/node/298/">https://online.stat.psu.edu/stat414/node/298/</a>, and note that what they refer to as MSE is $\sqrt{\frac{\sum_{i=1}^n\left(y_i - \hat{y_i}\right)^2}{n-2}}$. The addition of $1$ in the second square root refelects the extra uncertainty involved in predicting the actual $y$ value for a value of $X$, and comes from the error term in the statistical models, $\epsilon$. This does not show up with the confidence interval because remember $E(\bullet)$ is linear and $E(\epsilon)$ is assumed to be $0$.

Return to the `baseball` data and produce a $98\%$ prediction interval around the regression line created by regressing `W` on `RD`.

In [None]:
## Code here or write here










In [None]:
## Code here or write here










In [None]:
## Code here or write here










In [None]:
## Code here or write here










## Applied Questions

##### 1. Origins of Regression to the Mean.

From Wikipedia:

<q><i>
    The concept of regression comes from genetics and was popularized by Sir Francis Galton during the late 19th century with the publication of Regression towards mediocrity in hereditary stature. Galton observed that extreme characteristics (e.g., height) in parents are not passed on completely to their offspring. Rather, the characteristics in the offspring regress towards a mediocre point (a point which has since been identified as the mean). By measuring the heights of hundreds of people, he was able to quantify regression to the mean, and estimate the size of the effect. Galton wrote that, "the average regression of the offspring is a constant fraction of their respective mid-parental deviations". This means that the difference between a child and its parents for some characteristic is proportional to its parents' deviation from typical people in the population. If its parents are each two inches taller than the averages for men and women, then, on average, the offspring will be shorter than its parents by some factor (which, today, we would call one minus the regression coefficient) times two inches. For height, Galton estimated this coefficient to be about 2/3: the height of an individual will measure around a midpoint that is two thirds of the parents' deviation from the population average. 
    </i></q>

Load in the data set `galton.csv`.

Create two subsets called `male` and `female`. 

For the `male` data regress height on the father's height, for the female data regress height on the mother's height.

Check the linear regression assumptions in the case of the `male` model.

Perform a hypothesis test to check for evidence of a linear relationship between father height and son height.

Interpret your output. Does what you find follow from the wikipedia entry?

Create prediction intervals around both of the regression lines. Use $\alpha = 0.05$.

In [None]:
## Code here or write here










In [None]:
## Code here or write here










In [None]:
## Code here or write here










In [None]:
## Code here or write here










##### 2. Let's do that Hockey

Predicting what teams will do well is a pretty common goal in sports. A common approach is to assume that the hot team will just keep winning (for basketball shooting this is called the <a href="https://en.wikipedia.org/wiki/Hot_hand">hot hand fallacy</a>). While a win may be an indicator of a team's overall skill level, there are some sports where a win is more an indicator of luck, see this Youtube video <a href="https://www.youtube.com/watch?v=HNlgISa9Giw&t=123s">https://www.youtube.com/watch?v=HNlgISa9Giw&t=123s</a>. An example of a more luck based sport is <a href="https://www.wired.com/2012/11/luck-and-skill-untangled-qa-with-michael-mauboussin/">hockey</a>. We'll examine winning trends in hockey in this problem.

Load in the data from `hockey.csv`. This data contains the total wins from the first half of the season and the total wins from the second half of the season for each NHL team from 2016-2019. You'll look at this in both the explanatory and predictive sense.

Hold out the 2019 season as a test set.

Using the training data, plot `second_half_wins` against `first_half_wins`. Does there appear to be a linear relationship? Build an SLR model regressing `second_half_wins` on `first_half_wins`. What is the estimate of the slope? Plot the estimated line over the training data, also include the line $y=x$ for comparison. Calculate the Root Mean Square Error on both the training and test data.

In [None]:
## Code here or write here










In [None]:
## Code here or write here










In [None]:
## Code here or write here










In [None]:
## Code here or write here










Examining the article and video I linked to above, basketball is supposed to be more skill based than hockey. Let's explore that!

Load in the data from `basketball.csv`. This data contains the total wins from the first half of the season and the total wins from the second half of the season for each NBA team from 2016-2019. Repeat the steps for the NHL data above, but on the NBA data.

Compare and contrast your findings. Which line had the higher slope? Which model had better root mse? Make a plot of that contains the regression lines and the line $y=x$, what do you notice?