<a href="https://colab.research.google.com/github/tsvoronos/API202-students/blob/main/API_202_ReviewSession1_MD.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# API-202 Review Session #1
**Friday, January 27**

TF: Matthew Dodier

# Table of Contents
1. [Lecture Recap](#Lecture-Recap)
2. [Exercises](#Exercises)
2. [Appendix](#Appendix)

# Lecture Recap

* Bivariate regression: a simple analytical tool to quantify the relation between any two numeric variables. We' ll be able to describe:
  1. The **direction** of the relation;
  2. The **magnitude** of the relation;
  3. Our **confidence in the statistical significance** of the relation.

## Bivariate Regression

* Suppose we want to know the relationship (association) between class size and test scores (at the school district level)?
* Assume we have data for each school district, thus we have:
  * A mean test score that will be our outcome Y 
  * A class size measure (student-teacher ratio) that will be our treatment X
____________________  

## What Questions Might We Answer?
Some questions these data might answer:
  * *On average, what is the association between test scores and student-teacher ratio?*
  * *What is the predicted change in test score associated with an increase in the student-teacher ratio?*
____________________

## How Do We Answer These Questions? 
* Ordinary Least Squares (OLS) regression analysis will identify the **line** that **best fits** the data.
  * OLS chooses the line that *minimizes* the sum of squared residuals

$$
\hat{Y}_i=\hat{\beta}_0+\hat{\beta}_1 * X_i
$$

by 

$$
\min \sum_i \hat{u}_i^2=\min \sum_i\left(Y_i-\hat{Y}_i\right)^2
$$

* Two interpretation notes:
  1. Our line of best fit tells us the expected or predicted value of outcome Y given a value of X.	
  2. The best fit line is an average value, not an actual value for an individual.


## Step Back: Theorize for Population…


* Bivariate regression assumes a linear relationship between two variables. The **population regression function** (PRF) is written:

  $$
  Y_i=\beta_0+\beta_1 * X_i+\mu_i
  $$

  * $Y_i$ : Dependent variable, LHS variable
  * $\beta_0$ : Constant, intercept (*value of $Y$ when $X$ is 0*)
  * $\beta_1$ : Slope, coefficient of interest (*change in $Y$ associated with a one-unit increase in $X$*)
  * $X_i$ : Independent variable, RHS variable, explanatory variable, covariate regressor
  * $\mu_i$ : Residual, error term (*all the variation in $Y$ not explained by $X$*)

* This simple theoretical model is what we expect to hold in our population of reference. How we think the world works


## …and Estimate in a Sample
* When we actually estimate the coefficients on a sample, the sample regression function (SRF) is written: 

$$Y_i=\hat{\beta}_0+\hat{\beta}_1 * X_i+\hat{\mu_i}$$

* The OLS line of best fit chooses values of the slope ($\beta_1$) and intercept ($\beta_0$) that minimize the sum of the squared residuals in the sample at hand.

$$
\hat{Y}_i=\hat{\beta}_0+\hat{\beta}_1 X_i
$$

* The difference between the line of best fit and a person's actual $Y$ is reflected in his or her residual $\left(\hat{\mu_i}=Y_i-\hat{Y_i}\right)$.


## Estimate Interpretation

* What is the meaning of $\beta_1$? How does it relate $Y$ to $X$?

The slope coefficient $\beta_1$ tells us the average change in $Y$ associated with a one-unit change in $X$. 


* Note: $\beta_1$ has the *same sign* as the *correlation* between $X$ and $Y$.

The constant (or Y-intercept) $\beta_0$ tells us the predicted value of $Y$ for individuals with $X=0$.

## Making Inferences i.e., Hypothesis Testing
* With data samples, we can construct our best guess for the line of best fit for the population. We try to get at the theory put forth.
* We denote the true population coefficients as $\beta$ and our estimated coefficients from the sample as $\hat{\beta}$.
* Statistical inference quantifies the effect of sampling fluctuation, meaning we want to know how likely it is that $\hat{\beta}$ is close to $\beta$.
* When we reject the null hypothesis $H_0: \beta_1=0$ at the $5 \%$ level, we say that $\beta_1$ is statistically significant. This phrase implies that we have strong evidence of a significant relationship between $X$ and $Y$. 
  * This does not, however, necessarily imply that $X$ causes $Y$!

## Takeaways
* Bivariate regression helps us establish the direction and measure the magnitude of the relationship between two variables.
* Hypothesis testing and confidence intervals help us measure how certain we are about the magnitude in the population.
* If we reject the null hypothesis $H_0$: $\beta_1 = 0$ at the $5 \%$ level, we say:
  * "$X$ is statistically significantly associated with $Y$."
* For reasons we will explore more in future lectures, this is not the same as:
  * "$X$ causes $Y$."

# Exercises

The purpose of these exercises is to help you learn the basic mechanics of ordinary least squares (OLS) regression in R including reading summaries of regression results and hypothesis testing. We will also do data visualization using the ggplot2 package.

In [1]:
#suppress warnings
options(warn =-1,dplyr.summarise.inform=FALSE)

#load packages
library(tidyverse)
library(broom)

#load data
data(mtcars)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.4.0      [32m✔[39m [34mpurrr  [39m 1.0.1 
[32m✔[39m [34mtibble [39m 3.1.8      [32m✔[39m [34mdplyr  [39m 1.0.10
[32m✔[39m [34mtidyr  [39m 1.2.1      [32m✔[39m [34mstringr[39m 1.4.1 
[32m✔[39m [34mreadr  [39m 2.1.3      [32m✔[39m [34mforcats[39m 0.5.2 

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



This dataset includes cars in the 1974 issue of Motor Trend magazine. We want to look at the relationship between weight and miles per gallon per gallon.

**a. Let's start by visualizing the data with a scatterplot. Put miles per gallon on the y-axis and weight on the x-axis.**

* ggplot2 is a package within tidyverse for creating graphics.
  * Must supply a data and aesthetics.
  * Always start with something like ggplot(data, aes(x = var1, y = var2, color = var3, ...))
  * Aesthetics map variables in your data to x/y positions, colors, line types, groups, and more.
* Add geometries like geom_histogram(), geom_point(), geom_line(), geom_abline() , and more with +.
* Label axes with labs().

In [2]:
# Your answer here!

# Start


# End


**b. Run a linear regression of miles per gallon (mpg) on weight (wt). From the summary R output, what do $\hat{\beta_0}$ and $\hat{\beta_1}$ equal?**

In [3]:
# Your code here

# Start

# End

#### Your answer here

#### Start

#### End

**c. Interpret the coefficient on weight.  Also, consider the hypothesis $H_0: \beta_1 = 0$ in the context of the above regression.  Would you reject this hypothesis at the $5 \%$ significance level?  Explain.**

#### Your answer here

#### Start

#### End


**d. Now that we have some evidence about the relationship between miles per gallon and weight, let's go back to our scatterplot and add a line that captures this relationship. Again, put miles per gallon on the y-axis and weight on the x-axis.**

In [4]:
# Your answer here!

# Start

# End


**e. This dataset contains information on whether a car model has a manual or automatic transmission. How could we adjust our scatterplot to differentiate between data points of automatic and manual transmission cars?**

Hint: Within the aes() function of your gpplot command, you can add
group=am to separate out the scatter plot into the different groups represented by transmission type.

In [5]:
# Your answer here!

# Start

# End


# Appendix

## Hypothesis Testing: Technical Recap
1. State null hypothesis, $H_0 \quad \left(H_0: \beta_1=a\right)($ Usually, $a=0)$
2. Set a significance level, $\alpha \quad(\alpha=0.05)$
3. Calculate sample coefficient (estimate), $\hat{\beta}_1$
4. Calculate standard error, $S E\left(\hat{\beta}_1\right)$
5. Calculate t-statistic, $t=\frac{\widehat{\beta}_1-a}{S E\left(\widehat{\beta}_1\right)}$
  * In large samples, t-stat follows standard normal distribution. We compare $|t|$ to 1.96.
6. Calculate p-value (If $H_0$ were true, what is probability of a sample giving $\hat{\beta_1}$ at least as far away from $a$ as the one observed?)
  * p-value is the probability of observing a sample estimate ($\widehat{\beta_1}$) (or further away) if the null hypothesis is true in the population.
  * Alternatively: probability our estimate ($\widehat{\beta_1}$) is due to chance. We compare p-value to 0.05.
7.  We can construct a 95% *confidence interval* as:
$$
\left[\widehat{\beta_1}-1.96 * \text { Standard Error, } \widehat{\beta_1}+1.96 * \text { Standard Error }\right]
$$
  * Statistical interpretation: 95% of intervals constructed in this way from different samples will contain the true slope value.
8. Perform the test: Reject $H_0$ if p-value $< \alpha$; else fail to reject
(With $\alpha = 0.05$, reject $H_0$ if $|t| > 1.96$, if p-value $< 0.05$, or if the 95% confidence interval does not contain $a$)
