# Chapter 8 - Linear Regression

* can **model** the relationship between two variables with a line and give its equation
* correlation tells us how strong the relationship is, but it doesn't tell us what the line is
* the **linear model** is an equation of a straight line through the data
* being a model, it won't match reality exactly, but it can help us understand how the variables are associated

## Residuals

* **predicted value** is the estimate made from a model
* denoted as $\hat{y}$ to distinguish it from the _true_ value $y$
* the difference between an observed value and its predicted value is called the **residual**
* $residual = observed\ value - predicted\ value$
    * a _negative_ value means an _overestimate_
    * a _positive_ value means an _underestimate_

## "Best Fit" Means Least Squares

* some residuals are positive, some negative -- their sums would cancel each other out
* we deal with that by squaring the residual values
* the **line of best fit** is the line for which the sum of the squared residuals is smallest, the **least squares** line

## The Linear Model

* $\hat{y} = b_0 + b_1x$
    * $\hat{y}$: predicted values
    * $b_0$ and $b_1$: **coefficients** of the linear model
    * $b_0$: **intercept**: where the line intercepts the $y$-axis
    * $b_1$: **slope**: how rapidly $\hat{y}$ changes with respect to $x$
* slopes are always expressed in $y$-units per $x$-units; they tell how the $y$-variable changes (in its units) for a one unit change in the $x$-variable
* serves largely as a "starting value" for predictions; particularly when the situation where $x = 0$ is a non-reasonable case.    

## The Least Squares Line

* Need the following to determine the least squares line:
    * correlation ($r$)
    * standard deviations ($s_x$ and $s_y$)
    * means ($\bar{x}$ and $\bar{y}$)
* $b_1 = r\frac{s_y}{s_x}$
* $b_0 = \bar{y} - b_1\bar{x}$

* think in terms of $y$-units per $x$-units
* check the same conditions as checked for correlation:
    * quantitative variables condition
    * straight enough condition
    * outlier condition

## Step-By-Step Example : Calculating a Regression Equation

* plan: state the problem
* variables: identify the variables and report the W's
* check the conditions: plot a scatterplot; quantitative, straight enough, outliers
* mechanics: use summary statistics to find equation of regression line
    * find slope
    * find intercept
    * write equation of model
* conclusion: interpret, discussing in terms of variables and their units    

## Correlation and the Line

* If $x$ and $y$ are converted to standard units, and the regression line is plotted:
    * slope ($b_1$) = $r$
    * intercept ($b_0$) = 0
    * equation of line: $\hat{z}_y = rz_x$
* moving one standard deviation from the mean in $x$, we can expect to move $r$ standard deviations from the mean in $y$
* in general, cases that are one standard deviation away from the mean in $x$ are, on average, $r$ standard deviations away from the mean in $y$
* if $r = 0$, there is no linear relationship
* if $r = \pm{}1$, there is a perfect linear relationship 

## How Big Can Predicted Values Get?

* each predicted $y$ tends to be closer to its mean (in standard deviations) than its corresponding $x$ was
* this property of the linear model is called **regression to the mean**
* example:
    * tall men will, on average, have larger shoe sizes
    * short men will, on average, have smaller shoe sizes
    * given a man who's height is 2 SDs above the mean, we would predict his shoe size to fall somewhere between 0SD and 2SD above the mean shoe size
    * if $r$ = 0, the prediction would be the mean shoe size
    * if $r$ = 1, the prediction would be 2SDs above the mean shoe size
    * for an $r$ between 0 and 1, the prediction would fall somewhere between the extremes, tending away from the extreme (2 SDs) towards the mean, based on the actual $r$

## Residuals Revisited

* residuals are the part of the data that hasn't been modeled
* $Residual = Data - Model$
* use $e$ to denote residuals
* $e = y - \hat{y}$
* after fitting a regression line, plot the residuals to confirm there's no pattern
    * scatterplot of $x$-values against residuals should have no interesting features (i.e. direction, shape)
    * it should stretch horizontally with even scatter throughout
    * it should show no bends and no outliers
* many software packages plot residuals against $\hat{y}$, rather than $x$
    * if $r$ is positive, the only difference is the axis values/units
    * if $r$ is negative, the plots are mirror images of each other

## The Residual Standard Deviation

* standard deviation of the residuals, $s_e$, gives a measure of how much the points spread around the regression line
* this depends on the residuals having even scatter (as shown by the residuals plot noted above)
* this leads to a new assumption: **Equal Variance Assumption**
    * associated condition to check: **Does the Plot Thicken? Condition**
* $s_e = \sqrt{\frac{\sum{e^2}}{n-2}}$
* make a histogram of the residuals

## $R^2$-The Variation Accounted For