**SM339 &#x25aa; Applied Statistics &#x25aa; Spring 2023 &#x25aa; Uhan**

# Lesson 16. Confidence and Prediction Intervals for Response

## Overview

- As in simple linear regression, often we would like to use our model to make predictions

- We may want to predict:

    - The __mean response__ for predictor values $x_1^*, x_2^*, \dots, x_k^*$
    
    - An __individual response__ for predictor values $x_1^*, x_2^*, \dots, x_k^*$

## Confidence interval for response

- To estimate the __mean response__ for predictor values $x_1^*, x_2^*, \dots, x_k^*$, we use a __confidence interval__ for $\mu_{Y | x^*}$

- It estimates the mean of $Y$ when $X_1$ has the value $x_1^*$, $X_2$ has the value $x_2^*$, etc.

- Formula:
    $$\hat{y} \pm t_{\alpha / 2, n - (k + 1)} \mathit{SE}_{\hat{\mu}}$$

    - We will let R handle the particulars of calculating the standard error $\mathit{SE}_{\hat{\mu}}$

- Interpretation:
    > We are <mark>95%</mark> confident that the true average <mark>response</mark> for all <mark>observational units</mark> with <mark>$x_1^*, \dots, x_k^*$ values</mark> is between <mark>lower endpoint of CI</mark> and <mark>upper endpoint of CI</mark> <mark>units</mark>
    
    > Being <mark>95%</mark> confident means that, with repeated use, the procedure of forming a CI will capture the true mean response $\mu_{Y | x^*}$ <mark>95%</mark> of the time.
    
    - Rephrase the highlighted parts so that it matches the context of the problem

### Example 1
Continuing with the `RailsTrails` data and model from the previous lessons...

In [1]:
library(Stat2Data)
data(RailsTrails)

fit <- lm(Price2014 ~ SquareFeet + Distance, data = RailsTrails)
summary(fit)


Call:
lm(formula = Price2014 ~ SquareFeet + Distance, data = RailsTrails)

Residuals:
    Min      1Q  Median      3Q     Max 
-152.15  -30.27   -4.14   25.75  337.93 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   78.985     25.607   3.085  0.00263 ** 
SquareFeet   147.920     12.765  11.588  < 2e-16 ***
Distance     -15.788      7.586  -2.081  0.03994 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 65.55 on 101 degrees of freedom
Multiple R-squared:  0.6574,	Adjusted R-squared:  0.6506 
F-statistic: 96.89 on 2 and 101 DF,  p-value: < 2.2e-16


#### a.
Use the fitted model equation directly to predict the price (in 2014) of a home that is 1800 square feet and 1.2 miles from a bike trail.

*Write your notes here. Double-click to edit.*

*Solution.*

$$\widehat{\mathit{Price2014}} = 78.985 + 157.920(1.800) - 15.788(1.2) = 326.295 $$

#### b.
Construct and interpret a 90% confidence interval for the __average__ price of all 1800 square foot houses that are 1.2 miles from a bike trail.

In [2]:
# Solution
predict(fit, newdata=data.frame(SquareFeet=1.800, Distance=1.2), 
        interval="confidence", level=0.90)

Unnamed: 0,fit,lwr,upr
1,326.2948,314.2929,338.2967


*Write your notes here. Double-click to edit.*

*Solution.* We are 90% confident that the true average price of all 1800 square foot houses that are 1.2 miles from a bike trail is between \\$314,929 and \\$338.297.

## Prediction interval for response

- To estimate an __individual response__ for predictor values $x_1^*, x_2^*, \dots, x_k^*$, we use a __prediction interval__ for $y$

- It estimates a _future_ individual response $y$ when $X_1$ has the value $x_1^*$, $X_2$ has the value $x_2^*$, etc.

- Formula:
    $$ \hat{y} \pm t_{\alpha/2, n - (k + 1)} \mathit{SE}_{\hat{y}} $$

- Interpretation:
    > We are <mark>95%</mark> confident that the <mark>response</mark> of a particular <mark>observational units</mark> with <mark>$x_1^*, \dots, x_k^*$ values</mark> is between <mark>lower endpoint of CI</mark> and <mark>upper endpoint of CI</mark> <mark>units</mark>
    
    > Being <mark>95%</mark> confident means that, with repeated use, the procedure of forming a PI will capture the actual $y$ <mark>95%</mark> of the time.
    
    - Rephrase the highlighted parts so that it matches the context of the problem 

### Example 2

Continuing with Example 1...

#### a.
Construct and interpret a 90% interval predicting the price of one particular 1800 square foot house that is 1.2 miles from a bike trail.

In [3]:
# Solution
predict(fit, newdata=data.frame(SquareFeet=1.800, Distance=1.2), 
        interval="prediction", level=0.90)

Unnamed: 0,fit,lwr,upr
1,326.2948,216.8189,435.7708


#### b.
Which is wider, the 90% CI or the 90% PI?

*Write your notes here. Double-click to edit.*

*Solution.* The prediction interval is wider.

## Notes about confidence intervals vs. prediction intervals for response

- The point estimate anchoring both intervals is the same: $\hat{y}$

- The prediction interval is always wider than the confidence interval, because the prediction interval uses a larger standard error $\mathit{SE}_{\hat{y}}$

- Intuitively: 
    - The PI captures more uncertainty 
    
    - In addition to uncertainty due to sampling, the PI also captures the inherent uncertainty in the response of an _individual_ data point