## Goodness of Fit: The $R^2$ Statistic

The $R^2$ ("R-squared") value is perhaps the most widely used quality metric for linear models. It is therefore essential that we discuss it in this course, even though you will shortly see that its use is often inappropriate and better options (such as the metrics from our previous lessons) should usually be used instead.

The $R^2$ of a linear model (sometimes also referred to as the "Coefficient of Determination") is calculated using the following formula:

$$
R^2 = 1 - \frac{\left[\mathrm{Variance\ of\ Residuals}\right]}{\left[\mathrm{Variance\ of\ Dependent\ Variable}\right]}
$$

Fundamentally, $R^2$ measures the fraction of the variance in the dependent variable that is explained by our model. A low $R^2$ value does not mean our model is bad, but may mean that there is more that we could do to improve it so that it accounts for a larger amount of the dependent variable variation.

Let's revisit our sales vs. temperature data from an earlier lesson:

In [1]:
temp.df <- read.csv("data/ice-cream-sales.csv")
head(temp.df)

DailyHighTemperatureC,DailySalesContainers
<int>,<int>
21,19
22,10
23,24
24,57
26,49
26,77


We can examine the $R^2$ of the linear model corresponding to this data:

In [2]:
model <- lm(DailySalesContainers ~ DailyHighTemperatureC, data=temp.df)
summary(model)


Call:
lm(formula = DailySalesContainers ~ DailyHighTemperatureC, data = temp.df)

Residuals:
    Min      1Q  Median      3Q     Max 
-24.432  -7.481  -0.333   8.209  35.174 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)           -143.8641    10.1889  -14.12   <2e-16 ***
DailyHighTemperatureC    7.8028     0.3033   25.72   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 10.84 on 98 degrees of freedom
Multiple R-squared:  0.871,	Adjusted R-squared:  0.8697 
F-statistic: 661.7 on 1 and 98 DF,  p-value: < 2.2e-16


We can see the line `Multiple R-squared:  0.871` printed in the output above. If desired, you can also access the $R^2$ value programatically using the appropriate attribute of the summary object:

<span style="color:blue;font-weight:bold">Exercise</span>: Use the formula for $R^2$ given above to calculate the value above manually. Perform the following steps:

1. Estimate the variance of `model$residuals` using the `var` function, store the result in the variable `var.res` 
2. Estimate the variance of the dependent variable `temp.df$DailySalesContainers` using the `var` function, store the result in the variable `var.y`
3. Calculate the $R^2$ value using the formula given above, store the result in the variable `r2.manual`

In [7]:
# delete this entire line and replace it with your code

var.res <- var(model$residuals)
var.y <- var(temp.df$DailySalesContainers)

r2.manual <- 1 - ((var.res) / (var.y))

In [8]:
check.variable.value("var.res",var(model$residuals))
check.variable.value("var.y",var(temp.df$DailySalesContainers))
check.variable.value("r2.manual", 1 - var.res/var.y)
success()

You should observe the following:

* If our model is perfect, the residuals are all equal to `0`, and thus we will have $R^2 = 1 - 0 = 1$.
* If our data is pure noise with no linear relationship between $x$ and $y$, R will set the intercept `b` equal to the average value of $y$, $\overline{y}$. This results in $R^2 = 0$, so $R^2 = 0$ is the worst possible $R^2$ value that you can obtain from a linear regression analysis.

## Why is $R^2$ Useful?

You can use $R^2$ to get a quick sense of whether a model is suitable for making precise predictions of the dependent variable. For example, if your model has an $R^2$ value of `0.1`, then only `10%` of the variance in the dependent variable is explained by model. The remaining (large) amount of variance is likely to overwhelm our predicted value of the dependent variable with noise. 

However, it is very important to understand that *a low $R^2$ value does not mean that our model is bad!* Indeed, the word "bad" in the previous sentence is not well defined - a model may be "bad" for certain tasks but very useful for others. Our model may, for example, give us a very precise estimate for the slope and intercept values that we are interested in, even though the $R^2$ value is very low. 

## "How High of an $R^2$ Value do I Need?"

This question comes up very frequently - the answer is: "*That is the wrong question!*" In essentially all cases, you should ask a question more specific to your objective, and choose a different corresponding metric:

1. "*Are my predictions accurate enough to serve my purpose?*" - look at the *prediction intervals*.
2. "*Is my parameter estimate precise enough?*" - look at the *parameter confidence intervals*. 
3. "*Is there a real relationship between these two variables?*" - look at the *p-values*.