# rsquare not robust #37

Closed
opened this issue Dec 22, 2016 · 1 comment

Projects
None yet

### drsimonj commented Dec 22, 2016

Looks like rsquare calculation is "SSR / SST". For a more robust solution (explanation below), I suggest using "1 - SSE / SST" instead. For this, the function could be rewritten as:

```rsquare <- function(model, data) {
1 - stats::var(residuals(model, data)) / stats::var(response(model, data), na.rm = TRUE)
}```

Happy to submit a PR later, but I've got one live now, so don't want to tangle them up.

## Explanation

"SSR / SST" has few caveats. Very relevant (I think), it doesn't apply when a model fitted on one data set (training data) is used to predict new data (test set). Here's a worked example where we should get a negative R-Squared (predicted values are worse than the mean), but we get `Inf`.

```library(modelr)

set.seed(12)
# Training data
train <- data.frame(
x = 1:10,
y = 1:10 + rnorm(10, sd = .1)
)
# Test data with constant `y`
test <- data.frame(
x = 1:10,
y = 5
)

mod <- lm(y ~ x, train)
rsquare(mod, train)
#>  0.9989631

rsquare(mod, test)
#>  Inf

# Safer to calculate R-squared using residuals (easy to see in plot)
plot(test, ylim = c(1, 10))
abline(coefficients(mod))``` ### cerbeus commented Aug 9, 2017

 I agree, I just wanted to open the same issue since it's quite a severe methological problem. Using "SSR / SST" is a simplification that only holds under certain conditions and it definitely fails for test sets, which I consider a major use of model validation. The more general and robust way to calculate R^2, as it was pointed out already, is to use the originial definition: "1 - SSE / SST".