Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rsquare not robust #37

Closed
drsimonj opened this issue Dec 22, 2016 · 1 comment

Comments

Projects
None yet
2 participants
@drsimonj
Copy link

commented Dec 22, 2016

Looks like rsquare calculation is "SSR / SST". For a more robust solution (explanation below), I suggest using "1 - SSE / SST" instead. For this, the function could be rewritten as:

rsquare <- function(model, data) {
  1 - stats::var(residuals(model, data)) / stats::var(response(model, data), na.rm = TRUE)
}

Happy to submit a PR later, but I've got one live now, so don't want to tangle them up.

Explanation

"SSR / SST" has few caveats. Very relevant (I think), it doesn't apply when a model fitted on one data set (training data) is used to predict new data (test set). Here's a worked example where we should get a negative R-Squared (predicted values are worse than the mean), but we get Inf.

library(modelr)

set.seed(12)
# Training data
train <- data.frame(
  x = 1:10,
  y = 1:10 + rnorm(10, sd = .1)
)
# Test data with constant `y`
test <- data.frame(
  x = 1:10,
  y = 5
)

mod <- lm(y ~ x, train)
rsquare(mod, train)
#> [1] 0.9989631

rsquare(mod, test)
#> [1] Inf

# Safer to calculate R-squared using residuals (easy to see in plot)
plot(test, ylim = c(1, 10))
abline(coefficients(mod))

rplot

@cerbeus

This comment has been minimized.

Copy link

commented Aug 9, 2017

I agree, I just wanted to open the same issue since it's quite a severe methological problem. Using "SSR / SST" is a simplification that only holds under certain conditions and it definitely fails for test sets, which I consider a major use of model validation. The more general and robust way to calculate R^2, as it was pointed out already, is to use the originial definition: "1 - SSE / SST".

@hadley hadley closed this in 44d204a May 10, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.