-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
Description
Hello,
The r2_score metric is something that many functions in sklearn use but right now I think it generally is set as 1 - RSS/SYY, which would the right formula to use if you run a regression with an intercept.
If you did not use this, however, and ran a regression with no intercept then the r2_score should be equal to (y_pred^2).sum()/(y_true^2).sum() and notice that we do not demean.
Moreover, for other models, 1 - RSS/SYY might be negative or not between 1 and 0, which again is bad.
For a standard regression with an intercept term, 1 - RSS/SYY = corr(y_pred,y_true)^2 and this number, no matter what the model is, is between 0 and 1 with 1 as the goal.
I think the definition should then be changed on a per model basis or it should be changed to corr(y_pred,y_true)**2. The book "Applied Linear Regression" by S Weisberg mentions the issue I address above on page 84 of the third edition. It suggest to use corr(y_pred,y_true)^2 for nonlinear models and to alternate the definition as above for regression through the origin. Finally, with regards to regression, statsmodels does use a different formula for the r2_score depending on if you use or do not use the intercept in a regression.
Maybe this is known already, but the codebase does not seem to differentiate at all so that's why I am putting the issue here.
Thanks!