You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The r2_score metric is something that many functions in sklearn use but right now I think it generally is set as 1 - RSS/SYY, which would the right formula to use if you run a regression with an intercept.
If you did not use this, however, and ran a regression with no intercept then the r2_score should be equal to (y_pred^2).sum()/(y_true^2).sum() and notice that we do not demean.
Moreover, for other models, 1 - RSS/SYY might be negative or not between 1 and 0, which again is bad.
For a standard regression with an intercept term, 1 - RSS/SYY = corr(y_pred,y_true)^2 and this number, no matter what the model is, is between 0 and 1 with 1 as the goal.
I think the definition should then be changed on a per model basis or it should be changed to corr(y_pred,y_true)**2. The book "Applied Linear Regression" by S Weisberg mentions the issue I address above on page 84 of the third edition. It suggest to use corr(y_pred,y_true)^2 for nonlinear models and to alternate the definition as above for regression through the origin. Finally, with regards to regression, statsmodels does use a different formula for the r2_score depending on if you use or do not use the intercept in a regression.
Maybe this is known already, but the codebase does not seem to differentiate at all so that's why I am putting the issue here.
Thanks!
The text was updated successfully, but these errors were encountered:
Sorry for the slow reply. It's true that for non-linear models, R^2 doesn't need to be >0, but it's always <= 1.
We're basically always using the wikipedia definition without taking the model into account: https://en.wikipedia.org/wiki/Coefficient_of_determination
This is somewhat non-standard but unfortunately it's a bit hard to change. In particular when using a test set, it's a bit unclear to me what the R^2 means.
I'm not super familiar with stats, but (y_pred^2).sum()/(y_true^2).sum() seems really odd to me. When would that make sense?
I think this can be closed in part just because it's not going to change (although I suppose docs could be improved). But I also think the OP neglects the fact that in a machine learning context, we are usually estimating generalisation error, not goodness of fit alone; hence negative scores seem appropriate.
Hello,
The r2_score metric is something that many functions in sklearn use but right now I think it generally is set as 1 - RSS/SYY, which would the right formula to use if you run a regression with an intercept.
If you did not use this, however, and ran a regression with no intercept then the r2_score should be equal to (y_pred^2).sum()/(y_true^2).sum() and notice that we do not demean.
Moreover, for other models, 1 - RSS/SYY might be negative or not between 1 and 0, which again is bad.
For a standard regression with an intercept term, 1 - RSS/SYY = corr(y_pred,y_true)^2 and this number, no matter what the model is, is between 0 and 1 with 1 as the goal.
I think the definition should then be changed on a per model basis or it should be changed to corr(y_pred,y_true)**2. The book "Applied Linear Regression" by S Weisberg mentions the issue I address above on page 84 of the third edition. It suggest to use corr(y_pred,y_true)^2 for nonlinear models and to alternate the definition as above for regression through the origin. Finally, with regards to regression, statsmodels does use a different formula for the r2_score depending on if you use or do not use the intercept in a regression.
Maybe this is known already, but the codebase does not seem to differentiate at all so that's why I am putting the issue here.
Thanks!
The text was updated successfully, but these errors were encountered: