Skip to content

r2_score metric incorrect? #5570

@drei34

Description

@drei34

Hello,

The r2_score metric is something that many functions in sklearn use but right now I think it generally is set as 1 - RSS/SYY, which would the right formula to use if you run a regression with an intercept.

If you did not use this, however, and ran a regression with no intercept then the r2_score should be equal to (y_pred^2).sum()/(y_true^2).sum() and notice that we do not demean.

Moreover, for other models, 1 - RSS/SYY might be negative or not between 1 and 0, which again is bad.

For a standard regression with an intercept term, 1 - RSS/SYY = corr(y_pred,y_true)^2 and this number, no matter what the model is, is between 0 and 1 with 1 as the goal.

I think the definition should then be changed on a per model basis or it should be changed to corr(y_pred,y_true)**2. The book "Applied Linear Regression" by S Weisberg mentions the issue I address above on page 84 of the third edition. It suggest to use corr(y_pred,y_true)^2 for nonlinear models and to alternate the definition as above for regression through the origin. Finally, with regards to regression, statsmodels does use a different formula for the r2_score depending on if you use or do not use the intercept in a regression.

Maybe this is known already, but the codebase does not seem to differentiate at all so that's why I am putting the issue here.

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions