Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to do LeaveOneOut cross validation #15900

Open
qinhanmin2014 opened this issue Dec 16, 2019 · 4 comments
Open

How to do LeaveOneOut cross validation #15900

qinhanmin2014 opened this issue Dec 16, 2019 · 4 comments

Comments

@qinhanmin2014
Copy link
Member

Currrently, we have two ways to do LOO cv in scikit-learn
The first one is in GridSearchCV, where we calculate the score of each fold (i.e., each sample) and then take the average.
The second one is in RidgeCV, where we calculate the prediction of each fold (i.e., each sample), put them together and calculate the score.
I think this inconsistency is annoying.
Another issue is that whether we should consider sample_weight when averaging the scores in the first option and when calculating the scores in the second option. We do so in RidgeCV, but don't do so in GridSearchCV.
Related to RidgeCV issues ping @glemaitre

@amueller
Copy link
Member

The use in RidgeCV here is the inconsistency. However, it is not very easy to fix, because of #5097: we can not compute the r2_score with LOO because of the way we define the r2_score.

Also see #14886.

We do not use sample_weights in computing the score anywhere, see #4632 and many related issues.

@jnothman
Copy link
Member

jnothman commented Dec 24, 2019 via email

@qinhanmin2014
Copy link
Member Author

The use in RidgeCV here is the inconsistency. However, it is not very easy to fix, because of #5097: we can not compute the r2_score with LOO because of the way we define the r2_score.

I guess this is not a problem, because we'll get nan if we rely on default r2 when using GridSearchCV to solve regression problems

import warnings
warnings.simplefilter("ignore")
from sklearn.datasets import load_boston
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV, LeaveOneOut
X, y = load_boston(return_X_y=True)
X, y = X[:10], y[:10]
reg = RandomForestRegressor(random_state=0)
params = {"n_estimators": [10, 20]}
grid = GridSearchCV(reg, params, cv=LeaveOneOut())
grid.fit(X, y)
print(grid.best_score_)
# nan

@qinhanmin2014
Copy link
Member Author

Though I think it's not good to return nan.

I tried to google things like leave one out cross validation r2, seems that most people calculate the prediction of each fold (i.e., each sample), put them together and calculate the score.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants