Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GridsearchCV.score with multimetric scoring and callable refit #17058

Open
TimZaragori opened this issue Apr 27, 2020 · 2 comments
Open

GridsearchCV.score with multimetric scoring and callable refit #17058

TimZaragori opened this issue Apr 27, 2020 · 2 comments

Comments

@TimZaragori
Copy link

TimZaragori commented Apr 27, 2020

Describe the bug

When using GridsearchCV with multimetric scoring and a callable as refit, the GridsearchCV.score function doesn't works since score = self.scorer_[self.refit] if self.multimetric_ else self.scorer_ seems to wait only for a string in case of multimetric scoring

Steps/Code to Reproduce

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import GridSearchCV, RepeatedStratifiedKFold
from sklearn.linear_model import LogisticRegression
import numpy as np

def get_best_index(cv_results):
    best_rank_mask = cv_results['rank_test_roc_auc'] == cv_results['rank_test_roc_auc'].min()
    params_best_score = np.array(cv_results['params'])[best_rank_mask]
    params_name = params_best_score[0].keys()
    classifier_params_names = [_ for _ in params_name if 'classifier' in _]
    if 'classifier__C' in classifier_params_names or 'classifier__base_estimator__C' in classifier_params_names:
        classifier_params = [_['classifier__C' if 'classifier__C' in classifier_params_names else 'classifier__base_estimator__C'] for _ in params_best_score]
        params_best_score = params_best_score[classifier_params == min(classifier_params)]
    best_params = params_best_score[0]
    best_index = int(np.where(np.array(cv_results['params']) == best_params)[0])
    return best_index

breast = load_breast_cancer()
X = breast.data
y = breast.target
cv = RepeatedStratifiedKFold(5, 2, random_state=111)
params_dic = {'C': np.arange(0.1, 1.1, 0.1)}
clf = GridSearchCV(LogisticRegression(penalty='l2', max_iter=1e5, solver='saga'), params_dic, scoring=['roc_auc', 'accuracy'], cv=cv, refit=get_best_index, n_jobs=4)
clf.fit(X,y)
clf.score(X,y)

Actual Results

File "C:\Users\Tim\Anaconda3\lib\site-packages\sklearn\model_selection_search.py", line 447, in score
score = self.scorer_[self.refit] if self.multimetric_ else self.scorer_
KeyError: <function get_best_index at 0x0000028DD66BDBF8>

Expected Results

Since refit is a callable I don't know how he could know which metric to choose for scoring., However, if I give a string with the metric I chose, i.e. 'roc_auc', to refit argument the best index won't be chosen in the way I want. Maybe in case of multimetric scoring and callable refit, ask for dictionnary instead like {score: callable} and the score will be used in GridsearchCV.score ?

Versions

System:
python: 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)]
executable: C:\Users\Tim\Anaconda3\python.exe
machine: Windows-10-10.0.18362-SP0
Python dependencies:
pip: 20.0.2
setuptools: 39.1.0
sklearn: 0.22.1
numpy: 1.18.1
scipy: 1.4.1
Cython: 0.28.2
pandas: 1.0.0
matplotlib: 2.2.2
joblib: 0.14.1
Built with OpenMP: True

@jnothman
Copy link
Member

You're right, I can confirm this is a bug. But it seems there's no way for score to work if refit is a callable. I suppose that was under-thought on my part.

@TimZaragori
Copy link
Author

For my personal use (sklearn wrapper for nested cross validation) I tried to implement it this way : https://github.com/TimZaragori/Sklearn_NestedCV/blob/master/Statistical_analysis/nested_cv.py#L283 with the scoring function (score of model after refit on all data as GridsearchCV but after whole nested cross validation) : https://github.com/TimZaragori/Sklearn_NestedCV/blob/master/Statistical_analysis/nested_cv.py#L368
and in inner loops where I directly use GridsearchCV I am retrieving the scores like that : https://github.com/TimZaragori/Sklearn_NestedCV/blob/master/Statistical_analysis/nested_cv.py#L337

However I don't know if it can help you and in which extend it can be implemented in GridsearchCV

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants