-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
[MRG+1] Changes default for return_train_score to False #9677
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@jnothman Kindly suggest a suitable warning message. |
train_scores = _score(estimator, X_train, y_train, scorer, | ||
is_multimetric) | ||
score_train_time = time.time() - start_time - fit_time - score_time | ||
if (score_train_time >= 0.1*fit_time and | ||
time.time() - start_time) > 5: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parentheses in a strange place here...
And can't compare absolute time to 5 in a single fit. It will often only be a substantial amount of time over a large number of fits, as in the raising issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about adding a new variable alongside return_train_score
in BaseSearchCV, containing the time of the first call to GridSearchCV. Also we need to pass that time variable to the _fit_and_score function.
@jnothman Added a parameter in the function which takes care of the actual start time of GridSearchCV. Please see if this is valid or not. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a bad approach, but I suspect it won't receive the consensus to get merge, just because it makes the _fit_and_score
interface more messy...
@amueller, what do you think of this approach?
train_scores = _score(estimator, X_train, y_train, scorer, | ||
is_multimetric) | ||
score_train_time = time.time() - start_time - fit_time - score_time | ||
if score_train_time >= 0.1*fit_time and time.time() - grid_search_start_time> 5: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still may mean that training score calculation time took only .5 of a second, which I don't think is quite enough. Maybe better would be to (naively) estimate overall training score time: (time.time() - grid_search_start_time) * score_train_time / (fit_time + score_time + score_train_time) > 5
..?
The only other way to do it without changing the _fit_and_score interface
is to only raise the warning after fitting all estimators. And if we're
going to change _fit_and_score's interface, this is a good solution.
…On 3 September 2017 at 21:21, Kumar Ashutosh ***@***.***> wrote:
@jnothman <https://github.com/jnothman> I also agree upon the fact that
it makes the _fit_and_score method more messy. But I could not find and
alternative to keep track of total time that GridSearchCV would take. You
or @amueller <https://github.com/amueller> may suggest a way which serves
the purpose without changing _fit_and_score interface.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9677 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz65tS2hXIxc9O6p11nNoBA8BmsN8Kks5seoukgaJpZM4PLCkj>
.
|
@jnothman Should I continue with changing the function interface or raise warning only after completing |
I hope we'll get someone else's opinion soon... I suspect we'll land up
with warning at the end, but I'm not sure.
… |
What about the seemingly simpler solution of changing the default to |
@lesteve Yes, this is also a simple solution, will be done once others agree upon this. |
I suppose that's an acceptable solution.
…On 7 September 2017 at 21:47, Kumar Ashutosh ***@***.***> wrote:
@lesteve <https://github.com/lesteve> Yes, this is also a simple
solution, will be done once others agree upon this.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9677 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz63eCrLSWgNrp09pyhuLx8FrI2W7Nks5sf9fKgaJpZM4PLCkj>
.
|
@jnothman Kindly review and suggest a suitable FutureWarning message. |
Suggestions for a "nicer" warning message. Otherwise, I think this will work. :) |
Even after doing the required changes (change in test still not made), the following lines of code does not produce a deprecation warning, is it how DeprecationDict is supposed to behave or I am missing out something.
|
Warning should be produced when accessing the from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV
iris = datasets.load_iris()
parameters = {'kernel':('linear','rbf'), 'C':[1,10]}
svc = svm.SVC()
clf = GridSearchCV(svc, parameters)
clf.fit(iris.data, iris.target)
clf.cv_results_['split0_train_score'] # this is the line that should produce a warning Note that you should not get any warning if you use from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV
iris = datasets.load_iris()
parameters = {'kernel':('linear','rbf'), 'C':[1,10]}
svc = svm.SVC()
clf = GridSearchCV(svc, parameters, return_train_score=True)
clf.fit(iris.data, iris.target)
clf.cv_results_['split0_train_score'] # no warning because return_train_score is set to True Basically what we want users who do not set |
I am not sure how DeprecationDict is working. Need help on how to implement. |
You need to help us help you! Can you be specific about your problem is? what have you tried? If there is a failure that you don't understand, can you copy and paste it here? |
See my PR at thechargedneutron#2 |
I restarted the failing build in Travis hoping the timeout was just a glitch. |
Do you think this approach is decent, @lesteve? Too messy? It implies that a user no longer gets a warning informing them why the |
I have not looked at the diff of this PR (will do shortly). To be perfectly honest, I think your DeprecationDict suggestions is really neat and this is the best we can do:
|
sklearn/model_selection/_search.py
Outdated
"validation significantly. This is the reason " | ||
"return_train_score will change its default value " | ||
"from True (current behaviour) to False in 0.21. " | ||
"Please set return_train_score explicitly to get " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this message should be changed slightly to say something along the lines of "Looks like you are using training scores, you want to set return_train_score=True because return_train_score default will change in 0.21"
Make docstrings more uniform
@jnothman I pushed two main changes, it would be nice to have your opinion on these:
I am going to reset the LGTM count and add a +1 from me. |
'which will not be available by default ' | ||
'any more in 0.21. If you need training scores, ' | ||
'please set return_train_score=True').format(key) | ||
train_score = assert_warns_message(FutureWarning, msg, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we assert that there is no warning for the other val
s and that there is no warning for the other keys for 'warn'
?
Otherwise LGTM.
Thanks all |
Reference Issue
Fixes #9621
What does this implement/fix? Explain your changes.
Changes the default value of
return_train_score
towarn
. Raises a warning if train score takes more time.Any other comments?
The warning message needs improvement.