Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added fit and score times for learning_curve #13938

Merged
merged 23 commits into from Jun 14, 2019

Conversation

H4dr1en
Copy link
Contributor

@H4dr1en H4dr1en commented May 24, 2019

What does this implement/fix? Explain your changes.

It would be interesting to have access to the fit and score times of the estimators during the computing of the learning curves. This can be very easily done because _fit_and_score method already has a return_times parameter. I suggest to set it to True for the computing of the learning curves.

Any other comments?

Here is how fit and score times can be used to get valuable information:

Using the following setup:

from sklearn.datasets import make_regression
from sklearn.ensemble import RandomForestRegressor

X, Y = make_regression(n_samples=int(1e4), n_features=50, n_informative=25, bias=-92, noise=100)
estimator = RandomForestRegressor()

Having the times of the fit and scores help us plotting the following:

learning_curve_doc_time_all

And also such plots:

learning_curve_doc_diag

As you can see it is easy to determine the best estimator for the considered dataset, taking in account the
"scalability" of the model: it helps us telling which model will perform the better if we add more data to our dataset.

@H4dr1en H4dr1en changed the title [MRG] Added fit and score times for learning curves [WIP] Added fit and score times for learning curves May 24, 2019
@H4dr1en H4dr1en changed the title [WIP] Added fit and score times for learning curves [MRG] Added fit and score times for learning curves May 24, 2019
Copy link
Member

@jnothman jnothman left a comment

Thanks for the pull request. We will not make this kind of backwards-incompatible change. Please add a parameter such as return_times. It may also be worth adding a plot like that to an example.

@H4dr1en
Copy link
Contributor Author

@H4dr1en H4dr1en commented May 27, 2019

I added backwards compatibility (default behavior: as before, meaning return_times = False)
I also added a plot_learning_curves_times.py under examples/model_selection that will generate the following plot:

image

Should I take care of generating the doc for this example? If yes, how?

@jnothman
Copy link
Member

@jnothman jnothman commented May 27, 2019

doc/modules/learning_curve.rst Outdated Show resolved Hide resolved
examples/model_selection/plot_learning_curve.py Outdated Show resolved Hide resolved
@@ -126,3 +126,128 @@ def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
plot_learning_curve(estimator, title, X, y, (0.7, 1.01), cv=cv, n_jobs=4)

plt.show()

###############################################################################
Copy link
Member

@jnothman jnothman May 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to actually integrate this into the example above, rather than having an entirely separate demonstration of this functionality?

Copy link
Contributor Author

@H4dr1en H4dr1en May 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can, I will try!

EDIT: I am wondering if it is a good idea to merge the examples, as the first one focuses on showing two distinct learning curves whereas the second one focuses on stacking multiple fit time curves of estimators to find the best one.

So could you be more specific of what you would except?

For the moment I guess you would like to directly retrieve the fit times of the two estimators of the first example (GaussianNB and SVC) and plot the two corresponding fit time curves (new feature). If yes, then I can take care of that. Otherwise if we want to show more curves (and therefore more estimators), we should separate the examples (or would you prefer to add other classifiers to the first example?).

Copy link
Member

@NicolasHug NicolasHug May 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So could you be more specific of what you would except?

Why not just add your 2 new plots (n_samples vs fit_time and fit_time vs score) to both pre-existing estimators?

That would be a 3 by 2 grid of plots and would avoid having this whole new code section

Copy link
Contributor Author

@H4dr1en H4dr1en May 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so here are the resulting plots with changes in ea0cc71:

image

Copy link
Member

@NicolasHug NicolasHug May 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you look at the generated doc https://59951-843222-gh.circle-artifacts.com/0/doc/auto_examples/model_selection/plot_learning_curve.html

the plots are very small. Maybe try with 3 rows and 2 columns instead.
(you can build the doc locally following these guidelines https://scikit-learn.org/dev/developers/contributing.html#building-the-documentation)

Copy link
Contributor Author

@H4dr1en H4dr1en Jun 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am having trouble generating the doc. I am using a miniconda venv so I enter the following in the prompt command:

(py3) path\scikit-learn\doc>set EXAMPLES_PATTERN=plot_learning_curve.py
(py3) path\scikit-learn\doc>make html

But all the examples are then generated (regex var not taken in account) and this takes forever on my computer...
I also tried with (py3) path\scikit-learn\doc>set EXAMPLES_PATTERN=*plot_learning_curve.py but same result. How would you do?

Copy link
Member

@NicolasHug NicolasHug Jun 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the doc says EXAMPLES_PATTERN=your_regex_goes_here make html, in one command

Copy link
Contributor Author

@H4dr1en H4dr1en Jun 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gave me unknown command error, but anyway make html eventually worked.

Copy link
Member

@NicolasHug NicolasHug left a comment

A few comments, Looks good in general.

I'm not a huge fan of returning tuples with different sizes depending on the input, but I guess it's too late to return a bunch a anyway.

@@ -126,3 +126,128 @@ def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
plot_learning_curve(estimator, title, X, y, (0.7, 1.01), cv=cv, n_jobs=4)

plt.show()

###############################################################################
Copy link
Member

@NicolasHug NicolasHug May 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So could you be more specific of what you would except?

Why not just add your 2 new plots (n_samples vs fit_time and fit_time vs score) to both pre-existing estimators?

That would be a 3 by 2 grid of plots and would avoid having this whole new code section

examples/plot_kernel_ridge_regression.py Outdated Show resolved Hide resolved
sklearn/model_selection/_validation.py Outdated Show resolved Hide resolved
sklearn/model_selection/_validation.py Outdated Show resolved Hide resolved
sklearn/model_selection/_validation.py Outdated Show resolved Hide resolved
Copy link
Member

@jnothman jnothman left a comment

@thomasjpfan wrote on gitter:

This seems like a good candidate for returning a dictionary (or a Bunch). I think going through a deprecation cycle for it is a little rough.

Are you trying to suggest changing the return value without deprecation?? I don’t think that’s a good idea, certainly not unless you have a return value that can unpack as three elements.

Or are you trying to suggest that you want feedback on whether we should consider deprecating and moving to a bunch. I agree that deprecation would be disruptive to users and tutorials of this, so I'm not entirely fond of it.

If we want to change to bunch, then instead of return_times we should add return_bunch and include times in the bunch, make the default value change from False to True in a couple of versions, etc, etc.

@thomasjpfan
Copy link
Member

@thomasjpfan thomasjpfan commented May 30, 2019

Or are you trying to suggest that you want feedback on whether we should consider deprecating and moving to a bunch.

I was looking for feedback for such an API change. (I will be more direct next time). Even in the example we have something like this:

train_sizes, train_scores, test_scores, fit_times, _ = \
	        learning_curve(estimator, X, y, cv=cv, n_jobs=n_jobs,
	                       train_sizes=train_sizes,
	                       return_times=True)

Every time I see \ before the function is even called, I think of returning a bunch.

@jnothman
Copy link
Member

@jnothman jnothman commented May 30, 2019

Co-Authored-By: Joel Nothman <joel.nothman@gmail.com>
Copy link
Member

@NicolasHug NicolasHug left a comment

Minimal comments. LGTM anyway

examples/model_selection/plot_learning_curve.py Outdated Show resolved Hide resolved
examples/model_selection/plot_learning_curve.py Outdated Show resolved Hide resolved
sklearn/model_selection/tests/test_validation.py Outdated Show resolved Hide resolved
Copy link
Member

@jnothman jnothman left a comment

Otherwise lgtm. We can separately consider a return_bunch option, I suppose.

Please add an entry to the change log at doc/whats_new/v0.22.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:

sklearn/model_selection/tests/test_validation.py Outdated Show resolved Hide resolved
@NicolasHug
Copy link
Member

@NicolasHug NicolasHug commented Jun 11, 2019

@H4dr1en , please look at https://github.com/scikit-learn/scikit-learn/pull/13938/files and revert all the unrelated changes.

@NicolasHug NicolasHug changed the title [MRG] Added fit and score times for learning curves Added fit and score times for learning_curve Jun 14, 2019
@NicolasHug NicolasHug merged commit b28aadf into scikit-learn:master Jun 14, 2019
16 checks passed
@NicolasHug
Copy link
Member

@NicolasHug NicolasHug commented Jun 14, 2019

Thanks @H4dr1en !

@rth
Copy link
Member

@rth rth commented Jun 16, 2019

It looks like Circle CI is failing on master after this PR was merged?

@thomasjpfan
Copy link
Member

@thomasjpfan thomasjpfan commented Jun 16, 2019

Raised issue regarding this at #14098

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants