Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added fit and score times for learning_curve #13938

Merged
merged 23 commits into from Jun 14, 2019

Conversation

@H4dr1en
Copy link
Contributor

H4dr1en commented May 24, 2019

What does this implement/fix? Explain your changes.

It would be interesting to have access to the fit and score times of the estimators during the computing of the learning curves. This can be very easily done because _fit_and_score method already has a return_times parameter. I suggest to set it to True for the computing of the learning curves.

Any other comments?

Here is how fit and score times can be used to get valuable information:

Using the following setup:

from sklearn.datasets import make_regression
from sklearn.ensemble import RandomForestRegressor

X, Y = make_regression(n_samples=int(1e4), n_features=50, n_informative=25, bias=-92, noise=100)
estimator = RandomForestRegressor()

Having the times of the fit and scores help us plotting the following:

learning_curve_doc_time_all

And also such plots:

learning_curve_doc_diag

As you can see it is easy to determine the best estimator for the considered dataset, taking in account the
"scalability" of the model: it helps us telling which model will perform the better if we add more data to our dataset.

@H4dr1en H4dr1en changed the title [MRG] Added fit and score times for learning curves [WIP] Added fit and score times for learning curves May 24, 2019
@H4dr1en H4dr1en changed the title [WIP] Added fit and score times for learning curves [MRG] Added fit and score times for learning curves May 24, 2019
Copy link
Member

jnothman left a comment

Thanks for the pull request. We will not make this kind of backwards-incompatible change. Please add a parameter such as return_times. It may also be worth adding a plot like that to an example.

@H4dr1en

This comment has been minimized.

Copy link
Contributor Author

H4dr1en commented May 27, 2019

I added backwards compatibility (default behavior: as before, meaning return_times = False)
I also added a plot_learning_curves_times.py under examples/model_selection that will generate the following plot:

image

Should I take care of generating the doc for this example? If yes, how?

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented May 27, 2019

H4dr1en added 5 commits May 27, 2019
Setting it to - 1 is likely to make the generation of the documentation crash.
examples/model_selection/plot_learning_curve.py Outdated Show resolved Hide resolved
@@ -126,3 +126,128 @@ def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
plot_learning_curve(estimator, title, X, y, (0.7, 1.01), cv=cv, n_jobs=4)

plt.show()

###############################################################################

This comment has been minimized.

Copy link
@jnothman

jnothman May 28, 2019

Member

Is there a way to actually integrate this into the example above, rather than having an entirely separate demonstration of this functionality?

This comment has been minimized.

Copy link
@H4dr1en

H4dr1en May 28, 2019

Author Contributor

I guess we can, I will try!

EDIT: I am wondering if it is a good idea to merge the examples, as the first one focuses on showing two distinct learning curves whereas the second one focuses on stacking multiple fit time curves of estimators to find the best one.

So could you be more specific of what you would except?

For the moment I guess you would like to directly retrieve the fit times of the two estimators of the first example (GaussianNB and SVC) and plot the two corresponding fit time curves (new feature). If yes, then I can take care of that. Otherwise if we want to show more curves (and therefore more estimators), we should separate the examples (or would you prefer to add other classifiers to the first example?).

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug May 28, 2019

Contributor

So could you be more specific of what you would except?

Why not just add your 2 new plots (n_samples vs fit_time and fit_time vs score) to both pre-existing estimators?

That would be a 3 by 2 grid of plots and would avoid having this whole new code section

This comment has been minimized.

Copy link
@H4dr1en

H4dr1en May 29, 2019

Author Contributor

OK, so here are the resulting plots with changes in ea0cc71:

image

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug May 29, 2019

Contributor

If you look at the generated doc https://59951-843222-gh.circle-artifacts.com/0/doc/auto_examples/model_selection/plot_learning_curve.html

the plots are very small. Maybe try with 3 rows and 2 columns instead.
(you can build the doc locally following these guidelines https://scikit-learn.org/dev/developers/contributing.html#building-the-documentation)

This comment has been minimized.

Copy link
@H4dr1en

H4dr1en Jun 5, 2019

Author Contributor

I am having trouble generating the doc. I am using a miniconda venv so I enter the following in the prompt command:

(py3) path\scikit-learn\doc>set EXAMPLES_PATTERN=plot_learning_curve.py
(py3) path\scikit-learn\doc>make html

But all the examples are then generated (regex var not taken in account) and this takes forever on my computer...
I also tried with (py3) path\scikit-learn\doc>set EXAMPLES_PATTERN=*plot_learning_curve.py but same result. How would you do?

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Jun 5, 2019

Contributor

the doc says EXAMPLES_PATTERN=your_regex_goes_here make html, in one command

This comment has been minimized.

Copy link
@H4dr1en

H4dr1en Jun 6, 2019

Author Contributor

This gave me unknown command error, but anyway make html eventually worked.

Copy link
Contributor

NicolasHug left a comment

A few comments, Looks good in general.

I'm not a huge fan of returning tuples with different sizes depending on the input, but I guess it's too late to return a bunch a anyway.

@@ -126,3 +126,128 @@ def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
plot_learning_curve(estimator, title, X, y, (0.7, 1.01), cv=cv, n_jobs=4)

plt.show()

###############################################################################

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug May 28, 2019

Contributor

So could you be more specific of what you would except?

Why not just add your 2 new plots (n_samples vs fit_time and fit_time vs score) to both pre-existing estimators?

That would be a 3 by 2 grid of plots and would avoid having this whole new code section

examples/plot_kernel_ridge_regression.py Outdated Show resolved Hide resolved
sklearn/model_selection/_validation.py Outdated Show resolved Hide resolved
sklearn/model_selection/_validation.py Outdated Show resolved Hide resolved
sklearn/model_selection/_validation.py Outdated Show resolved Hide resolved
Copy link
Member

jnothman left a comment

@thomasjpfan wrote on gitter:

This seems like a good candidate for returning a dictionary (or a Bunch). I think going through a deprecation cycle for it is a little rough.

Are you trying to suggest changing the return value without deprecation?? I don’t think that’s a good idea, certainly not unless you have a return value that can unpack as three elements.

Or are you trying to suggest that you want feedback on whether we should consider deprecating and moving to a bunch. I agree that deprecation would be disruptive to users and tutorials of this, so I'm not entirely fond of it.

If we want to change to bunch, then instead of return_times we should add return_bunch and include times in the bunch, make the default value change from False to True in a couple of versions, etc, etc.

@thomasjpfan

This comment has been minimized.

Copy link
Member

thomasjpfan commented May 30, 2019

Or are you trying to suggest that you want feedback on whether we should consider deprecating and moving to a bunch.

I was looking for feedback for such an API change. (I will be more direct next time). Even in the example we have something like this:

train_sizes, train_scores, test_scores, fit_times, _ = \
	        learning_curve(estimator, X, y, cv=cv, n_jobs=n_jobs,
	                       train_sizes=train_sizes,
	                       return_times=True)

Every time I see \ before the function is even called, I think of returning a bunch.

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented May 30, 2019

Co-Authored-By: Joel Nothman <joel.nothman@gmail.com>
Copy link
Contributor

NicolasHug left a comment

Minimal comments. LGTM anyway

examples/model_selection/plot_learning_curve.py Outdated Show resolved Hide resolved
examples/model_selection/plot_learning_curve.py Outdated Show resolved Hide resolved
sklearn/model_selection/tests/test_validation.py Outdated Show resolved Hide resolved
Copy link
Member

jnothman left a comment

Otherwise lgtm. We can separately consider a return_bunch option, I suppose.

Please add an entry to the change log at doc/whats_new/v0.22.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:

sklearn/model_selection/tests/test_validation.py Outdated Show resolved Hide resolved
H4dr1en added 3 commits Jun 11, 2019
@NicolasHug

This comment has been minimized.

Copy link
Contributor

NicolasHug commented Jun 11, 2019

@H4dr1en , please look at https://github.com/scikit-learn/scikit-learn/pull/13938/files and revert all the unrelated changes.

@NicolasHug NicolasHug changed the title [MRG] Added fit and score times for learning curves Added fit and score times for learning_curve Jun 14, 2019
@NicolasHug NicolasHug merged commit b28aadf into scikit-learn:master Jun 14, 2019
16 checks passed
16 checks passed
LGTM analysis: C/C++ No code changes detected
Details
LGTM analysis: JavaScript No code changes detected
Details
LGTM analysis: Python 1 new alert
Details
ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: doc Your tests passed on CircleCI!
Details
ci/circleci: doc-min-dependencies Your tests passed on CircleCI!
Details
ci/circleci: lint Your tests passed on CircleCI!
Details
codecov/patch 100% of diff hit (target 96.81%)
Details
codecov/project 96.81% (+<.01%) compared to 61f6f5b
Details
scikit-learn.scikit-learn Build #20190611.23 succeeded
Details
scikit-learn.scikit-learn (Linux py35_conda_openblas) Linux py35_conda_openblas succeeded
Details
scikit-learn.scikit-learn (Linux py35_np_atlas) Linux py35_np_atlas succeeded
Details
scikit-learn.scikit-learn (Linux pylatest_conda) Linux pylatest_conda succeeded
Details
scikit-learn.scikit-learn (Windows py35_32) Windows py35_32 succeeded
Details
scikit-learn.scikit-learn (Windows py37_64) Windows py37_64 succeeded
Details
scikit-learn.scikit-learn (macOS pylatest_conda) macOS pylatest_conda succeeded
Details
@NicolasHug

This comment has been minimized.

Copy link
Contributor

NicolasHug commented Jun 14, 2019

Thanks @H4dr1en !

@rth

This comment has been minimized.

Copy link
Member

rth commented Jun 16, 2019

It looks like Circle CI is failing on master after this PR was merged?

@thomasjpfan

This comment has been minimized.

Copy link
Member

thomasjpfan commented Jun 16, 2019

Raised issue regarding this at #14098

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.