New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+3] ENH Restructure grid_scores_ into a dict of 1D (numpy) (masked) arrays that can be imported into pandas as a DataFrame. #6697

Merged
merged 5 commits into from Jun 16, 2016

Conversation

Projects
None yet
10 participants
@raghavrv
Member

raghavrv commented Apr 22, 2016

Overrides/Fixes #6686, #1034, #1787, #1768

Also related #1742, #1020

Tangentially related #1850, #1837, #2759

Also adds a name to all the scorers

  • _get_scorer_name(checked_scoring_function). Reverted based on consensus.
  • Test _get_scorer_name. Reverted based on consensus.
  • Restructure grid_scores_ into dict of (masked) numpy arrays, which can be easily imported into pandas.
  • Old tests pass
  • Modify old tests + New tests
  • _get_candidate_params() --> search.candidate_params_ --> search.results_['params']
  • Compute weighted std (for iid data.) + test
  • Try vectorizing in bulk using reshape
  • Examples with nice plots? (not for this PR)
  • Doc + Whatsnew

Reviews welcome!

The sandbox notebook

@amueller @vene @jnothman @agramfort @GaelVaroquaux @mblondel @hlin117 @ogrisel

@raghavrv raghavrv changed the title from [WIP] ENH Restructure grid_scores_ into more efficient data structure + Simultaneous Multiple Metric Grid / Random Search / cross-validation to [WIP] ENH Restructure grid_scores_ into more efficient data structure Apr 25, 2016

@raghavrv

This comment has been minimized.

Member

raghavrv commented Apr 27, 2016

Sorry for the delay! I was trying to attack multiple metric support along with this and later realized the Scorer interface discussion needs to take place before that!

So I've just restructured the grid_scores into search_results which is a dict of numpy masked arrays for parameters / reg arrays for scores/means and ranks.

* ``mean_validation_score``, the mean score over the
cross-validation folds
* ``cv_validation_scores``, the list of scores for each fold
search_results_ : dict of numpy (masked) ndarrays

This comment has been minimized.

@hlin117

hlin117 Apr 27, 2016

Contributor

By renaming grid_scores_ to search_results_, you're changing the API, right? You probably don't want to do that.

This comment has been minimized.

@vene

vene Apr 27, 2016

Member

The model_selection module has not been publicly released, so there's an argument to be made that this is fair game.

This comment has been minimized.

@jnothman

jnothman Apr 27, 2016

Member

Rather, this sort of change has been proposed and sought by the core devs for years; the fact that model_selection has not been released maybe makes it the best time to do it. However, I think we must consider this a deprecation case, not just drop grid_scores_ altogether. And that's because users will initially not rewrite all their old code but will change some imports, and will be much less angry if they get a DeprecationWarning at the end of a long fit rather than an AttributeError

This comment has been minimized.

@raghavrv

raghavrv Apr 28, 2016

Member

Humm... We plan on adding multiple metric support subsequently. How would grid_scores_ look for that use case? Do we raise an error when grid_scores_ is accessed when multiple metrics are required. Or do we return the first metric alone? or all the metrics as a dict of lists (of _CVScoreTuples)?

Also the old grid_search module is available for all the angry users who can tolerate a DeprecationWarning. I have a humble opinion that trying to support old code here will constrain us, not now but in subsequent PRs.

This comment has been minimized.

@vene

vene Apr 28, 2016

Member

I think @jnothman is right, especially because this is likely to be an issue at the end of a long fit. I guess grid_scores_ should raise an exception if (when) multiple metrics are used. They're not "angry users", if anything, they will be users who embark in the process of updating their projects to keep up with scikit-learn changes. We want to make it smooth, such that they can stop in the middle of the rewriting effort, if they have to, and things will still work, right?

Maybe in such a case we could use a shorter deprecation cycle, if you think that would help keep things cleaner.

This comment has been minimized.

@raghavrv

raghavrv Apr 28, 2016

Member

users who embark in the process of updating their projects to keep up with scikit-learn changes.

Okay with +3 I'm adding grid_scores_ as a property function. Also I was wondering if we should store like we did before as computing them from search_results_ (especially the params) is not very clean.

This comment has been minimized.

@jnothman

jnothman Apr 30, 2016

Member

It only has to work for cases formerly supported. Either way of producing them is fine.

* ``mean_validation_score``, the mean score over the
cross-validation folds
* ``cv_validation_scores``, the list of scores for each fold
search_results_ : dict of numpy (masked) ndarrays

This comment has been minimized.

@MechCoder

This comment has been minimized.

Member

MechCoder commented Apr 27, 2016

Ping again when this is ready for review..

mask = [ True True False False]...),
'degree' : masked_array(data = [2.0 3.0 -- --],
mask = [False False True True]...),
'accuracy_score_split_0' : [0.8, 0.7, 0.8, 0.9],

This comment has been minimized.

@jnothman

jnothman Apr 28, 2016

Member

It is annoying that we are pandering to pandas (ha!) to not allow a 2d array here :(

This comment has been minimized.

@raghavrv

raghavrv Apr 28, 2016

Member

Indeed! The other option is to store the scores in compact 2D numpy float arrays and have a sorted list of column headers (the dict keys of search_results_). (only the scores, not the parameters).

This comment has been minimized.

@hlin117

hlin117 Apr 30, 2016

Contributor

I think for some functions in networkx, they allow you to return pandas dataframes when possible by flipping a return_pandas parameter to True.

This comment has been minimized.

@jnothman

jnothman May 1, 2016

Member

Well, I suppose the most idiomatic Pandas representation is using a MultiIndex. But I don't think it's necessary our job to produce that.

This comment has been minimized.

@jnothman

jnothman May 1, 2016

Member

Not that I'm a proficient enough Pandas user to really say what is idiomatic or useful.

This comment has been minimized.

@jnothman

jnothman May 1, 2016

Member

To be practical, what I am a little concerned about is: how easy is it to calculate the standard deviations of scores across k-fold CV? It's easy in the current storage; how easy is it in the proposed? How easy is it in the Pandas rep?

This comment has been minimized.

@hlin117

hlin117 May 1, 2016

Contributor

@jnothman Assuming that the folds in question are named accuracy_score_split_0, ..., accuracy_score_split_(k-1), and the pandas dataframe with the scores is called search_results_, it's as easy as

fold_scores = search_results_.select(lambda name: name.startswith("accuracy_score_split_"), axis=1)

# Placing data into the dataframe
search_results_["std"] = fold_scores.std(axis=1)
search_results_["mean"] = fold_scores.mean(axis=1)

This comment has been minimized.

@jnothman

jnothman May 1, 2016

Member

I guess that's not terrible...

On 1 May 2016 at 16:29, Henry Lin notifications@github.com wrote:

In sklearn/model_selection/_search.py
#6697 (comment)
:

  •    kernel|gamma|degree|accuracy_score_split_0...|accuracy_score_mean ...|
    
  •    =====================================================================
    
  •    'poly'|  -  |  2   |           0.8           |         0.81          |
    
  •    'poly'|  -  |  3   |           0.7           |         0.60          |
    
  •    'rbf' | 0.1 |  -   |           0.8           |         0.75          |
    
  •    'rbf' | 0.2 |  -   |           0.9           |         0.82          |
    
  •    will be represented by a search_results_ dict of :
    
  •    {'kernel' : masked_array(data = ['poly', 'poly', 'rbf', 'rbf'],
    
  •                             mask = [False False False False]...)
    
  •     'gamma' : masked_array(data = [-- -- 0.1 0.2],
    
  •                            mask = [ True  True False False]...),
    
  •     'degree' : masked_array(data = [2.0 3.0 -- --],
    
  •                             mask = [False False  True  True]...),
    
  •     'accuracy_score_split_0' : [0.8, 0.7, 0.8, 0.9],
    

@jnothman https://github.com/jnothman Assuming that the folds in
question are named accuracy_score_split_0, ..., accuracy_score_split_(k-1),
and the pandas dataframe with the scores is called search_results_, it's
as easy as

fold_scores = search_results_.select(lambda name: name.startswith("accuracy_score_split_"), axis=1)
search_results_["std"] = fold_scores.std(axis=1) # To place it into the dataframe


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
https://github.com/scikit-learn/scikit-learn/pull/6697/files/e153cbbbe7fc8ae5edbb427ca6f3e22d63bd6f0a#r61681497

This comment has been minimized.

@raghavrv

raghavrv May 1, 2016

Member

Do you think get_candidates(...) addresses this problem?

search_results_ : dict of numpy (masked) ndarrays
A dict with keys as column headers and values as columns, that can be
imported into a pandas DataFrame.

This comment has been minimized.

@jnothman

jnothman Apr 28, 2016

Member

If scorer names are going to be used as field names, somewhere we will need to specify what the default scorer name is for custom scorers.

This comment has been minimized.

@raghavrv

raghavrv Apr 28, 2016

Member

Yes. Thanks!!

This comment has been minimized.

@MechCoder

MechCoder May 5, 2016

Member

Also, it should be useful to add documentation to what each column is, so that I can get an idea before looking at the example below..

'accuracy_score_split_0' : [0.8, 0.7, 0.8, 0.9],
'accuracy_score_split_1' : [0.82, 0.5, 0.7, 0.78],
'accuracy_score_mean' : [0.81, 0.60, 0.75, 0.82],
'candidate_rank' : [2, 4, 3, 1],

This comment has been minimized.

@jnothman

jnothman Apr 28, 2016

Member

Should this not also be by metric?

This comment has been minimized.

@raghavrv

raghavrv Apr 28, 2016

Member

Sorry for the sloppy work. I did the documentation before the implementation which is why this is not in line with the impl. And yes I should add tests and clean up the docs.

@@ -722,15 +792,32 @@ class GridSearchCV(BaseSearchCV):
Attributes
----------
grid_scores_ : list of named tuples

This comment has been minimized.

@jnothman

jnothman Apr 28, 2016

Member

Please nonetheless define:

@property
def grid_scores_(self):
    warnings.warn(... DeprecationWarning)
    return equivalent to what used to be provided

This comment has been minimized.

@raghavrv

raghavrv Apr 28, 2016

Member

Also if we definitely do want grid_scores_, would we be better off storing them like we did before? Recomputing them from search_results_ (especially extracting the parameters) is a bit more complex as the results are now stored as columns rather than rows (as was done previously?)

'accuracy_score_split_0' : [0.8, 0.7, 0.8, 0.9],
'accuracy_score_split_1' : [0.82, 0.5, 0.7, 0.78],
'accuracy_score_mean' : [0.81, 0.60, 0.75, 0.82],
'candidate_rank' : [2, 4, 3, 1],

This comment has been minimized.

@jnothman

jnothman Apr 28, 2016

Member

I think this should be accuracy_score_rank

This comment has been minimized.

@raghavrv

raghavrv Apr 28, 2016

Member

Fixed but not documented properly :(

self.search_results_ = search_results
best = sorted_indices[0]

This comment has been minimized.

@jnothman

jnothman Apr 28, 2016

Member

I proposed an age ago that we store best_index_ and define propertys that calculate best_params_ and best_score_ from this.

will be represented by a search_results_ dict of :
{'kernel' : masked_array(data = ['poly', 'poly', 'rbf', 'rbf'],

This comment has been minimized.

@jnothman

jnothman Apr 28, 2016

Member

Should parameters be prefixed by param_?

This comment has been minimized.

@jnothman

jnothman Apr 28, 2016

Member

Ah. I see you've done this.

score = 0
all_scores = []
for this_score, this_n_test_samples, _, parameters in \
self.n_candidates_ = int(n_fits / n_splits)

This comment has been minimized.

@jnothman

jnothman Apr 28, 2016

Member

Please use n_fits // n_splits if this is truly the best way to calculate this number

@jnothman

This comment has been minimized.

Member

jnothman commented Apr 28, 2016

I think, @rvraghav93, you would've benefited from writing this in a more test-driven way. Indeed, we would be able to critique your expected format before you write the implementation! Please get that test written up so I don't need to keep referring to an outdated example, when I know whether or not the test passes.

@jnothman

This comment has been minimized.

Member

jnothman commented Apr 28, 2016

This is otherwise going in the right direction and I look forward to its acceptance!

I think fixing #1020 directly here (visualising and marginalising over parameters) is out of scope, though you're right that enabling Pandas access allows users to groupby and visualise.

I also think #1742, adding training scores/times should probably also remain out of scope.

@raghavrv

This comment has been minimized.

Member

raghavrv commented Apr 28, 2016

Thanks heaps @jnothman, @hlin117 and @vene for the review!! I'll clean up a bit and add the tests soon.

@vene

This comment has been minimized.

Member

vene commented Apr 28, 2016

Sure, but I was just weighing in. I'll do my best to squeeze in an actual review.

@raghavrv

This comment has been minimized.

Member

raghavrv commented May 1, 2016

Ok I've re-added the grid_scores_ as a property function.

I've also added a get_candidates(<int>/<list of ints>/None(for all candidates)) method that will get us a (list of) dict of candidate parameter dict and scores to help users perform row wise operations.

Not sure if this is helpful or superfluous. I just went ahead and implemented this. We could remove if you feel this is superfluous and confuses search_results_.

Refer the 2nd cell of the sandbox notebook for a sample usage. I'll clean up docs and add tests soon.

@jnothman

This comment has been minimized.

Member

jnothman commented May 1, 2016

I don't think get_candidates is helpful. I see it provides:

  • indexing, but this is far from exciting enough to justify multiple access paradigms
  • parameters as dict, but this can be done within search_results_ too.
  • fold scores as array, but this could be done by providing fold_search_results_ or similar with one entry per fold rather than one entry per parameter setting; it does not not require an entirely different data structure

Separate;y I suggest we don't delimit everything in column names with underscores. We have no constraint that these be valid Python identifiers. Go for broke with colons, for instance.

@jnothman

This comment has been minimized.

Member

jnothman commented May 1, 2016

Though I guess Pandas has special attribute accessors for Python identifier names... Maybe I'm wrong about the colons, but I dislike the long underscored names aesthetically and in terms of the potential for naming conflicts...

@jnothman

This comment has been minimized.

Member

jnothman commented May 2, 2016

Having said that #1742 is a separate concern, I remember that when I was designing an equivalent solution, I realised that our results need to distinguish between train scores and test scores. Is it overkill to already label columns "test_score" rather than "score"?

"""Generate the metric name given the scoring parameter"""
if callable(scoring):
if scoring.__name__ == "_passthrough_scorer":
return "estimator_default_scorer"

This comment has been minimized.

@jnothman

jnothman May 2, 2016

Member

if "_score" is being appended, isn't "estimator" sufficient? i.e. "estimator_score"...?

if scoring.__name__ == "_passthrough_scorer":
return "estimator_default_scorer"
else:
return "custom_metric_%s" % (scoring.__name__,)

This comment has been minimized.

@jnothman

jnothman May 2, 2016

Member

In the multiple metrics context, there can still be naming conflicts. In any case, using __name__ here is a bit of a hack, and most custom scorers will be _PredictScorer or similar.

This comment has been minimized.

@raghavrv

raghavrv May 2, 2016

Member

Indeed. When I started this PR trying to handle multiple metric and this in one go, I used "custom_metric_0_function_name".

(i.e we will suffix a scorer number for each custom scorer)

Do you have any suggestions in mind?

if self.iid:
score /= float(n_test_samples)
mean_score = all_scores.sum() / float(n_test_samples_total)

This comment has been minimized.

@jnothman

jnothman May 2, 2016

Member

Can we please avoid float casting and instead ensure we have from __future__ import division up top?

@jnothman

This comment has been minimized.

Member

jnothman commented May 2, 2016

It's worth considering what parts of the construction of search_results_ should be factored out for reuse in other scoring-supporting estimators.

@raghavrv

This comment has been minimized.

Member

raghavrv commented May 2, 2016

Having said that #1742 is a separate concern, I remember that when I was designing an equivalent solution, I realised that our results need to distinguish between train scores and test scores. Is it overkill to already label columns "test_score" rather than "score"?

Can we do that when we add the train scores (in a separate PR)?

@raghavrv

This comment has been minimized.

Member

raghavrv commented May 2, 2016

Separately I suggest we don't delimit everything in column names with underscores. We have no constraint that these be valid Python identifiers. Go for broke with colons, for instance.

I did that so the columns can be accessed as attributes.

@raghavrv

This comment has been minimized.

Member

raghavrv commented May 2, 2016

but I dislike the long underscored names aesthetically and in terms of the potential for naming conflicts...

I does look a bit ugly :/ But I don't understand your concern about naming conflicts. Why would that happen?

@jnothman

This comment has been minimized.

Member

jnothman commented May 2, 2016

No, as long as things are prefixed there shouldn't really be naming conflicts.

@raghavrv raghavrv changed the title from [MRG+1] ENH Restructure grid_scores_ into a hashmap of 1D (numpy) (masked) arrays that can be imported into pandas as a DataFrame. to [MRG+3] ENH Restructure grid_scores_ into a hashmap of 1D (numpy) (masked) arrays that can be imported into pandas as a DataFrame. Jun 15, 2016

@raghavrv

This comment has been minimized.

Member

raghavrv commented Jun 15, 2016

All done.

I think circle ci failure is unrelated. Could you confirm @ogrisel?

If so merge? @jnothman @MechCoder @agramfort

@jnothman

This comment has been minimized.

Member

jnothman commented Jun 15, 2016

+1 for merge.

@raghavrv

This comment has been minimized.

Member

raghavrv commented Jun 15, 2016

Don't merge yet

@jnothman

This comment has been minimized.

Member

jnothman commented Jun 15, 2016

I meant that as : I will not merge because there may yet be minor issues, but I'm generally satisfied with this

@raghavrv

This comment has been minimized.

Member

raghavrv commented Jun 15, 2016

Sorry for the short uninformative comment. I had pushed a superfluous commit (unstashed from my multiple metric search work) by mistake and commented so to avoid accidental merge.

I just used reflog to fix it. Now its all good. Thanks heaps for the reviews!!

@raghavrv

This comment has been minimized.

Member

raghavrv commented Jun 15, 2016

Interesting... Force pushing a previous version (restored via reflog) which was tested already here does not trigger a CI rebuild. Cool!

@MechCoder

This comment has been minimized.

Member

MechCoder commented Jun 15, 2016

merge?

@raghavrv

This comment has been minimized.

Member

raghavrv commented Jun 16, 2016

Yea... All good from my side!!

@jnothman jnothman merged commit afd5d18 into scikit-learn:master Jun 16, 2016

3 of 4 checks passed

ci/circleci Your tests failed on CircleCI
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
coverage/coveralls Coverage increased (+0.004%) to 94.472%
Details
@jnothman

This comment has been minimized.

Member

jnothman commented Jun 16, 2016

Well, well. Congratulations and thank you, @raghavrv!

@MechCoder

This comment has been minimized.

Member

MechCoder commented Jun 16, 2016

Yay!! Congratulations and well done 👍 🍷 🍷

Or as they say over here, மகிழ்ச்சி

@raghavrv raghavrv deleted the raghavrv:multiple_metric_grid_search branch Jun 16, 2016

@TomDLT

This comment has been minimized.

Member

TomDLT commented Jun 16, 2016

🍻

@amueller

This comment has been minimized.

Member

amueller commented Jun 16, 2016

Awesome! Great job!

for params, mean_score, scores in clf.grid_scores_:
means = clf.results_['test_mean_score']
stds = clf.results_['test_std_score']
for i in range(len(clf.results_['params'])):

This comment has been minimized.

@amueller

amueller Jun 16, 2016

Member

why not zip?

This comment has been minimized.

@raghavrv

raghavrv Jun 16, 2016

Member

I refactored this from the now removed _get_candidate_scores. zip should have indeed been a better choice!

This comment has been minimized.

@jnothman

jnothman Jun 16, 2016

Member

Because the reviewers got lazy towards the end :)

@amueller

This comment has been minimized.

Member

amueller commented Jun 16, 2016

Sorry for coming late to the party, but why was results_ introduced instead of using grid_scores_? We just moved to a different module anyhow. Why should we have any deprecations in a new module?

@raghavrv

This comment has been minimized.

Member

raghavrv commented Jun 16, 2016

Thanks everyone! :)

Why should we have any deprecations in a new module?

#6697 (comment)

name results_ was chosen instead of grid_scores_ as

  • It will generically apply to GridSearchCV as well as RandomizedSearchCV and other non-grid based *SearchCV that maybe added in the future.
  • the results_, will store more than scores (very soon times/n_test_samples etc too).
@amueller

This comment has been minimized.

Member

amueller commented Jun 16, 2016

Thanks. then maybe remove grid_scores_ instead of having a deprecated one?

Ok, saw #6697 (comment) by @jnothman.
I guess it's not that much of a pain so let's keep it.

@raghavrv raghavrv changed the title from [MRG+3] ENH Restructure grid_scores_ into a hashmap of 1D (numpy) (masked) arrays that can be imported into pandas as a DataFrame. to [MRG+3] ENH Restructure grid_scores_ into a dict of 1D (numpy) (masked) arrays that can be imported into pandas as a DataFrame. Aug 18, 2016

@raghavrv raghavrv referenced this pull request Aug 18, 2016

Open

`results_` attribute for all EstimatorCV classes? #7206

0 of 13 tasks complete
@amueller

This comment has been minimized.

Member

amueller commented Sep 6, 2016

the docstring doesn't render nicely :-/

@jnothman

This comment has been minimized.

Member

jnothman commented Sep 6, 2016

@amueller

This comment has been minimized.

Member

amueller commented Sep 8, 2016

true, it renders ok. Though the rank is a link. And the thing throws a bunch of warnings when building.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment