ENH extensible parameter search results #1787

jnothman · 2013-03-18T03:28:38Z

GridSearch and friends need to be able to return more fields in their results (e.g. #1742, composite score).

More generally, the conceivable results from a parameter search can be classified into:

per-parameter setting, per-fold (currently only the test score for each fold)
per-parameter setting (currently the parameters and mean test score across folds)
per-search (best_params_, best_score_, best_estimator_; however best_params_ and best_score_ are redundantly available in grid_scores_ as long as the index of the best parameters is known.)

Hence this patch changes the output of a parameter search to be (attribute names are open for debate!):

grid_results_ (1.) a structured array (a numpy array with named fields) with one record per set of parameters
fold_results_ (2.) a structured array with one record per fold per set of parameters
best_index_ (3.)
best_estimator_ if refit==True (3.)

The structured arrays can be indexed by field name to produce an array of values; alternatively they can be indexed as an array to produce a single record, akin to the namedtuples introduced in 0c94b55 (not in 0.13.1). In any case it allows numpy vectorised operations, as used here when calculating the mean score for each parameter setting (in _aggregate_scores).

Given this data, the legacy grid_scores_ (already deprecated), best_params_ and best_scores_ are calculated as properties.

This approach is extensible to new fields, in particular new fields within fold_results_ records, which are compiled from dicts returned from fit_fold (formerly fit_grid_point).

This PR is cut back from #1768; there you can see this extensibility exemplified to store training scores, training and test times, and precision and recall together with F-score.

amueller · 2013-04-04T20:31:09Z

Could you rebase please?
Maybe parameter_results_ would be better than grid_results_?
I think I am with you on this one now. Not sure if it is easier to merge this one first or #1742.

It would be great if @GaelVaroquaux, @ogrisel and @larsmans could comment, as this is core api stuff :)

amueller · 2013-04-04T20:38:30Z

This looks great! If the test pass, I think this is good to go (plus the rebase obv).

amueller · 2013-04-04T20:39:20Z

There are probably some examples that need updating and possibly also the docs.

jnothman · 2013-04-06T13:40:46Z

rebased. Any pointers to examples and docs needing updates?

amueller · 2013-04-06T14:18:06Z

well, the grid-search narrative documentation and the examples using grid-search probably.
Also, you can run the test suite and see if there are any deprecation warnings.
Do you know where to look for the docs and examples?

jnothman · 2013-04-06T23:47:07Z

As an aside (perhaps subject to a separate PR), I wonder whether we should return the parameters as a structured array (rather than dicts). So, rather than grid_results_ being something like:

array([({'foo': 5, 'bar': 'a'}, 1.0), ({'foo': 3, 'bar': 'a'}, 0.5)], 
      dtype=[('parameters', '|O4'), ('test_score', '<f4')])

it would be:

array([((5, 'a'), 1.0), ((3, 'a'), 0.5)], 
      dtype=[('parameters', [('foo', '<i4'), ('bar', '|S1')]), ('test_score', '<f4')])

This allows us to easily query the data by parameter value:

>>> grid_results_['parameters']['foo'] > 4
array([ True, False], dtype=bool)

Note this would also apply to randomised searches, helping towards #1020 where a solution like #1034 could not.

This approach, however, doesn't handle grid searches with multiple grids (i.e. passing an array of dicts to ParameterGrid), because there's no assurance that the same fields will be set in each grid (and the opposite is likely with #1769). This could be solved by using a masked record array, in which case it would be sensible to make parameters_ separate from grid_results_.

WDYT?

jnothman · 2013-04-09T06:21:24Z

This last commit (b18a278) moves cross-validation evaluation code into a single place, the cross_validation module. This means *SearchCV can focus on the parameter search, and that users not interested in a parameter search (or performing one by hand) can take advantage of these extended results.

Provides consinstent and enhanced structured-array result style for non-search CV evaluation. So far only regression-tested.

jnothman · 2013-04-16T14:23:01Z

[The exact form of this output structure depends on other issues like #1850; to handle sample_weights (#1574), test_n_samples should instead be stored in a separate searcher attribute, fold_weights_. And my current naming preference for the result attributes is search_results_ and fold_results_.]

amueller · 2013-05-07T21:09:12Z

From your many pull requests, I think this is the one that I really want to merge first. Did you think about my proposal to rename grid_results_ to parameter_results_? I'll try to do a more serious review soon.

amueller · 2013-05-07T21:16:08Z

sklearn/grid_search.py

-        Contains scores for all parameter combinations in param_grid.
-        Each entry corresponds to one parameter setting.
-        Each named tuple has the attributes:
+    `grid_results_` : structured array of shape [# param combinations]


This shows that grid_results_ is not a good name, as it is not a grid here.

amueller · 2013-05-07T21:24:23Z

Refactoring the fit_fold from fit_grid_point and cross_val_score_ seems to be the right thing to me.

What made you add the CVEvaluator class?
Basically this moves some functionality from grid_search to cross_validation, right?
The two things I notice are the iid mean scoring and the iteration over parameter settings.
You didn't add the iid to the cross_val_score function, though.

What I don't like about this is that suddenly the concept of parameters appears in cross_validation.
Also, looking a bit into the future, maybe we want to keep the for-loop in grid_search. If we start doing any smart searches, I think CVEvaluator will not be powerful enough any more - but maybe that is getting ahead of ourselves.

jnothman · 2013-05-07T22:33:49Z

On this being the one to merge first: I agree (in particularly because then we can merge in training scores and times), but there are some questions on the record structure and naming conventions to support multiple scores, so we need to think about that at the same time.
On a better name for grid_results_, I am currently swayed to search_results_. But what I think is to create consistency, the documentation needs to define its nomenclature, i.e. define each "point" or "candidate" of the search as representing a set of parameters that is evaluated, and define "fold" as well.
On the refactoring: what this does is moves the parameter evaluation to cross_validation; the search space/algorithm, and the selecting of a best (probably), and any analysis of search results, is still to be defined in model_selection. Perhaps cross_validation belongs in the new model_selection package anyway?
Yes, iid should be copied to cross_val_score.
If we want fit_fold to be shared, we need some concept of setting parameters within it anyway. The main reason for allowing a sequence of parameter settings in CVEvaluator is parallelism, and that if we work out ways to handle parameters that can be changed without refitting, that will need to be done on a per-fold basis; i.e. this needs to be optimised with respect to some parameter sequence.
Re smart searches: I actually made this refactoring while considering smarter searches, which come down to a series of searches over known candidates: i.e. evaluate some candidates and then determine a trajectory, and consider another set of candidates, and so on. So this fits well to that purpose (though perhaps not the ideal API, but who's to know?).

jnothman · 2013-05-07T23:27:58Z

I just realised I didn't answer the question "What made you add the CVEvaluator class?"

Clearly it encapsulates the parallel fitting and the formatting of results. I also thought users of cross_val_score in an interactive context might appreciate something a bit more powerful, such that you can manually run an evaluation with one set of parameter settings ("candidate") then try another, etc. So something re-entrant was useful.

But a re-entrant setup was most important for the context of custom search algorithms (not in this PR; see https://github.com/jnothman/scikit-learn/tree/param_search_callback) where the CVEvaluator acts as a closure over its validated arguments and can be repeatedly called to evaluate different candidates. In particular, a more complicated search would inherit from CVEvaluator so that with every evaluation of candidates the results could be stored on the way (or not, given a setting).

jnothman · 2013-05-07T23:29:36Z

I also considered making CVEvaluator more general so it would handle the permutations-significance-testing case as well (i.e. parallelising over reorderings of y, not parameters), but I didn't like the result.

amueller · 2013-05-08T06:37:27Z

Thanks a lot for the feedback :) sounds sensible to me.
search_results_ would be fine with me, too.
I guess when we write the docs we will see what sounds most natural.

I'll do a fine-grained review asap ;)

amueller · 2013-05-10T10:21:09Z

sklearn/cross_validation.py

-                     fit_params):
-    """Inner loop for cross validation"""
-    n_samples = X.shape[0] if sp.issparse(X) else len(X)
+def fit_fold(estimator, X, y, train, test, scorer,


If this function is public, it should be in the references (doc/modules/classes.rst)

jnothman · 2013-05-11T11:22:50Z

Sure, I can add things to classes.rst... Alternatively, we could keep CVEvaluator and fit_fold private until we're more happy with them and their APIs?

also, add iid parameter to cross_val_score

amueller · 2013-05-14T19:52:26Z

I think we should always add stuff to classes.rst if it is public. Otherwise people have to look into the code to get help. It might help us get earlier feedback.

Just results_ seems a bit generic. mean_results_ would be ok with me. Why don't you like parameter_results_? The fold_results are indexed by the folds, the parameter_results_ are indexed by the parameters...

amueller · 2013-05-14T19:56:01Z

I think the structure of results_ and fold_results_ should be explained in the narrative in 5.2.1. It is pretty bad that the current dict is not explained there.
Feel free to rename the variables. There are some left-over grids, that probably should go.

@larsmans @GaelVaroquaux I'd really like you to have a look if you find the time.

jnothman · 2013-06-02T11:36:13Z

I wrote about this to the ML in order to weigh up alternatives and potentially get wider consensus. I don't think structured arrays are used elsewhere in scikit-learn, and I worry that while appropriate, they are a little too constraining and unfamiliar.

jnothman · 2013-06-07T02:48:27Z

It's not relevant to the rest of the proposal, but I've decided CVEvaluator should be CVScorer, adopting the Scorer interface (__call__(X, y=None, sample_weight=None, ...)), important for forward compatibility and API similarity. __call__ will call a search method whose arguments are the same except for the first which is an iterable of candidates. Both __call__ and search will return either a dict or a structured array, either way a mapping of names to arrays/values. All other public methods from CVEvaluator will disappear. Finally I intend to propose this refactoring as a separate PR.

ogrisel · 2013-06-07T08:17:43Z

Sorry for later feedback. I will try to have a look at this PR soon as I a currently working with RandomizedSearchCV.

ogrisel · 2013-06-07T08:25:04Z

I assigned this PR for Milestone 0.14 as the new RandomizedSeachCV will be introduced in that release and we don't want to break the API too much once it's released.

jnothman · 2013-06-19T01:33:58Z

@ogrisel: With regards to your comments on the ML, would we like to see the default storage / presentation of results as:

a list of dicts
a dict of arrays
structure-ambivalent because they will be hidden behind something like my search result interface

?

ogrisel · 2013-06-19T13:52:11Z

I would prefer a list of dicts with:

parameters_id (integer unique for each parameter combinations, used for grouping)
fold_id (unique integer for each CV fold, used for grouping computing mean and std scores scores)
parameters (the dict of the actual parameters values: cannot be hashed in general hence cannot be used for grouping directly hence the use of a parameters_id field)
train_fold_size (integer might be useful later if we use the same interface to compute learning curves simultaneously)
test_fold_size (useful for computing iid mean score)
validation_error (for the provided scoring, used for model ranking once averaged across collected folds)
training_error (to be able to evaluate the impact of the parameters on the under fitting and over fitting behavior of the model)
training_time (float in seconds)

And later we will let the user compute additional attribute using a callback API, for instance to collect additional complementary scores such as per class precision, recall and f1 score or full confusion matrices.

Then make the search result interface compute the grouped statistics and rank models by mean validation errors by grouping on the parameters_id fields.

jnothman · 2013-06-20T00:16:18Z

That structure makes a lot of sense in terms of asynchronous parallelisation... I'm still not entirely convinced it's worthwhile having each fold available to the user as a separate record (which is providing the output of map, before reduce). I also don't think train and test fold size necessarily need to be in there if we are using the same folds for every candidate.

I guess what you're trying to say is that this is the nature of our raw data: a series of fold records. And maybe we need to make a distinction between:

the fold records produced by the search
the default in-memory storage
the default API

My suggestion of structured arrays was intended to provide compact in-memory storage with easy, flexible and efficient access, but still required per-fold intermediate records.

Let's say that we could pass some kind of results_manager to the CV-search constructor. Let's say it's a class that accepts a cv generator (or listified form) so that it knows the number and sizes of of folds, and that the constructed object is stored on the CV-search estimator as results_. Let's say it has to have an add method and a get_best method. I can think of three primary implementations:

no storage: get_best performs a find-max over average scores (and results_ provides no data).
in-memory storage: don't care what the underlying storage is as long as it can be pickled and produces an interface like my SearchResult object.
off-site storage: dump data to file / kv-store / RDBMS and perform find-max at the same time and/or provide a complete API.

Each of these needs to:

group data from the same candidate for multiple folds, if add is called per-fold.
know how to calculate the best score, including (iid -> fold-weighted) average and greater_is_better logic.

I don't really think that first point should be necessary. If we have an asynchronous processing queue, we will still expect folds for each candidate to be evaluated roughly at the same time, so grouping can happen more efficiently by handling it straight off the queue (storing all the fold results temporarily in memory) rather than in each results_manager implementation. (Perhaps you wouldn't want to store all the folds in memory for LeavePOut, but I don't think that's going to be used for a large dataset / candidate space.)

jnothman · 2013-06-20T00:22:08Z

In short: I can't think of a use-case where a user wants per-fold data to be in a list. In an iterable coming off a queue, yes. In a relational DB, perhaps. (Grouped by candidate, certainly.)

ogrisel · 2013-06-20T11:00:34Z

That structure makes a lot of sense in terms of asynchronous parallelisation... I'm still not entirely convinced it's worthwhile having each fold available to the user as a separate record (which is providing the output of map, before reduce). I also don't think train and test fold size necessarily need to be in there if we are using the same folds for every candidate.

It is for fail over if some parameters set will generate ill conditioned optimization problems that are not numerically stable across all CV folds. That can happen with SGDClassifier and GBRT models apparently.

Dealing with missing evaluations is very useful, even with the lack of async parallelization.

If we have an asynchronous processing queue, we will still expect folds for each candidate to be evaluated roughly at the same time

This statement is false if we would like to implement the "warm start with growing number of CV folds" use case.

In short: I can't think of a use-case where a user wants per-fold data to be in a list. In an iterable coming off a queue, yes. In a relational DB, perhaps. (Grouped by candidate, certainly.)

Implementing fault tolerant grid search is one, iteratively growable CV folds is another (warm restarts with a higher number of CV iterations).

I wasted a couple of grid search run (lasting 10min each times) precisely because of those 2 missing use cases yesterday. So they are not made up use cases: as a regular user of the lib I really feel the need for those.

Also implementing learning curves with a variable train_fold_size will also be a usecase where the append-only list of atomic evaluation dicts will be easier.

In short: the dumb fold log records datastructure is so much more simple and flexible to allow the implementation of any additional use cases in the future (e.g. learning curves and warm restarts in any dimension) that I think it should be the base underlying datastructure we collect internally even if we expect the user to rarely have the need to access it directly but rather through the results_ object.

For instance we could have:

results_log_ : the append only list of dict datastructure to store the raw evaluations
results_summary_ : an object that provides user friendly ways to query the results. This object class could take the raw log as constructor parameter and compute its own aggregates (like iid mean scores for ranking).

The results log can be kept if we implement warm restarts. The results_summary_ will have to be reseted and recomputed from the updated log.

The enduser API can still be made simple by providing a results object that can do the aggregation and even output the structured array datastructure you propose if it prove really useful from an enduser API standpoint.

ogrisel · 2013-06-20T11:02:10Z

Also I don't think memory efficiency will never be an issue: even with millions of evaluations the overhead of python dicts and python object reference is pretty manageable in 2013 :)

jnothman · 2013-06-20T11:48:18Z

Also I don't think memory efficiency will never be an issue: even with millions of evaluations the overhead of python dicts and python object reference is pretty manageable in 2013 :)

Assuming you're not collecting other data, but in that case you're right, the dict overhead will make little difference, and I'm going on about nothing. For fault tolerance there's still sense in storing some data on-disk, though.

I'll think about how best to transform this PR into something like that.

jnothman · 2013-06-20T11:56:58Z

So from master, the things that IMO should happen are:

the fit_grid_point function should return a dict that will be used directly as a results_log_ entry, which means it needs to be passed the candidate id and fold id where it is not currently.
this implementation should also replace the parallelised function in cross_val_score, forming a CVScorer class to handle the shared parallelisation logic. These first two points form a PR on their own.
following on from that, PRs to store the log and an API for results access.

jnothman · 2013-06-20T11:58:55Z

And again, I should point out that one difficulty with dicts is that our names for fields in them cannot have deprecation warnings, so it's a bit dangerous making them a public API...

ogrisel · 2013-06-20T12:03:28Z

And again, I should point out that one difficulty with dicts is that our names for fields in them cannot have deprecation warnings, so it's a bit dangerous making them a public API...

That's a valid point I had not thought of.

jnothman · 2013-06-20T12:09:26Z

So we could make them custom objects, but they're less portable. I can't yet think of a nice solution there, except to make the results_log_ an unstable advanced feature...

(And not being concerned by the memory consumption of dicts, your comment on the memory efficiency of namedtuples in the context of _CVScoreTuple is a bit superfluous!)

ogrisel · 2013-06-20T12:17:42Z

So from master, the things that IMO should happen are:

a the fit_grid_point function should return a dict that will be used directly as a results_log_ entry, which means it needs to be passed the candidate id and fold id where it is not currently.

b this implementation should also replace the parallelised function in cross_val_score, forming a CVScorer class to handle the shared parallelisation logic. These first two points form a PR on their own.

c following on from that, PRs to store the log and an API for results access.

Sounds good. Also +1 for using candidate_id instead of parameters_id.

I would like to have other people opinions on our discussion though. Apparently people are pretty busy at the moment. Let see:

Ping @larsmans @mblondel @amueller @pprett @glouppe @arjoly @vene

I know @GaelVaroquaux is currently traveling at conferences. We might have a look at this during the SciPy sprint next week with him and @jakevdp.

ogrisel · 2013-06-20T12:19:41Z

(And not being concerned by the memory consumption of dicts, your comment on the memory efficiency of namedtuples in the context of _CVScoreTuple is a bit superfluous!)

Indeed, it's just that I added a __slots__ = () to the existing CVScoreTuple tuple to make it more idiomatic and then people started to ask why. Hence I added the comment.

jnothman · 2013-06-20T13:13:02Z

I would like to have other people opinions on our discussion though.

I think the discussion is a bit hard to navigate and it would be more sensible to present a cut back PR: #2079. I'll close this one as it seems we're unlikely to go with its solution.

amueller mentioned this pull request Mar 18, 2013

Restructure the output attributes of *SearchCV #1768

Closed

jnothman mentioned this pull request Apr 4, 2013

Use cross_validation.cross_val_score with metrics.precision_recall_fscore_support #1837

Closed

ENH extensible parameter search results using structured arrays

9005683

jnothman mentioned this pull request Apr 7, 2013

ENH method to index ParameterGrid points by parameter values #1842

Closed

EXAMPLE remove examples' uses of *SearchCV.cv_scores_

6d96923

REFACTOR/ENH refactor CV scoring from grid_search and cross_validation

b18a278

Provides consinstent and enhanced structured-array result style for non-search CV evaluation. So far only regression-tested.

jnothman added 2 commits May 2, 2013 09:26

COSMIT fix pep8 violations

c5da116

COSMIT Remove unused private methods

4a8215e

amueller reviewed May 7, 2013
View reviewed changes

amueller reviewed May 10, 2013
View reviewed changes

jnothman added 2 commits May 11, 2013 22:14

DOC Fix comments for cross_val_score and CVEvaluator

51ff593

also, add iid parameter to cross_val_score

Debug messages now print fit_fold in place of CVEvaluator

66a8463

Rename grid_results_ to search_results_

d4c10f6

amueller mentioned this pull request May 13, 2013

ENH: add pre_dispatch parameter to cross_val_score function #1961

Closed

Change CVEvaluator to approximate scorer interface

72f2286

jnothman mentioned this pull request Jun 20, 2013

WIP support future extensiblity of grid search results #2079

Closed

jnothman closed this Jun 20, 2013

jnothman mentioned this pull request Jan 10, 2014

[MRG] Refactor CV and grid search #2736

Closed

1 task

This was referenced Apr 20, 2016

[RFC] Better Format for search results in model_selection module. #6686

Closed

[MRG+3] ENH Restructure grid_scores_ into a dict of 1D (numpy) (masked) arrays that can be imported into pandas as a DataFrame. #6697

Merged

Uh oh!

ENH extensible parameter search results #1787

ENH extensible parameter search results #1787

Uh oh!

Conversation

jnothman commented Mar 18, 2013

Uh oh!

amueller commented Apr 4, 2013

Uh oh!

amueller commented Apr 4, 2013

Uh oh!

amueller commented Apr 4, 2013

Uh oh!

jnothman commented Apr 6, 2013

Uh oh!

amueller commented Apr 6, 2013

Uh oh!

jnothman commented Apr 6, 2013

Uh oh!

jnothman commented Apr 9, 2013

Uh oh!

jnothman commented Apr 16, 2013

Uh oh!

amueller commented May 7, 2013

Uh oh!

amueller May 7, 2013

Choose a reason for hiding this comment

Uh oh!

amueller commented May 7, 2013

Uh oh!

jnothman commented May 7, 2013

Uh oh!

jnothman commented May 7, 2013

Uh oh!

jnothman commented May 7, 2013

Uh oh!

amueller commented May 8, 2013

Uh oh!

amueller May 10, 2013

Choose a reason for hiding this comment

Uh oh!

jnothman commented May 11, 2013

Uh oh!

amueller commented May 14, 2013

Uh oh!

amueller commented May 14, 2013

Uh oh!

jnothman commented Jun 2, 2013

Uh oh!

jnothman commented Jun 7, 2013

Uh oh!

ogrisel commented Jun 7, 2013

Uh oh!

ogrisel commented Jun 7, 2013

Uh oh!

jnothman commented Jun 19, 2013

Uh oh!

ogrisel commented Jun 19, 2013

Uh oh!

jnothman commented Jun 20, 2013

Uh oh!

jnothman commented Jun 20, 2013

Uh oh!

ogrisel commented Jun 20, 2013

Uh oh!

ogrisel commented Jun 20, 2013

Uh oh!

jnothman commented Jun 20, 2013

Uh oh!

jnothman commented Jun 20, 2013

Uh oh!

jnothman commented Jun 20, 2013

Uh oh!

ogrisel commented Jun 20, 2013

Uh oh!

jnothman commented Jun 20, 2013

Uh oh!

ogrisel commented Jun 20, 2013

Uh oh!

ogrisel commented Jun 20, 2013