Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

ENH extensible parameter search results #1787

Closed
wants to merge 9 commits into
from

Conversation

Projects
None yet
3 participants
Owner

jnothman commented Mar 18, 2013

GridSearch and friends need to be able to return more fields in their results (e.g. #1742, composite score).

More generally, the conceivable results from a parameter search can be classified into:

  1. per-parameter setting, per-fold (currently only the test score for each fold)
  2. per-parameter setting (currently the parameters and mean test score across folds)
  3. per-search (best_params_, best_score_, best_estimator_; however best_params_ and best_score_ are redundantly available in grid_scores_ as long as the index of the best parameters is known.)

Hence this patch changes the output of a parameter search to be (attribute names are open for debate!):

  • grid_results_ (1.) a structured array (a numpy array with named fields) with one record per set of parameters
  • fold_results_ (2.) a structured array with one record per fold per set of parameters
  • best_index_ (3.)
  • best_estimator_ if refit==True (3.)

The structured arrays can be indexed by field name to produce an array of values; alternatively they can be indexed as an array to produce a single record, akin to the namedtuples introduced in 0c94b55 (not in 0.13.1). In any case it allows numpy vectorised operations, as used here when calculating the mean score for each parameter setting (in _aggregate_scores).

Given this data, the legacy grid_scores_ (already deprecated), best_params_ and best_scores_ are calculated as properties.

This approach is extensible to new fields, in particular new fields within fold_results_ records, which are compiled from dicts returned from fit_fold (formerly fit_grid_point).

This PR is cut back from #1768; there you can see this extensibility exemplified to store training scores, training and test times, and precision and recall together with F-score.

Owner

amueller commented Apr 4, 2013

Could you rebase please?
Maybe parameter_results_ would be better than grid_results_?
I think I am with you on this one now. Not sure if it is easier to merge this one first or #1742.

It would be great if @GaelVaroquaux, @ogrisel and @larsmans could comment, as this is core api stuff :)

Owner

amueller commented Apr 4, 2013

This looks great! If the test pass, I think this is good to go (plus the rebase obv).

Owner

amueller commented Apr 4, 2013

There are probably some examples that need updating and possibly also the docs.

Owner

jnothman commented Apr 6, 2013

rebased. Any pointers to examples and docs needing updates?

Owner

amueller commented Apr 6, 2013

well, the grid-search narrative documentation and the examples using grid-search probably.
Also, you can run the test suite and see if there are any deprecation warnings.
Do you know where to look for the docs and examples?

Owner

jnothman commented Apr 6, 2013

As an aside (perhaps subject to a separate PR), I wonder whether we should return the parameters as a structured array (rather than dicts). So, rather than grid_results_ being something like:

array([({'foo': 5, 'bar': 'a'}, 1.0), ({'foo': 3, 'bar': 'a'}, 0.5)], 
      dtype=[('parameters', '|O4'), ('test_score', '<f4')])

it would be:

array([((5, 'a'), 1.0), ((3, 'a'), 0.5)], 
      dtype=[('parameters', [('foo', '<i4'), ('bar', '|S1')]), ('test_score', '<f4')])

This allows us to easily query the data by parameter value:

>>> grid_results_['parameters']['foo'] > 4
array([ True, False], dtype=bool)

Note this would also apply to randomised searches, helping towards #1020 where a solution like #1034 could not.

This approach, however, doesn't handle grid searches with multiple grids (i.e. passing an array of dicts to ParameterGrid), because there's no assurance that the same fields will be set in each grid (and the opposite is likely with #1769). This could be solved by using a masked record array, in which case it would be sensible to make parameters_ separate from grid_results_.

WDYT?

Owner

jnothman commented Apr 9, 2013

This last commit (b18a278) moves cross-validation evaluation code into a single place, the cross_validation module. This means *SearchCV can focus on the parameter search, and that users not interested in a parameter search (or performing one by hand) can take advantage of these extended results.

REFACTOR/ENH refactor CV scoring from grid_search and cross_validation
Provides consinstent and enhanced structured-array result style for non-search CV evaluation.

So far only regression-tested.

I'm not satisfied with these method names (score_folds, score_means; and in truth, these methods could be removed). I chose something verb-phrase-like. Suggestions are welcome.

Owner

jnothman commented Apr 16, 2013

[The exact form of this output structure depends on other issues like #1850; to handle sample_weights (#1574), test_n_samples should instead be stored in a separate searcher attribute, fold_weights_. And my current naming preference for the result attributes is search_results_ and fold_results_.]

Owner

amueller commented May 7, 2013

From your many pull requests, I think this is the one that I really want to merge first. Did you think about my proposal to rename grid_results_ to parameter_results_? I'll try to do a more serious review soon.

@amueller amueller commented on an outdated diff May 7, 2013

sklearn/grid_search.py
@@ -760,20 +702,27 @@ class RandomizedSearchCV(BaseSearchCV):
Attributes
----------
- `cv_scores_` : list of named tuples
- Contains scores for all parameter combinations in param_grid.
- Each entry corresponds to one parameter setting.
- Each named tuple has the attributes:
+ `grid_results_` : structured array of shape [# param combinations]
@amueller

amueller May 7, 2013

Owner

This shows that grid_results_ is not a good name, as it is not a grid here.

@amueller amueller and 1 other commented on an outdated diff May 7, 2013

sklearn/cross_validation.py
+ results : dict of string to any
+ An extensible storage of fold results including the following keys:
+
+ ``'test_score'`` : float
+ The estimator's score on the test set.
+ ``'test_n_samples'`` : int
+ The number of samples in the test set.
+ """
+ if verbose > 1:
+ start_time = time.time()
+ if est_params is None:
+ msg = ''
+ else:
+ msg = '%s' % (', '.join('%s=%s' % (k, v)
+ for k, v in est_params.items()))
+ print("[CVEvaluator]%s %s" % (msg, (64 - len(msg)) * '.'))
@amueller

amueller May 7, 2013

Owner

I don't think this should print CVEvaluator as this is a public method, isn't it?

@jnothman

jnothman May 7, 2013

Owner

Well, the previous code printed GridSearchCV when it was a public method, and called from RandomizedSearchCV as well.

@amueller

amueller May 14, 2013

Owner

Hum... ok... doesn't make it much better, but fair enough ;)

Owner

amueller commented May 7, 2013

Refactoring the fit_fold from fit_grid_point and cross_val_score_ seems to be the right thing to me.

What made you add the CVEvaluator class?
Basically this moves some functionality from grid_search to cross_validation, right?
The two things I notice are the iid mean scoring and the iteration over parameter settings.
You didn't add the iid to the cross_val_score function, though.

What I don't like about this is that suddenly the concept of parameters appears in cross_validation.
Also, looking a bit into the future, maybe we want to keep the for-loop in grid_search. If we start doing any smart searches, I think CVEvaluator will not be powerful enough any more - but maybe that is getting ahead of ourselves.

Owner

jnothman commented May 7, 2013

  • On this being the one to merge first: I agree (in particularly because then we can merge in training scores and times), but there are some questions on the record structure and naming conventions to support multiple scores, so we need to think about that at the same time.
  • On a better name for grid_results_, I am currently swayed to search_results_. But what I think is to create consistency, the documentation needs to define its nomenclature, i.e. define each "point" or "candidate" of the search as representing a set of parameters that is evaluated, and define "fold" as well.
  • On the refactoring: what this does is moves the parameter evaluation to cross_validation; the search space/algorithm, and the selecting of a best (probably), and any analysis of search results, is still to be defined in model_selection. Perhaps cross_validation belongs in the new model_selection package anyway?
  • Yes, iid should be copied to cross_val_score.
  • If we want fit_fold to be shared, we need some concept of setting parameters within it anyway. The main reason for allowing a sequence of parameter settings in CVEvaluator is parallelism, and that if we work out ways to handle parameters that can be changed without refitting, that will need to be done on a per-fold basis; i.e. this needs to be optimised with respect to some parameter sequence.
  • Re smart searches: I actually made this refactoring while considering smarter searches, which come down to a series of searches over known candidates: i.e. evaluate some candidates and then determine a trajectory, and consider another set of candidates, and so on. So this fits well to that purpose (though perhaps not the ideal API, but who's to know?).
Owner

jnothman commented May 7, 2013

I just realised I didn't answer the question "What made you add the CVEvaluator class?"

Clearly it encapsulates the parallel fitting and the formatting of results. I also thought users of cross_val_score in an interactive context might appreciate something a bit more powerful, such that you can manually run an evaluation with one set of parameter settings ("candidate") then try another, etc. So something re-entrant was useful.

But a re-entrant setup was most important for the context of custom search algorithms (not in this PR; see https://github.com/jnothman/scikit-learn/tree/param_search_callback) where the CVEvaluator acts as a closure over its validated arguments and can be repeatedly called to evaluate different candidates. In particular, a more complicated search would inherit from CVEvaluator so that with every evaluation of candidates the results could be stored on the way (or not, given a setting).

Owner

jnothman commented May 7, 2013

I also considered making CVEvaluator more general so it would handle the permutations-significance-testing case as well (i.e. parallelising over reorderings of y, not parameters), but I didn't like the result.

Owner

amueller commented May 8, 2013

Thanks a lot for the feedback :) sounds sensible to me.
search_results_ would be fine with me, too.
I guess when we write the docs we will see what sounds most natural.

I'll do a fine-grained review asap ;)

@amueller amueller commented on an outdated diff May 10, 2013

sklearn/cross_validation.py
@@ -1038,13 +1041,74 @@ def __len__(self):
##############################################################################
-def _cross_val_score(estimator, X, y, scorer, train, test, verbose,
- fit_params):
- """Inner loop for cross validation"""
- n_samples = X.shape[0] if sp.issparse(X) else len(X)
+def fit_fold(estimator, X, y, train, test, scorer,
@amueller

amueller May 10, 2013

Owner

If this function is public, it should be in the references (doc/modules/classes.rst)

@amueller amueller commented on the diff May 10, 2013

sklearn/cross_validation.py
@@ -1103,50 +1185,221 @@ def cross_val_score(estimator, X, y=None, scoring=None, cv=None, n_jobs=1,
See 'Scoring objects' in the model evaluation section of the user guide
@amueller

amueller May 10, 2013

Owner

Is this comment appropriate? I don't think this section says anything about CVEvaluator.
Also CVEvaluator should also be added to the references.

@jnothman

jnothman May 11, 2013

Owner

yes, looks like a cut-paste fail.

@amueller amueller commented on an outdated diff May 10, 2013

sklearn/cross_validation.py
@@ -1103,50 +1185,221 @@ def cross_val_score(estimator, X, y=None, scoring=None, cv=None, n_jobs=1,
See 'Scoring objects' in the model evaluation section of the user guide
for details.
- cv : cross-validation generator, optional
- A cross-validation generator. If None, a 3-fold cross
- validation is used or 3-fold stratified cross-validation
- when y is supplied and estimator is a classifier.
+ cv : integer or cross-validation generator, optional
+ A cross-validation generator or number of stratified folds (default 3).
@amueller

amueller May 10, 2013

Owner

I think we can't leave out the detail that the folds are stratified only for classification, as it doesn't make sense otherwise.

Owner

jnothman commented May 11, 2013

Sure, I can add things to classes.rst... Alternatively, we could keep CVEvaluator and fit_fold private until we're more happy with them and their APIs?

Owner

jnothman commented May 11, 2013

I considered changing grid_results_ to search_results_, but if it's found in a search object that seems redundant. Maybe we just want results_ or means_ or mean_results_.

But then I'm not sure about parameters in there. I'd like to see the parameters as a separate attribute, probably as a structured array of their own, but it makes the results somewhat harder to visually inspect without a zip (which is of course what the previous results formats have done).

While we're talking terminology, I think we want to change parameter_iter to candidates and point or similar to candidate. I think this only affects local variables and similar, but the nomenclature should be cleaned up, in the code and in the tutorial.

Owner

amueller commented May 14, 2013

I think we should always add stuff to classes.rst if it is public. Otherwise people have to look into the code to get help. It might help us get earlier feedback.

Just results_ seems a bit generic. mean_results_ would be ok with me. Why don't you like parameter_results_? The fold_results are indexed by the folds, the parameter_results_ are indexed by the parameters...

Owner

amueller commented May 14, 2013

I think the structure of results_ and fold_results_ should be explained in the narrative in 5.2.1. It is pretty bad that the current dict is not explained there.
Feel free to rename the variables. There are some left-over grids, that probably should go.

@larsmans @GaelVaroquaux I'd really like you to have a look if you find the time.

Owner

jnothman commented Jun 2, 2013

I wrote about this to the ML in order to weigh up alternatives and potentially get wider consensus. I don't think structured arrays are used elsewhere in scikit-learn, and I worry that while appropriate, they are a little too constraining and unfamiliar.

Owner

jnothman commented Jun 7, 2013

It's not relevant to the rest of the proposal, but I've decided CVEvaluator should be CVScorer, adopting the Scorer interface (__call__(X, y=None, sample_weight=None, ...)), important for forward compatibility and API similarity. __call__ will call a search method whose arguments are the same except for the first which is an iterable of candidates. Both __call__ and search will return either a dict or a structured array, either way a mapping of names to arrays/values. All other public methods from CVEvaluator will disappear. Finally I intend to propose this refactoring as a separate PR.

Owner

ogrisel commented Jun 7, 2013

Sorry for later feedback. I will try to have a look at this PR soon as I a currently working with RandomizedSearchCV.

Owner

ogrisel commented Jun 7, 2013

I assigned this PR for Milestone 0.14 as the new RandomizedSeachCV will be introduced in that release and we don't want to break the API too much once it's released.

Owner

jnothman commented Jun 19, 2013

@ogrisel: With regards to your comments on the ML, would we like to see the default storage / presentation of results as:

  • a list of dicts
  • a dict of arrays
  • structure-ambivalent because they will be hidden behind something like my search result interface

?

Owner

ogrisel commented Jun 19, 2013

I would prefer a list of dicts with:

  • parameters_id (integer unique for each parameter combinations, used for grouping)
  • fold_id (unique integer for each CV fold, used for grouping computing mean and std scores scores)
  • parameters (the dict of the actual parameters values: cannot be hashed in general hence cannot be used for grouping directly hence the use of a parameters_id field)
  • train_fold_size (integer might be useful later if we use the same interface to compute learning curves simultaneously)
  • test_fold_size (useful for computing iid mean score)
  • validation_error (for the provided scoring, used for model ranking once averaged across collected folds)
  • training_error (to be able to evaluate the impact of the parameters on the under fitting and over fitting behavior of the model)
  • training_time (float in seconds)

And later we will let the user compute additional attribute using a callback API, for instance to collect additional complementary scores such as per class precision, recall and f1 score or full confusion matrices.

Then make the search result interface compute the grouped statistics and rank models by mean validation errors by grouping on the parameters_id fields.

Owner

jnothman commented Jun 20, 2013

That structure makes a lot of sense in terms of asynchronous parallelisation... I'm still not entirely convinced it's worthwhile having each fold available to the user as a separate record (which is providing the output of map, before reduce). I also don't think train and test fold size necessarily need to be in there if we are using the same folds for every candidate.

I guess what you're trying to say is that this is the nature of our raw data: a series of fold records. And maybe we need to make a distinction between:

  • the fold records produced by the search
  • the default in-memory storage
  • the default API

My suggestion of structured arrays was intended to provide compact in-memory storage with easy, flexible and efficient access, but still required per-fold intermediate records.

Let's say that we could pass some kind of results_manager to the CV-search constructor. Let's say it's a class that accepts a cv generator (or listified form) so that it knows the number and sizes of of folds, and that the constructed object is stored on the CV-search estimator as results_. Let's say it has to have an add method and a get_best method. I can think of three primary implementations:

  • no storage: get_best performs a find-max over average scores (and results_ provides no data).
  • in-memory storage: don't care what the underlying storage is as long as it can be pickled and produces an interface like my SearchResult object.
  • off-site storage: dump data to file / kv-store / RDBMS and perform find-max at the same time and/or provide a complete API.

Each of these needs to:

  • group data from the same candidate for multiple folds, if add is called per-fold.
  • know how to calculate the best score, including (iid -> fold-weighted) average and greater_is_better logic.

I don't really think that first point should be necessary. If we have an asynchronous processing queue, we will still expect folds for each candidate to be evaluated roughly at the same time, so grouping can happen more efficiently by handling it straight off the queue (storing all the fold results temporarily in memory) rather than in each results_manager implementation. (Perhaps you wouldn't want to store all the folds in memory for LeavePOut, but I don't think that's going to be used for a large dataset / candidate space.)

Owner

jnothman commented Jun 20, 2013

In short: I can't think of a use-case where a user wants per-fold data to be in a list. In an iterable coming off a queue, yes. In a relational DB, perhaps. (Grouped by candidate, certainly.)

Owner

ogrisel commented Jun 20, 2013

That structure makes a lot of sense in terms of asynchronous parallelisation... I'm still not entirely convinced it's worthwhile having each fold available to the user as a separate record (which is providing the output of map, before reduce). I also don't think train and test fold size necessarily need to be in there if we are using the same folds for every candidate.

It is for fail over if some parameters set will generate ill conditioned optimization problems that are not numerically stable across all CV folds. That can happen with SGDClassifier and GBRT models apparently.

Dealing with missing evaluations is very useful, even with the lack of async parallelization.

If we have an asynchronous processing queue, we will still expect folds for each candidate to be evaluated roughly at the same time

This statement is false if we would like to implement the "warm start with growing number of CV folds" use case.

In short: I can't think of a use-case where a user wants per-fold data to be in a list. In an iterable coming off a queue, yes. In a relational DB, perhaps. (Grouped by candidate, certainly.)

Implementing fault tolerant grid search is one, iteratively growable CV folds is another (warm restarts with a higher number of CV iterations).

I wasted a couple of grid search run (lasting 10min each times) precisely because of those 2 missing use cases yesterday. So they are not made up use cases: as a regular user of the lib I really feel the need for those.

Also implementing learning curves with a variable train_fold_size will also be a usecase where the append-only list of atomic evaluation dicts will be easier.

In short: the dumb fold log records datastructure is so much more simple and flexible to allow the implementation of any additional use cases in the future (e.g. learning curves and warm restarts in any dimension) that I think it should be the base underlying datastructure we collect internally even if we expect the user to rarely have the need to access it directly but rather through the results_ object.

For instance we could have:

  • results_log_ : the append only list of dict datastructure to store the raw evaluations
  • results_summary_ : an object that provides user friendly ways to query the results. This object class could take the raw log as constructor parameter and compute its own aggregates (like iid mean scores for ranking).

The results log can be kept if we implement warm restarts. The results_summary_ will have to be reseted and recomputed from the updated log.

The enduser API can still be made simple by providing a results object that can do the aggregation and even output the structured array datastructure you propose if it prove really useful from an enduser API standpoint.

Owner

ogrisel commented Jun 20, 2013

Also I don't think memory efficiency will never be an issue: even with millions of evaluations the overhead of python dicts and python object reference is pretty manageable in 2013 :)

Owner

jnothman commented Jun 20, 2013

Also I don't think memory efficiency will never be an issue: even with millions of evaluations the overhead of python dicts and python object reference is pretty manageable in 2013 :)

Assuming you're not collecting other data, but in that case you're right, the dict overhead will make little difference, and I'm going on about nothing. For fault tolerance there's still sense in storing some data on-disk, though.

I'll think about how best to transform this PR into something like that.

Owner

jnothman commented Jun 20, 2013

So from master, the things that IMO should happen are:

  • the fit_grid_point function should return a dict that will be used directly as a results_log_ entry, which means it needs to be passed the candidate id and fold id where it is not currently.
  • this implementation should also replace the parallelised function in cross_val_score, forming a CVScorer class to handle the shared parallelisation logic. These first two points form a PR on their own.
  • following on from that, PRs to store the log and an API for results access.
Owner

jnothman commented Jun 20, 2013

And again, I should point out that one difficulty with dicts is that our names for fields in them cannot have deprecation warnings, so it's a bit dangerous making them a public API...

Owner

ogrisel commented Jun 20, 2013

And again, I should point out that one difficulty with dicts is that our names for fields in them cannot have deprecation warnings, so it's a bit dangerous making them a public API...

That's a valid point I had not thought of.

Owner

jnothman commented Jun 20, 2013

So we could make them custom objects, but they're less portable. I can't yet think of a nice solution there, except to make the results_log_ an unstable advanced feature...

(And not being concerned by the memory consumption of dicts, your comment on the memory efficiency of namedtuples in the context of _CVScoreTuple is a bit superfluous!)

Owner

ogrisel commented Jun 20, 2013

So from master, the things that IMO should happen are:

  • a the fit_grid_point function should return a dict that will be used directly as a results_log_ entry, which means it needs to be passed the candidate id and fold id where it is not currently.
  • b this implementation should also replace the parallelised function in cross_val_score, forming a CVScorer class to handle the shared parallelisation logic. These first two points form a PR on their own.
  • c following on from that, PRs to store the log and an API for results access.

Sounds good. Also +1 for using candidate_id instead of parameters_id.

I would like to have other people opinions on our discussion though. Apparently people are pretty busy at the moment. Let see:

Ping @larsmans @mblondel @amueller @pprett @glouppe @arjoly @vene

I know @GaelVaroquaux is currently traveling at conferences. We might have a look at this during the SciPy sprint next week with him and @jakevdp.

Owner

ogrisel commented Jun 20, 2013

(And not being concerned by the memory consumption of dicts, your comment on the memory efficiency of namedtuples in the context of _CVScoreTuple is a bit superfluous!)

Indeed, it's just that I added a __slots__ = () to the existing CVScoreTuple tuple to make it more idiomatic and then people started to ask why. Hence I added the comment.

Owner

jnothman commented Jun 20, 2013

I would like to have other people opinions on our discussion though.

I think the discussion is a bit hard to navigate and it would be more sensible to present a cut back PR: #2079. I'll close this one as it seems we're unlikely to go with its solution.

@jnothman jnothman closed this Jun 20, 2013

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment