Restructure the output attributes of *SearchCV #1768

Closed
wants to merge 13 commits into
from

Conversation

Projects
None yet
3 participants
Owner

jnothman commented Mar 13, 2013

  • BaseSearchCV now stores attribs best_index_, grid_results_ and fold_results_ from which best_params_, best_score_ and grid_scores_ are derived; and best_estimator_.
  • grid_results_ and fold_results_ are structured arrays, allowing for flexible further computation.
  • The contents of these are extensible. Currently they may store training scores and times as well as test scores, number of test samples, parameters for each grid point, etc.
  • Multiple scores may be output, such as precision and recall as well as the F1 objective.
  • Some refactoring and additional testing included.

An alternative to #1742

amueller and others added some commits Mar 11, 2013

ENH add training score to GridSearchCV.cv_scores_
add docstring for GridSearchCV, RandomizedSearchCV and fit_grid_point. In "fit_grid_point" I used test_score rather than validation_score, as the split is given to the function.
rbf svm grid search example now also shows training scores - which illustrates overfitting for high C, and training/prediction times... which pasically serve to illustrate that this is possible. Maybe random forests would be better to evaluate training times?
ENH Enhanced results from cross-validation via Scorer.store
Currently, no tests have been added, and backwards compatibility is eschewed
Merge branch 'cv-enhanced-results' into grid_search_more_info
This includes some refactoring and creation of new Scorer classes to
wrap legacy scorer forms.

Conflicts:
	sklearn/grid_search.py
	sklearn/tests/test_grid_search.py
ENH Reimplement best_params_ and best_score_ as properties
Thus the attributes stored by BaseSearchCV._fit() are no longer redundant.

Also: test for these attributes
@@ -170,8 +169,8 @@ def __iter__(self):
yield params
-def fit_grid_point(X, y, base_clf, clf_params, train, test, scorer,
@jnothman

jnothman Mar 13, 2013

Owner

Should this be called fit_fold for consistency?

@amueller

amueller Mar 17, 2013

Owner

Yes, with deprecation of the current function name.

start_time))
print("[GridSearchCV] %s %s" % ((64 - len(end_msg)) * '.', end_msg))
- return this_score, clf_params, _num_samples(X_test)
+ return clf_params, results
@jnothman

jnothman Mar 13, 2013

Owner

clf_params is not used in BaseSearchCV. Should we just return results?

sklearn/metrics/scorer.py
+ def __init__(self, greater_is_better=True):
+ self.greater_is_better = greater_is_better
+
+ def store(self, result, estimator, X, y=None, prefix=''):
@jnothman

jnothman Mar 13, 2013

Owner

Alternatively, this could be implemented as get_named_scores, returning (name, value) tuples (one name of which must be 'score'). The prefixing and storage would then be left to downstream processes such as grid_search.fit_grid_point.

- Contains scores for all parameter combinations in param_grid.
- Each entry corresponds to one parameter setting.
- Each named tuple has the attributes:
+ `grid_results_` : structured array of shape [# param combinations]
@jnothman

jnothman Mar 13, 2013

Owner

Can we come up with a better name, particularly when a grid is not being used (as in RandomizedSearchCV)?

@jnothman

jnothman Mar 13, 2013

Owner

Maybe even search_results_ vs fold_results_

Owner

amueller commented Mar 13, 2013

Thanks a lot for your work. I hope I can incooperate you chages soon.

ENH/FIX/TST Use Scorer.calc_scores instead of store
This replaces Scorer.store()

Also: tests for new Scorer functionality and descendants, and fixes broken WrapScorer
+
+We can observe that the lower right half of the parameters (below the diagonal
+with high C and gamma values) is characteristic of parameters that yields an
+overfitting model: the trainin score is very high but there is a wide gap. The
@jaquesgrobler

jaquesgrobler Mar 14, 2013

Owner

training score?

@jnothman

jnothman Mar 14, 2013

Owner

Hmm. This is inherited from my merge with d884180 of #1742. Do I fix it, or do I remove the file from the changeset?

Owner

amueller commented Mar 17, 2013

I think this PR does to many thing / unrelated things at once.
The restructuring of the ParameterSearch results in to grid and fold results is independent of the restructuring of the scorer. Both are quite major changes.

Current master basically does a rename of the grid_scores_ attribute of 0.13.1. You changed the structure. Are there major benefits? It does seem a bit more intuitive but changing the structure of objects is annoying to users, so we shouldn't do it lightly.

On the other hand, one could argue that your changes are not going far enough. If you do have a grid of parameters, the reshaping is still annoying to do by hand, right?

I'm all for incremental improvements, but not if they change the api ;)

Owner

amueller commented Mar 17, 2013

Btw, I don't see how this is an alternative to #1742.
There, I add some additional info to the current parameter search results. This PR here changes the structure of the results and adds multiple Scorer classes.

Owner

jnothman commented Mar 18, 2013

All true, @amueller . It is not strictly an alternative to #1742 (which also grew much broader than its title suggests); rather, [what I meant:] it incorporates an alternative approach to returning additional statistics from the parameter search. The implementation at #1742 (d884180) actually breaks backwards-compatibility: where grid_scores_ returned a list of 3-tuples, there it returns a list of 6-tuples.

This patch doesn't change the structure of grid_scores_. Precisely, this patch proposes an output that is strictly more expressive, extensible and immediately powerful than the previous representation. So grid_scores_ etc. can easily be calculated as before (in a backwards-compatible way, unlike in #1742's patch), but: scores can also be accessed by name; additional data can be added; data is stored compactly in memory as numpy arrays; and these can be operated on to quickly find, for example, the score (or training time) variance across all parameters considered (for a particular fold, or over means).

I am not sure why you are concerned that the output cannot be indexed directly by parameter values. I don't think I ever suggested reshaping the output to be indexed by parameter values: this might indeed be a nice extension (though as far as I can tell would still not be trivial to use), but is certainly separate. Moreover this critique -- and the potential extension -- would apply equally to this patch and to the existing code.

Finally, I included multiple scores in this PR because:

  • it is what motivated me to make this change: it requires an extensible design (rather than one where all field names/types are preset);
  • it depends on this design (or on a hack using __float__ or __{add,iadd,mul}__ as in our ML discussions);
  • it continues to exemplify the power and utility of this design.

But if you insist, it can be removed; but it can then only be proposed after the restructure patch is accepted. As this was my first content PR to scikit-learn, would you advise me on the appropriate action? (If I am to split it, do I merely edit this branch, or do I close this PR and reopen?)

Owner

jnothman commented Mar 18, 2013

Okay. I've split it up. Start with #1787

@jnothman jnothman closed this Mar 18, 2013

Owner

amueller commented Mar 18, 2013

Thanks for responding to my comments. I think I agree with you. Thanks for splitting up the PR. I think this is the way to go.
I would really love to look at your contribution in more detail but I am constantly on the road at the moment (London, Berlin, New York ^^). I'll try to make some time to review #1787.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment