Skip to content

data-independent CV iterators #2904

@mblondel

Description

@mblondel

In many situations, you don't have a test set so you would like to use CV for both evaluation and hyper-parameter tuning. Therefore, you need to do nested cross-validation:

for train, test in cv1:
   # Find the best hyper-parameters for this split
    for train, val in cv2:
        [...]
   # Retrain using the best hyper-parameters
   [...]
# Return best scores for each split

This is very difficult to implement in a generic way with our current API because CV iterators are tied to a particular data. For example, when doing cv = KFold(n_samples), cv will only work with a dataset of the specified size.

Ideally, we would need something closer to the estimator API: use constructor parameters for data-independent options (n_folds, shuffle, random_state, train / test proportion, etc) and a run method that takes y as argument (the reason to take y is for stratified schemes). This would look something like this:

# deprecated usage
for train, test in KFold(n, n_folds):
    print train, test

# new usage
for train, test in KFold(n_folds).run(y):
    print train, test

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions