-
-
Notifications
You must be signed in to change notification settings - Fork 26.1k
Closed
Labels
Description
In many situations, you don't have a test set so you would like to use CV for both evaluation and hyper-parameter tuning. Therefore, you need to do nested cross-validation:
for train, test in cv1:
# Find the best hyper-parameters for this split
for train, val in cv2:
[...]
# Retrain using the best hyper-parameters
[...]
# Return best scores for each split
This is very difficult to implement in a generic way with our current API because CV iterators are tied to a particular data. For example, when doing cv = KFold(n_samples)
, cv
will only work with a dataset of the specified size.
Ideally, we would need something closer to the estimator API: use constructor parameters for data-independent options (n_folds, shuffle, random_state, train / test proportion, etc) and a run
method that takes y
as argument (the reason to take y
is for stratified schemes). This would look something like this:
# deprecated usage
for train, test in KFold(n, n_folds):
print train, test
# new usage
for train, test in KFold(n_folds).run(y):
print train, test