-
-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Added option to return raw predictions from cross_validate #18390
base: main
Are you sure you want to change the base?
Conversation
Hi @okz12 , thanks for updating! If you think that this PR is ready for review feel free to change WIP to MRG in the title: this will bring more attention to it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some nitpicks on first glance. It is annoying that the predictions are calculated twice but fixing this would be complex...
Return cross-validation predictions for the dataset. 'predict' returns | ||
the predictions whereas 'predict_proba' returns class probabilities. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add what happens when None and add references to terms:
Return cross-validation predictions for the dataset. 'predict' returns | |
the predictions whereas 'predict_proba' returns class probabilities. | |
What predictions to return. If 'predict', return output of :term:`predict` and if | |
'predict_proba', return output of :term:`predict_proba`. The test set indices | |
are also returned under a separate dict key. | |
If None, do not return predictions or test data indices. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've included the suggested change (minus the reference test indices, for now).
I'm not entirely clear on when :term: is needed - seems like it's for string arguments. It would help if the example here showed and discussed using :term: : https://scikit-learn.org/dev/developers/contributing.html#documentation
Should I file a new issue, if there isn't one already?
``predictions`` | ||
Cross-validation predictions for the dataset. | ||
This is available only if ``return_predictions`` parameter | ||
is set to ``predict`` or ``predict_proba``. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe note the key 'test indices' too ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'test_indices' is being passed from _fit_and_score
to cross_validate
but not being returned by cross_validate
. Do you think it is worth returning 'test_indices' from cross_validate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought, I do think this could be useful to the user.
It also raises a follow-on question about the return array. Currently the predictions are flattened out to a 1D-array for 'predict' (ignoring 'predict_proba''s extra dimension for now). The sample predictions are ordered back into their positioning from (n_splits, validation_samples_per_split,) to (n_train_samples,) which loses information on which sample was used in which split.
Would it be better to return predictions and indices by the split (n_splits, validation_samples_per_split,)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@okz12 How about a "split_index"/"fold"/... key in the output with the fold index per sample. This would make it somewhat easy to subset the predictions by split. Having the predictions in the right order by default seems preferable?
Could be done by enumerating cv.split() in cross_validate, pass the index to _fit_and_score (perhaps to split_progress) and add it to the results dict. Then stack and order by test indices as you do with the predictions.
- Change loop:
for split_idx, (train, test) in enumerate(cv.split(X, y, groups))
- Add
split_progress=(split_idx, cv.n_splits)
to_fit_and_score
call. - Add
result["split_indices"] = [split_progress[0]] * _num_samples(X_test)
in_fit_and_score
underif return_predictions:
. - Add
split_indices = np.hstack(results["split_indices"])
in cross_validate (below the same call fortest_indices
). - Add to output with
ret['split_indices'] = split_indices[test_indices]
Using split_progress
also causes the logging messages though.
@okz12 Is this still being worked on? It would be a good feature to finally have! :-) |
else: | ||
predictions = np.vstack(results["predictions"]) | ||
test_indices = np.hstack(results["test_indices"]) | ||
ret['predictions'] = predictions[test_indices] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this not be indexed by the argsort of test_indices?
If the predictions and test_indices already have the same order, the indices necessary to sort test_indices would also sort the predictions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cross_val_predict()
uses:
inv_test_indices = np.empty(len(test_indices), dtype=int)
inv_test_indices[test_indices] = np.arange(len(test_indices))
which seems to be the same as test_indices.argsort()
but probably faster.
Fix for #13478
Created new PR #15907 due to merge conflict.
This allows the user to extract predictions or prediction probabilities using 'return_predictions' argument for cross_validate. The argument takes 'predict_proba' or 'predict' as a string to return the predictions.
The approach has been to have
_fit_and_score
run inference on the cross-validation split and return it with the indices. Once all the indices and predictions have been returned to thecross_validate
function, they are ordered according to the dataset and returned.@cmarmo could you take a look please?