[MRG] Added option to return raw predictions from cross_validate #18390

okz12 · 2020-09-13T11:52:01Z

Fix for #13478
Created new PR #15907 due to merge conflict.

This allows the user to extract predictions or prediction probabilities using 'return_predictions' argument for cross_validate. The argument takes 'predict_proba' or 'predict' as a string to return the predictions.

The approach has been to have _fit_and_score run inference on the cross-validation split and return it with the indices. Once all the indices and predictions have been returned to the cross_validate function, they are ordered according to the dataset and returned.

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_validate
from sklearn.datasets._samples_generator import make_classification

n_samples = 25
n_classes = 2
clf = LogisticRegression(random_state=0)
X, y = make_classification(n_samples=n_samples, n_classes=n_classes)

ret = cross_validate(clf, X, y, return_train_score=False,
                     return_estimator=True, return_predictions='predict')
ret['predictions']
>>> array([0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1,
       0, 1, 1])

@cmarmo could you take a look please?

cmarmo · 2020-09-16T06:30:42Z

Hi @okz12 , thanks for updating! If you think that this PR is ready for review feel free to change WIP to MRG in the title: this will bring more attention to it.
Perhaps @lucyleeow might want to have a look on this one? Thanks!

lucyleeow

Just some nitpicks on first glance. It is annoying that the predictions are calculated twice but fixing this would be complex...

lucyleeow · 2020-09-17T10:05:28Z

sklearn/model_selection/_validation.py

+        Return cross-validation predictions for the dataset. 'predict' returns
+        the predictions whereas 'predict_proba' returns class probabilities.


Maybe add what happens when None and add references to terms:

Suggested change

Return cross-validation predictions for the dataset. 'predict' returns

the predictions whereas 'predict_proba' returns class probabilities.

What predictions to return. If 'predict', return output of :term:`predict` and if

'predict_proba', return output of :term:`predict_proba`. The test set indices

are also returned under a separate dict key.

If None, do not return predictions or test data indices.

I've included the suggested change (minus the reference test indices, for now).

I'm not entirely clear on when :term: is needed - seems like it's for string arguments. It would help if the example here showed and discussed using :term: : https://scikit-learn.org/dev/developers/contributing.html#documentation

Should I file a new issue, if there isn't one already?

lucyleeow · 2020-09-17T10:06:21Z

sklearn/model_selection/_validation.py

+            ``predictions``
+                Cross-validation predictions for the dataset.
+                This is available only if ``return_predictions`` parameter
+                is set to ``predict`` or ``predict_proba``.



Maybe note the key 'test indices' too ?

'test_indices' is being passed from _fit_and_score to cross_validate but not being returned by cross_validate. Do you think it is worth returning 'test_indices' from cross_validate?

On second thought, I do think this could be useful to the user.

It also raises a follow-on question about the return array. Currently the predictions are flattened out to a 1D-array for 'predict' (ignoring 'predict_proba''s extra dimension for now). The sample predictions are ordered back into their positioning from (n_splits, validation_samples_per_split,) to (n_train_samples,) which loses information on which sample was used in which split.

Would it be better to return predictions and indices by the split (n_splits, validation_samples_per_split,)?

@okz12 How about a "split_index"/"fold"/... key in the output with the fold index per sample. This would make it somewhat easy to subset the predictions by split. Having the predictions in the right order by default seems preferable?

Could be done by enumerating cv.split() in cross_validate, pass the index to _fit_and_score (perhaps to split_progress) and add it to the results dict. Then stack and order by test indices as you do with the predictions.

Change loop: for split_idx, (train, test) in enumerate(cv.split(X, y, groups))

Add split_progress=(split_idx, cv.n_splits) to _fit_and_score call.

Add result["split_indices"] = [split_progress[0]] * _num_samples(X_test) in _fit_and_score under if return_predictions:.

Add split_indices = np.hstack(results["split_indices"]) in cross_validate (below the same call for test_indices).

Add to output with ret['split_indices'] = split_indices[test_indices]

Using split_progress also causes the logging messages though.

sklearn/model_selection/_validation.py

…ikit-learn#18390)

LudvigOlsen · 2022-02-11T12:05:11Z

@okz12 Is this still being worked on? It would be a good feature to finally have! :-)

LudvigOlsen · 2022-03-01T12:10:51Z

sklearn/model_selection/_validation.py

+        else:
+            predictions = np.vstack(results["predictions"])
+        test_indices = np.hstack(results["test_indices"])
+        ret['predictions'] = predictions[test_indices]


Should this not be indexed by the argsort of test_indices?
If the predictions and test_indices already have the same order, the indices necessary to sort test_indices would also sort the predictions.

cross_val_predict() uses:

inv_test_indices = np.empty(len(test_indices), dtype=int) inv_test_indices[test_indices] = np.arange(len(test_indices))

which seems to be the same as test_indices.argsort() but probably faster.

Added option to return raw predictions from cross_validate

fbb0d66

okz12 mentioned this pull request Sep 13, 2020

[MRG] Added CV return predictions for predict/predict_proba #15907

Closed

github-actions bot added the module:model_selection label Sep 13, 2020

okz12 changed the title ~~[WIP] Added option to return raw predictions from cross_validate~~ [MRG] Added option to return raw predictions from cross_validate Sep 16, 2020

lucyleeow mentioned this pull request Sep 17, 2020

Add return_predictions option to the model_selection.cross_validate() API #13478

Open

lucyleeow reviewed Sep 17, 2020

View reviewed changes

Updated comments on returning predictions from cross_validate API (sc…

23b2d88

…ikit-learn#18390)

glemaitre added this to TO REVIEW in Guillaume's pet Oct 5, 2020

Base automatically changed from master to main January 22, 2021 10:53

cmarmo added help wanted Stalled labels Feb 22, 2022

LudvigOlsen reviewed Mar 1, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] Added option to return raw predictions from cross_validate #18390

[MRG] Added option to return raw predictions from cross_validate #18390

okz12 commented Sep 13, 2020

cmarmo commented Sep 16, 2020

lucyleeow left a comment

lucyleeow Sep 17, 2020

okz12 Sep 19, 2020

lucyleeow Sep 17, 2020

okz12 Sep 18, 2020

okz12 Sep 19, 2020

LudvigOlsen Feb 11, 2022 •

edited

LudvigOlsen commented Feb 11, 2022

LudvigOlsen Mar 1, 2022

LudvigOlsen Mar 1, 2022 •

edited

		Return cross-validation predictions for the dataset. 'predict' returns
		the predictions whereas 'predict_proba' returns class probabilities.

-        Return cross-validation predictions for the dataset. 'predict' returns
-        the predictions whereas 'predict_proba' returns class probabilities.
+        What predictions to return. If 'predict', return output of :term:`predict` and if
+        'predict_proba', return output of :term:`predict_proba`. The test set indices
+        are also returned under a separate dict key.
+        If None, do not return predictions or test data indices.

[MRG] Added option to return raw predictions from cross_validate #18390

Are you sure you want to change the base?

[MRG] Added option to return raw predictions from cross_validate #18390

Conversation

okz12 commented Sep 13, 2020

cmarmo commented Sep 16, 2020

lucyleeow left a comment

Choose a reason for hiding this comment

lucyleeow Sep 17, 2020

Choose a reason for hiding this comment

okz12 Sep 19, 2020

Choose a reason for hiding this comment

lucyleeow Sep 17, 2020

Choose a reason for hiding this comment

okz12 Sep 18, 2020

Choose a reason for hiding this comment

okz12 Sep 19, 2020

Choose a reason for hiding this comment

LudvigOlsen Feb 11, 2022 • edited

Choose a reason for hiding this comment

LudvigOlsen commented Feb 11, 2022

LudvigOlsen Mar 1, 2022

Choose a reason for hiding this comment

LudvigOlsen Mar 1, 2022 • edited

Choose a reason for hiding this comment

LudvigOlsen Feb 11, 2022 •

edited

LudvigOlsen Mar 1, 2022 •

edited