Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Added option to return raw predictions from cross_validate #18390

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

okz12
Copy link
Contributor

@okz12 okz12 commented Sep 13, 2020

Fix for #13478
Created new PR #15907 due to merge conflict.

This allows the user to extract predictions or prediction probabilities using 'return_predictions' argument for cross_validate. The argument takes 'predict_proba' or 'predict' as a string to return the predictions.

The approach has been to have _fit_and_score run inference on the cross-validation split and return it with the indices. Once all the indices and predictions have been returned to the cross_validate function, they are ordered according to the dataset and returned.

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_validate
from sklearn.datasets._samples_generator import make_classification

n_samples = 25
n_classes = 2
clf = LogisticRegression(random_state=0)
X, y = make_classification(n_samples=n_samples, n_classes=n_classes)

ret = cross_validate(clf, X, y, return_train_score=False,
                     return_estimator=True, return_predictions='predict')
ret['predictions']
>>> array([0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1,
       0, 1, 1])

@cmarmo could you take a look please?

@cmarmo
Copy link
Member

cmarmo commented Sep 16, 2020

Hi @okz12 , thanks for updating! If you think that this PR is ready for review feel free to change WIP to MRG in the title: this will bring more attention to it.
Perhaps @lucyleeow might want to have a look on this one? Thanks!

@okz12 okz12 changed the title [WIP] Added option to return raw predictions from cross_validate [MRG] Added option to return raw predictions from cross_validate Sep 16, 2020
Copy link
Member

@lucyleeow lucyleeow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some nitpicks on first glance. It is annoying that the predictions are calculated twice but fixing this would be complex...

Comment on lines 159 to 160
Return cross-validation predictions for the dataset. 'predict' returns
the predictions whereas 'predict_proba' returns class probabilities.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add what happens when None and add references to terms:

Suggested change
Return cross-validation predictions for the dataset. 'predict' returns
the predictions whereas 'predict_proba' returns class probabilities.
What predictions to return. If 'predict', return output of :term:`predict` and if
'predict_proba', return output of :term:`predict_proba`. The test set indices
are also returned under a separate dict key.
If None, do not return predictions or test data indices.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've included the suggested change (minus the reference test indices, for now).

I'm not entirely clear on when :term: is needed - seems like it's for string arguments. It would help if the example here showed and discussed using :term: : https://scikit-learn.org/dev/developers/contributing.html#documentation

Should I file a new issue, if there isn't one already?

``predictions``
Cross-validation predictions for the dataset.
This is available only if ``return_predictions`` parameter
is set to ``predict`` or ``predict_proba``.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe note the key 'test indices' too ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'test_indices' is being passed from _fit_and_score to cross_validate but not being returned by cross_validate. Do you think it is worth returning 'test_indices' from cross_validate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, I do think this could be useful to the user.

It also raises a follow-on question about the return array. Currently the predictions are flattened out to a 1D-array for 'predict' (ignoring 'predict_proba''s extra dimension for now). The sample predictions are ordered back into their positioning from (n_splits, validation_samples_per_split,) to (n_train_samples,) which loses information on which sample was used in which split.

Would it be better to return predictions and indices by the split (n_splits, validation_samples_per_split,)?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@okz12 How about a "split_index"/"fold"/... key in the output with the fold index per sample. This would make it somewhat easy to subset the predictions by split. Having the predictions in the right order by default seems preferable?

Could be done by enumerating cv.split() in cross_validate, pass the index to _fit_and_score (perhaps to split_progress) and add it to the results dict. Then stack and order by test indices as you do with the predictions.

  1. Change loop: for split_idx, (train, test) in enumerate(cv.split(X, y, groups))
  2. Add split_progress=(split_idx, cv.n_splits) to _fit_and_score call.
  3. Add result["split_indices"] = [split_progress[0]] * _num_samples(X_test) in _fit_and_score under if return_predictions:.
  4. Add split_indices = np.hstack(results["split_indices"]) in cross_validate (below the same call for test_indices).
  5. Add to output with ret['split_indices'] = split_indices[test_indices]

Using split_progress also causes the logging messages though.

sklearn/model_selection/_validation.py Outdated Show resolved Hide resolved
@glemaitre glemaitre added this to TO REVIEW in Guillaume's pet Oct 5, 2020
Base automatically changed from master to main January 22, 2021 10:53
@LudvigOlsen
Copy link

@okz12 Is this still being worked on? It would be a good feature to finally have! :-)

else:
predictions = np.vstack(results["predictions"])
test_indices = np.hstack(results["test_indices"])
ret['predictions'] = predictions[test_indices]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this not be indexed by the argsort of test_indices?
If the predictions and test_indices already have the same order, the indices necessary to sort test_indices would also sort the predictions.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cross_val_predict() uses:

    inv_test_indices = np.empty(len(test_indices), dtype=int)
    inv_test_indices[test_indices] = np.arange(len(test_indices))

which seems to be the same as test_indices.argsort() but probably faster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

4 participants