Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How big is the calibration data, mapie uses to compute intervals #159

Closed
nilslacroix opened this issue Apr 24, 2022 · 3 comments
Closed
Assignees
Labels
documentation Improvements or additions to documentation

Comments

@nilslacroix
Copy link

Is your documentation request related to a problem? Please describe.
I found this picture of Mapie in your docs, which shows the workflow of Mapie. My question is: How big is this
calibration dataset in case for example of the "cv+" method?

I find this question important, because the calibrated data is subtracted from the training data and can't be changed in size. This could lead to performance issues if the number of samples is small.

image

@nilslacroix nilslacroix added the documentation Improvements or additions to documentation label Apr 24, 2022
@nilslacroix
Copy link
Author

Also wouldnt it make sense, in case of a regression problem, to sort the training data by the target feature before fitting it on mapie? I mean if you have samples which are sorted from lets say 100.000k to 600.000k Saleprice, in case of a housing problem, the leave-one-out-cv would basically calculate intervals in a space of values, which are simliar to each other and thus make more sense. For example fold1 = 100.000k -120.000k, fold2= 120.000k-140.000k and so on...

@gmartinonQM
Copy link
Contributor

gmartinonQM commented Apr 24, 2022

@vtaquet , I think the picture was specific to classification, at a time where we had not implemented cross-validation yet, only the split-conformal with cv="prefit" option. This picture is thus obsolete, and the size of the calibration set is defined by the number of calibration folds cv.

@vtaquet
Copy link
Member

vtaquet commented May 2, 2022

@gmartinonQM , the picture is indeed obsolete and should be updated in a future PR.

@nilslacroix , sorting the training data before splitting it into folds is up to the user and needs to be done before calling MAPIE. Your cross-validation strategy can be defined using the desired sklearn BaseCrossValidator object like KFold but keep in mind that the training and calibration sets need to have similar distributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

3 participants