Skip to content

Is there a function to generate a train set and test set from DataFrame? #1498

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
cloga opened this issue Mar 22, 2014 · 11 comments · May be fixed by #7384
Open

Is there a function to generate a train set and test set from DataFrame? #1498

cloga opened this issue Mar 22, 2014 · 11 comments · May be fixed by #7384

Comments

@cloga
Copy link

cloga commented Mar 22, 2014

I find sandbox>tool has cross_validate.py, and there are many functions from sklearn include split , are there any update?

@jseabold
Copy link
Member

In the past, I've just edited the scikit-learn functions to return indices and done the slicing myself.

@jseabold jseabold added this to the 0.6 milestone Mar 24, 2014
@jseabold
Copy link
Member

Added the 0.6.0 milestone to this. It would be pretty easy to adapt all of the CV tools to be DataFrame aware.

@josef-pkt
Copy link
Member

@cloga This part is asleep until we pick up on more cross-validation and bootstrap support.

My plan was to copy an updated version from scikit-learn, but this time in a separate module, and add a statsmodels module with the extra iterators or generators that we need. And based on the pandas issue, we would also need to add support for pandas dataframes.

Whenever I needed something similar recently it was faster to write a few lines, than to look for premade functions. Until now we haven't consolidated our usage of these iterators, and they are still spread out over several modules and scripts.

@josef-pkt josef-pkt modified the milestones: 0.6, 0.7 Aug 17, 2014
@josef-pkt josef-pkt modified the milestones: 0.8, 0.7 Jul 17, 2015
@Dimkoim
Copy link

Dimkoim commented Jun 19, 2018

Hello! I would like to work on this issue. Is anyone working right now on that?

@josef-pkt
Copy link
Member

@Dimkoim AFAIK, Nobody is working on this.

I haven't worked on those since we added them, but ran into some related issues while working or thinking about other parts:

I just ran into a case where we use the LOOO iterator in outlier_influence. However, I would like to change the LOOO iterator to take an index for which observations to leave out instead of doing the full leave every observation, one at a time out.
That is, I only want to look at some LOOO cases instead of all, because most of them will be uninteresting (non-influential in my current application).

Aside 1: When we use the split iterators for the models, then we need to watch out for extra arrays/series that were used in the model.__init__. There is currently no generic support for subsampling those extra arrays, although it's easier for specific model where we know what extra arrays are used. #4741

Aside 2:
#4682 I think we need a more general splitter that preserves full rank of the design matrix, e.g. when there are categoricals.
#4680 some thoughts on a random resampler that also returns the out of sample index/mask.

@Dimkoim
Copy link

Dimkoim commented Jun 19, 2018

@josef-pkt Thanks for the reply! I was thinking for a more general split function, like the one that scikit has. What's your opinion?

@josef-pkt
Copy link
Member

I briefly looked at the scikit-learn code again, I haven't done looked at it in a long time, but doesn't seem to have changed much since 2014.

I think the plan of copying the splitter and similar over to statsmodels and then adjust is still good.
I'm not sure about the supporting code that is imported in the scikit-learn and some design choices, e.g. so far we don't use any metaclasses and abc's. (We need to switch to the same or similar RandomState handling as scikit-learn eventually.)
Most of the actual crossvalidation code that is in the same module as the splitters will not be compatible with statsmodels.

The stratified splitters seem to preserve full support (non-singular design) if used for categoricals, AFAICS.
It also looks like the ShuffleSplit allows subsampling with train_size + test_size < 1

We will need some others resamplers e.g. for time series and bootstrap, but those can be added separately. In scikit-learn BaseShuffleSplit,, PartitionIterator are separate sets of classes, AFAICS.

@varoonp123
Copy link

Has this issue been addressed? If not, I can take a look at it.

@bashtage
Copy link
Member

It has not and you are welcome to work on it.

@ashton77
Copy link

ashton77 commented Jan 20, 2021

Hey @bashtage, my project team would like to work on this issue.

@bashtage
Copy link
Member

@ashton77 please do.

@NolanMP NolanMP linked a pull request Mar 17, 2021 that will close this issue
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants