-
Notifications
You must be signed in to change notification settings - Fork 3.3k
Is there a function to generate a train set and test set from DataFrame? #1498
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
In the past, I've just edited the scikit-learn functions to return indices and done the slicing myself. |
Added the 0.6.0 milestone to this. It would be pretty easy to adapt all of the CV tools to be DataFrame aware. |
@cloga This part is asleep until we pick up on more cross-validation and bootstrap support. My plan was to copy an updated version from scikit-learn, but this time in a separate module, and add a statsmodels module with the extra iterators or generators that we need. And based on the pandas issue, we would also need to add support for pandas dataframes. Whenever I needed something similar recently it was faster to write a few lines, than to look for premade functions. Until now we haven't consolidated our usage of these iterators, and they are still spread out over several modules and scripts. |
Hello! I would like to work on this issue. Is anyone working right now on that? |
@Dimkoim AFAIK, Nobody is working on this. I haven't worked on those since we added them, but ran into some related issues while working or thinking about other parts: I just ran into a case where we use the LOOO iterator in outlier_influence. However, I would like to change the LOOO iterator to take an index for which observations to leave out instead of doing the full leave every observation, one at a time out. Aside 1: When we use the split iterators for the models, then we need to watch out for extra arrays/series that were used in the Aside 2: |
@josef-pkt Thanks for the reply! I was thinking for a more general split function, like the one that |
I briefly looked at the scikit-learn code again, I haven't done looked at it in a long time, but doesn't seem to have changed much since 2014. I think the plan of copying the splitter and similar over to statsmodels and then adjust is still good. The stratified splitters seem to preserve full support (non-singular design) if used for categoricals, AFAICS. We will need some others resamplers e.g. for time series and bootstrap, but those can be added separately. In scikit-learn BaseShuffleSplit,, PartitionIterator are separate sets of classes, AFAICS. |
Has this issue been addressed? If not, I can take a look at it. |
It has not and you are welcome to work on it. |
Hey @bashtage, my project team would like to work on this issue. |
@ashton77 please do. |
I find sandbox>tool has cross_validate.py, and there are many functions from sklearn include split , are there any update?
The text was updated successfully, but these errors were encountered: