Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in train_test_split #22979

Closed
nilslacroix opened this issue Mar 28, 2022 · 10 comments
Closed

Bug in train_test_split #22979

nilslacroix opened this issue Mar 28, 2022 · 10 comments
Labels

Comments

@nilslacroix
Copy link

Describe the bug

I am pretty confident I found a bug in train_test_split. When using train_test_split multiple times in a row it produces wrong indexes in X_train and y_train.

Steps/Code to Reproduce

Using this code at the first time in Jupyter notebook produced the correct result with two identical indexes.

X_train, X_test, y_train, y_test = train_test_split(df, target, test_size=0.2, random_state=seed)

I confirmed this with:
X_train.index.equals(y_train.index)

After loading the dataset in the variable df again and getting the target with target = np.log1p(df['SalePrice']) I could confirm
that df and target had the right indeces in the right order before I used the function. After the function was done X_train
and y_train had changed indeces. So there seems to be a memory related problen, since the first usage worked just fine but all tries which are done after that produce the wrong result. Clearing all output and using a reset in Jupyter notebook solves the issue for one try again.

Expected Results

Of course I would expect the right indeces.

Actual Results

image
This is the actual result from X_train - if you compare it the picture below you can see that the first samples of the table have been altered but the last ones remain intact.
image

Versions

0.24.2 scikit learn, python 3.9
@nilslacroix nilslacroix added Bug Needs Triage Issue requires triage labels Mar 28, 2022
@nilslacroix
Copy link
Author

I found that when I sort the index and print it it seems to be the index from 0 - samplesize, with a few indexes missing ....

image

@jeremiedbb
Copy link
Member

Thanks for the report @nilslacroix. Could you come up with a minimal reproducer that exhibits the issue you're facing (https://scikit-learn.org/dev/developers/minimal_reproducer.html#minimal-reproducer) ?

@thomasjpfan thomasjpfan added module:model_selection Needs Reproducible Code Issue requires reproducible code and removed Needs Triage Issue requires triage labels Mar 28, 2022
@nilslacroix
Copy link
Author

nilslacroix commented Mar 30, 2022

Hi I tried to reproduce the error with another dataset but it works just fine with toy examples- I am really baffled here.
As you can see in the screenshot I feed two pandas Dataframes with a Int64Index into the function. Before I do this the indeces
are correctly aligned.

But after train_test_split is called, y_train (and y_test as well) are reindexed beginning at 0 and randomly mixed, which
results in a False index comparison.

image

For comparison reason I wrote this reproducer, but this yields correct indeces. This is even more strange since the only difference is the dataframe, which is fed into the function. Maybe you guys have a clue why this could happen?

#load dataset
from sklearn.datasets import load_diabetes
import pandas as pd
X, y = load_diabetes(return_X_y=True)
#convert to pandas, since I use pandas dataframes
X = pd.DataFrame(X)
y = pd.DataFrame(y)
#simulate my randomized dataset which is put into train_test_split
#randomize both dataframes with the same seed, so the index is in the same order
X = X.sample(frac = 1 , random_state=seed)
y = y.sample(frac = 1 , random_state=seed)
#print if index is equal before train_test_split
print(X.index.equals(y.index))
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=seed)
print(X_train.index.equals(y_train.index))

@thomasjpfan
Copy link
Member

This issue is hard to debug without a reproducer. If you can not share your dataset, can you write a reproducer that generates a Dataframe around the size of your dataset to reproduce your issue?

@nilslacroix
Copy link
Author

nilslacroix commented Apr 24, 2022

Sorry just realised I wrote the same thing again. I can not reproduce the error to be honest, not with some toy df. It is also size independent because the error occasionally happens when I sample only like 100 samples from the whole dataframe. I really think this is memory related or some strange jupyter notebook bug, since it works again as soon as I restart and clear the notebook.

I just do not see any other way since I compare the indeces right before and after calling the function

@nilslacroix
Copy link
Author

I found the culprit. I used Intel® Extension for Scikit-learn to speed up the computation. Turns out that when I import it with

from sklearnex import patch_sklearn
patch_sklearn()

this bug appears. And even others, for example I just saw that the scoring method of SVR with a poly kernel produced different scores a few seconds later with the same X, y sets , without changing the parameters.

@thomasjpfan
Copy link
Member

I am closing this issue because it is a sklearnex specific issue. Thank you for looking into the issue more.

@adrinjalali
Copy link
Member

Should we have a tag for issues that are due to this package rather than sklearn?

@thomasjpfan
Copy link
Member

Should we have a tag for issues that are due to this package rather than sklearn?

In recent memory, I think this is the first issue related to sklearnex. To me, this means not to label it.

If we really want to label this issue, there is a more generic class of issues: "upstream library". Depending on the issue, it may or may not be an issue with scikit-learn itself.

@nilslacroix
Copy link
Author

As an update: This is related to sklearnex for sure. The problem is hardware related. When using AMD CPUs, the functions and class which are replaced by sklearnex (for example train_test_split and estimators) producte wrong indeces and inconsistent scores instead of giving a warning. I think thats the part that is problematic, and that it advertises such a boost in performance without telling it is a hardware dependent library. May lead to future cases of such problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants