New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KFold on large data yields overlapping train and test sets #6063
Comments
There it goes, on master, Python 3.5 from joblib import Parallel, delayed
from sklearn.model_selection import KFold
import numpy as np
def overlapping(test, train):
print('test : %s' % test)
print('train : %s' % train)
test = np.array(test)
train = np.array(train)
inter = np.intersect1d(test, train)
return inter.shape[0]
if __name__ == '__main__':
X = np.zeros(10e5)
y = np.zeros(10e5)
k_fold = KFold(n_folds=3, shuffle=False)
inter = Parallel(n_jobs=2, verbose=10, max_nbytes=0)(delayed(overlapping)(train, test) for train,
test in k_fold.split(X, y))
print('Intersection : %s' % inter)
print('No memmapping')
k_fold = KFold(n_folds=3, shuffle=False)
inter = Parallel(n_jobs=2, verbose=10,
max_nbytes=None)(delayed(overlapping)(train, test) for train,
test in k_fold.split(X, y))
print('Intersection : %s' % inter) Out :
I should add that the problem is not met with Ping @ogrisel @GaelVaroquaux I reckon this is rather critical. |
I was curious about this and ran some more tests, and I can confirm K-fold is producing overlapping training and testing sets, with memory mapping. The output numbers are not exactly reproducible, but what is clear is there is overlapping. More extensive testing is needed - I am planning to add a unittest based on this to make sure newer updates wouldn't pass with this bug. Here is the code: import sys
from joblib import Parallel, delayed
from sklearn.model_selection import KFold
import numpy as np
def overlapping(test, train):
# print('test : %s' % test)
# print('train : %s' % train)
test = np.array(test)
train = np.array(train)
inter = np.intersect1d(test, train)
return inter.shape[0]
if __name__ == '__main__':
for num_jobs in range(2,6):
for num_folds in [ 3, 5, 7, 10, 15, 20 ]:
for exp in range(2,7):
num_samples = 10**exp
sys.stdout.write('#jobs: {0:2d}, K = {1:3d}, N = {2:10d} : '.format(num_jobs,num_folds,num_samples))
X = np.zeros(num_samples)
y = np.zeros(num_samples)
k_fold = KFold(n_folds=num_folds, shuffle=False)
inter = Parallel(n_jobs=num_jobs, verbose=0, max_nbytes=0)\
(delayed(overlapping)(train, test) for train, test in k_fold.split(X, y))
sys.stdout.write('Intersection, with memmap : %10d, ' % sum(inter))
k_fold = KFold(n_folds=num_folds, shuffle=False)
inter = Parallel(n_jobs=num_jobs, verbose=0, max_nbytes=None)\
(delayed(overlapping)(train, test) for train,test in k_fold.split(X, y))
sys.stdout.write('no memmap : %10d' % sum(inter))
print " "
print " " Here are the results:
|
More checks using sklearn.externals.joblib: |
I am getting a Memory error ;( What configurations do you guys have? Are you running this on cloud? |
Also |
I haven't done exhaustive testing yet in serial mode (when #jobs = 1), but it seems to me that bugs are with memory mapping. This deserves either test case to make sure it doesn't happen, or at the least assertion to make sure its not run in parallel, whence it can produce overlapping splits. |
Indeed! |
I can confirm that KFold with parallel fails as well when using memory mapping (which is triggered automatically for > 1M test and train index sets). |
I understand the source of the pbm: the automatic memmaping feature of joblib is confused by the reuse of the same id() on recently garbage-collected arrays in the iterator. I am working on a test to reproduce the issue in joblib. |
Is it closed? ie has joblib update been merged in? |
Re-opening, this was closed automatically by github when merging the PR in the joblib repo. I will do the joblib 0.9.4 release and sync PR to scikit-learn next. |
uh oh this doesn't look good :-/ |
is this problem solved? tried test code (with some modification due to API change of |
Yes the underlying issue has been fixed in joblib 0.9.4 which is included in scikit-learn 0.17.1. If someone with the necessary rights can close this one, that'd be great! |
When using a large number of samples,
train
andtest
in_fit_and_score
and thus memory-mapped within theParallel
loop. I have found out that in this case,train
andtest
can be overlapping, while they are generate by aKFold(3, shuffle=False)
. I suspect some collisions in joblib there. I discovered this by seeing that some of my scores were unexpectedly low in thegrid_scores_
past a specific number of samples.I will try to reproduce ASAP
The text was updated successfully, but these errors were encountered: