KFold on large data yields overlapping train and test sets #6063

arthurmensch · 2015-12-18T14:30:29Z

When using a large number of samples, train and test in _fit_and_score and thus memory-mapped within the Parallel loop. I have found out that in this case, train and test can be overlapping, while they are generate by a KFold(3, shuffle=False). I suspect some collisions in joblib there. I discovered this by seeing that some of my scores were unexpectedly low in the grid_scores_ past a specific number of samples.

I will try to reproduce ASAP

The text was updated successfully, but these errors were encountered:

arthurmensch · 2015-12-18T14:37:03Z

There it goes, on master, Python 3.5

from joblib import Parallel, delayed
from sklearn.model_selection import KFold

import numpy as np


def overlapping(test, train):
    print('test : %s' % test)
    print('train : %s' % train)
    test = np.array(test)
    train = np.array(train)
    inter = np.intersect1d(test, train)
    return inter.shape[0]

if __name__ == '__main__':
    X = np.zeros(10e5)
    y = np.zeros(10e5)
    k_fold = KFold(n_folds=3, shuffle=False)
    inter = Parallel(n_jobs=2, verbose=10, max_nbytes=0)(delayed(overlapping)(train, test) for train,
                                                                     test in k_fold.split(X, y))
    print('Intersection : %s' % inter)

    print('No memmapping')

    k_fold = KFold(n_folds=3, shuffle=False)
    inter = Parallel(n_jobs=2, verbose=10,
                     max_nbytes=None)(delayed(overlapping)(train, test) for train,
                                                                     test in k_fold.split(X, y))
    print('Intersection : %s' % inter)

Out :

test : [333334 333335 333336 ..., 999997 999998 999999]
train : [     0      1      2 ..., 333331 333332 333333]
test : [     0      1      2 ..., 999997 999998 999999]
train : [333334 333335 333336 ..., 666664 666665 666666]
test : [333334 333335 333336 ..., 999997 999998 999999]
train : [666667 666668 666669 ..., 999997 999998 999999]
[Parallel(n_jobs=2)]: Done   4 out of   3 | elapsed:    0.4s remaining:   -0.1s
[Parallel(n_jobs=2)]: Done   4 out of   3 | elapsed:    0.4s remaining:   -0.1s
[Parallel(n_jobs=2)]: Done   3 out of   3 | elapsed:    0.4s finished
Intersection : [0, 0, 333333]
No memmapping
test : [333334 333335 333336 ..., 999997 999998 999999]
train : [     0      1      2 ..., 333331 333332 333333]
[Parallel(n_jobs=2)]: Done   4 out of   3 | elapsed:    0.5s remaining:   -0.1s
[Parallel(n_jobs=2)]: Done   4 out of   3 | elapsed:    0.5s remaining:   -0.1s
[Parallel(n_jobs=2)]: Done   3 out of   3 | elapsed:    0.5s finished
test : [     0      1      2 ..., 999997 999998 999999]
train : [333334 333335 333336 ..., 666664 666665 666666]
test : [     0      1      2 ..., 666664 666665 666666]
train : [666667 666668 666669 ..., 999997 999998 999999]
Intersection : [0, 0, 0]

I should add that the problem is not met with X.shape[0] < 10e4, but is for X.shape[0] > 10e5

Ping @ogrisel @GaelVaroquaux I reckon this is rather critical.

raamana · 2015-12-19T04:46:08Z

I was curious about this and ran some more tests, and I can confirm K-fold is producing overlapping training and testing sets, with memory mapping. The output numbers are not exactly reproducible, but what is clear is there is overlapping. More extensive testing is needed - I am planning to add a unittest based on this to make sure newer updates wouldn't pass with this bug.

Here is the code:

import sys
from joblib import Parallel, delayed
from sklearn.model_selection import KFold
import numpy as np

def overlapping(test, train):
    # print('test : %s' % test)
    # print('train : %s' % train)
    test = np.array(test)
    train = np.array(train)
    inter = np.intersect1d(test, train)
    return inter.shape[0]

if __name__ == '__main__':

    for num_jobs in range(2,6):
        for num_folds in [ 3, 5, 7, 10, 15, 20 ]:
            for exp in range(2,7):
                num_samples = 10**exp

                sys.stdout.write('#jobs: {0:2d}, K = {1:3d}, N = {2:10d} : '.format(num_jobs,num_folds,num_samples))
                X = np.zeros(num_samples)
                y = np.zeros(num_samples)
                k_fold = KFold(n_folds=num_folds, shuffle=False)
                inter = Parallel(n_jobs=num_jobs, verbose=0, max_nbytes=0)\
                    (delayed(overlapping)(train, test) for train, test in k_fold.split(X, y))
                sys.stdout.write('Intersection, with memmap : %10d, ' % sum(inter))

                k_fold = KFold(n_folds=num_folds, shuffle=False)
                inter = Parallel(n_jobs=num_jobs, verbose=0, max_nbytes=None)\
                    (delayed(overlapping)(train, test) for train,test in k_fold.split(X, y))
                sys.stdout.write('no memmap : %10d' % sum(inter))
                print " "
            print " "

Here are the results:

#jobs:  2, K =   3, N =        100 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  2, K =   3, N =       1000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  2, K =   3, N =      10000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  2, K =   3, N =     100000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  2, K =   3, N =    1000000 : Intersection, with memmap :          0, no memmap :          0 

#jobs:  2, K =   5, N =        100 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  2, K =   5, N =       1000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  2, K =   5, N =      10000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  2, K =   5, N =     100000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  2, K =   5, N =    1000000 : Intersection, with memmap :          0, no memmap :          0 

#jobs:  2, K =   7, N =        100 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  2, K =   7, N =       1000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  2, K =   7, N =      10000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  2, K =   7, N =     100000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  2, K =   7, N =    1000000 : Intersection, with memmap :          0, no memmap :          0 

#jobs:  2, K =  10, N =        100 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  2, K =  10, N =       1000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  2, K =  10, N =      10000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  2, K =  10, N =     100000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  2, K =  10, N =    1000000 : Intersection, with memmap :          0, no memmap :          0 

#jobs:  2, K =  15, N =        100 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  2, K =  15, N =       1000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  2, K =  15, N =      10000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  2, K =  15, N =     100000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  2, K =  15, N =    1000000 : Intersection, with memmap :     933334, no memmap :          0 

#jobs:  2, K =  20, N =        100 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  2, K =  20, N =       1000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  2, K =  20, N =      10000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  2, K =  20, N =     100000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  2, K =  20, N =    1000000 : Intersection, with memmap :    1850000, no memmap :          0 

#jobs:  3, K =   3, N =        100 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  3, K =   3, N =       1000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  3, K =   3, N =      10000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  3, K =   3, N =     100000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  3, K =   3, N =    1000000 : Intersection, with memmap :          0, no memmap :          0 

#jobs:  3, K =   5, N =        100 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  3, K =   5, N =       1000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  3, K =   5, N =      10000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  3, K =   5, N =     100000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  3, K =   5, N =    1000000 : Intersection, with memmap :          0, no memmap :          0 

#jobs:  3, K =   7, N =        100 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  3, K =   7, N =       1000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  3, K =   7, N =      10000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  3, K =   7, N =     100000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  3, K =   7, N =    1000000 : Intersection, with memmap :          0, no memmap :          0 

#jobs:  3, K =  10, N =        100 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  3, K =  10, N =       1000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  3, K =  10, N =      10000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  3, K =  10, N =     100000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  3, K =  10, N =    1000000 : Intersection, with memmap :     100000, no memmap :          0 

#jobs:  3, K =  15, N =        100 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  3, K =  15, N =       1000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  3, K =  15, N =      10000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  3, K =  15, N =     100000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  3, K =  15, N =    1000000 : Intersection, with memmap :     933334, no memmap :          0 

#jobs:  3, K =  20, N =        100 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  3, K =  20, N =       1000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  3, K =  20, N =      10000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  3, K =  20, N =     100000 : Intersection, with memmap :       5000, no memmap :          0 
#jobs:  3, K =  20, N =    1000000 : Intersection, with memmap :    2750000, no memmap :          0 

#jobs:  4, K =   3, N =        100 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  4, K =   3, N =       1000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  4, K =   3, N =      10000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  4, K =   3, N =     100000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  4, K =   3, N =    1000000 : Intersection, with memmap :          0, no memmap :          0 

#jobs:  4, K =   5, N =        100 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  4, K =   5, N =       1000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  4, K =   5, N =      10000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  4, K =   5, N =     100000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  4, K =   5, N =    1000000 : Intersection, with memmap :          0, no memmap :          0 

#jobs:  4, K =   7, N =        100 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  4, K =   7, N =       1000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  4, K =   7, N =      10000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  4, K =   7, N =     100000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  4, K =   7, N =    1000000 : Intersection, with memmap :          0, no memmap :          0 

#jobs:  4, K =  10, N =        100 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  4, K =  10, N =       1000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  4, K =  10, N =      10000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  4, K =  10, N =     100000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  4, K =  10, N =    1000000 : Intersection, with memmap :          0, no memmap :          0 

#jobs:  4, K =  15, N =        100 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  4, K =  15, N =       1000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  4, K =  15, N =      10000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  4, K =  15, N =     100000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  4, K =  15, N =    1000000 : Intersection, with memmap :     933334, no memmap :          0 

#jobs:  4, K =  20, N =        100 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  4, K =  20, N =       1000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  4, K =  20, N =      10000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  4, K =  20, N =     100000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  4, K =  20, N =    1000000 : Intersection, with memmap :    1900000, no memmap :          0 

#jobs:  5, K =   3, N =        100 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  5, K =   3, N =       1000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  5, K =   3, N =      10000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  5, K =   3, N =     100000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  5, K =   3, N =    1000000 : Intersection, with memmap :          0, no memmap :          0 

#jobs:  5, K =   5, N =        100 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  5, K =   5, N =       1000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  5, K =   5, N =      10000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  5, K =   5, N =     100000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  5, K =   5, N =    1000000 : Intersection, with memmap :          0, no memmap :          0 

#jobs:  5, K =   7, N =        100 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  5, K =   7, N =       1000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  5, K =   7, N =      10000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  5, K =   7, N =     100000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  5, K =   7, N =    1000000 : Intersection, with memmap :          0, no memmap :          0 

#jobs:  5, K =  10, N =        100 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  5, K =  10, N =       1000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  5, K =  10, N =      10000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  5, K =  10, N =     100000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  5, K =  10, N =    1000000 : Intersection, with memmap :          0, no memmap :          0 

#jobs:  5, K =  15, N =        100 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  5, K =  15, N =       1000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  5, K =  15, N =      10000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  5, K =  15, N =     100000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  5, K =  15, N =    1000000 : Intersection, with memmap :     866667, no memmap :          0 

#jobs:  5, K =  20, N =        100 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  5, K =  20, N =       1000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  5, K =  20, N =      10000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  5, K =  20, N =     100000 : Intersection, with memmap :      95000, no memmap :          0 
#jobs:  5, K =  20, N =    1000000 : Intersection, with memmap :    1900000, no memmap :          0

raamana · 2015-12-19T05:59:16Z

More checks using sklearn.externals.joblib:
https://gist.github.com/raamana/382927a527c87746b1ef

raghavrv · 2015-12-21T12:39:49Z

I am getting a Memory error ;( What configurations do you guys have? Are you running this on cloud?

raghavrv · 2015-12-21T13:46:02Z

Also KFold does the splits serially while the delayed(overlap) function is done in parallel correct? Which would mean the problem is with memmapping of joblib? (Sorry if I misunderstood)

raamana · 2015-12-21T14:53:40Z

I haven't done exhaustive testing yet in serial mode (when #jobs = 1), but it seems to me that bugs are with memory mapping. This deserves either test case to make sure it doesn't happen, or at the least assertion to make sure its not run in parallel, whence it can produce overlapping splits.

raghavrv · 2015-12-21T16:55:49Z

This deserves either test case, or at the least assertion to make sure its not run in parallel, whence it can produce overlapping splits.

Indeed!

arthurmensch · 2015-12-26T16:14:53Z

I can confirm that KFold with parallel fails as well when using memory mapping (which is triggered automatically for > 1M test and train index sets).

ogrisel · 2016-01-06T07:42:17Z

I understand the source of the pbm: the automatic memmaping feature of joblib is confused by the reuse of the same id() on recently garbage-collected arrays in the iterator. I am working on a test to reproduce the issue in joblib.

GaelVaroquaux · 2016-01-18T10:48:32Z

Closed #6063 via joblib/joblib#294.

Is it closed? ie has joblib update been merged in?

ogrisel · 2016-01-18T12:10:12Z

Re-opening, this was closed automatically by github when merging the PR in the joblib repo. I will do the joblib 0.9.4 release and sync PR to scikit-learn next.

amueller · 2016-01-19T19:43:05Z

uh oh this doesn't look good :-/

zym1010 · 2016-06-14T16:29:30Z

is this problem solved? tried test code (with some modification due to API change of KFold) from @raamana and found it to work without any problem, with sklearn 0.17.1 and joblib 0.9.4

lesteve · 2016-06-21T07:58:53Z

is this problem solved?

Yes the underlying issue has been fixed in joblib 0.9.4 which is included in scikit-learn 0.17.1. If someone with the necessary rights can close this one, that'd be great!

arthurmensch changed the title ~~GridSearchCV with large KFold splits sometimes have overlapping train and test sets~~ GridSearchCV with large KFold splits have overlapping train and test sets Dec 18, 2015

arthurmensch changed the title ~~GridSearchCV with large KFold splits have overlapping train and test sets~~ KFold on large data yields overlapping train and test sets Dec 18, 2015

lesteve mentioned this issue Jan 8, 2016

AssertionError: Incorrect data length while decompressing in Memory/numpy_pickle.py joblib/joblib#287

Closed

ogrisel mentioned this issue Jan 18, 2016

[MRG] Fix auto memmap gc failure joblib/joblib#294

Merged

ogrisel closed this as completed in joblib/joblib#294 Jan 18, 2016

ogrisel reopened this Jan 18, 2016

ogrisel mentioned this issue Jan 18, 2016

[MRG] Joblib 0.9.4 #6179

Merged

GaelVaroquaux closed this as completed Jun 21, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KFold on large data yields overlapping train and test sets #6063

KFold on large data yields overlapping train and test sets #6063

arthurmensch commented Dec 18, 2015

arthurmensch commented Dec 18, 2015

raamana commented Dec 19, 2015

raamana commented Dec 19, 2015

raghavrv commented Dec 21, 2015

raghavrv commented Dec 21, 2015

raamana commented Dec 21, 2015

raghavrv commented Dec 21, 2015

arthurmensch commented Dec 26, 2015

ogrisel commented Jan 6, 2016

GaelVaroquaux commented Jan 18, 2016

ogrisel commented Jan 18, 2016

amueller commented Jan 19, 2016

zym1010 commented Jun 14, 2016 •

edited

lesteve commented Jun 21, 2016

KFold on large data yields overlapping train and test sets #6063

KFold on large data yields overlapping train and test sets #6063

Comments

arthurmensch commented Dec 18, 2015

arthurmensch commented Dec 18, 2015

raamana commented Dec 19, 2015

raamana commented Dec 19, 2015

raghavrv commented Dec 21, 2015

raghavrv commented Dec 21, 2015

raamana commented Dec 21, 2015

raghavrv commented Dec 21, 2015

arthurmensch commented Dec 26, 2015

ogrisel commented Jan 6, 2016

GaelVaroquaux commented Jan 18, 2016

ogrisel commented Jan 18, 2016

amueller commented Jan 19, 2016

zym1010 commented Jun 14, 2016 • edited

lesteve commented Jun 21, 2016

zym1010 commented Jun 14, 2016 •

edited