Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KFold on large data yields overlapping train and test sets #6063

Closed
arthurmensch opened this issue Dec 18, 2015 · 14 comments · Fixed by joblib/joblib#294
Closed

KFold on large data yields overlapping train and test sets #6063

arthurmensch opened this issue Dec 18, 2015 · 14 comments · Fixed by joblib/joblib#294

Comments

@arthurmensch
Copy link
Contributor

When using a large number of samples, train and test in _fit_and_score and thus memory-mapped within the Parallel loop. I have found out that in this case, train and test can be overlapping, while they are generate by a KFold(3, shuffle=False). I suspect some collisions in joblib there. I discovered this by seeing that some of my scores were unexpectedly low in the grid_scores_ past a specific number of samples.

I will try to reproduce ASAP

@arthurmensch
Copy link
Contributor Author

There it goes, on master, Python 3.5

from joblib import Parallel, delayed
from sklearn.model_selection import KFold

import numpy as np


def overlapping(test, train):
    print('test : %s' % test)
    print('train : %s' % train)
    test = np.array(test)
    train = np.array(train)
    inter = np.intersect1d(test, train)
    return inter.shape[0]

if __name__ == '__main__':
    X = np.zeros(10e5)
    y = np.zeros(10e5)
    k_fold = KFold(n_folds=3, shuffle=False)
    inter = Parallel(n_jobs=2, verbose=10, max_nbytes=0)(delayed(overlapping)(train, test) for train,
                                                                     test in k_fold.split(X, y))
    print('Intersection : %s' % inter)

    print('No memmapping')

    k_fold = KFold(n_folds=3, shuffle=False)
    inter = Parallel(n_jobs=2, verbose=10,
                     max_nbytes=None)(delayed(overlapping)(train, test) for train,
                                                                     test in k_fold.split(X, y))
    print('Intersection : %s' % inter)

Out :

test : [333334 333335 333336 ..., 999997 999998 999999]
train : [     0      1      2 ..., 333331 333332 333333]
test : [     0      1      2 ..., 999997 999998 999999]
train : [333334 333335 333336 ..., 666664 666665 666666]
test : [333334 333335 333336 ..., 999997 999998 999999]
train : [666667 666668 666669 ..., 999997 999998 999999]
[Parallel(n_jobs=2)]: Done   4 out of   3 | elapsed:    0.4s remaining:   -0.1s
[Parallel(n_jobs=2)]: Done   4 out of   3 | elapsed:    0.4s remaining:   -0.1s
[Parallel(n_jobs=2)]: Done   3 out of   3 | elapsed:    0.4s finished
Intersection : [0, 0, 333333]
No memmapping
test : [333334 333335 333336 ..., 999997 999998 999999]
train : [     0      1      2 ..., 333331 333332 333333]
[Parallel(n_jobs=2)]: Done   4 out of   3 | elapsed:    0.5s remaining:   -0.1s
[Parallel(n_jobs=2)]: Done   4 out of   3 | elapsed:    0.5s remaining:   -0.1s
[Parallel(n_jobs=2)]: Done   3 out of   3 | elapsed:    0.5s finished
test : [     0      1      2 ..., 999997 999998 999999]
train : [333334 333335 333336 ..., 666664 666665 666666]
test : [     0      1      2 ..., 666664 666665 666666]
train : [666667 666668 666669 ..., 999997 999998 999999]
Intersection : [0, 0, 0]

I should add that the problem is not met with X.shape[0] < 10e4, but is for X.shape[0] > 10e5

Ping @ogrisel @GaelVaroquaux I reckon this is rather critical.

@arthurmensch arthurmensch changed the title GridSearchCV with large KFold splits sometimes have overlapping train and test sets GridSearchCV with large KFold splits have overlapping train and test sets Dec 18, 2015
@arthurmensch arthurmensch changed the title GridSearchCV with large KFold splits have overlapping train and test sets KFold on large data yields overlapping train and test sets Dec 18, 2015
@raamana
Copy link
Contributor

raamana commented Dec 19, 2015

I was curious about this and ran some more tests, and I can confirm K-fold is producing overlapping training and testing sets, with memory mapping. The output numbers are not exactly reproducible, but what is clear is there is overlapping. More extensive testing is needed - I am planning to add a unittest based on this to make sure newer updates wouldn't pass with this bug.

Here is the code:

import sys
from joblib import Parallel, delayed
from sklearn.model_selection import KFold
import numpy as np

def overlapping(test, train):
    # print('test : %s' % test)
    # print('train : %s' % train)
    test = np.array(test)
    train = np.array(train)
    inter = np.intersect1d(test, train)
    return inter.shape[0]

if __name__ == '__main__':

    for num_jobs in range(2,6):
        for num_folds in [ 3, 5, 7, 10, 15, 20 ]:
            for exp in range(2,7):
                num_samples = 10**exp

                sys.stdout.write('#jobs: {0:2d}, K = {1:3d}, N = {2:10d} : '.format(num_jobs,num_folds,num_samples))
                X = np.zeros(num_samples)
                y = np.zeros(num_samples)
                k_fold = KFold(n_folds=num_folds, shuffle=False)
                inter = Parallel(n_jobs=num_jobs, verbose=0, max_nbytes=0)\
                    (delayed(overlapping)(train, test) for train, test in k_fold.split(X, y))
                sys.stdout.write('Intersection, with memmap : %10d, ' % sum(inter))

                k_fold = KFold(n_folds=num_folds, shuffle=False)
                inter = Parallel(n_jobs=num_jobs, verbose=0, max_nbytes=None)\
                    (delayed(overlapping)(train, test) for train,test in k_fold.split(X, y))
                sys.stdout.write('no memmap : %10d' % sum(inter))
                print " "
            print " "

Here are the results:

#jobs:  2, K =   3, N =        100 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  2, K =   3, N =       1000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  2, K =   3, N =      10000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  2, K =   3, N =     100000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  2, K =   3, N =    1000000 : Intersection, with memmap :          0, no memmap :          0 

#jobs:  2, K =   5, N =        100 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  2, K =   5, N =       1000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  2, K =   5, N =      10000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  2, K =   5, N =     100000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  2, K =   5, N =    1000000 : Intersection, with memmap :          0, no memmap :          0 

#jobs:  2, K =   7, N =        100 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  2, K =   7, N =       1000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  2, K =   7, N =      10000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  2, K =   7, N =     100000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  2, K =   7, N =    1000000 : Intersection, with memmap :          0, no memmap :          0 

#jobs:  2, K =  10, N =        100 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  2, K =  10, N =       1000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  2, K =  10, N =      10000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  2, K =  10, N =     100000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  2, K =  10, N =    1000000 : Intersection, with memmap :          0, no memmap :          0 

#jobs:  2, K =  15, N =        100 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  2, K =  15, N =       1000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  2, K =  15, N =      10000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  2, K =  15, N =     100000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  2, K =  15, N =    1000000 : Intersection, with memmap :     933334, no memmap :          0 

#jobs:  2, K =  20, N =        100 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  2, K =  20, N =       1000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  2, K =  20, N =      10000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  2, K =  20, N =     100000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  2, K =  20, N =    1000000 : Intersection, with memmap :    1850000, no memmap :          0 

#jobs:  3, K =   3, N =        100 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  3, K =   3, N =       1000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  3, K =   3, N =      10000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  3, K =   3, N =     100000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  3, K =   3, N =    1000000 : Intersection, with memmap :          0, no memmap :          0 

#jobs:  3, K =   5, N =        100 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  3, K =   5, N =       1000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  3, K =   5, N =      10000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  3, K =   5, N =     100000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  3, K =   5, N =    1000000 : Intersection, with memmap :          0, no memmap :          0 

#jobs:  3, K =   7, N =        100 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  3, K =   7, N =       1000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  3, K =   7, N =      10000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  3, K =   7, N =     100000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  3, K =   7, N =    1000000 : Intersection, with memmap :          0, no memmap :          0 

#jobs:  3, K =  10, N =        100 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  3, K =  10, N =       1000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  3, K =  10, N =      10000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  3, K =  10, N =     100000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  3, K =  10, N =    1000000 : Intersection, with memmap :     100000, no memmap :          0 

#jobs:  3, K =  15, N =        100 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  3, K =  15, N =       1000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  3, K =  15, N =      10000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  3, K =  15, N =     100000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  3, K =  15, N =    1000000 : Intersection, with memmap :     933334, no memmap :          0 

#jobs:  3, K =  20, N =        100 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  3, K =  20, N =       1000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  3, K =  20, N =      10000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  3, K =  20, N =     100000 : Intersection, with memmap :       5000, no memmap :          0 
#jobs:  3, K =  20, N =    1000000 : Intersection, with memmap :    2750000, no memmap :          0 

#jobs:  4, K =   3, N =        100 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  4, K =   3, N =       1000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  4, K =   3, N =      10000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  4, K =   3, N =     100000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  4, K =   3, N =    1000000 : Intersection, with memmap :          0, no memmap :          0 

#jobs:  4, K =   5, N =        100 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  4, K =   5, N =       1000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  4, K =   5, N =      10000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  4, K =   5, N =     100000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  4, K =   5, N =    1000000 : Intersection, with memmap :          0, no memmap :          0 

#jobs:  4, K =   7, N =        100 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  4, K =   7, N =       1000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  4, K =   7, N =      10000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  4, K =   7, N =     100000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  4, K =   7, N =    1000000 : Intersection, with memmap :          0, no memmap :          0 

#jobs:  4, K =  10, N =        100 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  4, K =  10, N =       1000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  4, K =  10, N =      10000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  4, K =  10, N =     100000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  4, K =  10, N =    1000000 : Intersection, with memmap :          0, no memmap :          0 

#jobs:  4, K =  15, N =        100 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  4, K =  15, N =       1000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  4, K =  15, N =      10000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  4, K =  15, N =     100000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  4, K =  15, N =    1000000 : Intersection, with memmap :     933334, no memmap :          0 

#jobs:  4, K =  20, N =        100 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  4, K =  20, N =       1000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  4, K =  20, N =      10000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  4, K =  20, N =     100000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  4, K =  20, N =    1000000 : Intersection, with memmap :    1900000, no memmap :          0 

#jobs:  5, K =   3, N =        100 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  5, K =   3, N =       1000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  5, K =   3, N =      10000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  5, K =   3, N =     100000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  5, K =   3, N =    1000000 : Intersection, with memmap :          0, no memmap :          0 

#jobs:  5, K =   5, N =        100 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  5, K =   5, N =       1000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  5, K =   5, N =      10000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  5, K =   5, N =     100000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  5, K =   5, N =    1000000 : Intersection, with memmap :          0, no memmap :          0 

#jobs:  5, K =   7, N =        100 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  5, K =   7, N =       1000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  5, K =   7, N =      10000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  5, K =   7, N =     100000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  5, K =   7, N =    1000000 : Intersection, with memmap :          0, no memmap :          0 

#jobs:  5, K =  10, N =        100 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  5, K =  10, N =       1000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  5, K =  10, N =      10000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  5, K =  10, N =     100000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  5, K =  10, N =    1000000 : Intersection, with memmap :          0, no memmap :          0 

#jobs:  5, K =  15, N =        100 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  5, K =  15, N =       1000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  5, K =  15, N =      10000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  5, K =  15, N =     100000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  5, K =  15, N =    1000000 : Intersection, with memmap :     866667, no memmap :          0 

#jobs:  5, K =  20, N =        100 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  5, K =  20, N =       1000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  5, K =  20, N =      10000 : Intersection, with memmap :          0, no memmap :          0 
#jobs:  5, K =  20, N =     100000 : Intersection, with memmap :      95000, no memmap :          0 
#jobs:  5, K =  20, N =    1000000 : Intersection, with memmap :    1900000, no memmap :          0 

@raamana
Copy link
Contributor

raamana commented Dec 19, 2015

More checks using sklearn.externals.joblib:
https://gist.github.com/raamana/382927a527c87746b1ef

@raghavrv
Copy link
Member

I am getting a Memory error ;( What configurations do you guys have? Are you running this on cloud?

@raghavrv
Copy link
Member

Also KFold does the splits serially while the delayed(overlap) function is done in parallel correct? Which would mean the problem is with memmapping of joblib? (Sorry if I misunderstood)

@raamana
Copy link
Contributor

raamana commented Dec 21, 2015

I haven't done exhaustive testing yet in serial mode (when #jobs = 1), but it seems to me that bugs are with memory mapping. This deserves either test case to make sure it doesn't happen, or at the least assertion to make sure its not run in parallel, whence it can produce overlapping splits.

@raghavrv
Copy link
Member

This deserves either test case, or at the least assertion to make sure its not run in parallel, whence it can produce overlapping splits.

Indeed!

@arthurmensch
Copy link
Contributor Author

I can confirm that KFold with parallel fails as well when using memory mapping (which is triggered automatically for > 1M test and train index sets).

@ogrisel
Copy link
Member

ogrisel commented Jan 6, 2016

I understand the source of the pbm: the automatic memmaping feature of joblib is confused by the reuse of the same id() on recently garbage-collected arrays in the iterator. I am working on a test to reproduce the issue in joblib.

@GaelVaroquaux
Copy link
Member

Closed #6063 via joblib/joblib#294.

Is it closed? ie has joblib update been merged in?

@ogrisel ogrisel reopened this Jan 18, 2016
@ogrisel
Copy link
Member

ogrisel commented Jan 18, 2016

Re-opening, this was closed automatically by github when merging the PR in the joblib repo. I will do the joblib 0.9.4 release and sync PR to scikit-learn next.

@amueller
Copy link
Member

uh oh this doesn't look good :-/

@zym1010
Copy link

zym1010 commented Jun 14, 2016

is this problem solved? tried test code (with some modification due to API change of KFold) from @raamana and found it to work without any problem, with sklearn 0.17.1 and joblib 0.9.4

@lesteve
Copy link
Member

lesteve commented Jun 21, 2016

is this problem solved?

Yes the underlying issue has been fixed in joblib 0.9.4 which is included in scikit-learn 0.17.1. If someone with the necessary rights can close this one, that'd be great!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants