Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF : StratifiedShuffleSplit is slow when using large number of classes #5991

Closed
arthurmensch opened this issue Dec 9, 2015 · 12 comments
Closed
Labels
Easy Well-defined and straightforward way to resolve Enhancement

Comments

@arthurmensch
Copy link
Contributor

When using large number of classes (e.g. > 10000, e.g for recommender systems), StratifiedShuffleSplit is very slow when compared to ShuffleSplit. Looking at the code, I believe that the following part:

            for i, class_i in enumerate(classes):
                permutation = rng.permutation(class_counts[i])
                perm_indices_class_i = np.where((y == class_i))[0][permutation]

l. 1070 in sklearn.model_selection._split is suboptimal : we should build an index matrix holding the indices for each class in the dataset (implying to do a single pass over data, maybe along with a bincount(classes)). Indeed np.where does a pass over y at each call, leading to a O(n_classes * len(y)) complexity, whereas it could be O(len(y)) only.

I obtain a significant gain in perf doing:

        class_indices = np.zeros((n_classes, class_counts.max()), dtype='int')
        count = np.zeros(n_classes, dtype='int')
        for i in range(len(y_indices)):
            class_indices[y_indices[i], count[y_indices[i]]] = i
            count[y_indices[i]] += 1

and subsequently replacing

perm_indices_class_i = np.where((y == class_i))[0][permutation]

by

perm_indices_class_i = class_indices[class_i,:class_counts[i]][permutation]

This is suboptimal given we iterate over y values using within a Python loop. I believe that the proper way to do this would be to create a bincount_with_ref cython function that would both count the occurence of classes and accumulate class index in a class_indices array - in arrayfuncs.pyx. Memory usage goes up of len(y) * sizeof('int'), which is typically small when compared to X size.

Would this be useful ? I'll have to provide benchmarks !

@arthurmensch arthurmensch changed the title StratifiedShuffleSplit is slow when using large number of classes PERF : StratifiedShuffleSplit is slow when using large number of classes Dec 9, 2015
@amueller
Copy link
Member

can you check if this is still relevant with the new code? I guess probably it is...

@mbrio
Copy link

mbrio commented Jun 21, 2017

I can certainly state that running StratifiedShuffleSplit on a large number of classes still is incredibly slow. I began a process about 10 hours ago on a dataset with many classes (>1000) that is only 5GB and it has not yet completed.

@jnothman
Copy link
Member

How does it compare to RepeatedStratifiedKFold? Is there a reason to prefer StratifiedShuffleSplit over that?

@jnothman
Copy link
Member

jnothman commented Jun 21, 2017

I agree with @arthurmensch that the repeated np.where((y == class_i)) is absurd and would welcome a patch.

I also have a much simpler implementation of _approximate_mode (ping @amueller who will, I'm sure, be pleased), which we can throw in if it's worth changing the determinism given a fixed seed:

def _approximate_mode(class_counts, n_draws, rng):
    """
    This builds on the fact that x = (class_counts / class_counts.sum() * n_draws)
    gives us an approximate count for each class. If we want integers that still sum
    to n_draws, we can use np.diff(np.r_[0, np.round(x.cumsum()]). And if we want
    randomization, all we need to do is permute class_counts so that the fractional
    parts of the cumulative sum go to different classes.
    """
    rng = check_random_state(rng)
    perm = rng.permutation(len(class_counts))
    inv_perm = np.zeros(len(perm), dtype=int)
    inv_perm[perm] = np.arange(len(perm))
    cumprop = class_counts[perm].cumsum()
    cumprop = cumprop * n_draws / cumprop[-1]
    permuted_counts = np.diff(np.hstack([0, np.round(cumprop)]))
    return permuted_counts.astype(np.int)[inv_perm]

I am, however, still not entirely convinced that we need StratifiedShuffleSplit, really.

@jnothman jnothman added Easy Well-defined and straightforward way to resolve Enhancement Need Contributor labels Jun 21, 2017
@jnothman
Copy link
Member

Btw, the code in the original post:

        class_indices = np.zeros((n_classes, class_counts.max()), dtype='int')
        count = np.zeros(n_classes, dtype='int')
        for i in range(len(y_indices)):
            class_indices[y_indices[i], count[y_indices[i]]] = i
            count[y_indices[i]] += 1

I think this can be done as:

class_indices = np.split(np.argsort(y_indices), np.cumsum(class_counts)[:-1])

(Yes, I know it's asymptotically slower by a logarithmic factor.)

@jnothman
Copy link
Member

If train_size + test_size < 100% we also have the option to permute n_i[i] + t_i[i] rather than class_counts[i]

@mbrio
Copy link

mbrio commented Jun 21, 2017

@jnothman I was under the impression that the *KFold classes did not allow you to supply values for the % of training vs test data; this was the reason behind my use of StratifiedShuffleSplit. Also, I'm not using the dev release of scikit so I do not have access to RepeatedStratifiedKFold; that's introduced in 0.19-dev correct?. Finally, since the data I'm using is very large I'm using a split from the raw data of 0.4 and 0.08 for testing purposes. I'm looking into testing StratifiedKFold right now.

@jnothman
Copy link
Member

jnothman commented Jun 21, 2017 via email

@vene
Copy link
Member

vene commented Jun 26, 2017

permute n_i[i] + t_i[i]

wouldn't this mean that the indices between n_i[i] + t_i[i] + 1 and class_counts[i] are never selected? It seems to me that this would deterministically rule out the same samples. (but I'm likely missing something)

@jnothman
Copy link
Member

No, you're probably right and I wasn't thinking straight. But we could use sample_without_replacement and let it choose which method ... not that that's worked so well in the past!

@amueller
Copy link
Member

@jnothman I'm very happy if we can simplify this, though I didn't look at your implementation in detail. You say it's equivalent but doesn't give identical results because of different randomization?
Did it pass the tests? They are actually pretty strict.

@jnothman
Copy link
Member

jnothman commented Jun 28, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Easy Well-defined and straightforward way to resolve Enhancement
Projects
None yet
Development

No branches or pull requests

5 participants