New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF : StratifiedShuffleSplit is slow when using large number of classes #5991

Closed
arthurmensch opened this Issue Dec 9, 2015 · 12 comments

Comments

Projects
None yet
5 participants
@arthurmensch
Contributor

arthurmensch commented Dec 9, 2015

When using large number of classes (e.g. > 10000, e.g for recommender systems), StratifiedShuffleSplit is very slow when compared to ShuffleSplit. Looking at the code, I believe that the following part:

            for i, class_i in enumerate(classes):
                permutation = rng.permutation(class_counts[i])
                perm_indices_class_i = np.where((y == class_i))[0][permutation]

l. 1070 in sklearn.model_selection._split is suboptimal : we should build an index matrix holding the indices for each class in the dataset (implying to do a single pass over data, maybe along with a bincount(classes)). Indeed np.where does a pass over y at each call, leading to a O(n_classes * len(y)) complexity, whereas it could be O(len(y)) only.

I obtain a significant gain in perf doing:

        class_indices = np.zeros((n_classes, class_counts.max()), dtype='int')
        count = np.zeros(n_classes, dtype='int')
        for i in range(len(y_indices)):
            class_indices[y_indices[i], count[y_indices[i]]] = i
            count[y_indices[i]] += 1

and subsequently replacing

perm_indices_class_i = np.where((y == class_i))[0][permutation]

by

perm_indices_class_i = class_indices[class_i,:class_counts[i]][permutation]

This is suboptimal given we iterate over y values using within a Python loop. I believe that the proper way to do this would be to create a bincount_with_ref cython function that would both count the occurence of classes and accumulate class index in a class_indices array - in arrayfuncs.pyx. Memory usage goes up of len(y) * sizeof('int'), which is typically small when compared to X size.

Would this be useful ? I'll have to provide benchmarks !

@arthurmensch arthurmensch changed the title from StratifiedShuffleSplit is slow when using large number of classes to PERF : StratifiedShuffleSplit is slow when using large number of classes Dec 9, 2015

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Sep 14, 2016

Member

can you check if this is still relevant with the new code? I guess probably it is...

Member

amueller commented Sep 14, 2016

can you check if this is still relevant with the new code? I guess probably it is...

@mbrio

This comment has been minimized.

Show comment
Hide comment
@mbrio

mbrio Jun 21, 2017

I can certainly state that running StratifiedShuffleSplit on a large number of classes still is incredibly slow. I began a process about 10 hours ago on a dataset with many classes (>1000) that is only 5GB and it has not yet completed.

mbrio commented Jun 21, 2017

I can certainly state that running StratifiedShuffleSplit on a large number of classes still is incredibly slow. I began a process about 10 hours ago on a dataset with many classes (>1000) that is only 5GB and it has not yet completed.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 21, 2017

Member

How does it compare to RepeatedStratifiedKFold? Is there a reason to prefer StratifiedShuffleSplit over that?

Member

jnothman commented Jun 21, 2017

How does it compare to RepeatedStratifiedKFold? Is there a reason to prefer StratifiedShuffleSplit over that?

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 21, 2017

Member

I agree with @arthurmensch that the repeated np.where((y == class_i)) is absurd and would welcome a patch.

I also have a much simpler implementation of _approximate_mode (ping @amueller who will, I'm sure, be pleased), which we can throw in if it's worth changing the determinism given a fixed seed:

def _approximate_mode(class_counts, n_draws, rng):
    """
    This builds on the fact that x = (class_counts / class_counts.sum() * n_draws)
    gives us an approximate count for each class. If we want integers that still sum
    to n_draws, we can use np.diff(np.r_[0, np.round(x.cumsum()]). And if we want
    randomization, all we need to do is permute class_counts so that the fractional
    parts of the cumulative sum go to different classes.
    """
    rng = check_random_state(rng)
    perm = rng.permutation(len(class_counts))
    inv_perm = np.zeros(len(perm), dtype=int)
    inv_perm[perm] = np.arange(len(perm))
    cumprop = class_counts[perm].cumsum()
    cumprop = cumprop * n_draws / cumprop[-1]
    permuted_counts = np.diff(np.hstack([0, np.round(cumprop)]))
    return permuted_counts.astype(np.int)[inv_perm]

I am, however, still not entirely convinced that we need StratifiedShuffleSplit, really.

Member

jnothman commented Jun 21, 2017

I agree with @arthurmensch that the repeated np.where((y == class_i)) is absurd and would welcome a patch.

I also have a much simpler implementation of _approximate_mode (ping @amueller who will, I'm sure, be pleased), which we can throw in if it's worth changing the determinism given a fixed seed:

def _approximate_mode(class_counts, n_draws, rng):
    """
    This builds on the fact that x = (class_counts / class_counts.sum() * n_draws)
    gives us an approximate count for each class. If we want integers that still sum
    to n_draws, we can use np.diff(np.r_[0, np.round(x.cumsum()]). And if we want
    randomization, all we need to do is permute class_counts so that the fractional
    parts of the cumulative sum go to different classes.
    """
    rng = check_random_state(rng)
    perm = rng.permutation(len(class_counts))
    inv_perm = np.zeros(len(perm), dtype=int)
    inv_perm[perm] = np.arange(len(perm))
    cumprop = class_counts[perm].cumsum()
    cumprop = cumprop * n_draws / cumprop[-1]
    permuted_counts = np.diff(np.hstack([0, np.round(cumprop)]))
    return permuted_counts.astype(np.int)[inv_perm]

I am, however, still not entirely convinced that we need StratifiedShuffleSplit, really.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 21, 2017

Member

Btw, the code in the original post:

        class_indices = np.zeros((n_classes, class_counts.max()), dtype='int')
        count = np.zeros(n_classes, dtype='int')
        for i in range(len(y_indices)):
            class_indices[y_indices[i], count[y_indices[i]]] = i
            count[y_indices[i]] += 1

I think this can be done as:

class_indices = np.split(np.argsort(y_indices), np.cumsum(class_counts)[:-1])

(Yes, I know it's asymptotically slower by a logarithmic factor.)

Member

jnothman commented Jun 21, 2017

Btw, the code in the original post:

        class_indices = np.zeros((n_classes, class_counts.max()), dtype='int')
        count = np.zeros(n_classes, dtype='int')
        for i in range(len(y_indices)):
            class_indices[y_indices[i], count[y_indices[i]]] = i
            count[y_indices[i]] += 1

I think this can be done as:

class_indices = np.split(np.argsort(y_indices), np.cumsum(class_counts)[:-1])

(Yes, I know it's asymptotically slower by a logarithmic factor.)

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 21, 2017

Member

If train_size + test_size < 100% we also have the option to permute n_i[i] + t_i[i] rather than class_counts[i]

Member

jnothman commented Jun 21, 2017

If train_size + test_size < 100% we also have the option to permute n_i[i] + t_i[i] rather than class_counts[i]

@mbrio

This comment has been minimized.

Show comment
Hide comment
@mbrio

mbrio Jun 21, 2017

@jnothman I was under the impression that the *KFold classes did not allow you to supply values for the % of training vs test data; this was the reason behind my use of StratifiedShuffleSplit. Also, I'm not using the dev release of scikit so I do not have access to RepeatedStratifiedKFold; that's introduced in 0.19-dev correct?. Finally, since the data I'm using is very large I'm using a split from the raw data of 0.4 and 0.08 for testing purposes. I'm looking into testing StratifiedKFold right now.

mbrio commented Jun 21, 2017

@jnothman I was under the impression that the *KFold classes did not allow you to supply values for the % of training vs test data; this was the reason behind my use of StratifiedShuffleSplit. Also, I'm not using the dev release of scikit so I do not have access to RepeatedStratifiedKFold; that's introduced in 0.19-dev correct?. Finally, since the data I'm using is very large I'm using a split from the raw data of 0.4 and 0.08 for testing purposes. I'm looking into testing StratifiedKFold right now.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 21, 2017

Member
Member

jnothman commented Jun 21, 2017

@vene

This comment has been minimized.

Show comment
Hide comment
@vene

vene Jun 26, 2017

Member

permute n_i[i] + t_i[i]

wouldn't this mean that the indices between n_i[i] + t_i[i] + 1 and class_counts[i] are never selected? It seems to me that this would deterministically rule out the same samples. (but I'm likely missing something)

Member

vene commented Jun 26, 2017

permute n_i[i] + t_i[i]

wouldn't this mean that the indices between n_i[i] + t_i[i] + 1 and class_counts[i] are never selected? It seems to me that this would deterministically rule out the same samples. (but I'm likely missing something)

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 26, 2017

Member

No, you're probably right and I wasn't thinking straight. But we could use sample_without_replacement and let it choose which method ... not that that's worked so well in the past!

Member

jnothman commented Jun 26, 2017

No, you're probably right and I wasn't thinking straight. But we could use sample_without_replacement and let it choose which method ... not that that's worked so well in the past!

@jmschrei jmschrei closed this in #9197 Jun 26, 2017

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Jun 28, 2017

Member

@jnothman I'm very happy if we can simplify this, though I didn't look at your implementation in detail. You say it's equivalent but doesn't give identical results because of different randomization?
Did it pass the tests? They are actually pretty strict.

Member

amueller commented Jun 28, 2017

@jnothman I'm very happy if we can simplify this, though I didn't look at your implementation in detail. You say it's equivalent but doesn't give identical results because of different randomization?
Did it pass the tests? They are actually pretty strict.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 28, 2017

Member
Member

jnothman commented Jun 28, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment