New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF : StratifiedShuffleSplit is slow when using large number of classes #5991
Comments
can you check if this is still relevant with the new code? I guess probably it is... |
I can certainly state that running StratifiedShuffleSplit on a large number of classes still is incredibly slow. I began a process about 10 hours ago on a dataset with many classes (>1000) that is only 5GB and it has not yet completed. |
How does it compare to RepeatedStratifiedKFold? Is there a reason to prefer StratifiedShuffleSplit over that? |
I agree with @arthurmensch that the repeated I also have a much simpler implementation of def _approximate_mode(class_counts, n_draws, rng):
"""
This builds on the fact that x = (class_counts / class_counts.sum() * n_draws)
gives us an approximate count for each class. If we want integers that still sum
to n_draws, we can use np.diff(np.r_[0, np.round(x.cumsum()]). And if we want
randomization, all we need to do is permute class_counts so that the fractional
parts of the cumulative sum go to different classes.
"""
rng = check_random_state(rng)
perm = rng.permutation(len(class_counts))
inv_perm = np.zeros(len(perm), dtype=int)
inv_perm[perm] = np.arange(len(perm))
cumprop = class_counts[perm].cumsum()
cumprop = cumprop * n_draws / cumprop[-1]
permuted_counts = np.diff(np.hstack([0, np.round(cumprop)]))
return permuted_counts.astype(np.int)[inv_perm] I am, however, still not entirely convinced that we need |
Btw, the code in the original post: class_indices = np.zeros((n_classes, class_counts.max()), dtype='int')
count = np.zeros(n_classes, dtype='int')
for i in range(len(y_indices)):
class_indices[y_indices[i], count[y_indices[i]]] = i
count[y_indices[i]] += 1 I think this can be done as: class_indices = np.split(np.argsort(y_indices), np.cumsum(class_counts)[:-1]) (Yes, I know it's asymptotically slower by a logarithmic factor.) |
If |
@jnothman I was under the impression that the *KFold classes did not allow you to supply values for the % of training vs test data; this was the reason behind my use of |
Okay. I get that wanting a smaller portion of the data for training may be
appropriate. A stratified sample from StratifiedKFold could potentially be
another solution there. Hmm. Perhaps I've been under-thinking being
unsettled about StratifiedShuffleSplit.
In any case it can be made faster.
…On 21 June 2017 at 23:36, Michael Diolosa ***@***.***> wrote:
@jnothman <https://github.com/jnothman> I was under the impression that
the *KFold classes did not allow you to supply values for the % of training
vs test data; this was the reason behind my use of StratifiedShuffleSplit.
Also, I'm not using the dev release of scikit so I do not have access to
RepeatedStratifiedKFold; that's introduced in 0.19-dev correct?. Finally,
since the data I'm using is very large I'm using a split from the raw data
of 0.4 and 0.08 for testing purposes. I'm looking into testing
StratifiedKFold right now.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#5991 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz62GKGSi-a5g6qdj7YU7JFTuC-iWwks5sGRxzgaJpZM4Gxt1A>
.
|
wouldn't this mean that the indices between |
No, you're probably right and I wasn't thinking straight. But we could use |
@jnothman I'm very happy if we can simplify this, though I didn't look at your implementation in detail. You say it's equivalent but doesn't give identical results because of different randomization? |
Iirc it passed all tests that did not stipulate particular values. But it
is really not a priority for 0.19
…On 28 Jun 2017 1:24 pm, "Andreas Mueller" ***@***.***> wrote:
@jnothman <https://github.com/jnothman> I'm very happy if we can simplify
this, though I didn't look at your implementation in detail. You say it's
equivalent but doesn't give identical results because of different
randomization?
Did it pass the tests? They are actually pretty strict.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#5991 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz610zQothsrOM3IKBF5MVIxQjFd3Tks5sIceCgaJpZM4Gxt1A>
.
|
When using large number of classes (e.g. > 10000, e.g for recommender systems),
StratifiedShuffleSplit
is very slow when compared toShuffleSplit
. Looking at the code, I believe that the following part:l. 1070
insklearn.model_selection._split
is suboptimal : we should build an index matrix holding the indices for each class in the dataset (implying to do a single pass over data, maybe along with abincount(classes)
). Indeed np.where does a pass overy
at each call, leading to aO(n_classes * len(y))
complexity, whereas it could beO(len(y))
only.I obtain a significant gain in perf doing:
and subsequently replacing
by
This is suboptimal given we iterate over y values using within a Python loop. I believe that the proper way to do this would be to create a
bincount_with_ref
cython function that would both count the occurence of classes and accumulate class index in aclass_indices
array - inarrayfuncs.pyx
. Memory usage goes up oflen(y) * sizeof('int')
, which is typically small when compared toX
size.Would this be useful ? I'll have to provide benchmarks !
The text was updated successfully, but these errors were encountered: