[MRG+2] switch to multinomial composition for mixture sampling #7702
Conversation
This may be a problem, not sure. This reminds me the issue we had in StratifiedShuffleSplit #6472 where drawing samples for each group did not add up to the total number of samples. Although I didn't really understand the details I believe @amueller used some kind of approximation to avoid randomly sampling. |
It's not a problem. The function On my computer, for a weights of shape = (1000000, ) : %timeit np.random.multinomial(100000000, weights).astype(int)
10 loops, best of 3: 135 ms per loop |
I think this is an acceptable use of |
@@ -385,7 +385,7 @@ def sample(self, n_samples=1): | |||
|
|||
_, n_features = self.means_.shape | |||
rng = check_random_state(self.random_state) | |||
n_samples_comp = np.round(self.weights_ * n_samples).astype(int) | |||
n_samples_comp = rng.multinomial(n_samples, self.weights_).astype(int) |
jnothman
Oct 19, 2016
Member
the astype should not be necessary.
the astype should not be necessary.
Can you please add a non-regression test? Otherwise looks good. Thanks for the fix! |
Yes, will do. |
aa9cae1
to
614dd4a
I don't think I have the correct permissions to restart the travis build, but it should be passing. Can a maintainer trigger a rebuild? |
@@ -956,6 +957,13 @@ def test_sample(): | |||
for k in range(n_features)]) | |||
assert_array_almost_equal(gmm.means_, means_s, decimal=1) | |||
|
|||
# Check that sizes that are drawn match what is requested | |||
assert_equal(X_s.shape, (n_samples, n_components)) | |||
for sample_size in [4, 101, 1004, 5051]: |
jnothman
Oct 20, 2016
Member
Is there a particular reason to try with a large number? Would for sample_size in range(50)
suffice?
Is there a particular reason to try with a large number? Would for sample_size in range(50)
suffice?
ljwolf
Oct 20, 2016
Author
Contributor
I think either would be sufficient. Earlier in the test, the sample size is 20,000. it should be fine to do range(1,k)
, too. Should I use that instead?
I think either would be sufficient. Earlier in the test, the sample size is 20,000. it should be fine to do range(1,k)
, too. Should I use that instead?
42d38eb
to
f25eacf
As long as the test runs quite quickly, this LGTM. |
Please update what's new |
I just swapped to the |
Could you please try updating master and rebasing on it? |
Alright, I've rebased to master & the tests are off to the races. |
LGTM |
@@ -956,6 +957,13 @@ def test_sample(): | |||
for k in range(n_features)]) | |||
assert_array_almost_equal(gmm.means_, means_s, decimal=1) | |||
|
|||
# Check that sizes that are drawn match what is requested | |||
assert_equal(X_s.shape, (n_samples, n_components)) | |||
for sample_size in range(1, 50): |
lesteve
Oct 20, 2016
•
Member
This test does not fail on master so it looks like you are not testing the edge case you discovered.
This test does not fail on master so it looks like you are not testing the edge case you discovered.
I used n_components=3 for the test to test the actual regression seen in #7701. As part of this it seems that n_features and n_components were swapped in a few places. @tguillemot can you quickly check whether what I changed makes sense. |
n_components and n_features were equal and one was used for the other in some places.
@lesteve Sorry for these mistakes. It's |
LGTM |
@lesteve nope, that's what they're there for! |
AppVeyor is taking quite some time, this PR should be merged if it comes back green. |
Merged, thanks a lot! |
@lesteve for future reference, please use the squash and merge feature, that makes cherry-picking much simpler. |
Oops sorry for that, I always try to remember to do squash and merge but it looks like I missed this one. |
Looks like you can only allow squash and merge if you want to: The settings are available from: https://github.com/scikit-learn/scikit-learn/settings Should we do that? |
@lesteve yes. I enabled that. Hm though sometimes we have multiple authors? Well, let's see when the first person complains. We can always go back. Thanks for finding that. |
Whoops, we'd forgotten to tag this for 0.18.1. Done now. |
hm sorry didn't make it into 0.18.1. My bad :( |
Reference Issue
fixes #7701
What does this implement/fix? Explain your changes.
This changes the way mixture models construct the composition of new samples. Specifically, any subclass deriving from
BaseMixture.sample
is affected.Instead of rounding the composition vector from
weights * n_samples
, this draws the composition vector from a multinomial distribution. Thus, samples return are guaranteed to have the number of observations requested by the user. However, the composition of the sample is now stochastic.In addition, this adds tests to ensure that
n_samples
are returned whenmixture.sample(n_samples)
is called.This may affect scaling for mixture models with a very large number of dimensions, since the multinational composition draw may be slow. But, this composition draw only occurs once during sampling.
Any other comments?
None. Thanks for the great package!