Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+2] switch to multinomial composition for mixture sampling #7702

Merged
merged 3 commits into from Oct 20, 2016

Conversation

@ljwolf
Copy link
Contributor

@ljwolf ljwolf commented Oct 19, 2016

Reference Issue

fixes #7701

What does this implement/fix? Explain your changes.

This changes the way mixture models construct the composition of new samples. Specifically, any subclass deriving fromBaseMixture.sample is affected.

Instead of rounding the composition vector from weights * n_samples, this draws the composition vector from a multinomial distribution. Thus, samples return are guaranteed to have the number of observations requested by the user. However, the composition of the sample is now stochastic.

In addition, this adds tests to ensure that n_samples are returned when mixture.sample(n_samples) is called.

This may affect scaling for mixture models with a very large number of dimensions, since the multinational composition draw may be slow. But, this composition draw only occurs once during sampling.

Any other comments?

None. Thanks for the great package!

@tguillemot
Copy link
Contributor

@tguillemot tguillemot commented Oct 19, 2016

Can you add a test with the example you've put in #7701 on the function test_sample in mixture/test_gaussian_mixture.py ?
@ljwolf Thanks.

@lesteve
Copy link
Member

@lesteve lesteve commented Oct 19, 2016

This may affect scaling for mixture models with a very large number of dimensions, since the multinational composition draw may be slow.

This may be a problem, not sure.

This reminds me the issue we had in StratifiedShuffleSplit #6472 where drawing samples for each group did not add up to the total number of samples. Although I didn't really understand the details I believe @amueller used some kind of approximation to avoid randomly sampling.

@tguillemot
Copy link
Contributor

@tguillemot tguillemot commented Oct 19, 2016

This may affect scaling for mixture models with a very large number of dimensions, since the multinational composition draw may be slow.

It's not a problem. The function sample is not use during the fitting process.
Moreover, as weights is linked to the number of component and GaussianMixture is not designed to deal with a lot of component (less than 1e6), the time is really OK.

On my computer, for a weights of shape = (1000000, ) :

%timeit np.random.multinomial(100000000, weights).astype(int)
10 loops, best of 3: 135 ms per loop
Copy link
Member

@jnothman jnothman left a comment

I think this is an acceptable use of rng.multinomial. Please add a test.

@@ -385,7 +385,7 @@ def sample(self, n_samples=1):

_, n_features = self.means_.shape
rng = check_random_state(self.random_state)
n_samples_comp = np.round(self.weights_ * n_samples).astype(int)
n_samples_comp = rng.multinomial(n_samples, self.weights_).astype(int)

This comment has been minimized.

@jnothman

jnothman Oct 19, 2016
Member

the astype should not be necessary.

@amueller
Copy link
Member

@amueller amueller commented Oct 19, 2016

Can you please add a non-regression test? Otherwise looks good. Thanks for the fix!

@ljwolf
Copy link
Contributor Author

@ljwolf ljwolf commented Oct 19, 2016

Yes, will do.

@ljwolf ljwolf force-pushed the ljwolf:gaussmix_sampling branch 2 times, most recently from aa9cae1 to 614dd4a Oct 19, 2016
@ljwolf
Copy link
Contributor Author

@ljwolf ljwolf commented Oct 20, 2016

I don't think I have the correct permissions to restart the travis build, but it should be passing. Can a maintainer trigger a rebuild?

@@ -956,6 +957,13 @@ def test_sample():
for k in range(n_features)])
assert_array_almost_equal(gmm.means_, means_s, decimal=1)

# Check that sizes that are drawn match what is requested
assert_equal(X_s.shape, (n_samples, n_components))
for sample_size in [4, 101, 1004, 5051]:

This comment has been minimized.

@jnothman

jnothman Oct 20, 2016
Member

Is there a particular reason to try with a large number? Would for sample_size in range(50) suffice?

This comment has been minimized.

@ljwolf

ljwolf Oct 20, 2016
Author Contributor

I think either would be sufficient. Earlier in the test, the sample size is 20,000. it should be fine to do range(1,k), too. Should I use that instead?

@ljwolf ljwolf force-pushed the ljwolf:gaussmix_sampling branch 2 times, most recently from 42d38eb to f25eacf Oct 20, 2016
@jnothman
Copy link
Member

@jnothman jnothman commented Oct 20, 2016

As long as the test runs quite quickly, this LGTM.

@jnothman
Copy link
Member

@jnothman jnothman commented Oct 20, 2016

Please update what's new

@jnothman jnothman changed the title switch to multinomial composition for mixture sampling [MRG+1] switch to multinomial composition for mixture sampling Oct 20, 2016
@ljwolf
Copy link
Contributor Author

@ljwolf ljwolf commented Oct 20, 2016

I just swapped to the range(1,50)construct and added the fact that the test was added to the original statement. That Travis report looks pretty strange... tons of errors that look like build errors?

@jnothman
Copy link
Member

@jnothman jnothman commented Oct 20, 2016

Could you please try updating master and rebasing on it?

@ljwolf ljwolf force-pushed the ljwolf:gaussmix_sampling branch from f25eacf to 55672f9 Oct 20, 2016
@ljwolf
Copy link
Contributor Author

@ljwolf ljwolf commented Oct 20, 2016

Alright, I've rebased to master & the tests are off to the races.

@tguillemot
Copy link
Contributor

@tguillemot tguillemot commented Oct 20, 2016

LGTM

@@ -956,6 +957,13 @@ def test_sample():
for k in range(n_features)])
assert_array_almost_equal(gmm.means_, means_s, decimal=1)

# Check that sizes that are drawn match what is requested
assert_equal(X_s.shape, (n_samples, n_components))
for sample_size in range(1, 50):

This comment has been minimized.

@lesteve

lesteve Oct 20, 2016
Member

This test does not fail on master so it looks like you are not testing the edge case you discovered.

Copy link
Member

@lesteve lesteve left a comment

I used n_components=3 for the test to test the actual regression seen in #7701.

As part of this it seems that n_features and n_components were swapped in a few places. @tguillemot can you quickly check whether what I changed makes sense.

@lesteve
Copy link
Member

@lesteve lesteve commented Oct 20, 2016

I used n_components=3 for the test to test the actual regression seen in #7701.

I forgot to say it explicitly: I used the recent github features that allows people with enough admin rights to push into PR branches. @ljwolf I hope you don't mind :-).

n_components and n_features were equal and one was used for the other in
some places.
@lesteve lesteve force-pushed the ljwolf:gaussmix_sampling branch from b4913d8 to 4e1c101 Oct 20, 2016
@tguillemot
Copy link
Contributor

@tguillemot tguillemot commented Oct 20, 2016

@lesteve Sorry for these mistakes. It's n_components indeed.

@amueller
Copy link
Member

@amueller amueller commented Oct 20, 2016

LGTM

@amueller amueller changed the title [MRG+1] switch to multinomial composition for mixture sampling [MRG+2] switch to multinomial composition for mixture sampling Oct 20, 2016
@ljwolf
Copy link
Contributor Author

@ljwolf ljwolf commented Oct 20, 2016

@lesteve nope, that's what they're there for! :shipit:

@lesteve lesteve dismissed jnothman’s stale review Oct 20, 2016

Test has now been added

@lesteve
Copy link
Member

@lesteve lesteve commented Oct 20, 2016

AppVeyor is taking quite some time, this PR should be merged if it comes back green.

@lesteve lesteve merged commit ad6f094 into scikit-learn:master Oct 20, 2016
3 checks passed
3 checks passed
ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@lesteve
Copy link
Member

@lesteve lesteve commented Oct 20, 2016

Merged, thanks a lot!

@amueller
Copy link
Member

@amueller amueller commented Oct 25, 2016

@lesteve for future reference, please use the squash and merge feature, that makes cherry-picking much simpler.

@lesteve
Copy link
Member

@lesteve lesteve commented Oct 25, 2016

Oops sorry for that, I always try to remember to do squash and merge but it looks like I missed this one.

@lesteve
Copy link
Member

@lesteve lesteve commented Oct 25, 2016

Looks like you can only allow squash and merge if you want to:
https://github.com/blog/2141-squash-your-commits

The settings are available from: https://github.com/scikit-learn/scikit-learn/settings

Should we do that?

@amueller
Copy link
Member

@amueller amueller commented Oct 25, 2016

@lesteve yes. I enabled that. Hm though sometimes we have multiple authors? Well, let's see when the first person complains. We can always go back. Thanks for finding that.

@jnothman jnothman added this to the 0.18.1 milestone Nov 5, 2016
@jnothman
Copy link
Member

@jnothman jnothman commented Nov 5, 2016

Whoops, we'd forgotten to tag this for 0.18.1. Done now.

@amueller
Copy link
Member

@amueller amueller commented Nov 17, 2016

hm sorry didn't make it into 0.18.1. My bad :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

5 participants