New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+2] switch to multinomial composition for mixture sampling #7702

Merged
merged 3 commits into from Oct 20, 2016

Conversation

Projects
None yet
5 participants
@ljwolf
Contributor

ljwolf commented Oct 19, 2016

Reference Issue

fixes #7701

What does this implement/fix? Explain your changes.

This changes the way mixture models construct the composition of new samples. Specifically, any subclass deriving fromBaseMixture.sample is affected.

Instead of rounding the composition vector from weights * n_samples, this draws the composition vector from a multinomial distribution. Thus, samples return are guaranteed to have the number of observations requested by the user. However, the composition of the sample is now stochastic.

In addition, this adds tests to ensure that n_samples are returned when mixture.sample(n_samples) is called.

This may affect scaling for mixture models with a very large number of dimensions, since the multinational composition draw may be slow. But, this composition draw only occurs once during sampling.

Any other comments?

None. Thanks for the great package!

@tguillemot

This comment has been minimized.

Show comment
Hide comment
@tguillemot

tguillemot Oct 19, 2016

Contributor

Can you add a test with the example you've put in #7701 on the function test_sample in mixture/test_gaussian_mixture.py ?
@ljwolf Thanks.

Contributor

tguillemot commented Oct 19, 2016

Can you add a test with the example you've put in #7701 on the function test_sample in mixture/test_gaussian_mixture.py ?
@ljwolf Thanks.

@lesteve

This comment has been minimized.

Show comment
Hide comment
@lesteve

lesteve Oct 19, 2016

Member

This may affect scaling for mixture models with a very large number of dimensions, since the multinational composition draw may be slow.

This may be a problem, not sure.

This reminds me the issue we had in StratifiedShuffleSplit #6472 where drawing samples for each group did not add up to the total number of samples. Although I didn't really understand the details I believe @amueller used some kind of approximation to avoid randomly sampling.

Member

lesteve commented Oct 19, 2016

This may affect scaling for mixture models with a very large number of dimensions, since the multinational composition draw may be slow.

This may be a problem, not sure.

This reminds me the issue we had in StratifiedShuffleSplit #6472 where drawing samples for each group did not add up to the total number of samples. Although I didn't really understand the details I believe @amueller used some kind of approximation to avoid randomly sampling.

@tguillemot

This comment has been minimized.

Show comment
Hide comment
@tguillemot

tguillemot Oct 19, 2016

Contributor

This may affect scaling for mixture models with a very large number of dimensions, since the multinational composition draw may be slow.

It's not a problem. The function sample is not use during the fitting process.
Moreover, as weights is linked to the number of component and GaussianMixture is not designed to deal with a lot of component (less than 1e6), the time is really OK.

On my computer, for a weights of shape = (1000000, ) :

%timeit np.random.multinomial(100000000, weights).astype(int)
10 loops, best of 3: 135 ms per loop
Contributor

tguillemot commented Oct 19, 2016

This may affect scaling for mixture models with a very large number of dimensions, since the multinational composition draw may be slow.

It's not a problem. The function sample is not use during the fitting process.
Moreover, as weights is linked to the number of component and GaussianMixture is not designed to deal with a lot of component (less than 1e6), the time is really OK.

On my computer, for a weights of shape = (1000000, ) :

%timeit np.random.multinomial(100000000, weights).astype(int)
10 loops, best of 3: 135 ms per loop
@jnothman

I think this is an acceptable use of rng.multinomial. Please add a test.

Show outdated Hide outdated sklearn/mixture/base.py
@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Oct 19, 2016

Member

Can you please add a non-regression test? Otherwise looks good. Thanks for the fix!

Member

amueller commented Oct 19, 2016

Can you please add a non-regression test? Otherwise looks good. Thanks for the fix!

@ljwolf

This comment has been minimized.

Show comment
Hide comment
@ljwolf

ljwolf Oct 19, 2016

Contributor

Yes, will do.

Contributor

ljwolf commented Oct 19, 2016

Yes, will do.

@ljwolf

This comment has been minimized.

Show comment
Hide comment
@ljwolf

ljwolf Oct 20, 2016

Contributor

I don't think I have the correct permissions to restart the travis build, but it should be passing. Can a maintainer trigger a rebuild?

Contributor

ljwolf commented Oct 20, 2016

I don't think I have the correct permissions to restart the travis build, but it should be passing. Can a maintainer trigger a rebuild?

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Oct 20, 2016

Member

As long as the test runs quite quickly, this LGTM.

Member

jnothman commented Oct 20, 2016

As long as the test runs quite quickly, this LGTM.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Oct 20, 2016

Member

Please update what's new

Member

jnothman commented Oct 20, 2016

Please update what's new

@jnothman jnothman changed the title from switch to multinomial composition for mixture sampling to [MRG+1] switch to multinomial composition for mixture sampling Oct 20, 2016

@ljwolf

This comment has been minimized.

Show comment
Hide comment
@ljwolf

ljwolf Oct 20, 2016

Contributor

I just swapped to the range(1,50)construct and added the fact that the test was added to the original statement. That Travis report looks pretty strange... tons of errors that look like build errors?

Contributor

ljwolf commented Oct 20, 2016

I just swapped to the range(1,50)construct and added the fact that the test was added to the original statement. That Travis report looks pretty strange... tons of errors that look like build errors?

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Oct 20, 2016

Member

Could you please try updating master and rebasing on it?

Member

jnothman commented Oct 20, 2016

Could you please try updating master and rebasing on it?

@ljwolf

This comment has been minimized.

Show comment
Hide comment
@ljwolf

ljwolf Oct 20, 2016

Contributor

Alright, I've rebased to master & the tests are off to the races.

Contributor

ljwolf commented Oct 20, 2016

Alright, I've rebased to master & the tests are off to the races.

@tguillemot

This comment has been minimized.

Show comment
Hide comment
@tguillemot

tguillemot Oct 20, 2016

Contributor

LGTM

Contributor

tguillemot commented Oct 20, 2016

LGTM

@lesteve

I used n_components=3 for the test to test the actual regression seen in #7701.

As part of this it seems that n_features and n_components were swapped in a few places. @tguillemot can you quickly check whether what I changed makes sense.

@lesteve

This comment has been minimized.

Show comment
Hide comment
@lesteve

lesteve Oct 20, 2016

Member

I used n_components=3 for the test to test the actual regression seen in #7701.

I forgot to say it explicitly: I used the recent github features that allows people with enough admin rights to push into PR branches. @ljwolf I hope you don't mind :-).

Member

lesteve commented Oct 20, 2016

I used n_components=3 for the test to test the actual regression seen in #7701.

I forgot to say it explicitly: I used the recent github features that allows people with enough admin rights to push into PR branches. @ljwolf I hope you don't mind :-).

Use n_components=3 to test actual regression
n_components and n_features were equal and one was used for the other in
some places.
@tguillemot

This comment has been minimized.

Show comment
Hide comment
@tguillemot

tguillemot Oct 20, 2016

Contributor

@lesteve Sorry for these mistakes. It's n_components indeed.

Contributor

tguillemot commented Oct 20, 2016

@lesteve Sorry for these mistakes. It's n_components indeed.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Oct 20, 2016

Member

LGTM

Member

amueller commented Oct 20, 2016

LGTM

@amueller amueller changed the title from [MRG+1] switch to multinomial composition for mixture sampling to [MRG+2] switch to multinomial composition for mixture sampling Oct 20, 2016

@ljwolf

This comment has been minimized.

Show comment
Hide comment
@ljwolf

ljwolf Oct 20, 2016

Contributor

@lesteve nope, that's what they're there for! :shipit:

Contributor

ljwolf commented Oct 20, 2016

@lesteve nope, that's what they're there for! :shipit:

Test has now been added

@lesteve

This comment has been minimized.

Show comment
Hide comment
@lesteve

lesteve Oct 20, 2016

Member

AppVeyor is taking quite some time, this PR should be merged if it comes back green.

Member

lesteve commented Oct 20, 2016

AppVeyor is taking quite some time, this PR should be merged if it comes back green.

@lesteve lesteve merged commit ad6f094 into scikit-learn:master Oct 20, 2016

3 checks passed

ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@lesteve

This comment has been minimized.

Show comment
Hide comment
@lesteve

lesteve Oct 20, 2016

Member

Merged, thanks a lot!

Member

lesteve commented Oct 20, 2016

Merged, thanks a lot!

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Oct 25, 2016

Member

@lesteve for future reference, please use the squash and merge feature, that makes cherry-picking much simpler.

Member

amueller commented Oct 25, 2016

@lesteve for future reference, please use the squash and merge feature, that makes cherry-picking much simpler.

@lesteve

This comment has been minimized.

Show comment
Hide comment
@lesteve

lesteve Oct 25, 2016

Member

Oops sorry for that, I always try to remember to do squash and merge but it looks like I missed this one.

Member

lesteve commented Oct 25, 2016

Oops sorry for that, I always try to remember to do squash and merge but it looks like I missed this one.

@lesteve

This comment has been minimized.

Show comment
Hide comment
@lesteve

lesteve Oct 25, 2016

Member

Looks like you can only allow squash and merge if you want to:
https://github.com/blog/2141-squash-your-commits

The settings are available from: https://github.com/scikit-learn/scikit-learn/settings

Should we do that?

Member

lesteve commented Oct 25, 2016

Looks like you can only allow squash and merge if you want to:
https://github.com/blog/2141-squash-your-commits

The settings are available from: https://github.com/scikit-learn/scikit-learn/settings

Should we do that?

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Oct 25, 2016

Member

@lesteve yes. I enabled that. Hm though sometimes we have multiple authors? Well, let's see when the first person complains. We can always go back. Thanks for finding that.

Member

amueller commented Oct 25, 2016

@lesteve yes. I enabled that. Hm though sometimes we have multiple authors? Well, let's see when the first person complains. We can always go back. Thanks for finding that.

@jnothman jnothman added this to the 0.18.1 milestone Nov 5, 2016

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Nov 5, 2016

Member

Whoops, we'd forgotten to tag this for 0.18.1. Done now.

Member

jnothman commented Nov 5, 2016

Whoops, we'd forgotten to tag this for 0.18.1. Done now.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Nov 17, 2016

Member

hm sorry didn't make it into 0.18.1. My bad :(

Member

amueller commented Nov 17, 2016

hm sorry didn't make it into 0.18.1. My bad :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment