New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOCS: Model selection and Model averaging #938

Closed
springcoil opened this Issue Jan 18, 2016 · 35 comments

Comments

Projects
None yet
8 participants
@springcoil
Member

springcoil commented Jan 18, 2016

Someone (it'll probably be me) should read this and write up some notes http://arxiv.org/abs/1503.08650

This will give a good explanation of the strengths and weaknesses of the various model-selection metrics we have, and what we also need to add to PyMC3.

Adding here as a placeholder.

This is for you @jonsedar and @twiecki

@jonsedar

This comment has been minimized.

Show comment
Hide comment
@jonsedar

jonsedar Jan 18, 2016

Contributor

Good find, that looks really useful. If you have time to update the model selection example, go for it - otherwise I'll find time over the next couple of weeks.

Contributor

jonsedar commented Jan 18, 2016

Good find, that looks really useful. If you have time to update the model selection example, go for it - otherwise I'll find time over the next couple of weeks.

@springcoil

This comment has been minimized.

Show comment
Hide comment
@springcoil

springcoil Jan 18, 2016

Member

You think it should all be included in the same documentation/ notebook? Just a few notes summarising the math or do we have other things to add in?

Member

springcoil commented Jan 18, 2016

You think it should all be included in the same documentation/ notebook? Just a few notes summarising the math or do we have other things to add in?

@jonsedar

This comment has been minimized.

Show comment
Hide comment
@jonsedar

jonsedar Jan 18, 2016

Contributor

I'd link to the paper, perhaps add in some math where appropriate.

There's also a TODO on crossval - which I'm planning to add. It's made fairly straightforward by using sklearn.cross_validation.ShuffleSplit to select and theano.shared variables to easily substitute train / test sets, then use pm.sample_ppc to make predictions on the test set. I've used it on my own work and can add soon, but am a little pressed for time this week. If you're looking for something to do please feel free!

Contributor

jonsedar commented Jan 18, 2016

I'd link to the paper, perhaps add in some math where appropriate.

There's also a TODO on crossval - which I'm planning to add. It's made fairly straightforward by using sklearn.cross_validation.ShuffleSplit to select and theano.shared variables to easily substitute train / test sets, then use pm.sample_ppc to make predictions on the test set. I've used it on my own work and can add soon, but am a little pressed for time this week. If you're looking for something to do please feel free!

@avehtari

This comment has been minimized.

Show comment
Hide comment
@avehtari

avehtari Jan 18, 2016

For cross-validation see also https://github.com/avehtari/PSIS. There's Python reference code for Pareto smoothed importance sampling cross-validation, too.

avehtari commented Jan 18, 2016

For cross-validation see also https://github.com/avehtari/PSIS. There's Python reference code for Pareto smoothed importance sampling cross-validation, too.

@springcoil

This comment has been minimized.

Show comment
Hide comment
@springcoil

springcoil Jan 18, 2016

Member

Guys, I'm a bit pressed this week. If I have time I'll have a stab at one of those contributions. :)

Member

springcoil commented Jan 18, 2016

Guys, I'm a bit pressed this week. If I have time I'll have a stab at one of those contributions. :)

@springcoil

This comment has been minimized.

Show comment
Hide comment
@springcoil

springcoil Sep 3, 2016

Member

I've not actually gotten around to doing this but @fonnesbeck do we have cross-validation now?

Member

springcoil commented Sep 3, 2016

I've not actually gotten around to doing this but @fonnesbeck do we have cross-validation now?

@fonnesbeck

This comment has been minimized.

Show comment
Hide comment
@fonnesbeck

fonnesbeck Sep 3, 2016

Member

We have leave-one-out in loo

Member

fonnesbeck commented Sep 3, 2016

We have leave-one-out in loo

@springcoil

This comment has been minimized.

Show comment
Hide comment
@springcoil

springcoil Mar 13, 2017

Member

We have some of this cross-validation stuff now, and I think it's a lot easier to add examples/ documentation.

I've not really got the time or the capacity so I'll leave this as an exercise for the reader :)

Member

springcoil commented Mar 13, 2017

We have some of this cross-validation stuff now, and I think it's a lot easier to add examples/ documentation.

I've not really got the time or the capacity so I'll leave this as an exercise for the reader :)

@aloctavodia

This comment has been minimized.

Show comment
Hide comment
@aloctavodia

aloctavodia Jul 6, 2017

Member

I just want to mention that we already have 3 notebooks-examples related to model selection.

Certainly there is room for improvement (unification?) of these examples.

Member

aloctavodia commented Jul 6, 2017

I just want to mention that we already have 3 notebooks-examples related to model selection.

Certainly there is room for improvement (unification?) of these examples.

@junpenglao

This comment has been minimized.

Show comment
Hide comment
@junpenglao

junpenglao Jul 6, 2017

Member

Thanks @aloctavodia What I am doing now is
1, extending the GLM-model-selection (add bpic and loo),
2, maybe combine model_comparison and model_averaging together
or
extend model_comparison with math and explanation from @avehtari 's survey paper.
or both above (input welcome).

Member

junpenglao commented Jul 6, 2017

Thanks @aloctavodia What I am doing now is
1, extending the GLM-model-selection (add bpic and loo),
2, maybe combine model_comparison and model_averaging together
or
extend model_comparison with math and explanation from @avehtari 's survey paper.
or both above (input welcome).

@aloctavodia

This comment has been minimized.

Show comment
Hide comment
@aloctavodia

aloctavodia Jul 6, 2017

Member

I think adding some theory to model_comparison is something we should do. Not sure about merging it with mode_averaging, but my only concern is to get a too long notebook.

Should we remove bpic? given that we have DIC, WAIC and LOO

Member

aloctavodia commented Jul 6, 2017

I think adding some theory to model_comparison is something we should do. Not sure about merging it with mode_averaging, but my only concern is to get a too long notebook.

Should we remove bpic? given that we have DIC, WAIC and LOO

@junpenglao

This comment has been minimized.

Show comment
Hide comment
@junpenglao

junpenglao Jul 6, 2017

Member

I am not too familiar with Bpic, but BPIC and DIC is more or less the same with different weight:

DIC: 2 * mean_deviance - deviance_at_mean
BPIC: 3 * mean_deviance - 2 * deviance_at_mean
Member

junpenglao commented Jul 6, 2017

I am not too familiar with Bpic, but BPIC and DIC is more or less the same with different weight:

DIC: 2 * mean_deviance - deviance_at_mean
BPIC: 3 * mean_deviance - 2 * deviance_at_mean
@avehtari

This comment has been minimized.

Show comment
Hide comment
@avehtari

avehtari Jul 6, 2017

First, it's great to have demos!

I recommend dropping DIC and BPIC as you can always recommend to use WAIC instead, and having too many options causes just confusion.

I also recommend using PSIS-LOO instead of WAIC, because it's more reliable and has better diagnostics as discussed in http://link.springer.com/article/10.1007/s11222-016-9696-4 (preprint https://arxiv.org/abs/1507.04544), but if you insist to have one information criterion then leave WAIC.

For model averaging see https://arxiv.org/abs/1704.02030
Bayesian stacking introduced in that paper doesn't require much more computation compared to information criterion based weighting, but is much better.

Aki

avehtari commented Jul 6, 2017

First, it's great to have demos!

I recommend dropping DIC and BPIC as you can always recommend to use WAIC instead, and having too many options causes just confusion.

I also recommend using PSIS-LOO instead of WAIC, because it's more reliable and has better diagnostics as discussed in http://link.springer.com/article/10.1007/s11222-016-9696-4 (preprint https://arxiv.org/abs/1507.04544), but if you insist to have one information criterion then leave WAIC.

For model averaging see https://arxiv.org/abs/1704.02030
Bayesian stacking introduced in that paper doesn't require much more computation compared to information criterion based weighting, but is much better.

Aki

@twiecki

This comment has been minimized.

Show comment
Hide comment
@twiecki

twiecki Jul 6, 2017

Member

@avehtari thanks for chiming in. Those suggestions make sense to me.

Member

twiecki commented Jul 6, 2017

@avehtari thanks for chiming in. Those suggestions make sense to me.

@junpenglao

This comment has been minimized.

Show comment
Hide comment
@junpenglao

junpenglao Jul 7, 2017

Member

@avehtari thanks for the great suggestions, I will take that into account.

@aloctavodia Is the weight output from pm.compare support for Bayesian Stacking as suggested by @avehtari in the comments above?

Member

junpenglao commented Jul 7, 2017

@avehtari thanks for the great suggestions, I will take that into account.

@aloctavodia Is the weight output from pm.compare support for Bayesian Stacking as suggested by @avehtari in the comments above?

junpenglao added a commit to junpenglao/pymc3 that referenced this issue Jul 7, 2017

add comments from @avehtari
pymc-devs#938 (comment)

I am keeping the DIC and BPIC in the doc as we need something to introduce these features in the docs.
@aloctavodia

This comment has been minimized.

Show comment
Hide comment
@aloctavodia

aloctavodia Jul 7, 2017

Member

@junpenglao The paper suggested by @avehtari is in my To-Read list, so ideas in that paper are not incorporated in pm.compare yet. At this point pm.compare performs a very simple AIC-type weighting (but uses WAIC or LOO instead of AIC). From a quick look at the paper I think we can improve what pm.compare does by doing a Bayesian Bootstrap (but I still have to read the details). Additionally we can have the Bayesian Stacking as another option, apparently is a more reliable method but also a more expensive one.

Member

aloctavodia commented Jul 7, 2017

@junpenglao The paper suggested by @avehtari is in my To-Read list, so ideas in that paper are not incorporated in pm.compare yet. At this point pm.compare performs a very simple AIC-type weighting (but uses WAIC or LOO instead of AIC). From a quick look at the paper I think we can improve what pm.compare does by doing a Bayesian Bootstrap (but I still have to read the details). Additionally we can have the Bayesian Stacking as another option, apparently is a more reliable method but also a more expensive one.

@junpenglao

This comment has been minimized.

Show comment
Hide comment
@junpenglao

junpenglao Jul 7, 2017

Member

Sounds good to me. But maybe we should have a separate function for model averaging?

Member

junpenglao commented Jul 7, 2017

Sounds good to me. But maybe we should have a separate function for model averaging?

@aseyboldt

This comment has been minimized.

Show comment
Hide comment
@aseyboldt

aseyboldt Jul 7, 2017

Member

@aloctavodia Something that came up in the Bound doc: Currently Bound doesn't fix the density so that it integrates to 1, wouldn't this break the model comparison code? It does require the actual logp of the observations, not just something proportional to it, right? If so, that should be state clearly in the docs.

Member

aseyboldt commented Jul 7, 2017

@aloctavodia Something that came up in the Bound doc: Currently Bound doesn't fix the density so that it integrates to 1, wouldn't this break the model comparison code? It does require the actual logp of the observations, not just something proportional to it, right? If so, that should be state clearly in the docs.

@junpenglao

This comment has been minimized.

Show comment
Hide comment
@junpenglao

junpenglao Jul 7, 2017

Member

hmm good point @aseyboldt , both WAIC and LOO computes the logp of each element in model (https://github.com/pymc-devs/pymc3/blob/master/pymc3/stats.py#L127), it might be problematic with Bound

Member

junpenglao commented Jul 7, 2017

hmm good point @aseyboldt , both WAIC and LOO computes the logp of each element in model (https://github.com/pymc-devs/pymc3/blob/master/pymc3/stats.py#L127), it might be problematic with Bound

@aloctavodia

This comment has been minimized.

Show comment
Hide comment
@aloctavodia

aloctavodia Jul 7, 2017

Member

@junpenglao at this point we can use compare to compute weights and then sample_ppc_w, that accepts a set of weights, to compute weighted posterior predictive samples. We can also have a compareplot that can be used together with compare, to analyze the result of the comparison.

@aseyboldt I think you are right and that can be problematic. I rememberer that I was able to reproduce the values of WAIC in the book Statistical Rethinking, I should probably check this again...

Member

aloctavodia commented Jul 7, 2017

@junpenglao at this point we can use compare to compute weights and then sample_ppc_w, that accepts a set of weights, to compute weighted posterior predictive samples. We can also have a compareplot that can be used together with compare, to analyze the result of the comparison.

@aseyboldt I think you are right and that can be problematic. I rememberer that I was able to reproduce the values of WAIC in the book Statistical Rethinking, I should probably check this again...

@junpenglao

This comment has been minimized.

Show comment
Hide comment
@junpenglao

junpenglao Jul 7, 2017

Member

@aloctavodia cool, in that case we can just add a few more columns to the output of pm.compare function to have Bayesian Stacking weight etc.

Member

junpenglao commented Jul 7, 2017

@aloctavodia cool, in that case we can just add a few more columns to the output of pm.compare function to have Bayesian Stacking weight etc.

@junpenglao junpenglao changed the title from DOCS: Model selection to DOCS: Model selection and Model averaging Jul 7, 2017

@aloctavodia

This comment has been minimized.

Show comment
Hide comment
@aloctavodia

aloctavodia Jul 7, 2017

Member

@junpenglao yeah that right, or alternatively we can ask compare to compute the weights in one way or the other. I guess the decision will depends on the computational cost of getting those weights

Member

aloctavodia commented Jul 7, 2017

@junpenglao yeah that right, or alternatively we can ask compare to compute the weights in one way or the other. I guess the decision will depends on the computational cost of getting those weights

twiecki added a commit that referenced this issue Jul 7, 2017

add comments from @avehtari
#938 (comment)

I am keeping the DIC and BPIC in the doc as we need something to introduce these features in the docs.
@junpenglao

This comment has been minimized.

Show comment
Hide comment
@junpenglao

junpenglao Jul 7, 2017

Member

FYI @aloctavodia @aseyboldt

x = np.random.randn(100,2)+1
x[x<0]=0

with pm.Model() as model1:
    mu= pm.Normal('mu', 0., 100., shape=2)
    sd= pm.Gamma('sd', alpha=1.,beta=1.)
    obs=pm.Normal('obs', mu=mu, sd=sd, observed=x)
    trace1=pm.sample()

with pm.Model() as model2:
    mu= pm.Bound(pm.Normal, lower=0.)('mu', mu=0., sd=100., shape=2)
    sd= pm.Gamma('sd', alpha=1.,beta=1.)
    obs=pm.Normal('obs', mu=mu, sd=sd, observed=x)
    trace2=pm.sample()
pm.compare([trace1,trace2],[model1,model2])
      WAIC    pWAIC     dWAIC    weight       SE        dSE warning
1  515.267  2.38154         0  0.617567  16.4203          0       0
0  516.225  2.90537  0.958468  0.382433  16.5625  0.0030158       0

pm.compare([trace1,trace2],[model1,model2], ic='LOO')
       LOO     pLOO     dLOO    weight       SE         dSE warning
1  515.451  2.47385        0  0.637353  16.4317           0       1
0  516.579  3.08234  1.12778  0.362647  16.5911  0.00699399       1
Member

junpenglao commented Jul 7, 2017

FYI @aloctavodia @aseyboldt

x = np.random.randn(100,2)+1
x[x<0]=0

with pm.Model() as model1:
    mu= pm.Normal('mu', 0., 100., shape=2)
    sd= pm.Gamma('sd', alpha=1.,beta=1.)
    obs=pm.Normal('obs', mu=mu, sd=sd, observed=x)
    trace1=pm.sample()

with pm.Model() as model2:
    mu= pm.Bound(pm.Normal, lower=0.)('mu', mu=0., sd=100., shape=2)
    sd= pm.Gamma('sd', alpha=1.,beta=1.)
    obs=pm.Normal('obs', mu=mu, sd=sd, observed=x)
    trace2=pm.sample()
pm.compare([trace1,trace2],[model1,model2])
      WAIC    pWAIC     dWAIC    weight       SE        dSE warning
1  515.267  2.38154         0  0.617567  16.4203          0       0
0  516.225  2.90537  0.958468  0.382433  16.5625  0.0030158       0

pm.compare([trace1,trace2],[model1,model2], ic='LOO')
       LOO     pLOO     dLOO    weight       SE         dSE warning
1  515.451  2.47385        0  0.637353  16.4317           0       1
0  516.579  3.08234  1.12778  0.362647  16.5911  0.00699399       1
@aloctavodia

This comment has been minimized.

Show comment
Hide comment
@aloctavodia

aloctavodia Jul 7, 2017

Member

Great! @junpenglao. It seems that differences are due to sample size, increasing sample size should reduce them.

Member

aloctavodia commented Jul 7, 2017

Great! @junpenglao. It seems that differences are due to sample size, increasing sample size should reduce them.

@aseyboldt

This comment has been minimized.

Show comment
Hide comment
@aseyboldt

aseyboldt Jul 7, 2017

Member

@aloctavodia Why sample size? The only difference between the models is how the HalfNormals prior on mu is defined. They should be identical. But the logp for mu in the second version is always half that of the the logp in the fist model.
I'm not sure how to deal with that other than just putting a couple of fat warnings in the doc and docstring.
I guess proper support for truncated distributions (using the cdf of the distributions) would solve it.

Member

aseyboldt commented Jul 7, 2017

@aloctavodia Why sample size? The only difference between the models is how the HalfNormals prior on mu is defined. They should be identical. But the logp for mu in the second version is always half that of the the logp in the fist model.
I'm not sure how to deal with that other than just putting a couple of fat warnings in the doc and docstring.
I guess proper support for truncated distributions (using the cdf of the distributions) would solve it.

@aloctavodia

This comment has been minimized.

Show comment
Hide comment
@aloctavodia

aloctavodia Jul 7, 2017

Member

@junpenglao I mean the samples from the posterior. I am not able to check this at the moment, but I think both traces are not exactly equal so the differences are explained just by sampling. Could you compare the traces and also increase the number of steps?

Member

aloctavodia commented Jul 7, 2017

@junpenglao I mean the samples from the posterior. I am not able to check this at the moment, but I think both traces are not exactly equal so the differences are explained just by sampling. Could you compare the traces and also increase the number of steps?

@aseyboldt

This comment has been minimized.

Show comment
Hide comment
@aseyboldt

aseyboldt Jul 7, 2017

Member

I tried it with 1e5 samples and the second weight is still smaller than the first. Also, the standard error of the difference if LOO and WAIC is much smaller than the difference. I reran it a couple of times and got this result consistently.

Member

aseyboldt commented Jul 7, 2017

I tried it with 1e5 samples and the second weight is still smaller than the first. Also, the standard error of the difference if LOO and WAIC is much smaller than the difference. I reran it a couple of times and got this result consistently.

@aseyboldt

This comment has been minimized.

Show comment
Hide comment
@aseyboldt

aseyboldt Jul 7, 2017

Member

Sorry, I did it again and can't reproduce it now. Maybe I messed up the sorting order earlier one or two times. However, the standard error seems to be off. If I use the same model twice, the dWAIC values are consistently much larger than dSE:

      WAIC    pWAIC      dWAIC    weight       SE          dSE warning
1  46.5785  2.26211          0  0.502283  3.90641            0       1
0  46.5967  2.27682  0.0182618  0.497717  3.90201  0.000133433       1

Or am I interpreting the values incorrectly?

Member

aseyboldt commented Jul 7, 2017

Sorry, I did it again and can't reproduce it now. Maybe I messed up the sorting order earlier one or two times. However, the standard error seems to be off. If I use the same model twice, the dWAIC values are consistently much larger than dSE:

      WAIC    pWAIC      dWAIC    weight       SE          dSE warning
1  46.5785  2.26211          0  0.502283  3.90641            0       1
0  46.5967  2.27682  0.0182618  0.497717  3.90201  0.000133433       1

Or am I interpreting the values incorrectly?

@aseyboldt

This comment has been minimized.

Show comment
Hide comment
@aseyboldt

aseyboldt Jul 7, 2017

Member

Is it possible there is a missing square root in the SE and dSE somewhere?

Member

aseyboldt commented Jul 7, 2017

Is it possible there is a missing square root in the SE and dSE somewhere?

@aloctavodia

This comment has been minimized.

Show comment
Hide comment
@aloctavodia

aloctavodia Jul 8, 2017

Member

@aseyboldt you are right! There is a mistake. The square root is misplaced the line d_se = len(diff) ** 0.5 * np.var(diff) should be d_se = (len(diff) * np.var(diff))** 0.5

Member

aloctavodia commented Jul 8, 2017

@aseyboldt you are right! There is a mistake. The square root is misplaced the line d_se = len(diff) ** 0.5 * np.var(diff) should be d_se = (len(diff) * np.var(diff))** 0.5

@junpenglao

This comment has been minimized.

Show comment
Hide comment
@junpenglao

junpenglao Jul 8, 2017

Member

@aloctavodia are you sending a PR?

Member

junpenglao commented Jul 8, 2017

@aloctavodia are you sending a PR?

@junpenglao

This comment has been minimized.

Show comment
Hide comment
@junpenglao

junpenglao Jul 8, 2017

Member

Also isnt Standard error usually compute as SD/sqrt(n) -> np.std(diff)/np.sqrt(len(diff))

Member

junpenglao commented Jul 8, 2017

Also isnt Standard error usually compute as SD/sqrt(n) -> np.std(diff)/np.sqrt(len(diff))

@aloctavodia

This comment has been minimized.

Show comment
Hide comment
@aloctavodia

aloctavodia Jul 8, 2017

Member

@junpenglao I will not be able to send a PR anytime soon. Today I am begining my vacations and I did not bring my notebook with me. If you or someone else wants to work on this stuff there are a couple of references you can check for related functions. https://github.com/stan-dev/loo and https://github.com/rmcelreath/rethinking

Also there are some examples of WAIC computations that can be compared to the ones in the book Statistical rethinking book here https://github.com/aloctavodia/Statistical-Rethinking-with-Python-and-PyMC3

After vacations I have a couple of things to do and then come back to Argentina, so I will not be able to work on this until the last days of July. Sorry for that.

Member

aloctavodia commented Jul 8, 2017

@junpenglao I will not be able to send a PR anytime soon. Today I am begining my vacations and I did not bring my notebook with me. If you or someone else wants to work on this stuff there are a couple of references you can check for related functions. https://github.com/stan-dev/loo and https://github.com/rmcelreath/rethinking

Also there are some examples of WAIC computations that can be compared to the ones in the book Statistical rethinking book here https://github.com/aloctavodia/Statistical-Rethinking-with-Python-and-PyMC3

After vacations I have a couple of things to do and then come back to Argentina, so I will not be able to work on this until the last days of July. Sorry for that.

@junpenglao

This comment has been minimized.

Show comment
Hide comment
@junpenglao

junpenglao Jul 8, 2017

Member

@aloctavodia no problem! enjoy your holiday ;-)

Member

junpenglao commented Jul 8, 2017

@aloctavodia no problem! enjoy your holiday ;-)

junpenglao added a commit to junpenglao/pymc3 that referenced this issue Jul 8, 2017

denadai2 added a commit to denadai2/pymc3 that referenced this issue Aug 9, 2017

add comments from @avehtari
pymc-devs#938 (comment)

I am keeping the DIC and BPIC in the doc as we need something to introduce these features in the docs.

denadai2 added a commit to denadai2/pymc3 that referenced this issue Aug 9, 2017

@aloctavodia

This comment has been minimized.

Show comment
Hide comment
@aloctavodia

aloctavodia Aug 19, 2017

Member

I think all this is basically covered now

Member

aloctavodia commented Aug 19, 2017

I think all this is basically covered now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment