Incorporate dask to handle predictions and comparisons with large number of observations #5371

lucianopaz · 2022-01-19T15:02:04Z

lucianopaz
Jan 19, 2022
Maintainer

Background

This discussion came up after a conversation between @OriolAbril, @ricardoV94 and @lucianopaz. We were interested in being able to do model comparisons on a dataset that had ~1M observations. The problem with this is that we usually have ~4K samples in the posterior, and model comparison requires us to compute the pointwise logp, which needs to store about ~4G 64bit points in memory, (ruffly should eat up 30GB of RAM) leading to MemoryError.

We started drafting ideas of what we should change to be able to rely on dask distributed arrays to be able to balance the memory footprint of model comparisons. We then also decided to look into the changes that would be needed to accommodate prior and posterior predictive samples in distributed arrays.

What needs to be done

PyMC codebase

Change the pymc converters to be able use something that is different from the _DefaultTrace
Change sample_posterior_predictive to use something that is different from _DefaultTrace
Change sample_prior_predictive to pass a dask array to to_inference_data

ArviZ codebase

Expose psislw argument to az.loo in addition to rhat so it can be computed with dask via arviz.psislw
Make sure we are properly handling .values, .data and .compute() to not load into memory large arrays unnecessarly
Expose dask_kwargs to both az.loo and az.psislw
(nice but not necessary) Use a proper ufunc version of logsumexp (maybe scipy has one?)

Arviz should then be able to do the computations for model comparison out of the box because xarray can interface properly with Dask.

Ideas going forward:

It would be great to incorporate @michaelosthege and leverage the mcbackend package

michaelosthege · 2022-01-19T16:58:25Z

michaelosthege
Jan 19, 2022
Maintainer

One could already hack sample_pxxxxx_predictive to append draws to a mcbackend.Backend, even though it currently assumes a Run→Chain relationship and was so far only designed around taking care of posterior MCMC draws.
This could be used with the ClickHouseBackend, or a mcbackend.backends.dask.DaskBackend could be added.
^{By the way: pip install mcbackend}

IMO we should discuss if/how to extend/change the McBackend metadata schema to better integrate with "the other" InferenceData groups like prior or posterior_predictive.
Either by adding a higher-level metadata entity to describe relationships between an MCMC run and the prior/posterior predictive sampling runs, OR by generalizing the current schema such that it applies to all InferenceData groups.

So the tricky part is really one of understanding the Bayesian workflow and mapping it to a relational schema.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorporate dask to handle predictions and comparisons with large number of observations #5371

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Incorporate dask to handle predictions and comparisons with large number of observations #5371

lucianopaz Jan 19, 2022 Maintainer

Background

What needs to be done

Ideas going forward:

Replies: 1 comment

michaelosthege Jan 19, 2022 Maintainer

lucianopaz
Jan 19, 2022
Maintainer

michaelosthege
Jan 19, 2022
Maintainer