Incorporate dask to handle predictions and comparisons with large number of observations #5371
Replies: 1 comment
-
One could already hack IMO we should discuss if/how to extend/change the McBackend metadata schema to better integrate with "the other" So the tricky part is really one of understanding the Bayesian workflow and mapping it to a relational schema. |
Beta Was this translation helpful? Give feedback.
-
Background
This discussion came up after a conversation between @OriolAbril, @ricardoV94 and @lucianopaz. We were interested in being able to do model comparisons on a dataset that had ~1M observations. The problem with this is that we usually have ~4K samples in the posterior, and model comparison requires us to compute the pointwise
logp
, which needs to store about ~4G 64bit points in memory, (ruffly should eat up 30GB of RAM) leading toMemoryError
.We started drafting ideas of what we should change to be able to rely on dask distributed arrays to be able to balance the memory footprint of model comparisons. We then also decided to look into the changes that would be needed to accommodate prior and posterior predictive samples in distributed arrays.
What needs to be done
PyMC codebase
_DefaultTrace
sample_posterior_predictive
to use something that is different from_DefaultTrace
sample_prior_predictive
to pass a dask array toto_inference_data
ArviZ codebase
az.loo
in addition to rhat so it can be computed with dask viaarviz.psislw
.values
,.data
and.compute()
to not load into memory large arrays unnecessarlydask_kwargs
to bothaz.loo
andaz.psislw
Arviz should then be able to do the computations for model comparison out of the box because xarray can interface properly with Dask.
Ideas going forward:
Beta Was this translation helpful? Give feedback.
All reactions