Where do you want to store your MCMC draws? In memory? On disk? Or in a database running in a datacenter?
No matter where you want to put them, or which PPL generates them: McBackend takes care of your MCMC samples.
The mcbackend
package consists of three parts:
No matter which programming language your favorite PPL is written in, the ProtocolBuffers from McBackend can be used to generate code in languages like C++, C#, Python and many more to represent commonly used metadata about MCMC runs, chains and model variables.
The definitions in protobufs/meta.proto
are designed to maximize compatibility with ArviZ
objects, making it easy to transform MCMC draws stored according to the McBackend schema to InferenceData
objects for plotting & analysis.
The draws
and stats
created by MCMC sampling algorithms at runtime need to be stored somewhere.
This "somewhere" is called the storage backend in PPLs/MCMC frameworks like PyMC or emcee.
Most storage backends must be initialized with metadata about the model variables so they can, for example, pre-allocated memory for the draws
and stats
they're about to receive.
After then receiving thousands of draws
and stats
they must then provide methods by which the draws
/stats
can be retrieved.
The mcbackend.core
module has classes such as Backend
, Run
, and Chain
to define these interfaces for any storage backend, no matter if it's an in-memory, filesystem or database storage.
Albeit this implementation is currently Python-only, the interface signature should be portable to e.g. C++.
Via mcbackend.backends
the McBackend package then provides backend implementations.
Currently you may choose from:
backend = mcbackend.NumPyBackend()
backend = mcbackend.ClickHouseBackend( client=clickhouse_driver.Client("localhost") )
# All that matters:
isinstance(backend, mcbackend.Backend)
# >>> True
Anything that is a Backend
can be wrapped by an adapter that makes it compatible with your favorite PPL.
In the example below, a ClickHouseBackend
is initialized to store MCMC draws from a PyMC model in a ClickHouse database.
See below for how to run it in Docker.
import clickhouse_driver
import mcbackend
import pymc as pm
# 1. Create _any_ kind of backend
ch_client = clickhouse_driver.Client("localhost")
backend = mcbackend.ClickHouseBackend(ch_client)
with pm.Model():
# 2. Create your model
...
# 3. Hit the inference button ™ while passing the backend!
pm.sample(trace=backend)
In case of PyMC the adapter lives in the PyMC codebase since version 5.1.1,
so all you need to do is pass any mcbackend.Backend
via the pm.sample(trace=...)
parameter!
Instead of using PyMC's built-in NumPy backend, the MCMC draws now end up in ClickHouse.
Continuing the example from above we can now retrieve draws from the backend.
Note that since this example wrote the draws to ClickHouse, we could run the code below on another machine, and even while the above model is still sampling!
backend = mcbackend.ClickHouseBackend(ch_client)
# Fetch the run from the database (downloads just metadata)
run = backend.get_run(trace.run_id)
# Get all draws from a chain
chain = run.get_chains()[0]
chain.get_draws("my favorite variable")
# >>> array([ ... ])
# Convert everything to `InferenceData`
idata = run.to_inferencedata()
print(idata)
# >>> Inference data with groups:
# >>> > posterior
# >>> > sample_stats
# >>> > observed_data
# >>> > constant_data
# >>>
# >>> Warmup iterations saved (warmup_*).
McBackend just started and is looking for contributions. For example:
- Schema discussion: Which metadata is needed? (related: PyMC #5160)
- Interface discussion: How should
Backend
/Run
/Chain
evolve? - Python Backends for disk storage (HDF5,
*.proto
, ...) - C++
Backend
/Run
/Chain
interfaces - C++ ClickHouse backend (via
clickhouse-cpp
)
As the schema and API stabilizes a mid-term goal might be to replace PyMC BaseTrace
/MultiTrace
entirely to rely on mcbackend
.
Getting rid of MultiTrace
was a long-term goal behind making pm.sample(return_inferencedata=True)
the default.
First clone the repository and install mcbackend
locally:
pip install -e .
To run the tests:
pip install -r requirements-dev.txt
pytest -v
Some tests need a ClickHouse database server running locally. To start one in Docker:
docker run --detach --rm --name mcbackend-db -p 9000:9000 --ulimit nofile=262144:262144 clickhouse/clickhouse-server
If you don't already have it, first install the protobuf compiler:
conda install protobuf
pip install --pre "betterproto[compiler]"
To compile the *.proto
files for languages other than Python, check the ProtocolBuffers documentation.
The following script compiles them for Python using the betterproto
compiler plugin to get nice-looking dataclasses.
It also copies the generated files to the right place in mcbackend
.
python protobufs/generate.py