# `bagofholding` introduction

This notebook provides a quick rundown of the key user-facing features for `bagofholding`

In [1]:
import numpy as np

import bagofholding as boh

`bagofholding` is intended to work with any `pickle`-able python object, so first let's whip up some custom class to work with

In [2]:
class MyCustomClass:
    def __init__(self, n: int):
        self.n = n
        self.name = f"my_custom_class_{n}"
        self.data = np.arange(n)

    def __eq__(self, other):
        return all(
            (
                self.__class__ == other.__class__,
                self.n == other.n,
                self.name == other.name,
                np.all(self.data == other.data),
            )
        )

my_object = MyCustomClass(10)
my_object.__metadata__ = "Let's add some metadata reminding ourselves this was created for the example notebook"

## Basics

Storage with `bagofholding` differs from `pickle` under the hood, but is intended to be similarly easy to work with.

The underlying analogy is that we have a "bag" that we're putting our python objects into. Presently, the only back-end we have implemented uses HDF5 with `h5py`, so let's save our object with a bag of that flavour

In [3]:
filename = "notebook_example.h5"
boh.H5Bag.save(my_object, filename)

Saving is a class-level method, we never actually need to instantiate a "bag". For loading, we do:

In [4]:
bag = boh.H5Bag(filename)
reloaded = bag.load()

print("The reloaded object is the same as the saved object:", reloaded == my_object)

The reloaded object is the same as the saved object: True


So the basic save-load cycle is extremely straightforward. From here on, we go beyond the power of `pickle`.

We make a "bag" object before loading because, without re-instantiating anything we've saved, we can look at its internal structure! Under the hood, we leverage the same [`__reduce__` workflow that `pickle` uses](https://docs.python.org/3/library/pickle.html#object.__reduce__) in order to decompose arbitrary objects. But instead of lumping everything together in a binary blob, `bagofholding` lets us peek at these different components:

In [5]:
bag.list_paths()

['object',
 'object/args',
 'object/args/i0',
 'object/constructor',
 'object/item_iterator',
 'object/kv_iterator',
 'object/state',
 'object/state/__metadata__',
 'object/state/data',
 'object/state/n',
 'object/state/name']

## Metadata

We have additionally scraped metadata from our object at save-time, which can be found using item-access on the bag with the appropriate path:

In [6]:
bag["object"]

Metadata(content_type='bagofholding.h5.content.Reducible', qualname='MyCustomClass', module='__main__', version=None, meta="Let's add some metadata reminding ourselves this was created for the example notebook")

Complex objects like numpy arrays get non-trivial metadata

In [7]:
bag["object/state/data"]

Metadata(content_type='bagofholding.h5.content.Array', qualname='ndarray', module='numpy', version='1.26.4', meta=None)

While for simple python primitives we don't bother storing anything

In [8]:
print(bag["object/state/n"])

Metadata(content_type='bagofholding.h5.content.Long', qualname=None, module=None, version=None, meta=None)


And, of course, we store metadata about the bag itself!

In [9]:
import re
re.sub(r"(?<=version=')[^']*", "...", str(bag.get_bag_info()))
# Don't worry about the regex, we're just replacing the version number so the automated test doesn't fail each new commit

"H5Info(qualname='H5Bag', module='bagofholding.h5.bag', version='...', libver_str='latest')"

For Jupyter users, we can browse the structure and metadata of the stored object conveniently in a GUI

In [10]:
widget = bag.browse()
widget

BagTree(multiple_selection=False, nodes=(Node(disabled=True, icon='shopping-bag', name='Bag', nodes=(Node(disa…

## Partial loading

A powerful advantage of `bagofholding` is that we allow objects to be only _partially_ reloaded! Since we track the internal object structure, we can pass a particular internal path within the object to reload just that piece

In [11]:
bag.load("object/state/n")

10

Of course it may be convenient to leverage this feature, but its real power begins to shine when we consider long-term storage.

Suppose your colleague worked with their custom python code to generate important data... and then left. Now you want to access that data, but don't have a python environment that includes all of their bespoke code! Let's simulate this by resetting our kernel's knowledge, and losing access to `__main__.MyCustomClass`.

In [12]:
%reset -f

In [13]:
import bagofholding as boh

filename = "notebook_example.h5"

We can still browse the saved object

In [14]:
bag = boh.H5Bag(filename)
bag.browse()

BagTree(multiple_selection=False, nodes=(Node(disabled=True, icon='shopping-bag', name='Bag', nodes=(Node(disa…

But of course, we are no longer able to simply reload it

In [15]:
try:
    bag.load()
except AttributeError as e:
    print(e)

module '__main__' has no attribute 'MyCustomClass'


However, if we know where the data we want is stored -- either because we're familiar with the object's library even though we don't have it available right now, or simply by inspecting the object's browsable structure using `bagofholding` -- then we can still reload just that data!

In [16]:
bag.load("object/state/n")

10

In some cases we might want to have _part_ of the original environment available, i.e. that part needed to load the terminal data we're interested in. We can see what that is, right down to the version number

In [17]:
bag["object/state/data"]

Metadata(content_type='bagofholding.h5.content.Array', qualname='ndarray', module='numpy', version='1.26.4', meta=None)

And load once we've made it available to our current python interpreter

In [18]:
import numpy as np

bag.load("object/state/data")

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In this way, data stored with `bagofholding` is extremely transparent and robust.

## Version control

Another advantage to storing metadata is that we can check against stored versions at load-time to ensure that our current environment will be able to safely recreate the desired objects from the serialized data.

By default, versions are found by looking at the `__version__` attribute of a given object's base module, but since not all modules store their versioning info this way, this can be overridden on a per-module basis using the `version_scraping` argument.

By default, `bagofholding` will complain if two versions do not match exactly, but we can relax this with the `require_versions` argument.
For semantically versioned packages, we have granular control over how strictly the versions match.

For the sake of this notebook, let's use the `version_scraping` dictionary to provide a custom version for `numpy` at read-time, and then explore the possibilities for `require_versions`:

In [19]:
import importlib

def change_numpy_major(module_name: str) -> str:
    if module_name != "numpy":
        raise ValueError("Hey, this is supposed to be a numpy-based example!")
    numpy = importlib.import_module(module_name)
    numpy_actual_version = numpy.__version__
    semantic_breakdown = numpy_actual_version.split(".")
    semantic_breakdown[1] = "9999"  # Change the semantic minor version
    return ".".join(semantic_breakdown)

In [20]:
def print_error_without_addresses(e):
    """
    Don't worry about this, it's just so automated tests don't get hung up
    on memory addresses changing in error messages
    """
    import re

    msg = str(e)
    pattern = re.compile(r"<function (\S+) at 0x[0-9a-fA-F]+>")
    clean_message = pattern.sub(r"<function \1 ...>", msg)
    pattern_lambda = re.compile(r"<function <lambda> at 0x[0-9a-fA-F]+>")
    clean_message = pattern_lambda.sub(r"<function <lambda> ...>", clean_message)
    print(clean_message)

When our "current version of numpy" is X.9999.Z, default load behaviour will complain:

In [21]:
try:
    bag.load("object/state/data", version_scraping={"numpy": change_numpy_major})
except boh.EnvironmentMismatchError as e:
    print_error_without_addresses(e)

numpy is stored with version 1.26.4, but the current environment has 1.9999.4. This does not pass validation criterion: <function _versions_are_equal ...>


In fact, either of the choices below will complain that these versions are not compatible for loading:

In [22]:
for validation in ["exact", "semantic-minor"]:
    try:
        bag.load("object/state/data", version_validator=validation, version_scraping={"numpy": change_numpy_major})
    except boh.EnvironmentMismatchError as e:
        print_error_without_addresses(f"Can't load with {validation}: {e}")

Can't load with exact: numpy is stored with version 1.26.4, but the current environment has 1.9999.4. This does not pass validation criterion: <function _versions_are_equal ...>
Can't load with semantic-minor: numpy is stored with version 1.26.4, but the current environment has 1.9999.4. This does not pass validation criterion: <function _versions_match_semantic_minor ...>


But either of these more relaxed flags will let us proceed:

In [23]:
for validation in ["semantic-major", "none"]:
    bag.load("object/state/data", version_validator=validation, version_scraping={"numpy": change_numpy_major})
    print(f"Loaded without complaint with {validation}")

Loaded without complaint with semantic-major
Loaded without complaint with none


## Save-time safety

Being able to exploit the above version control to the fullest means your stored object(s) needs to come from an importable module with some sort of versioning.
To this end, we provide two save-time flags to ensure better behaviour from saved objects.

First, you can require at save-time that non-standard objects all have a version:

In [24]:
class SomethingLocalAndUnversioned:
    pass

In [25]:
try:
    boh.H5Bag.save(SomethingLocalAndUnversioned, filename, require_versions=True)
except boh.NoVersionError as e:
    print(e)

Could not find a version for __main__. Either disable `require_versions`, use `version_scraping` to find an existing version for this package, or add versioning to the unversioned package.


And second, you can forbid particular modules, e.g. some local library or, more commonly, `__main__`:

In [26]:
try:
    boh.H5Bag.save(SomethingLocalAndUnversioned, filename, forbidden_modules=("__main__",))
except boh.ModuleForbiddenError as e:
    print(e)

Module '__main__' is forbidden as a source of stored objects. Change the `forbidden_modules` or move this object to an allowed module.


## (Advanced topic) Customization

Because it is modeled on the `pickle` API, power users can customize the `bagofholding` storage behavior using familiar tools like custom `__reduce__` or `__getstate__` methods on their classes.
E.g., below we see that modifying the state manipulation impacts what is displayed on browsing:

In [27]:
class Customized:
    def __init__(self, x):
        self.x = x

    def __getstate__(self):
        return {"by_default_this_would_just_be_x": self.x}

    def __setstate__(self, state):
        self.x = state["by_default_this_would_just_be_x"]
boh.H5Bag.save(Customized(42), filename)
boh.H5Bag(filename).list_paths()[-1]

'object/state/by_default_this_would_just_be_x'

## Limitations

`bagofholding` uses many of the same patterns as `pickle`, and thus is only expected to work for objects which could otherwise be pickled.
Bag objects offer a convenience method to quickly test this:

In [28]:
message = boh.H5Bag.pickle_check(lambda x: x, raise_exceptions=False)
print_error_without_addresses(message)

Can't pickle <function <lambda> ...>: attribute lookup <lambda> on __main__ failed


And although the same patterns as `pickle` are exploited, `bagofholding` does not actually _execute_ `pickle`.
To this end, the highest protocol value exploiting out-of-band data is not supported:

In [29]:
try:
    boh.H5Bag.save(42, filename, _pickle_protocol=5)
except boh.PickleProtocolError as e:
    print(e)

pickle protocol must be <= 4, got 5


## Notebook cleanup

At the end of the day, let's clean up the files we created.

In [30]:
import contextlib
import os

with contextlib.suppress(FileNotFoundError):
    os.remove(filename)