# Demonstration of `Storage` class and manifests

The `Storage` class supports generic saving/loading from (among other things) direct file system access and zip files. We use this to support `ZipFileLinearIndex` and SBT databases, in particular; note that SBTs can be loaded both from file systems and zip files, as well as (with a bit more work) IPFS and Redis databases.

Manifests store metadata about the signatures that are available in a particular storage. They are used to support selectors, picklists, and fast/direct loading of specific signatures. The important part for `Storage` use is that each manifest row contains an `internal_location` that can be used to directly load a signature upon request.

(There are also IPFS and Redis-based storage classes that we don't show below, as they require other packages to be installed as well as some services.)

In [1]:
import shutil, os
from io import StringIO

import sourmash
from sourmash.sbt_storage import FSStorage, ZipStorage
from sourmash.manifest import CollectionManifest
from sourmash.signature import load_signatures

from sourmash.sourmash_args import SaveSignaturesToLocation

In [2]:
# load some demo signatures
sig1 = sourmash.load_one_signature('akkermansia.fa.sig', ksize=31)
sig2 = sourmash.load_one_signature('shew_os185.fa.sig', ksize=31)
sig3 = sourmash.load_one_signature('shew_os223.fa.sig', ksize=31)

In [3]:
# define some functions to (1) save signatures into storage, and (2) save manifest into storage

# CTB notes:
# * could add a Storage.save_sig just like this function...
# * could also add a manifest getter/setter to load manifest from stock location
# * ZipStorage.save has a 'compress' argument that we should pull up to base class

def save_sig(storage, sig_obj):
    data = sourmash.save_signatures([sig_obj], compression=1)
    filename = f"{sig_obj.md5sum()}.sig.gz"

    internal_path = storage.save(filename, data)
    return internal_path

def save_manifest(storage, manifest_rows):
    manifest = CollectionManifest(manifest_rows)
    manifest_name = f"SOURMASH-MANIFEST.csv"

    manifest_fp = StringIO()
    manifest.write_to_csv(manifest_fp, write_header=True)
    manifest_data = manifest_fp.getvalue().encode("utf-8")

    # CTB note: in sourmash, we usually use 'compress=True' when this is a ZipStorage
    storage.save(manifest_name, manifest_data, overwrite=True)
    
    storage.flush()

In [4]:
shutil.rmtree('/tmp/fs-example', ignore_errors=True)
# here, the 'location' is the root directory, 'subdir' is a
# subdirectory with content entirely under FSStorage control.
fs_store = FSStorage(location='/tmp', subdir='fs-example')

# this is used only for presentation purposes to show the final file name.
fullpath = os.path.join(fs_store.location, fs_store.subdir)

#
# for each signature, save into storage and build a manifest row, too.
#

internal_path = save_sig(fs_store, sig1)
manifest_row1 = CollectionManifest.make_manifest_row(sig1, internal_path)
print(f"saved '{str(sig1)[:30]}...' to {fullpath}/{internal_path}")

internal_path = save_sig(fs_store, sig2)
manifest_row2 = CollectionManifest.make_manifest_row(sig2, internal_path)
print(f"saved '{str(sig1)[:30]}...' to {fullpath}/{internal_path}")

internal_path = save_sig(fs_store, sig3)
manifest_row3 = CollectionManifest.make_manifest_row(sig3, internal_path)
print(f"saved '{str(sig1)[:30]}...' to {fullpath}/{internal_path}")

# save manifest:
save_manifest(fs_store, [manifest_row1, manifest_row2, manifest_row3])

saved 'CP001071.1 Akkermansia mucinip...' to /tmp/fs-example/6822e0b7f2b21030699fbb98c698e71c.sig.gz
saved 'CP001071.1 Akkermansia mucinip...' to /tmp/fs-example/b47b13ef3781433fc3531fd502f723a4.sig.gz
saved 'CP001071.1 Akkermansia mucinip...' to /tmp/fs-example/ae6659f6804482c9d5e739e554a48563.sig.gz


In [5]:
try:
    os.remove('/tmp/zip-store.zip')
except FileNotFoundError:
    pass

zip_store = ZipStorage('/tmp/zip-store.zip')

internal_path = save_sig(zip_store, sig1)
print(f"saved '{str(sig1)[:30]}...' to {zip_store.path}:{internal_path}")
manifest_row1 = CollectionManifest.make_manifest_row(sig1, internal_path)


internal_path = save_sig(zip_store, sig2)
manifest_row2 = CollectionManifest.make_manifest_row(sig2, internal_path)
print(f"saved '{str(sig1)[:30]}...' to {zip_store.path}:{internal_path}")


internal_path = save_sig(zip_store, sig3)
manifest_row3 = CollectionManifest.make_manifest_row(sig3, internal_path)
print(f"saved '{str(sig1)[:30]}...' to {zip_store.path}:{internal_path}")

save_manifest(zip_store, [manifest_row1, manifest_row2, manifest_row3])

saved 'CP001071.1 Akkermansia mucinip...' to /tmp/zip-store.zip:6822e0b7f2b21030699fbb98c698e71c.sig.gz
saved 'CP001071.1 Akkermansia mucinip...' to /tmp/zip-store.zip:b47b13ef3781433fc3531fd502f723a4.sig.gz
saved 'CP001071.1 Akkermansia mucinip...' to /tmp/zip-store.zip:ae6659f6804482c9d5e739e554a48563.sig.gz


In [6]:
# now demonstrate loading all of the signatures in the manifest.

def load_sigs(storage):
    # load manifest and all sigs in manifest
    manifest = CollectionManifest.load_from_storage(storage)
    
    for loc in manifest.locations():
        sig_data = storage.load(loc)
        for sig_obj in load_signatures(sig_data):
            yield sig_obj
            
print('loading from zip store:')
for sig_obj in load_sigs(zip_store):
    print(sig_obj.name[:30] + '...')
    
print('---')
print('loading from fs store:')
for sig_obj in load_sigs(fs_store):
    print(sig_obj.name[:30] + '...')

loading from zip store:
CP001071.1 Akkermansia mucinip...
NC_009665.1 Shewanella baltica...
NC_011663.1 Shewanella baltica...
---
loading from fs store:
CP001071.1 Akkermansia mucinip...
NC_009665.1 Shewanella baltica...
NC_011663.1 Shewanella baltica...


## What we achieved here by using `Storage`

Both `Storage` objects have a manifest that can be loaded from a canonical location, `SOURMASH-MANIFEST.csv`.

The manifest contains full signature metadata that can be used to select, iterate, etc over all signatures.

The `SOURMASH-MANIFEST.csv` contains the `internal_location` where each signature is stored within the `Storage` object, and this path can be used to directly load that signature.

Once each `Storage` class is instantiated, the `save` and `load` methods function identically, independent of the specific class being used.

This means that signature and manifest saving and loading can be done with the same API.

As fun additional features, note that:
* in this case the manifest files are identical, which doesn't really matter, but is a nice happenstance
* with only a little trickiness around instantiating the Storage class, you can unzip the zipfile into its own directory and then use `FSStorage` to access it - all the manifest paths etc will work since they are relative.

## Using `SaveSignaturesToLocation` instead

For Reasons, we also provide a separate signature saving API in `sourmash_args.SaveSignaturesToLocation`.

This is a somewhat higher-level API that supports a few different output formats, including directories and zip stores as well as .sig files; it's intended to be more user-friendly.

See below for usage.

In [7]:
shutil.rmtree('/tmp/fs-example2', ignore_errors=True)
try:
    os.remove('/tmp/zip-store2.zip')
except FileNotFoundError:
    pass

# note that this does not create a manifest
with SaveSignaturesToLocation('/tmp/fs-example2/') as savesig:
    savesig.add(sig1)
    savesig.add(sig2)
    savesig.add(sig3)
    
    print(f"saved {len(savesig)} signatures to {savesig.location}")
    
# this _does_ create a manifest
with SaveSignaturesToLocation('/tmp/zip-store2.zip') as savesig:
    savesig.add(sig1)
    savesig.add(sig2)
    savesig.add(sig3)
    
    print(f"saved {len(savesig)} signatures to {savesig.location}")

saved 3 signatures to /tmp/fs-example2/
saved 3 signatures to /tmp/zip-store2.zip


In [8]:
print('loading from zip store 2:')
for sig_obj in sourmash.load_file_as_signatures('/tmp/zip-store2.zip'):
    print(sig_obj.name[:30] + '...')
    
print('---')
print('loading from file system/directory, round 2:')
for sig_obj in sourmash.load_file_as_signatures('/tmp/fs-example2'):
    print(sig_obj.name[:30] + '...')

loading from zip store 2:
CP001071.1 Akkermansia mucinip...
NC_009665.1 Shewanella baltica...
NC_011663.1 Shewanella baltica...
---
loading from file system/directory, round 2:
CP001071.1 Akkermansia mucinip...
NC_011663.1 Shewanella baltica...
NC_009665.1 Shewanella baltica...
