# Tutorial: TileDB-SOMA append mode

As of TileDB-SOMA 1.5.0, we're excited to offer support for append mode.

As of TileDB-SOMA 1.15.0, we're proud to offer a `shape` feature for dataframes and arrays within experimentts which more closely matches user expectations.

Use-cases include ingesting H5AD/AnnData from multiple sequencing runs over time, accumulating the data over time, into millions of cells.

First, we'll do the usual package imports:

In [None]:
import scanpy as sc

import tiledbsoma
import tiledbsoma.io
import tiledbsoma.logging

tiledbsoma.show_package_versions()

Next we'll set up where our data are going:

In [None]:
import datetime

stamp = datetime.datetime.today().strftime("%Y%m%d-%H%M%S")
experiment_uri = f"/tmp/append-example-{stamp}"
experiment_uri

For this demo, we're writing to `/tmp`, but URIs like the following allow storing data on TileDB Cloud, cloud storage such as S3, or instance-local NVME:

- `/var/data/mysoma1`
- `s3://mybucket/mysoma2`
- `tiledb://mynamespace/s3://mybucket/mysoma3`

Everything in this notebook below this URI-selection cell is agnostic to the storage backend.

## Create the initial SOMA Experiment

Next we'll prep some input data. To make things easy for this self-contained demo, we'll use Scanpy's `pbmc3k`, with a custom column.

In [None]:
ad1 = sc.datasets.pbmc3k()
sc.pp.calculate_qc_metrics(ad1, inplace=True)
ad1.obs["when"] = ["Monday"] * len(ad1.obs)

Now we're ready to ingest the data into a SOMA experiment. Since SOMA is multimodal, we'll specify the destination modality, or measurement name, to be "RNA".

In [5]:
measurement_name = "RNA"

In [None]:
tiledbsoma.logging.info()
tiledbsoma.io.from_anndata(experiment_uri, ad1, measurement_name=measurement_name)

Now let's read back the data. We'll take a look at `obs`, `var`, and `X`.

**obs**: For this initial ingest, there are obs IDs ending in `-1`, the `when` is `Monday`, and there are 2700 rows. Also note that since TileDB is a columnar database, when we select certain columns, those are the only ones loaded from disk. This positively impacts performance at cloud scale: you get what you asked for, without needing to wait for what you didn't ask for.

In [None]:
with tiledbsoma.Experiment.open(experiment_uri) as exp:
    print(
        exp.obs.read(column_names=["obs_id", "n_genes_by_counts", "when"])
        .concat()
        .to_pandas()
    )

**var**: Let's also look at `var`, selecting out the join IDs (which index columns of `X`) as well as the Ensembl-format and NCBI-format gene IDs:

In [None]:
with tiledbsoma.Experiment.open(experiment_uri) as exp:
    print(
        exp.ms["RNA"]
        .var.read(column_names=["soma_joinid", "var_id", "gene_ids"])
        .concat()
        .to_pandas()
    )

**X**: Lastly let's look at the expression matrix, in COO format. (You can convert to other formats if you like.) Its rows and columns are indexed by the `soma_joinid` of the `obs` and `var` dataframes, respectively.

In [None]:
with tiledbsoma.Experiment.open(experiment_uri) as exp:
    X = exp.ms["RNA"].X["data"]
    print(X.read().tables().concat().to_pandas())
    print()
    print(X.used_shape())

While you can ask all dataframes and arrays in the experiment for their `.domain` or `.shape`, respectively, one at a time, there's also the handy `show_experiment_shape` which traverses the experiment for you.

The dataframe domains and array shapes are soft limits on what values can be read from, or written to, them.

In [None]:
with tiledbsoma.Experiment.open(experiment_uri) as exp:
    X = exp.ms["RNA"].X["data"]
    print(X.read().tables().concat().to_pandas())
    print()
    print(X.shape)

## Appending a new dataset to the SOMA Experiment

Now, let's simulate another day's sequencing run. For simplicity of this demo notebook, we'll mutate the previous dataset, changing the obs IDs to have a `-2` suffix, and also putting `Tuesday` in the `when` column. Also, we'll multiply the `X` values by 10.

In [11]:
ad2 = ad1.copy()
ad2.obs.index = [e.replace("-1", "-2") for e in ad1.obs.index]
ad2.obs["when"] = ["Tuesday"] * len(ad2.obs)

In [12]:
ad2.X *= 10

Now we simply ingest as before -- the only additional step is a black-box registration step which detects which cell IDs are new (here, all of them) and which gene IDs are new (here, none of them).

The registration takes two forms, either of which you can use depending on your use-case: `tiledbsoma.io.register_anndatas` for in-memory AnnData objects, or `tiledbsoma.io.register_h5ads` for on-storage AnnData objects.

In [None]:
rd = tiledbsoma.io.register_anndatas(
    experiment_uri,
    [ad2],
    measurement_name=measurement_name,
    obs_field_name="obs_id",
    var_field_name="var_id",
)

As described on in the tutorial on the TileDB-SOMA shape feature, the `domain` of dataframes and the `shape` of N-dimensional arrays are soft limits on what values can be read from or written to. In order to ingest more data, we'll need to increase those soft limits.

First let's look at what they currently are:

In [None]:
tiledbsoma.io.show_experiment_shapes(exp.uri)

Then we apply the resize, and look at the domains and shapes again:

In [None]:
tiledbsoma.io.resize_experiment(exp.uri, nobs=rd.get_obs_shape(), nvars=rd.get_var_shapes())

In [None]:
tiledbsoma.io.show_experiment_shapes(exp.uri)

Now we can ingest the new data:

In [None]:
tiledbsoma.io.from_anndata(
    experiment_uri,
    ad2,
    measurement_name=measurement_name,
    registration_mapping=rd,
)

Now let's read back the appended data. There are now obs IDs ending in `-1` as well as `-2`, the `when` includes `Monday` as well as `Tuesday`, and there are 5400 rows.

(For `Wednesday` and onward, it'll simply be the same pattern -- we can grow our data iteratively over time, to arbitrary sizes.)

In [None]:
with tiledbsoma.Experiment.open(experiment_uri) as exp:
    print(
        exp.obs.read(column_names=["obs_id", "n_genes_by_counts", "when"])
        .concat()
        .to_pandas()
    )

Let's also look at `var`, as before. Since we had data for more cells but for the same genes, there is nothing new here. The `obs` table grew downward with the new cells, and `X` grew downward with new rows, but `var` stayed the same.

In real-world data, occasionally you will see a gene expressed in subsequent data which wasn't expressed in the initial data. That's fine -- you'll simply see `var` grow just a bit for those newly encountered gene IDs, with corresponding new columns for `X`.

In [None]:
with tiledbsoma.Experiment.open(experiment_uri) as exp:
    print(
        exp.ms["RNA"]
        .var.read(column_names=["soma_joinid", "var_id", "gene_ids"])
        .concat()
        .to_pandas()
    )

And lastly, the `X` expression matrix which has grown downward with the new cells, while keeping the same width as we didn't introduce new genes:

In [None]:
with tiledbsoma.Experiment.open(experiment_uri) as exp:
    X = exp.ms["RNA"].X["data"]
    print(X.read().tables().concat().to_pandas())
    print()
    print(X.shape)

## Ingesting multiple datasets to a SOMA Experiment

Finally, we'll demonstrate combining multiple AnnDatas into one new experiment.

The flow is pretty similar to the above:

1. One call to `register_anndatas` or `register_h5ads` (passing all input AnnDatas/h5ads)
2. One call to `from_anndata`/`from_h5ad` *for each input AnnData*

Here's a helper function to simulate multiple lab runs. As above, where we used `pbmc3k` to simulate Monday and Tuesday data, here we use `pbmc3k` to simulate multiple AnnData objects.

In [21]:
def make_ad(when, scale, obs_id_suffix):
    ad = ad1.copy()
    ad.obs.index = [e.replace("-1", obs_id_suffix) for e in ad.obs.index]
    ad.obs["when"] = [when] * len(ad.obs)
    ad.X *= scale
    return ad

ads = [
    make_ad(when, scale, f"-{idx + 3}")
    for idx, (when, scale)
    in enumerate({
        "Wednesday": 20,
        "Thursday": 30,
        "Friday": 40,
    }.items())
]

We'll ingest these AnnData objects, as before, but this time to a fresh/empty `/tmp` location:

In [None]:
stamp = datetime.datetime.today().strftime("%Y%m%d-%H%M%S")
exp = None
experiment_uri = f"/tmp/append-example-{stamp}"
experiment_uri

Here we'll register all the AnnData objects. Note that the SOMA Experiment doesn't exist yet, so we pass `experiment_uri=None` to signify that.

In [None]:
rd2 = tiledbsoma.io.register_anndatas(
    experiment_uri=None,  # new Experiment, from scratch
    adatas=ads,
    measurement_name=measurement_name,
    obs_field_name="obs_id",
    var_field_name="var_id",
)

Now that we've gotten the registrations for all the input AnnData objects, we can ingest them.

Note:

- Here we ingest them sequentially, in the same order as above.
- But we could also ingest them in any shuffled order.
- Or we could have multiple workers in ingest them in parallel, one worker per AnnData object, as long as the registration data are passed to each worker.

In [None]:
for ad in ads:
    if tiledbsoma.Experiment.exists(experiment_uri):
        tiledbsoma.io.resize_experiment(
            experiment_uri,
            nobs=rd2.get_obs_shape(),
            nvars=rd2.get_var_shapes()
        )

    tiledbsoma.io.from_anndata(
        experiment_uri,
        ad,
        measurement_name=measurement_name,
        registration_mapping=rd2,
    )

Reading back the concatenated data, we see 2700 rows for each of {`-3`, `-4`, `-5`}:

In [None]:
with tiledbsoma.Experiment.open(experiment_uri) as exp:
    print(
        exp.obs.read(column_names=["obs_id", "n_genes_by_counts", "when"])
        .concat()
        .to_pandas()
    )

`var` is the same as in the single original Anndata objects (since we added more cells with all the same genes):

In [None]:
with tiledbsoma.Experiment.open(experiment_uri) as exp:
    print(
        exp.ms["RNA"]
        .var.read(column_names=["soma_joinid", "var_id", "gene_ids"])
        .concat()
        .to_pandas()
    )

Finally, the `X` expression matrix contains 3x the entries as the original, but is also the same width (since we didn't introduce new genes):

In [None]:
with tiledbsoma.Experiment.open(experiment_uri) as exp:
    X = exp.ms["RNA"].X["data"]
    print(X.read().tables().concat().to_pandas())
    print()
    print(X.shape)