# Tutorial: TileDB-SOMA append-mode

As of TileDB-SOMA 1.5.0, we're excited to offer support for append mode.

Use-cases include ingesting H5AD/AnnData from multiple sequencing runs over time, accumulating the data over time, into millions of cells.

First, we'll do the usual package imports:

In [1]:
import scanpy as sc
import tiledbsoma
import tiledbsoma.io
import tiledbsoma.logging

tiledbsoma.show_package_versions()

tiledbsoma.__version__              1.11.1
TileDB-Py version                   0.29.0
TileDB core version (tiledb)        2.23.0
TileDB core version (libtiledbsoma) 2.23.0
python version                      3.11.8.final.0
OS version                          Linux 4.14.336-257.568.amzn2.x86_64


Next we'll set up where our data are going. Note that to change the storage backend, everything in this notebook below this URI-selection cell is completely unchanged -- whether you store your data on TileDB Cloud, cloud storage such as S3, instance-local NVME, or what have you.

In [2]:
import datetime

stamp = datetime.datetime.today().strftime("%Y%m%d-%H%M%S")
# experiment_uri = f"tiledb://johnkerl-tiledb/s3://tiledb-johnkerl/scratch/append-example-{stamp}"
# Self-contained demo
experiment_uri = f"/tmp/append-example-{stamp}"
experiment_uri

'/tmp/append-example-20240516-203638'

## Create the initial SOMA Experiment

Next we'll prep some input data. To make things easy for this self-contained demo, we'll use Scanpy's `pbmc3k`, with a custom column.

In [3]:
ad1 = sc.datasets.pbmc3k()
sc.pp.calculate_qc_metrics(ad1, inplace=True)
ad1.obs["when"] = ["Monday"] * len(ad1.obs)

Now we're ready to ingest the data into a SOMA experiment. Since SOMA is multimodal, we'll specify the destination modality, or measurement name, to be "RNA".

In [4]:
measurement_name = "RNA"

In [5]:
tiledbsoma.logging.info()
tiledbsoma.io.from_anndata(experiment_uri, ad1, measurement_name=measurement_name)

Registration: registering isolated AnnData object.


Wrote   /tmp/append-example-20240516-203638/obs


Wrote   /tmp/append-example-20240516-203638/ms/RNA/var


Writing /tmp/append-example-20240516-203638/ms/RNA/X/data



For instance checks, use `isinstance(X, (anndata.experimental.CSRDataset, anndata.experimental.CSCDataset))` instead.

For creation, use `anndata.experimental.sparse_dataset(X)` instead.



Wrote   /tmp/append-example-20240516-203638/ms/RNA/X/data


Wrote   /tmp/append-example-20240516-203638


'/tmp/append-example-20240516-203638'

Now let's read back the data. We'll take a look at `obs`, `var`, and `X`.

**obs**: For this initial ingest, there are obs IDs ending in `-1`, the `when` is `Monday`, and there are 2700 rows. Also note that since TileDB is a columnar database, when we select certain columns, those are the only ones loaded from disk. This positively impacts performance at cloud scale.

In [6]:
with tiledbsoma.Experiment.open(experiment_uri) as exp:
    print(
        exp.obs.read(column_names=["obs_id", "n_genes_by_counts", "when"])
        .concat()
        .to_pandas()
    )

                obs_id  n_genes_by_counts    when
0     AAACATACAACCAC-1                781  Monday
1     AAACATTGAGCTAC-1               1352  Monday
2     AAACATTGATCAGC-1               1131  Monday
3     AAACCGTGCTTCCG-1                960  Monday
4     AAACCGTGTATGCG-1                522  Monday
...                ...                ...     ...
2695  TTTCGAACTCTCAT-1               1155  Monday
2696  TTTCTACTGAGGCA-1               1227  Monday
2697  TTTCTACTTCCTCG-1                622  Monday
2698  TTTGCATGAGAGGC-1                454  Monday
2699  TTTGCATGCCTCAC-1                724  Monday

[2700 rows x 3 columns]


**var**: Let's also look at `var`, selecting out the join IDs (which index columns of `X`) as well as the Ensembl-format and NCBI-format gene IDs:

In [7]:
with tiledbsoma.Experiment.open(experiment_uri) as exp:
    print(
        exp.ms["RNA"]
        .var.read(column_names=["soma_joinid", "var_id", "gene_ids"])
        .concat()
        .to_pandas()
    )

       soma_joinid        var_id         gene_ids
0                0    MIR1302-10  ENSG00000243485
1                1       FAM138A  ENSG00000237613
2                2         OR4F5  ENSG00000186092
3                3  RP11-34P13.7  ENSG00000238009
4                4  RP11-34P13.8  ENSG00000239945
...            ...           ...              ...
32733        32733    AC145205.1  ENSG00000215635
32734        32734         BAGE5  ENSG00000268590
32735        32735    CU459201.1  ENSG00000251180
32736        32736    AC002321.2  ENSG00000215616
32737        32737    AC002321.1  ENSG00000215611

[32738 rows x 3 columns]


**X**: Lastly let's look at the expression matrix, in COO format. (You can convert to other formats if you like.) Its rows and columns are indexed by the `soma_joinid` of the `obs` and `var` dataframes, respectively.

In [8]:
with tiledbsoma.Experiment.open(experiment_uri) as exp:
    X = exp.ms["RNA"].X["data"]
    print(X.read().tables().concat().to_pandas())
    print()
    print(X.used_shape())

         soma_dim_0  soma_dim_1  soma_data
0                 0          70        1.0
1                 0         166        1.0
2                 0         178        2.0
3                 0         326        1.0
4                 0         363        1.0
...             ...         ...        ...
2286879        2699       32697        1.0
2286880        2699       32698        7.0
2286881        2699       32702        1.0
2286882        2699       32705        1.0
2286883        2699       32708        3.0

[2286884 rows x 3 columns]

((0, 2699), (0, 32732))


## Appending a new dataset to the SOMA Experiment

Now, let's simiulate another day's sequencing run. For simplicity of this demo notebook, we'll mutate the previous dataset, changing the obs IDs to have a `-2` suffix, and also putting `Tuesday` in the `when` column. Also, we'll multiply the `X` values by 10.

In [9]:
ad2 = ad1
ad2.obs.index = [e.replace("-1", "-2") for e in ad1.obs.index]
ad2.obs["when"] = ["Tuesday"] * len(ad2.obs)

In [10]:
ad2.X *= 10

Now we simply ingest as before -- the only additional step is a black-box registration step which detects which cell IDs are new (here, all of them) and which gene IDs are new (here, none of them).

The registration takes two forms, either of which you can use depending on your use-case: `tiledbsoma.io.register_anndatas` for in-memory AnnData objects, or `tiledbsoma.io.register_h5ads` for on-storage AnnData objects.

In [11]:
rd = tiledbsoma.io.register_anndatas(
    experiment_uri,
    [ad2],
    measurement_name=measurement_name,
    obs_field_name="obs_id",
    var_field_name="var_id",
)

tiledbsoma.io.from_anndata(
    experiment_uri,
    ad2,
    measurement_name=measurement_name,
    registration_mapping=rd,
)

Registration: starting with experiment /tmp/append-example-20240516-203638


Registration: found nobs=2700 nvar=32738 from experiment.


Registration: registering AnnData object.


Registration: accumulated to nobs=5400 nvar=32738.


Registration: complete.


Wrote   /tmp/append-example-20240516-203638/obs


Wrote   /tmp/append-example-20240516-203638/ms/RNA/var


Writing /tmp/append-example-20240516-203638/ms/RNA/X/data


Wrote   /tmp/append-example-20240516-203638/ms/RNA/X/data


Wrote   /tmp/append-example-20240516-203638


'/tmp/append-example-20240516-203638'

Now let's read back the appended data. There are now obs IDs ending in `-1` as well as `-2`, the `when` includes `Monday` as well as `Tuesday`, and there are 5400 rows.

(For `Wednesday` and onward, it'll simply be the same pattern -- we can grow our data iteratively over time, to arbitrary sizes.)

In [12]:
with tiledbsoma.Experiment.open(experiment_uri) as exp:
    print(
        exp.obs.read(column_names=["obs_id", "n_genes_by_counts", "when"])
        .concat()
        .to_pandas()
    )

                obs_id  n_genes_by_counts     when
0     AAACATACAACCAC-1                781   Monday
1     AAACATTGAGCTAC-1               1352   Monday
2     AAACATTGATCAGC-1               1131   Monday
3     AAACCGTGCTTCCG-1                960   Monday
4     AAACCGTGTATGCG-1                522   Monday
...                ...                ...      ...
5395  TTTCGAACTCTCAT-2               1155  Tuesday
5396  TTTCTACTGAGGCA-2               1227  Tuesday
5397  TTTCTACTTCCTCG-2                622  Tuesday
5398  TTTGCATGAGAGGC-2                454  Tuesday
5399  TTTGCATGCCTCAC-2                724  Tuesday

[5400 rows x 3 columns]


Let's also look at `var`, as before. Since we had data for more cells but for the same genes, there is nothing new here. The `obs` table grew downward with the new cells, and `X` grew downward with new rows, but `var` stayed the same.

In real-world data, occasionally you will see a gene expressed in subsequent data which wasn't expressed in the initial data. That's fine -- you'll simply see `var` grow just a bit for those newly encountered gene IDs, with corresponding new columns for `X`.

In [13]:
with tiledbsoma.Experiment.open(experiment_uri) as exp:
    print(
        exp.ms["RNA"]
        .var.read(column_names=["soma_joinid", "var_id", "gene_ids"])
        .concat()
        .to_pandas()
    )

       soma_joinid        var_id         gene_ids
0                0    MIR1302-10  ENSG00000243485
1                1       FAM138A  ENSG00000237613
2                2         OR4F5  ENSG00000186092
3                3  RP11-34P13.7  ENSG00000238009
4                4  RP11-34P13.8  ENSG00000239945
...            ...           ...              ...
32733        32733    AC145205.1  ENSG00000215635
32734        32734         BAGE5  ENSG00000268590
32735        32735    CU459201.1  ENSG00000251180
32736        32736    AC002321.2  ENSG00000215616
32737        32737    AC002321.1  ENSG00000215611

[32738 rows x 3 columns]


And lastly, the expression matrix which has grown downward with the new cells, while keeping the same width as we didn't introduce new genes:

In [14]:
with tiledbsoma.Experiment.open(experiment_uri) as exp:
    X = exp.ms["RNA"].X["data"]
    print(X.read().tables().concat().to_pandas())
    print()
    print(X.used_shape())

         soma_dim_0  soma_dim_1  soma_data
0                 0          70        1.0
1                 0         166        1.0
2                 0         178        2.0
3                 0         326        1.0
4                 0         363        1.0
...             ...         ...        ...
4573763        5399       32697       10.0
4573764        5399       32698       70.0
4573765        5399       32702       10.0
4573766        5399       32705       10.0
4573767        5399       32708       30.0

[4573768 rows x 3 columns]

((0, 5399), (0, 32732))
