# Quickstart `annbatch`

This notebook will walk you through the following steps:
1. How to convert an existing collection of `anndata` files into a shuffled, zarr-based, collection of `anndata` datasets
2. How to load the converted collection using `annbatch`
3. Extend an existing collection with new `anndata` datasets

In [None]:
# Download two example datasets from CELLxGENE
!wget https://datasets.cellxgene.cziscience.com/866d7d5e-436b-4dbd-b7c1-7696487d452e.h5ad
!wget https://datasets.cellxgene.cziscience.com/f81463b8-4986-4904-a0ea-20ff02cbb317.h5ad

## IMPORTANT: Configure zarrs

This step is both required for converting existing `anndata` files into a performant, shuffled collection of datasets for mini batch loading

In [None]:
import zarr
import zarrs  # noqa

zarr.config.set({"codec_pipeline.path": "zarrs.ZarrsCodecPipeline"})

In [None]:
import warnings

# Suppress zarr vlen-utf8 codec warnings
warnings.filterwarnings(
    "ignore",
    message="The codec `vlen-utf8` is currently not part in the Zarr format 3 specification.*",
    category=UserWarning,
    module="zarr.codecs.vlen_utf8",
)

## Converting existing `anndata` files into a shuffled collection

The conversion code will take care of the following things:
* Align (outer join) the gene spaces across all datasets listed in `adata_paths`
  * The gene spaces are outer-joined based on the gene names provided in the `var_names` field of the individual `AnnData` objects.
  * If you want to subset to specific gene space, you can provide a list of gene names via the `var_subset` parameter.
* Shuffle the cells across all datasets (this works on larger than memory datasets as well).
  * This is important for block-wise shuffling during data loading.
* Shuffle the input files across multiple output datasets:
  * The size of each individual output dataset can be controlled via the `n_obs_per_dataset` parameter.
  * We recommend to choose a dataset size that comfortably fits into system memory.

In [None]:
from arrayloaders import create_anndata_collection

create_anndata_collection(
    # List all the h5ad files you want to include in the collection
    adata_paths=["866d7d5e-436b-4dbd-b7c1-7696487d452e.h5ad", "f81463b8-4986-4904-a0ea-20ff02cbb317.h5ad"],
    # Path to store the output collection
    output_path="annbatch_collection",
    shuffle=True,  # Whether to pre-shuffle the cells of the collection
    n_obs_per_dataset=2_097_152,  # Number of cells per dataset shard
    var_subset=None,  # Optionally subset the collection to a specific gene space
    should_denseify=False,
)

## Data loading example

In [None]:
from pathlib import Path

COLLECTION_PATH = Path("annbatch_collection/")

In [None]:
import anndata as ad

from arrayloaders import ZarrSparseDataset

ds = ZarrSparseDataset(
    batch_size=4096,  # Total number of obs per yielded batch
    chunk_size=256,  # Number of obs to load from disk contiguously - default settings should work well
    preload_nchunks=32,  # Number of chunks to preload + shuffle - default settings should work well
    preload_to_gpu=False,  # If True, preloaded chunks are moved to GPU memory via `cupy`, which can put more pressure on GPU memory but will accelerate loading ~20%
)

# Add dataset that should be used for training
ds.add_anndatas(
    [
        ad.AnnData(
            X=ad.io.sparse_dataset(zarr.open(p)["X"]),
            obs=ad.io.read_elem(zarr.open(p)["obs"]),
        )
        for p in COLLECTION_PATH.glob("*.zarr")
    ],
    obs_keys="cell_type",
)

**IMPORTANT:**
* The `ZarrSparseDataset` yields batches of sparse tensors.
* The conversion to dense tensors should be done on the GPU, as shown in the example below.
  * First call `.cuda()` and then `.to_dense()`
  * E.g. `x = x.cuda().to_dense()`
  * This is significantly faster than doing the dense conversion on the CPU.


In [None]:
# Iterate over dataloader
for batch in ds:
    x, obs = batch
    # Important: Convert to dense on GPU
    x = x.cuda().to_dense()
    # Feed data into your model
    ...

## Optional: Extend an existing collection with a new dataset

You might want to extend an existing pre-shuffled collection with a new dataset.
This can be done using the `add_to_collection` function.

This function will take care of shuffling the new dataset into the existing collection without having to re-shuffle the entire collection.

In [None]:
from arrayloaders import add_to_collection

add_to_collection(
    adata_paths=[
        "866d7d5e-436b-4dbd-b7c1-7696487d452e.h5ad",
    ],
    output_path="annbatch_collection",
    read_full_anndatas=True,  # This should be set to False if the new datasets DO NOT fit into memory
)