# Streaming of annotated data matrices from the cloud

When working with large datasets, it is often inefficient to load entire datasets into memory.

Here, we walk through partial streaming.

In [None]:
import lamindb as ln

In [None]:
ln.track()

## Prepare files

We already have data in the cloud:

In [None]:
pbmc68k_h5ad = ln.File("s3://lamindb-ci/lndb-storage/pbmc68k.h5ad")

In [None]:
pbmc68k_h5ad = ln.add(pbmc68k_h5ad)

Alternatively, you can save as zarr:

In [None]:
pbmc68k_zarr = ln.File("s3://lamindb-ci/lndb-storage/pbmc68k.zarr")

In [None]:
pbmc68k_zarr = ln.add(pbmc68k_zarr)

## Caching vs streaming entire files

Load h5ad (a local file is cached):

In [None]:
pbmc68k_h5ad.load()

Load zarr (no caching happens here, data is streamed):

In [None]:
pbmc68k_zarr.load()

## Stream files partially

We saw that both datasets have `cell_type`: Dendritic cells, CD14+ Monocytes.

Now let's only fetch data that are labeled as these two cell types.

First we obtain the ingested AnnData Files by querying the LaminDB instance.

```{note}

This is merely a database query, it does **not** download the data.

```

In [None]:
file = ln.select(ln.File, run_id=ln.context.run.id, suffix=".h5ad").one()

Next, we prepare the query strings to query the columns of `.obs` for each `AnnData` object. For details see the [pandas docs](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html).

```{note}

Soon, we'll integrate the within-object queries with the SQL queries.

```

### (A) Pandas-style query strings

In [None]:
query_string = "cell_type == 'Dendritic cells' | cell_type == 'CD14+ Monocytes'"

Subset the `AnnData` objects based on the query strings above and load them directly into memory.

```{note}

No caching happens here!

When `ln.subset` is executed, only the `.obs` columns are loaded to perform the subset. For all remaining, **only the subsets** data are loaded into memory.

```

In [None]:
ln.subset(file, query_obs=query_string)

```{tip}

Set `use_concat=True` to return a single concatenated AnnData object (runs [`anndata.concat`](https://anndata.readthedocs.io/en/latest/generated/anndata.concat.html) under the hood).

See an example in (B) Lazy query expressions
```

### (B) Lazy query expressions

Lazy selectors for convenient subsetting with complicated conditions.

Operators, methods and numpy functions are supported.

In [None]:
query_string = ln.lazy.cell_type.isin(("Dendritic cells", "CD14+ Monocytes")) & (
    ln.lazy.percent_mito <= 0.05
)

In [None]:
adata = ln.subset(file, query_obs=query_string, use_concat=True)

In [None]:
adata

Let's now check whether the returned AnnData only contains queried categories:

In [None]:
adata.obs["cell_type"].value_counts()