# Filtering SpatialData elements with Table Queries

## Introduction

The `spatialdata` framework supports both the representation of `SpatialElement`s (images, labels, points, shapes) and of annotations for these elements. As we explored in the [tables](./tables.ipynb) notebook, some types of `SpatialElement`s can contain annotations within themselves, but the general approach we take is to represent `SpatialElement`s and annotations in separate objects using `AnnData` tables.

In this notebook we introduce **table queries** - a filtering mechanism that allows you to subset both the annotations (tables) and their corresponding spatial elements using expressive query syntax. This functionality is provided by the `filter_table_by_query()` function, which uses the [`annsel`](https://github.com/srivarra/annsel) library for building query expressions. Under the hood, `annsel` uses  [`narwhals`](https://narwhals-dev.github.io/narwhals/), an "*extremely lightweight and extensible compatibility layer between dataframe libraries*". This notebook assumes that you are have familarized yourself with content in the [tables](./tables.ipynb) notebook.

## Setup and Data Loading

Lets start by importing the necessary libraries and loading the example blobs dataset.

In [1]:
from pathlib import Path

import annsel as an
import numpy as np

import spatialdata as sd
from spatialdata.datasets import blobs

blobs_sdata = blobs()
blobs_sdata

  from pkg_resources import DistributionNotFound, get_distribution
  return convert_region_column_to_categorical(adata)


SpatialData object
├── Images
│     ├── 'blobs_image': DataArray[cyx] (3, 512, 512)
│     └── 'blobs_multiscale_image': DataTree[cyx] (3, 512, 512), (3, 256, 256), (3, 128, 128)
├── Labels
│     ├── 'blobs_labels': DataArray[yx] (512, 512)
│     └── 'blobs_multiscale_labels': DataTree[yx] (512, 512), (256, 256), (128, 128)
├── Points
│     └── 'blobs_points': DataFrame with shape: (<Delayed>, 4) (2D points)
├── Shapes
│     ├── 'blobs_circles': GeoDataFrame shape: (5, 2) (2D shapes)
│     ├── 'blobs_multipolygons': GeoDataFrame shape: (2, 1) (2D shapes)
│     └── 'blobs_polygons': GeoDataFrame shape: (5, 1) (2D shapes)
└── Tables
      └── 'table': AnnData (26, 3)
with coordinate systems:
    ▸ 'global', with elements:
        blobs_image (Images), blobs_multiscale_image (Images), blobs_labels (Labels), blobs_multiscale_labels (Labels), blobs_points (Points), blobs_circles (Shapes), blobs_multipolygons (Shapes), blobs_polygons (Shapes)

The table in the blobs dataset is rather minimal, so we will artifically add a couple of columns (`cell_type` and `area`) to help illustrate the functionality.

In [2]:
rng = np.random.default_rng(123456)

blobs_sdata.tables["table"].obs["cell_type"] = rng.choice(
    ["A", "B", "C", "C", "AA", "BB", "CC"], size=blobs_sdata.tables["table"].n_obs
)
blobs_sdata.tables["table"].obs["cell_type_granular"] = rng.choice(
    ["A", "B", "C", "D", "E", "F", "G", "H", "I", "J"], size=blobs_sdata.tables["table"].n_obs
)
blobs_sdata.tables["table"].obs["area"] = rng.choice(
    [10, 20, 30, 40, 50, 60, 70, 80, 90, 100], size=blobs_sdata.tables["table"].n_obs
)

## Supported Operations

## Basic Filtering Examples

Now let's explore how to filter our blobs `SpatialData` object using table queries.

The most common use case is to filter based on observations (`obs`):

In [3]:
blobs_sdata_filtered = sd.filter_by_table_query(blobs_sdata, table_name="table", obs_expr=an.col("cell_type") == "A")
blobs_sdata_filtered

  return self.value(*args)
  adata.uns[cls.ATTRS_KEY] = attr


SpatialData object
├── Labels
│     └── 'blobs_labels': DataArray[yx] (512, 512)
└── Tables
      └── 'table': AnnData (5, 3)
with coordinate systems:
    ▸ 'global', with elements:
        blobs_labels (Labels)

In [4]:
print(
    f"\nObservations reduced from {blobs_sdata_filtered.tables['table'].n_obs} to {blobs_sdata_filtered.tables['table'].n_obs}"
)


Observations reduced from 5 to 5


### Breaking Down `an.col("cell_type") == "A"`



**What is `an.col("cell_type")`?**

`an.col("cell_type")` creates a column reference that points to the "cell_type" column (doesn't specify if it's in `obs` or `var`). By assigning this to the `obs_expr` argument, you're telling the function to filter the `obs` component of the AnnData table based on this column. Think of it as saying "I want to work with the cell_type column".


**What does `== "A"` do?**

The equality operator `== "A"` applies a comparison operator to that column reference, creating a boolean condition that will be `True` for rows where cell_type equals "A" and `False` everywhere else.

**Why This Syntax Design?**

These expressions are ran in `narwhals` under the hood to create expressions and run them. If you have a keen eye, you may notice that this syntax is similar to Polars, as the Narwhals API follows as closely as it can to the ergonomics of Polars.


Lets take look at another example, this time we will want to select observations which belong to the `blobs_labels` region.

In [5]:
blobs_sdata_filtered = sd.filter_by_table_query(
    blobs_sdata,
    table_name="table",
    obs_expr=an.col("region") == "blobs_labels",
)
blobs_sdata_filtered

  return self.value(*args)
  adata.uns[cls.ATTRS_KEY] = attr


SpatialData object
├── Labels
│     └── 'blobs_labels': DataArray[yx] (512, 512)
└── Tables
      └── 'table': AnnData (26, 3)
with coordinate systems:
    ▸ 'global', with elements:
        blobs_labels (Labels)

Since all the observations in the table are from the `blobs_labels` element, The table query will return the same `AnnData` object to SpatialDate. But in terms of the other `SpatilaElements` we can see that it's only kept the `blobss_labels` element.



You can also filter based on numeric values, as you'd expect.

In [6]:
blobs_sdata_filtered = sd.filter_by_table_query(blobs_sdata, table_name="table", obs_expr=an.col("instance_id") <= 10)
blobs_sdata_filtered

  return self.value(*args)
  adata.uns[cls.ATTRS_KEY] = attr


SpatialData object
├── Labels
│     └── 'blobs_labels': DataArray[yx] (512, 512)
└── Tables
      └── 'table': AnnData (9, 3)
with coordinate systems:
    ▸ 'global', with elements:
        blobs_labels (Labels)

In [7]:
blobs_sdata_filtered = sd.filter_by_table_query(
    blobs_sdata, table_name="table", obs_expr=an.col("instance_id").is_in([1, 3, 5, 8, 13])
)
blobs_sdata_filtered

  return self.value(*args)
  adata.uns[cls.ATTRS_KEY] = attr


SpatialData object
├── Labels
│     └── 'blobs_labels': DataArray[yx] (512, 512)
└── Tables
      └── 'table': AnnData (5, 3)
with coordinate systems:
    ▸ 'global', with elements:
        blobs_labels (Labels)

## Supported Operators and Expressions

- `an.col("column_name")` - reference a column in `obs` or `var`
  - *Note:* Can be multiple columns, `an.col(["column_name1", "column_name2"])`
- Special "columns":
  - `an.obs_names` - reference observation names (row indices, aka `AnnData.obs_names`)
  - `an.var_names` - reference variable names (column names, aka `AnnData.var_names`)
- Comparison operators:
  - `>`, `>=`, `<`, `<=`, `==`, `!=`
- Membership:
  - `.is_in([list])`
- String methods:
  - `.str.contains()`, `.str.starts_with()`, `.str.ends_with()`
- Logical:
  - `&` (and), `|` (or), `~` (not)

As long as an expression does not perform an aggregation under the hood or change length, it can be passed used.

For a full list of supported operators and expressions, see the corersponding [narwhals documentation](https://narwhals-dev.github.io/narwhals/api-reference/expr/).

We can also combine multiple expressions per table component (`obs`, `var`, etc...)

Here we will select observations that have a cell type which starts with `"A"`, and observations which whose `cell_type_granular` is in `["A", "B", "C"]`.

In [8]:
blobs_sdata_filtered = sd.filter_by_table_query(
    blobs_sdata,
    table_name="table",
    obs_expr=((an.col("cell_type").str.starts_with("A")) | (an.col("cell_type_granular").is_in(["A", "B", "C"]))),
)
blobs_sdata_filtered

  return self.value(*args)
  adata.uns[cls.ATTRS_KEY] = attr


SpatialData object
├── Labels
│     └── 'blobs_labels': DataArray[yx] (512, 512)
└── Tables
      └── 'table': AnnData (16, 3)
with coordinate systems:
    ▸ 'global', with elements:
        blobs_labels (Labels)

There are two ways to use "and" operators in table queries:

1. Using `&` operator between two expressions
2. Using a tuple of expressions

In [9]:
blobs_sdata_filtered = sd.filter_by_table_query(
    blobs_sdata,
    table_name="table",
    obs_expr=((an.col("cell_type").str.starts_with("A")), (an.col("cell_type_granular").is_in(["A", "B", "C"]))),
)
blobs_sdata_filtered.tables["table"].obs

  return self.value(*args)
  adata.uns[cls.ATTRS_KEY] = attr


Unnamed: 0,instance_id,region,cell_type,cell_type_granular,area
18,18,blobs_labels,AA,C,80
19,19,blobs_labels,A,C,60
26,26,blobs_labels,AA,A,100


In [10]:
blobs_sdata_filtered = sd.filter_by_table_query(
    blobs_sdata,
    table_name="table",
    obs_expr=((an.col("cell_type").str.starts_with("A")) & (an.col("cell_type_granular").is_in(["A", "B", "C"]))),
)
blobs_sdata_filtered.tables["table"].obs

  return self.value(*args)
  adata.uns[cls.ATTRS_KEY] = attr


Unnamed: 0,instance_id,region,cell_type,cell_type_granular,area
18,18,blobs_labels,AA,C,80
19,19,blobs_labels,A,C,60
26,26,blobs_labels,AA,A,100


In [11]:
blobs_sdata_filtered.tables["table"].var_names

Index(['channel_0_sum', 'channel_1_sum', 'channel_2_sum'], dtype='object')

In this example, suppose that the `var_name` `channel_0_sum` is of some importance to you when the expression value for some observation is greater than 125. We can also filter based on that matrix's column.

In [12]:
blobs_sdata_filtered = sd.filter_by_table_query(
    blobs_sdata,
    table_name="table",
    x_expr=an.col("channel_0_sum") > 125,
)
blobs_sdata_filtered.tables["table"].obs

  return self.value(*args)
  adata.uns[cls.ATTRS_KEY] = attr


Unnamed: 0,instance_id,region,cell_type,cell_type_granular,area
1,1,blobs_labels,A,F,20
2,2,blobs_labels,AA,F,10
3,3,blobs_labels,BB,C,80
4,4,blobs_labels,C,E,10
5,5,blobs_labels,C,B,50
6,6,blobs_labels,A,D,60
8,8,blobs_labels,A,G,30
9,9,blobs_labels,CC,H,50
10,10,blobs_labels,B,I,50
13,13,blobs_labels,C,A,40


And of course you can combine different filters across different `AnnData` Table components.

In [13]:
blobs_sdata_filtered = sd.filter_by_table_query(
    blobs_sdata,
    table_name="table",
    obs_expr=an.col("cell_type") == "B",
    x_expr=an.col("channel_0_sum") > 125,
)
blobs_sdata_filtered.tables["table"].obs

  return self.value(*args)
  adata.uns[cls.ATTRS_KEY] = attr


Unnamed: 0,instance_id,region,cell_type,cell_type_granular,area
10,10,blobs_labels,B,I,50
16,16,blobs_labels,B,C,90


## Using a Real Dataset

To wrap up the notebook, we'll briefly use the queries 

Here we'll take a look querying using the [mibitof dataset](https://spatialdata.scverse.org/en/stable/tutorials/notebooks/datasets/README.html). In addition there is a companion notebook 

In [14]:
mibitof_zarr_path = Path("~/Downloads/mibitof.zarr").expanduser()

mibitof_sdata = sd.read_zarr(mibitof_zarr_path)
mibitof_sdata

  compressor, fill_value = _kwargs_compat(compressor, fill_value, kwargs)


SpatialData object, with associated Zarr store: /Users/srivarra/Downloads/mibitof.zarr
├── Images
│     ├── 'point8_image': DataArray[cyx] (3, 1024, 1024)
│     ├── 'point16_image': DataArray[cyx] (3, 1024, 1024)
│     └── 'point23_image': DataArray[cyx] (3, 1024, 1024)
├── Labels
│     ├── 'point8_labels': DataArray[yx] (1024, 1024)
│     ├── 'point16_labels': DataArray[yx] (1024, 1024)
│     └── 'point23_labels': DataArray[yx] (1024, 1024)
└── Tables
      └── 'table': AnnData (3309, 36)
with coordinate systems:
    ▸ 'point8', with elements:
        point8_image (Images), point8_labels (Labels)
    ▸ 'point16', with elements:
        point16_image (Images), point16_labels (Labels)
    ▸ 'point23', with elements:
        point23_image (Images), point23_labels (Labels)

Lets also get a brief look at the `obs` component of the `AnnData` table. Here are a few columns of interest:

- `point`: This is the name of the Field of View (FOV) that an observation belongs to (in this case it's cells)
- `cell_size`: The area of a cell
- `donor`: The donor that the cell is from
- `Cluster`: The cluster / cell type that the cell belongs to
- `batch`: The batch that the cell is from (usually with respect to the donor or point / FOV)
- `library_id`: An identifier pointing to which `SpatialElement` the observation belongs to.

In [15]:
mibitof_sdata.tables["table"].obs

Unnamed: 0,row_num,point,cell_id,X1,center_rowcoord,center_colcoord,cell_size,category,donor,Cluster,batch,library_id
9376-1,9479,8,2,65222.0,37.0,6.0,474.0,carcinoma,90de,Epithelial,1,point8_labels
9377-1,9480,8,4,65224.0,314.0,3.0,126.0,carcinoma,90de,Epithelial,1,point8_labels
9378-1,9481,8,5,65225.0,407.0,6.0,398.0,carcinoma,90de,Epithelial,1,point8_labels
9379-1,9482,8,6,65226.0,439.0,20.0,1749.0,carcinoma,90de,Epithelial,1,point8_labels
9380-1,9483,8,7,65227.0,479.0,6.0,407.0,carcinoma,90de,Imm_other,1,point8_labels
...,...,...,...,...,...,...,...,...,...,...,...,...
4270-0,4322,23,1479,61793.0,519.0,1018.0,125.0,carcinoma,21d7,Tcell_CD4,0,point23_labels
4271-0,4323,23,1480,61794.0,929.0,1018.0,190.0,carcinoma,21d7,Imm_other,0,point23_labels
4272-0,4324,23,1481,61795.0,999.0,1019.0,173.0,carcinoma,21d7,Imm_other,0,point23_labels
4273-0,4325,23,1482,61796.0,322.0,1018.0,181.0,carcinoma,21d7,Myeloid_CD11c,0,point23_labels


In this example, we're picking donor "21d7" and keeping `vars` that either start with `"CD"` or are `"ASCT2"` or `"ATP5A"`.

In [16]:
mibitof_sdata_filtered = sd.filter_by_table_query(
    mibitof_sdata,
    # filter_tables=False,
    table_name="table",
    obs_expr=an.col("donor") == "21d7",
    var_names_expr=(an.var_names.is_in(["ASCT2", "ATP5A"]) | an.var_names.str.starts_with("CD")),
)
mibitof_sdata_filtered

  return self.value(*args)
  adata.uns[cls.ATTRS_KEY] = attr


SpatialData object
├── Labels
│     └── 'point23_labels': DataArray[yx] (1024, 1024)
└── Tables
      └── 'table': AnnData (1241, 14)
with coordinate systems:
    ▸ 'point23', with elements:
        point23_labels (Labels)

If your spatialdata object has a lot of `SpatialElements` and you only want to apply the filter to a subset of them, you can use the `element_names` parameter to specify which ones you want to use for the filter!

As a final example, let's take it up a few notches and use most of the features of the `filter_by_table_query` function. We will also be using the `method` version of the query instead of the `function`. They behave the same way, except that the `method` version passes in it's own `SpatialData` object.


We'll be subsetting of specific `SpatialElements`, and applying filters across `obs`, `var`, and `x` components of the `AnnData` table with a variety of queries.

In [17]:
mibitof_sdata_filtered = mibitof_sdata_filtered.filter_by_table_query(
    table_name="table",
    element_names=["point23_labels", "point8_labels"],
    # Filter observations (obs) based on multiple conditions
    obs_expr=(
        # Cells from donor 21d7 OR 90de
        an.col("donor").is_in(["21d7", "90de"])
        # AND cells with size greater than 400
        & (an.col("cell_size") > 400)
        # AND cells that are either Epithelial or contain "Tcell" in their cluster name
        & (an.col("Cluster") == "Epithelial")
        | (an.col("Cluster").str.contains("Tcell"))
    ),
    # Filter variables (var) based on multiple conditions
    var_names_expr=(
        # Select columns that start with CD
        an.var_names.str.starts_with("CD")
        # OR columns that contain "ATP"
        | an.var_names.str.contains("ATP")
        # OR specific columns
        | an.var_names.is_in(["ASCT2", "PKM2", "SMA"])
    ),
    # Filter based on expression values
    x_expr=(
        # Keep cells where ASCT2 is greater than 0.1
        (an.col("ASCT2") > 0.1)
        # AND less than 2 for ASCT2
        & (an.col("ASCT2") < 2)
    ),
    how="right",
)
mibitof_sdata_filtered

  return self.value(*args)
  adata.uns[cls.ATTRS_KEY] = attr


SpatialData object
├── Labels
│     └── 'point23_labels': DataArray[yx] (1024, 1024)
└── Tables
      └── 'table': AnnData (103, 14)
with coordinate systems:
    ▸ 'point23', with elements:
        point23_labels (Labels)

To wrap up, there are a few things to note:

1. **NOTE:** `SpatialElements` are filtered, but the components within those elements are not.
   1. For example, when we're filtering by the `obs` table and we get a subset of the Label `SpatialElement`, the individual segmentation masks are not modified, they will have the exact same masks as the original Label `SpatialElement`.
2. A layer of a given `AnnData` table can be used by specifying the `layer` parameter in the `filter_by_table_query` function.
3. You can use either the method or the function, they behave exactly the same.