# All of Annsel

This notebook tells you just about everything you need to use `annsel`. It's a good starting point to get a feel for the package.

:::{note}
:class: dropdown

You should be familiar with [`AnnData` ](https://anndata.readthedocs.io/en/latest/) beforehand.
:::

## Set up Data

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import annsel as an

We will load a Leukemic bone marrow cytometry dataset :cite:p:`triana_single-cell_2021` You can view the dataset on [CellxGene](https://cellxgene.cziscience.com/e/b3a5a10f-b1cb-4e8e-abce-bf345448625b.cxg/).


In [None]:
adata = an.datasets.leukemic_bone_marrow_dataset()

In [None]:
adata = an.datasets.focal_cortical_dysplasia_dataset()

In [None]:
adata

In [None]:
adata.an.filter(obs=an.col(), var=an.col(), x=an.col())

In [None]:
for group in adata.an.group_by(an.obs_col(["Cell_label"])):
    print(group)

Importing `annsel` will automatically register the `AnnData` accessors. You can access them with `anndata.AnnData.an`.

The accessor allows you to perform operations on `AnnData` objects with respect to columns and indices.


In addition, the following `Narwhals` inherited expressions are also available:


You express the following:
1. Observation columns: `an.obs_col()`
2. Variable columns: `an.var_col()`
3. Observation names: `an.obs_names()`
4. Variable names with the context of the `var` DataFrame: `an.var_names()`
5. Variable names with the context of the `X` matrix `an.x()`


These can be combined with the following methods:
1. Filtering: `an.filter()`
2. Selection: `an.select()`

You'll find that there are many familiar expressions which you can use. View the supported [`Narwhals` `Expr`](https://narwhals-dev.github.io/narwhals/api-completeness/expr/) methods for a full list.

## Filter


Suppose we only want to select variables which are Protein Coding genes. We can use the `var_cols` callable to filter the `AnnData` object where `"feature_type"` is "`protein_coding"`.


In [None]:
adata.an.filter(
    an.var_col(["feature_type"]) == "protein_coding",
)

:::{note}
:class: dropdown

This is equivalent to:
```python

adata[:, adata.var["feature_type"] == "protein_coding"]
```
:::

Or if we want multiple feature types, we can use the method [`is_in`](https://narwhals-dev.github.io/narwhals/api-reference/expr/#narwhals.Expr.is_in). We can also use set operations to combine multiple predicates into a single predicate.


In [None]:
adata.an.filter(
    an.var_col(["feature_type"]).is_in(["protein_coding", "lncRNA"])
    | an.var_col(["feature_name"]).is_in(["IGHD", "IGHM", "IGKC"])
)

:::{note}
:class: dropdown

This is equivalent to:
```python

adata[
    :,
    adata.var["feature_type"].isin(["protein_coding", "lncRNA"])
    | adata.var["feature_name"].isin(["IGHD", "IGHM", "IGKC"]),
]
```
:::

Let's filter the dataset by the various cell labels using `obs_col`.



In [None]:
adata.an.filter(an.obs_col(["Cell_label"]) == "Lymphomyeloid prog")

:::{note}
:class: dropdown

This is equivalent to:
```python

adata[adata.obs["Cell_label"] == "Lymphomyeloid prog", :]
```
:::

We can combine multiple predicates to filter by both obs and var.

In [None]:
adata.an.filter(
    an.obs_col(["Cell_label"]) == "Lymphomyeloid prog",
    an.var_col(["feature_type"]).is_in(["protein_coding"]),
    an.var_col(["vst.mean"]) >= 0.5,
    an.obs_col(["sex"]) == "male",
)

:::{note}
:class: dropdown

This is equivalent to 

```python
adata[
    (adata.obs["Cell_label"] == "Lymphomyeloid prog") & (adata.obs["sex"] == "male"),
    (adata.var["feature_type"].isin(["protein_coding"])) & (adata.var["vst.mean"] >= 0.5),
]
```
:::


We can also filter by `var_names` and `obs_names`. We can also return a copy instead of a view of the original `AnnData` object.


In [None]:
adata.an.filter(an.var_names().str.starts_with("ENSG0000018"), an.obs_names().str.ends_with("1"), copy=True)

:::{note}
:class: dropdown

This is equivalent to 

```python
adata[adata.obs_names.str.endswith("1"), adata.var_names.str.startswith("ENSG0000018")]
```
:::


## Select

We can also select columns from `X`, `var` and `obs` as well.

In [None]:
adata.an.select(an.obs_col(["Cluster_ID"]), an.var_col(["feature_type"]), an.x(["ENSG00000206560"]))

In [None]:
import itertools
import narwhals as nw
import anndata as ad
from narwhals.typing import IntoExpr, IntoDataFrame


@nw.narwhalify
def _groupby_obs(df: IntoDataFrame, col: IntoExpr) -> nw.Expr:
    return df.group_by(col)


@nw.narwhalify
def _groupby_var(df: IntoDataFrame, col: IntoExpr) -> nw.Expr:
    return df.group_by(col)


# Modify the _groupby function to handle the iteration properly
def _groupby(adata: ad.AnnData, obs_col: IntoExpr, var_col: IntoExpr):
    obs_groups = _groupby_obs(adata.obs, obs_col)
    var_groups = _groupby_var(adata.var, var_col)

    # Create generator of all combinations
    for (_obs_name, obs_data), (_var_name, var_data) in itertools.product(obs_groups, var_groups):
        # Get the indices and slice the AnnData
        obs_idx = obs_data.to_native().index
        var_idx = var_data.to_native().index
        yield adata[obs_idx, var_idx]


_groupby_obs(adata.obs, nw.col("Cluster_ID"))


# Usage:
# for group in _groupby(adata, obs_col=["Cluster_ID", "Cell_label"], var_col=["feature_type"]):
#     print(group)

In [None]:
an.obs_col(["Cluster_ID", "Cell_label"])

In [None]:
list(adata.an.group_by(an.obs_col(["Cluster_ID", "Cell_label"])))

## Pipe


You can also use the `pipe` method to chain multiple operations together.


In [None]:
import scanpy as sc

adata.an.filter(
    an.obs_col(["Cell_label"]) == "Lymphomyeloid prog",
    an.var_col(["feature_type"]).is_in(["protein_coding"]),
    copy=True,
).an.pipe(sc.pl.pca, color=["Cluster_ID"])

In [None]:
adata.an.select(an.var_col(["feature_type"]), an.x(["ENSG00000206560"]))

In [None]:
adata.obs