# Ontology mapping

Ontologies are structured and standardized representations of knowledge in a specific domain, defining the concepts, relationships, and properties within that domain. They are essential for perturbation analysis as they provide a common vocabulary and framework for organizing and integration perturbation data.

pertpy is compatible with [bionty-base](https://github.com/laminlabs/bionty-base) which provides access to public ontologies and functionality to map values against them.

## Setup

If you don't yet have Bionty installed, install it with `pip install bionty`.

In [1]:
import anndata as ad
import numpy as np
import pandas as pd

## Dataset

Create an AnnData object with gene names in Ensemble notation and cell line annotations in the `obs` slot.

In [None]:
adata = ad.AnnData(
    X=np.random.default_rng().random((3, 3)),
    var=pd.DataFrame(
        index=[
            "ENSG00000148584",
            "ENSG00000121410",
            "ENSGcorrupted",
        ]
    ),
    obs=pd.DataFrame(
        columns=["cell lines"],
        data=[
            "HEK293",
            "JURKAT",
            "THP-1 cell",
        ],
    ),
)
adata



AnnData object with n_obs × n_vars = 3 × 3
    obs: 'cell lines'

In [3]:
adata.obs

Unnamed: 0,cell lines
0,HEK293
1,JURKAT
2,THP-1 cell


## Introduction to bionty-base

First we import bionty-base.

In [4]:
import bionty_base as bt

❗ You are running 3.11.4
Only python versions 3.8~3.10 are currently tested, use at your own risk.
✅ wrote new records from public sources.yaml to /home/zeth/.lamin/bionty/versions/sources_local.yaml!

if you see this message repeatedly, run: bt.reset_sources()


Let's look at all available ontologies.

In [5]:
bt.display_available_sources()

Unnamed: 0_level_0,source,organism,version,url,md5,source_name,source_website
entity,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Organism,ensembl,vertebrates,release-110,https://ftp.ensembl.org/pub/release-110/specie...,f3faf95648d3a2b50fd3625456739706,Ensembl,https://www.ensembl.org
Organism,ensembl,vertebrates,release-109,https://ftp.ensembl.org/pub/release-109/specie...,7595bb989f5fec07eaca5e2138f67bd4,Ensembl,https://www.ensembl.org
Organism,ensembl,vertebrates,release-108,https://ftp.ensembl.org/pub/release-108/specie...,d97c1ee302e4072f5f5c7850eff0b642,Ensembl,https://www.ensembl.org
Organism,ensembl,bacteria,release-57,https://ftp.ensemblgenomes.ebi.ac.uk/pub/bacte...,ee28510ed5586ea7ab4495717c96efc8,Ensembl,https://www.ensembl.org
Organism,ensembl,fungi,release-57,http://ftp.ensemblgenomes.org/pub/fungi/releas...,dbcde58f4396ab8b2480f7fe9f83df8a,Ensembl,https://www.ensembl.org
Organism,ensembl,metazoa,release-57,http://ftp.ensemblgenomes.org/pub/metazoa/rele...,424636a574fec078a61cbdddb05f9132,Ensembl,https://www.ensembl.org
Organism,ensembl,plants,release-57,https://ftp.ensemblgenomes.ebi.ac.uk/pub/plant...,eadaa1f3e527e4c3940c90c7fa5c8bf4,Ensembl,https://www.ensembl.org
Organism,ncbitaxon,all,2023-06-20,s3://bionty-assets/df_all__ncbitaxon__2023-06-...,00d97ba65627f1cd65636d2df22ea76c,NCBItaxon Ontology,https://github.com/obophenotype/ncbitaxon
Gene,ensembl,human,release-110,s3://bionty-assets/df_human__ensembl__release-...,832f3947e83664588d419608a469b528,Ensembl,https://www.ensembl.org
Gene,ensembl,human,release-109,s3://bionty-assets/human_ensembl_release-109_G...,72da9968c74e96d136a489a6102a4546,Ensembl,https://www.ensembl.org


Bionty provides three key functionalities:

1. `inspect`: Check whether any of our values (here diseases) are mappable against a specified ontology.
2. `map_synonyms`: Map values against synonyms. This is not relevant for our diseases.
3. `curate`: Curate ontology values against the ontology to ensure compliance.

Mapping against the Cell Line Ontology with Bionty

We will now showcase how to access the [cell line ontology](https://www.ebi.ac.uk/ols4/ontologies/clo) with Bionty. The Cell Line Ontology (CLO) aims to harmonize cell line definitions across the world.

Bionty is centered around Bionty entity objects that provide the above introduced functionality. We create a Bionty CellLine object with the cell line ontology as our source and a specific version for reproducibility.

### Cell lines

In [6]:
cell_line_bt = bt.CellLine(source="clo", version="2022-03-21")
cell_line_bt

PublicOntology
Entity: CellLine
Organism: all
Source: clo, 2022-03-21
#terms: 39037

📖 .df(): ontology reference table
🔎 .lookup(): autocompletion of terms
🎯 .search(): free text search of terms
✅ .validate(): strictly validate values
🧐 .inspect(): full inspection of values
👽 .standardize(): convert to standardized names
🪜 .diff(): difference between two versions
🔗 .to_pronto(): Pronto.Ontology object

We can access the DataFrame that contains all ontology terms:

In [7]:
cell_line_bt.df()

Unnamed: 0_level_0,name,definition,synonyms,parents
ontology_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CLO:0000000,cell line cell culturing,a maintaining cell culture process that keeps ...,,[]
CLO:0000001,cell line cell,A cultured cell that is part of a cell line - ...,,[]
CLO:0000002,suspension cell line culturing,suspension cell line culturing is a cell line ...,,[CLO:0000000]
CLO:0000003,adherent cell line culturing,adherent cell line culturing is a cell line cu...,,[CLO:0000000]
CLO:0000004,cell line cell modification,a material processing that modifies an existin...,,[]
...,...,...,...,...
CLO:0051617,RCB0187 cell,A immortal medaka cell line cell that has the ...,RCB0187|OLHE-131,[CLO:0009822]
CLO:0051618,RCB2945 cell,A immortal medaka cell line cell that has the ...,RCB2945|DIT29,[CLO:0009822]
CLO:0051619,RCB0184 cell,A immortal medaka cell line cell that has the ...,OLF-136|RCB0184,[CLO:0009822]
CLO:0051620,RCB0188 cell,A immortal medaka cell line cell that has the ...,RCB0188|OLME-104,[CLO:0009822]


Let's inspect all of our cell lines to learn whether they can be mapped against the ontology using the `name` field:

In [8]:
cell_line_bt.inspect(adata.obs["cell lines"], field=cell_line_bt.name, return_df=True)

✅ [1;92m2 terms[0m (66.70%) are validated for [3mname[0m
❗ [1;93m1 term[0m (33.30%) is not validated for [3mname[0m: [1;93mJURKAT[0m
   detected [1;93m1 terms with synonym[0m: [1;93mJURKAT[0m
→  standardize terms via [3m.standardize()[0m


Unnamed: 0,__validated__
HEK293,True
JURKAT,False
THP-1 cell,True


We observe that `JURKAR` cannot be mapped against the Cell Line Ontology. Hence, we create a lookup object and try to find JURKAT cells in the ontology with auto-complete.

In [9]:
cell_line_bt_lookup = cell_line_bt.lookup()

In [10]:
cell_line_bt_lookup.jurkat_cell

CellLine(ontology_id='CLO:0007043', name='JURKAT cell', definition='an immortalized human T lymphocyte cell that was derived in the late 1970s from the peripheral blood of a 14-year-old boy with T cell leukemia|disease: leukemia, T cell', synonyms='JURKAT', parents=array(['CLO:0000523'], dtype=object), _5='jurkat cell')

In [11]:
cell_line_bt_lookup.jurkat_cell.name

'JURKAT cell'

In [12]:
cell_line_bt_lookup.jurkat_cell.definition

'an immortalized human T lymphocyte cell that was derived in the late 1970s from the peripheral blood of a 14-year-old boy with T cell leukemia|disease: leukemia, T cell'

Indeed we find that the actual name of the cells is `JURKAT cell`.
Let's rename it.

In [13]:
adata.obs["cell lines"].replace({"JURKAT": "JURKAT cell"}, inplace=True)
adata.obs["cell lines"]

0         HEK293
1    JURKAT cell
2     THP-1 cell
Name: cell lines, dtype: object

In [14]:
cell_line_bt.inspect(adata.obs["cell lines"], field=cell_line_bt.name, return_df=True)

✅ [1;92m3 terms[0m (100.00%) are validated for [3mname[0m


Unnamed: 0,__validated__
HEK293,True
JURKAT cell,True
THP-1 cell,True


Now all terms could be mapped.

We could have also used the search functionality to find the match for JURKAT cells:

In [15]:
cell_line_bt.search("JURKAT").head()

Unnamed: 0_level_0,ontology_id,definition,synonyms,parents,__agg__,__ratio__
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
JURKAT cell,CLO:0007043,an immortalized human T lymphocyte cell that w...,JURKAT,[CLO:0000523],jurkat cell,100.0
RCB0806 cell,CLO:0050978,A immortal human blood cell line cell that has...,RCB0806|Jurkat,[CLO:0000617],rcb0806 cell,100.0
+/+ (A) cell,CLO:0001020,,+/+ (A),[CLO:0000019],+/+ (a) cell,90.0
Jurkat J6 cell,CLO:0007044,,Jurkat J6,[CLO:0000019],jurkat j6 cell,90.0
U cell,CLO:0009449,,U,[CLO:0000466],u cell,90.0


The same workflow can be applied to genes.

### Genes

In [16]:
gene_bt = bt.Gene()
gene_bt

PublicOntology
Entity: Gene
Organism: human
Source: ensembl, release-110
#terms: 75719

📖 .df(): ontology reference table
🔎 .lookup(): autocompletion of terms
🎯 .search(): free text search of terms
✅ .validate(): strictly validate values
🧐 .inspect(): full inspection of values
👽 .standardize(): convert to standardized names
🪜 .diff(): difference between two versions
🔗 .to_pronto(): Pronto.Ontology object

In [17]:
gene_bt.inspect(adata.var_names, gene_bt.ensembl_gene_id)

✅ [1;92m2 terms[0m (66.70%) are validated for [3mensembl_gene_id[0m
❗ [1;93m1 term[0m (33.30%) is not validated for [3mensembl_gene_id[0m: [1;93mENSGcorrupted[0m


<lamin_utils._inspect.InspectResult at 0x762c61ff7950>

`ENSGcorrupted` is not a valid Ensembl gene ID and should therefore also be corrected.