# Adding custom cell-types to Cell Ontology
We demonstrate here how to adjust the cell ontology for use in popV

First we download the cl.obo from the Cell Ontology.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
# Download cl.ono fro OBO page.
!mkdir new_ontology
!wget http://purl.obolibrary.org/obo/cl/cl.json -O new_ontology/cl.json

--2024-12-15 00:52:50--  http://purl.obolibrary.org/obo/cl/cl.json
Resolving purl.obolibrary.org (purl.obolibrary.org)... 104.18.37.59, 172.64.150.197, 2606:4700:4400::6812:253b, ...
Connecting to purl.obolibrary.org (purl.obolibrary.org)|104.18.37.59|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github.com/obophenotype/cell-ontology/releases/latest/download/cl.json [following]
--2024-12-15 00:52:50--  https://github.com/obophenotype/cell-ontology/releases/latest/download/cl.json
Resolving github.com (github.com)... 140.82.116.4
Connecting to github.com (github.com)|140.82.116.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github.com/obophenotype/cell-ontology/releases/download/v2024-09-26/cl.json [following]
--2024-12-15 00:52:50--  https://github.com/obophenotype/cell-ontology/releases/download/v2024-09-26/cl.json
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 302

## Edit OBO file

We first read the obo file and show it's content for an existing cell-type and display the number of edges and nodes and show a single content of the file.

In [6]:
import json

with open("new_ontology/cl.json") as f:
    cell_ontology = json.load(f)["graphs"][0]

In [11]:
popv_dict = {}
popv_dict["nodes"] = [entry for entry in cell_ontology["nodes"] if entry["type"] == "CLASS" and entry.get("lbl", False)]
popv_dict["lbl_sentence"] = {
    entry["lbl"]: f"{entry['lbl']}: {entry.get('meta', {}).get('definition', {}).get('val', '')}"
    for entry in popv_dict["nodes"]
}
popv_dict["lbl_sentence"]["T cell"]

'T cell: A type of lymphocyte whose defining characteristic is the expression of a T cell receptor complex.'

Our custom cell-type does not exist.

In [12]:
popv_dict["lbl_sentence"].get("specialized T cell", "No definition found")

'No definition found'

A random example how nodes are described for cell-types and how we need to structure the entry.

In [24]:
cell_ontology["graphs"][0]["nodes"][1000]

{'id': 'http://purl.obolibrary.org/obo/CL_0000871',
 'lbl': 'splenic macrophage',
 'type': 'CLASS',
 'meta': {'definition': {'val': 'A secondary lymphoid organ macrophage found in the spleen.',
   'xrefs': ['GO_REF:0000031', 'PMID:15771589', 'PMID:16322748']},
  'comments': ['Role or process: immune, clearance of apoptotic and senescent cells.'],
  'xrefs': [{'val': 'FMA:83026'}]}}

In [26]:
cell_ontology["graphs"][0]["nodes"].append(
    {
        "id": "CL:0200000",
        "lbl": "specialized T cell",
        "type": "CLASS",
        "meta": {"definition": {"val": "A T cell that has a specific function in the immune system."}},
    }
)  # All other fields are not used in popV.

In [27]:
cell_ontology["graphs"][0]["edges"][1000]

{'sub': 'http://purl.obolibrary.org/obo/CL_0000510',
 'pred': 'is_a',
 'obj': 'http://purl.obolibrary.org/obo/CL_0002563'}

In [49]:
cell_ontology["graphs"][0]["edges"].append(
    {
        "sub": "CL:0200000",  # new specialized T cell
        "pred": "is_a",
        "obj": "http://purl.obolibrary.org/obo/CL_0000084",  # T cell
    }
)

In [53]:
cell_ontology["graphs"][0]["edges"][-1]

{'sub': 'CL:0200000',
 'pred': 'is_a',
 'obj': 'http://purl.obolibrary.org/obo/CL_0000084'}

In [54]:
with open("new_ontology/cl_modified.json", "w") as f:
    json.dump(cell_ontology, f)

We need to create all ontology files for popV.

In [58]:
from popv import create_ontology_resources

create_ontology_resources("new_ontology/cl.json")

Use pytorch device_name: cuda
Load pretrained SentenceTransformer: all-mpnet-base-v2
Batches: 100%|██████████| 507/507 [00:13<00:00, 38.54it/s]


## Run popV

We need to create additional files, namely a dictionary and an LLM model of our Cell Ontology. We call our helper function in popV that creates these files in the same folder as our cl.obo file.

In [2]:
import sys

sys.path.insert(0, "popv")

In [6]:
from popv import create_ontology_resources

create_ontology_resources("resources/ontology/cl.json")

Use pytorch device_name: cuda
Load pretrained SentenceTransformer: all-mpnet-base-v2
Batches: 100%|██████████| 507/507 [00:13<00:00, 38.61it/s] 


In [7]:
import scanpy as sc

In [8]:
query_adata = sc.read_h5ad("resources/dataset/test/lca_subset.h5ad")
ref_adata = sc.read_h5ad("resources/dataset/test/ts_lung_subset.h5ad")

In [9]:
# Add our new cell-type label to the reference dataset.
# ref_adata.obs['cell_ontology_class'] = ref_adata.obs['cell_ontology_class'].replace('CD4-positive, alpha-beta T cell', 'my special tcell')
# We use a newer cl.obo file that has updated the term for lung epithelial cells. You can find these in synonyms.
ref_adata.obs["cell_ontology_class"] = ref_adata.obs["cell_ontology_class"].replace(
    "type II pneumocyte", "pulmonary alveolar type 2 cell"
)
ref_adata.obs["cell_ontology_class"] = ref_adata.obs["cell_ontology_class"].replace(
    "type I pneumocyte", "pulmonary alveolar type 1 cell"
)

ref_adata.obs["cell_ontology_class"].value_counts()

cell_ontology_class
macrophage                                  370
pulmonary alveolar type 2 cell              247
basal cell                                   60
non-classical monocyte                       34
capillary endothelial cell                   33
club cell                                    32
classical monocyte                           27
basophil                                     23
CD4-positive, alpha-beta T cell              20
respiratory goblet cell                      18
lung ciliated cell                           15
vein endothelial cell                        14
lung microvascular endothelial cell          14
CD8-positive, alpha-beta T cell              12
fibroblast                                   11
intermediate monocyte                         9
adventitial cell                              9
endothelial cell of artery                    8
pulmonary alveolar type 1 cell                8
neutrophil                                    7
dendritic cell      

In [10]:
ref_adata.write_h5ad("resources/dataset/test/ts_lung_subset.h5ad")

In [None]:
from popv.preprocessing import Process_Query

adata = Process_Query(
    query_adata,
    ref_adata,
    query_labels_key=None,
    query_batch_key=None,
    ref_labels_key="cell_ontology_class",
    ref_batch_key=None,
    unknown_celltype_label="unknown",
    save_path_trained_models="test",
    # cl_obo_folder="resources/ontology",
    cl_obo_folder=[
        "new_ontology/cl_popv.json",
        "new_ontology/cl.ontology",
        "new_ontology/cl.ontology.nlp.emb",
    ],  # Point to new files.
    prediction_mode="retrain",
    n_samples_per_label=20,
    hvg=1000,
).adata

Sampling 20 cells per label


In [None]:
from popv.annotation import annotate_data

annotate_data(
    adata,
)

  0%|          | 0/9 [00:00<?, ?it/s]Saving celltypist results to adata.obs["popv_celltypist_prediction"]


🍳 Preparing data before training
🔬 Input data has 334 cells and 1000 genes
⚖️ Scaling input data
🏋️ Training data using logistic regression
✅ Model training done!
🔬 Input data has 2000 cells and 1000 genes
🔗 Matching reference genes in the model
🧬 1000 features used for prediction
⚖️ Scaling input data
🖋️ Predicting labels
✅ Prediction done!
🗳️ Majority voting the predictions
✅ Majority voting done!
 11%|█         | 1/9 [00:00<00:03,  2.47it/s]Integrating data with bbknn




Saving knn on bbknn results to adata.obs["popv_knn_on_bbknn_prediction"]
BBKNN found only 7 neighbors. Reduced neighbors in KNN.
Saving UMAP of bbknn results to adata.obs["X_bbknn_umap_popv"]
Using UMAP instead of RAPIDS as high number of batches leads to OOM.
 22%|██▏       | 2/9 [00:21<01:27, 12.50s/it]Integrating data with harmony


	Initialization is completed.
	Completed 1 / 10 iteration(s).
	Completed 2 / 10 iteration(s).


Saving knn on harmony results to adata.obs["popv_knn_on_harmony_prediction"]


	Completed 3 / 10 iteration(s).
Reach convergence after 3 iteration(s).


Saving UMAP of harmony results to adata.obs["X_umap_harmony_popv"]
 33%|███▎      | 3/9 [00:37<01:26, 14.36s/it]Integrating data with scanorama


Found 1000 genes among all datasets


Saving knn on scanorama results to adata.obs["popv_knn_on_scanorama_prediction"]


[[0.    0.906]
 [0.    0.   ]]
Processing datasets (0, 1)


Saving UMAP of scanorama results to adata.obs["X_umap_scanorma_popv"]
 44%|████▍     | 4/9 [00:42<00:53, 10.68s/it]Integrating data with scvi
Training scvi offline.
GPU available: True (cuda), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs


Epoch 20/20: 100%|██████████| 20/20 [00:13<00:00,  1.46it/s, v_num=1, train_loss_step=753, train_loss_epoch=1.03e+3]    

`Trainer.fit` stopped: `max_epochs=20` reached.


Epoch 20/20: 100%|██████████| 20/20 [00:13<00:00,  1.45it/s, v_num=1, train_loss_step=753, train_loss_epoch=1.03e+3]

Saving knn on scvi results to adata.obs["popv_knn_on_scvi_prediction"]





Saving UMAP of scvi results to adata.obs["X_scvi_umap_popv"]
 56%|█████▌    | 5/9 [01:01<00:53, 13.44s/it]Computing Onclass. Storing prediction in adata.obs["popv_onclass_prediction"]
I0000 00:00:1734255148.140749  319561 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 20811 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:01:00.0, compute capability: 8.6
I0000 00:00:1734255148.165931  319561 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 20811 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:01:00.0, compute capability: 8.6
I0000 00:00:1734255148.166839  319561 mlir_graph_optimization_pass.cc:401] MLIR V1 optimization pass is not enabled


Training cost after epoch 1: loss:14.985920 acc: 0.105 auc: 0.639 auprc: 0.054
Training cost after epoch 2: loss:13.875673 acc: 0.219 auc: 0.799 auprc: 0.143
Training cost after epoch 3: loss:13.127899 acc: 0.338 auc: 0.886 auprc: 0.250
Training cost after epoch 4: loss:12.366190 acc: 0.449 auc: 0.944 auprc: 0.402
Training cost after epoch 5: loss:11.744126 acc: 0.539 auc: 0.972 auprc: 0.619
Training cost after epoch 6: loss:11.237657 acc: 0.611 auc: 0.984 auprc: 0.733
Training cost after epoch 7: loss:10.819752 acc: 0.692 auc: 0.992 auprc: 0.865
Training cost after epoch 8: loss:10.409714 acc: 0.734 auc: 0.997 auprc: 0.921
Training cost after epoch 9: loss:10.058737 acc: 0.781 auc: 0.998 auprc: 0.959
Training cost after epoch 10: loss:9.771173 acc: 0.820 auc: 0.999 auprc: 0.982
Training cost after epoch 11: loss:9.456179 acc: 0.859 auc: 1.000 auprc: 0.995
Training cost after epoch 12: loss:9.184665 acc: 0.877 auc: 1.000 auprc: 1.000
Training cost after epoch 13: loss:8.951223 acc: 0.9

 67%|██████▋   | 6/9 [01:49<01:15, 25.20s/it]Computing random forest classifier. Storing prediction in adata.obs["popv_rf_prediction"]
 78%|███████▊  | 7/9 [01:50<00:34, 17.46s/it]Integrating data with scANVI


[34mINFO    [0m File test/scvi/model.pt already downloaded                                                                
[34mINFO    [0m Training for [1;36m20[0m epochs.                                                                                   


GPU available: True (cuda), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs


Epoch 20/20: 100%|██████████| 20/20 [00:37<00:00,  1.88s/it, v_num=1, train_loss_step=702, train_loss_epoch=968]      

`Trainer.fit` stopped: `max_epochs=20` reached.


Epoch 20/20: 100%|██████████| 20/20 [00:37<00:00,  1.89s/it, v_num=1, train_loss_step=702, train_loss_epoch=968]

Saving scanvi label prediction to adata.obs["popv_scanvi_prediction"]
Saving UMAP of scanvi results to adata.obs["X_scanvi_umap_popv"]





 89%|████████▉ | 8/9 [02:30<00:24, 24.59s/it]Computing support vector machine. Storing prediction in adata.obs["popv_svm_prediction"]
100%|██████████| 9/9 [02:31<00:00, 16.78s/it]
Using predictions ['popv_celltypist_prediction', 'popv_knn_on_bbknn_prediction', 'popv_knn_on_harmony_prediction', 'popv_knn_on_scanorama_prediction', 'popv_knn_on_scvi_prediction', 'popv_onclass_prediction', 'popv_rf_prediction', 'popv_scanvi_prediction', 'popv_svm_prediction'] for PopV consensus
