# PrimeKG Enrichment and Embedding

In this tutorial, we will explain how to perform multimodal enrichment and embedding of PrimeKG nodes.

We will consider the following node types
1. Drugs (PubChem/DrugBank/CTD) - TEXT and SMILES
2. Proteins (NCBI/Gene) - TEXT and amino-acid sequence
3. Pathways (Reactome) - TEXT
4. Phenotypes (HPO) - TEXT
5. Protein function (GO) - TEXT
6. Disease (MONDO) - TEXT
7. Anatomy (UBERON) - TEXT

Prior information about the PrimeKG can be found in the following repositories:
- https://github.com/mims-harvard/PrimeKG
- https://github.com/mims-harvard/TDC/

Note that we are leveraging the PrimeKG provided in Harvard Dataverse, which is publicly available in the following link:

https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/IXA7BM

By the time we are writing this tutorial, the latest version of PrimeKG (`kg.csv`) is `2.1`.

First of all, we need to import necessary libraries as follows:

In [17]:
# Import necessary libraries
import sys
import torch
import networkx as nx
from tqdm import tqdm
sys.path.append('../../..')
from aiagents4pharma.talk2knowledgegraphs.datasets.primekg import PrimeKG
from aiagents4pharma.talk2knowledgegraphs.utils.enrichments.uniprot_proteins import EnrichmentWithUniProt
from aiagents4pharma.talk2knowledgegraphs.utils.enrichments.ols_terms import EnrichmentWithOLS
from aiagents4pharma.talk2knowledgegraphs.utils.enrichments.reactome_pathways import EnrichmentWithReactome
from aiagents4pharma.talk2knowledgegraphs.utils.enrichments.pubchem_strings import EnrichmentWithPubChem
from aiagents4pharma.talk2knowledgegraphs.utils.embeddings.huggingface import EmbeddingWithHuggingFace
from aiagents4pharma.talk2knowledgegraphs.utils.pubchem_utils import external_id2pubchem_cid

### Check device availability

In [2]:
device = "cuda:0" if torch.cuda.is_available() else "cpu"
device

'cpu'

### Load the BioBERT model

In [3]:
# Using MSFT's BioBERT
biobert_model = EmbeddingWithHuggingFace(model_name='microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract',
                                     model_cache_dir="../../../../data/microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract/",
                                     truncation=False,
                                     device=device)

### Load PrimeKG

The `PrimeKG` dataset allows to load the data from the Harvard Dataverse server if the data is not available locally. 

Otherwise, the data is loaded from the local directory as defined in the `local_dir`.

In [4]:
# Define primekg data by providing a local directory where the data is stored
primekg_data = PrimeKG(local_dir="../../../../data/primekg/")

To load the dataframes of nodes and edges from PrimeKG, we just need to invoke a method as follows.

In [5]:
# Invoke a method to load the data
primekg_data.load_data()

# Get primekg_nodes and primekg_edges
primekg_nodes = primekg_data.get_nodes()
primekg_edges = primekg_data.get_edges()

Loading nodes of PrimeKG dataset ...
../../../../data/primekg/primekg_nodes.tsv.gz already exists. Loading the data from the local directory.
Loading edges of PrimeKG dataset ...
../../../../data/primekg/primekg_edges.tsv.gz already exists. Loading the data from the local directory.


### Check PrimeKG Dataframes

As mentioned before, the primekg_nodes and primekg_edges are the dataframes of nodes and edges, respectively.

We can further analyze the dataframes to extract the information we need.

For instance, we can construct a graph from the nodes and edges dataframes using the networkx library.

#### PrimeKG Nodes

`primekg_nodes` is a dataframe of nodes, which has the following columns:
- `node_index`: the index of the node
- `node`: the node name
- `node_id`: the id of the node (currently set as node name itself, for visualization purposes)
- `node_uid`: the unique identifier of the node (source name + unique id)
- `node_type`: the type of the node

We can check a sample of the primekg nodes to see the list of nodes in the PrimeKG dataset as follows.

In [6]:
# Check a sample of the primekg nodes
primekg_nodes.head()

Unnamed: 0,node_index,node_name,node_source,node_id,node_type
0,0,PHYHIP,NCBI,9796,gene/protein
1,1,GPANK1,NCBI,7918,gene/protein
2,2,ZRSR2,NCBI,8233,gene/protein
3,3,NRF1,NCBI,4899,gene/protein
4,4,PI4KA,NCBI,5297,gene/protein


The current version of PrimeKG has about 130K of nodes in total as we can observe in the following cell.

In [7]:
# Check dimensions of the primekg nodes
primekg_nodes.shape

(129375, 5)

 We can breakdown the statistics of the primekg nodes by their types as follows.

In [8]:
# Show node types and their counts
primekg_nodes['node_type'].value_counts()

node_type
biological_process    28642
gene/protein          27671
disease               17080
effect/phenotype      15311
anatomy               14035
molecular_function    11169
drug                   7957
cellular_component     4176
pathway                2516
exposure                818
Name: count, dtype: int64

PrimeKG was built using various sources, as we can observe from their unique node sources as follows.

In [9]:
# Show source of the primekg nodes
primekg_nodes['node_source'].value_counts()

node_source
GO               43987
NCBI             27671
MONDO            15813
HPO              15311
UBERON           14035
DrugBank          7957
REACTOME          2516
MONDO_grouped     1267
CTD                818
Name: count, dtype: int64

In [10]:
primekg_nodes[primekg_nodes['node_source'] == 'CTD']
# primekg_edges.head()

Unnamed: 0,node_index,node_name,node_source,node_id,node_type
61677,61677,1-hydroxyphenanthrene,CTD,C092102,exposure
61678,61678,1-hydroxypyrene,CTD,C033146,exposure
61679,61679,1-naphthol,CTD,C029350,exposure
61680,61680,"2,2',3',4,4',5-hexachlorobiphenyl",CTD,C029790,exposure
61681,61681,"2,2',3,5,5',6-hexachlorobiphenyl",CTD,C066675,exposure
...,...,...,...,...,...
127593,127593,Heptanes,CTD,D006536,exposure
127594,127594,octane,CTD,C026728,exposure
127595,127595,pseudocumene,CTD,C010313,exposure
127596,127596,pentane,CTD,C033353,exposure


In [11]:
test = EnrichmentWithPubChem()

In [12]:
test.enrich_documents(['24667'])

INFO:aiagents4pharma.talk2knowledgegraphs.utils.pubchem_utils:Load Hydra configuration for PubChem CID description.


(["Butylated Hydroxyanisole can cause cancer according to The World Health Organization's International Agency for Research on Cancer (IARC)."],
 ['CC(C)(C)C1=C(C=CC(=C1)O)OC.CC(C)(C)C1=C(C=CC(=C1)OC)O'])

### Create a directed graph using the egdes

In [13]:
kg = nx.DiGraph()

## Make a KG using the edgelist
G = nx.from_pandas_edgelist(
    primekg_edges,
    source="head_name",
    target="tail_name",
    edge_key="relation",
    # edge_attr=["edge_id", "edge_type", "feature_value", "feature_id"],
    create_using=nx.DiGraph(),
)
kg = nx.compose(G, kg)

### Add additional node attributes (e.g. source, id and type)

In [14]:
# Start by extracting slicing the df to include only thge head nodes
df_head_nodes = primekg_edges[['head_name', 'head_source', 'head_id', 'head_type']]
# Rename the columns
df_head_nodes = df_head_nodes.rename(columns={
    'head_name': 'node_name',
    'head_source': 'node_source',
    'head_id': 'node_id',
    'head_type': 'node_type'
})
# Set the node_name as index
df_head_nodes = df_head_nodes.set_index('node_name')
# Add the additional attributes to graph
G.add_nodes_from((n, dict(d)) for n, d in df_head_nodes.iterrows())
# Recompose the graph
kg = nx.compose(G, kg)

# CTD enrichment
We will map CTD IDs to their corresponding PubChem IDs, and extract their descriptions and SMILES representation using EnrichmentWithPubChem.

In [15]:
from dataclasses import dataclass
# Create a dataclass to hold the node attributes
@dataclass
class PubChemAttr:
    """Dataclass to hold the attributes of a node."""
    pubchem_cid: str
    name: str
    # Make description optional
    # If not provided, it will be set to None
    description: str = None
    smiles: str = None


## Go iteratively over every CTD ID and fetch its description and SMILES rep

In [23]:
list_pubchem_attrs = []
# For the sake of space and time, we will enrich only the first 5 nodes of each DB
# Extract all gene IDs from the graph
pubchem_obj = EnrichmentWithPubChem()
pubchem_cids = []
count = 0
for node in tqdm(kg.nodes):
    if kg.nodes[node].get('node_source') != 'CTD':
        continue
    count += 1
    # Get the node attributes
    node_attr = kg.nodes[node]
    # Cnvert CTD ID into PubChem ID
    pubchem_cid = external_id2pubchem_cid('Comparative Toxicogenomics Database', node_attr.get('node_id'))
    # Save all PubChem CIDs
    pubchem_cids.append(pubchem_cid)
    # Create a ReactomeAttr object
    pubchem_attr = PubChemAttr(
        pubchem_cid=pubchem_cid,
        name=node
    )
    list_pubchem_attrs.append(pubchem_attr)
    if count == 2:
        break

# Enrich PubChem attr
for pubchem_attr in list_pubchem_attrs:
    # Fetch descriptions and SMILES representation
    description, smiles = pubchem_obj.enrich_documents([pubchem_attr.pubchem_cid])
    # Add descriptions to the corresponding Reactome attributes
    pubchem_attr.description = description[0]
    pubchem_attr.smiles = smiles[0]

  0%|          | 0/129262 [00:00<?, ?it/s]INFO:aiagents4pharma.talk2knowledgegraphs.utils.pubchem_utils:Load Hydra configuration for PubChem ID conversion.
  9%|▊         | 11282/129262 [00:00<00:04, 24243.44it/s]INFO:aiagents4pharma.talk2knowledgegraphs.utils.pubchem_utils:Load Hydra configuration for PubChem ID conversion.
 14%|█▍        | 18354/129262 [00:01<00:06, 18164.34it/s]
INFO:aiagents4pharma.talk2knowledgegraphs.utils.pubchem_utils:Load Hydra configuration for PubChem CID description.
INFO:aiagents4pharma.talk2knowledgegraphs.utils.pubchem_utils:Load Hydra configuration for PubChem CID description.


## Add descrioptions to the CTD nodes and recompose the graph

In [24]:
for pubchem_attr in list_pubchem_attrs:
    node = pubchem_attr.name
    description = pubchem_attr.description
    # print (f"node: {node}, description: {description}")
    G.add_nodes_from([(node, {'description': description})])

# Recompose the graph
kg = nx.compose(G, kg)

## Please follow the notebook link below to know how to generate embeddings of SMILES representation
https://virtualpatientengine.github.io/AIAgents4Pharma/notebooks/talk2knowledgegraphs/tutorial_primekg_smiles_enrich_embed/

## Generate embedding of CTD descriptions

In [25]:
for i, node in tqdm(enumerate(kg.nodes)):
    node_id = kg.nodes[node].get('node_id')    
    if kg.nodes[node].get('description') is None:
        continue
    print (node)
    desc = kg.nodes[node].get('description')
    # print (desc)
    outputs = biobert_model.embed_documents([desc])
    # print (outputs)
    G.add_nodes_from([(node, {'description_embedding': outputs})])
    # torch.cuda.synchronize()
    # torch.cuda.empty_cache()

# Recompose the graph
kg = nx.compose(G, kg)

11282it [00:00, 93365.28it/s]

DDT
Copper


129262it [00:00, 429209.15it/s]


## Display a DF with results

In [26]:
import pandas as pd
dic = {'node':[],
       'node_source':[],
       'node_id':[],
       'description':[],
       'description_embedding':[]}
for node in tqdm(kg.nodes):
    node_id = kg.nodes[node].get('node_id')
    if kg.nodes[node].get('description') is None:
        continue
    dic['node'].append(node)
    dic['node_source'].append(kg.nodes[node].get('node_source'))
    dic['node_id'].append(kg.nodes[node].get('node_id'))
    dic['description'].append(kg.nodes[node].get('description'))
    dic['description_embedding'].append(kg.nodes[node].get('description_embedding'))
    # print (node, kg.nodes[node].get('description'), kg.nodes[node].get('sequence'), kg.nodes[node].get('description_embedding'))

df = pd.DataFrame(dic)
df

100%|██████████| 129262/129262 [00:00<00:00, 859282.23it/s]


Unnamed: 0,node,node_source,node_id,description,description_embedding
0,DDT,CTD,D003634,DDT (Dichlorodiphenyltrichloroethane) can caus...,"[[tensor(-0.1798), tensor(0.1512), tensor(0.26..."
1,Copper,CTD,D003300,Copper atom is a copper group element atom and...,"[[tensor(-0.2902), tensor(0.0179), tensor(0.33..."


# Reactome pathway enrichment
We will use Reactome API services to extract textual descriptions of pathways using the EnrichmentWithReactome class.

In [15]:
from dataclasses import dataclass
# Create a dataclass to hold the node attributes
@dataclass
class ReactomeAttr:
    """Dataclass to hold the attributes of a node."""
    pathway_id: str
    name: str
    # Make description optional
    # If not provided, it will be set to None
    description: str = None


## Go iteratively over every pathway and fetch its description

In [20]:
list_reactome_attrs = []
# For the sake of space and time, we will enrich only the first 5 nodes of each DB
# Extract all gene IDs from the graph
reactome_obj = EnrichmentWithReactome()
count = 0
for node in tqdm(kg.nodes):
    if kg.nodes[node].get('node_source') != 'REACTOME':
        continue
    count += 1
    # Get the node attributes
    node_attr = kg.nodes[node]
    # Create a ReactomeAttr object
    reactome_attr = ReactomeAttr(
        pathway_id=node_attr.get('node_id'),
        name=node,
        description=node_attr.get('description')
    )
    list_reactome_attrs.append(reactome_attr)
    if count == 2:
        break
for reactome_attr in list_reactome_attrs:
    # Fetch descriptions
    description = reactome_obj.enrich_documents([reactome_attr.pathway_id])
    # Add descriptions to the corresponding Reactome attributes
    reactome_attr.description = description[0]
print (list_reactome_attrs)

 47%|████▋     | 60717/129262 [00:00<00:00, 1510221.05it/s]
INFO:aiagents4pharma.talk2knowledgegraphs.utils.enrichments.reactome_pathways:Load Hydra configuration for reactome enrichment
INFO:aiagents4pharma.talk2knowledgegraphs.utils.enrichments.reactome_pathways:Load Hydra configuration for reactome enrichment


[ReactomeAttr(pathway_id='R-HSA-8877627', name='Vitamin E', description='Vitamins A, D, E and K are lipophilic compounds, the so-called fat-soluble vitamins. Because of their lipophilicity, fat-soluble vitamins are solubilised and transported by intracellular carrier proteins to exert their actions. Alpha-tocopherol, the main form of vitamin E found in the body, is transported by alpha-tocopherol transfer protein (TTPA) in hepatic cells (Kono & Arai 2015, Schmolz et al. 2016).'), ReactomeAttr(pathway_id='R-HSA-5334118', name='DNA methylation', description='Methylation of cytosine is catalyzed by a family of DNA methyltransferases (DNMTs): DNMT1, DNMT3A, and DNMT3B transfer methyl groups from S-adenosylmethionine to cytosine, producing 5-methylcytosine and homocysteine (reviewed in Klose and Bird 2006, Ooi et al. 2009, Jurkowska et al. 2011, Moore et al. 2013). (DNMT2 appears to methylate RNA rather than DNA.) DNMT1, the first enzyme discovered, preferentially methylates hemimethylated 

## Add descriptions to the Reactome nodes and recompose the graph

In [21]:
for reactome_attr in list_reactome_attrs:
    node = reactome_attr.name
    description = reactome_attr.description
    # print (f"node: {node}, description: {description}")
    G.add_nodes_from([(node, {'description': description})])

# Recompose the graph
kg = nx.compose(G, kg)

## Generate embeddings of descriptions of reactome pathways

In [31]:
for i, node in tqdm(enumerate(kg.nodes)):
    node_id = kg.nodes[node].get('node_id')    
    if kg.nodes[node].get('description') is None:
        continue
    print (node)
    desc = kg.nodes[node].get('description')
    # print (desc)
    outputs = biobert_model.embed_documents([desc])
    # print (outputs)
    G.add_nodes_from([(node, {'description_embedding': outputs})])
    # torch.cuda.synchronize()
    # torch.cuda.empty_cache()

# Recompose the graph
kg = nx.compose(G, kg)

93148it [00:00, 456878.51it/s]

Vitamin E
DNA methylation


129262it [00:00, 531214.21it/s]


## Display the results in a DF

In [32]:
import pandas as pd
dic = {'node':[],
       'node_source':[],
       'node_id':[],
       'description':[],
       'description_embedding':[]}
for node in tqdm(kg.nodes):
    node_id = kg.nodes[node].get('node_id')
    if kg.nodes[node].get('description') is None:
        continue
    dic['node'].append(node)
    dic['node_source'].append(kg.nodes[node].get('node_source'))
    dic['node_id'].append(kg.nodes[node].get('node_id'))
    dic['description'].append(kg.nodes[node].get('description'))
    dic['description_embedding'].append(kg.nodes[node].get('description_embedding'))
    # print (node, kg.nodes[node].get('description'), kg.nodes[node].get('sequence'), kg.nodes[node].get('description_embedding'))

df = pd.DataFrame(dic)
df

100%|██████████| 129262/129262 [00:00<00:00, 995538.19it/s]


Unnamed: 0,node,node_source,node_id,description,description_embedding
0,Vitamin E,REACTOME,R-HSA-8877627,"Vitamins A, D, E and K are lipophilic compound...","[[tensor(-0.4649), tensor(0.2769), tensor(0.73..."
1,DNA methylation,REACTOME,R-HSA-5334118,Methylation of cytosine is catalyzed by a fami...,"[[tensor(-0.5609), tensor(0.4334), tensor(0.49..."


# OLS terms enrichments

OLS is the Ontology Lookup Service by EMBL/EBI. We will use their API services to extract textual descriptions of the following terms using the EnrichmentWithOLS class.
1. GO
2. HPO
3. UBERON
4. MONDO
5. MONDO_grouped

In [14]:
from dataclasses import dataclass
# Create a dataclass to hold the node attributes
@dataclass
class OLSAttr:
    """Dataclass to hold the attributes of a node."""
    term_id: str
    name: str
    label: str = None
    # Make description optional
    # If not provided, it will be set to None
    description: str = None


In [53]:
# Define a dictionary to store DB name and its OLS code
dic_ols = {
    'GO': 'GO',
    'HPO': 'HP',
    'UBERON': 'UBERON',
    'MONDO': 'MONDO',
}

## Go iteratively over every DB in OLS and store results in a dic

In [54]:
list_ols_attrs = []
term_ids = []
# For the sake of space and time, we will enrich only the first 5 nodes of each DB
# Extract all gene IDs from the graph
ols_obj = EnrichmentWithOLS()
for source in ['GO', 'MONDO', 'HPO', 'UBERON']:
    count = 0
    for node in tqdm(kg.nodes):
        if kg.nodes[node].get('node_source') != source:
            continue
        count += 1
        # Get the node attributes
        node_attr = kg.nodes[node]
        # Term ID
        # OLS term must contain 7-digit integer code
        # Hence, prefix with 0s such that total number
        # of characters is 7
        term_id = dic_ols[source] + '_' + str("{:07}".format(int(node_attr.get('node_id'))))
        term_ids.append(term_id)
        # Create a OLSAttr object
        ols_attr = OLSAttr(
            term_id=term_id,
            name=node,
            label=node,
            description=node_attr.get('description')
        )
        list_ols_attrs.append(ols_attr)
        if count == 2:
            break
# Fetch descriptions
descriptions = ols_obj.enrich_documents(term_ids)
# Add descriptions to the corresponding OLS attributes
for ols_attr, description in zip(list_ols_attrs, descriptions):
    ols_attr.description = description

 46%|████▌     | 59435/129262 [00:00<00:00, 1410968.24it/s]
 19%|█▉        | 24751/129262 [00:00<00:00, 1464943.46it/s]
 20%|█▉        | 25525/129262 [00:00<00:00, 1565290.51it/s]
 79%|███████▉  | 101813/129262 [00:00<00:00, 1408816.01it/s]
INFO:aiagents4pharma.talk2knowledgegraphs.utils.enrichments.ols_terms:Load Hydra configuration for OLS enrichments.


True
['Any process that stops, prevents, or reduces the frequency, rate or extent of the directed movement of a neurotransmitter into a neuron or glial cell.']
negative regulation of neurotransmitter uptake
True
['Any process that stops, prevents, or reduces the frequency, rate or extent of the directed movement of serotonin into a cell.']
negative regulation of serotonin uptake
True
['Persistently high systemic arterial blood pressure. Based on multiple readings (blood pressure determination), hypertension is currently defined as when systolic pressure is consistently greater than 140 mm Hg or when diastolic pressure is consistently 90 mm Hg or more.']
hypertensive disorder
True
['A condition that occurs while resting or lying in bed; it is characterized by an irresistible urgency to move the legs to obtain relief from a strange and uncomfortable sensation in the legs.']
restless legs syndrome
True
[]
Graves disease
True
[]
Horner syndrome
True
['Nonsynovial joint in which the articul

## Repeat the same for MONDO_grouped expect concatenate descriptions of all IDs in a group

In [55]:
# For the sake of space and time, we will enrich only the first 5 nodes of each DB
# Extract all gene IDs from the graph
ols_obj = EnrichmentWithOLS()
count = 0
for node in tqdm(kg.nodes):
    if kg.nodes[node].get('node_source') != 'MONDO_grouped':
        continue
    count += 1
    # Get the node attributes
    node_attr = kg.nodes[node]
    # MONDO_grouped contains multiple codes
    # separated by a '_'
    # OLS term must contain 7-digit integer code
    # Hence, prefix with 0s such that total number
    # of characters is 7
    codes = node_attr.get('node_id')
    codes = codes.split('_')
    # print (codes)
    term_ids = []
    for code in codes:
        term_id = 'MONDO_' + str("{:07}".format(int(code)))
        term_ids.append(term_id)
    # Create a OLSAttr object
    ols_attr = OLSAttr(
        term_id=node_attr.get('node_id'),
        name=node,
        label=node,
        description=node_attr.get('description')
    )
    # Fetch descriptions
    descriptions = ols_obj.enrich_documents(term_ids)
    # Add descriptions to the corresponding OLS attributes
    ols_attr.description = '\n'.join(descriptions)
    list_ols_attrs.append(ols_attr)
    if count == 2:
        break
print (list_ols_attrs)

  0%|          | 0/129262 [00:00<?, ?it/s]INFO:aiagents4pharma.talk2knowledgegraphs.utils.enrichments.ols_terms:Load Hydra configuration for OLS enrichments.


True
['High blood pressure caused by an underlying medical condition.']
secondary hypertension
True
['Hypertension that presents without an identifiable cause.']
essential hypertension
True
["OBSOLETE. An instance of hypertension that is caused by a modification of the individual's genome."]
obsolete genetic hypertension


 19%|█▉        | 24716/129262 [00:00<00:02, 42644.98it/s]INFO:aiagents4pharma.talk2knowledgegraphs.utils.enrichments.ols_terms:Load Hydra configuration for OLS enrichments.


True
['Increased blood pressure in the portal venous system. It is most commonly caused by cirrhosis. Other causes include portal vein thrombosis, Budd-Chiari syndrome, and right heart failure. Complications include ascites, esophageal varices, encephalopathy, and splenomegaly.']
portal hypertension
True
['A severe medical condition which is estimated to appear in 9-18% of hypertensive patients, in which treatment with 3 or more antihypertensive drugs including diuretics are ineffective.']
resistant hypertension
True
['Any Parkinson disease in which the cause of the disease is a mutation in the LRRK2 gene.']
autosomal dominant Parkinson disease 8
True
['Any Parkinson disease in which the cause of the disease is a mutation in the PARK7 gene.']
autosomal recessive early-onset Parkinson disease 7
True
['Any Parkinson disease in which the cause of the disease is a mutation in the VPS35 gene.']
Parkinson disease 17
True
['A Parkinson disease that begins after around the age of 50.']
late-on

 19%|█▉        | 24752/129262 [00:02<00:09, 11270.94it/s]

True
[]
parkinson disease 12
True
[]
parkinson disease 10
True
[]
parkinson disease 16
[OLSAttr(term_id='GO_0051581', name='negative regulation of neurotransmitter uptake', label='negative regulation of neurotransmitter uptake', description='Any process that stops, prevents, or reduces the frequency, rate or extent of the directed movement of a neurotransmitter into a neuron or glial cell.'), OLSAttr(term_id='GO_0051612', name='negative regulation of serotonin uptake', label='negative regulation of serotonin uptake', description='Any process that stops, prevents, or reduces the frequency, rate or extent of the directed movement of serotonin into a cell.'), OLSAttr(term_id='MONDO_0005044', name='hypertensive disorder', label='hypertensive disorder', description='Persistently high systemic arterial blood pressure. Based on multiple readings (blood pressure determination), hypertension is currently defined as when systolic pressure is consistently greater than 140 mm Hg or when diastolic 




## Add descrioptions to the OLS nodes and recompose the graph

In [56]:
for ols_attr in list_ols_attrs:
    node = ols_attr.name
    description = ols_attr.description
    # print (f"node: {node}, description: {description}")
    G.add_nodes_from([(node, {'description': description})])

# Recompose the graph
kg = nx.compose(G, kg)

## Generate embedding for all the nodes with textual descriptions

In [57]:
for i, node in tqdm(enumerate(kg.nodes)):
    node_id = kg.nodes[node].get('node_id')    
    if kg.nodes[node].get('description') is None:
        continue
    print (node)
    desc = kg.nodes[node].get('description')
    outputs = biobert_model.embed_documents([desc])
    G.add_nodes_from([(node, {'description_embedding': outputs})])
    # torch.cuda.synchronize()
    # torch.cuda.empty_cache()

# Recompose the graph
kg = nx.compose(G, kg)

24716it [00:00, 202653.96it/s]

hypertensive disorder
hypertension
restless legs syndrome
Parkinson disease


59436it [00:00, 125677.84it/s]

Graves disease
Horner syndrome
synostosis
negative regulation of neurotransmitter uptake
negative regulation of serotonin uptake


129262it [00:00, 218477.31it/s]


capsule


## Display the results in a DF

In [58]:
import pandas as pd
dic = {'node':[],
       'node_source':[],
       'node_id':[],
       'description':[],
       'description_embedding':[]}
for node in tqdm(kg.nodes):
    node_id = kg.nodes[node].get('node_id')
    if kg.nodes[node].get('description') is None:
        continue
    dic['node'].append(node)
    dic['node_source'].append(kg.nodes[node].get('node_source'))
    dic['node_id'].append(kg.nodes[node].get('node_id'))
    dic['description'].append(kg.nodes[node].get('description'))
    dic['description_embedding'].append(kg.nodes[node].get('description_embedding'))
    # print (node, kg.nodes[node].get('description'), kg.nodes[node].get('sequence'), kg.nodes[node].get('description_embedding'))

df = pd.DataFrame(dic)
df

100%|██████████| 129262/129262 [00:00<00:00, 953488.63it/s]


Unnamed: 0,node,node_source,node_id,description,description_embedding
0,hypertensive disorder,MONDO,5044,Persistently high systemic arterial blood pres...,"[[tensor(-0.3774), tensor(0.0026), tensor(0.33..."
1,hypertension,MONDO_grouped,1200_1134_15512_5080_100078,High blood pressure caused by an underlying me...,"[[tensor(-0.2972), tensor(0.1923), tensor(0.38..."
2,restless legs syndrome,MONDO,5391,A condition that occurs while resting or lying...,"[[tensor(-0.0092), tensor(0.1624), tensor(0.47..."
3,Parkinson disease,MONDO_grouped,11764_11658_13625_8199_14604_11613_14233_11562...,Any Parkinson disease in which the cause of th...,"[[tensor(-0.0157), tensor(0.0820), tensor(0.17..."
4,Graves disease,HPO,100647,Graves disease,"[[tensor(-0.6299), tensor(0.1351), tensor(0.77..."
5,Horner syndrome,HPO,2277,Horner syndrome,"[[tensor(-0.5213), tensor(-0.1533), tensor(0.7..."
6,synostosis,UBERON,10361,Nonsynovial joint in which the articulating bo...,"[[tensor(0.0200), tensor(-0.2079), tensor(0.38..."
7,negative regulation of neurotransmitter uptake,GO,51581,"Any process that stops, prevents, or reduces t...","[[tensor(0.0589), tensor(0.3241), tensor(0.057..."
8,negative regulation of serotonin uptake,GO,51612,"Any process that stops, prevents, or reduces t...","[[tensor(-0.0227), tensor(0.3225), tensor(-0.0..."
9,capsule,UBERON,3893,A cover or envelope partly or wholly surroundi...,"[[tensor(-0.0116), tensor(0.2860), tensor(0.49..."


# Protein enrichments
Now, we will encrich the protein nodes with their description and sequence.
We will query the UniProt via API to get the descp and sequence. For this, we
will first need to get all the node IDs.

In [11]:
from dataclasses import dataclass
# Create a dataclass to hold the node attributes
@dataclass
class GeneAttr:
    """Dataclass to hold the attributes of a gene node."""
    id: str
    name: str
    # Make description optional
    # If not provided, it will be set to None
    description: str = None
    sequence: str = None


### Get node IDs

In [37]:
# Extract all gene IDs from the graph
dic_gene_ids = {}
for n in tqdm(kg.nodes):
    if kg.nodes[n].get('node_type') != 'gene/protein' and kg.nodes[n].get('node_source') != 'NCBI':
        continue
    # Get the node attributes
    node_attr = kg.nodes[n]
    # Create a GeneAttr object
    gene_attr = GeneAttr(
        id=node_attr.get('node_id'),
        name=n,
        description=node_attr.get('description'),
        sequence=node_attr.get('sequence')
    )
    # Add the gene_attr object to the dictionary
    dic_gene_ids[node_attr.get('node_id')] = gene_attr
# Check the number of gene IDs
len(dic_gene_ids)

  0%|          | 0/129262 [00:00<?, ?it/s]

100%|██████████| 129262/129262 [00:00<00:00, 1131144.04it/s]


27609

### Submit a job to UniProt to map the Gene ID to its description and sequence

Here we show 2 ways to get description and sequence of a gene:
1. Most of the biomedical graphs offer gene names, which can be used to extract sequence and description using the EnrichmentWithUniProt class in the utils of T2KG
2. Some graphs, like PrimeKG, also offer gene IDs, which can also be used to extract sequence and descriotion using the snippet defined (borrowed from UniProt)

In [15]:
import time
import requests
from requests.adapters import HTTPAdapter, Retry
from urllib.parse import urlparse, parse_qs, urlencode

# Define variables to perform UniProt ID mapping
# Adopted from https://www.uniprot.org/help/id_mapping
API_URL = "https://rest.uniprot.org"
POLLING_INTERVAL = 5
retries = Retry(total=5, backoff_factor=0.25, status_forcelist=[500, 502, 503, 504])
session = requests.Session()
session.mount("https://", HTTPAdapter(max_retries=retries))

def submit_id_mapping(from_db, to_db, ids) -> str:
    """
    Function to submit a job to perform ID mapping.

    Args:
        from_db (str): The source database.
        to_db (str): The target database.
        ids (list): The list of IDs to map.

    Returns:
        str: The job ID.
    """
    request = requests.post(f"{API_URL}/idmapping/run",
                            data={"from": from_db,
                                  "to": to_db,
                                  "ids": ",".join(ids)},)
    try:
        request.raise_for_status()
    except requests.HTTPError:
        print(request.json())
        raise

    return request.json()["jobId"]

def check_id_mapping_results_ready(job_id):
    """
    Function to check if the ID mapping results are ready.

    Args:
        job_id (str): The job ID.

    Returns:
        bool: True if the results are ready, False otherwise.
    """
    while True:
        request = session.get(f"{API_URL}/idmapping/status/{job_id}")

        try:
            request.raise_for_status()
        except requests.HTTPError:
            print(request.json())
            raise

        j = request.json()
        if "jobStatus" in j:
            if j["jobStatus"] in ("NEW", "RUNNING"):
                print(f"Retrying in {POLLING_INTERVAL}s")
                time.sleep(POLLING_INTERVAL)
            else:
                raise Exception(j["jobStatus"])
        else:
            return bool(j["results"] or j["failedIds"])

def get_id_mapping_results_link(job_id):
    """
    Function to get the link to the ID mapping results.

    Args:
        job_id (str): The job ID.

    Returns:
        str: The link to the ID mapping results.
    """
    url = f"{API_URL}/idmapping/details/{job_id}"
    request = requests.Session().get(url)

    try:
        request.raise_for_status()
    except requests.HTTPError:
        print(request.json())
        raise

    return request.json()["redirectURL"]

def decode_results(response, file_format, compressed):
    """
    Function to decode the ID mapping results.

    Args:
        response (requests.Response): The response object.
        file_format (str): The file format of the results.
        compressed (bool): Whether the results are compressed.

    Returns:
        str: The ID mapping results
    """

    if compressed:
        decompressed = zlib.decompress(response.content, 16 + zlib.MAX_WBITS)
        if file_format == "json":
            j = json.loads(decompressed.decode("utf-8"))
            return j
        elif file_format == "tsv":
            return [line for line in decompressed.decode("utf-8").split("\n") if line]
        elif file_format == "xlsx":
            return [decompressed]
        elif file_format == "xml":
            return [decompressed.decode("utf-8")]
        else:
            return decompressed.decode("utf-8")
    elif file_format == "json":
        return response.json()
    elif file_format == "tsv":
        return [line for line in response.text.split("\n") if line]
    elif file_format == "xlsx":
        return [response.content]
    elif file_format == "xml":
        return [response.text]
    return response.text

def get_id_mapping_results_stream(url):
    """
    Function to get the ID mapping results from a stream.

    Args:
        url (str): The URL to the ID mapping results.

    Returns:
        str: The ID mapping results.
    """
    if "/stream/" not in url:
        url = url.replace("/results/", "/results/stream/")

    request = session.get(url)

    try:
        request.raise_for_status()
    except requests.HTTPError:
        print(request.json())
        raise

    parsed = urlparse(url)
    query = parse_qs(parsed.query)
    file_format = query["format"][0] if "format" in query else "json"
    compressed = (
        query["compressed"][0].lower() == "true" if "compressed" in query else False
    )
    return decode_results(request, file_format, compressed)

For the sake of time, we will use only the **first 5 nodes**

In [40]:
# Add the top 5 gene IDs to a list
inputs = list(dic_gene_ids.keys())[:5]
# Submit the job to perform ID mapping
job_id = submit_id_mapping(
    from_db="GeneID", to_db="UniProtKB", ids=inputs
)
# Print the job ID
print (f"Job ID: {job_id}")
# Check the status of the job
status = check_id_mapping_results_ready(job_id)
# Print the status of the job
print (f"Job status: {status}")

Job ID: 29s22juOq2
Job status: True


Check and get the ID mapping results

In [41]:
if check_id_mapping_results_ready(job_id):
    link = get_id_mapping_results_link(job_id)
    mapping_results = get_id_mapping_results_stream(link)
    print(mapping_results)

{'results': [{'from': '9796', 'to': {'entryType': 'UniProtKB reviewed (Swiss-Prot)', 'primaryAccession': 'Q92561', 'secondaryAccessions': ['D3DSR1', 'Q8N4I9'], 'uniProtkbId': 'PHYIP_HUMAN', 'entryAudit': {'firstPublicDate': '1997-11-01', 'lastAnnotationUpdateDate': '2025-04-09', 'lastSequenceUpdateDate': '1997-02-01', 'entryVersion': 176, 'sequenceVersion': 1}, 'annotationScore': 4.0, 'organism': {'scientificName': 'Homo sapiens', 'commonName': 'Human', 'taxonId': 9606, 'lineage': ['Eukaryota', 'Metazoa', 'Chordata', 'Craniata', 'Vertebrata', 'Euteleostomi', 'Mammalia', 'Eutheria', 'Euarchontoglires', 'Primates', 'Haplorrhini', 'Catarrhini', 'Hominidae', 'Homo']}, 'proteinExistence': '1: Evidence at protein level', 'proteinDescription': {'recommendedName': {'fullName': {'value': 'Phytanoyl-CoA hydroxylase-interacting protein'}}, 'alternativeNames': [{'fullName': {'value': 'Phytanoyl-CoA hydroxylase-associated protein 1'}, 'shortNames': [{'value': 'PAHX-AP1'}, {'value': 'PAHXAP1'}]}]}, 

### Store the mapping results in a dictionary
Key is the gene ID and value is a nested dictionary with keys "description" and "sequence"

In [43]:
dic_gene_id_to_descp_seq = {}
for result in mapping_results['results']:
    # print(result['to'])
    if result['to']['entryType'] == 'UniProtKB reviewed (Swiss-Prot)':
        # print (result['from'], result['to'])
        dic_gene_id_to_descp_seq[result['from']] = {}
        for comment in result['to']['comments']:
            if comment['commentType'] == 'FUNCTION':
                for text in comment['texts']:
                    # print (text['value'])
                    description = text['value']
        dic_gene_id_to_descp_seq[result['from']]['description'] = description
        dic_gene_id_to_descp_seq[result['from']]['sequence'] = result['to']['sequence']['value']

# Display the contents of the dictionary
for gene_id, descp_seq in dic_gene_id_to_descp_seq.items():
    print(f"Gene ID: {gene_id}")
    print(f"Description: {descp_seq['description']}")
    print(f"Sequence: {descp_seq['sequence']}")
    print()
        

Gene ID: 9796
Description: Its interaction with PHYH suggests a role in the development of the central system
Sequence: MELLSTPHSIEINNITCDSFRISWAMEDSDLERVTHYFIDLNKKENKNSNKFKHRDVPTKLVAKAVPLPMTVRGHWFLSPRTEYSVAVQTAVKQSDGEYLVSGWSETVEFCTGDYAKEHLAQLQEKAEQIAGRMLRFSVFYRNHHKEYFQHARTHCGNMLQPYLKDNSGSHGSPTSGMLHGVFFSCNTEFNTGQPPQDSPYGRWRFQIPAQRLFNPSTNLYFADFYCMYTAYHYAILVLAPKGSLGDRFCRDRLPLLDIACNKFLTCSVEDGELVFRHAQDLILEIIYTEPVDLSLGTLGEISGHQLMSLSTADAKKDPSCKTCNISVGR

Gene ID: 56992
Description: Plus-end directed kinesin-like motor enzyme involved in mitotic spindle assembly
Sequence: MAPGCKTELRSVTNGQSNQPSNEGDAIKVFVRIRPPAERSGSADGEQNLCLSVLSSTSLRLHSNPEPKTFTFDHVADVDTTQESVFATVAKSIVESCMSGYNGTIFAYGQTGSGKTFTMMGPSESDNFSHNLRGVIPRSFEYLFSLIDREKEKAGAGKSFLCKCSFIEIYNEQIYDLLDSASAGLYLREHIKKGVFVVGAVEQVVTSAAEAYQVLSGGWRNRRVASTSMNRESSRSHAVFTITIESMEKSNEIVNIRTSLLNLVDLAGSERQKDTHAEGMRLKEAGNINRSLSCLGQVITALVDVGNGKQRHVCYRDSKLTFLLRDSLGGNAKTAIIANVHPGSRCFGETLSTLNFAQRAKLIKNKAVVNEDTQGNVSQLQAEVKRLKEQLAELASGQTPPESFLTRDKKKTNYMEYFQEAMLFFKKSE

## Most of the biomedical graphs offer gene names, hence you can choose to also query seqeuences and descriptions via the Gene names using the EnrichmentWithUniprot class

In [47]:
for gene_id in inputs:
    # Get the gene name of the gene ID
    gene_name = dic_gene_ids[gene_id].name
    print (f"Gene name: {gene_name}")
    # Create an instance of the EnrichmentWithUniProt class
    enrich_uniprot = EnrichmentWithUniProt()
    # Get the sequence and description for the gene name
    description, sequence = enrich_uniprot.enrich_documents([gene_name])
    dic_gene_id_to_descp_seq[gene_id]['description'] = description
    dic_gene_id_to_descp_seq[gene_id]['sequence'] = sequence
    print (f"Gene name: {gene_name}\nDescription: {description}\nSequence: {sequence}")

Gene name: PHYHIP
Gene name: PHYHIP
Description: ['Its interaction with PHYH suggests a role in the development of the central system']
Sequence: ['MELLSTPHSIEINNITCDSFRISWAMEDSDLERVTHYFIDLNKKENKNSNKFKHRDVPTKLVAKAVPLPMTVRGHWFLSPRTEYSVAVQTAVKQSDGEYLVSGWSETVEFCTGDYAKEHLAQLQEKAEQIAGRMLRFSVFYRNHHKEYFQHARTHCGNMLQPYLKDNSGSHGSPTSGMLHGVFFSCNTEFNTGQPPQDSPYGRWRFQIPAQRLFNPSTNLYFADFYCMYTAYHYAILVLAPKGSLGDRFCRDRLPLLDIACNKFLTCSVEDGELVFRHAQDLILEIIYTEPVDLSLGTLGEISGHQLMSLSTADAKKDPSCKTCNISVGR']
Gene name: KIF15
Gene name: KIF15
Description: ['Plus-end directed kinesin-like motor enzyme involved in mitotic spindle assembly']
Sequence: ['MAPGCKTELRSVTNGQSNQPSNEGDAIKVFVRIRPPAERSGSADGEQNLCLSVLSSTSLRLHSNPEPKTFTFDHVADVDTTQESVFATVAKSIVESCMSGYNGTIFAYGQTGSGKTFTMMGPSESDNFSHNLRGVIPRSFEYLFSLIDREKEKAGAGKSFLCKCSFIEIYNEQIYDLLDSASAGLYLREHIKKGVFVVGAVEQVVTSAAEAYQVLSGGWRNRRVASTSMNRESSRSHAVFTITIESMEKSNEIVNIRTSLLNLVDLAGSERQKDTHAEGMRLKEAGNINRSLSCLGQVITALVDVGNGKQRHVCYRDSKLTFLLRDSLGGNAKTAIIANVHPGSRCFGETLSTLNFAQRAKLIKNKAVVNEDTQG

### Map the description and sequence from the dictionary to their corresponding nodes in the graph

In [48]:
from tqdm import tqdm
for node in tqdm(kg.nodes):
    if kg.nodes[node].get('node_type') != 'gene/protein':
        continue
    gene_id = kg.nodes[node].get('node_id')
    # Ignore the genes/proteins without description
    if gene_id not in dic_gene_id_to_descp_seq:
        continue
    description = dic_gene_id_to_descp_seq[gene_id]['description']
    sequence = dic_gene_id_to_descp_seq[gene_id]['sequence']
    print (f"node: {node}, gene ID: {gene_id}, description: {description}, sequence: {sequence}")
    G.add_nodes_from([(node, {'description': description, 'sequence': sequence})])

# Recompose the graph
kg = nx.compose(G, kg)

100%|██████████| 129262/129262 [00:00<00:00, 1531104.56it/s]


node: PHYHIP, gene ID: 9796, description: ['Its interaction with PHYH suggests a role in the development of the central system'], sequence: ['MELLSTPHSIEINNITCDSFRISWAMEDSDLERVTHYFIDLNKKENKNSNKFKHRDVPTKLVAKAVPLPMTVRGHWFLSPRTEYSVAVQTAVKQSDGEYLVSGWSETVEFCTGDYAKEHLAQLQEKAEQIAGRMLRFSVFYRNHHKEYFQHARTHCGNMLQPYLKDNSGSHGSPTSGMLHGVFFSCNTEFNTGQPPQDSPYGRWRFQIPAQRLFNPSTNLYFADFYCMYTAYHYAILVLAPKGSLGDRFCRDRLPLLDIACNKFLTCSVEDGELVFRHAQDLILEIIYTEPVDLSLGTLGEISGHQLMSLSTADAKKDPSCKTCNISVGR']
node: KIF15, gene ID: 56992, description: ['Plus-end directed kinesin-like motor enzyme involved in mitotic spindle assembly'], sequence: ['MAPGCKTELRSVTNGQSNQPSNEGDAIKVFVRIRPPAERSGSADGEQNLCLSVLSSTSLRLHSNPEPKTFTFDHVADVDTTQESVFATVAKSIVESCMSGYNGTIFAYGQTGSGKTFTMMGPSESDNFSHNLRGVIPRSFEYLFSLIDREKEKAGAGKSFLCKCSFIEIYNEQIYDLLDSASAGLYLREHIKKGVFVVGAVEQVVTSAAEAYQVLSGGWRNRRVASTSMNRESSRSHAVFTITIESMEKSNEIVNIRTSLLNLVDLAGSERQKDTHAEGMRLKEAGNINRSLSCLGQVITALVDVGNGKQRHVCYRDSKLTFLLRDSLGGNAKTAIIANVHPGSRCFGETLSTLNFAQRAKLIKNKAVVNEDTQGNVSQLQAEVK

### Check device availability

In [None]:
device = "cuda:0" if torch.cuda.is_available() else "cpu"
device

'cuda:0'

### Load the ESM2 model

In [None]:
emb_model = EmbeddingWithHuggingFace(model_name='facebook/esm2_t6_8M_UR50D',
                                     model_cache_dir="../../../../data/facebook/esm2_t6_8M_UR50D/",
                                     truncation=False,
                                     device=device)

Some weights of EsmModel were not initialized from the model checkpoint at facebook/esm2_t6_8M_UR50D and are newly initialized: ['esm.pooler.dense.bias', 'esm.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Generate sequence embedding and add it to the graph as new attribute "sequence_embedding"

In [None]:
# Embeddings using 1 sample at a time
for node in tqdm(kg.nodes):
    if kg.nodes[node].get('node_type') != 'gene/protein':
        continue
    gene_id = kg.nodes[node].get('node_id')
    if kg.nodes[node].get('sequence') is None:
        continue
    seq = kg.nodes[node].get('sequence')
    # print (node, seq)
    outputs = emb_model.embed_documents([seq])
    G.add_nodes_from([(node, {'sequence_embedding': outputs[0]})])
    torch.cuda.synchronize()
    torch.cuda.empty_cache()

# Recompose the graph
kg = nx.compose(G, kg)

100%|██████████| 129262/129262 [00:05<00:00, 21864.09it/s]


### Protein embedding
Load the BioBERT model

In [20]:
# Using MSFT's BioBERT
emb_model = EmbeddingWithHuggingFace(model_name='microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract',
                                     model_cache_dir="../../../../data/microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract/",
                                     truncation=False,
                                     device=device)

### Generate description embedding and add it to the graph as new attribute "description_embedding"

In [21]:
for i, node in tqdm(enumerate(kg.nodes)):
    if kg.nodes[node].get('node_type') != 'gene/protein':
        continue
    gene_id = kg.nodes[node].get('node_id')
    if kg.nodes[node].get('description') is None:
        continue
    desc = kg.nodes[node].get('description')
    outputs = emb_model.embed_documents([desc])
    G.add_nodes_from([(node, {'description_embedding': outputs})])
    torch.cuda.synchronize()
    torch.cuda.empty_cache()

# Recompose the graph
kg = nx.compose(G, kg)

129262it [00:00, 372240.20it/s]


### Put together all the results so far in a df

In [22]:
import pandas as pd
dic = {'gene':[],
       'description':[],
       'sequence':[],
       'description_embedding':[],
       'sequence_embedding':[]}
for node in tqdm(kg.nodes):
    if kg.nodes[node].get('node_type') != 'gene/protein':
        continue
    gene_id = kg.nodes[node].get('node_id')
    if kg.nodes[node].get('description') is None:
        continue
    dic['gene'].append(node)
    dic['description'].append(kg.nodes[node].get('description'))
    dic['sequence'].append(kg.nodes[node].get('sequence'))
    dic['description_embedding'].append(kg.nodes[node].get('description_embedding'))
    dic['sequence_embedding'].append(kg.nodes[node].get('sequence_embedding'))
    # print (node, kg.nodes[node].get('description'), kg.nodes[node].get('sequence'), kg.nodes[node].get('description_embedding'))

df = pd.DataFrame(dic)
df

100%|██████████| 129262/129262 [00:00<00:00, 1170116.51it/s]


Unnamed: 0,gene,description,sequence,description_embedding,sequence_embedding
0,PABPC1,(Microbial infection) Positively regulates the...,MNPSAPSYPMASLYVGDLHPDVTEAMLYEKFSPAGPILSIRVCRDM...,"[[tensor(-0.1369), tensor(-0.0554), tensor(0.0...","[tensor(0.0133), tensor(-0.0413), tensor(0.192..."
1,GTPBP3,GTPase component of the GTPBP3-MTO1 complex th...,MWRGLWTLAAQAARGPRRLCTRRSSGAPAPGSGATIFALSSGQGRC...,"[[tensor(-0.0017), tensor(0.2555), tensor(0.45...","[tensor(-0.1717), tensor(-0.2150), tensor(-0.0..."
2,SHOX2,May be a growth regulator and have a role in s...,MEELTAFVSKSFDQKVKEKKEAITYREVLESGPLRGAKEPTGCTEA...,"[[tensor(0.0310), tensor(0.3633), tensor(0.744...","[tensor(-0.0969), tensor(-0.2075), tensor(0.01..."
3,ALDH16A1,May be a growth regulator and have a role in s...,MAATRAGPRAREIFTSLEYGPVPESHACALAWLDTQDRCLGHYVNG...,"[[tensor(0.0310), tensor(0.3633), tensor(0.744...","[tensor(-0.2420), tensor(-0.0727), tensor(0.10..."
4,GMPR,Catalyzes the irreversible NADPH-dependent dea...,MPRIDADLKLDFKDVLLRPKRSSLKSRAEVDLERTFTFRNSKQTYS...,"[[tensor(0.1503), tensor(-0.1371), tensor(0.33...","[tensor(0.0235), tensor(0.0571), tensor(0.0156..."
5,INPP4B,Catalyzes the hydrolysis of the 4-position pho...,MEIKEEGASEEGQHFLPTAQANDPGDCQFTSIQKTPNEPQLEFILA...,"[[tensor(0.1104), tensor(0.2020), tensor(0.234...","[tensor(-0.0336), tensor(-0.1216), tensor(0.02..."
