# BioBridge-PrimeKG Loader

In this tutorial, we will explain how to load BioBridge-PrimeKG dataset, which is a dataset for cross-modality prediction using PrimeKG as the knowledge graph. 

Prior information about the BioBridge-PrimeKG dataaset can be found in the following repositories:
- https://github.com/RyanWangZf/BioBridge
- https://arxiv.org/pdf/2310.03320

First of all, we need to import the necessary libraries as follows.

In [1]:
# Import necessary libraries
import sys
sys.path.append('../../..')
from aiagents4pharma.talk2knowledgegraphs.datasets.biobridge_primekg import BioBridgePrimeKG

  from .autonotebook import tqdm as notebook_tqdm





### Load BioBridge-PrimeKG

The `BioBridgePrimeKG` allows to load the data from related Github repository if the data is not available locally. 

Otherwise, the data is loaded from the local directory as defined in the `local_dir` and `primekg_dir`.

In [2]:
# Define biobridge primekg data by providing a local directory where the data is stored
biobridge_data = BioBridgePrimeKG(primekg_dir="../../../../data/primekg/",
                                  local_dir="../../../../data/biobridge_primekg/")

To load the dataframes of BioBridge and its split, we just need to call a method as follows.

In [3]:
# Invoke a method to load the data
biobridge_data.load_data()

Loading PrimeKG dataset...
Loading nodes of PrimeKG dataset ...
../../../../data/primekg/primekg_nodes.tsv.gz already exists. Loading the data from the local directory.
Loading edges of PrimeKG dataset ...
../../../../data/primekg/primekg_edges.tsv.gz already exists. Loading the data from the local directory.
Loading data config file of BioBridgePrimeKG...
Downloading data_config.json from https://raw.githubusercontent.com/RyanWangZf/BioBridge/refs/heads/main/data/BindData/data_config.json to ../../../../data/biobridge_primekg/...


1.03kiB [00:00, 3.78MiB/s]                


Building node embeddings...
Downloading protein.pkl from https://media.githubusercontent.com/media/RyanWangZf/BioBridge/refs/heads/main/data/embeddings/esm2b_unimo_pubmedbert/protein.pkl to ../../../../data/biobridge_primekg/embeddings...


100%|██████████| 197M/197M [00:13<00:00, 14.7MiB/s] 


Downloading mf.pkl from https://media.githubusercontent.com/media/RyanWangZf/BioBridge/refs/heads/main/data/embeddings/esm2b_unimo_pubmedbert/mf.pkl to ../../../../data/biobridge_primekg/embeddings...


100%|██████████| 34.3M/34.3M [00:02<00:00, 12.4MiB/s]


Downloading cc.pkl from https://media.githubusercontent.com/media/RyanWangZf/BioBridge/refs/heads/main/data/embeddings/esm2b_unimo_pubmedbert/cc.pkl to ../../../../data/biobridge_primekg/embeddings...


100%|██████████| 12.5M/12.5M [00:01<00:00, 7.26MiB/s]


Downloading bp.pkl from https://media.githubusercontent.com/media/RyanWangZf/BioBridge/refs/heads/main/data/embeddings/esm2b_unimo_pubmedbert/bp.pkl to ../../../../data/biobridge_primekg/embeddings...


100%|██████████| 85.9M/85.9M [00:10<00:00, 8.09MiB/s]


Downloading drug.pkl from https://media.githubusercontent.com/media/RyanWangZf/BioBridge/refs/heads/main/data/embeddings/esm2b_unimo_pubmedbert/drug.pkl to ../../../../data/biobridge_primekg/embeddings...


100%|██████████| 28.7M/28.7M [00:04<00:00, 6.86MiB/s]


Downloading disease.pkl from https://media.githubusercontent.com/media/RyanWangZf/BioBridge/refs/heads/main/data/embeddings/esm2b_unimo_pubmedbert/disease.pkl to ../../../../data/biobridge_primekg/embeddings...


100%|██████████| 53.1M/53.1M [00:02<00:00, 20.2MiB/s]


Building full triplets...
Downloading protein.csv from https://media.githubusercontent.com/media/RyanWangZf/BioBridge/refs/heads/main/data/Processed/protein.csv to ../../../../data/biobridge_primekg/processed...


100%|██████████| 11.7M/11.7M [00:00<00:00, 17.2MiB/s]


Downloading mf.csv from https://media.githubusercontent.com/media/RyanWangZf/BioBridge/refs/heads/main/data/Processed/molecular.csv to ../../../../data/biobridge_primekg/processed...


100%|██████████| 2.24M/2.24M [00:00<00:00, 13.9MiB/s]


Downloading cc.csv from https://media.githubusercontent.com/media/RyanWangZf/BioBridge/refs/heads/main/data/Processed/cellular.csv to ../../../../data/biobridge_primekg/processed...


100%|██████████| 1.05M/1.05M [00:00<00:00, 4.17MiB/s]


Downloading bp.csv from https://media.githubusercontent.com/media/RyanWangZf/BioBridge/refs/heads/main/data/Processed/biological.csv to ../../../../data/biobridge_primekg/processed...


100%|██████████| 6.79M/6.79M [00:00<00:00, 38.3MiB/s]


Downloading drug.csv from https://media.githubusercontent.com/media/RyanWangZf/BioBridge/refs/heads/main/data/Processed/drug.csv to ../../../../data/biobridge_primekg/processed...


100%|██████████| 9.47M/9.47M [00:00<00:00, 15.6MiB/s]


Downloading disease.csv from https://media.githubusercontent.com/media/RyanWangZf/BioBridge/refs/heads/main/data/Processed/disease.csv to ../../../../data/biobridge_primekg/processed...


100%|██████████| 11.4M/11.4M [00:01<00:00, 7.92MiB/s]


Building train-test split...
Number of 1 nodes in train: 16918
Number of 1 nodes in test: 1879
Number of 6 nodes in train: 6084
Number of 6 nodes in test: 675
Number of 2 nodes in train: 15349
Number of 2 nodes in test: 1705
Number of 0 nodes in train: 24669
Number of 0 nodes in test: 2740
Number of 5 nodes in train: 9856
Number of 5 nodes in test: 1095
Number of 7 nodes in train: 3610
Number of 7 nodes in test: 401


As a result, we obtained several processed files in the local directory as explained in the subsequent sections.

### Check initial PrimeKG Dataframes 

Firstly, we can get the initial PrimeKG data by invoking the method `get_primekg()` and further check the dataframes of nodes and edges.

In [4]:
# Get the initial data of PrimeKG
primekg_nodes = biobridge_data.get_primekg().get_nodes()
primekg_edges = biobridge_data.get_primekg().get_edges()

In [5]:
# Check PrimeKG nodes
primekg_nodes.head()

Unnamed: 0,node_index,node_name,node_source,node_id,node_type
0,0,PHYHIP,NCBI,9796,gene/protein
1,1,GPANK1,NCBI,7918,gene/protein
2,2,ZRSR2,NCBI,8233,gene/protein
3,3,NRF1,NCBI,4899,gene/protein
4,4,PI4KA,NCBI,5297,gene/protein


In [6]:
# Check the dimensions of the PrimeKG nodes
primekg_nodes.shape

(129375, 5)

In [7]:
# Check PrimeKG edges
primekg_edges.head()

Unnamed: 0,head_index,head_name,head_source,head_id,head_type,tail_index,tail_name,tail_source,tail_id,tail_type,display_relation,relation
0,0,PHYHIP,NCBI,9796,gene/protein,8889,KIF15,NCBI,56992,gene/protein,ppi,protein_protein
1,1,GPANK1,NCBI,7918,gene/protein,2798,PNMA1,NCBI,9240,gene/protein,ppi,protein_protein
2,2,ZRSR2,NCBI,8233,gene/protein,5646,TTC33,NCBI,23548,gene/protein,ppi,protein_protein
3,3,NRF1,NCBI,4899,gene/protein,11592,MAN1B1,NCBI,11253,gene/protein,ppi,protein_protein
4,4,PI4KA,NCBI,5297,gene/protein,2122,RGS20,NCBI,8601,gene/protein,ppi,protein_protein


In [8]:
# Check the dimensions of the PrimeKG edges
primekg_edges.shape

(8100498, 12)

### Check BioBridge-PrimeKG Data Config

BioBridgePrimeKG class provides a method to load the dataconfig file which contains the following information:
- `node_type` : The node type mapping
- `relation_type` : The relation type mapping
- `emb_dim` : The embedding dimension (pre-loaded embeddings from the repository)

In [9]:
biobridge_data.get_data_config().keys()

dict_keys(['node_type', 'relation_type', 'emb_dim'])

In [10]:
# Check the node type within data config of the BioBridge PrimeKG
biobridge_data.get_data_config()['node_type']

{'biological_process': 0,
 'gene/protein': 1,
 'disease': 2,
 'effect/phenotype': 3,
 'anatomy': 4,
 'molecular_function': 5,
 'drug': 6,
 'cellular_component': 7,
 'pathway': 8,
 'exposure': 9}

In [11]:
# Check the relation type within data config of the BioBridge PrimeKG
biobridge_data.get_data_config()['relation_type']

{'expression present': 0,
 'synergistic interaction': 1,
 'interacts with': 2,
 'ppi': 3,
 'phenotype present': 4,
 'parent-child': 5,
 'associated with': 6,
 'side effect': 7,
 'contraindication': 8,
 'expression absent': 9,
 'target': 10,
 'indication': 11,
 'enzyme': 12,
 'transporter': 13,
 'off-label use': 14,
 'linked to': 15,
 'phenotype absent': 16,
 'carrier': 17}

In [12]:
# Check the embedding dimension within data config of the BioBridge PrimeKG
# Note that, not all of node types have embeddings
biobridge_data.get_data_config()['emb_dim']

{'biological_process': 768,
 'cellular_component': 768,
 'disease': 768,
 'drug': 512,
 'molecular_function': 768,
 'gene/protein': 2560}

### Check BioBridge-PrimeKG Node Information

BioBridge has provided us with node information for each modalities of PrimeKG.
In particular, they included the information for the following modalities:
- protein (sequence)
- molecular function (texts)
- cellular component (texts)
- biological process (texts)
- drug (SMILES strings)
- disease (texts)

In [13]:
# Get the node information of the BioBridge PrimeKG
biobridge_node_info = biobridge_data.get_node_info_dict()
biobridge_node_info.keys()

dict_keys(['gene/protein', 'molecular_function', 'cellular_component', 'biological_process', 'drug', 'disease'])

In [14]:
# Check a sample node information of gene/protein
biobridge_node_info['gene/protein']

Unnamed: 0,node_index,node_id,node_type,node_name,node_source,sequence
0,0,9796,gene/protein,PHYHIP,NCBI,MELLSTPHSIEINNITCDSFRISWAMEDSDLERVTHYFIDLNKKEN...
1,1,7918,gene/protein,GPANK1,NCBI,MSRPLLITFTPATDPSDLWKDGQQQPQPEKPESTLDGAAARAFYEA...
2,2,8233,gene/protein,ZRSR2,NCBI,MAAPEKMTFPEKPSHKKYRAALKKEKRKKRRQELARLRDSGLSQKE...
3,3,4899,gene/protein,NRF1,NCBI,MEEHGVTQTEHMATIEAHAVAQQVQQVHVATYTEHSMLSADEDSPS...
4,4,5297,gene/protein,PI4KA,NCBI,MAAAPARGGGGGGGGGGGCSGSGSSASRGFYFNTVLSLARSLAVQR...
...,...,...,...,...,...,...
19157,83735,100133251,gene/protein,PRR23D2,NCBI,MYGYRRLRSPRDSQTEPQNDNEGETSLATTQMNPPKRRQVEQGPST...
19158,83735,100133251,gene/protein,PRR23D2,NCBI,MYGYRRLRSPRDSQTEPQNDNEGETSLATTQMNPPKRRQVEQGPST...
19159,83740,389649,gene/protein,C8orf86,NCBI,MRPLGKGLLPAEELIRSNLGVGRSLRDCLSQSGKLAEELGSKRLKP...
19160,83746,343990,gene/protein,CRACDL,NCBI,MISTRVMDIKLREAAEGLGEDSTGKKKSKFKTFKKFFGKKKRKESP...


In [15]:
# Check a sample node information of molecular_function
biobridge_node_info['molecular_function']

Unnamed: 0,node_index,node_id,node_type,node_name,node_source,description
0,53517,8168,molecular_function,methyltransferase activity,GO,Catalysis of the transfer of a methyl group to...
1,53518,140101,molecular_function,"catalytic activity, acting on a tRNA",GO,Catalytic activity that acts to modify a tRNA.
2,53519,140097,molecular_function,"catalytic activity, acting on DNA",GO,Catalytic activity that acts to modify DNA.
3,53520,140096,molecular_function,"catalytic activity, acting on a protein",GO,Catalytic activity that acts to modify a protein.
4,53521,140098,molecular_function,"catalytic activity, acting on RNA",GO,"Catalytic activity that acts to modify RNA, dr..."
...,...,...,...,...,...,...
10961,124216,42880,molecular_function,D-glucuronate transmembrane transporter activity,GO,"Enables the transfer of D-glucuronate, the D-e..."
10962,124217,61922,molecular_function,histone propionyltransferase activity,GO,Catalysis of the reaction: propionyl-CoA + his...
10963,124218,61995,molecular_function,ATP-dependent protein-DNA complex displacement...,GO,An activity that displaces proteins or protein...
10964,124219,51266,molecular_function,sirohydrochlorin ferrochelatase activity,GO,Catalysis of the reaction: siroheme + 2 H+ = F...


In [16]:
# Check a sample node information of cellular_component
biobridge_node_info['cellular_component']

Unnamed: 0,node_index,node_id,node_type,node_name,node_source,description
0,55515,110165,cellular_component,cellular anatomical entity,GO,A part of a cellular organism that is either a...
1,55516,30137,cellular_component,COPI-coated vesicle,GO,A vesicle with a coat formed of the COPI coat ...
2,55517,30133,cellular_component,transport vesicle,GO,Any of the vesicles of the constitutive secret...
3,55520,5777,cellular_component,peroxisome,GO,A small organelle enclosed by a single membran...
4,55524,99512,cellular_component,supramolecular fiber,GO,A polymer consisting of an indefinite number o...
...,...,...,...,...,...,...
4008,127430,44169,cellular_component,host cell rough endoplasmic reticulum membrane,GO,The lipid bilayer surrounding the host cell ro...
4009,127431,98652,cellular_component,collagen type VII anchoring fibril,GO,An antiparallel dimer of two collagen VII trim...
4010,127432,90732,cellular_component,cofilin-actin rod,GO,"A cellular structure consisting of parallel, h..."
4011,127433,779,cellular_component,"condensed chromosome, centromeric region",GO,The region of a condensed chromosome that incl...


In [17]:
# Check a sample node information of biological_process
biobridge_node_info['biological_process']

Unnamed: 0,node_index,node_id,node_type,node_name,node_source,description
0,39898,51581,biological_process,negative regulation of neurotransmitter uptake,GO,"Any process that stops, prevents, or reduces t..."
1,39899,43271,biological_process,negative regulation of ion transport,GO,"Any process that stops, prevents, or reduces t..."
2,39900,51611,biological_process,regulation of serotonin uptake,GO,"Any process that modulates the frequency, rate..."
3,39901,51616,biological_process,regulation of histamine uptake,GO,"Any process that modulates the frequency, rate..."
4,39902,51956,biological_process,negative regulation of amino acid transport,GO,"Any process that stops, prevents, or reduces t..."
...,...,...,...,...,...,...
27473,115046,60654,biological_process,mammary gland cord elongation,GO,The process in which the mammary gland sprout ...
27474,115047,1903696,biological_process,protein localization to horsetail-astral micro...,GO,A process in which a protein is transported to...
27475,115048,3372,biological_process,establishment or maintenance of cytoskeleton p...,GO,Any cellular process that results in the speci...
27476,115049,97624,biological_process,UDP-galactose transmembrane import into Golgi ...,GO,The directed movement of UDP-galactose into th...


In [18]:
# Check a sample node information of drug
biobridge_node_info['drug']

Unnamed: 0,node_index,description,half_life,indication,mechanism_of_action,protein_binding,pharmacodynamics,state,atc_1,atc_2,...,node_type,node_name,node_source,name,smiles,logP ALOGPS,logP ChemAxon,solubility ALOGPS,pKa (strongest acidic),pKa (strongest basic)
0,14014,"Flunisolide (marketed as AeroBid, Nasalide, Na...",The half-life is 1.8 hours,For the maintenance treatment of asthma as a p...,Flunisolide is a glucocorticoid receptor agoni...,Approximately 40% after oral inhalation,Flunisolide is a synthetic corticosteroid. It ...,Flunisolide is a solid.,Flunisolide is anatomically related to respira...,Flunisolide is in the therapeutic group of nas...,...,drug,Flunisolide,DrugBank,Flunisolide,[H][C@@]12C[C@@]3([H])[C@]4([H])C[C@H](F)C5=CC...,2.20,1.56,3.74e-02 g/l,13.73,-2.90
1,14015,Alclometasone is synthetic glucocorticoid ster...,,For the relief of the inflammatory and pruriti...,The mechanism of the anti-inflammatory activit...,,Alclometasone is a synthetic corticosteroid fo...,Alclometasone is a solid.,Alclometasone is anatomically related to derma...,Alclometasone is in the therapeutic group of c...,...,drug,Alclometasone,DrugBank,Alclometasone,[H][C@@]12C[C@@H](C)[C@](O)(C(=O)CO)[C@@]1(C)C...,2.11,1.68,1.37e-01 g/l,12.45,-2.90
2,14016,Medrysone is a corticosteroid used in ophthalm...,,"For the treatment of allergic conjunctivitis, ...",There is no generally accepted explanation for...,,Medrysone is a topical anti-inflammatory corti...,Medrysone is a solid.,Medrysone is anatomically related to sensory o...,Medrysone is in the therapeutic group of ophth...,...,drug,Medrysone,DrugBank,Medrysone,[H][C@@]12CC[C@H](C(C)=O)[C@@]1(C)C[C@H](O)[C@...,3.06,3.13,3.37e-02 g/l,19.14,-0.26
3,14017,"A glucocorticoid employed, usually as eye drop...",,For the ophthalmic treatment of corticosteroid...,There is no generally accepted explanation for...,,Corticosteroids such as fluorometholone inhibi...,Fluorometholone is a solid.,Fluorometholone is anatomically related to der...,Fluorometholone is in the therapeutic group of...,...,drug,Fluorometholone,DrugBank,Fluorometholone,[H][C@@]12CC[C@](O)(C(C)=O)[C@@]1(C)C[C@H](O)[...,2.34,2.42,1.66e-02 g/l,12.65,-3.40
4,14018,Beclomethasone dipropionate is a second-genera...,"Following intravenous administration, the half...",Indicated for oral inhalation use in the maint...,Beclomethasone dipropionate is a corticosteroi...,"Based on the findings of _in vitro_ studies, t...","Inflammatory conditions, including asthma, der...",Beclomethasone dipropionate is a solid.,Beclomethasone dipropionate is anatomically re...,Beclomethasone dipropionate is in the therapeu...,...,drug,Beclomethasone dipropionate,DrugBank,Beclomethasone dipropionate,[H][C@@]12C[C@H](C)[C@](OC(=O)CC)(C(=O)COC(=O)...,3.69,4.43,2.08e-03 g/l,13.85,-3.30
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6943,21955,,,,,,,,Metabutethamine is anatomically related to ner...,Metabutethamine is in the therapeutic group of...,...,drug,Metabutethamine,DrugBank,,CC(C)CNCCOC(=O)C1=CC=CC(N)=C1,,,,,
6944,21956,,,,,,,,Quinisocaine is anatomically related to dermat...,Quinisocaine is in the therapeutic group of an...,...,drug,Quinisocaine,DrugBank,,CCCCC1=CC2=CC=CC=C2C(OCCN(C)C)=N1,,,,,
6945,39888,Benzyl benzoate is one of the older preparatio...,,Used to kill lice and the mites responsible fo...,Benzyl benzoate exerts toxic effects on the ne...,,Benzyl benzoate is one of the older preparatio...,Benzyl benzoate is a liquid.,Benzyl benzoate is anatomically related to ant...,Benzyl benzoate is in the therapeutic group of...,...,drug,Benzyl benzoate,DrugBank,,O=C(OCC1=CC=CC=C1)C1=CC=CC=C1,,,,,
6946,39894,Sulfur hexafluoride is an ultrasound contrast ...,The terminal half-life of SF6 in blood was app...,Echocardiography: Sulfur hexafluoride is indic...,"Within the blood, the acoustic impedance of Lu...",,Sulfur hexafluoride provides useful echocardio...,,Sulfur hexafluoride is anatomically related to...,Sulfur hexafluoride is in the therapeutic grou...,...,drug,Sulfur hexafluoride,DrugBank,,FS(F)(F)(F)(F)F,,,,,


In [19]:
# Check a sample node information of disease
biobridge_node_info['disease']

Unnamed: 0,node_index,mondo_id,mondo_name,definition
0,27165,8019,mullerian aplasia and hyperandrogenism,"Deficiency of the glycoprotein WNT4, associate..."
1,27165,8019,mullerian aplasia and hyperandrogenism,"Deficiency of the glycoprotein WNT4, associate..."
2,27166,11043,"myelodysplasia, immunodeficiency, facial dysmo...",
3,27168,8878,"bone dysplasia, lethal Holmgren type",Bone dysplasia lethal Holmgren type (BDLH) is ...
4,27169,8905,predisposition to invasive fungal disease due ...,"A rare, genetic primary immunodeficiency chara..."
...,...,...,...,...
44128,99866,44144,panic disorder with agoraphobia,A disorder in which an individual experiences ...
44129,99916,44797,desmoplastic nevus,A benign melanocytic nevus characterized by th...
44130,99916,44800,desmoplastic spitz nevus,A Spitz nevus associated with fibrous stroma f...
44131,99969,100023,self-limited familial and non-familial neonata...,A neonatal/infantile epilepsy sndrome that is ...


### Check BioBridge-PrimeKG Node Embeddings (Pre-Loaded)

BioBridge provides a dictionary of pre-loaded embeddings, which can be obtained as follows.

In [20]:
# Check node embeddings
emb_dict = biobridge_data.get_node_embeddings()
emb_dict[0]

[0.04029838368296623,
 -0.018344514071941376,
 0.02762659639120102,
 -0.026468712836503983,
 0.021834833547472954,
 -0.04956040903925896,
 0.013426685705780983,
 0.04726368933916092,
 -0.025193220004439354,
 -0.004347709938883781,
 -0.09398091584444046,
 -0.02682836912572384,
 0.06272736936807632,
 0.03773018345236778,
 -0.0003949799865949899,
 -0.10644476860761642,
 -0.04382409527897835,
 -0.03279171884059906,
 0.03302460163831711,
 0.0036869393661618233,
 -0.0472925640642643,
 0.015392928384244442,
 0.01283049676567316,
 -0.04233483597636223,
 0.009237916208803654,
 -0.05455828458070755,
 0.024593649432063103,
 -0.09538378566503525,
 -0.0695975124835968,
 -0.010735561139881611,
 0.005214910954236984,
 0.11971891671419144,
 -0.0430755540728569,
 -0.00513798464089632,
 0.04484416916966438,
 0.08294414728879929,
 0.07404263317584991,
 0.022557679563760757,
 -0.046012863516807556,
 0.016955774277448654,
 -0.023771632462739944,
 0.00910295732319355,
 0.008506315760314465,
 0.0026993537321

In [21]:
len(emb_dict)

85466

In [22]:
# Check embedding dimension
len(emb_dict[0])

2560

In [23]:
# Cross-check the node embeddings with the node types
primekg_nodes[primekg_nodes.node_index == 0]

Unnamed: 0,node_index,node_name,node_source,node_id,node_type
0,0,PHYHIP,NCBI,9796,gene/protein


### Check Triplet Splits of BioBridge-PrimeKG

Lastly, BioBridge-PrimeKG splits contain train and test dataframes of triplets and nodes.

In [24]:
# Get all triplets of BioBridge PrimeKG
triplets = biobridge_data.get_primekg_triplets()
triplets.head()

Unnamed: 0,head_index,head_name,head_source,head_id,head_type,tail_index,tail_name,tail_source,tail_id,tail_type,display_relation,relation
0,0,PHYHIP,NCBI,9796,1,8889,KIF15,NCBI,56992,1,3,protein_protein
1,1,GPANK1,NCBI,7918,1,2798,PNMA1,NCBI,9240,1,3,protein_protein
2,2,ZRSR2,NCBI,8233,1,5646,TTC33,NCBI,23548,1,3,protein_protein
3,3,NRF1,NCBI,4899,1,11592,MAN1B1,NCBI,11253,1,3,protein_protein
4,4,PI4KA,NCBI,5297,1,2122,RGS20,NCBI,8601,1,3,protein_protein


In [25]:
triplets

Unnamed: 0,head_index,head_name,head_source,head_id,head_type,tail_index,tail_name,tail_source,tail_id,tail_type,display_relation,relation
0,0,PHYHIP,NCBI,9796,1,8889,KIF15,NCBI,56992,1,3,protein_protein
1,1,GPANK1,NCBI,7918,1,2798,PNMA1,NCBI,9240,1,3,protein_protein
2,2,ZRSR2,NCBI,8233,1,5646,TTC33,NCBI,23548,1,3,protein_protein
3,3,NRF1,NCBI,4899,1,11592,MAN1B1,NCBI,11253,1,3,protein_protein
4,4,PI4KA,NCBI,5297,1,2122,RGS20,NCBI,8601,1,3,protein_protein
...,...,...,...,...,...,...,...,...,...,...,...,...
3904605,52855,B cell receptor transport into membrane raft,GO,32597,0,34572,CD24,NCBI,100133941,1,2,bioprocess_protein
3904606,113352,chemokine receptor transport out of membrane raft,GO,32600,0,34572,CD24,NCBI,100133941,1,2,bioprocess_protein
3904607,42264,negative regulation of cytoskeleton organization,GO,51494,0,57675,IQCJ-SCHIP1,NCBI,100505385,1,2,bioprocess_protein
3904608,109904,mesendoderm migration,GO,90133,0,58770,APELA,NCBI,100506013,1,2,bioprocess_protein


Finally, we can check the number of each split as follows.

In [26]:
# Check the number of samples in each split of the biobridge primekg dataframes
biobridge_splits = biobridge_data.get_train_test_split()

In [27]:
list(biobridge_splits.keys())

['train', 'node_train', 'test', 'node_test']

In [28]:
# Check dataframe of training triples
biobridge_splits["train"]

Unnamed: 0,head_index,head_name,head_source,head_id,head_type,tail_index,tail_name,tail_source,tail_id,tail_type,display_relation,relation
0,0,PHYHIP,NCBI,9796,1,8889,KIF15,NCBI,56992,1,3,protein_protein
1,1,GPANK1,NCBI,7918,1,2798,PNMA1,NCBI,9240,1,3,protein_protein
2,2,ZRSR2,NCBI,8233,1,5646,TTC33,NCBI,23548,1,3,protein_protein
3,3,NRF1,NCBI,4899,1,11592,MAN1B1,NCBI,11253,1,3,protein_protein
4,4,PI4KA,NCBI,5297,1,2122,RGS20,NCBI,8601,1,3,protein_protein
...,...,...,...,...,...,...,...,...,...,...,...,...
3768281,124473,longitudinal sarcoplasmic reticulum,GO,14801,7,58744,DHRS7C,NCBI,201140,1,2,cellcomp_protein
3768282,55747,myofilament,GO,36379,7,57367,MYBPHL,NCBI,343263,1,2,cellcomp_protein
3768285,126945,lateral wall of outer hair cell,GO,120249,7,22033,SLC26A5,NCBI,375611,1,2,cellcomp_protein
3768286,125456,Swi5-Swi2 complex,GO,34974,7,57415,SWI5,NCBI,375757,1,2,cellcomp_protein


In [29]:
# Check dataframe of training nodes
biobridge_splits["node_train"]

Unnamed: 0,node_index,node_type
0,0,1
1,1,1
2,2,1
3,3,1
4,4,1
...,...,...
76481,127431,7
76482,127432,7
76483,127433,7
76484,127239,7


In [30]:
# Check dataframe of testing triples
biobridge_splits["test"]

Unnamed: 0,head_index,head_name,head_source,head_id,head_type,tail_index,tail_name,tail_source,tail_id,tail_type,display_relation,relation
8,8,MT1A,NCBI,4489,1,1785,TP53,NCBI,7157,1,3,protein_protein
12,12,CD7,NCBI,924,1,7681,SFXN5,NCBI,94097,1,3,protein_protein
16,16,SNRPD2,NCBI,6633,1,3235,PRPF4,NCBI,9128,1,3,protein_protein
18,19,VAV3,NCBI,10451,1,3005,ZRANB1,NCBI,54764,1,3,protein_protein
29,16,SNRPD2,NCBI,6633,1,216,NCSTN,NCBI,23385,1,3,protein_protein
...,...,...,...,...,...,...,...,...,...,...,...,...
3768273,125342,myosin V complex,GO,31475,7,9639,DYNLL2,NCBI,140735,1,2,cellcomp_protein
3768277,55608,extracellular membrane-bounded organelle,GO,65010,7,57129,PHOSPHO1,NCBI,162466,1,2,cellcomp_protein
3768283,124243,axonemal outer doublet,GO,97545,7,59351,CFAP100,NCBI,348807,1,2,cellcomp_protein
3768284,124243,axonemal outer doublet,GO,97545,7,59352,CFAP73,NCBI,387885,1,2,cellcomp_protein


In [31]:
# Check dataframe of testing nodes
biobridge_splits["node_test"]

Unnamed: 0,node_index,node_type
0,8,1
1,12,1
2,16,1
3,19,1
4,34,1
...,...,...
8490,127404,7
8491,127415,7
8492,127421,7
8493,127425,7
