# StarkQA-PrimeKG Loader

In this tutorial, we will explain how to load the StarkQA-PrimeKG dataset, which is a dataset for question answering over knowledge graphs of PrimeKG.

Prior information about the StarkQA-PrimeKG dataset can be found in the following repositories:
- https://github.com/snap-stanford/stark
- https://stark.stanford.edu/
- https://huggingface.co/datasets/snap-stanford/stark

First of all, we need to import the necessary libraries as follows.

In [1]:
# Import necessary libraries
import sys
sys.path.append('../../..')
from aiagents4pharma.talk2knowledgegraphs.datasets.starkqa_primekg import StarkQAPrimeKG






### Load StarQA-PrimeKG

The `StarkQAPrimeKG` allows loading the data from the HuggingFace Hub if the data is not available locally.

Otherwise, the data is loaded from the local directory as defined in the `local_dir`.

In [2]:
# Define starkqa primekg data by providing a local directory where the data is stored
starkqa_data = StarkQAPrimeKG(local_dir="../../../../data/starkqa_primekg/")

To load the dataframes of StarkQA and its split, we just need a method as follows.

In [3]:
# Invoke a method to load the data
starkqa_data.load_data()

# Get the StarkQAPrimeKG data, which are the QA pairs, split indices, and the node information
starkqa_df = starkqa_data.get_starkqa()
starkqa_split_indices = starkqa_data.get_starkqa_split_indicies()
starkqa_node_info = starkqa_data.get_starkqa_node_info()
query_embeddings = starkqa_data.get_query_embeddings()
node_embeddings = starkqa_data.get_node_embeddings()

Loading StarkQAPrimeKG dataset...
Downloading files from snap-stanford/stark


  0%|          | 0/7 [00:00<?, ?it/s]

qa/prime/split/test-0.1.index:   0%|          | 0.00/1.39k [00:00<?, ?B/s]

 14%|█▍        | 1/7 [00:00<00:02,  2.64it/s]

qa/prime/split/test.index:   0%|          | 0.00/14.0k [00:00<?, ?B/s]

 29%|██▊       | 2/7 [00:00<00:02,  2.49it/s]

qa/prime/split/train.index:   0%|          | 0.00/30.9k [00:00<?, ?B/s]

 43%|████▎     | 3/7 [00:01<00:01,  2.61it/s]

qa/prime/split/val.index:   0%|          | 0.00/11.2k [00:00<?, ?B/s]

 57%|█████▋    | 4/7 [00:01<00:01,  2.02it/s]

qa/prime/stark_qa/stark_qa.csv:   0%|          | 0.00/2.06M [00:00<?, ?B/s]

 71%|███████▏  | 5/7 [00:02<00:01,  1.65it/s]

(…)ark_qa/stark_qa_human_generated_eval.csv:   0%|          | 0.00/14.8k [00:00<?, ?B/s]

 86%|████████▌ | 6/7 [00:02<00:00,  1.93it/s]

processed.zip:   0%|          | 0.00/28.0M [00:00<?, ?B/s]

100%|██████████| 7/7 [00:04<00:00,  1.54it/s]


Loading StarkQAPrimeKG embeddings...
Downloading embeddings from text-embedding-ada-002


Downloading...
From (original): https://drive.google.com/uc?id=1MshwJttPZsHEM2cKA5T13SIrsLeBEdyU
From (redirected): https://drive.google.com/uc?id=1MshwJttPZsHEM2cKA5T13SIrsLeBEdyU&confirm=t&uuid=ae35d8f3-b8d6-454d-bb67-f8d73223508d
To: c:\Users\mulyadi\Repo\data\starkqa_primekg_test\text-embedding-ada-002\query\query_emb_dict.pt
100%|██████████| 72.0M/72.0M [00:02<00:00, 31.3MB/s]
Downloading...
From (original): https://drive.google.com/uc?id=16EJvCMbgkVrQ0BuIBvLBp-BYPaye-Edy
From (redirected): https://drive.google.com/uc?id=16EJvCMbgkVrQ0BuIBvLBp-BYPaye-Edy&confirm=t&uuid=a9359c11-415b-40f2-a72e-ab25fcd5372b
To: c:\Users\mulyadi\Repo\data\starkqa_primekg_test\text-embedding-ada-002\doc\candidate_emb_dict.pt
100%|██████████| 832M/832M [00:34<00:00, 24.1MB/s] 
  query_emb_dict = torch.load(query_emb_path)
  node_emb_dict = torch.load(node_emb_path)


### Check StarQA-PrimeKG Dataframe

StarkQA dataframes contains the following columns:
- `id`: Unique identifier for each question and answer pair
- `query`: The synthesized question from the StarkQA dataset
- `answer_ids`: The unique identifier for the answer to the question (multiple answers are possible)

In [5]:
# Check a sample of the starkqa primekg dataframe
starkqa_df.head()

Unnamed: 0,id,query,answer_ids
0,0,Could you identify any skin diseases associate...,[95886]
1,1,What drugs target the CYP3A4 enzyme and are us...,[15450]
2,2,What is the name of the condition characterize...,"[98851, 98853]"
3,3,What drugs are used to treat epithelioid sarco...,[15698]
4,4,Can you supply a compilation of genes and prot...,"[7161, 22045]"


The current version of StarkQA-PrimeKG has about 11K records of question-and-answer pairs.

In [6]:
# Check dimensions of the starkqa primekg dataframe
starkqa_df.shape

(11204, 3)

### Check StarkQA-PrimeKG Node Information

StarkQA provides additional node information for PrimeKG as a dictionary for each node.

This allows us to further enrich the features of the knowledge graph nodes.

In [7]:
# Check the node information of PrimeKG
starkqa_node_info[0]

{'id': 9796,
 'type': 'gene/protein',
 'name': 'PHYHIP',
 'source': 'NCBI',
 'details': {'query': 'PHYHIP',
  '_id': '9796',
  '_score': 17.934021,
  'alias': ['DYRK1AP3', 'PAHX-AP', 'PAHXAP1'],
  'genomic_pos': {'chr': '8',
   'end': 22232101,
   'ensemblgene': 'ENSG00000168490',
   'start': 22219703,
   'strand': -1},
  'name': 'phytanoyl-CoA 2-hydroxylase interacting protein',
  'summary': 'Enables protein tyrosine kinase binding activity. Involved in protein localization. Located in cytoplasm. [provided by Alliance of Genome Resources, Apr 2022]'}}

### Check StarQA-PrimeKG Embeddings (Query & Nodes)

Note that StarkQA has provided pre-loaded embeddings for the queries and nodes using the 'text-embedding-ada-002' model.

In [8]:
# Check query embeddings
query_embeddings

{3461: tensor([[-0.0211, -0.0149,  0.0020,  ..., -0.0277, -0.0085, -0.0034]]),
 10065: tensor([[-0.0131,  0.0068,  0.0197,  ..., -0.0382, -0.0209, -0.0300]]),
 8931: tensor([[-0.0299, -0.0173,  0.0056,  ..., -0.0206, -0.0150, -0.0108]]),
 6664: tensor([[-0.0368,  0.0045,  0.0026,  ..., -0.0055, -0.0094, -0.0178]]),
 3798: tensor([[ 0.0009, -0.0095,  0.0231,  ..., -0.0197, -0.0106, -0.0288]]),
 9392: tensor([[-0.0092,  0.0213,  0.0097,  ..., -0.0140, -0.0023, -0.0223]]),
 1039: tensor([[-0.0093, -0.0078,  0.0324,  ..., -0.0046, -0.0062, -0.0395]]),
 2116: tensor([[-0.0197, -0.0117,  0.0251,  ..., -0.0249, -0.0022, -0.0175]]),
 930: tensor([[-0.0028, -0.0036,  0.0514,  ..., -0.0220, -0.0057, -0.0295]]),
 7313: tensor([[-0.0350, -0.0214, -0.0035,  ..., -0.0162, -0.0054, -0.0175]]),
 8787: tensor([[-0.0149, -0.0030, -0.0097,  ..., -0.0245,  0.0041, -0.0193]]),
 7271: tensor([[-0.0099, -0.0031, -0.0148,  ..., -0.0317, -0.0035, -0.0182]]),
 7188: tensor([[-0.0228, -0.0016,  0.0180,  ..., -0.

In [14]:
# Check length and dimension of query embeddings
len(query_embeddings), query_embeddings[0].shape

(11204, torch.Size([1, 1536]))

In [10]:
# Check node embeddings
node_embeddings

{0: tensor([[-0.0497, -0.0080, -0.0108,  ..., -0.0098, -0.0167, -0.0184]]),
 1: tensor([[-0.0310,  0.0156,  0.0047,  ..., -0.0299, -0.0185, -0.0211]]),
 2: tensor([[-0.0295,  0.0115, -0.0203,  ..., -0.0397, -0.0181, -0.0204]]),
 3: tensor([[-0.0320, -0.0132, -0.0218,  ..., -0.0329, -0.0123, -0.0272]]),
 4: tensor([[-0.0325, -0.0123, -0.0177,  ..., -0.0282, -0.0045, -0.0315]]),
 5: tensor([[-0.0136,  0.0236,  0.0044,  ..., -0.0408, -0.0187, -0.0272]]),
 6: tensor([[-0.0257, -0.0240, -0.0326,  ..., -0.0291, -0.0035, -0.0589]]),
 7: tensor([[-0.0396,  0.0123, -0.0340,  ..., -0.0194, -0.0149, -0.0235]]),
 8: tensor([[-0.0345, -0.0122, -0.0214,  ..., -0.0268, -0.0206, -0.0376]]),
 9: tensor([[-0.0239,  0.0046, -0.0159,  ..., -0.0240, -0.0084, -0.0385]]),
 10: tensor([[-0.0468,  0.0205, -0.0175,  ..., -0.0329, -0.0131, -0.0329]]),
 11: tensor([[-0.0305, -0.0085, -0.0157,  ..., -0.0219,  0.0019, -0.0292]]),
 12: tensor([[-0.0392,  0.0038, -0.0234,  ..., -0.0421, -0.0152, -0.0192]]),
 13: tens

In [15]:
# Check length and dimension of node embeddings
len(node_embeddings), node_embeddings[0].shape

(129375, torch.Size([1, 1536]))

### Check StarQA-PrimeKG splits

StarkQA-PrimeKG splits contain train, validation, and test indices for benchmarking the QA-driven AI models.

In [12]:
# Check the split indices of the starkqa primekg dataframe
starkqa_split_indices.keys()

dict_keys(['train', 'val', 'test', 'test-0.1'])

Finally, we can check the number of each split as follows.

In [13]:
# Check the number of samples in each split of the starkqa primekg dataframe
for split, idx in starkqa_split_indices.items():
    print(f"{split}: {len(idx)}")

train: 6162
val: 2241
test: 2801
test-0.1: 280
