# Tutorial: StarkQA-PrimeKG Loader

In this tutorial, we will explain how to load StarkQA-PrimeKG dataset, which is a dataset for question answering over knowledge graphs of PrimeKG. 

Prior information about the StarkQA-PrimeKG dataaset can be found in the following repositories:
- https://github.com/snap-stanford/stark
- https://stark.stanford.edu/
- https://huggingface.co/datasets/snap-stanford/stark

We first need to import the necessary libraries as follows.

In [1]:
# Import necessary libraries
import sys
sys.path.append('../../..')
from aiagents4pharma.talk2knowledgegraphs.datasets.starkqa_primekg import StarkQAPrimeKG






### Load StarQA-PrimeKG

The `StarkQAPrimeKG` allows to load the data from the HuggingFace Hub if the data is not available locally. 

Otherwise, the data is loaded from the local directory as defined in the `local_dir`.

In [2]:
# Define starkqa primekg data by providing a local directory where the data is stored
starkqa_data = StarkQAPrimeKG(local_dir="../../../../data/starkqa_primekg_test/")

To load the dataframes of StarkQA and its split, we just need a method as follows.

In [4]:
# Invoke a method to load the data
starkqa_data.load_data()

# Get the StarkQAPrimeKG data, which are the QA pairs, split indices, and the node information
starkqa_df = starkqa_data.get_starkqa()
starkqa_split_indices = starkqa_data.get_starkqa_split_indicies()
starkqa_node_info = starkqa_data.get_starkqa_node_info()

Downloading files from snap-stanford/stark


  0%|          | 0/7 [00:00<?, ?it/s]

qa/prime/split/test-0.1.index:   0%|          | 0.00/1.39k [00:00<?, ?B/s]

 14%|█▍        | 1/7 [00:00<00:01,  3.57it/s]

qa/prime/split/test.index:   0%|          | 0.00/14.0k [00:00<?, ?B/s]

 29%|██▊       | 2/7 [00:00<00:01,  3.47it/s]

qa/prime/split/train.index:   0%|          | 0.00/30.9k [00:00<?, ?B/s]

 43%|████▎     | 3/7 [00:00<00:01,  3.63it/s]

qa/prime/split/val.index:   0%|          | 0.00/11.2k [00:00<?, ?B/s]

 57%|█████▋    | 4/7 [00:01<00:00,  3.58it/s]

qa/prime/stark_qa/stark_qa.csv:   0%|          | 0.00/2.06M [00:00<?, ?B/s]

 71%|███████▏  | 5/7 [00:01<00:00,  3.08it/s]

(…)ark_qa/stark_qa_human_generated_eval.csv:   0%|          | 0.00/14.8k [00:00<?, ?B/s]

 86%|████████▌ | 6/7 [00:01<00:00,  3.28it/s]

processed.zip:   0%|          | 0.00/28.0M [00:00<?, ?B/s]

100%|██████████| 7/7 [00:02<00:00,  2.38it/s]


### Check StarQA-PrimeKG Dataframes


StarkQA dataframes contain the following columns:
- `id`: Unique identifier for each question and answer pair
- `query`: The synthesized question from the StarkQA dataset
- `answer_ids`: The unique identifier for the answer to the question (multiple answers are possible)

In [6]:
# Check a sample of the starkqa primekg dataframe
starkqa_df.head()

Unnamed: 0,id,query,answer_ids
0,0,Could you identify any skin diseases associate...,[95886]
1,1,What drugs target the CYP3A4 enzyme and are us...,[15450]
2,2,What is the name of the condition characterize...,"[98851, 98853]"
3,3,What drugs are used to treat epithelioid sarco...,[15698]
4,4,Can you supply a compilation of genes and prot...,"[7161, 22045]"


The current StarkQA-PrimeKG has about 11K records of questions and answers pairs.

In [7]:
# Check dimensions of the starkqa primekg dataframe
starkqa_df.shape

(11204, 3)

### CHeck StarkQA-PrimeKG Node Information

StarkQA provides an additional node information for PrimeKG as a dictionary for each node.

This allows us to further enrich the features of the knowledge graph nodes.

In [8]:
# Check the node information of PrimeKG
starkqa_node_info[0]

{'id': 9796,
 'type': 'gene/protein',
 'name': 'PHYHIP',
 'source': 'NCBI',
 'details': {'query': 'PHYHIP',
  '_id': '9796',
  '_score': 17.934021,
  'alias': ['DYRK1AP3', 'PAHX-AP', 'PAHXAP1'],
  'genomic_pos': {'chr': '8',
   'end': 22232101,
   'ensemblgene': 'ENSG00000168490',
   'start': 22219703,
   'strand': -1},
  'name': 'phytanoyl-CoA 2-hydroxylase interacting protein',
  'summary': 'Enables protein tyrosine kinase binding activity. Involved in protein localization. Located in cytoplasm. [provided by Alliance of Genome Resources, Apr 2022]'}}

### Check StarQA-PrimeKG splits

StarkQA-PrimeKG splits contain train, validation, and test indices for benchmarking the QA-driven AI models.

In [9]:
# Check the split indices of the starkqa primekg dataframe
starkqa_split_indices.keys()

dict_keys(['train', 'val', 'test', 'test-0.1'])

Finally, we can check the number of each split as follows.

In [10]:
# Check the number of samples in each split of the starkqa primekg dataframe
for split, idx in starkqa_split_indices.items():
    print(f"{split}: {len(idx)}")

train: 6162
val: 2241
test: 2801
test-0.1: 280
