# Tutorial: StarkQA-PrimeKG Loader

In this tutorial, we will explain how to load StarkQA-PrimeKG dataset, which is a dataset for question answering over knowledge graphs of PrimeKG. 

Prior information about the StarkQA-PrimeKG dataaset can be found in the following repositories:
- https://github.com/snap-stanford/stark
- https://stark.stanford.edu/
- https://huggingface.co/datasets/snap-stanford/stark

We first need to import the necessary libraries as follows.

In [1]:
# Import necessary libraries
import sys
sys.path.append('../../..')
from aiagents4pharma.talk2knowledgegraphs.tools.starkqa_primekg_loader import StarkQAPrimeKGData, StarkQAPrimeKGLoaderInput, StarkQAPrimeKGLoaderTool






### Load StarQA-PrimeKG

The `StarkQAPrimeKGLoaderTool` allows to load the data from the HuggingFace Hub if the data is not available locally. 

Otherwise, the data is loaded from the local directory as defined in the `local_dir` parameter set in `StarkQAPrimeKGData`.

In [2]:
# Define starkqa primekg data by providing a local directory where the data is stored
starkqa_data = StarkQAPrimeKGData(local_dir="../../../../data/starkqa_primekg_test/")

# Define starkqa primekg loader input by providing the starkqa primekg data
loader_input = StarkQAPrimeKGLoaderInput(data=starkqa_data)

To load the dataframes of StarkQA and its split, we just need to invoke the tool as follows.

In [3]:
# Define starkqa primekg loader tool and call run method to load the dataframes and its split indices
tool = StarkQAPrimeKGLoaderTool()
starkqa_df, split_idx = tool.call_run(loader_input.data.repo_id,
                                      loader_input.data.local_dir)

../../../../data/starkqa_primekg_test/qa/prime/stark_qa/stark_qa.csv already exists. Loading the data from the local directory.


### Check StarQA-PrimeKG Dataframes


StarkQA dataframes contain the following columns:
- `id`: Unique identifier for each question and answer pair
- `query`: The synthesized question from the StarkQA dataset
- `answer_ids`: The unique identifier for the answer to the question (multiple answers are possible)

In [4]:
# Check a sample of the starkqa primekg dataframe
starkqa_df.head()

Unnamed: 0,id,query,answer_ids
0,0,Could you identify any skin diseases associate...,[95886]
1,1,What drugs target the CYP3A4 enzyme and are us...,[15450]
2,2,What is the name of the condition characterize...,"[98851, 98853]"
3,3,What drugs are used to treat epithelioid sarco...,[15698]
4,4,Can you supply a compilation of genes and prot...,"[7161, 22045]"


The current StarkQA-PrimeKG has about 11K records of questions and answers pairs.

In [5]:
# Check dimensions of the starkqa primekg dataframe
starkqa_df.shape

(11204, 3)

### Check StarQA-PrimeKG splits

StarkQA-PrimeKG splits contain train, validation, and test indices for benchmarking the QA-driven AI models.

In [6]:
# Check the split indices of the starkqa primekg dataframe
split_idx.keys()

dict_keys(['train', 'val', 'test', 'test-0.1'])

Finally, we can check the number of each split as follows.

In [7]:
# Check the number of samples in each split of the starkqa primekg dataframe
for split, idx in split_idx.items():
    print(f"{split}: {len(idx)}")

train: 6162
val: 2241
test: 2801
test-0.1: 280
