# Tutorial: PrimeKG Loader

In this tutorial, we will explain how to load dataframes of PrimeKG containing the information of the entities and the relations of the knowledge graph.

Prior information about the PrimeKG can be found in the following repositories:
- https://github.com/mims-harvard/PrimeKG
- https://github.com/mims-harvard/TDC/

Note that we are leveraging the PrimeKG provided in Harvard Dataverse, which is publicly available in the following link:

https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/IXA7BM

By the time we are writing this tutorial, the latest version of PrimeKG (`kg.csv`) is `2.1`.

First of all, we need to import necessary libraries as follows:

In [None]:
# Import necessary libraries
import sys
sys.path.append('../../..')
from aiagents4pharma.talk2knowledgegraphs.tools.primekg_loader import PrimeKGData, PrimeKGLoaderInput, PrimeKGLoaderTool

### Load PrimeKG

The `PrimeKGLoaderTool` allows to load the data from the Harvard Dataverse server if the data is not available locally. 

Otherwise, the data is loaded from the local directory as defined in the `local_dir` parameter set in `PrimeKGData`.

In [None]:
# Define primekg data by providing a local directory where the data is stored
primekg_data = PrimeKGData(local_dir="../../../../data/primekg_test/")

# Define primekg loader input by providing the primekg data
loader_input = PrimeKGLoaderInput(data=primekg_data)

To load the dataframes of nodes and edges from PrimeKG, we just need to invoke the tool as follows.

In [None]:
# Define primekg loader tool and call run method to load the dataframes of nodes and edges
tool = PrimeKGLoaderTool()
primekg_nodes, primekg_edges = tool.call_run(loader_input.data.name,
                                             loader_input.data.server_path,
                                             loader_input.data.file_id,
                                             loader_input.data.local_dir)


### Check PrimeKG Dataframes

As aforementioned, the primekg_nodes and primekg_edges are the dataframes of nodes and edges, respectively. 

We can further analyze the dataframes to extract the information we need.

For instance, we can construct a graph from the nodes and edges dataframes using networkx library.

#### PrimeKG Nodes

`primekg_nodes` is a dataframe of nodes, which has the following columns:
- `node_index`: the index of the node
- `node`: the node name
- `node_id`: the id of the node (currently set as node name itself, for visualization purposes)
- `node_uid`: the unique identifier of the node (source name + unique id)
- `node_type`: the type of the node

We can check a sample of the primekg nodes to see the list of nodes in the PrimeKG dataset as follows.

In [None]:
# Check a sample of the primekg nodes
primekg_nodes.head()

The current PrimeKG has about 130K of nodes in total as we can observe in the following cell.

In [None]:
# Check dimensions of the primekg nodes
primekg_nodes.shape

 We can breakdown the statistics of the primekg nodes by their types as follows.

In [None]:
# Show node types and their counts
primekg_nodes['node_type'].value_counts()

PrimeKG was built using various sources as we can observe from their node unique sources as follows.

In [None]:
# Show source of the primekg nodes
primekg_nodes['node_uid'].apply(lambda x: x.split(':')[0]).value_counts()

#### PrimeKG Edges

`primekg_edges` is a dataframe of edges, which has the following columns:
- `node_source_index`: the index of the source node
- `node_source`: the name of the source node
- `node_source_uid`: the unique identifier of the source node
- `node_target_index`: the index of the target node
- `node_target`: the name of the target node
- `node_target_uid`: the unique identifier of the target node
- `edge_type`: the type of the edge

We can also check a sample of the primekg edges to see the interconnections between the nodes in the PrimeKG dataset as follows.

In [None]:
# Check a sample of the primekg edges
primekg_edges.head()

The current PrimeKG has about 8.1M of edges in total as we can observe in the following cell.

In [None]:
# Check dimensions of the primekg nodes
primekg_edges.shape

 We can breakdown the statistics of the primekg edges by their types as follows.

In [None]:
# Show edge types and their counts
primekg_edges['edge_type'].value_counts()