# Strict Knowledge Graph Construction using Schema LLM Path Extractor 

In this notebook, we'll use the SchemaLLMPathExtractor to extract triples from pubmed abstracts and visualize the resulting knowledge graphs with Neo4j.

If it's your first time constructing a knowledge graph on Neo4j, I recommend first going through the [Customizing property graph index in LlamaIndex tutorial](https://github.com/tomasonjo/blogs/blob/master/llm/llama_index_neo4j_custom_retriever.ipynb).

I will build upon concepts from that tutorial. Feel free to use any Neo4j instance you have access to.

## Get data from pubmed

In [None]:
import requests
from xml.etree import ElementTree

from llama_index.core import Document

# Step 1: Search for articles
search_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
search_params = {"db": "pubmed", "term": "prostatitis plant medicine", "retmax": 10}

response = requests.get(search_url, params=search_params)
root = ElementTree.fromstring(response.content)
pmids = [id_elem.text for id_elem in root.findall(".//Id")]

# Step 2: Fetch article details
fetch_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi"
fetch_params = {"db": "pubmed", "id": ",".join(pmids), "retmode": "xml"}

response = requests.get(fetch_url, params=fetch_params)
root = ElementTree.fromstring(response.content)
abstracts = [abstract_elem.text for abstract_elem in root.findall(".//AbstractText")]

# # Step 3: Create Document objects, skipping None and where len(abstract) < 100
documents = [
    Document(text=abstract)
    for abstract in abstracts
    if abstract is not None and len(abstract) > 100
]

In [None]:
print(f"Number of documents retrieved: {len(documents)}")

# Print the documents to verify
for i, doc in enumerate(documents):
    print(f"Document {i}: {doc.text}")
    if i > 4:
        break

Number of documents: 12
Document 0: Androgen deprivation therapy (ADT) is the primary treatment for advanced prostate cancer (PCa). However, prolonged ADT inevitably results in therapy resistance with the emergence of the castration-resistant PCa phenotype (CRPC). Hence, there is an urgent need to explore new treatment options capable of delaying PCa progression. Hispidin (HPD) is a natural polyketide primarily derived from plants and fungi. HPD has been shown to have a diverse pharmacological profile, exhibiting anti-inflammatory, antiviral, cardiovascular and neuro-protective activities. However, there is currently no research regarding its properties in the context of PCa treatment. This research article seeks to evaluate the anti-cancer effect of HPD and determine the underlying molecular basis in both androgen-sensitive PCa and CRPC cells. Cell growth, migration, and invasion assays were performed via the MTS method, a wound healing assay and the transwell method. To investigate i

## Create a local neo4j docker instance

for local docker instance, using windows on anaconda prompt, run below

 ```
 docker run ^
    -p 7474:7474 -p 7687:7687 ^
    -v "%CD%/data:/data" -v "%CD%/plugins:/plugins" ^
    --name neo4j-apoc ^
    -e NEO4J_apoc_export_file_enabled=true ^
    -e NEO4J_apoc_import_file_enabled=true ^
    -e NEO4J_apoc_import_file_use__neo4j__config=true ^
    -e NEO4JLABS_PLUGINS="[\"apoc\"]" ^
    neo4j:latest
```

for mac or linux, use below

```
docker run \
    -p 7474:7474 -p 7687:7687 \
    -v $PWD/data:/data -v $PWD/plugins:/plugins \
    --name neo4j-apoc \
    -e NEO4J_apoc_export_file_enabled=true \
    -e NEO4J_apoc_import_file_enabled=true \
    -e NEO4J_apoc_import_file_use__neo4j__config=true \
    -e NEO4JLABS_PLUGINS=\[\"apoc\"\] \
    neo4j:latest
```

Go see your instance at http://localhost:7474/browser/. Default password is 'neo4j'.

You will be asked to change the password, you can just change it to 'password'.

In [None]:
import nest_asyncio

nest_asyncio.apply()

from llama_index.graph_stores.neo4j import Neo4jPGStore

username = "neo4j"
password = "password"
url = "bolt://localhost:7687"

graph_store = Neo4jPGStore(username=username, password=password, url=url)



# setup

Create an .env file in the same dir as this notebook with the following:

```bash
OPENAI_API_KEY="sk-your-openai-api-key"
```

In [None]:
import os


def load_env(env_path=".env"):
    """
    Load environment variables from a .env file.
    """
    if not os.path.exists(env_path):
        print(f"Warning: {env_path} file not found.")
        return

    with open(env_path) as file:
        for line in file:
            line = line.strip()
            if line and not line.startswith("#"):
                key, value = line.split("=", 1)
                os.environ[key.strip()] = value.strip()

In [None]:
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI

from llama_index.core import PropertyGraphIndex
from llama_index.core.indices.property_graph import SchemaLLMPathExtractor

load_env()
llm = OpenAI(model="gpt-4o-mini", temperature=0.0)
embed_model = OpenAIEmbedding(model_name="text-embedding-3-small")

## building with strict=False

In [None]:
from typing import Literal

# Define entities and relations
possible_entities = Literal[
    "DRUG",
    "DISEASE",
    "BIOLOGICAL_PROCESS",
    "MOLECULAR_FUNCTION",
    "CELL_LINE",
    "SIGNALING_PATHWAY",
    "COMPOUND",
    "PLANT",
]

possible_relations = Literal[
    "TREATS", "HAS_EFFECT_ON", "INVOLVES", "EXPRESSED_IN", "PART_OF", "CONTAINS"
]

To reduce processing time, we can set the number of documents to 5 or any number you want.

In [None]:
N = 5

In [None]:
# building with strict=False
kg_extractor = SchemaLLMPathExtractor(
    llm=llm,
    max_triplets_per_chunk=20,
    strict=False,
    possible_entities=possible_entities,
    possible_relations=possible_relations,
    num_workers=4,
)

schema_index = PropertyGraphIndex.from_documents(
    documents[:N],
    llm=llm,
    embed_model=embed_model,
    property_graph_store=graph_store,
    kg_extractors=[kg_extractor],
    show_progress=True,
)

Parsing nodes: 100%|██████████| 3/3 [00:00<00:00, 1235.68it/s]
Extracting paths from text with schema: 100%|██████████| 3/3 [00:10<00:00,  3.61s/it]
Generating embeddings: 100%|██████████| 1/1 [00:00<00:00,  2.27it/s]
Generating embeddings: 100%|██████████| 2/2 [00:00<00:00,  2.93it/s]


Go check the graph at http://localhost:7474/browser/ and try running the following or similar Cypher query:

```cypher
MATCH (n:COMPOUND)-[r:HAS_EFFECT_ON]->(m:DISEASE) RETURN n, r, m
```

In [None]:
# after examining the graph, we can delete it
graph_store.structured_query("MATCH (n) DETACH DELETE n")

[]

## building with strict=True, without properties

In [None]:
from typing import List, Tuple

Triple = Tuple[str, str, str]
kg_validation_schema: List[Triple] = [
    ("DRUG", "TREATS", "DISEASE"),
    ("DRUG", "HAS_EFFECT_ON", "BIOLOGICAL_PROCESS"),
    ("BIOLOGICAL_PROCESS", "INVOLVES", "MOLECULAR_FUNCTION"),
    ("BIOLOGICAL_PROCESS", "EXPRESSED_IN", "CELL_LINE"),
    ("BIOLOGICAL_PROCESS", "PART_OF", "SIGNALING_PATHWAY"),
    ("PLANT", "CONTAINS", "COMPOUND"),
    ("COMPOUND", "HAS_EFFECT_ON", "BIOLOGICAL_PROCESS"),
    ("COMPOUND", "TREATS", "DISEASE"),
    ("DRUG", "PART_OF", "SIGNALING_PATHWAY"),
    ("DISEASE", "INVOLVES", "SIGNALING_PATHWAY"),
]

In [None]:
# building with strict=True, withou properties
kg_extractor = SchemaLLMPathExtractor(
    llm=llm,
    max_triplets_per_chunk=20,
    strict=True,
    possible_entities=possible_entities,
    possible_entity_props=None,
    possible_relations=possible_relations,
    possible_relation_props=None,
    kg_validation_schema=kg_validation_schema,
    num_workers=4,
)

schema_index = PropertyGraphIndex.from_documents(
    documents[:N],
    llm=llm,
    embed_model=embed_model,
    property_graph_store=graph_store,
    kg_extractors=[kg_extractor],
    show_progress=True,
)

Parsing nodes: 100%|██████████| 3/3 [00:00<00:00, 1214.10it/s]
Extracting paths from text with schema: 100%|██████████| 3/3 [00:08<00:00,  2.94s/it]
Generating embeddings: 100%|██████████| 1/1 [00:00<00:00,  2.18it/s]
Generating embeddings: 100%|██████████| 1/1 [00:00<00:00,  2.45it/s]


Check the graph at http://localhost:7474/browser/ 

We can see that the graph is much more strict, and there are fewer entities extracted. This is due to us defining which relationships (including the directionality) are allowed.

In [None]:
# after examining the graph, we can delete it
graph_store.structured_query("MATCH (n) DETACH DELETE n")

[]

## building with strict=True, with properties

In [None]:
# Define entity properties with descriptions
possible_entity_props = [
    ("SYNONYMS", "Other names for the entity"),
    ("SOURCE", "The origin of the entity"),
    ("TOXICITY", "Information on the toxicity of the entity"),
]

# Define relations and their properties with descriptions
possible_relation_props = [
    ("EFFECT_STRENGTH", "The strength of the effect (e.g., potent, moderate, weak)"),
    (
        "EVIDENCE",
        "The type of evidence supporting the relation (e.g., preclinical, clinical, in vitro)",
    ),
    ("DOSAGE", "The dosage required to achieve the effect"),
]

In [None]:
# building with strict=True, with properties
kg_extractor = SchemaLLMPathExtractor(
    llm=llm,
    max_triplets_per_chunk=20,
    strict=True,
    possible_entities=possible_entities,
    possible_entity_props=possible_entity_props,
    possible_relations=possible_relations,
    possible_relation_props=possible_relation_props,
    kg_validation_schema=kg_validation_schema,
    num_workers=4,
)

schema_index = PropertyGraphIndex.from_documents(
    documents[:N],
    llm=llm,
    embed_model=embed_model,
    property_graph_store=graph_store,
    kg_extractors=[kg_extractor],
    show_progress=True,
)

Parsing nodes: 100%|██████████| 5/5 [00:00<00:00, 1267.77it/s]
Extracting paths from text with schema: 100%|██████████| 5/5 [01:12<00:00, 14.50s/it]
Generating embeddings: 100%|██████████| 1/1 [00:00<00:00,  2.05it/s]
Generating embeddings: 100%|██████████| 1/1 [00:00<00:00,  3.71it/s]
