# Strict Knowledge Graph Construction using Schema LLM Path Extractor 

In this notebook, we'll use the SchemaLLMPathExtractor to extract triples from pubmed abstracts and visualize the resulting knowledge graphs with Neo4j.

In [None]:
import requests
from xml.etree import ElementTree

from llama_index.core import Document

# Step 1: Search for articles
search_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
search_params = {"db": "pubmed", "term": "prostatitis plant medicine", "retmax": 10}

response = requests.get(search_url, params=search_params)
root = ElementTree.fromstring(response.content)
pmids = [id_elem.text for id_elem in root.findall(".//Id")]

# Step 2: Fetch article details
fetch_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi"
fetch_params = {"db": "pubmed", "id": ",".join(pmids), "retmode": "xml"}

response = requests.get(fetch_url, params=fetch_params)
root = ElementTree.fromstring(response.content)
abstracts = [abstract_elem.text for abstract_elem in root.findall(".//AbstractText")]

# # Step 3: Create Document objects, skipping None and where len(abstract) < 100
documents = [
    Document(text=abstract)
    for abstract in abstracts
    if abstract is not None and len(abstract) > 100
]

In [None]:
print(f"Number of documents: {len(documents)}")

# Print the documents to verify
for i, doc in enumerate(documents):
    print(f"Document {i}: {doc.text}")
    if i > 4:
        break

Number of documents: 13
Document 0: While conventional medicine has advanced in recent years, there are still concerns about its potential adverse reactions. The ethnopharmacological knowledge established over many centuries and the existence of a variety of metabolites have made medicinal plants, such as the stinging nettle plant, an invaluable resource for treating a wide range of health conditions, considering its minimal adverse effects on human health. The aim of this review is to highlight the therapeutic benefits and biological activities of the edible 
Document 1: Natural products are being developed as possible treatment options due to the rising prevalence of cancer and the harmful side effects of synthetic medications. Arctiin is a naturally occurring lignan found in numerous plants and exhibits different pharmacological activities, along with cancer. To elucidate the anticancer properties and underlying mechanisms of action, a comprehensive search of various electronic data

for local docker instance, using windows on anaconda prompt, run below

 ```
 docker run ^
    -p 7474:7474 -p 7687:7687 ^
    -v "%CD%/data:/data" -v "%CD%/plugins:/plugins" ^
    --name neo4j-apoc ^
    -e NEO4J_apoc_export_file_enabled=true ^
    -e NEO4J_apoc_import_file_enabled=true ^
    -e NEO4J_apoc_import_file_use__neo4j__config=true ^
    -e NEO4JLABS_PLUGINS="[\"apoc\"]" ^
    neo4j:latest
```

for mac or linux, use below

```
docker run \
    -p 7474:7474 -p 7687:7687 \
    -v $PWD/data:/data -v $PWD/plugins:/plugins \
    --name neo4j-apoc \
    -e NEO4J_apoc_export_file_enabled=true \
    -e NEO4J_apoc_import_file_enabled=true \
    -e NEO4J_apoc_import_file_use__neo4j__config=true \
    -e NEO4JLABS_PLUGINS=\[\"apoc\"\] \
    neo4j:latest
```

Go see your instance at http://localhost:7474/browser/. You will be asked to change the password.

In [None]:
import nest_asyncio

nest_asyncio.apply()

from llama_index.graph_stores.neo4j import Neo4jPGStore

username = "neo4j"
password = "password"
url = "bolt://localhost:7687"

graph_store = Neo4jPGStore(username=username, password=password, url=url)



In [None]:
import os


def load_env(env_path=".env"):
    """
    Load environment variables from a .env file.
    """
    if not os.path.exists(env_path):
        print(f"Warning: {env_path} file not found.")
        return

    with open(env_path) as file:
        for line in file:
            line = line.strip()
            if line and not line.startswith("#"):
                key, value = line.split("=", 1)
                os.environ[key.strip()] = value.strip()

In [None]:
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI

load_env()
llm = OpenAI(model="gpt-4o-mini", temperature=0.0)
embed_model = OpenAIEmbedding(model_name="text-embedding-3-small")

In [None]:
from typing import Literal

# Define entities and relations
possible_entities = Literal[
    "DRUG",
    "DISEASE",
    "BIOLOGICAL_PROCESS",
    "MOLECULAR_FUNCTION",
    "CELL_LINE",
    "SIGNALING_PATHWAY",
    "COMPOUND",
    "PLANT",
]
possible_entity_props = ["SYNONYMS", "SOURCE", "TOXICITY"]
possible_relations = Literal[
    "TREATS", "HAS_EFFECT_ON", "INVOLVES", "EXPRESSED_IN", "PART_OF", "CONTAINS"
]
possible_relation_props = ["EFFECT_STRENGTH", "EVIDENCE", "DOSAGE"]

In [None]:
from llama_index.core import PropertyGraphIndex
from llama_index.core.indices.property_graph import SchemaLLMPathExtractor

kg_extractor = SchemaLLMPathExtractor(
    llm=llm,
    max_triplets_per_chunk=20,
    strict=False,  #
    possible_entities=possible_entities,
    possible_entity_props=possible_entity_props,
    possible_relations=possible_relations,
    possible_relation_props=possible_relation_props,
    num_workers=4,
)

In [None]:
schema_index = PropertyGraphIndex.from_documents(
    documents[:3],
    llm=llm,
    embed_model=embed_model,
    property_graph_store=graph_store,
    kg_extractors=[kg_extractor],
    show_progress=True,
)

Parsing nodes: 100%|██████████| 3/3 [00:00<00:00, 1016.64it/s]
Extracting paths from text with schema: 100%|██████████| 3/3 [00:08<00:00,  2.74s/it]
Generating embeddings: 100%|██████████| 1/1 [00:00<00:00,  2.29it/s]
Generating embeddings: 100%|██████████| 1/1 [00:00<00:00,  1.78it/s]


In [None]:
!pip show llama_index

Name: llama-index
Version: 0.10.58
Summary: Interface between LLMs and your data
Home-page: https://llamaindex.ai
Author: Jerry Liu
Author-email: jerry@llamaindex.ai
License: MIT
Location: /opt/anaconda3/envs/latest_llama/lib/python3.11/site-packages
Requires: llama-index-agent-openai, llama-index-cli, llama-index-core, llama-index-embeddings-openai, llama-index-indices-managed-llama-cloud, llama-index-legacy, llama-index-llms-openai, llama-index-multi-modal-llms-openai, llama-index-program-openai, llama-index-question-gen-openai, llama-index-readers-file, llama-index-readers-llama-parse
Required-by: 
