# Demo - Continued

This notebook is a continuation of the Demo.ipynb notebook. It demos more advanced features:
- Running pipeline from a config file
- Running entity resolution in a separate process
- Customizing components

In [1]:
import os

from dotenv import load_dotenv
import neo4j
from neo4j_graphrag.llm import OpenAILLM
from neo4j_graphrag.embeddings import OpenAIEmbeddings
from neo4j_graphrag.experimental.pipeline.kg_builder import SimpleKGPipeline

In [2]:
load_dotenv()

True

In [3]:
file_path = "./data/Climate change - Wikipedia long.pdf"

In [4]:
driver = neo4j.GraphDatabase.driver(
    os.getenv("NEO4J_URI", "bolt://localhost:7687"),
    auth=(
        os.getenv("NEO4J_USERNAME", "neo4j"),
        os.getenv("NEO4J_PASSWORD", "neo4j")
    )
)

llm = OpenAILLM(
    model_name="gpt-4o",
    model_params={
        "response_format": {"type": "json_object"}
    }
)

embedder = OpenAIEmbeddings(
    model="text-embedding-3-small",
)

In [5]:
# driver.execute_query("MATCH (n) DETACH DELETE n")

In [6]:
from neo4j_graphrag.experimental.components.schema import GraphSchema
new_schema = GraphSchema.from_file("refined_schema.json")

## Config file

The `simple_kg_pipeline_config.yaml`:
- Contains all the information to instantiate a SimpleKGPipeline:
    - Classes to be used as LLM/Embedder
    - Init parameters for each component
    - A YAML representation of the schema
- Is able to parse ENV vars for some parameters (e.g. Neo4j connection)

Note: JSON is also supported

Let's try this config, that, contrarily to the other demos, sets `perform_entity_resolution=False`:

In [6]:
from neo4j_graphrag.experimental.pipeline.config.runner import PipelineRunner

pipeline = PipelineRunner.from_config_file("simple_kg_pipeline_config.yaml")
await pipeline.run({
    "file_path": file_path,
    "document_metadata": {
        "source": "Wikipedia",
    }
});

## Entity Resolution

Looking at the graph, we see it's less connected than the other examples: nodes like "USA" or "CO2" are duplicated (one per chunk), and we've lost the chunk to chunk relationship we're looking for by building a KG. Let's run entity resolution now. The default strategy in SimpleKGPipeline is to use an exact match on label + name property to merge nodes. We're going to test another one, using semantic similarity (computed using spaCy embeddings)

Create extra nodes for the demo:

```
CREATE (:`__Entity__`:GreenhouseGas {name: "carbonic dioxide"})-[:OBSERVED_IN]->(:`__Entity__`:Country {name: "Greenland"})
```


In [7]:
from neo4j_graphrag.experimental.components.resolver import SpaCySemanticMatchResolver

In [9]:
resolver = SpaCySemanticMatchResolver(
    driver=driver,
)
await resolver.run();

By default, all `__Entity__` nodes are considered as candidates and can be merged. Sometimes, you may want to prevent merge with pre-existing data. You can achieve this by using the `filter_query` paramter for all `Resolver`. For instance, to resolve only entities coming from Wikipedia articles: 

In [28]:
resolver = SpaCySemanticMatchResolver(
    driver=driver,
    filter_query="""
    MATCH (entity)-[:FROM_CHUNK]->(:Chunk)-[:FROM_DOCUMENT]->(d:Document)
    WHERE d.source = "Wikipedia"
    """
)
await resolver.run();

## Custom components

If you want to use a custom strategy, you can write your own component. See [the examples in the repo](https://github.com/neo4j/neo4j-graphrag-python/tree/main/examples)

Here we're going to update our splitter to split the text per section. To do this, we also need to update the document loader so that it returns markdown:


In [11]:
import pymupdf4llm
from typing import Optional
from pathlib import Path
from neo4j_graphrag.experimental.components.pdf_loader import DataLoader
from neo4j_graphrag.experimental.components.types import PdfDocument, DocumentInfo


class LoaderToMarkdown(DataLoader):
    async def run(
        self, filepath: Path, metadata: Optional[dict[str, str]] = None
    ) -> PdfDocument:
        text = pymupdf4llm.to_markdown(filepath)
        return PdfDocument(
            text=text,
            document_info=DocumentInfo(
                path=str(filepath),
                metadata=metadata or {},
            ),
        )


In [12]:
my_loader = LoaderToMarkdown()
new_document = await my_loader.run(file_path)

In [13]:
from neo4j_graphrag.experimental.components.types import TextChunks, TextChunk
from neo4j_graphrag.experimental.components.text_splitters.base import TextSplitter


class SectionSplitter(TextSplitter):
    async def run(self, text: str) -> TextChunks:
        return TextChunks(chunks=[
            TextChunk(text=sec.strip(), index=k)
            for k, sec in enumerate(text.split('\n#'))
        ])


In [14]:
my_splitter = SectionSplitter()
chunks = await my_splitter.run(text=new_document.text)
print(len(chunks.chunks))

13


In [15]:
pipeline = SimpleKGPipeline(
    driver=driver,
    llm=llm,
    embedder=embedder,
    schema=new_schema,
    text_splitter=my_splitter,
    pdf_loader=my_loader,
)
await pipeline.run_async(
    file_path=file_path,
    document_metadata={
        "source": "Wikipedia",
    },
);