<a href="https://colab.research.google.com/github/tomasonjo/ESRA/blob/main/llm/llama_index_neo4j_custom_retriever.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Defining a Custom Property Graph Retriever

<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/docs/examples/property_graph/property_graph_custom_retriever.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


This guide shows you how to define a custom retriever against a property graph.

It is more involved than using our out-of-the-box graph retrievers, but allows you to have granular control over the retrieval process so that it's better tailored for your application.

We show you how to define an advanced retrieval flow by directly leveraging the property graph store. We'll execute both vector search and text-to-cypher retrieval, and then combine the results through a reranking module.

In [37]:
!pip install --quiet llama-index llama-index-graph-stores-neo4j llama-index-program-openai llama-index-llms-openai

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.6 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.7/1.6 MB[0m [31m19.9 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m27.6 MB/s[0m eta [36m0:00:00[0m
[?25h

## Setup and Build the Property Graph

In [38]:
import nest_asyncio

nest_asyncio.apply()

In [39]:
import os

os.environ["OPENAI_API_KEY"] = "sk-"

#### Load Paul Graham Essay

In [4]:
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

--2024-06-03 22:57:25--  https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 75042 (73K) [text/plain]
Saving to: ‘data/paul_graham/paul_graham_essay.txt’


2024-06-03 22:57:26 (32.0 MB/s) - ‘data/paul_graham/paul_graham_essay.txt’ saved [75042/75042]



In [5]:
from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader("./data/paul_graham/").load_data()

#### Define Default LLMs

In [6]:
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-4o", temperature=0.3)
embed_model = OpenAIEmbedding(model_name="text-embedding-3-small")

In [76]:
from llama_index.graph_stores.neo4j import Neo4jPGStore

username="neo4j"
password="capture-debit-blanket"
url="bolt://44.202.206.163:7687"


graph_store = Neo4jPGStore(
    username=username,
    password=password,
    url=url,
)

#### Build the Property Graph

In [8]:
from llama_index.core import PropertyGraphIndex

index = PropertyGraphIndex.from_documents(
    documents,
    llm=llm,
    embed_model=embed_model,
    property_graph_store=graph_store,
    show_progress=True,
)

Parsing nodes:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting paths from text: 100%|██████████| 22/22 [00:15<00:00,  1.39it/s]
Extracting implicit paths: 100%|██████████| 22/22 [00:00<00:00, 5404.71it/s]
Generating embeddings: 100%|██████████| 1/1 [00:00<00:00,  1.33it/s]
Generating embeddings: 100%|██████████| 5/5 [00:00<00:00,  7.53it/s]


## Entity disambiguation

In [86]:
similarity_threshold = 0.9
data = graph_store.structured_query("""
MATCH (e:__Entity__)
CALL db.index.vector.queryNodes('vector', 5, e.embedding)
YIELD node, score
WITH e, node, score
WHERE score > toFLoat($cutoff) AND node <> e
WITH e, collect(node) AS nodes
RETURN [e.name] + [n in nodes | n.name] AS duplicates LIMIT 5
""", param_map={'cutoff': similarity_threshold})
print(data)

[{'duplicates': ['Lisp', 'Lisp hacker']}, {'duplicates': ['New dialect of lisp', 'New lisp']}, {'duplicates': ['Lisp hacker', 'Lisp']}, {'duplicates': ['70 stores at end of 1996', '500 stores at end of 1997']}, {'duplicates': ['500 stores at end of 1997', '70 stores at end of 1996']}]


In [91]:
graph_store.structured_query("""
MATCH (e:__Entity__)
CALL db.index.vector.queryNodes('vector', 5, e.embedding)
YIELD node, score
WITH e, node, score
WHERE score > toFLoat($cutoff) AND id(e) < id(node)
WITH e, collect(node) AS nodes
CALL apoc.refactor.mergeNodes([e] + nodes)
YIELD node
RETURN count(*)
""", param_map={'cutoff': similarity_threshold})

[{'count(*)': 0}]

# Retrieval

In [33]:
from llama_index.core.retrievers import CustomPGRetriever
from llama_index.core.graph_stores import PropertyGraphStore
from llama_index.core.vector_stores.types import VectorStore
from llama_index.core.embeddings import BaseEmbedding
from llama_index.core.prompts import PromptTemplate
from llama_index.core.llms import LLM
from pydantic import BaseModel
from llama_index.program.openai import OpenAIPydanticProgram


from typing import Optional, Any, Union, List, Optional

class Entities(BaseModel):
    """List of named entities in the text such as names of people, organizations, concepts, and locations"""
    names: Optional[List[str]]


prompt_template_entities = """
Extract all named entities such as names of people, organizations, concepts, and locations
from the following text:
{text}
"""

class MyCustomRetriever(CustomPGRetriever):
    """Custom retriever with cohere reranking."""

    def init(
        self,
        ## vector context retriever params
        embed_model: Optional[BaseEmbedding] = None,
        vector_store: Optional[VectorStore] = None,
        similarity_top_k: int = 4,
        path_depth: int = 1,
        **kwargs: Any,
    ) -> None:
        """Uses any kwargs passed in from class constructor."""

        # Create fulltext index
        self.graph_store.structured_query(
            """CREATE FULLTEXT INDEX entities IF NOT EXISTS FOR (e:`__Entity__`) ON EACH [e.name];""")
        self.entity_extraction = OpenAIPydanticProgram.from_defaults(
    output_cls=Entities, prompt_template_str=prompt_template_entities
)

    def custom_retrieve(self, query_str: str) -> str:
        """Define custom retriever with reranking.

        Could return `str`, `TextNode`, `NodeWithScore`, or a list of those.
        """
        entities = self.entity_extraction(text=query_str).names
        if entities:
            pass
        else:
            pass
        print(entities)
        ## TMP: please change
        final_text = "\n\n".join(
            [n.get_content(metadata_mode="llm") for n in nodes_1]
        )

        return final_text

## Test out the Custom Retriever

Now let's initialize and test out the custom retriever against our data!

To build a full RAG pipeline, we use the `RetrieverQueryEngine` to combine our retriever with the LLM synthesis module - this is also used under the hood for the property graph index.

In [34]:
custom_sub_retriever = MyCustomRetriever(
    index.property_graph_store,
    include_text=True,
    vector_store=index.vector_store,
)

In [35]:
from llama_index.core.query_engine import RetrieverQueryEngine

query_engine = RetrieverQueryEngine.from_args(
    index.as_retriever(sub_retrievers=[custom_sub_retriever]), llm=llm
)

### Try out some Queries

In [36]:
response = query_engine.query("Did the author like programming?")
print(str(response))

None


NameError: name 'nodes_1' is not defined