# Property Graph Construction with Predefined Schemas


In this notebook, we walk through using Neo4j, Ollama, and Huggingface to build a property graph.

Specifically, we will be using the `SchemaLLMPathExtractor` which allows us to specify an exact schema containing possible entity types, relation types, and defining how they can be connected together.

This is useful for when you have a specific graph you want to build, and want to limit what the LLM is predicting.


In [None]:
%pip install llama-index
%pip install llama-index-llms-ollama
%pip install llama-index-embeddings-huggingface
%pip install llama-index-graph-stores-neo4j
%pip install openai
%pip install llama-index-embeddings-azure-openai
%pip install llama-index-llms-azure-openai

## Load Data

In [None]:
!mkdir -p './azdev/paulgraham'
!curl -o 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

In [None]:
from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader("/azdev/paulgraham/").load_data()

## Graph Construction

To construct our graph, we are going to take advantage of the `SchemaLLMPathExtractor` to construct our graph.

Given some schema for a graph, we can extract entities and relations that follow this schema, rather than letting the LLM decide entities and relations at random.

In [None]:
import nest_asyncio

nest_asyncio.apply()

In [None]:
from typing import Literal
from llama_index.llms.ollama import Ollama
from llama_index.llms.azure_openai import AzureOpenAI
from llama_index.core.indices.property_graph import SchemaLLMPathExtractor

# best practice to use upper-case
entities = Literal["PERSON", "PLACE", "ORGANIZATION"]
relations = Literal["HAS", "PART_OF", "WORKED_ON", "WORKED_WITH", "WORKED_AT"]

# define which entities can have which relations
validation_schema = {
    "PERSON": ["HAS", "PART_OF", "WORKED_ON", "WORKED_WITH", "WORKED_AT"],
    "PLACE": ["HAS", "PART_OF", "WORKED_AT"],
    "ORGANIZATION": ["HAS", "PART_OF", "WORKED_WITH"],
}

kg_extractor = SchemaLLMPathExtractor(  
    llm=AzureOpenAI(  
        model="gpt-4",  
        deployment_name="",  
        api_key='',  
        azure_endpoint="",  
        api_version="2024-02-01",  
    ),  
    possible_entities=entities,  
    possible_relations=relations,  
    kg_validation_schema=validation_schema,  
    strict=True  
)  

To launch Neo4j locally, first ensure you have docker installed. Then, you can launch the database with the following docker command

```bash
docker run \
    -p 7474:7474 -p 7687:7687 \
    -v $PWD/data:/data -v $PWD/plugins:/plugins \
    --name neo4j-apoc \
    -e NEO4J_apoc_export_file_enabled=true \
    -e NEO4J_apoc_import_file_enabled=true \
    -e NEO4J_apoc_import_file_use__neo4j__config=true \
    -e NEO4J_AUTH=neo4j/database4591 \
    -e NEO4JLABS_PLUGINS=\[\"apoc\"\] \
    neo4j:latest
```

From here, you can open the db at [http://localhost:7474/](http://localhost:7474/). On this page, you will be asked to sign in. Use the default username/password of `neo4j` and `neo4j`.

Once you login for the first time, you will be asked to change the password.

After this, you are ready to create your first property graph!

In [None]:
from llama_index.graph_stores.neo4j import Neo4jPGStore

graph_store = Neo4jPGStore(
    username="neo4j",
    password="",
    url="bolt://localhost:7687",
)

In [None]:
import os  
import asyncio  
from openai import AzureOpenAI  
from llama_index.core import PropertyGraphIndex  
  
# Set environment variables (replace with your actual values)  
os.environ["AZURE_OPENAI_API_KEY"] = ""  
os.environ["AZURE_OPENAI_ENDPOINT"] = ""  
  
# Initialize the Azure OpenAI client  
client = AzureOpenAI(  
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),  
    api_version="2024-02-01",  
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT")  
)  
  
deployment_name = 'embedding'  # Replace with your deployment name  
  
# Define a function to get embeddings from Azure OpenAI  
def azure_openai_embedding(text):  
    try:  
        response = client.embeddings.create(  
            model=deployment_name,  
            input=text  
        )  
        # Assuming the response object has a method or attribute to get the embedding data  
        embedding = response.data[0].embedding  
        return embedding  
    except Exception as e:  
        print(f"Error getting embedding: {e}")  
        return None  
  
# Create a custom embedding class to use with PropertyGraphIndex  
class AzureOpenAIEmbedding:  
    def __init__(self, client):  
        self.client = client  
  
    def embed(self, text):  
        return azure_openai_embedding(text)  
  
    async def aget_text_embedding_batch(self, texts, **kwargs):  
        try:  
            loop = asyncio.get_event_loop()  
            tasks = [loop.run_in_executor(None, azure_openai_embedding, text) for text in texts]  
            responses = await asyncio.gather(*tasks)  
            return [response for response in responses if response is not None]  
        except Exception as e:  
            print(f"Error getting batch embeddings: {e}")  
            return None  
  
# Initialize the custom embedding model  
embed_model = AzureOpenAIEmbedding(client)  
  
# Create the PropertyGraphIndex with the Azure OpenAI embedding model  
try:  
    index = PropertyGraphIndex.from_documents(  
        documents,  
        kg_extractors=[kg_extractor],  
        embed_model=embed_model,  
        property_graph_store=graph_store,  
    )  
except Exception as e:  
    print(f"Error creating PropertyGraphIndex: {e}")  


In [None]:
import os  
import asyncio  
from openai import AzureOpenAI  
from llama_index.core import PropertyGraphIndex  
  
# Set environment variables (replace with your actual values)  
os.environ["AZURE_OPENAI_API_KEY"] = ""  
os.environ["AZURE_OPENAI_ENDPOINT"] = ""  
  
# Initialize the Azure OpenAI client  
client = AzureOpenAI(  
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),  
    api_version="2024-02-01",  
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT")  
)  
  
deployment_name = 'embedding'  # Replace with your deployment name  
  
# Define a function to get embeddings from Azure OpenAI  
def azure_openai_embedding(text):  
    try:  
        response = client.embeddings.create(  
            model=deployment_name,  
            input=text  
        )  
        return response['data'][0]['embedding']  
    except Exception as e:  
        print(f"Error getting embedding: {e}")  
        return None  
  
# Create a custom embedding class to use with PropertyGraphIndex  
class AzureOpenAIEmbedding:  
    def __init__(self, client):  
        self.client = client  
  
    def embed(self, text):  
        return azure_openai_embedding(text)  
  
    async def aget_text_embedding_batch(self, texts, **kwargs):  
        try:  
            loop = asyncio.get_event_loop()  
            tasks = [loop.run_in_executor(None, azure_openai_embedding, text) for text in texts]  
            responses = await asyncio.gather(*tasks)  
            return [response for response in responses if response is not None]  
        except Exception as e:  
            print(f"Error getting batch embeddings: {e}")  
            return None  
  
# Initialize the custom embedding model  
embed_model = AzureOpenAIEmbedding(client)  
  
# Create the PropertyGraphIndex with the Azure OpenAI embedding model  
try:  
    index = PropertyGraphIndex.from_documents(  
        documents,  
        kg_extractors=[kg_extractor],  
        embed_model=embed_model,  
        property_graph_store=graph_store,  
    )  
except Exception as e:  
    print(f"Error creating PropertyGraphIndex: {e}")  


In [None]:
import os  
import asyncio  
from openai import AzureOpenAI  
from llama_index.core import PropertyGraphIndex  
  
# Set environment variables (replace with your actual values)  
os.environ["AZURE_OPENAI_API_KEY"] = ""  
os.environ["AZURE_OPENAI_ENDPOINT"] = ""  
  
# Initialize the Azure OpenAI client  
client = AzureOpenAI(  
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),  
    api_version="2024-02-01",  
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT")  
)  
  
deployment_name = 'embedding'  # Replace with your deployment name  
  
# Define a function to get embeddings from Azure OpenAI  
def azure_openai_embedding(text):  
    try:  
        response = client.embeddings.create(  
            model=deployment_name,  
            input=text  
        )  
        return response['data'][0]['embedding']  
    except Exception as e:  
        print(f"Error getting embedding: {e}")  
        return None  
  
# Create a custom embedding class to use with PropertyGraphIndex  
class AzureOpenAIEmbedding:  
    def __init__(self, client):  
        self.client = client  
  
    def embed(self, text):  
        return azure_openai_embedding(text)  
  
    async def aget_text_embedding_batch(self, texts, **kwargs):  
        try:  
            loop = asyncio.get_event_loop()  
            tasks = [loop.run_in_executor(None, azure_openai_embedding, text) for text in texts]  
            responses = await asyncio.gather(*tasks)  
            return [response for response in responses if response is not None]  
        except Exception as e:  
            print(f"Error getting batch embeddings: {e}")  
            return None  
  
# Initialize the custom embedding model  
embed_model = AzureOpenAIEmbedding(client)  
  
# Create the PropertyGraphIndex with the Azure OpenAI embedding model  
try:  
    index = PropertyGraphIndex.from_documents(  
        documents,  
        kg_extractors=[kg_extractor],  
        embed_model=embed_model,  
        property_graph_store=graph_store,  
    )  
except Exception as e:  
    print(f"Error creating PropertyGraphIndex: {e}")  


In [None]:
import os  
import asyncio  
from openai import AzureOpenAI  
from llama_index.core import PropertyGraphIndex  
  
# Set environment variables (replace with your actual values)  
os.environ["AZURE_OPENAI_API_KEY"] = ""  
os.environ["AZURE_OPENAI_ENDPOINT"] = ""  
  
# Initialize the Azure OpenAI client  
client = AzureOpenAI(  
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),  
    api_version="2024-02-01",  
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT")  
)  
  
deployment_name = 'embedding'  # Replace with your deployment name  
  
# Define a function to get embeddings from Azure OpenAI  
def azure_openai_embedding(text):  
    try:  
        response = client.embeddings.create(  
            model=deployment_name,  
            input=text  
        )  
        return response['data'][0]['embedding']  
    except Exception as e:  
        print(f"Error getting embedding: {e}")  
        return None  
  
# Create a custom embedding class to use with PropertyGraphIndex  
class AzureOpenAIEmbedding:  
    def __init__(self, client):  
        self.client = client  
  
    def embed(self, text):  
        return azure_openai_embedding(text)  
  
    async def aget_text_embedding_batch(self, texts, **kwargs):  
        try:  
            loop = asyncio.get_event_loop()  
            tasks = [loop.run_in_executor(None, azure_openai_embedding, text) for text in texts]  
            responses = await asyncio.gather(*tasks)  
            return [response for response in responses if response is not None]  
        except Exception as e:  
            print(f"Error getting batch embeddings: {e}")  
            return None  
  
# Initialize the custom embedding model  
embed_model = AzureOpenAIEmbedding(client)  
  
# Create the PropertyGraphIndex with the Azure OpenAI embedding model  
try:  
    index = PropertyGraphIndex.from_documents(  
        documents,  
        kg_extractors=[kg_extractor],  
        embed_model=embed_model,  
        property_graph_store=graph_store,  
    )  
except Exception as e:  
    print(f"Error creating PropertyGraphIndex: {e}")  


If we inspect the graph created, we can see that it only includes the relations and entity types that we defined!

![local graph](./local_kg.png)

For information on all `kg_extractors`, see [the documentation](../../module_guides/indexing/lpg_index_guide.md#construction).

## Querying

Now that our graph is created, we can query it. 

As is the theme with this notebook, we will be using a lower-level API and constructing all our retrievers ourselves!

In [None]:
from llama_index.core.indices.property_graph import (  
    LLMSynonymRetriever,  
    VectorContextRetriever,  
)  
from llama_index.llms.azure_openai import AzureOpenAI  
from llama_index.embeddings.azure_openai import AzureOpenAIEmbedding  
  
# Initialize the Azure OpenAI LLM for synonym retrieval  
llm = AzureOpenAI(  
    model="gpt-4",  
    deployment_name="chat",  
    api_key=""  ,  
    azure_endpoint="" ,  
    api_version="2024-02-01",  
)  
  
# Initialize the Azure OpenAI embedding model for context retrieval  
embed_model = AzureOpenAIEmbedding(  
    model="text-embedding-ada-002",  
    deployment_name="my-custom-embedding",  
    api_key=""  ,  
    azure_endpoint="" ,  
    api_version="2024-02-01",  
)  
  
# Create the LLMSynonymRetriever  
llm_synonym = LLMSynonymRetriever(  
    index.property_graph_store,  
    llm=llm,  
    include_text=False,  
)  
  
# Create the VectorContextRetriever  
vector_context = VectorContextRetriever(  
    index.property_graph_store,  
    embed_model=embed_model,  
    include_text=False,  
)  
  
# Combine the retrievers  
retriever = index.as_retriever(  
    sub_retrievers=[  
        llm_synonym,  
        vector_context,  
    ]  
)  
  
# Retrieve nodes based on the query  
nodes = retriever.retrieve("What happened at Interleaf?")  
for node in nodes:  
    print(node.text)  


In [None]:
retriever = index.as_retriever(
    sub_retrievers=[
        llm_synonym,
        vector_context,
    ]
)

In [None]:
nodes = retriever.retrieve("What happened at Interleaf?")

for node in nodes:
    print(node.text)

We can also create a query engine with similar syntax.

In [None]:
query_engine = index.as_query_engine(
    sub_retrievers=[
        llm_synonym,
        vector_context,
    ],
    llm=Ollama(model="llama3", request_timeout=3600),
)

response = query_engine.query("What happened at Interleaf?")

print(str(response))

For more info on all retrievers, see the [complete guide](../../module_guides/indexing/lpg_index_guide.md#retrieval-and-querying).