In This notebook we have used LLMs in combination with Neo4j, a graph database, to perform ***Retrieval Augmented Generation (RAG)***.

### ***Why use RAG?***
If you want to use LLMs to generate answers based on your own content or knowledge base, instead of providing large context when prompting the model, you can fetch the relevant information in a database and use this information to generate a response.

This allows you to:

 - Reduce hallucinations
 - Provide relevant, up to date information to your users
 - Leverage your own content/knowledge base

### ***Why use a graph database?***
If you have data where relationships between data points are important and you might want to leverage that, then it might be worth considering graph databases instead of traditional relational databases.

## Graph databases are good to address the following:

 - Navigating deep hierarchies
 - Finding hidden connections between items
 - Discovering relationships between items

### Use cases

Graph databases are particularly relevant for finding relationships or analysing correlation between data points,
Recommendation chatbot, tool to analyse.

LLM Graph Generation

In [18]:
# importing the requisite libraries and modules, setting a foundation for interfacing with the dataset preparation
from langchain.docstore.document import Document

from langchain.text_splitter import CharacterTextSplitter
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import Neo4jVector
from langchain_openai import OpenAIEmbeddings
from langchain_openai import ChatOpenAI
from langchain_core.prompts import PromptTemplate
from langchain_community.graphs import Neo4jGraph
from langchain.chains import GraphCypherQAChain
from neo4j import GraphDatabase
import openai
import json
from dotenv import load_dotenv

In [19]:
# Load environment variabless
load_dotenv()


True

In [21]:
#loads environment variables, safeguarding sensitive credentials for the OpenAI API and Neo4j database.

import os

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

# Next, we need to define Neo4j credentials

NEO4J_USER = os.getenv("NEO4J_USER")
NEO4J_PASSWORD = os.getenv("NEO4J_PASSWORD")
NEO4J_CONNECTION_URL = os.getenv("NEO4J_CONNECTION_URL")

In [5]:
# Neo4j configuration & constraints

graph = Neo4jGraph(NEO4J_CONNECTION_URL, NEO4J_USER, NEO4J_PASSWORD)

In [22]:
## Data Ingestion and upload multiple files

from langchain_community.document_loaders import TextLoader
loader=TextLoader("Data/Profiles.txt")
profile_documents=loader.load()

In [23]:
loader=TextLoader("Data/Projects.txt")
project_documents=loader.load()

In [24]:
from langchain_community.document_loaders import TextLoader
loader=TextLoader("Data/slackmsg.txt")
slackmsg=loader.load()


documents = profile_documents + project_documents + slackmsg

In [25]:
documents[:3]

[Document(page_content='Full Name: Sarah Johnson \nSkills: Machine Learning, Data Analytics, Azure, Python \nProjects: BetaHealth Secure Healthcare Data Analytics Platform on Azure\n\nFull Name: David Patel \nSkills: AWS, Cloud Computing, DevOps, Data Warehousing \nProjects:\n\nFull Name: Amanda Rodriguez \nSkills: Data Security, Compliance, Healthcare Regulations, Azure \nProjects: BetaHealth Secure Healthcare Data Analytics Platform on Azure\n\nFull Name: Jason Mitchell \nSkills: Data Analytics, Machine Learning, Azure, Data Warehousing \nProjects: GammaTech Smart Logistics Platform on Azure\n\nFull Name: Emily Turner \nSkills: IoT, Real-time Data Management, Azure, Python \nProjects: GammaTech Smart Logistics Platform on Azure\n\nFull Name: Michael Clark \nSkills: Data Engineering, Data Warehousing, AWS, Python \nProjects: AlphaCorp AWS-Powered Supply Chain Optimization Platform\n\nFull Name: Jessica White \nSkills: Data Privacy, Security Compliance, Azure Key Vault, Healthcare Regu

The uploades files which are chunked into smaller text pieces using TokenTextSplitter from LangChain. This splitter ensures that each chunk is maximized to 512 tokens with an overlap of 24 tokens, adhering to context window limits for embedding models and making sure that the continuity of the context is not lost.

In [26]:
from langchain.text_splitter import TokenTextSplitter
text_splitter=TokenTextSplitter(chunk_size=512,chunk_overlap=24)
docs=text_splitter.split_documents(documents)
docs[:5]

[Document(page_content='Full Name: Sarah Johnson \nSkills: Machine Learning, Data Analytics, Azure, Python \nProjects: BetaHealth Secure Healthcare Data Analytics Platform on Azure\n\nFull Name: David Patel \nSkills: AWS, Cloud Computing, DevOps, Data Warehousing \nProjects:\n\nFull Name: Amanda Rodriguez \nSkills: Data Security, Compliance, Healthcare Regulations, Azure \nProjects: BetaHealth Secure Healthcare Data Analytics Platform on Azure\n\nFull Name: Jason Mitchell \nSkills: Data Analytics, Machine Learning, Azure, Data Warehousing \nProjects: GammaTech Smart Logistics Platform on Azure\n\nFull Name: Emily Turner \nSkills: IoT, Real-time Data Management, Azure, Python \nProjects: GammaTech Smart Logistics Platform on Azure\n\nFull Name: Michael Clark \nSkills: Data Engineering, Data Warehousing, AWS, Python \nProjects: AlphaCorp AWS-Powered Supply Chain Optimization Platform\n\nFull Name: Jessica White \nSkills: Data Privacy, Security Compliance, Azure Key Vault, Healthcare Regu

The chunked documents are instantiated into the Neo4j vector index as nodes. It uses the core functionalities of Neo4j graph database and OpenAI embeddings to construct this vector index

In [27]:
# Instantiate Neo4j vector from documents
neo4j_vector = Neo4jVector.from_documents(
    docs,
    OpenAIEmbeddings(),
    url=os.environ["NEO4J_CONNECTION_URL"],
    username=os.environ["NEO4J_USER"],
    password=os.environ["NEO4J_PASSWORD"]
)

Building Knowledge Graph

In [28]:
#initialize model

from langchain_openai import ChatOpenAI
llm=ChatOpenAI(temperature=0.5, model_name="gpt-3.5-turbo")

In [29]:
from langchain_experimental.graph_transformers import LLMGraphTransformer
llm_transformer = LLMGraphTransformer(llm=llm)

In [30]:
graph_documents= llm_transformer.convert_to_graph_documents(docs)


In [31]:
graph_documents

[GraphDocument(nodes=[Node(id='Sarah Johnson', type='Person'), Node(id='David Patel', type='Person'), Node(id='Amanda Rodriguez', type='Person'), Node(id='Jason Mitchell', type='Person'), Node(id='Emily Turner', type='Person'), Node(id='Michael Clark', type='Person'), Node(id='Jessica White', type='Person'), Node(id='Daniel Brown', type='Person'), Node(id='Olivia Martinez', type='Person'), Node(id='William Lee', type='Person'), Node(id='Ella Smith', type='Person'), Node(id='Lucas Taylor', type='Person'), Node(id='Liam Thompson', type='Person')], relationships=[Relationship(source=Node(id='Sarah Johnson', type='Person'), target=Node(id='Machine Learning', type='Skill'), type='HAS_SKILL'), Relationship(source=Node(id='Sarah Johnson', type='Person'), target=Node(id='Data Analytics', type='Skill'), type='HAS_SKILL'), Relationship(source=Node(id='Sarah Johnson', type='Person'), target=Node(id='Azure', type='Skill'), type='HAS_SKILL'), Relationship(source=Node(id='Sarah Johnson', type='Perso

In [33]:
graph.add_graph_documents(graph_documents)
   

In [34]:
print(f"Nodes:{graph_documents[0].nodes}")
print(f"Relationships:{graph_documents[0].relationships}")

Nodes:[Node(id='Sarah Johnson', type='Person'), Node(id='David Patel', type='Person'), Node(id='Amanda Rodriguez', type='Person'), Node(id='Jason Mitchell', type='Person'), Node(id='Emily Turner', type='Person'), Node(id='Michael Clark', type='Person'), Node(id='Jessica White', type='Person'), Node(id='Daniel Brown', type='Person'), Node(id='Olivia Martinez', type='Person'), Node(id='William Lee', type='Person'), Node(id='Ella Smith', type='Person'), Node(id='Lucas Taylor', type='Person'), Node(id='Liam Thompson', type='Person')]
Relationships:[Relationship(source=Node(id='Sarah Johnson', type='Person'), target=Node(id='Machine Learning', type='Skill'), type='HAS_SKILL'), Relationship(source=Node(id='Sarah Johnson', type='Person'), target=Node(id='Data Analytics', type='Skill'), type='HAS_SKILL'), Relationship(source=Node(id='Sarah Johnson', type='Person'), target=Node(id='Azure', type='Skill'), type='HAS_SKILL'), Relationship(source=Node(id='Sarah Johnson', type='Person'), target=Node

In [35]:
llm_transformer = LLMGraphTransformer(
    llm=llm,
    allowed_nodes=["PERSON", "SKILLS", "PROJECTS"],
    allowed_relationships=["PART_OF", "HAS_PEOPLE", "HAS", "IS_A"],
)
graph_documents_filtered = llm_transformer.convert_to_graph_documents(
    documents
)
print(f"Nodes:{graph_documents[0].nodes}")
print(f"Relationships:{graph_documents[0].relationships}")

Nodes:[Node(id='Sarah Johnson', type='Person'), Node(id='David Patel', type='Person'), Node(id='Amanda Rodriguez', type='Person'), Node(id='Jason Mitchell', type='Person'), Node(id='Emily Turner', type='Person'), Node(id='Michael Clark', type='Person'), Node(id='Jessica White', type='Person'), Node(id='Daniel Brown', type='Person'), Node(id='Olivia Martinez', type='Person'), Node(id='William Lee', type='Person'), Node(id='Ella Smith', type='Person'), Node(id='Lucas Taylor', type='Person'), Node(id='Liam Thompson', type='Person')]
Relationships:[Relationship(source=Node(id='Sarah Johnson', type='Person'), target=Node(id='Machine Learning', type='Skill'), type='HAS_SKILL'), Relationship(source=Node(id='Sarah Johnson', type='Person'), target=Node(id='Data Analytics', type='Skill'), type='HAS_SKILL'), Relationship(source=Node(id='Sarah Johnson', type='Person'), target=Node(id='Azure', type='Skill'), type='HAS_SKILL'), Relationship(source=Node(id='Sarah Johnson', type='Person'), target=Node

In [36]:
graph = Neo4jGraph(NEO4J_CONNECTION_URL, NEO4J_USER, NEO4J_PASSWORD)

The GraphCycherQAChain abstracts all the details and outputs a natural language response for a natural language question(NLQ). However, internally it uses LLMs to generate a Cypher query for an NLQ and retrieves graph result from the graph database and finally uses those result to generate the final natural language response, again using an LLM.

In [37]:
chain = GraphCypherQAChain.from_llm(
    graph=graph, llm=llm, verbose=True, validate_cypher=True
)

In [38]:
chain.invoke({"query": "Which person uses the largest number of different technologies?"})



[1m> Entering new GraphCypherQAChain chain...[0m
Generated Cypher:
[32;1m[1;3mMATCH (p:Person)-[:HAS_SKILL]->(t:Technology)
WITH p, COUNT(DISTINCT t) AS numTech
RETURN p
ORDER BY numTech DESC
LIMIT 1[0m
Full Context:
[32;1m[1;3m[{'p': {'id': 'Sarah Johnson'}}][0m

[1m> Finished chain.[0m


{'query': 'Which person uses the largest number of different technologies?',
 'result': 'Sarah Johnson uses the largest number of different technologies.'}

In [52]:
chain.invoke({"query": " Which person has more project?"})



[1m> Entering new GraphCypherQAChain chain...[0m
Generated Cypher:
[32;1m[1;3mMATCH (p:Person)-[:WORKS_ON]->(pr:Project)
WITH p, COUNT(pr) AS numProjects
RETURN p.id, numProjects
ORDER BY numProjects DESC
LIMIT 1[0m
Full Context:
[32;1m[1;3m[{'p.id': 'Isabella Harris', 'numProjects': 1}][0m

[1m> Finished chain.[0m


{'query': ' Which person has more project?',
 'result': 'Isabella Harris has 1 project.'}

In [39]:
chain.invoke({"query": " List all the projects??"})



[1m> Entering new GraphCypherQAChain chain...[0m
Generated Cypher:
[32;1m[1;3mMATCH (p:Project)
RETURN p;[0m
Full Context:
[32;1m[1;3m[{'p': {'id': 'Alphacorp Aws-Powered Supply Chain Optimization Platform'}}, {'p': {'id': 'Betahealth Secure Healthcare Data Analytics Platform On Azure'}}, {'p': {'id': 'Gammatech Smart Logistics Platform On Azure'}}, {'p': {'id': 'Betahealth Telemedicine Platform On Microsoft Azure'}}, {'p': {'id': 'Alphacorp Aws-Powered Sales Analytics Dashboard'}}, {'p': {'id': 'Deltaedu Virtual Classroom Platform On Aws'}}, {'p': {'id': 'Gammatech Iot-Driven Manufacturing Monitoring System On Azure'}}, {'p': {'id': 'Epsilonfinance Mobile-First Digital Wallet On Google Cloud'}}, {'p': {'id': 'Ai-Powered Student Performance Analytics On Aws'}}, {'p': {'id': 'Timeline & Customer Experience'}}][0m

[1m> Finished chain.[0m


{'query': ' List all the projects??', 'result': "I don't know the answer."}

In [40]:
graph_documents

[GraphDocument(nodes=[Node(id='Sarah Johnson', type='Person'), Node(id='David Patel', type='Person'), Node(id='Amanda Rodriguez', type='Person'), Node(id='Jason Mitchell', type='Person'), Node(id='Emily Turner', type='Person'), Node(id='Michael Clark', type='Person'), Node(id='Jessica White', type='Person'), Node(id='Daniel Brown', type='Person'), Node(id='Olivia Martinez', type='Person'), Node(id='William Lee', type='Person'), Node(id='Ella Smith', type='Person'), Node(id='Lucas Taylor', type='Person'), Node(id='Liam Thompson', type='Person')], relationships=[Relationship(source=Node(id='Sarah Johnson', type='Person'), target=Node(id='Machine Learning', type='Skill'), type='HAS_SKILL'), Relationship(source=Node(id='Sarah Johnson', type='Person'), target=Node(id='Data Analytics', type='Skill'), type='HAS_SKILL'), Relationship(source=Node(id='Sarah Johnson', type='Person'), target=Node(id='Azure', type='Skill'), type='HAS_SKILL'), Relationship(source=Node(id='Sarah Johnson', type='Perso

In [168]:
# Load environment variables
load_dotenv()

import os

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

# Next, we need to define Neo4j credentials

NEO4J_USER = os.getenv("NEO4J_USER")
NEO4J_PASSWORD = os.getenv("NEO4J_PASSWORD")
NEO4J_CONNECTION_URL = os.getenv("NEO4J_CONNECTION_URL")
graph = Neo4jGraph(
    url=os.environ["NEO4J_CONNECTION_URL"], username=os.environ["NEO4J_USER"], password=os.environ["NEO4J_PASSWORD"]
)

In [76]:
# # Prompt for processing project
# project_template = """
# From the Project Brief below, extract the following Entities & relationships described in the mentioned format 
# 0. ALWAYS FINISH THE OUTPUT. Never send partial responses
# 1. First, look for these Entity types in the text and generate as comma-separated format similar to entity type.
#    `id` property of each entity must be alphanumeric and must be unique among the entities. You will be referring this property to define the relationship between entities. Do not create new entity types that aren't mentioned below. Document must be summarized and stored inside Project entity under `summary` property. You will have to generate as many entities as needed as per the types below:
#     nodes Types:
#     label:'Project',id:string,name:string;summary:string //Project mentioned in the brief; `id` property is the full name of the project, in lowercase, with no capital letters, special characters, spaces or hyphens; Contents of original document must be summarized inside 'summary' property
#     label:'Technology',id:string,name:string //Technology Entity; `id` property is the name of the technology, in camel-case. Identify as many of the technologies used as possible
#     label:'Client',id:string,name:string;industry:string //Client that the project was done for; `id` property is the name of the Client, in camel-case; 'industry' is the industry that the client operates in, as mentioned in the project brief.
    
# 2. Next generate each relationships as triples of head, relationship and tail. To refer the head and tail entity, use their respective `id` property. Relationship property should be mentioned within brackets as comma-separated. They should follow these relationship types below. You will have to generate as many relationships as needed as defined below:
#     Relationship types:
#     project|USES_TECH|technology 
#     project|HAS_CLIENT|client


# 3. The output should look like :

#     "nodes": ["label":"Project","id":string,"name":string,"summary":string],
#     "relationships": ["projectid|USES_TECH|technologyid"]

# Schema:
# {schema}

# QUESTION: {question}

# YOUR ANSWER:""" 


# # Prompt for processing' profiles
# profile_template = """From the list of people below, extract the following Entities & relationships described in the mentioned format 
# 0. ALWAYS FINISH THE OUTPUT. Never send partial responses
# 1. First, look for these Entity types in the text and generate as comma-separated format similar to entity type.
#    `id` property of each entity must be alphanumeric and must be unique among the entities. You will be referring this property to define the relationship between entities. Do not create new entity types that aren't mentioned below. You will have to generate as many entities as needed as per the types below:
#     nodes Entity Types:
#     label:'Person',id:string,name:string //Person that the data is about. `id` property is the name of the person, in camel-case. 'name' is the person's name, as spelled in the text.
#     label:'Project',id:string,name:string;summary:string //Project mentioned in the profile; `id` property is the full lowercase name of the project, with no capital letters, special characters, spaces or hyphens.
#     label:'Technology',id:string,name:string //Technology Entity, as listed in the "skills"-section of every person; `id` property is the name of the technology, in camel-case.
    
# 3. Next generate each relationships as triples of head, relationship and tail. To refer the head and tail entity, use their respective `id` property. Relationship property should be mentioned within brackets as comma-separated. They should follow these relationship types below. You will have to generate as many relationships as needed as defined below:
#     Relationship types:
#     person|HAS_SKILLS|technology 
#     project|HAS_PEOPLE|person


# The output should look like :

#     "nodes": ["label":"Person","id":string,"name":string],
#     "relationships": ["projectid|HAS_PEOPLE|personid"]

# # Schema:
# {schema}



# QUESTION: {question}

# YOUR ANSWER:""" 


# # Prompt for processing slack messages

# # slackmsg_template = """
# From the list of messages below, extract the following Entities & relationships described in the mentioned format 
# 0. ALWAYS FINISH THE OUTPUT. Never send partial responses
# 1. First, look for these Entity types in the text and generate as comma-separated format similar to entity type.
#    `id` property of each entity must be alphanumeric and must be unique among the entities. You will be referring this property to define the relationship between entities. Do not create new entity types that aren't mentioned below. You will have to generate as many entities as needed as per the types below:
#     nodes Types:
#     label:'Person',id:string,name:string //Person that sent the message. `id` property is the name of the person, in camel-case; for example, "michaelClark", or "emmaMartinez"; 'name' is the person's name, as spelled in the text.
#     label:'SlackMessage',id:string,text:string //The Slack-Message that was sent; 'id' property should be the message id, as spelled in the reference. 'text' property is the text content of the message, as spelled in the reference
    
# 3. Next generate each relationships as triples of head, relationship and tail. To refer the head and tail entity, use their respective `id` property. Relationship property should be mentioned within brackets as comma-separated. They should follow these relationship types below. You will have to generate as many relationships as needed as defined below:
#     Relationship types:
#     personid|SENT|slackmessageid

# The output should look like :

#     "nodes": ["label":"SlackMessage","id":string,"text":string],
#     "relationships": ["personid|SENT|messageid"]

# Schema:
# {schema}

# QUESTION: {question}

# YOUR ANSWER:""" 



In [77]:
# combined_template = f"{project_template}\n{profile_template}"

Cypher generation prompt

In [82]:
cypher_generation_template = """
You are a helpful IT-project and account management expert who extracts information from documents, 
Use only Nodes and relationships mentioned in the schema, 
Please generate a Cypher query to extract the required information from the graph database.
schema: {schema}

Examples:
Question: Which client's projects use most of our people?
Answer: ```MATCH (c:CLIENT)<-[:HAS_CLIENT]-(p:Project)-[:HAS_PEOPLE]->(person:Person)
RETURN c.name AS Client, COUNT(DISTINCT person) AS NumberOfPeople
ORDER BY NumberOfPeople DESC```
Question: Which person uses the largest number of different technologies?
Answer: ```MATCH (person:Person)-[:USES_TECH]->(tech:Technology)
RETURN person.name AS PersonName, COUNT(DISTINCT tech) AS NumberOfTechnologies
ORDER BY NumberOfTechnologies DESC```

Question: {question}
"""

In [83]:
CYPHER_GENERATION_PROMPT = PromptTemplate(
    input_variables=["schema","question"], template=cypher_generation_template
)

In [85]:
chain = GraphCypherQAChain.from_llm(
    ChatOpenAI(temperature=0, model='gpt-3.5-turbo'), graph=graph, verbose=True,
    cypher_prompt=CYPHER_GENERATION_PROMPT
)

In [87]:
chain.invoke({"query": "Which person uses the largest number of different technologies?"})



[1m> Entering new GraphCypherQAChain chain...[0m
Generated Cypher:
[32;1m[1;3mMATCH (person:Person)-[:HAS_SKILL]->(tech:Technology)
RETURN person.id AS PersonID, COUNT(DISTINCT tech) AS NumberOfTechnologies
ORDER BY NumberOfTechnologies DESC[0m
Full Context:
[32;1m[1;3m[{'PersonID': 'Sarah Johnson', 'NumberOfTechnologies': 4}, {'PersonID': 'David Patel', 'NumberOfTechnologies': 4}, {'PersonID': 'Jason Mitchell', 'NumberOfTechnologies': 4}, {'PersonID': 'Sophia Anderson', 'NumberOfTechnologies': 4}, {'PersonID': 'Lucas Taylor', 'NumberOfTechnologies': 4}, {'PersonID': 'Ella Smith', 'NumberOfTechnologies': 4}, {'PersonID': 'Emily Turner', 'NumberOfTechnologies': 3}, {'PersonID': 'Olivia Martinez', 'NumberOfTechnologies': 3}, {'PersonID': 'William Lee', 'NumberOfTechnologies': 3}, {'PersonID': 'Michael Clark', 'NumberOfTechnologies': 3}][0m

[1m> Finished chain.[0m


{'query': 'Which person uses the largest number of different technologies?',
 'result': 'Sarah Johnson, David Patel, Jason Mitchell, Sophia Anderson, Lucas Taylor, and Ella Smith use the largest number of different technologies, each with 4 technologies.'}

In [88]:
chain.invoke({"query": "Which client's projects use most of our people?"})



[1m> Entering new GraphCypherQAChain chain...[0m
Generated Cypher:
[32;1m[1;3mMATCH (c:Client)<-[:CLIENT]-(p:Project)-[:HAS_PEOPLE]->(person:Person)
RETURN c.name AS Client, COUNT(DISTINCT person) AS NumberOfPeople
ORDER BY NumberOfPeople DESC[0m
Full Context:
[32;1m[1;3m[][0m

[1m> Finished chain.[0m


{'query': "Which client's projects use most of our people?",
 'result': "I don't know the answer."}