<a href="https://colab.research.google.com/github/neo4j-contrib/ms-graphrag-neo4j/blob/main/examples/neo4j_weaviate_combined.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

>[Naive RAG vs GraphRAG with Neo4J & Weaviate](#scrollTo=n3QFDMrgAkCo)

>>[Install Dependencies](#scrollTo=n3QFDMrgAkCo)

>>[Write Documents to Weaviate Cloud](#scrollTo=nqwuGr0Xhgtm)

>>[Classic RAG with OpenAI](#scrollTo=-uAAWPQXBUdX)

>>[Graph RAG](#scrollTo=zzBnUF4bBYKG)

>>>[Build a Graph with Neo4J](#scrollTo=zzBnUF4bBYKG)

>>>[Extract Relevant Entities](#scrollTo=FVzpKJViBiJT)

>>>[Summarize Nodes and Communities](#scrollTo=j1wAsUfIBrGc)

>>>[Write the Entities to Weaviate](#scrollTo=n105cc-_B9bN)



# Naive RAG vs GraphRAG with Neo4J & Weaviate

In this recipe, we will be walking through 2 ways of doing RAG:

1. Classic RAG where we do simple vector search, followed be answer generation based on this context
2. Graph RAG, making use of both vector search, combined by a graph representation of our dataset including community and node summaries

For this example, we will be using a generated dataset called "Financial Contracts", that lists (fake) contracts sugned between individuals and companies.

## Install Dependencies

In [1]:
!pip install --quiet --upgrade git+https://github.com/neo4j-contrib/ms-graphrag-neo4j.git datasets weaviate-client neo4j-graphrag

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


## Write Documents to Weaviate Cloud

To get started, you can use a free Weaviate Sandbox.

1. Create a cluster
2. Take note of the cluster URL and API key
3. Go to 'Embeddings' and turn it on.

In [2]:
import os
from getpass import getpass

if "WEAVIATE_API_KEY" not in os.environ:
  os.environ["WEAVIATE_API_KEY"] = getpass("Weaviate API Key")
if "WEAVIATE_URL" not in os.environ:
  os.environ["WEAVIATE_URL"] = getpass("Weaviate URL")

Weaviate API Key··········
Weaviate URL··········


In [4]:
import weaviate
from weaviate.auth import Auth

client = weaviate.connect_to_weaviate_cloud(
    cluster_url=os.environ.get("WEAVIATE_URL"),
    auth_credentials=Auth.api_key(os.environ.get("WEAVIATE_API_KEY")),
)

In [5]:
from weaviate.classes.config import Configure

#client.collections.delete("Financial_contracts")
client.collections.create(
    "Financial_contracts",
    description="A dataset of financial contracts between indivicuals and/or companies, as well as information on the type of contract and who has authored them.",
    vectorizer_config=Configure.Vectorizer.text2vec_weaviate(),
)

<weaviate.collections.collection.sync.Collection at 0x7a72cd3b8610>

In [6]:
from datasets import load_dataset

financial_dataset = load_dataset("weaviate/agents", "query-agent-financial-contracts", split="train", streaming=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/3.84k [00:00<?, ?B/s]

In [7]:
financial_collection = client.collections.get("Financial_contracts")

with financial_collection.batch.dynamic() as batch:
    for item in financial_dataset:
        batch.add_object(properties=item["properties"])

## Classic RAG with OpenAI

In [9]:
os.environ["OPENAI_API_KEY"]= getpass("Openai API Key:")

Openai API Key:··········


In [10]:
from openai import AsyncOpenAI

openai_client = AsyncOpenAI()

async def achat(messages, model="gpt-4o", temperature=0, config={}):
    response = await openai_client.chat.completions.create(
        model=model,
        temperature=temperature,
        messages=messages,
        **config,
    )
    return response.choices[0].message.content

In [11]:
async def classic_rag(input: str) -> str:
    context = [str(obj.properties) for obj in financial_collection.query.near_text(query = input, limit=3).objects]
    messages = [
    {
        "role": "user",
        "content": "Based on the given context: {context} \n\n Answer the following question: {question}".format(context=context, question=input)
    },
    ]
    output = await achat(messages, model="gpt-4o")
    return output

In [12]:
response = await classic_rag("What do you know about Weaviate")
print(response)

Based on the provided context, Weaviate is a corporation involved in multiple partnership agreements with OpenAI. Here are some details about Weaviate from the context:

1. **Location and Legal Organization**:
   - In the first agreement dated March 15, 2022, Weaviate is described as a corporation organized under the laws of the State of Delaware, with its principal place of business at 123 Innovation Drive, Wilmington, DE.
   - In the second agreement dated April 5, 2023, Weaviate is located at 123 Innovation Drive, Tech City.
   - In the third agreement dated November 15, 2023, Weaviate is described as a corporation organized under the laws of the state of California, with its principal office located at 123 Innovation Drive, San Francisco, CA.

2. **Partnerships with OpenAI**:
   - Weaviate has entered into multiple partnership agreements with OpenAI to collaborate on various projects, particularly in the field of artificial intelligence.
   - The agreements outline financial contri

## Graph RAG

### Build a Graph with Neo4J


In [13]:
import os
from getpass import getpass

from ms_graphrag_neo4j import MsGraphRAG
from neo4j import GraphDatabase
import pandas as pd

# Use Neo4j Sandbox - Blank Project https://sandbox.neo4j.com/

os.environ["NEO4J_URI"]="bolt://52.207.220.65:7687"
os.environ["NEO4J_USERNAME"]="neo4j"
os.environ["NEO4J_PASSWORD"]="dives-platform-eligibility"

In [14]:
driver = GraphDatabase.driver(
    os.environ["NEO4J_URI"],
    auth=(os.environ["NEO4J_USERNAME"], os.environ["NEO4J_PASSWORD"]),
    #notifications_min_severity="OFF",
)
ms_graph = MsGraphRAG(driver=driver, model="gpt-4o", max_workers=10)

In [15]:
import pandas as pd

# Login using e.g. `huggingface-cli login` to access this dataset
df = pd.read_parquet("hf://datasets/weaviate/agents/query-agent/financial-contracts/0001.parquet")
df.head()

Unnamed: 0,properties,vector
0,"{'date': '2023-03-15T14:30:00+00:00', 'contrac...","[-0.02571106, 0.028182983, 0.011955261, -0.014..."
1,"{'date': '2023-03-15T10:30:00+00:00', 'contrac...","[0.013244629, 0.015045166, 0.0042381287, -0.04..."
2,"{'date': '2022-03-15T09:30:00+00:00', 'contrac...","[-0.041625977, 0.013809204, -0.011222839, 0.03..."
3,"{'date': '2023-09-15T14:23:00+00:00', 'contrac...","[-0.006462097, -0.003780365, -0.064453125, -0...."
4,"{'date': '2024-03-15T10:30:00+00:00', 'contrac...","[-0.022064209, 0.04623413, -0.0284729, 0.00141..."


In [16]:
texts = [el['contract_text'] for el in df['properties']]
texts[:2]

['PARTNERSHIP AGREEMENT\n\nThis Partnership Agreement ("Agreement") is made and entered into as of the 15th day of March, 2023, by and between Weaviate, a company registered in the State of California, and OpenAI, a research organization based in San Francisco, California.\n\n1. Purpose\nThe parties agree to establish a partnership to collaborate on artificial intelligence research and development, sharing resources and expertise.\n\n2. Contributions\nWeaviate shall contribute technology resources valued at $112.85 and staff time equivalent to a monetary value of $550.09. OpenAI shall contribute its research expertise and a project management team valued at $98.14.\n\n3. Profit Sharing\nThe net profits generated from joint projects shall be distributed as follows: Weaviate shall receive 60% and OpenAI shall receive 40%.\n\n4. Duration\nThis Agreement shall commence on the date hereof and shall continue in effect for a period of three (3) years, unless terminated earlier in accordance w

### Extract Relevant Entities

Next, we will start extracting relevant entities and relations between these entities that we might be interested in.

In [17]:
allowed_entities = ["Person", "Organization", "Location"]

await ms_graph.extract_nodes_and_rels(texts, allowed_entities)

Extracting nodes & relationships: 100%|██████████| 100/100 [00:34<00:00,  2.87it/s]


'Successfuly extracted and imported 274 relationships'

### Summarize Nodes and Communities

In [18]:
await ms_graph.summarize_nodes_and_rels()

Summarizing nodes: 100%|██████████| 33/33 [00:13<00:00,  2.42it/s]
Summarizing relationships: 100%|██████████| 33/33 [00:08<00:00,  3.87it/s]


'Successfuly summarized nodes and relationships'

In [21]:
await ms_graph.summarize_communities()

Leiden algorithm identified 1 community levels with 3 communities on the last level.




Summarizing communities:   0%|          | 0/3 [00:00<?, ?it/s][A[A

Summarizing communities:  33%|███▎      | 1/3 [00:09<00:18,  9.19s/it][A[A

Summarizing communities:  67%|██████▋   | 2/3 [00:10<00:04,  4.57s/it][A[A

Summarizing communities: 100%|██████████| 3/3 [00:13<00:00,  4.41s/it]


'Generated 3 community summaries'

In [22]:
entities = ms_graph.query("""
MATCH (e:__Entity__)
RETURN e.name AS entity_id, e.summary AS entity_summary
""")

In [23]:
entities[:2]

[{'entity_id': 'WEAVIATE',
  'entity_summary': "Weaviate is a corporation organized under the laws of both the State of California and the State of Delaware, with its principal place of business primarily located in San Francisco, CA, and additional offices at 123 Innovation Drive, Tech City, CA 90210, and 123 Tech Lane, Silicon Valley, CA 94043. The company is involved in a wide range of activities, including providing consulting, software development, data analysis, cloud storage, technical support, and project management services. Weaviate is actively engaged in partnerships to develop innovative AI solutions and advanced data processing technologies, contributing resources and expertise to these collaborations.\n\nThe organization acts as both a lessor and a lessee in various lease agreements, and it is involved in multiple business relationships under Non-Disclosure Agreements. Weaviate also participates in sales and purchase order agreements, acting as both a buyer and a seller, 

### Write the Entities to Weaviate

In [25]:
from weaviate.classes.config import Configure

#client.collections.delete("Entities")
client.collections.create(
    "Entities",
    description="A dataset of entities appearing in the financial contracts between indivicuals and/or companies, as well as information on the type of contract and who has authored them.",
    vectorizer_config=Configure.Vectorizer.text2vec_weaviate(),
)

<weaviate.collections.collection.sync.Collection at 0x7a71e4142510>

In [26]:
from datasets import IterableDataset

# Define a simple generator
def list_generator(data):
    for item in data:
        yield item

# Create the IterableDataset
entities_dataset = IterableDataset.from_generator(list_generator, gen_kwargs={"data": entities})

In [28]:
entities_collection = client.collections.get("Entities")

with entities_collection.batch.dynamic() as batch:
    for item in entities_dataset:
        batch.add_object(properties=item)

In [29]:
from neo4j_graphrag.retrievers import WeaviateNeo4jRetriever

retrieval_query = """
    WITH collect(node) as nodes
WITH collect {
    UNWIND nodes as n
    MATCH (n)<-[:MENTIONS]->(c:__Chunk__)
    WITH c, count(distinct n) as freq
    RETURN c.text AS chunkText
    ORDER BY freq DESC
    LIMIT 3
} AS text_mapping,
collect {
    UNWIND nodes as n
    MATCH (n)-[:IN_COMMUNITY*]->(c:__Community__)
    WHERE c.summary IS NOT NULL
    WITH c, c.rating as rank
    RETURN c.summary
    ORDER BY rank DESC
    LIMIT 3
} AS report_mapping,
collect {
    UNWIND nodes as n
    MATCH (n)-[r:SUMMARIZED_RELATIONSHIP]-(m)
    WHERE m IN nodes
    RETURN r.summary AS descriptionText
    LIMIT 3
} as insideRels,
collect {
    UNWIND nodes as n
    RETURN n.summary AS descriptionText
} as entities
RETURN {Chunks: text_mapping, Reports: report_mapping,
       Relationships: insideRels,
       Entities: entities} AS output
    """

retriever = WeaviateNeo4jRetriever(
    driver=driver,
    client=client,
    collection="Entities",
    id_property_external="entity_id",
    id_property_neo4j="name",
    retrieval_query=retrieval_query
)

In [30]:
async def hybrid_local_search_rag(input: str) -> str:
    context = [str(el[1]) for el in retriever.search(query_text=input, top_k=3)]
    messages = [
    {
        "role": "user",
        "content": "Based on the given context: {context} \n\n Answer the following question: {question}".format(context=context, question=input)
    },
    ]
    output = await achat(messages, model="gpt-4o")
    return output

In [31]:
response = await hybrid_local_search_rag(input="What do you know about Weaviate")
print(response)

Weaviate is a corporation organized under the laws of both the State of California and the State of Delaware. Its principal place of business is primarily located in San Francisco, CA, with additional offices at 123 Innovation Drive, Tech City, CA, and 123 Tech Lane, Silicon Valley, CA. The company is involved in a wide range of activities, including consulting, software development, data analysis, cloud storage, technical support, and project management services. Weaviate is actively engaged in partnerships to develop innovative AI solutions and advanced data processing technologies, contributing resources and expertise to these collaborations.

The organization acts as both a lessor and a lessee in various lease agreements and is involved in multiple business relationships under Non-Disclosure Agreements. Weaviate also participates in sales and purchase order agreements, acting as both a buyer and a seller, and is involved in loan agreements as a lender. The company is responsible fo