<a href="https://colab.research.google.com/github/rahiakela/genai-research-and-practice/blob/main/essential-graph-rag/01_vector_similarity_search_and_hybrid_search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Setup

In [None]:
!pip install -q pdfplumber
!pip install langchain-google-genai
!pip install neo4j

In [2]:
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_google_genai import GoogleGenerativeAIEmbeddings

from google.colab import userdata

from numpy.linalg import norm
import pandas as pd
import numpy as np
import requests
import pdfplumber
import os

In [4]:
# --- Configuration ---
os.environ["GOOGLE_API_KEY"] = userdata.get("GOOGLE_API_KEY")

In [5]:
# Initialize the ChatGoogleGenerativeAI model
# Use a model that supports audio input, like "gemini-1.5-flash" or "gemini-1.5-pro"
llm = ChatGoogleGenerativeAI(
    model="gemini-2.5-flash",
    temperature=0,
    streaming=True,
    api_key=userdata.get("GOOGLE_API_KEY")
)

gemini_embeddings = GoogleGenerativeAIEmbeddings(model="models/text-embedding-004")

In [16]:
from neo4j import GraphDatabase

driver = GraphDatabase.driver(
    userdata.get("NEO4J_URI"),
    auth=(userdata.get("NEO4J_USERNAME"), userdata.get("NEO4J_PASSWORD"))
)

In [7]:
def chunk_text(text, chunk_size, overlap, split_on_whitespace_only=True):
    chunks = []
    index = 0

    while index < len(text):
        if split_on_whitespace_only:
            prev_whitespace = 0
            left_index = index - overlap
            while left_index >= 0:
                if text[left_index] == " ":
                    prev_whitespace = left_index
                    break
                left_index -= 1
            next_whitespace = text.find(" ", index + chunk_size)
            if next_whitespace == -1:
                next_whitespace = len(text)
            chunk = text[prev_whitespace:next_whitespace].strip()
            chunks.append(chunk)
            index = next_whitespace + 1
        else:
            start = max(0, index - overlap + 1)
            end = min(index + chunk_size + overlap, len(text))
            chunk = text[start:end].strip()
            chunks.append(chunk)
            index += chunk_size

    return chunks

## Load data

In [8]:
remote_pdf_url = "https://arxiv.org/pdf/1709.00666.pdf"
pdf_filename = "ch02-downloaded.pdf"

response = requests.get(remote_pdf_url)

if response.status_code == 200:
    with open(pdf_filename, "wb") as pdf_file:
        pdf_file.write(response.content)
else:
    print("Failed to download the PDF. Status code:", response.status_code)

In [9]:
text = ""

with pdfplumber.open(pdf_filename) as pdf:
    for page in pdf.pages:
        text += page.extract_text()

print(text[0:20])

Einstein’s Patents a


## Data Chunking

In [10]:
chunks = chunk_text(text, 500, 40)
print(len(chunks))
print(chunks[0])

89
Einstein’s Patents and Inventions
Asis Kumar Chaudhuri
Variable Energy Cyclotron Centre
1‐AF Bidhan Nagar, Kolkata‐700 064
Abstract: Times magazine selected Albert Einstein, the German born Jewish Scientist as the person of the 20th
century. Undoubtedly, 20th century was the age of science and Einstein’s contributions in unravelling mysteries
of nature was unparalleled. However, few are aware that Einstein was also a great inventor. He and his
collaborators had patented a wide variety of inventions


## Embed Data

In [11]:
def embed(text_chunks):
  embeddings_list = []
  for text_chunk in text_chunks:
      embeddings = gemini_embeddings.embed_query(text_chunk)
      embeddings_list.append(embeddings)
  return embeddings_list

embeddings = embed(chunks)

print(embeddings[0][0:3])
print(len(embeddings))
print(len(embeddings[0]))

[0.00421591, -0.042048614, 0.019263804]
89
768


In [17]:
# Creating a vector index in Neo4j
driver.execute_query("""CREATE VECTOR INDEX pdf IF NOT EXISTS
FOR (c:Chunk)
ON c.embedding""")

EagerResult(records=[], summary=<neo4j._work.summary.ResultSummary object at 0x7f3e07cb00b0>, keys=[])

In [18]:
# Storing chunks and populating the vector index in Neo4j
cypher_query = '''
WITH $chunks as chunks, range(0, size($chunks)) AS index
UNWIND index AS i
WITH i, chunks[i] AS chunk, $embeddings[i] AS embedding
MERGE (c:Chunk {index: i})
SET c.text = chunk, c.embedding = embedding
'''

driver.execute_query(cypher_query, chunks=chunks, embeddings=embeddings)

EagerResult(records=[], summary=<neo4j._work.summary.ResultSummary object at 0x7f3e06b21df0>, keys=[])

In [19]:
# Getting data from a chunk node in Neo4j
records, _, _ = driver.execute_query("MATCH (c:Chunk) WHERE c.index = 0 RETURN c.embedding, c.text")

print(records[0]["c.text"][0:30])
print(records[0]["c.embedding"][0:3])

Einstein’s Patents and Inventi
[0.00421591, -0.042048614, 0.019263804]


## Vector search

In [20]:
# Embedding user question
question = "At what time was Einstein really interested in experimental works?"
question_embedding = embed([question])[0]

# Performing vector search in Neo4j
query = '''
CALL db.index.vector.queryNodes('pdf', $k, $question_embedding) YIELD node AS hits, score
RETURN hits.text AS text, score, hits.index AS index
'''
similar_records, _, _ = driver.execute_query(query, question_embedding=question_embedding, k=4)

# Printing results
for record in similar_records:
    print(record["text"])
    print(record["score"], record["index"])
    print("======")

Einstein
left his job at the Patent office and joined the University of Zurich on October 15, 1909. Thereafter, he
continued to rise in ladder. In 1911, he moved to Prague University as a full professor, a year later, he
was appointed as full professor at ETH, Zurich, his alma‐mater. In 1914, he was appointed Director of
the Kaiser Wilhelm Institute for Physics (1914–1932) and a professor at the Humboldt University of
Berlin, with a special clause in his contract that freed him from teaching obligations. In the meantime,
he was working for
0.837451696395874 31
Einstein’s life was rather featureless. He diligently worked at the patent office,
played violin, discussed physics with his friends, write few not so interesting papers. Then in 1905, he
took the academic world by surprise. In the annals of physics, the year 1905 is known as “annus
mirabilis” or the year of miracle. Indeed, a miracle happened. Albert Einstein, barely 26 years old,
sitting in an obscure Swiss patent office, wrote

## Generating Answer

In [21]:
# The LLM context
system_message = "You're en Einstein expert, but can only use the provided documents to respond to the questions."
user_message = f"""
Use the following documents to answer the question that will follow:
{[doc["text"] for doc in similar_records]}

---

The question to answer using information only from the above documents: {question}
"""

print("Question:", question)

# Generating an answer using an LLM
messages=[
    {"role": "system", "content": system_message},
    {"role": "user", "content": user_message}
]
stream = llm.invoke(messages)

print(stream.content)

Question: At what time was Einstein really interested in experimental works?
Einstein was genuinely interested in experimental works during his ETH days.


## Hybrid search index

In [22]:
# Creating a full-text index in Neo4j
try :
    driver.execute_query(f"CREATE FULLTEXT INDEX ftPdfChunk FOR (c:Chunk) ON EACH [c.text]")
except:
    print("Fulltext Index already exists")

## Performing hybrid search

In [23]:
# Performing hybrid search in Neo4j
hybrid_query = '''
CALL {
    // vector index
    CALL db.index.vector.queryNodes('pdf', $k, $question_embedding) YIELD node, score
    WITH collect({node:node, score:score}) AS nodes, max(score) AS max
    UNWIND nodes AS n
    // We use 0 as min
    RETURN n.node AS node, (n.score / max) AS score
    UNION
    // keyword index
    CALL db.index.fulltext.queryNodes('ftPdfChunk', $question, {limit: $k})
    YIELD node, score
    WITH collect({node:node, score:score}) AS nodes, max(score) AS max
    UNWIND nodes AS n
    // We use 0 as min
    RETURN n.node AS node, (n.score / max) AS score
}
// dedup
WITH node, max(score) AS score ORDER BY score DESC LIMIT $k
RETURN node, score
'''
similar_hybrid_records, _, _ = driver.execute_query(hybrid_query, question_embedding=question_embedding, question=question, k=4)

for record in similar_hybrid_records:
    print(record["node"]["text"])
    print(record["score"], record["node"]["index"])
    print("======")



CH‐Switzerland
Considering Einstein’s upbringing, his interest in inventions and patents was not unusual.
Being a manufacturer’s son, Einstein grew upon in an environment of machines and instruments.
When his father’s company obtained the contract to illuminate Munich city during beer festival, he
was actively engaged in execution of the contract. In his ETH days Einstein was genuinely interested
in experimental works. He wrote to his friend, “most of the time I worked in the physical laboratory,
fascinated by the direct contact with observation.” Einstein's
1.0 42
Einstein
left his job at the Patent office and joined the University of Zurich on October 15, 1909. Thereafter, he
continued to rise in ladder. In 1911, he moved to Prague University as a full professor, a year later, he
was appointed as full professor at ETH, Zurich, his alma‐mater. In 1914, he was appointed Director of
the Kaiser Wilhelm Institute for Physics (1914–1932) and a professor at the Humboldt University of
Berlin

In [24]:
user_message = f"""
Use the following documents to answer the question that will follow:
{[doc["node"]["text"] for doc in similar_hybrid_records]}

---

The question to answer using information only from the above documents: {question}
"""

print("Question:", question)

# Generating an answer using an LLM
messages=[
    {"role": "system", "content": system_message},
    {"role": "user", "content": user_message}
]
stream = llm.invoke(messages)

print(stream.content)

Question: At what time was Einstein really interested in experimental works?
Einstein was genuinely interested in experimental works during his ETH days. He wrote to a friend that "most of the time I worked in the physical laboratory, fascinated by the direct contact with observation."
