<a href="https://colab.research.google.com/github/vijaypoluri/AI/blob/main/vijay_Advanced_part_2_of_V4_Module_4_Advanced_LLMs_Maven_Knowledge_Graph_Construction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction
In the [basic notebook for Module 5](https://colab.research.google.com/drive/1OX3I6GKBlfcJGyecyK8VNWbY5l4MA2dA?usp=sharing) of [Enterprise RAG and Multi-Agent Applications](https://maven.com/boring-bot/advanced-llm), we explored the fundamentals of knowledge graph construction and Graph RAG. We built a hotel reviews knowledge graph, migrated it to Neo4j, and implemented a template-based retriever for answering natural language queries about our graph data.

This advanced notebook builds on those foundations to explore more sophisticated Graph RAG techniques and hybrid approaches. While our basic notebook focused on core concepts and implementation, this notebook delves into methods that significantly enhance both the graph itself and our retrieval capabilities - including the pattern most frequently discussed in modern Graph RAG demonstrations, the LLM-driven extraction of structured triplets from unstructured text.



## Architecture Overview
Our final graph RAG system will have several key components:

architecture (1).svg

The key enhancements we'll cover in this notebook include:

1. **Graph Enrichment**: Using LLMs to extract entities and relationships from unstructured text fields, expanding our knowledge graph beyond the structured data
   
2. **Vector Indexing**: Adding semantic search capabilities to our graph nodes, enabling similarity-based retrieval alongside structural queries
   
3. **Advanced Retrieval**: Implementing and comparing Text2Cypher, template-based, and vector retrieval
   
4. **Performance Analysis**: Systematically comparing different RAG strategies to understand when graph-based approaches outperform traditional vector RAG


## Prerequisites

* Completion of the basic Knowledge Graph RAG notebook
* Access to the same Neo4j database used in the basic notebook
* OpenAI API key

# Setup

Let's begin by setting up our connections and exploring how to enhance our knowledge graph with entities extracted from unstructured text.

We will use the same Neo4j database instance that ingested our data in the basic notebook. Make sure that you have entered your NEO4J_URI and NEO4J_PASSWORD key-value pairs into your Colab Secrets before continuing.

In [5]:
%pip install pyvis IPython cchardet datasets langchain==0.1.17 neo4j openai tiktoken langchain-community langchain-experimental json-repair

from getpass import getpass
import os
from google.colab import userdata
import json
import pandas as pd
from typing import Optional, List, Dict, Any
from openai import OpenAI
from neo4j import GraphDatabase
from dataclasses import dataclass

# Configure OpenAI API key
if os.getenv("OPENAI_API_KEY") is None:
  try:
    os.environ["OPENAI_API_KEY"] = userdata.get("OPENAI_API_KEY")
  except (userdata.TimeoutException, userdata.SecretNotFoundError):
    if any(['VSCODE' in x for x in os.environ.keys()]):
      print('Please enter password in the VS Code prompt at the top of your VS Code window!')
    os.environ["OPENAI_API_KEY"] = getpass("")

assert os.getenv("OPENAI_API_KEY", "").startswith("sk-"), "This doesn't look like a valid OpenAI API key"
print("OpenAI API key configured")

# Connect to Neo4j
url = userdata.get('NEO4J_URI')
username = "neo4j"
password = userdata.get('NEO4J_PASSWORD')
password = "q33sNW6OPpuEF_V8gdMtA2jQbbjX4AITIhs-25El1P0"


driver = GraphDatabase.driver(url, auth=(username, password))
print("Connected to Neo4j database")

# Initialize OpenAI client
openai_client = OpenAI()

OpenAI API key configured
Connected to Neo4j database


# Expanding the graph from unstructured text fields with LLMs and pre-defined schema types

Graph RAG really shines when you have large amounts of unstructured text data which contain many and complex relationships. In fact, most publicly available examples and demos you see for Graph RAG involve unstructured text as the primary data sources. In our case, with the structured dataset of hotel reviews, we can turn to the hotel descriptions for such data. Let's now look at how to build a pipeline for entity and relationship extraction from unstructured text.

In [6]:
from typing import Optional, List, Dict, Any
from openai import OpenAI
import pandas as pd
import json

class Neo4jGraphExtractor:
    def __init__(self, openai_client: OpenAI, neo4j_driver: GraphDatabase, entity_types: List[str], rel_types: List[str], instruct_notes: List[str] = None, temperature: float = 0 ):
        self.client = openai_client
        self.temperature = temperature
        self.driver = driver
        self.entity_types = entity_types
        self.rel_types = rel_types
        self.instruct_notes = instruct_notes

    def _create_prompt(self, hotel_name: str, text: str) -> str:
        return f"""
-Goal-
Given a text document that contains the description of a specific hotel and a list of entity types, identify all entities of those types from the text and all relationships among the identified entities.

-Steps-
1. Identify all entities. There is a known root entity, which is the described hotel. For each identified entity, extract the following information:
- entity_name: Name of the entity, capitalized
- entity_type: One of the following types: [{self.entity_types}]
- entity_description: Comprehensive description of the entity's attributes and activities

2. From the entities identified in step 1, identify all pairs of (source_entity, target_entity) that are clearly related to each other.

For each pair of related entities, extract:
- source_entity: name of the source entity
- target_entity: name of the target entity
- relationship_type: One of the following types: {self.rel_types}

3. Return output as a single JSON list containing all entities and relationships.

-Real Data-
######################
Root hotel: {hotel_name}
text: {text}
######################
output:
"""

    def _parse_llm_response(self, response: str) -> Dict[str, List[Dict[str, Any]]]:
      try:
          # Clean up markdown formatting if present
          if response.startswith('```'):
              # Extract content between code blocks
              response = response.split('```')[1]
              # Remove json language identifier if present
              if response.startswith('json'):
                  response = response[4:]
              response = response.strip()

          # Parse JSON
          try:
              data = json.loads(response)

              # Handle both direct entity/relationship format and nested format
              if isinstance(data, dict) and 'entities' in data and 'relationships' in data:
                  return {
                      "entities": data['entities'],
                      "relationships": data['relationships']
                  }
              if isinstance(data, list) and len(data) > 0:
                  # Original format handling
                  entities = [item for item in data if "type" in item]
                  relationships = [item for item in data if "relationship" in item]
                  return {
                      "entities": entities,
                      "relationships": relationships
                  }
              # Fallback if data is not in any expected shape
              return {"entities": [], "relationships": []}

          except json.JSONDecodeError as e:
              # If standard format fails, try the alternative format
              print(f"JSONDecodeError thrown in inner block: {str(e)}")
              if str(e).startswith("Extra data"):
                try:
                  # Assuming the format is [entities_list, relationships_list]
                  wrapped = f"[{response}]"
                  arr = json.loads(wrapped)
                  if len(arr) == 2:
                    entities_list, relationships_list = arr
                    return {
                        "entities": entities_list,
                        "relationships": relationships_list
                    }
                except json.JSONDecodeError:
                  raise e
              else:
                  # If neither format works, raise the original error
                  raise e
      except json.JSONDecodeError as e:
          print(f"Error parsing JSON response: {str(e)}")
          print("Response was:", response)
          return {"entities": [], "relationships": []}

    def process_text(self, hotel_name: str, text: str) -> Dict[str, List[Dict[str, Any]]]:
        """
        Process a hotel description text and extract entities and relationships.

        Args:
            hotel_name: Name of the hotel (root entity)
            text: Hotel description text to process

        Returns:
            Dictionary containing lists of extracted entities and relationships
        """
        prompt = self._create_prompt(hotel_name, text)

        response = self.client.chat.completions.create(
            model="gpt-4o",  # or your preferred model
            temperature=self.temperature,
            messages=[
                {"role": "system", "content": "You are a helpful assistant that extracts entities and relationships from text and returns them in JSON format."},
                {"role": "user", "content": prompt}
            ],
            response_format={"type": "json_object"}
        )

        # Extract the content from the response
        result = response.choices[0].message.content

        # Parse and return the structured data
        return self._parse_llm_response(result)

    def process_and_save(self, hotel_name: str, text: str) -> Dict[str, List[Dict[str, Any]]]:
        """
        Process hotel description text, extract entities and relationships, and save to Neo4j.

        Args:
            hotel_name: Name of the hotel (root entity)
            text: Hotel description text to process

        Returns:
            Dictionary containing lists of extracted entities and relationships
        """
        try:
            result = self.process_text(hotel_name, text)

            if result == {"entities": [], "relationships": []}:
                print(f"No entities or relationships found for {hotel_name}, skipping upload")
                return result
            print(f"Processing and saving result {result}")

            # Save to Neo4j within a single session
            with self.driver.session() as session:
                # Add entities
                for entity in result['entities']:
                    cypher_query = """
                    MERGE (n:__Entity__ {name: $name})
                    SET n.entity = $type,
                        n.description = $description
                    WITH n
                    CALL apoc.create.addLabels(n, [$type]) YIELD node
                    RETURN distinct 'done' AS result
                    """
                    session.run(
                        cypher_query,
                        name=entity['entity_name'],
                        type=entity['entity_type'].upper(),
                        description=entity['entity_description']
                    )

                # Add relationships
                for rel in result['relationships']:
                    # Format the relationship type directly into the query
                    cypher_query = f"""
                    MATCH (a:__Entity__), (b:__Entity__)
                    WHERE a.name = $source AND b.name = $target
                    MERGE (a)-[r:{rel['relationship_type']}]->(b)
                    RETURN distinct 'done' AS result
                    """
                    session.run(
                        cypher_query,
                        source=rel['source_entity'],
                        target=rel['target_entity']
                    )

            return result

        except Exception as e:
            print(f"Error processing hotel {hotel_name}: {str(e)}")
            return {"entities": [], "relationships": []}


    def process_from_dataframe(self,
                             dataset: pd.DataFrame,
                             name_column: str,
                             description_column: str,
                             batch_size: Optional[int] = None):
        """Process hotels from pandas DataFrame"""
        total = len(dataset)
        for idx, row in dataset.iterrows():
            print(f"Processing hotel {idx + 1}/{total}: {row[name_column]}")
            self.process_and_save(
                hotel_name=row[name_column],
                text=row[description_column]
            )

            if batch_size and (idx + 1) % batch_size == 0:
                print(f"Completed batch of {batch_size} hotels")

    def process_from_neo4j(self, batch_size: Optional[int] = None):
        """Process hotels from existing Neo4j HOTEL nodes"""
        with self.driver.session() as session:
            # First, count total hotels
            count_query = """
            MATCH (h:HOTEL)
            RETURN count(h) as total
            """
            total = session.run(count_query).single()['total']

            # Then process in batches if specified
            query = """
            MATCH (h:HOTEL)
            RETURN h.name as name, h.description as description
            """
            if batch_size:
                query += f" SKIP $skip LIMIT {batch_size}"

            processed = 0
            while processed < total:
                results = session.run(query, skip=processed)
                for record in results:
                    processed += 1
                    if record['description'] is None:
                        print(f"Skipping hotel {processed}/{total}: {record['name']} (no description)")
                        continue
                    print(f"Processing hotel {processed}/{total}: {record['name']}")
                    self.process_and_save(
                        hotel_name=record['name'],
                        text=record['description']
                    )

                if batch_size:
                    print(f"Completed batch of {batch_size} hotels")

    def process_hotels(self,
                      source: str = 'neo4j',
                      dataset: Optional[pd.DataFrame] = None,
                      name_column: Optional[str] = None,
                      description_column: Optional[str] = None,
                      batch_size: Optional[int] = None):
        """
        Unified interface for processing hotels from either source

        Args:
            source: Either 'neo4j' or 'dataframe'
            dataset: Required if source is 'dataframe'
            name_column: Required if source is 'dataframe'
            description_column: Required if source is 'dataframe'
            batch_size: Optional batch size for processing
        """
        if source == 'neo4j':
            self.process_from_neo4j(batch_size=batch_size)
        elif source == 'dataframe':
            if not all([dataset is not None,
                       name_column is not None,
                       description_column is not None]):
                raise ValueError("Dataset and column names required for DataFrame source")
            self.process_from_dataframe(
                dataset=dataset,
                name_column=name_column,
                description_column=description_column,
                batch_size=batch_size
            )
        else:
            raise ValueError("Source must be either 'neo4j' or 'dataframe'")

In [7]:
openai_client = OpenAI()
extractor = Neo4jGraphExtractor(openai_client=openai_client,
                                neo4j_driver=driver,
                                entity_types = ["HOTEL", "AMENITY", "TOURIST_ATTRACTION"],
                                rel_types=["HAS_AMENITY", "LOCATED_NEARBY"],
                                )

# Process from Neo4j:
extractor.process_hotels(source='neo4j', batch_size=100)



Skipping hotel 1/149: Royal National Hotel (no description)
Processing hotel 2/149: Grant Plaza Hotel
Processing and saving result {'entities': [{'entity_name': 'Grant Plaza Hotel', 'entity_type': 'HOTEL', 'entity_description': "Grant Plaza Hotel is a limited service boutique hotel located in the heart of the city. It is recommended by many travellers as one of the best valued hotels in San Francisco. The hotel is conveniently located at the gateway to Chinatown, 3 blocks from Union Square, and within easy walking distance to many fine restaurants and theaters in this exciting city. It is also only 1 block to the famous San Francisco Cable Car line, the Bank of American Building in the Financial District and the 5 stars Ritz Carlton Hotel in Nob Hill. It caters to both business and pleasure travelers, and is the perfect place for a vacation getaway. The Hotel has just a completed renovation in 2015. The comfortable guestrooms are appointed with contemporary furnishings, and have recent

Our knowledge graph has now been augmented with additional fact triplets about the hotels. This enrichment allows us to capture more complex relationships between hotels, amenities, and tourist attractions.

## Build database indices for the graph

With the data in place, we can next create a vector index over the text properties, exposing all of our nodes to HNSW similarity search.

Recall that in our basic KG notebook, we crafted our node MERGE Cypher statement to give all new nodes a base label `__Entity__` along with their type label from our hotels data model. In Neo4j, we can only a create a vector index for a single node label, but we might not know all the entity types we will need ahead of time. Hence why we gave all our nodes a base entity label to cover all potential entity types. Now that we've added new entity and relationship types to our KG, we're ready to create the vector index that will allow us to perform similarity search.

In [8]:
import os
from langchain.vectorstores.neo4j_vector import Neo4jVector
from langchain.embeddings.openai import OpenAIEmbeddings

vector_index = Neo4jVector.from_existing_graph(
    OpenAIEmbeddings(),
    url=url,
    username=username,
    password=password,
    index_name='reviews',
    node_label="__Entity__",
    text_node_properties=['name', 'description', 'text'],
    embedding_node_property='embedding',
)


  warn_deprecated(


In [9]:
# Test the vector search capability with a dash of Langchain
response = vector_index.similarity_search(
    "What positive things are said about the Pera Palace Hotel?"
)

print("Top similarity search result:")
print(response[0].page_content)

Top similarity search result:

name: Pera Perfect\$
description: 
text: Gamze went above and beyond to make all our dining experiences superb at the Pera Palace Hotel. She is most attentive to one's needs and a true asset to the hotel. Our room was clean, comfortable, spacious , and yet reminiscent of a bygone era.  Afternoon tea, accompanied by piano music, was delightful. The  Ataturk quarters were historical and the famous elevator, 1 of the 1st of its kind in Europe, are a must see. \$


# Advanced Retrieval

## Text2Cypher: Building a Flexible Retrieval Pattern

In the basic notebook, we explored a template-based approach to graph retrieval, which works well for common query patterns with clear entities. However, template-based systems are limited to predefined patterns and can't handle novel queries or complex structural relationships.

Text2Cypher is a more flexible retrieval pattern that uses an LLM to generate custom Cypher queries on the fly based on:
1. The natural language query
2. The graph schema
3. Best practices for Cypher query construction

Our applications gain several advantages from this approach:
- **Flexibility**: Can handle arbitrary query patterns not covered by templates
- **Structural Understanding**: Maintains awareness of the graph structure
- **Complex Relationships**: Supports multi-hop traversals and complex filtering

The implementation involves:
1. Fetching the graph schema from Neo4j
2. Using the schema to guide the LLM in generating valid Cypher
3. Executing the generated Cypher and processing the results
4. Generating a natural language answer from the structured results

Let's implement a Text2Cypher retriever from scratch:

### Text2Cypher retriever

In [None]:
import os
from typing import List, Dict, Any, Optional, Union
from openai import OpenAI
import json
from neo4j import GraphDatabase

class Text2CypherRetriever:
    """
    A retriever that converts natural language queries to Cypher queries using LLMs.
    """

    def __init__(self, neo4j_driver: GraphDatabase.driver, openai_client: OpenAI, schema: Optional[str] = None):
        """
        Initialize the Text2Cypher retriever.

        Args:
            neo4j_driver: Neo4j database driver
            openai_client: OpenAI client
            schema: Optional schema information to guide Cypher generation.
                   If None, the schema will be fetched from the database.
        """
        self.driver = neo4j_driver
        self.client = openai_client
        self._schema = schema

    @property
    def schema(self) -> str:
        """Get the Neo4j schema information."""
        if self._schema is None:
            self._schema = self._fetch_schema()
        return self._schema

    def _fetch_schema(self) -> str:
        """Fetch the graph schema information from the Neo4j database."""
        with self.driver.session() as session:
            # Get node labels, properties, and relationship types
            result = session.run("""
            CALL apoc.meta.schema()
            YIELD value
            RETURN value
            """)

            schema_data = result.single()["value"]

            # Format the schema in a way that's easy for the LLM to understand
            formatted_schema = "# Node Labels and Properties\n"

            for node_label, node_data in schema_data.items():
                if node_data.get("type") == "node":
                    formatted_schema += f"\n## {node_label}\n"
                    formatted_schema += "Properties:\n"

                    for prop, prop_data in node_data.get("properties", {}).items():
                        if prop != "embedding":  # Skip embedding properties
                            formatted_schema += f"- {prop}: {prop_data.get('type', 'unknown')}\n"

            formatted_schema += "\n# Relationship Types\n"
            rels = set()

            for node_data in schema_data.values():
                if node_data.get("type") == "node":
                    for rel in node_data.get("relationships", {}).values():
                        rel_type = rel.get("type")
                        if rel_type:
                            rels.add(rel_type)

            for rel in sorted(rels):
                formatted_schema += f"- {rel}\n"

            return formatted_schema

    def generate_cypher(self, query: str) -> str:
        """
        Generate a Cypher query from a natural language query.

        Args:
            query: Natural language query

        Returns:
            Generated Cypher query
        """
        prompt = f"""You are a Neo4j Cypher query expert. Your task is to translate natural language questions into Cypher queries.

Below is the schema of the Neo4j graph database:

{self.schema}

Important Guidelines:
1. Generate only the Cypher query, with no explanations, comments, or markdown formatting.
2. Always return nodes with all their properties to provide complete information.
3. Avoid using the 'embedding' property in your queries - it contains vector data and is very large.
4. Use appropriate aggregation functions (count, avg, collect) when grouping data.
5. Limit results to a reasonable number (10-20 max) when returning many nodes.
6. For more complex queries, consider using multiple MATCH clauses.
7. Make sure to use the correct relationship directions in your query.

Question: {query}

Cypher Query:"""

        response = self.client.chat.completions.create(
            model="gpt-4o-mini",
            temperature=0,
            messages=[
                {"role": "system", "content": "You are a Neo4j Cypher query generation assistant."},
                {"role": "user", "content": prompt}
            ]
        )

        cypher_query = response.choices[0].message.content.strip()

        # Clean up any markdown formatting if present
        if cypher_query.startswith("```") and cypher_query.endswith("```"):
            cypher_query = cypher_query[3:-3].strip()
        if cypher_query.startswith("```cypher"):
            cypher_query = cypher_query[9:].strip()
            if cypher_query.endswith("```"):
                cypher_query = cypher_query[:-3].strip()

        return cypher_query

    def execute_cypher(self, cypher_query: str) -> List[Dict[str, Any]]:
        """
        Execute a Cypher query against the Neo4j database.

        Args:
            cypher_query: Cypher query to execute

        Returns:
            List of result records as dictionaries
        """
        try:
            with self.driver.session() as session:
                result = session.run(cypher_query)
                records = [dict(record) for record in result]
                return records
        except Exception as e:
            print(f"Error executing Cypher query: {str(e)}")
            return []

    def retrieve(self, query: str) -> Dict[str, Any]:
        """
        Retrieve information from the Neo4j database using a natural language query.

        Args:
            query: Natural language query

        Returns:
            Dictionary containing the results and query information
        """
        # Generate Cypher query
        cypher_query = self.generate_cypher(query)

        # Execute the query
        results = self.execute_cypher(cypher_query)

        return {
            "query": query,
            "cypher_query": cypher_query,
            "results": results
        }

    def generate_answer(self, query: str, results: List[Dict[str, Any]]) -> str:
        """
        Generate a natural language answer based on query results.

        Args:
            query: Original natural language query
            results: Query results

        Returns:
            Natural language answer
        """
        # Format results for the prompt
        if not results:
            results_text = "No results were found for this query."
        else:
            # Convert results to a clean text representation
            items = []
            for i, item in enumerate(results[:10]):  # Limit to 10 items to keep prompt size reasonable
                item_str = f"Result {i+1}:\n"
                for k, v in item.items():
                    if k == "embedding":
                        continue  # Skip embedding vectors
                    if isinstance(v, (list, set)):
                        v_str = ", ".join(str(x) for x in v if x is not None)
                        item_str += f"  {k}: {v_str}\n"
                    else:
                        item_str += f"  {k}: {v}\n"
                items.append(item_str)

            if len(results) > 10:
                items.append(f"... and {len(results) - 10} more results.")

            results_text = "\n".join(items)

        prompt = f"""
The user asked: "{query}"

I retrieved the following information from the graph database:
{results_text}

Based on this information, provide a helpful, conversational response to the user's query.
Make sure to address all aspects of their question if possible.
If the information retrieved doesn't fully answer their query, acknowledge this limitation.
"""

        response = self.client.chat.completions.create(
            model="gpt-4o-mini",
            temperature=0.7,
            messages=[
                {"role": "system", "content": "You are a helpful assistant that provides accurate information based on database query results."},
                {"role": "user", "content": prompt}
            ]
        )

        return response.choices[0].message.content

    def query(self, query: str) -> Dict[str, Any]:
        """
        Process a natural language query end-to-end.

        Returns:
            Dictionary with answer, retrieval info, and intermediate steps
        """
        # Retrieve information
        retrieval_result = self.retrieve(query)

        # Generate answer
        answer = self.generate_answer(query, retrieval_result["results"])

        return {
            "query": query,
            "answer": answer,
            "retrieval_info": {
                "cypher_query": retrieval_result["cypher_query"],
                "result_count": len(retrieval_result["results"])
            },
            "results": retrieval_result["results"]
        }



Now let's initialize and test our Text2Cypher retriever:


In [None]:
text2cypher_retriever = Text2CypherRetriever(
    neo4j_driver=driver,
    openai_client=openai_client
)

In [None]:
text2cypher_retriever.query("What tourist attractions are LOCATED NEARBY the Grant Plaza Hotel?")

{'query': 'What tourist attractions are LOCATED NEARBY the Grant Plaza Hotel?',
 'answer': "Certainly! If you’re staying at the Grant Plaza Hotel, there are some great tourist attractions nearby that you can explore:\n\n1. **Chinatown**: Just a short walk away, Chinatown is a vibrant neighborhood in San Francisco known for its colorful streets and numerous shops and restaurants. It’s a fantastic place to immerse yourself in culture and enjoy some delicious cuisine.\n\n2. **San Francisco Cable Car**: The famous San Francisco Cable Car line is just one block from the Grant Plaza Hotel. Riding the cable car is a quintessential San Francisco experience, offering scenic views of the city as it climbs the hills.\n\n3. **Bank of America Building**: Also located just one block away, the Bank of America Building is situated in the Financial District. While it may not be a traditional tourist attraction, it's noteworthy for its architecture and proximity to other attractions.\n\n4. **Union Squar

## Basic Vector Similarity Retriever Implementation

Let's implement a retriever that uses vector similarity to find relevant nodes in the graph:

In [None]:
class VectorSimilarityRetriever:
    def __init__(self, neo4j_driver: GraphDatabase.driver, openai_client: OpenAI, index_name: str = "reviews"):
        """
        Initialize the vector similarity retriever.

        Args:
            neo4j_driver: Neo4j database driver
            openai_client: OpenAI client
            index_name: Name of the vector index in Neo4j
        """
        self.driver = neo4j_driver
        self.client = openai_client
        self.index_name = index_name
        self.embeddings_model = "text-embedding-ada-002"

    def get_embedding(self, text: str) -> List[float]:
        """
        Get an embedding vector for the given text.

        Args:
            text: Text to embed

        Returns:
            Embedding vector
        """
        response = self.client.embeddings.create(
            model=self.embeddings_model,
            input=text
        )
        return response.data[0].embedding

    def retrieve(self, query: str, limit: int = 10, node_labels: Optional[List[str]] = None) -> Dict[str, Any]:
        """
        Retrieve nodes from the graph based on vector similarity.

        Args:
            query: Query text
            limit: Maximum number of results to return
            node_labels: Optional list of node labels to filter by

        Returns:
            Dictionary containing the results and query information
        """
        # Get embedding for the query
        query_embedding = self.get_embedding(query)

        # Construct Neo4j query for vector search with more targeted results
        cypher_query = f"""
        CALL db.index.vector.queryNodes($index_name, $limit, $query_embedding)
        YIELD node, score
        """

        # Add label filter if provided, otherwise prioritize REVIEW nodes for text queries
        if node_labels:
            label_filters = []
            for label in node_labels:
                label_filters.append(f"node:{label}")
            cypher_query += f"WHERE {' OR '.join(label_filters)}\n"
        else:
            # For review-specific queries, prioritize REVIEW nodes
            if "review" in query.lower() or "said" in query.lower() or "sentiment" in query.lower():
                cypher_query += "WHERE node:REVIEW\n"

        cypher_query += """
        RETURN node, score
        ORDER BY score DESC
        """

        # Execute query
        try:
            with self.driver.session() as session:
                result = session.run(
                    cypher_query,
                    index_name=self.index_name,
                    limit=limit,
                    query_embedding=query_embedding
                )

                # Transform the results into a more usable format
                records = []
                for record in result:
                    node = record["node"]
                    score = record["score"]

                    # Extract all node properties
                    props = dict(node)
                    if "embedding" in props:
                        del props["embedding"]  # Skip embedding vectors

                    # Add score and labels
                    props["similarity_score"] = score
                    props["labels"] = list(node.labels)

                    records.append(props)

                return {
                    "query": query,
                    "results": records
                }
        except Exception as e:
            print(f"Error executing vector search: {str(e)}")
            return {
                "query": query,
                "error": str(e),
                "results": []
            }

    def generate_answer(self, query: str, results: List[Dict[str, Any]]) -> str:
        """
        Generate a natural language answer based on query results.

        Args:
            query: Original natural language query
            results: Query results

        Returns:
            Natural language answer
        """
        # Format results for the prompt
        if not results:
            results_text = "No results were found for this query."
        else:
            # Convert results to a clean text representation
            items = []
            for i, item in enumerate(results[:5]):  # Limit to 5 items
                item_str = f"Result {i+1} (Similarity: {item.get('similarity_score', 'N/A')}):\n"
                for k, v in item.items():
                    if k in ['embedding', 'similarity_score']:
                        continue  # Skip embedding vectors and already displayed score
                    if isinstance(v, (list, set)):
                        v_str = ", ".join(str(x) for x in v if x is not None)
                        item_str += f"  {k}: {v_str}\n"
                    else:
                        item_str += f"  {k}: {v}\n"
                items.append(item_str)

            if len(results) > 10:
                items.append(f"... and {len(results) - 10} more results.")

            results_text = "\n".join(items)

        prompt = f"""
The user asked: "{query}"

I retrieved the following information using semantic similarity search:
{results_text}

Based on this information, provide a helpful, conversational response to the user's query.
Focus on the most relevant information from the results.
If the information retrieved doesn't fully answer their query, acknowledge this limitation.
"""

        response = self.client.chat.completions.create(
            model="gpt-4o-mini",
            temperature=0.7,
            messages=[
                {"role": "system", "content": "You are a helpful assistant that provides accurate information based on semantic search results."},
                {"role": "user", "content": prompt}
            ]
        )

        return response.choices[0].message.content

    def query(self, query: str, limit: int = 10, node_labels: Optional[List[str]] = None) -> Dict[str, Any]:
        """
        Process a natural language query end-to-end.

        Args:
            query: Natural language query
            limit: Maximum number of results to return
            node_labels: Optional list of node labels to filter by

        Returns:
            Dictionary with answer, retrieval info, and results
        """
        # Retrieve information
        retrieval_result = self.retrieve(query, limit, node_labels)

        # Generate answer
        answer = self.generate_answer(query, retrieval_result["results"])

        return {
            "query": query,
            "answer": answer,
            "retrieval_info": {
                "retrieval_method": "vector_similarity",
                "result_count": len(retrieval_result["results"])
            },
            "results": retrieval_result["results"]
        }

In [None]:
vector_retriever = VectorSimilarityRetriever(
    neo4j_driver=driver,
    openai_client=openai_client
)

vector_retriever.query("What positive things are said about the Sirdeci Mansion Hotel?")

{'query': 'What positive things are said about the Sirdeci Mansion Hotel?',
 'answer': "The Sirdeci Mansion Hotel has received many positive reviews from guests, highlighting several aspects that make it a great choice for a stay in Istanbul.\n\n1. **Location**: Guests rave about the hotel's prime location in the old area of Istanbul. It's within walking distance to major attractions like the Blue Mosque, Topkapi Palace, and the Grand Bazaar. This makes it very convenient for exploring the city's rich history and culture.\n\n2. **Friendly and Helpful Staff**: Many reviews emphasize the exceptional service provided by the hotel staff. Guests mention how welcoming and accommodating the staff are, going out of their way to help with directions, recommendations, and even booking tables at nearby restaurants. Specific staff members, like Okay and Fatme, received special mentions for their attentiveness and helpfulness.\n\n3. **Comfortable and Unique Accommodations**: Reviewers appreciate th

## Router Retriever: Your Homework Assignment
Now that we've implemented three different retrieval strategies, we need a router that can intelligently select the most appropriate strategy for each query. This is where your homework assignment comes in!
Below is a skeleton implementation of a RouterRetriever class. Your task is to complete the implementation by filling in the missing parts:

In [None]:
class RouterRetriever:
    """
    A router that selects the appropriate retrieval strategy based on query characteristics.

    This is a skeleton implementation for you to complete as part of the homework assignment.
    """

    def __init__(
        self,
        text2cypher_retriever: Text2CypherRetriever,
        template_retriever: SimpleGraphRAG, # copy from basic notebook
        vector_retriever: VectorSimilarityRetriever,
        openai_client: OpenAI
    ):
        """Initialize with all retriever implementations."""
        self.text2cypher_retriever = text2cypher_retriever
        self.template_retriever = template_retriever
        self.vector_retriever = vector_retriever
        self.client = openai_client

    def route_query(self, query: str) -> Dict[str, Any]:
        """
        Analyze the query and determine which retrieval strategy to use.

        Args:
            query: The natural language query

        Returns:
            Dictionary with the selected strategy and reasoning

        TODO: Implement this method to select the most appropriate retrieval strategy.
        """
        # TODO: Implement query analysis and strategy selection

        # Default implementation (replace with your own)
        return {
            "strategy": "text2cypher",  # Default fallback
            "reasoning": "Default strategy - replace with actual reasoning"
        }

    def query(self, query: str) -> Dict[str, Any]:
        """
        Route the query to the appropriate retriever and return results.

        Args:
            query: The natural language query

        Returns:
            Results from the selected retriever with routing information

        TODO: Complete this method to execute the query using the selected strategy.
        """
        # TODO: Implement query routing
        # 1. Call route_query to determine the best strategy
        # 2. Execute the query using the selected retriever
        # 3. Return the results with added routing information

        # Default implementation (replace with your own)
        strategy = "text2cypher"  # Replace with actual routing logic
        result = self.text2cypher_retriever.query(query)

        # TODO: Add routing information to the result

        return result



### Assignment Guidelines:

1. **Strategy Selection Logic:** Implement route_query to intelligently analyze the query and select the most appropriate retrieval strategy.

  * Document your reasoning process for different query types


2. **Router Implementation:** Complete the query method to route queries to the appropriate retriever.

  * Handle errors gracefully if a strategy fails
  * Provide clear metadata about why a particular strategy was chosen
  * Consider implementing fallback strategies


3. **Testing and Evaluation:**

  * Test your router with a variety of queries
  * Compare the results from different strategies
  * Evaluate the accuracy of your strategy selection logic


4. **Extra Credit:**

  * Implement a hybrid approach (potentially even a dedicated hybrid retriever) that combines results from multiple strategies
  * Implement a feedback mechanism to improve strategy selection over time


### Decision Criteria for Your Router
When implementing your router, consider these factors to determine the best strategy:

1. **Query Structure:**

  * Template patterns: Common patterns with clear entities (template)
  * Complex relationships: Multi-hop traversals (text2cypher or template) or aggregations (text2cypher)
  * Semantic/conceptual: Opinion or concept-based queries (vector)


2. **Entity Presence:**

  * Clear entity mentions: Specific hotel names, locations, etc. (template or text2cypher)
  * Conceptual descriptions: "Luxury", "family-friendly", etc. (vector)


3. **Query Intent:**

  * Fact retrieval: "How many reviews does hotel X have?" (text2cypher)
  * Opinion extraction: "What do guests say about X?" (vector)
  * Common lookup patterns: "Tell me about hotel X" (template)

# Comparing Vector RAG vs Graph RAG: A Practical Analysis

When implementing RAG systems, it's crucial to understand when graph-based approaches offer meaningful advantages over traditional vector RAG. Let's examine some real queries against our hotel reviews knowledge graph to understand these tradeoffs.

## Case Study 1: Multi-Hop Queries
### Query: "What highly-rated hotels near Fisherman's Wharf offer free Wi-Fi and easy access to cable cars?"

#### Vector RAG Response:
> *Two highly-rated hotels near Fisherman's Wharf that offer free Wi-Fi and easy access to cable cars are
San Francisco Marriott Fisherman's Wharf and Hotel Riu Plaza Fisherman's Wharf.*

#### Graph RAG Response:
Using the following Cypher for the retrieval:
```cypher
MATCH (h:HOTEL)-[:LOCATED_NEARBY]->(fw:TOURIST_ATTRACTION {name: 'Fisherman\'s Wharf'}),
      (h)-[:HAS_AMENITY]->(wifi:AMENITY {name: 'Free Wi-Fi'}),
      (h)-[:LOCATED_NEARBY]->(cc:TOURIST_ATTRACTION {name: 'San Francisco Cable Car line'}),
      (h)-[:HAS_REVIEW]->(r:REVIEW)
WHERE r.rating_value >= 4.0
RETURN DISTINCT h.name
```

>*The Hotel Riu Plaza Fisherman’s Wharf, located in San Francisco, offers free Wi-Fi and easy access to the San Francisco Cable Car line.*

#### Analysis:
1. **Accuracy**: Vector RAG incorrectly includes SF Marriott, which doesn't mention free Wi-Fi in its description. Graph RAG correctly identifies only Hotel Riu by following explicit relationships.

2. **Query Processing**:
   - Vector RAG attempts to infer relationships from text proximity
   - Graph RAG follows verified relationship paths
   - Multi-hop queries strongly favor graph approach

3. **Confidence**: Graph RAG can guarantee the existence of relationships, while vector RAG makes best guesses

## Case Study 2: Amenity Comparisons
### Query: "Compare amenities and ratings for hotels near Fisherman's Wharf and Pier 39"

Vector RAG Response:
```
Based on the information provided:

1. **San Francisco Marriott Fisherman's Wharf**:
   - Amenities: Fitness Center, Handicap Accessible Rooms, Safety Deposit Boxes, Iron, Ironing Board, Hairdryers.
   - Rating: No specific rating mentioned.

2. **Hotel Riu Plaza Fisherman’s Wharf**:
   - Amenities: Free WiFi, Outdoor Pool, Gym, Conference Rooms, Satellite TV, Mini-fridge, Coffee Machine, AC, Buffet Breakfast, Grab & Go Bar.
   - Rating: No specific rating mentioned.

3. **Hotel Zephyr San Francisco**:
   - Amenities: Not specified in the provided information.
   - Rating: No specific rating mentioned.

4. **PIER 39**:
   - Amenities: Shops, Restaurants, Sea lions.
   - Rating: No specific rating mentioned.

Unfortunately, the information provided does not include specific ratings for the hotels near Fisherman's Wharf and Pier 39, so I cannot provide a direct comparison in terms of ratings. If you have a specific rating range in mind or other criteria for comparison, please let me know.
```

Graph RAG Response:

```
The Hyatt Centric Fisherman's Wharf San Francisco and Hotel Riu Plaza Fisherman’s Wharf are two hotels near both Fisherman's Wharf and PIER 39.

The Hyatt Centric offers amenities such as Brick & Beam, a Fitness Center, Free Wi-Fi, Meeting Rooms, and a Relaxing Pool. It has an average rating of 4.33.

The Hotel Riu Plaza provides amenities like a Buffet Breakfast Service, a Bar, a Minibar, a TV with Free Movies, and Coffee and Tea Facilities. It has an average rating of 4.0.
```

#### Analysis:
1. **Structured Data**: Graph RAG excels at comparing discrete attributes
2. **Relationship Context**: Understanding nearness to multiple landmarks
3. **Aggregation**: Can compute statistics across relationship patterns

## Key Findings

### When to Use Graph RAG:
1. **Relationship-Critical Queries**
   - Multiple hops required (e.g., "hotels near X with amenity Y")
   - Relationship accuracy matters
   - Complex pattern matching

2. **Structured Comparisons**
   - Comparing entities across relationships
   - Aggregating across relationship patterns
   - Need for verified connections

3. **Hybrid Questions**
   - Combining factual relationships with semantic search
   - Need both structured and unstructured insights

### When Traditional Vector RAG Suffices:
1. **Simple Semantic Queries**
   - Single-entity questions
   - General descriptions or summaries
   - No relationship traversal needed

2. **Fuzzy Matching**
   - When exact relationship matching isn't critical
   - Flexible interpretation acceptable
   - General sentiment or topic analysis

## Implementation Considerations

1. **Data Quality Requirements**
   - Graph RAG requires explicit relationship modeling
   - Higher upfront cost in knowledge graph construction
   - Need for relationship maintenance/updates

2. **Query Complexity**
   - Graph queries can be more complex to construct
   - Need for query optimization
   - Hybrid approaches often optimal

3. **System Architecture**
   - Graph databases add operational complexity
   - Vector indices still valuable for semantic search
   - Consider hybrid architectures for complex applications


