<a href="https://colab.research.google.com/github/xbwei/data-analysis-with-generative-ai/blob/main/GraphRAG_Social_Media_Neo4j.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# üß† GraphRAG: Retrieval-Augmented Generation with Neo4j Knowledge Graph

**Ask natural-language questions ‚Äî get graph-powered answers.**

In the previous notebook (`Social_Media_ETL_Neo4j_Python.ipynb`) we built a social media knowledge graph in Neo4j with **Users, Tweets, Hashtags, and Places**. Now we will layer **AI** on top of that graph to create a **GraphRAG** pipeline.

**What is GraphRAG?**

Traditional RAG retrieves relevant text chunks from a vector store. **GraphRAG** goes further ‚Äî it combines:
1. **Vector Search** ‚Äî find semantically similar tweets using embeddings.
2. **Graph Traversal** ‚Äî follow relationships (who posted it, where, which hashtags) to enrich context.
3. **LLM Generation** ‚Äî pass the enriched context to a language model for a grounded answer.

![GraphRAG Pipeline](https://github.com/lbsocial/data-analysis-with-generative-ai/blob/1660b372f63accb8e55ea4b439924ba764639182/image/Gemini_Generated_Image_vslfdxvslfdxvslf.png?raw=true)

**What we will build:**
| Step | Action |
|------|--------|
| 1 | Install dependencies |
| 2 | Connect to Neo4j & Gemini |
| 3 | Create vector embeddings for every Tweet |
| 4 | Build a Neo4j Vector Index |
| 5 | Implement pure Vector Search retrieval |
| 6 | Implement GraphRAG retrieval (Vector + Graph Traversal) |
| 7 | Build the full RAG pipeline with Gemini |
| 8 | Compare Vector-only vs GraphRAG answers |
| 9 | Interactive GraphRAG query |
| 10 | Advanced ‚Äî Cypher-Augmented Generation |
| 11 | Geospatial queries & Geo-Augmented GraphRAG |---


> **Why no LangChain?** Frameworks like LangChain provide convenient abstractions (`GraphCypherQAChain`, `Neo4jVector`, etc.) that can simplify production code. However, in this tutorial we use the **Neo4j Python driver** and **Google GenAI SDK** directly so you can see exactly how each piece works ‚Äî embedding, vector search, graph traversal, prompt formatting, and LLM generation. Once you understand these building blocks, adopting a framework becomes much easier.

## üõ†Ô∏è Step 1: Install Dependencies

In [None]:
pip install neo4j google-genai -q

## üîå Step 2: Connect to Neo4j & Gemini

We need two connections:
- **Neo4j** ‚Äî our knowledge graph.
- **Google Gemini** ‚Äî for generating embeddings and LLM answers.

> **Note:** Store your credentials as environment variables or in Colab Secrets.

In [None]:
from google.colab import userdata
from neo4j import GraphDatabase
from google import genai

# ‚îÄ‚îÄ Neo4j Credentials (from Colab Secrets) ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
NEO4J_URI = userdata.get('NEO4J_URI')
NEO4J_PASSWORD = userdata.get('NEO4J_PASSWORD')
NEO4J_AUTH = ("neo4j", NEO4J_PASSWORD)

# ‚îÄ‚îÄ Gemini Client (from Colab Secrets) ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
GOOGLE_API_KEY = userdata.get('GOOGLE_API_KEY')
client = genai.Client(api_key=GOOGLE_API_KEY)

# ‚îÄ‚îÄ Verify Neo4j Connection ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
driver = GraphDatabase.driver(NEO4J_URI, auth=NEO4J_AUTH)
driver.verify_connectivity()
print("‚úÖ Connected to Neo4j")

# ‚îÄ‚îÄ Verify Gemini Connection ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
test = client.models.get(model="gemini-2.0-flash")
print(f"‚úÖ Connected to Gemini ({test.display_name})")

## üìä Step 2b: Verify the Graph Data

Let's confirm the graph we built in the ETL notebook is ready.

In [None]:
# Quick health check ‚Äî count nodes by label
with driver.session(database="neo4j") as session:
    result = session.run("""
        CALL {
            MATCH (t:Tweet) RETURN 'Tweet' AS label, count(t) AS count
            UNION ALL
            MATCH (u:User)  RETURN 'User'  AS label, count(u) AS count
            UNION ALL
            MATCH (h:Hashtag) RETURN 'Hashtag' AS label, count(h) AS count
            UNION ALL
            MATCH (p:Place) RETURN 'Place' AS label, count(p) AS count
        }
        RETURN label, count
    """)
    print("‚îÄ‚îÄ Graph Node Summary ‚îÄ‚îÄ")
    for record in result:
        print(f"  {record['label']:>10s}: {record['count']}")

# Sample one tweet with all its relationships
with driver.session(database="neo4j") as session:
    result = session.run("""
        MATCH (u:User)-[:POSTED]->(t:Tweet)-[:LOCATED_AT]->(p:Place)
        OPTIONAL MATCH (t)-[:TAGGED_WITH]->(h:Hashtag)
        RETURN u.username AS user, t.text AS text,
               p.name AS place, collect(h.name) AS hashtags
        LIMIT 3
    """)
    print("\n‚îÄ‚îÄ Sample Tweets ‚îÄ‚îÄ")
    for record in result:
        print(f"  @{record['user']} from {record['place']}")
        print(f"    \"{record['text'][:80]}...\"")
        print(f"    Tags: {record['hashtags']}\n")

---
## üß¨ Step 3: Generate Vector Embeddings for Tweets

We use Gemini's `gemini-embedding-001` model to convert every tweet's text into a **3072-dimensional vector**. This vector captures the *semantic meaning* of the text.

We then write each embedding back to the `Tweet` node as a property called `embedding`.

In [None]:
EMBEDDING_MODEL = "models/gemini-embedding-001"
EMBEDDING_DIM = 3072


def get_embeddings(texts: list[str]) -> list[list[float]]:
    """Get Gemini embeddings for a batch of texts."""
    response = client.models.embed_content(
        model=EMBEDDING_MODEL,
        contents=texts,
    )
    return [emb.values for emb in response.embeddings]


# ‚îÄ‚îÄ Fetch all tweet texts ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
with driver.session(database="neo4j") as session:
    result = session.run("MATCH (t:Tweet) RETURN t.id AS id, t.text AS text")
    tweets = [(record["id"], record["text"]) for record in result]

print(f"üìÑ Found {len(tweets)} tweets to embed.")

# ‚îÄ‚îÄ Batch embed (20 tweets at a time) ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
BATCH_SIZE = 20
for i in range(0, len(tweets), BATCH_SIZE):
    batch = tweets[i : i + BATCH_SIZE]
    ids = [t[0] for t in batch]
    texts = [t[1] for t in batch]

    embeddings = get_embeddings(texts)

    # Write embeddings back to Neo4j
    with driver.session(database="neo4j") as session:
        session.run(
            """
            UNWIND $data AS row
            MATCH (t:Tweet {id: row.id})
            SET t.embedding = row.embedding
            """,
            data=[{"id": id_, "embedding": emb} for id_, emb in zip(ids, embeddings)],
        )

    print(f"  ‚úÖ Embedded batch {i // BATCH_SIZE + 1}/{(len(tweets) - 1) // BATCH_SIZE + 1}")


print("\nüéâ All tweet embeddings stored in Neo4j!")

---
## üóÇÔ∏è Step 4: Create a Neo4j Vector Index

A **vector index** lets Neo4j perform fast approximate-nearest-neighbor (ANN) search on the embedding property. This is the foundation of our retrieval step.

In [None]:
INDEX_NAME = "tweet_embeddings"

with driver.session(database="neo4j") as session:
    # Ensure the index is dropped idempotently
    print(f"Attempting to drop existing index '{INDEX_NAME}' if it exists...")
    session.run(f"DROP INDEX {INDEX_NAME} IF EXISTS")
    print(f"‚úÖ Index '{INDEX_NAME}' dropped (if it existed).")

    # Create the vector index with the correct dimensions
    session.run(f"""
        CREATE VECTOR INDEX {INDEX_NAME}
        FOR (t:Tweet)
        ON (t.embedding)
        OPTIONS {{
            indexConfig: {{
                `vector.dimensions`: {EMBEDDING_DIM},
                `vector.similarity_function`: 'cosine'
            }}
        }}
    """)
    print(f"‚úÖ Vector index '{INDEX_NAME}' created with {EMBEDDING_DIM} dimensions.")

# Verify index details, specifically dimensions
with driver.session(database="neo4j") as session:
    # Run SHOW VECTOR INDEXES and then filter in Python
    result = session.run("SHOW VECTOR INDEXES")
    found_index = False
    for record in result:
        if record['name'] == INDEX_NAME:
            found_index = True
            print(f"  Index: {record['name']}")
            print(f"  State: {record['state']}")
            # Extract dimensions from the 'properties' map
            if 'properties' in record and 'indexConfig' in record['properties']:
                config = record['properties']['indexConfig']
                dimensions = config.get('vector.dimensions')
                if dimensions:
                    print(f"  Actual Dimensions: {dimensions}")
                else:
                    print("  Dimensions not found in index properties.")
            else:
                print("  Index configuration properties not found.")
            break
    if not found_index:
        print(f"‚ùå Index '{INDEX_NAME}' not found after creation.")

---
## üîç Step 5: Pure Vector Search (Baseline)

First, let's implement a plain **vector similarity search**. We embed the user's question, then find the `k` most similar tweets. This is what a traditional RAG system does ‚Äî **no graph knowledge** is used.

In [None]:
def vector_search(question: str, k: int = 5) -> list[dict]:
    """
    Plain vector search ‚Äî returns the top-k most similar tweets.
    No graph traversal.
    """
    q_embedding = get_embeddings([question])[0]

    cypher = """
        CALL db.index.vector.queryNodes($index, $k, $embedding)
        YIELD node AS tweet, score
        RETURN tweet.id    AS id,
               tweet.text  AS text,
               tweet.likes AS likes,
               score
        ORDER BY score DESC
    """
    with driver.session(database="neo4j") as session:
        result = session.run(cypher, index=INDEX_NAME, k=k, embedding=q_embedding)
        return [dict(record) for record in result]


# ‚îÄ‚îÄ Test it ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
question = "What are people saying about graph databases?"
results = vector_search(question)

print(f"üîé Vector Search Results for: '{question}'\n")
for i, r in enumerate(results, 1):
    print(f"  {i}. [score={r['score']:.4f}] {r['text'][:90]}")

---
## üï∏Ô∏è Step 6: GraphRAG Retrieval (Vector + Graph Traversal)

This is the key innovation. After finding similar tweets via vector search, we **traverse the graph** to collect rich context:

| Traversal | What we get |
|-----------|-------------|
| `Tweet ‚Üê POSTED ‚Üê User` | Who wrote it? How many followers? |
| `Tweet ‚Üí LOCATED_AT ‚Üí Place` | Where was it posted? |
| `Tweet ‚Üí TAGGED_WITH ‚Üí Hashtag` | What topics is it about? |
| `User ‚Üí POSTED ‚Üí OtherTweets` | What else has this user said? (context expansion) |
| `Hashtag ‚Üê TAGGED_WITH ‚Üê OtherTweets` | What else is tagged with the same topics? |

This gives the LLM **much richer context** than bare text chunks.

In [None]:
def graph_rag_search(question: str, k: int = 5) -> list[dict]:
    """
    GraphRAG retrieval ‚Äî vector search + graph traversal.
    Returns enriched context for each matching tweet.
    """
    q_embedding = get_embeddings([question])[0]

    cypher = """
        // ‚îÄ‚îÄ 1. Vector Search: find semantically similar tweets ‚îÄ‚îÄ
        CALL db.index.vector.queryNodes($index, $k, $embedding)
        YIELD node AS tweet, score

        // ‚îÄ‚îÄ 2. Graph Traversal: enrich with relationships ‚îÄ‚îÄ
        // Get the author
        MATCH (author:User)-[:POSTED]->(tweet)

        // Get the location
        OPTIONAL MATCH (tweet)-[:LOCATED_AT]->(place:Place)

        // Get all hashtags on this tweet
        OPTIONAL MATCH (tweet)-[:TAGGED_WITH]->(hashtag:Hashtag)

        // ‚îÄ‚îÄ 3. Context Expansion: other tweets by same author ‚îÄ‚îÄ
        OPTIONAL MATCH (author)-[:POSTED]->(other_tweet:Tweet)
        WHERE other_tweet.id <> tweet.id

        // ‚îÄ‚îÄ 4. Context Expansion: co-occurring tweets via hashtags ‚îÄ‚îÄ
        OPTIONAL MATCH (hashtag)<-[:TAGGED_WITH]-(related_tweet:Tweet)
        WHERE related_tweet.id <> tweet.id

        RETURN tweet.id                        AS tweet_id,
               tweet.text                      AS text,
               tweet.likes                     AS likes,
               tweet.retweets                  AS retweets,
               score,

               // Author context
               author.username                 AS author,
               author.followers                AS author_followers,

               // Location context
               place.name                      AS location,
               place.country                   AS country,

               // Hashtag context
               collect(DISTINCT hashtag.name)   AS hashtags,

               // Expanded context
               collect(DISTINCT other_tweet.text)[0..3]   AS author_other_tweets,
               collect(DISTINCT related_tweet.text)[0..3] AS related_tweets_via_hashtag

        ORDER BY score DESC
    """
    with driver.session(database="neo4j") as session:
        result = session.run(cypher, index=INDEX_NAME, k=k, embedding=q_embedding)
        return [dict(record) for record in result]


# ‚îÄ‚îÄ Test it ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
question = "What are people saying about graph databases?"
results = graph_rag_search(question)

print(f"üï∏Ô∏è GraphRAG Results for: '{question}'\n")
for i, r in enumerate(results, 1):
    print(f"  {i}. [score={r['score']:.4f}] @{r['author']} ({r['author_followers']} followers)")
    print(f"     üìç {r['location']}, {r['country']}")
    print(f"     üí¨ \"{r['text'][:80]}...\"")
    print(f"     üè∑Ô∏è  Tags: {r['hashtags']}")
    if r['related_tweets_via_hashtag']:
        print(f"     üîó Related: \"{r['related_tweets_via_hashtag'][0][:60]}...\"")
    print()

---
## ü§ñ Step 7: Build the Full RAG Pipeline with Gemini

Now we connect everything: **retrieve** enriched context from the graph, **format** it into a prompt, and **generate** an answer with Gemini.

We build two pipelines side-by-side:
- `rag_answer()` ‚Äî plain vector retrieval + Gemini
- `graph_rag_answer()` ‚Äî GraphRAG retrieval + Gemini

In [45]:
LLM_MODEL = "gemini-2.0-flash"

SYSTEM_PROMPT = """You are a social media analyst assistant.
Answer the user's question based ONLY on the retrieved context below.
If the context doesn't contain enough information, say so.
Always cite specific tweets, users, or locations when relevant.
Be concise but thorough."""


def format_vector_context(results: list[dict]) -> str:
    """Format plain vector search results into a text block."""
    lines = []
    for i, r in enumerate(results, 1):
        lines.append(f"Tweet {i} (similarity={r['score']:.3f}, likes={r['likes']}):\n  \"{r['text']}\"")
    return "\n\n".join(lines)


def format_graph_context(results: list[dict]) -> str:
    """Format GraphRAG results into a rich context block."""
    lines = []
    for i, r in enumerate(results, 1):
        block = f"""Tweet {i} (similarity={r['score']:.3f}):
  Text: "{r['text']}"
  Author: @{r['author']} ({r['author_followers']} followers)
  Location: {r['location']}, {r['country']}
  Engagement: {r['likes']} likes, {r['retweets']} retweets
  Hashtags: {', '.join(r['hashtags'])}"""
        if r.get('author_other_tweets'):
            block += f"\n  Other tweets by @{r['author']}:"
            for t in r['author_other_tweets']:
                block += f"\n    - \"{t[:80]}...\""
        if r.get('related_tweets_via_hashtag'):
            block += f"\n  Related tweets (same hashtags):"
            for t in r['related_tweets_via_hashtag']:
                block += f"\n    - \"{t[:80]}...\""
        lines.append(block)
    return "\n\n".join(lines)


def rag_answer(question: str, k: int = 5) -> str:
    """Traditional RAG: vector search + LLM."""
    results = vector_search(question, k)
    context = format_vector_context(results)

    response = client.models.generate_content(
        model=LLM_MODEL,
        contents=f"Context:\n{context}\n\nQuestion: {question}",
        config=genai.types.GenerateContentConfig(
            system_instruction=SYSTEM_PROMPT,
            temperature=0.2,
        ),
    )
    return response.text


def graph_rag_answer(question: str, k: int = 5) -> str:
    """GraphRAG: vector search + graph traversal + LLM."""
    results = graph_rag_search(question, k)
    context = format_graph_context(results)

    response = client.models.generate_content(
        model=LLM_MODEL,
        contents=f"Context:\n{context}\n\nQuestion: {question}",
        config=genai.types.GenerateContentConfig(
            system_instruction=SYSTEM_PROMPT,
            temperature=0.2,
        ),
    )
    return response.text


print("‚úÖ RAG pipelines ready.")

‚úÖ RAG pipelines ready.


---
## ‚öñÔ∏è Step 8: Compare Vector-Only RAG vs GraphRAG

Let's ask the same questions and compare the quality of answers.

In [None]:
comparison_questions = [
    "What are people saying about graph databases and who are the most active users talking about them?",
    "Which cities generate the most discussion about AI and cloud computing?",
    "Are there any users who tweet about both Python and Neo4j? What are they saying?",
]

for question in comparison_questions:
    print("=" * 80)
    print(f"‚ùì Question: {question}")
    print("=" * 80)

    print("\nüìÑ ‚îÄ‚îÄ TRADITIONAL RAG (Vector Only) ‚îÄ‚îÄ")
    print(rag_answer(question))

    print("\nüï∏Ô∏è ‚îÄ‚îÄ GRAPHRAG (Vector + Graph Traversal) ‚îÄ‚îÄ")
    print(graph_rag_answer(question))
    print("\n")

---
## üß™ Step 9: Interactive GraphRAG Query

Try your own questions below!

In [None]:
# ‚îÄ‚îÄ Change this question to anything you want ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
your_question = "What topics are trending in San Francisco?"

print(f"‚ùì {your_question}\n")
print(graph_rag_answer(your_question))

---
## üî¨ Step 10: Advanced ‚Äî Cypher-Augmented Generation

For structured analytical questions ("How many tweets per city?"), we can let the LLM **generate Cypher queries** directly. This combines the power of graph queries with natural language.

In [None]:
CYPHER_SYSTEM_PROMPT = """You are a Neo4j Cypher expert. Given the user's question, generate a Cypher query to answer it.

The graph schema is:
- (:User {id, username, name, followers, following, tweet_count})
    -[:POSTED]->(:Tweet {id, text, created_at, likes, retweets, replies, location, embedding})
- (:Tweet)-[:LOCATED_AT]->(:Place {name, country, location})
- (:Tweet)-[:TAGGED_WITH]->(:Hashtag {name})

Rules:
- Return ONLY raw Cypher (no markdown, no explanation, no code fences).
- Use LIMIT 10 unless the user asks for more.
- Use meaningful aliases in RETURN clauses.
"""


def cypher_rag_answer(question: str) -> str:
    """Let the LLM generate a Cypher query, run it, then summarize results."""

    # Step 1: Generate Cypher
    cypher_response = client.models.generate_content(
        model=LLM_MODEL,
        contents=question,
        config=genai.types.GenerateContentConfig(
            system_instruction=CYPHER_SYSTEM_PROMPT,
            temperature=0.0,
        ),
    )
    cypher_query = cypher_response.text.strip()
    print(f"üîß Generated Cypher:\n{cypher_query}\n")

    # Step 2: Execute the Cypher query
    try:
        with driver.session(database="neo4j") as session:
            result = session.run(cypher_query)
            records = [dict(record) for record in result]
    except Exception as e:
        return f"‚ùå Cypher execution error: {e}"

    if not records:
        return "No results found."

    # Step 3: Summarize with LLM
    import json
    data_str = json.dumps(records, indent=2, default=str)

    summary_response = client.models.generate_content(
        model=LLM_MODEL,
        contents=f"Question: {question}\n\nQuery results:\n{data_str}",
        config=genai.types.GenerateContentConfig(
            system_instruction="Summarize the following database query results in a clear, human-readable way. Use bullet points or a table if appropriate.",
            temperature=0.2,
        ),
    )
    return summary_response.text


print("‚úÖ Cypher-Augmented Generation ready.")

In [None]:
# ‚îÄ‚îÄ Analytical questions that benefit from Cypher ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
analytical_questions = [
    "Which user has the most followers and what do they tweet about?",
    "How many tweets were posted from each city?",
    "Which hashtags are most frequently used together?",
]

for q in analytical_questions:
    print("=" * 80)
    print(f"‚ùì {q}")
    print("=" * 80)
    print(cypher_rag_answer(q))
    print()

---
## üåç Step 11: Geospatial Queries ‚Äî Find Tweets Near a Location

Every Tweet and Place node in our graph has a `point()` property storing longitude/latitude. Neo4j's built-in `point.distance()` function lets us find tweets **within a radius** of any coordinate ‚Äî no external GIS tools needed.

**Use cases:**
- "What are people tweeting about near Times Square?"
- "Which users are active within 50 km of Tokyo?"
- "What topics trend in a geographic cluster?"

In [None]:
def find_tweets_near(city_name: str, radius_km: int = 50, limit: int = 10) -> list[dict]:
    """
    Find tweets posted within `radius_km` of a city center.
    Uses Neo4j's point.distance() for server-side geospatial filtering.
    """
    cypher = """
        // Get the city center point
        MATCH (p:Place {name: $city})
        WITH p.location AS center

        // Find tweets within radius
        MATCH (u:User)-[:POSTED]->(t:Tweet)-[:LOCATED_AT]->(place:Place)
        WHERE point.distance(t.location, center) < $radius_m
        OPTIONAL MATCH (t)-[:TAGGED_WITH]->(h:Hashtag)

        RETURN t.text                          AS text,
               u.username                      AS author,
               place.name                      AS place,
               collect(DISTINCT h.name)        AS hashtags,
               t.likes                         AS likes,
               round(point.distance(t.location, center) / 1000.0, 2) AS distance_km
        ORDER BY distance_km ASC
        LIMIT $limit
    """
    with driver.session(database="neo4j") as session:
        result = session.run(cypher, city=city_name, radius_m=radius_km * 1000, limit=limit)
        return [dict(record) for record in result]


# ‚îÄ‚îÄ Demo: find tweets near multiple cities ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
geo_demos = [
    ("San Francisco, CA", 50),
    ("Tokyo, JP", 30),
]

for city, radius in geo_demos:
    results = find_tweets_near(city, radius_km=radius)
    print(f"üìç Tweets near {city} (within {radius} km):\n")
    for i, r in enumerate(results, 1):
        print(f"  {i}. @{r['author']} ‚Äî {r['distance_km']} km away")
        print(f"     \"{r['text'][:80]}...\"")
        print(f"     üè∑Ô∏è {r['hashtags']}  ‚ù§Ô∏è {r['likes']} likes\n")
    print("-" * 60)

### üó∫Ô∏è Which cities are nearest to each other?

We can also compute **inter-city distances** directly in the graph to see how our Place nodes relate geographically.

In [None]:
# ‚îÄ‚îÄ Compute pairwise distances between all cities ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
with driver.session(database="neo4j") as session:
    result = session.run("""
        MATCH (a:Place), (b:Place)
        WHERE a.name < b.name  // avoid duplicates
        RETURN a.name AS city_a,
               b.name AS city_b,
               round(point.distance(a.location, b.location) / 1000.0, 0) AS distance_km
        ORDER BY distance_km ASC
    """)
    print("‚îÄ‚îÄ City-to-City Distances ‚îÄ‚îÄ\n")
    for record in result:
        print(f"  {record['city_a']:>20s}  ‚Üî  {record['city_b']:<20s}  {record['distance_km']:>8.0f} km")

### üß† Geo-Augmented GraphRAG

Combine **geospatial filtering + vector search + graph traversal** into a single retrieval function. This answers questions like:
> *"What are people near London saying about AI?"*

In [None]:
def geo_graph_rag_search(question: str, city_name: str, radius_km: int = 100, k: int = 5) -> list[dict]:
    """
    Geo-filtered GraphRAG: vector search restricted to tweets
    within `radius_km` of a city, then enriched via graph traversal.
    """
    q_embedding = get_embeddings([question])[0]

    cypher = """
        // Get city center
        MATCH (p:Place {name: $city})
        WITH p.location AS center

        // Vector search ‚Äî find semantically similar tweets
        CALL db.index.vector.queryNodes($index, $k_broad, $embedding)
        YIELD node AS tweet, score

        // Geo filter ‚Äî keep only tweets within radius
        WHERE point.distance(tweet.location, center) < $radius_m

        // Graph traversal ‚Äî enrich
        MATCH (author:User)-[:POSTED]->(tweet)
        OPTIONAL MATCH (tweet)-[:LOCATED_AT]->(place:Place)
        OPTIONAL MATCH (tweet)-[:TAGGED_WITH]->(hashtag:Hashtag)

        RETURN tweet.text                       AS text,
               score,
               author.username                  AS author,
               author.followers                 AS author_followers,
               place.name                       AS location,
               place.country                    AS country,
               collect(DISTINCT hashtag.name)   AS hashtags,
               tweet.likes                      AS likes,
               tweet.retweets                   AS retweets,
               round(point.distance(tweet.location, center) / 1000.0, 2) AS distance_km
        ORDER BY score DESC
        LIMIT $k
    """
    with driver.session(database="neo4j") as session:
        # Search a broader pool then geo-filter down
        result = session.run(
            cypher,
            city=city_name,
            radius_m=radius_km * 1000,
            index=INDEX_NAME,
            k_broad=k * 10,  # cast a wider net for vector search
            k=k,
            embedding=q_embedding,
        )
        return [dict(record) for record in result]


def geo_graph_rag_answer(question: str, city_name: str, radius_km: int = 100, k: int = 5) -> str:
    """Geo-filtered GraphRAG + Gemini LLM."""
    results = geo_graph_rag_search(question, city_name, radius_km, k)
    if not results:
        return f"No tweets found near {city_name} matching your question."

    context_lines = []
    for i, r in enumerate(results, 1):
        context_lines.append(
            f"Tweet {i} (similarity={r['score']:.3f}, {r['distance_km']} km from {city_name}):\n"
            f"  Text: \"{r['text']}\"\n"
            f"  Author: @{r['author']} ({r['author_followers']} followers)\n"
            f"  Location: {r['location']}, {r['country']}\n"
            f"  Hashtags: {', '.join(r['hashtags'])}\n"
            f"  Engagement: {r['likes']} likes, {r['retweets']} retweets"
        )
    context = "\n\n".join(context_lines)

    response = client.models.generate_content(
        model=LLM_MODEL,
        contents=f"Context:\n{context}\n\nQuestion: {question}",
        config=genai.types.GenerateContentConfig(
            system_instruction=SYSTEM_PROMPT,
            temperature=0.2,
        ),
    )
    return response.text


print("‚úÖ Geo-augmented GraphRAG ready.")

In [None]:
# ‚îÄ‚îÄ Geo-scoped questions across multiple cities ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
geo_rag_demos = [
    ("What are people saying about AI and machine learning?", "London, UK", 100),
    ("What do people think about cloud computing and serverless?", "New York, NY", 80),
    ("What do people think about cloud computing and serverless?", "San Francisco, CA", 80),
    ("What do people think about cloud computing and serverless?", "Tokyo, JP", 80),
]

for question, city, radius in geo_rag_demos:
    print("=" * 70)
    print(f"‚ùì {question}")
    print(f"üìç Within {radius} km of {city}")
    print("=" * 70)
    print(geo_graph_rag_answer(question, city, radius_km=radius))
    print()

---
## üßπ Cleanup

Close the Neo4j driver when finished.

In [None]:
driver.close()
print("‚úÖ Neo4j connection closed.")

---
## üèÅ Conclusion

We built a complete **GraphRAG** pipeline from the social media knowledge graph:

| Component | What it does |
|-----------|-------------|
| **Vector Embeddings** | Encode tweet text into 768-d vectors via Gemini |
| **Neo4j Vector Index** | Fast cosine-similarity search over tweet embeddings |
| **Graph Traversal** | Enrich results with author, location, hashtags, and related tweets |
| **LLM Generation** | Produce grounded answers from the enriched context |
| **Cypher Generation** | Let the LLM write graph queries for analytical questions |
| **Geospatial Queries** | Filter tweets by proximity using `point.distance()` |
| **Geo-Augmented GraphRAG** | Combine vector search + geo-filter + graph traversal |

**Key Insight:** GraphRAG produces *richer, more accurate* answers than plain vector search because it leverages the **structural relationships** between entities ‚Äî not just text similarity. Adding geospatial filtering lets you scope answers to a specific region.

**Next Steps:**
- Add **community detection** to find clusters of related users.
- Implement **hybrid search** (keyword + vector + graph).
- Build a **Streamlit app** for interactive graph exploration.
- Add **temporal analysis** ‚Äî how do topics evolve over time?