# Neo4j Semantic Search Pipeline

This notebook implements a semantic search pipeline using Neo4j and Sentence Transformers. It covers:
1.  **Data Loading & Embedding:** Constructing semantic text from flight data and generating embeddings using three different models.
2.  **Index Creation:** Creating vector indices in Neo4j.
3.  **Search:** Executing a similarity search.

In [1]:
# Install necessary dependencies
!pip install neo4j sentence-transformers python-dotenv

Collecting neo4j
  Downloading neo4j-6.0.3-py3-none-any.whl.metadata (5.2 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collec

In [3]:
import os
os.environ['NEO4J_URI'] = 'neo4j+s://dc92f3a9.databases.neo4j.io'
os.environ['NEO4J_USERNAME'] = 'neo4j'
os.environ['NEO4J_PASSWORD'] = 'UnLFJ3OvABIa5cKQ3tSx0IMqw91fDGxo8ebu13awzRY'

## Part 1: Load Models & Process Data
This step fetches journey data, creates descriptive text, generates embeddings (MiniLM, MPNet, BGE-M3), and stores them in `JourneyVector` nodes.

In [None]:
from neo4j import GraphDatabase
from sentence_transformers import SentenceTransformer
import os

print("Loading embedding models...")
model_minilm = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
model_mpnet = SentenceTransformer("sentence-transformers/paraphrase-mpnet-base-v2")
model_bge_m3 = SentenceTransformer("BAAI/bge-m3")

# ------------------------------
# 2. Neo4j Connection
# ------------------------------
uri = os.getenv('NEO4J_URI', 'neo4j://localhost:7687')
username = os.getenv('NEO4J_USERNAME', 'neo4j')
password = os.getenv('NEO4J_PASSWORD', 'password')

driver = GraphDatabase.driver(uri, auth=(username, password))

# ------------------------------
# Helper: Semantic Text Builder
# ------------------------------
def build_semantic_text(record):
    """
    Constructs a qualitative narrative using the provided airline metrics.
    Now optimized for 'Route' phrasing and 'Record Lookup'.
    """
    # --- Delay Context ---
    delay = record['arrival_delay_minutes']
    if delay <= 0:
        delay_desc = f"arrived early by {abs(delay)} minutes"
        punctuality = "highly punctual"
    elif delay <= 15:
        delay_desc = f"was roughly on time ({delay} min delay)"
        punctuality = "punctual"
    elif delay <= 60:
        delay_desc = f"had a moderate delay of {delay} minutes"
        punctuality = "delayed"
    else:
        delay_desc = f"suffered a severe delay of {delay} minutes"
        punctuality = "severely delayed"

    # --- Food Context ---
    score = record['food_satisfaction_score']
    if score <= 2:
        food_desc = "poor dining experience"
    elif score == 3:
        food_desc = "average dining experience"
    else:
        food_desc = "excellent dining experience"

    # --- Distance Context ---
    miles = record['actual_flown_miles']
    if miles < 1000:
        haul = "short-haul"
    elif miles < 4000:
        haul = "medium-haul"
    else:
        haul = "long-haul"

    
    text = (
        f"A {punctuality} {haul} flight operating out of {record['origin']}. "
        f"The flight departs from {record['origin']} and arrives at {record['destination']}. "
        f"The {record['passenger_class']} journey covered {miles} miles on a {record['fleet_type_description']} aircraft. "
        f"It {delay_desc}. "
        f"The passenger (Generation: {record['generation']}, Status: {record['loyalty_program_level']}) "
        f"reported a {food_desc} with a rating of {score}/5. "
        f"Passenger record locator is {record['record_locator']} and Feedback ID is {record['feedback_ID']}."
    )
    return text

# ------------------------------
# 3. Processing Pipeline
# ------------------------------
def process(tx):
    print("Fetching journey data...")
    result = tx.run("""
        MATCH (p:Passenger)-[:TOOK]->(j:Journey)-[:ON]->(f:Flight)
        MATCH (f)-[:DEPARTS_FROM]->(dep:Airport)
        MATCH (f)-[:ARRIVES_AT]->(arr:Airport)
        RETURN
            j.feedback_ID AS feedback_ID,
            p.record_locator AS record_locator,
            p.generation AS generation,
            p.loyalty_program_level AS loyalty_program_level,
            j.food_satisfaction_score AS food_satisfaction_score,
            j.arrival_delay_minutes AS arrival_delay_minutes,
            j.actual_flown_miles AS actual_flown_miles,
            j.passenger_class AS passenger_class,
            f.fleet_type_description AS fleet_type_description,
            dep.station_code AS origin,
            arr.station_code AS destination
    """)
    
    records = list(result)
    print(f"Found {len(records)} journeys to embed.")

    for i, row in enumerate(records):
        # 1. Build rich text with  route/ID phrasing
        text = build_semantic_text(row)

        # 2. Generate embeddings
        emb_minilm = model_minilm.encode(text).tolist()
        emb_mpnet = model_mpnet.encode(text).tolist()
        emb_bge_m3 = model_bge_m3.encode(text).tolist()

        # 3. Store in SEPARATE Node (:JourneyVector)
        tx.run("""
            MATCH (j:Journey {feedback_ID: $fid})
            
            MERGE (jv:JourneyVector {id: $fid + '_vec'})
            ON CREATE SET 
                jv.text = $text,
                jv.record_locator = $locator,  
                jv.feedback_id = $fid,         
                jv.minilm_embedding = $e1,
                jv.mpnet_embedding = $e2,
                jv.bgem3_embedding = $e3
            ON MATCH SET
                jv.text = $text,
                jv.record_locator = $locator,
                jv.feedback_id = $fid,
                jv.minilm_embedding = $e1,
                jv.mpnet_embedding = $e2,
                jv.bgem3_embedding = $e3
            
            MERGE (j)-[:HAS_VECTOR]->(jv)
        """, 
        fid=row['feedback_ID'],
        locator=row['record_locator'],
        text=text,
        e1=emb_minilm, 
        e2=emb_mpnet, 
        e3=emb_bge_m3)

        if i % 50 == 0:
            print(f"Processed {i}/{len(records)}...")

with driver.session() as session:
    session.execute_write(process)

print("Done! Vectors stored in 'JourneyVector' nodes.")

## Part 2: Create Indices
Creates Vector Indices in Neo4j for the three different embedding models.

In [None]:
from neo4j import GraphDatabase
import os

uri = os.getenv('NEO4J_URI')
username = os.getenv('NEO4J_USERNAME')
password = os.getenv('NEO4J_PASSWORD')

driver = GraphDatabase.driver(uri, auth=(username, password))

def create_indices():
    with driver.session() as session:
        print("Creating indices on :JourneyVector...")

        # 1. MiniLM
        session.run("""
            CREATE VECTOR INDEX minilm_vec_index IF NOT EXISTS
            FOR (n:JourneyVector) ON (n.minilm_embedding)
            OPTIONS {indexConfig: {`vector.dimensions`: 384, `vector.similarity_function`: 'cosine'}}
        """)
        
        # 2. MPNet
        session.run("""
            CREATE VECTOR INDEX mpnet_vec_index IF NOT EXISTS
            FOR (n:JourneyVector) ON (n.mpnet_embedding)
            OPTIONS {indexConfig: {`vector.dimensions`: 768, `vector.similarity_function`: 'cosine'}}
        """)

        # 3. BGE-M3
        session.run("""
            CREATE VECTOR INDEX bgem3_vec_index IF NOT EXISTS
            FOR (n:JourneyVector) ON (n.bgem3_embedding)
            OPTIONS {indexConfig: {`vector.dimensions`: 1024, `vector.similarity_function`: 'cosine'}}
        """)

        print("Indices created successfully.")

if __name__ == "__main__":
    create_indices()

## Part 3: Semantic Search Test
Performs a test search using the MiniLM model.

In [None]:
from neo4j import GraphDatabase
from sentence_transformers import SentenceTransformer
import os 
from dotenv import load_dotenv

load_dotenv()

# We'll test with MiniLM for speed
model = SentenceTransformer("BAAI/bge-m3")

uri = os.getenv('NEO4J_URI', 'neo4j://localhost:7687')
username = os.getenv('NEO4J_USERNAME', 'neo4j')
password = os.getenv('NEO4J_PASSWORD', 'password')
driver = GraphDatabase.driver(uri, auth=(username, password))

def search(query, top_k=3):
    embedding = model.encode(query).tolist()
    
    cypher = """
    CALL db.index.vector.queryNodes('bgem3_vec_index', $k, $vec)
    YIELD node, score
    
    MATCH (j:Journey)-[:HAS_VECTOR]->(node)
    MATCH (p:Passenger)-[:TOOK]->(j)
    
    RETURN 
        score,
        node.text AS semantic_text,
        j.feedback_ID AS feedback_id,
        j.arrival_delay_minutes AS actual_delay,
        j.food_satisfaction_score AS actual_food
    """
    
    with driver.session() as session:
        result = session.run(cypher, k=top_k, vec=embedding)
        return [dict(r) for r in result]

if __name__ == "__main__":
    print("--- Testing Semantic Search ---")
    q = "big delays and bad food"
    print(f"Query: '{q}'")
    
    results = search(q)
    for r in results:
        print(f"\nScore: {r['score']:.4f}")
        print(f"Text: {r['semantic_text']}")
        print(f"DB Check -> Delay: {r['actual_delay']}, Food: {r['actual_food']}")

## Comparitive Analysis

In [5]:
from neo4j import GraphDatabase
from sentence_transformers import SentenceTransformer
import os
from dotenv import load_dotenv

load_dotenv()

# ---------------------------------------------------------
# 1. SETUP: Load All 3 Models
# ---------------------------------------------------------
print("Loading models... (This might take a minute)")
models = {
    "minilm": {
        "model": SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2"),
        "index": "minilm_vec_index"
    },
    "mpnet": {
        "model": SentenceTransformer("sentence-transformers/paraphrase-mpnet-base-v2"),
        "index": "mpnet_vec_index"
    },
    "bge-m3": {
        "model": SentenceTransformer("BAAI/bge-m3"),
        "index": "bgem3_vec_index"
    }
}

uri = os.getenv('NEO4J_URI', 'neo4j://localhost:7687')
username = os.getenv('NEO4J_USERNAME', 'neo4j')
password = os.getenv('NEO4J_PASSWORD', 'password')
driver = GraphDatabase.driver(uri, auth=(username, password))


QUESTIONS = [
    "Q1: flights with severe delays and terrible food",
    "Q2: excellent dining experience on a short flight",
    "Q3: unhappy passengers on Boeing aircraft",
    "Q4: Millennial generation complaining about delays",
    "Q5: Premier Gold members with poor satisfaction",
    "Q6: The food was great but the flight was late",
    "Q7: long haul flights that arrived early",
    "Q8: Economy class passengers who had a good time",
    "Q9: nightmare journey with huge delay over 3 hours",
    "Q10: smooth trip with no issues",
    "Q11: Show me the delay for a flight out of JNX", 
    "Q12: Boomers who had excellent food", 
    "Q13: Satisfied Premier Gold members in Economy", 
    "Q14: Find all flights from JNX to EWX longer than 2000 miles.", 
    "Q15: Gen X passengers flying on the A320-200", 
    "Q16: Flights out of SEX", 
    "Q17: Flights from SEX to IAX", 
    "Q18: Pull up the record locator EPXXW8.", 
    "Q18: Pull up the feedback ID F_18."
    
]

# ---------------------------------------------------------
# 3. SEARCH LOGIC
# ---------------------------------------------------------
def search(query, model_key, top_k=3):
    model_obj = models[model_key]["model"]
    index_name = models[model_key]["index"]
    
    embedding = model_obj.encode(query).tolist()
    
    cypher = f"""
    CALL db.index.vector.queryNodes('{index_name}', $k, $vec)
    YIELD node, score
    
    MATCH (j:Journey)-[:HAS_VECTOR]->(node)
    
    RETURN 
        score,
        node.text AS semantic_text,
        j.arrival_delay_minutes AS actual_delay,
        j.food_satisfaction_score AS actual_food
    """
    
    with driver.session() as session:
        result = session.run(cypher, k=top_k, vec=embedding)
        return [dict(r) for r in result]

# ---------------------------------------------------------
# 4. EXECUTION LOOP
# ---------------------------------------------------------
if __name__ == "__main__":
    print("\n=== STARTING 3-MODEL COMPARISON (Top 3 Results) ===\n")

    for q in QUESTIONS:
        print(f"_"*80)
        print(f"QUERY: {q}")
        print(f"_"*80)
        
        for model_name in ["minilm", "mpnet", "bge-m3"]:
            print(f"\n--- MODEL: {model_name.upper()} ---")
            try:
                results = search(q, model_name, top_k=3)
                
                if not results:
                    print("  No results found.")
                    continue

                for i, r in enumerate(results):
                    print(f"  #{i+1} [Score: {r['score']:.4f}]")
                    # Truncate text to keep output clean
                    clean_text = r['semantic_text']
                    print(f"     Text: \"{clean_text}\"")
                    print(f"     Stats: Delay={r['actual_delay']}min | Food={r['actual_food']}/5")
                
            except Exception as e:
                print(f"  Error: {e}")
        
        print("\n")

Loading models... (This might take a minute)

=== STARTING 3-MODEL COMPARISON (Top 3 Results) ===

________________________________________________________________________________
QUERY: Q1: flights with severe delays and terrible food
________________________________________________________________________________

--- MODEL: MINILM ---
  #1 [Score: 0.7518]
     Text: "A severely delayed medium-haul flight operating out of EWX. The flight departs from EWX and arrives at IAX. The Economy journey covered 1400 miles on a B737-MAX8 aircraft. It suffered a severe delay of 177 minutes. The passenger (Generation: Millennial, Status: premier gold) reported a poor dining experience with a rating of 1/5. Passenger record locator is M4XXTP and Feedback ID is F_3."
     Stats: Delay=177min | Food=1/5
  #2 [Score: 0.7510]
     Text: "A severely delayed short-haul flight operating out of MIX. The flight departs from MIX and arrives at IAX. The Economy journey covered 717 miles on a B737-800 aircraf

#### Deleting vectors and indices to restart with diff embeddings

In [None]:
# from neo4j import GraphDatabase
# import os
# from dotenv import load_dotenv

# # Load environment variables
# load_dotenv()

# # Neo4j Connection
# uri = os.getenv('NEO4J_URI', 'neo4j://localhost:7687')
# username = os.getenv('NEO4J_USERNAME', 'neo4j')
# password = os.getenv('NEO4J_PASSWORD', 'password')

# driver = GraphDatabase.driver(uri, auth=(username, password))

# def clean_slate():
#     with driver.session() as session:
#         print("--- Starting Cleanup ---")
        
#         # 1. Drop the Vector Indices
#         # We drop them individually to ensure no conflicts when re-creating
#         indices_to_drop = [
#             "journey_minilm_full_index", # From your old script name
#             "journey_mpnet_full_index",  # From your old script name
#             "journey_bge_m3_index",      # From your old script name
#             "minilm_vec_index",          # From new script name
#             "mpnet_vec_index",           # From new script name
#             "bgem3_vec_index"            # From new script name
#         ]
        
#         for index in indices_to_drop:
#             try:
#                 print(f"Dropping index: {index}...")
#                 session.run(f"DROP INDEX {index} IF EXISTS")
#             except Exception as e:
#                 print(f"Could not drop {index}: {e}")

#         # 2. Delete the Nodes
#         # DETACH DELETE removes the node AND its relationships (e.g., HAS_VECTOR)
#         print("Deleting all :JourneyVector nodes...")
#         result = session.run("""
#             MATCH (n:JourneyVector)
#             DETACH DELETE n
#             RETURN count(n) as count
#         """)
#         count = result.single()["count"]
#         print(f"Deleted {count} JourneyVector nodes.")
        
#         # Optional: Clean up old properties on Journey nodes if you ran the old script
#         print("Cleaning up old vector properties on :Journey nodes (if any)...")
#         session.run("""
#             MATCH (j:Journey)
#             REMOVE j.full_feature_text, 
#                    j.embedding_minilm_full, 
#                    j.embedding_mpnet_full, 
#                    j.embedding_bge_m3_full
#         """)
#         print("Cleanup complete.")

# if __name__ == "__main__":
#     clean_slate()
#     driver.close()

### New report after improvements to text sentiment 

###  Airline Embedding Models Re-Evaluation Report

### 1. What Improved? (The "Keyword Stuffing" Effect)
The changes made to the text generation strategy specifically adding *"operating out of [Origin]"*, *"departs from X and arrives at Y"*, and explicit IDs have yielded **drastic improvements** in retrieval quality compared to the first run.

* **Route Accuracy is now near 100%:**
    * *Previously:* Models struggled to distinguish Origin from Destination.
    * *Now:* In **Q14** ("Flights from JNX to EWX") and **Q16** ("Flights out of SEX"), **ALL three models** successfully retrieved flights matching the specific airport codes. The phrasing *"operating out of SEX"* created a strong semantic hook that even the smaller MiniLM model could catch.
* **Semantic Richness:**
    * The generated text matches user queries much more naturally. When a user asks for "severe delays" or "big delays", the models now find text that explicitly says "suffered a severe delay", leading to higher confidence scores (0.80+ for MPNet/BGE-M3).

### 2. The "ID Lookup" Limitation
*Tests: Q18 (Record EPXXW8) and Q19 (Feedback F_18)*

**Notice:** Despite adding *"Passenger record locator is EPXXW8"* to the text, **NONE of the models found the exact record in the top 3 results.**
* **Why?** Vector embeddings capture *semantic meaning*, not exact character matching. To a language model, the random string `EPXXW8` looks semantically identical to `PQXXPR` or `NZXXC7` they are just "alphanumeric codes."
* **The Takeaway:** This validates your system architecture. **we cannot rely on Vector Search for ID lookups.** We **must** use the Router (Intent Classification) to send ID queries to **Cypher** (exact match), while reserving Vector Search for qualitative queries ("bad food", "delays").

---

### 3. Model Comparisons

#### 🥇 Winner: BGE-M3 (BAAI/bge-m3)
* **Performance:** Consistently the highest confidence scores (often >0.82).
* **Route Handling:** It handled the "Flights out of SEX" query perfectly, returning multiple flights departing from that exact station.
* **Nuance:** It showed the best balance in **Q6** ("Great food but late flight"), finding flights with high delays but 5/5 food scores, whereas others struggled to balance the contradicting sentiments.
* **Verdict:** **The Best Choice.** It behaves most like a "Hybrid" retriever, respecting keywords while understanding sentiment.

#### 🥈 Runner Up: MPNet (paraphrase-mpnet-base-v2)
* **Performance:** Very strong on descriptive queries (Q1, Q9). It understands "Nightmare journey" very well.
* **Improvement:** It significantly improved on Airport Codes compared to the previous run, thanks to the new text templates.
* **Verdict:** A solid backup, but slightly less precise than BGE-M3 on edge-case entity matching.

#### 🥉 Third Place: MiniLM (all-MiniLM-L6-v2)
* **Performance:** Fastest, but lowest confidence scores (~0.60 - 0.75).
* **Surprise Win:** It actually handled the Route queries (Q14, Q16) correctly this time! The "operating out of" phrasing helped this small model bridge the gap.
* **Verdict:** Usable if hardware resources are very tight, but significantly less "smart" than BGE-M3.

---


### Visual Architecture Update
The failure of vector search to find exact IDs validates this flow:



1.  **Input:** "Record EPXXW8"
2.  **Intent Classifier:** Detects `lookup_details`
3.  **Path:** Skips Vector Index $\rightarrow$ Executes Cypher Exact Match.