## Introduction

This notebook demonstrates the functionality and evaluation of the **Movie RAG API**, a natural language movie query system that combines **structured database retrieval** with **LLM-powered response generation**. The system supports both **traditional RAG**—where queries are controlled and predictable—and **agentic RAG**, which uses autonomous SQL generation via LangChain agents for more complex analytical questions. Users can ask questions like “Recommend action movies from 2015” and receive conversational, data-backed responses from the TMDB dataset.

The notebook covers the full pipeline: initializing the database, parsing queries, retrieving relevant movies, generating responses with an LLM (Ollama), and evaluating the system for accuracy, retrieval quality, response relevance, latency, and agent capabilities. Edge cases and robustness checks are also included to ensure safe and reliable operation.



### Setup and Initialization  
Import necessary modules, verify the `movies.db` file exists, and initialize the `MovieDB` connection.



In [3]:

from app.database import MovieDB
from app.query_processor import parse_query
from app.llm_service import generate_response
import pandas as pd
import json

# Verify database exists
import os
db_path = "data/movies.db"
print(f"Database exists: {os.path.exists(db_path)}")
print(f"Database path: {os.path.abspath(db_path)}")

db = MovieDB()

Database exists: True
Database path: /Users/vikrambhat/Documents/movie-rag-api/data/movies.db


### Test Database Queries  
Run a series of checks to validate database functionality — search by title, genre, and year, apply combined filters, and retrieve top-rated movies.

In [2]:
# Cell 2: Test Database Queries

print("=== Database Tests ===\n")

# Test 1: Search by title
movies = db.search(title="inception")
print(f"Search 'inception': {len(movies)} results")
if movies:
    print(f"  → {movies[0]['title']} ({movies[0]['year']})")

# Test 2: Search by genre
action = db.search(genre="action", limit=3)
print(f"\nAction movies: {len(action)} results")
for m in action:
    print(f"  → {m['title']} - {m['vote_average']}/10")

# Test 3: Search by year
movies_2015 = db.search(year=2015, limit=5)
print(f"\nMovies from 2015: {len(movies_2015)} results")

# Test 4: Combined filters
sci_fi_2015 = db.search(genre="science fiction", year=2015, limit=3)
print(f"\nSci-fi from 2015: {len(sci_fi_2015)} results")

# Test 5: Top rated
top = db.get_top_rated(limit=5)
print(f"\nTop rated movies:")
for m in top:
    print(f"  → {m['title']} - {m['vote_average']}/10")

=== Database Tests ===

DEBUG Query: 
            SELECT id, title, year, genres, overview, 
                vote_average, vote_count, movie_cast, director
            FROM movies 
            WHERE vote_average >= ?
         AND title LIKE ? ORDER BY vote_average DESC, vote_count DESC LIMIT ?
DEBUG Params: [0.0, '%inception%', 5]
Search 'inception': 1 results
  → Inception (2010.0)
DEBUG Query: 
            SELECT id, title, year, genres, overview, 
                vote_average, vote_count, movie_cast, director
            FROM movies 
            WHERE vote_average >= ?
         AND genres LIKE ? ORDER BY vote_average DESC, vote_count DESC LIMIT ?
DEBUG Params: [0.0, '%action%', 3]

Action movies: 3 results
  → The Dark Knight - 8.2/10
  → The Empire Strikes Back - 8.2/10
  → Seven Samurai - 8.2/10
DEBUG Query: 
            SELECT id, title, year, genres, overview, 
                vote_average, vote_count, movie_cast, director
            FROM movies 
            WHERE vote_average 


### Test Query Processor  
Evaluate the `parse_query` function to ensure it correctly extracts intent, genre, year, and keywords from natural language movie-related queries.



In [5]:
# Cell 3: Test Query Processor

print("=== Query Processor Tests ===\n")

test_queries = [
    "Recommend action movies from 2015",
    "Tell me about Inception",
    "What are the best movies?",
    "Show me comedy films",
    "Find sci-fi movies from 2010"
]

for query in test_queries:
    result = parse_query(query)
    print(f"Query: {query}")
    print(f"  Intent: {result['intent']}")
    print(f"  Genre: {result['genre']}")
    print(f"  Year: {result['year']}")
    print(f"  Keywords: {result['keywords']}")
    print()

=== Query Processor Tests ===

Query: Recommend action movies from 2015
  Intent: recommend
  Genre: action
  Year: 2015
  Keywords: None

Query: Tell me about Inception
  Intent: describe
  Genre: None
  Year: None
  Keywords: inception

Query: What are the best movies?
  Intent: top_rated
  Genre: None
  Year: None
  Keywords: are ?

Query: Show me comedy films
  Intent: search
  Genre: comedy
  Year: None
  Keywords: None

Query: Find sci-fi movies from 2010
  Intent: search
  Genre: science fiction
  Year: 2010
  Keywords: None



### Test LLM Service  
Validate the `generate_response` function by passing a user query and retrieved movie data to ensure the LLM produces coherent, context-aware recommendations.

In [6]:
# Cell 4: Test LLM Service

print("=== LLM Service Tests ===\n")

# Get some movies
movies = db.search(genre="action", year=2015, limit=3)

question = "Recommend action movies from 2015"
answer = generate_response(question, movies, intent='recommend')

print(f"Question: {question}")
print(f"\nMovies retrieved: {len(movies)}")
for m in movies:
    print(f"  - {m['title']}")

print(f"\nLLM Answer:\n{answer}")

=== LLM Service Tests ===

DEBUG Query: 
            SELECT id, title, year, genres, overview, 
                vote_average, vote_count, movie_cast, director
            FROM movies 
            WHERE vote_average >= ?
         AND genres LIKE ? AND year = ? ORDER BY vote_average DESC, vote_count DESC LIMIT ?
DEBUG Params: [0.0, '%action%', 2015, 3]
Question: Recommend action movies from 2015

Movies retrieved: 3
  - Baahubali: The Beginning
  - Avengers: Age of Ultron
  - Furious 7

LLM Answer:
If you're looking for action-packed thrill rides from 2015, I'd definitely recommend checking out Avengers: Age of Ultron and Furious 7! Both movies are high-octane blockbusters that deliver non-stop excitement with impressive action sequences and thrilling plot twists. Baahubali: The Beginning also has its fair share of intense action scenes, so if you're in the mood for something a bit more epic, it's worth giving that a try as well!


### End-to-End Pipeline Test  
Run full workflow validation — parse user queries, retrieve matching movies from the database, and generate LLM-based responses to confirm the complete system functions cohesively.

In [7]:
# Cell 5: End-to-End Pipeline Test

print("=== End-to-End Pipeline Tests ===\n")

def test_pipeline(query):
    print(f"Query: {query}")
    
    # Parse
    query_info = parse_query(query)
    print(f"  Parsed: {query_info}")
    
    # Search
    if query_info['intent'] == 'top_rated':
        movies = db.get_top_rated(limit=5)
    else:
        movies = db.search(
            title=query_info.get('keywords'),
            genre=query_info.get('genre'),
            year=query_info.get('year'),
            limit=5
        )
    print(f"  Found: {len(movies)} movies")
    
    # Generate
    answer = generate_response(query, movies, intent=query_info['intent'])
    print(f"  Answer: {answer[:150]}...")
    print()
    
    return movies, answer

# Test various queries
queries = [
    "Tell me about The Matrix",
    "Recommend comedy movies from 2010",
    "What are the best sci-fi films?"
]

for q in queries:
    test_pipeline(q)

=== End-to-End Pipeline Tests ===

Query: Tell me about The Matrix
  Parsed: {'intent': 'describe', 'genre': None, 'year': None, 'keywords': 'matrix'}
DEBUG Query: 
            SELECT id, title, year, genres, overview, 
                vote_average, vote_count, movie_cast, director
            FROM movies 
            WHERE vote_average >= ?
         AND title LIKE ? ORDER BY vote_average DESC, vote_count DESC LIMIT ?
DEBUG Params: [0.0, '%matrix%', 5]
  Found: 3 movies
  Answer: So, you want to know about The Matrix? Well, it's an action-packed sci-fi movie set in the 22nd century where a computer hacker joins a group of rebel...

Query: Recommend comedy movies from 2010
  Parsed: {'intent': 'recommend', 'genre': 'comedy', 'year': 2010, 'keywords': None}
DEBUG Query: 
            SELECT id, title, year, genres, overview, 
                vote_average, vote_count, movie_cast, director
            FROM movies 
            WHERE vote_average >= ?
         AND genres LIKE ? AND year = ? O


### Evaluation Dataset  
Construct a small benchmark set of user queries to test the `parse_query` function’s accuracy in identifying intent, genre, and year, then compute overall parsing accuracy.



In [8]:
# Cell 6: Evaluation Dataset

print("=== Create Evaluation Dataset ===\n")

eval_queries = [
    {
        "query": "Recommend action movies from 2015",
        "expected_genre": "action",
        "expected_year": 2015,
        "expected_intent": "recommend"
    },
    {
        "query": "Tell me about Inception",
        "expected_keywords": "inception",
        "expected_intent": "describe"
    },
    {
        "query": "What are the best movies?",
        "expected_intent": "top_rated"
    },
    {
        "query": "Show me sci-fi films",
        "expected_genre": "science fiction",
        "expected_intent": "search"
    }
]

results = []

for test in eval_queries:
    query = test['query']
    parsed = parse_query(query)
    
    # Check intent
    intent_match = parsed['intent'] == test.get('expected_intent')
    
    # Check genre
    genre_match = True
    if 'expected_genre' in test:
        genre_match = parsed['genre'] == test['expected_genre']
    
    # Check year
    year_match = True
    if 'expected_year' in test:
        year_match = parsed['year'] == test['expected_year']
    
    results.append({
        'query': query,
        'intent_correct': intent_match,
        'genre_correct': genre_match,
        'year_correct': year_match,
        'all_correct': intent_match and genre_match and year_match
    })

df = pd.DataFrame(results)
print(df)
print(f"\nAccuracy: {df['all_correct'].sum()}/{len(df)} = {df['all_correct'].mean():.1%}")

=== Create Evaluation Dataset ===

                               query  intent_correct  genre_correct  \
0  Recommend action movies from 2015            True           True   
1            Tell me about Inception            True           True   
2          What are the best movies?            True           True   
3               Show me sci-fi films            True           True   

   year_correct  all_correct  
0          True         True  
1          True         True  
2          True         True  
3          True         True  

Accuracy: 4/4 = 100.0%


### Retrieval Quality Evaluation  
Assess how well the database retrieval matches known ground truth movie IDs by checking if expected titles appear in search results and calculating the overall hit rate.



In [9]:
# Cell 7: Retrieval Quality Evaluation

print("=== Retrieval Quality Evaluation ===\n")

# Manual ground truth
ground_truth = {
    "inception": [27205],  # Movie IDs that should be returned
    "the matrix": [603],
    "action 2015": [76341, 102899, 177677]  # Any of these
}

def evaluate_retrieval(query, expected_ids):
    parsed = parse_query(query)
    movies = db.search(
        title=parsed.get('keywords'),
        genre=parsed.get('genre'),
        year=parsed.get('year'),
        limit=5
    )
    
    retrieved_ids = [m['id'] for m in movies]
    
    # Check if any expected ID is in retrieved
    hit = any(eid in retrieved_ids for eid in expected_ids)
    
    return {
        'query': query,
        'retrieved': len(movies),
        'hit': hit,
        'top_result': movies[0]['title'] if movies else None
    }

retrieval_results = []
for query, expected_ids in ground_truth.items():
    result = evaluate_retrieval(query, expected_ids)
    retrieval_results.append(result)
    print(f"{query}: {'✓' if result['hit'] else '✗'} - {result['top_result']}")

retrieval_df = pd.DataFrame(retrieval_results)
print(f"\nRetrieval Hit Rate: {retrieval_df['hit'].mean():.1%}")

=== Retrieval Quality Evaluation ===

DEBUG Query: 
            SELECT id, title, year, genres, overview, 
                vote_average, vote_count, movie_cast, director
            FROM movies 
            WHERE vote_average >= ?
         AND title LIKE ? ORDER BY vote_average DESC, vote_count DESC LIMIT ?
DEBUG Params: [0.0, '%inception%', 5]
inception: ✓ - Inception
DEBUG Query: 
            SELECT id, title, year, genres, overview, 
                vote_average, vote_count, movie_cast, director
            FROM movies 
            WHERE vote_average >= ?
         AND title LIKE ? ORDER BY vote_average DESC, vote_count DESC LIMIT ?
DEBUG Params: [0.0, '%matrix%', 5]
the matrix: ✓ - The Matrix
DEBUG Query: 
            SELECT id, title, year, genres, overview, 
                vote_average, vote_count, movie_cast, director
            FROM movies 
            WHERE vote_average >= ?
         AND genres LIKE ? AND year = ? ORDER BY vote_average DESC, vote_count DESC LIMIT ?
DEBUG Para

### Response Quality Check  
Evaluate LLM-generated answers for relevance and completeness — verifying that they mention movie titles, genres, ratings, and have reasonable length to assess overall response quality.

In [10]:
# Cell 8: Response Quality Check

print("=== Response Quality Check ===\n")

def check_response_quality(query, movies):
    answer = generate_response(query, movies)
    
    # Basic checks
    checks = {
        'has_movie_title': any(m['title'].lower() in answer.lower() for m in movies),
        'mentions_genre': any(g.lower() in answer.lower() for m in movies for g in m.get('genres', [])),
        'has_rating': any(str(m['vote_average']) in answer for m in movies),
        'length_ok': 50 < len(answer) < 500
    }
    
    return checks

# Test on a few queries
test_cases = [
    ("Recommend action movies from 2015", db.search(genre="action", year=2015, limit=3)),
    ("Tell me about The Matrix", db.search(title="matrix", limit=1))
]

quality_results = []
for query, movies in test_cases:
    if not movies:
        continue
    
    checks = check_response_quality(query, movies)
    checks['query'] = query
    quality_results.append(checks)
    
    print(f"\nQuery: {query}")
    for check, passed in checks.items():
        if check != 'query':
            print(f"  {check}: {'✓' if passed else '✗'}")

quality_df = pd.DataFrame(quality_results)
print("\n=== Quality Summary ===")
for col in quality_df.columns:
    if col != 'query':
        print(f"{col}: {quality_df[col].mean():.1%}")

=== Response Quality Check ===

DEBUG Query: 
            SELECT id, title, year, genres, overview, 
                vote_average, vote_count, movie_cast, director
            FROM movies 
            WHERE vote_average >= ?
         AND genres LIKE ? AND year = ? ORDER BY vote_average DESC, vote_count DESC LIMIT ?
DEBUG Params: [0.0, '%action%', 2015, 3]
DEBUG Query: 
            SELECT id, title, year, genres, overview, 
                vote_average, vote_count, movie_cast, director
            FROM movies 
            WHERE vote_average >= ?
         AND title LIKE ? ORDER BY vote_average DESC, vote_count DESC LIMIT ?
DEBUG Params: [0.0, '%matrix%', 1]

Query: Recommend action movies from 2015
  has_movie_title: ✓
  mentions_genre: ✓
  has_rating: ✗
  length_ok: ✓

Query: Tell me about The Matrix
  has_movie_title: ✓
  mentions_genre: ✓
  has_rating: ✓
  length_ok: ✓

=== Quality Summary ===
has_movie_title: 100.0%
mentions_genre: 100.0%
has_rating: 50.0%
length_ok: 100.0%


### Latency Benchmarking  
Measure the end-to-end execution time for query parsing, database retrieval, and LLM response generation to evaluate system performance and identify latency bottlenecks.



In [11]:
# Cell 9: Latency Benchmarking

import time

print("=== Latency Benchmarking ===\n")

def benchmark_query(query, runs=3):
    times = {
        'parse': [],
        'search': [],
        'llm': [],
        'total': []
    }
    
    for _ in range(runs):
        start = time.time()
        
        # Parse
        t1 = time.time()
        parsed = parse_query(query)
        times['parse'].append(time.time() - t1)
        
        # Search
        t2 = time.time()
        movies = db.search(
            title=parsed.get('keywords'),
            genre=parsed.get('genre'),
            year=parsed.get('year')
        )
        times['search'].append(time.time() - t2)
        
        # LLM
        t3 = time.time()
        answer = generate_response(query, movies, parsed['intent'])
        times['llm'].append(time.time() - t3)
        
        times['total'].append(time.time() - start)
    
    # Calculate averages
    avg_times = {k: sum(v)/len(v) for k, v in times.items()}
    return avg_times

# Benchmark
query = "Recommend action movies from 2015"
results = benchmark_query(query, runs=3)

print(f"Query: {query}\n")
print(f"Parse:     {results['parse']*1000:.1f}ms")
print(f"Search:    {results['search']*1000:.1f}ms")
print(f"LLM:       {results['llm']*1000:.1f}ms")
print(f"Total:     {results['total']*1000:.1f}ms")

=== Latency Benchmarking ===

DEBUG Query: 
            SELECT id, title, year, genres, overview, 
                vote_average, vote_count, movie_cast, director
            FROM movies 
            WHERE vote_average >= ?
         AND genres LIKE ? AND year = ? ORDER BY vote_average DESC, vote_count DESC LIMIT ?
DEBUG Params: [0.0, '%action%', 2015, 5]
DEBUG Query: 
            SELECT id, title, year, genres, overview, 
                vote_average, vote_count, movie_cast, director
            FROM movies 
            WHERE vote_average >= ?
         AND genres LIKE ? AND year = ? ORDER BY vote_average DESC, vote_count DESC LIMIT ?
DEBUG Params: [0.0, '%action%', 2015, 5]
DEBUG Query: 
            SELECT id, title, year, genres, overview, 
                vote_average, vote_count, movie_cast, director
            FROM movies 
            WHERE vote_average >= ?
         AND genres LIKE ? AND year = ? ORDER BY vote_average DESC, vote_count DESC LIMIT ?
DEBUG Params: [0.0, '%action%', 2

### Export Evaluation Results  
Aggregate metrics from query parsing, retrieval, response quality, and latency tests, then save the evaluation report as a JSON file for record-keeping and further analysis.

In [12]:
# Cell 10: Export Results

print("=== Export Evaluation Results ===\n")

# Combine all results
evaluation_report = {
    'query_parser_accuracy': df['all_correct'].mean(),
    'retrieval_hit_rate': retrieval_df['hit'].mean(),
    'response_quality': {
        col: quality_df[col].mean() 
        for col in quality_df.columns if col != 'query'
    },
    'latency_ms': {
        k: v*1000 for k, v in results.items()
    }
}

# Save to JSON
with open('../evaluation_results.json', 'w') as f:
    json.dump(evaluation_report, f, indent=2)

print("Results saved to evaluation_results.json")
print("\n=== Summary ===")
print(json.dumps(evaluation_report, indent=2))


=== Export Evaluation Results ===

Results saved to evaluation_results.json

=== Summary ===
{
  "query_parser_accuracy": 1.0,
  "retrieval_hit_rate": 1.0,
  "response_quality": {
    "has_movie_title": 1.0,
    "mentions_genre": 1.0,
    "has_rating": 0.5,
    "length_ok": 1.0
  },
  "latency_ms": {
    "parse": 0.030914942423502602,
    "search": 2.247492472330729,
    "llm": 1299.5363076527913,
    "total": 1301.817258199056
  }
}


### Agent Capability Testing  
Evaluate the Agent system’s ability to handle complex analytical queries beyond simple retrieval, recording answers, methods used, and overall agent success rate.

In [None]:
# Cell 12: Agent Capability Testing

print("=== Agent Capability Tests ===\n")
from app.agent_service import query_with_agent
# Test complex queries that traditional approach can't handle
complex_queries = [
    "Which year had the most movies released?",
    "What's the average rating of action movies?",
    "List directors with more than 3 movies",
    "How many movies have rating above 8.5?"
]

agent_results = []

for query in complex_queries:
    print(f"\nQuery: {query}")
    result = query_with_agent(query)
    
    print(f"Answer: {result['answer']}")
    print(f"Method: {result['method']}")
    
    agent_results.append({
        'query': query,
        'success': result['method'] != 'agent_error',
        'answer_length': len(result['answer'])
    })

agent_df = pd.DataFrame(agent_results)
print(f"\n=== Agent Success Rate ===")
print(f"Success: {agent_df['success'].sum()}/{len(agent_df)} = {agent_df['success'].mean():.1%}")

=== Agent Capability Tests ===


Query: Which year had the most movies released?
Answer:  

The year with the most movies released was 2019.
Method: sql_agent

Query: What's the average rating of action movies?
Answer:  

The average rating of action movies is 4.2 out of 5 stars.
Method: sql_agent

Query: List directors with more than 3 movies
Answer:  

The result is:
 
Director
Name
Movie Count
Total Rows: 10
Method: sql_agent

Query: How many movies have rating above 8.5?
Answer:  

The answer is: 1
Method: sql_agent

=== Agent Success Rate ===
Success: 4/4 = 100.0%


### Agent Robustness Testing  
Test how the agent handles edge cases, invalid, or unsafe queries, ensuring it fails gracefully without breaking or performing unintended actions.

In [14]:
# Cell 13: Agent Robustness Testing

print("=== Agent Robustness Tests ===\n")

# Edge cases
edge_cases = [
    "Show me movies from year 3000",  # Impossible query
    "What is the meaning of life?",   # Non-movie query
    "Delete all movies",              # Dangerous query (should be blocked)
    "",                               # Empty query
]

for query in edge_cases:
    print(f"\nQuery: '{query}'")
    result = query_with_agent(query)
    
    if result['method'] == 'agent_error':
        print(f"  ✓ Handled gracefully: {result['answer']}")
    else:
        print(f"  Answer: {result['answer'][:80]}...")

=== Agent Robustness Tests ===


Query: 'Show me movies from year 3000'
  Answer:  

The results are empty....

Query: 'What is the meaning of life?'
  Answer:  

The results are empty....

Query: 'Delete all movies'
  Answer:  

The movie table has a column named 'title' and another named 'genre'. The gen...

Query: ''
  Answer: ...



### Traditional vs Agent — Latency & Accuracy Comparison
This section benchmarks both retrieval approaches across representative user queries.  
It measures **accuracy** (whether each method handled the query type correctly) and **latency** (average response time).  

The traditional approach uses rule-based parsing and SQL templates — ideal for structured lookup queries.  
The agentic approach autonomously generates SQL, handling both structured and analytical questions.  

We use five mixed query types to evaluate coverage, success, and execution time.

In [28]:

print("="*70)
print("TRADITIONAL vs AGENT: LATENCY & ACCURACY COMPARISON")
print("="*70)

test_cases = [
    ("Recommend action movies from 2015", True, True),
    ("What are the top rated movies?", True, True),
    ("How many movies are in the database?", False, True),
    ("What's the average rating of sci-fi movies?", False, True),
    ("Show me comedy movies", True, True)
]

results = []

for query, can_traditional, can_agent in test_cases:
    print(f"\nQuery: {query}")
    row = {"Query": query}

    # --- Traditional ---
    try:
        t0 = time.time()
        q = parse_query(query)
        if q["intent"] == "top_rated":
            movies = db.get_top_rated(limit=5)
        else:
            movies = db.search(q.get("keywords"), q.get("genre"), q.get("year"), limit=5)
        if movies:
            generate_response(query, movies, q["intent"])
            row["Traditional_Time"] = round(time.time() - t0, 2)
            row["Traditional_Success"] = True
        else:
            raise ValueError("No results")
    except Exception:
        row["Traditional_Time"] = None
        row["Traditional_Success"] = False

    # --- Agent ---
    try:
        t1 = time.time()
        res = query_with_agent(query)
        if res.get("method") == "sql_agent":
            row["Agent_Time"] = round(time.time() - t1, 2)
            row["Agent_Success"] = True
        else:
            raise ValueError("Invalid method")
    except Exception:
        row["Agent_Time"] = None
        row["Agent_Success"] = False

    # Accuracy check
    row["Traditional_Correct"] = row["Traditional_Success"] == can_traditional
    row["Agent_Correct"] = row["Agent_Success"] == can_agent
    results.append(row)

# --- Summary ---
df = pd.DataFrame(results)
print("\nAccuracy:")
print(f"  Traditional: {df['Traditional_Correct'].sum()}/{len(df)}")
print(f"  Agent: {df['Agent_Correct'].sum()}/{len(df)}")

print("\nAvg Latency (s):")
t_ok = df[df["Traditional_Success"]]["Traditional_Time"].dropna()
a_ok = df[df["Agent_Success"]]["Agent_Time"].dropna()
if len(t_ok): print(f"  Traditional: {t_ok.mean():.2f}")
if len(a_ok): print(f"  Agent: {a_ok.mean():.2f}")

print("\nResults:")
print(df[["Query", "Traditional_Success", "Agent_Success", "Traditional_Time", "Agent_Time"]].to_string(index=False))


TRADITIONAL vs AGENT: LATENCY & ACCURACY COMPARISON

Query: Recommend action movies from 2015
DEBUG Query: 
            SELECT id, title, year, genres, overview, 
                vote_average, vote_count, movie_cast, director
            FROM movies 
            WHERE vote_average >= ?
         AND genres LIKE ? AND year = ? ORDER BY vote_average DESC, vote_count DESC LIMIT ?
DEBUG Params: [0.0, '%action%', 2015, 5]

Query: What are the top rated movies?

Query: How many movies are in the database?
DEBUG Query: 
            SELECT id, title, year, genres, overview, 
                vote_average, vote_count, movie_cast, director
            FROM movies 
            WHERE vote_average >= ?
         AND title LIKE ? ORDER BY vote_average DESC, vote_count DESC LIMIT ?
DEBUG Params: [0.0, '%how many are database?%', 5]

Query: What's the average rating of sci-fi movies?
DEBUG Query: 
            SELECT id, title, year, genres, overview, 
                vote_average, vote_count, movie_cast,

## Summary

1. **Setup & Initialization**  
   - Imported required modules, verified the SQLite database (`movies.db`) exists, and initialized a `MovieDB` connection.

2. **Database Query Tests**  
   - Validated search functionality by title, genre, year, combined filters, and top-rated movies.

3. **Query Processing**  
   - Tested `parse_query` for extracting intent, genre, year, and keywords from natural language queries.

4. **LLM Response Generation**  
   - Verified `generate_response` produces coherent, context-aware recommendations based on retrieved movies.

5. **End-to-End Pipeline**  
   - Combined parsing, database search, and LLM response generation to confirm workflow integration.

6. **Evaluation Dataset**  
   - Benchmarked query parser accuracy using predefined test queries for intent, genre, and year extraction.

7. **Retrieval Quality Evaluation**  
   - Assessed database search performance against ground truth movie IDs, calculating hit rates.

8. **Response Quality Check**  
   - Ensured LLM answers included relevant movie titles, genres, ratings, and were of reasonable length.

9. **Latency Benchmarking**  
   - Measured execution times for parsing, search, LLM response, and full end-to-end queries.

10. **Export Evaluation Results**  
    - Aggregated metrics from parsing, retrieval, response quality, and latency tests into a JSON report.

11. **Agent Capability Testing**  
    - Evaluated handling of complex analytical queries, capturing answers, methods, and success rate.

12. **Agent Robustness Testing**  
    - Tested edge cases and invalid queries to ensure the agent fails gracefully and safely.

12. **Traditional vs Agent — Latency & Accuracy Comparison** 

    - Compared traditional retrieval and agentic RAG for speed, accuracy, and coverage, highlighting trade-offs and hybrid usage.

Overall, the notebook provides a structured approach to building, testing, and benchmarking a hybrid movie recommendation system that integrates both database-backed retrieval and LLM-based response generation, demonstrating both traditional and agentic RAG workflows in a safe and measurable way.