# SePA Web Search Pipeline Demo

This notebook demonstrates the SePA web search system with:
- Multi-query expansion (4 diverse queries)
- Reciprocal Rank Fusion with domain boosting
- GPU-accelerated embeddings and reranking
- Context-aware response generation

## 1. Setup and Initialization

In [None]:
import sys
import os
from pathlib import Path
import asyncio
import json
import warnings
from dotenv import load_dotenv

# Setup paths and environment
sys.path.append(str(Path().absolute().parent))
load_dotenv(Path().absolute().parent / '.env')

# Import search components
from src.web_search import RAG
from src.response import ResponseGenerator
from src.config import Config
from src.ranking import RecipRocalRankFusion

# Suppress deprecation warnings
warnings.filterwarnings("ignore", category=DeprecationWarning, module="langchain")

config = Config()
print("Configuration:")
print(f"  - Embedding Model: {config.EMBEDDING_MODEL}")
print(f"  - Query Expansion: {config.QUERY_EXPANSION_MODEL}")
print(f"  - RRF: k={config.RRF_CONFIG['k']}, boost={config.RRF_CONFIG['boost_factor']}x")
print(f"  - Trusted Domains: {len(config.TRUSTED_DOMAINS)} configured")

rag = RAG(config=config)
response_gen = ResponseGenerator(config)
print("\nComponents initialized successfully")

  from .autonotebook import tqdm as notebook_tqdm


Configuration:
  - Embedding Model: BAAI/bge-base-en-v1.5
  - Query Expansion: gpt-4o-mini
  - RRF: k=60, boost=2.0x
  - Trusted Domains: 23 configured

Components initialized successfully


## 2. Complete Pipeline Demonstration with Detailed Logging

In [None]:
# Define query and context
query = "What are the most effective plyometric exercises to increase vertical jump for basketball?"

user_context = {
    "age": 22,
    "gender": "male", 
    "sport": "basketball",
    "level": "college athlete",
}

conversation_history = [
    {"role": "user", "content": "I've been training for 3 months to increase my vertical jump"},
    {"role": "assistant", "content": "That's great dedication! Improving vertical jump requires a combination of strength training, plyometrics, and proper recovery."},
    {"role": "user", "content": "Mostly squats and box jumps"}
]

print("="*70)
print("SEPA WEB SEARCH PIPELINE DEMONSTRATION")
print("="*70)
print(f"\nQuery: {query}")
print(f"User: {user_context['age']}-year-old {user_context['gender']} {user_context['sport']} athlete")
print(f"Conversation context: {len(conversation_history)} previous messages")
print("\n" + "-"*70)

# 1. Query Expansion
print("\n1. QUERY EXPANSION")
print("-"*30)
expanded_queries = await rag._generate_multi_queries(query, user_context)
for i, q in enumerate(expanded_queries, 1):
    print(f"  Query {i}: {q}")

# 2. Web Search with Multiple Queries
print("\n2. WEB SEARCH (Multiple Queries)")
print("-"*30)
all_search_results = []
for i, exp_query in enumerate(expanded_queries, 1):
    results = await rag._search_web(exp_query)
    all_search_results.append(results)
    print(f"  Query {i}: Retrieved {len(results)} results")

# Flatten all results for RRF
print(f"\n  Total results before RRF: {sum(len(r) for r in all_search_results)}")

# 3. Reciprocal Rank Fusion
print("\n3. RECIPROCAL RANK FUSION (RRF)")
print("-"*30)
rrf = RecipRocalRankFusion(
    k=config.RRF_CONFIG['k'],
    boost_factor=config.RRF_CONFIG['boost_factor']
)

# Apply RRF with domain boosting
fused_results = rrf.fuse(
    result_lists=all_search_results,
    trusted_domains=config.TRUSTED_DOMAINS,
    return_scores=True
)

print(f"  RRF Parameters: k={config.RRF_CONFIG['k']}, boost={config.RRF_CONFIG['boost_factor']}x")
print(f"  Results after fusion: {len(fused_results)}")
print("\n  Top 5 results with RRF scores:")
for i, (result, score) in enumerate(fused_results, 1):
    url = result.get('link', 'No URL')
    is_trusted = any(domain in url for domain in config.TRUSTED_DOMAINS)
    boost_indicator = " [BOOSTED]" if is_trusted else ""
    print(f"    [{i}] Score: {score:.4f}{boost_indicator}")
    print(f"        {result.get('title', 'No title')[:50]}...")
    print(f"        {url}")

# Extract just the results for processing
search_results = [r[0] for r in fused_results[:config.SEARCH_CONFIG['max_fetch_candidates']]]

print(f"\n  Selected top {len(search_results)} results for content fetching")

print("\n4. CONTENT FETCHING & CHUNKING")
print("-"*30)
print("  Fetching content from URLs...")

from src.inference import chunk_text_semantically

all_chunks = []
successful_fetches = 0

for i, result in enumerate(search_results, 1):
    url = result.get('link', '')
    print(f"\n  [{i}] Fetching: {url[:60]}...")
    
    # The actual fetching happens in process_query, but we can show the concept
    # For demo purposes, we'll show what would happen
    print(f"      Status: Attempting fetch...")

# Actually process the query to get real data
vector_store = await rag.process_query(
    query=query,
    user_context=user_context
)

# Get documents to analyze chunking
retriever = vector_store.as_retriever(search_kwargs={"k": 20})
docs = await retriever.ainvoke(query)

# Analyze chunking results
unique_sources = {}
for doc in docs:
    source = doc.metadata.get('source', '')
    if source not in unique_sources:
        unique_sources[source] = []
    unique_sources[source].append(len(doc.page_content))

print(f"\n  Chunking Results:")
print(f"    Total chunks created: {len(docs)}")
print(f"    From {len(unique_sources)} unique sources")
if docs:
    print(f"    Average chunk size: {sum(len(d.page_content) for d in docs) / len(docs):.0f} chars")

for source, chunk_sizes in list(unique_sources.items())[:3]:
    print(f"\n    Source: {source[:50]}...")
    print(f"      Chunks: {len(chunk_sizes)}, Sizes: {chunk_sizes[:3]}...")

print("\n5. EMBEDDING GENERATION")
print("-"*30)
print(f"  Generating embeddings for {len(docs)} chunks")
print(f"  Using model: {config.EMBEDDING_MODEL}")
print(f"  Embedding dimension: 768")
print(f"  Processing via HuggingFace Inference Endpoint")

print(f"\n  Sample chunk for embedding:")
if docs:
    sample_chunk = docs[0].page_content[:200]
    print(f"    \"{sample_chunk}...\"")
    print(f"    → Embedded to 768-dimensional vector")

print("\n6. SIMILARITY SEARCH")
print("-"*30)
print(f"  Searching vector store with query: \"{query[:50]}...\"")

docs_with_scores = await vector_store.asimilarity_search_with_score(query, k=10)

print(f"\n  Top 5 results by cosine similarity:")
for i, (doc, score) in enumerate(docs_with_scores[:5], 1):
    source = doc.metadata.get('source', 'unknown')
    print(f"    [{i}] Similarity: {1 - score:.4f}")
    print(f"        Source: {source[:60]}...")
    print(f"        Preview: {doc.page_content[:100]}...")

print("\n7. CROSS-ENCODER RERANKING")
print("-"*30)
print(f"  Reranking top {len(docs_with_scores)} documents")
print(f"  Using model: {config.CROSS_ENCODER_MODEL}")

from src.inference import score_with_cross_encoder

pairs = [(query, doc.page_content[:1000]) for doc, _ in docs_with_scores]
rerank_scores = await score_with_cross_encoder(config.CROSS_ENCODER_MODEL, pairs)

reranked_docs = list(zip(docs_with_scores, rerank_scores))
reranked_docs.sort(key=lambda x: x[1], reverse=True)

print(f"\n  Top 5 results after reranking:")
for i, ((doc, _), rerank_score) in enumerate(reranked_docs[:5], 1):
    source = doc.metadata.get('source', 'unknown')
    print(f"    [{i}] Rerank Score: {rerank_score:.4f}")
    print(f"        Source: {source[:60]}...")
    print(f"        Preview: {doc.page_content[:100]}...")

print("\n8. RESPONSE GENERATION")
print("-"*30)

result = await response_gen.rag_generate_response(
    user_input=query,
    user_context=user_context,
    conversation_history=conversation_history,
    verbosity_level="moderate",
    vector_store=vector_store
)

print(f"  Context provided to LLM: Top {config.SEARCH_CONFIG['retrieval_k']} reranked documents")
print(f"  Response model: GPT-4o")
print(f"  Generated {len(result['answer'])} character response")
print(f"  Cited {len(result.get('sources', []))} sources")

print("\n" + "="*70)
print("FINAL RESPONSE")
print("="*70)
print(result['answer'])

if result.get('sources'):
    print("\n" + "="*70)
    print("SOURCES CITED")
    print("="*70)
    for i, source in enumerate(result['sources'], 1):
        print(f"\n[{i}] {source.get('title', 'No title')[:70]}")
        print(f"    URL: {source.get('url', 'No URL')}")
        print(f"    Relevance: {source.get('relevance_score', 0.0):.3f}")

print("\n" + "="*70)
print("PIPELINE SUMMARY")
print("="*70)
print(f"Queries generated:        {len(expanded_queries)}")
print(f"Search results (total):   {sum(len(r) for r in all_search_results)}")
print(f"After RRF fusion:         {len(fused_results)}")
print(f"Documents fetched:        {len(unique_sources)}")
print(f"Chunks created:           {len(docs)}")
print(f"After similarity search:  {len(docs_with_scores)}")
print(f"After reranking:          {len(reranked_docs)}")
print(f"Sources cited:            {len(result.get('sources', []))}")
print(f"Processing success rate:  {len(unique_sources)/max(len(search_results), 1)*100:.1f}%")

SEPA WEB SEARCH PIPELINE DEMONSTRATION

Query: What are the most effective plyometric exercises to increase vertical jump for basketball?
User: 22-year-old male basketball athlete
Conversation context: 3 previous messages

----------------------------------------------------------------------

1. QUERY EXPANSION
------------------------------
  Query 1: What are the most effective plyometric exercises to increase vertical jump for basketball?
  Query 2: best plyometric exercises to improve vertical jump for basketball
  Query 3: scientific studies on plyometrics for enhancing vertical jump performance
  Query 4: effective plyometric workouts for young male basketball athletes

2. WEB SEARCH (Multiple Queries)
------------------------------
  Query 1: Retrieved 8 results
  Query 2: Retrieved 8 results
  Query 3: Retrieved 8 results
  Query 4: Retrieved 8 results

  Total results before RRF: 32

3. RECIPROCAL RANK FUSION (RRF)
------------------------------
  RRF Parameters: k=60, boost=

  docs = retriever.get_relevant_documents(user_input)


  Context provided to LLM: Top 5 reranked documents
  Response model: GPT-4o
  Generated 783 character response
  Cited 2 sources

FINAL RESPONSE
To effectively increase your vertical jump for basketball, focusing on plyometric exercises is key. These exercises enhance explosive power by strengthening fast-twitch muscle fibers, crucial for converting strength into speed [2]. Recommended exercises include box jumps, depth jumps, squat jumps, and bounding drills, which are specifically designed to improve explosiveness and tendon strength, thereby reducing injury risk and boosting performance [1][2]. Additionally, incorporating equipment like VertIMax, kettlebells, and squat racks can provide consistent resistance and further aid in explosiveness training [2]. For detailed demonstrations of these exercises, you might find it helpful to watch resources such as "Top 5 Plyometric Exercises To Jump Higher" on YouTube [1].

SOURCES CITED

[1] Top 5 Plyometric Exercises To Jump Higher - YouTub