# NLP CA6 ‚Äì RAG Pipeline with LangChain, FAISS, TogetherAI, Tavily, and LangGraph

This notebook implements a complete Retrieval-Augmented Generation (RAG) system for Persian/English documents, following the assignment specification.



We will cover:

1. Vector representations and FAISS vector store with caching

2. Retriever(s): FAISS semantic, BM25 lexical, and an Ensemble retriever

3. Router Chain using TogetherAI (Meta Llama 3 70B, temperature=0)

4. Search Engine Chain using Tavily

5. (Optional) Relevancy Check Chain

6. Fallback Chain

7. Generate-with-Context Chain

8. Full agent graph with LangGraph



All explanations are in English; code supports Persian content.

## 0) Environment Setup

We install and import required libraries. Set your API keys in environment variables:



- `TOGETHER_API_KEY` for TogetherAI

- `TAVILY_API_KEY` for Tavily



We will prefer CPU-compatible packages to avoid GPU dependency unless available.

In [None]:
# Install core dependencies (run once)

%pip -q install -U langchain langchain-community langchain-huggingface langgraph pydantic==2.*

%pip -q install -U faiss-cpu sentence-transformers

%pip -q install -U langchain-together tavily-python



import os

from pathlib import Path



# Set API keys via environment variables (edit as needed)

# os.environ["TOGETHER_API_KEY"] = "<your_together_api_key>"

# os.environ["TAVILY_API_KEY"] = "<your_tavily_api_key>"



BASE_DIR = Path("/Users/tahamajs/Documents/uni/NLP/nlp-assignments-spring-2023/NLP_UT/last/NLP-CA6")

DATA_DIR = BASE_DIR / "data"

DOCS_DIR = DATA_DIR / "docs"

CACHE_DIR = DATA_DIR / "emb_cache"

VSTORE_DIR = DATA_DIR / "faiss_store"



for d in [DATA_DIR, DOCS_DIR, CACHE_DIR, VSTORE_DIR]:

    d.mkdir(parents=True, exist_ok=True)



BASE_DIR, DATA_DIR, DOCS_DIR, CACHE_DIR, VSTORE_DIR

## 1) Data Loading and Chunking



We load documents (e.g., PDF, text files) and chunk them into manageable pieces for embedding and retrieval.



**Key Concepts:**

- **Document Loaders**: LangChain provides loaders for PDFs, text files, etc.

- **Text Splitting**: Break long documents into chunks with overlap to preserve context.

- **Chunk Size**: Balance between context (larger chunks) and precision (smaller chunks).

- **Overlap**: Ensures important information at chunk boundaries isn't lost.



For this assignment, you should place your reference book (NLP textbook) chapters in the `docs/` folder as PDF or text files.

In [None]:
from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader, TextLoader

from langchain.text_splitter import RecursiveCharacterTextSplitter



# Load documents from the docs directory

# Adjust loader based on your file types (PDF, txt, etc.)

try:

    loader = DirectoryLoader(

        str(DOCS_DIR),

        glob="**/*.pdf",

        loader_cls=PyPDFLoader,

        show_progress=True

    )

    raw_documents = loader.load()

    print(f"Loaded {len(raw_documents)} document pages/sections")
except Exception as e:

    print(f"Error loading PDFs: {e}")

    print("Trying text files...")

    loader = DirectoryLoader(

        str(DOCS_DIR),

        glob="**/*.txt",

        loader_cls=TextLoader,

        show_progress=True

    )

    raw_documents = loader.load()

    print(f"Loaded {len(raw_documents)} text documents")



# Show sample

if raw_documents:

    print(f"\nSample document (first 300 chars):\n{raw_documents[0].page_content[:300]}...")

else:

    print("\n‚ö†Ô∏è No documents found. Please add PDF or text files to the docs/ directory.")

In [None]:
# Chunk documents into smaller pieces

text_splitter = RecursiveCharacterTextSplitter(

    chunk_size=1000,        # characters per chunk

    chunk_overlap=200,      # overlap to preserve context

    length_function=len,

    separators=["\n\n", "\n", " ", ""]

)



chunked_documents = text_splitter.split_documents(raw_documents)

print(f"\nTotal chunks created: {len(chunked_documents)}")

print(f"Sample chunk (first 200 chars):\n{chunked_documents[0].page_content[:200]}...")



# Statistics

chunk_lengths = [len(doc.page_content) for doc in chunked_documents]

print(f"\nChunk Statistics:")

print(f"  Average chunk length: {sum(chunk_lengths)/len(chunk_lengths):.0f} chars")

print(f"  Min chunk length: {min(chunk_lengths)} chars")

print(f"  Max chunk length: {max(chunk_lengths)} chars")

## 2) Vector Representations and FAISS Vector Store



### Part A: Embeddings with HuggingFace and FAISS



We use **HuggingFaceEmbeddings** to create vector representations of text chunks, then store them in **FAISS** (Facebook AI Similarity Search) for efficient retrieval.



**Key Concepts:**



#### Why Embeddings?

- Convert text into dense numerical vectors (e.g., 768-dimensional)

- Semantic similarity: similar texts have similar vectors

- Enable vector search (find semantically similar documents)



#### FAISS:

- Developed by Meta (Facebook AI)

- Optimized for fast similarity search in high-dimensional spaces

- Supports billions of vectors with efficient indexing

- CPU and GPU implementations available



#### Cache-Backed Embeddings:

- **Problem**: Re-embedding documents on every run is slow and expensive

- **Solution**: `CacheBackedEmbeddings` stores embeddings on disk

- First run: compute and cache embeddings

- Subsequent runs: load from cache (much faster)

- Trade-off: disk I/O vs GPU computation time



For GPU-heavy workloads with many documents, re-computing on GPU might be faster than disk reads, but for moderate datasets, caching is highly beneficial.

In [None]:
from langchain_huggingface import HuggingFaceEmbeddings

from langchain.storage import LocalFileStore

from langchain.embeddings import CacheBackedEmbeddings

from langchain_community.vectorstores import FAISS



# Initialize the base embedder

# Default model: sentence-transformers/all-MiniLM-L6-v2 (English, 384-dim)

# For Persian/multilingual: consider "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"

base_embedder = HuggingFaceEmbeddings(

    model_name="sentence-transformers/all-MiniLM-L6-v2",

    model_kwargs={'device': 'cpu'},  # Change to 'cuda' if GPU available

    encode_kwargs={'normalize_embeddings': True}

)



# Setup cache-backed embeddings

cache_store = LocalFileStore(str(CACHE_DIR))

cached_embedder = CacheBackedEmbeddings.from_bytes_store(

    base_embedder,

    cache_store,

    namespace="huggingface_embeddings"

)



print("Embedder initialized with caching enabled")

print(f"Cache directory: {CACHE_DIR}")

In [None]:
# Create or load FAISS vector store

import time



faiss_index_path = VSTORE_DIR / "faiss_index"



if faiss_index_path.exists():

    print("Loading existing FAISS index...")

    vectorstore = FAISS.load_local(

        str(VSTORE_DIR),

        cached_embedder,

        "faiss_index",

        allow_dangerous_deserialization=True

    )

    print(f"Loaded {vectorstore.index.ntotal} vectors from disk")

else:

    print("Creating new FAISS index (this may take a few minutes on first run)...")

    start_time = time.time()

    

    vectorstore = FAISS.from_documents(

        chunked_documents,

        cached_embedder

    )

    

    elapsed = time.time() - start_time

    print(f"FAISS index created in {elapsed:.1f} seconds")

    print(f"Total vectors: {vectorstore.index.ntotal}")

    

    # Save to disk

    vectorstore.save_local(str(VSTORE_DIR), "faiss_index")

    print(f"Index saved to {VSTORE_DIR}")

### Part B: Importance of Choosing the Right Embedder



**Why does the choice of embedder matter?**



#### 1. Language-Specific Training

If we use an embedder trained **only on English** to embed **Persian** text:



**Problems:**

- **Out-of-vocabulary tokens**: Persian words may be split into meaningless subwords

- **Poor semantic understanding**: Model hasn't learned Persian grammar, idioms, or word relationships

- **Low-quality embeddings**: Similar Persian sentences may have dissimilar vectors

- **Retrieval failures**: Relevant Persian documents won't be found



**Example:**

```

Query: "ÿ≥ŸÑÿßŸÖ ÿØŸÜ€åÿß" (Hello world)

English-only model: May treat each Persian character as noise

Result: Cannot find semantically similar Persian documents

```



#### 2. Domain-Specific Models

- **General models** (e.g., all-MiniLM): Good for everyday language

- **Domain models** (e.g., biomedical, legal): Better for specialized text

- **Multilingual models** (e.g., paraphrase-multilingual): Support 50+ languages



#### 3. For Persian NLP:

**Recommended models:**

- `sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2` (50+ languages)

- `sentence-transformers/LaBSE` (109 languages, high quality)

- Persian-specific models from HuggingFace (search for "persian" or "farsi")



#### 4. Trade-offs:

- **Size vs Quality**: Larger models (768-dim) often perform better than smaller (384-dim)

- **Speed vs Accuracy**: Faster models may sacrifice some quality

- **Language coverage**: Multilingual models may be weaker per-language than monolingual



**Bottom line**: Always choose an embedder that:

1. Supports your target language(s)

2. Matches your domain if possible

3. Balances quality and computational resources

## 3) Retriever Implementation



### Part A (Basic): FAISS Semantic Retriever



A retriever takes a query and returns the most relevant documents from the vector store.



We'll implement:

1. **Semantic retriever** using FAISS (vector similarity)

2. Test with 3 queries as specified in the assignment

In [None]:
# Create FAISS-based semantic retriever

faiss_retriever = vectorstore.as_retriever(

    search_type="similarity",

    search_kwargs={"k": 5}  # Return top 5 documents

)



# Test queries (as per assignment requirements)

test_queries = [

    "What is natural language processing?",  # NLP-related (in-domain)

    "Explain binary search trees",            # Computer science (out-of-NLP-domain)

    "Who is the president of Bolivia?"        # General knowledge (out-of-scope)

]



print("="*80)

print("TESTING FAISS SEMANTIC RETRIEVER")

print("="*80)



for i, query in enumerate(test_queries, 1):

    print(f"\n{'='*80}")

    print(f"Query {i}: {query}")

    print("="*80)

    

    docs = faiss_retriever.invoke(query)

    

    print(f"\nRetrieved {len(docs)} documents:\n")

    for j, doc in enumerate(docs, 1):

        print(f"[Doc {j}] (first 150 chars)")

        print(f"{doc.page_content[:150]}...")

        print(f"Metadata: {doc.metadata}")

        print()

### Part B (Bonus): Hybrid Retriever with BM25 and Ensemble



**Lexical vs Semantic Retrieval:**



#### Lexical Retrieval (BM25):

- **Based on**: Exact keyword matching

- **Algorithm**: BM25 (Best Matching 25) - a probabilistic ranking function

- **How it works**: 

  - Counts term frequency (TF) in documents

  - Considers document length normalization

  - Uses inverse document frequency (IDF) for term importance

- **Strengths**:

  - Fast and lightweight (no embeddings needed)

  - Excellent for exact keyword matches

  - Works well with technical terms, names, acronyms

  - No training required

- **Weaknesses**:

  - No semantic understanding ("car" ‚â† "automobile")

  - Fails on paraphrasing or synonyms

  - Sensitive to exact wording



#### Semantic Retrieval (FAISS):

- **Based on**: Vector similarity in embedding space

- **How it works**:

  - Convert query and documents to dense vectors

  - Find nearest neighbors using cosine similarity or L2 distance

- **Strengths**:

  - Understands semantic meaning

  - Handles synonyms, paraphrasing

  - Cross-lingual retrieval possible

  - Captures context and intent

- **Weaknesses**:

  - Computationally expensive (embeddings + indexing)

  - May miss exact keyword matches

  - Requires good quality embedder



#### Ensemble Retriever:

Combines both approaches:

- **Weighted fusion**: `score = w1 * lexical_score + w2 * semantic_score`

- **Best of both worlds**: Keyword precision + semantic understanding

- **Use case**: When you want robustness across different query types



**Importance by use case:**

- **Technical documentation**: Lexical (exact API names, functions)

- **Conversational QA**: Semantic (natural language understanding)

- **Hybrid systems**: Ensemble (maximum coverage)

In [None]:
from langchain.retrievers import BM25Retriever, EnsembleRetriever



# Create BM25 lexical retriever

bm25_retriever = BM25Retriever.from_documents(chunked_documents)

bm25_retriever.k = 5  # Return top 5



print("BM25 lexical retriever initialized")

print(f"Total documents indexed: {len(chunked_documents)}")

In [None]:
# Experiment with weight combinations

print("="*80)

print("EXPERIMENTING WITH ENSEMBLE WEIGHTS")

print("="*80)



# Test different weight configurations

weight_configs = [

    (1.0, 0.0, "100% Lexical (BM25)"),

    (0.0, 1.0, "100% Semantic (FAISS)"),

    (0.5, 0.5, "50-50 Balanced"),

    (0.3, 0.7, "30% Lexical, 70% Semantic (Recommended for English)"),

]



test_query = "What is natural language processing?"

print(f"\nTest Query: '{test_query}'\n")



for bm25_weight, faiss_weight, description in weight_configs:

    ensemble = EnsembleRetriever(

        retrievers=[bm25_retriever, faiss_retriever],

        weights=[bm25_weight, faiss_weight]

    )

    

    docs = ensemble.invoke(test_query)

    print(f"\n{description}:")

    print(f"  Retrieved {len(docs)} documents")

    print(f"  First doc preview: {docs[0].page_content[:100]}..." if docs else "  No docs")



print("\n" + "="*80)

print("CONCLUSION: For strong English embedders, semantic weight 0.6-0.7 often works best.")

print("="*80)

In [None]:
# Final ensemble retriever with chosen weights

ensemble_retriever = EnsembleRetriever(

    retrievers=[bm25_retriever, faiss_retriever],

    weights=[0.3, 0.7]  # 30% lexical, 70% semantic

)



print("="*80)

print("TESTING FINAL ENSEMBLE RETRIEVER")

print("="*80)



for i, query in enumerate(test_queries, 1):

    print(f"\n{'='*80}")

    print(f"Query {i}: {query}")

    print("="*80)

    

    docs = ensemble_retriever.invoke(query)

    

    print(f"\nRetrieved {len(docs)} documents:\n")

    for j, doc in enumerate(docs, 1):

        print(f"[Doc {j}] (first 150 chars)")

        print(f"{doc.page_content[:150]}...")

        print(f"Metadata: {doc.metadata}")

        print()

## 4) Chain Implementations



### Understanding Chains in LangChain



A **chain** is a sequence of components connected in a directed acyclic graph (DAG):

```

Input ‚Üí Prompt Template ‚Üí LLM ‚Üí Output Parser ‚Üí Result

```



**Three main components:**

1. **Prompt Template**: Format input with placeholders for variables

2. **LLM**: Large language model that generates responses

3. **Output Parser**: Validates and structures the LLM output



**Example:**

```python

chain = prompt | llm | parser

result = chain.invoke({"query": "Hello"})

```



The `|` operator connects components into a pipeline.

### 4.1) Router Chain (TogetherAI + Llama 3 70B)



**Purpose**: Classify user queries into three categories:

1. `VectorStore`: NLP-related questions (use local knowledge base)

2. `SearchEngine`: Computer science but non-NLP (use web search)

3. `None` (Fallback): Out-of-scope questions



**Why Temperature=0?**

- Temperature controls randomness in LLM outputs

- **Temperature=0**: Deterministic, always picks most likely token

- **Temperature>0**: Adds randomness, more creative but less predictable

- For classification/routing, we want **consistent, deterministic** decisions

- Higher temperature would cause unpredictable routing‚Äîbad for a decision system!

In [None]:
from langchain_together import ChatTogether

from langchain.prompts import ChatPromptTemplate

from langchain.pydantic_v1 import BaseModel, Field

from langchain.output_parsers import PydanticOutputParser

from typing import Literal



# Initialize TogetherAI LLM

router_llm = ChatTogether(

    model="meta-llama/Llama-3-70b-chat-hf",

    temperature=0.0,

    max_tokens=100

)



print("‚úÖ TogetherAI LLM initialized (Llama 3 70B, temp=0)")

In [None]:
# Define Pydantic output schema for router

class RouterOutput(BaseModel):

    """Router decision for query classification."""

    tool: Literal["VectorStore", "SearchEngine", "None"] = Field(

        description="The tool to use: VectorStore for NLP topics, SearchEngine for general CS topics, None for out-of-scope"

    )



router_parser = PydanticOutputParser(pydantic_object=RouterOutput)



# Router prompt

router_prompt = ChatPromptTemplate.from_messages([

    ("system", """You are a query classifier for an NLP chatbot.



Classify the user's query into ONE of these categories:

- VectorStore: Questions about Natural Language Processing (NLP), machine learning for text, transformers, etc.

- SearchEngine: Questions about Computer Science topics OUTSIDE of NLP (e.g., algorithms, data structures, OS)

- None: Questions completely outside the chatbot's domain (e.g., geography, politics, history)



Respond ONLY with the tool name.



{format_instructions}"""),

    ("human", "{query}")

])



# Build router chain

router_chain = router_prompt | router_llm | router_parser



print("‚úÖ Router chain assembled")

print(f"Output format: {router_parser.get_format_instructions()[:100]}...")

In [None]:
# Test router chain

print("="*80)

print("TESTING ROUTER CHAIN")

print("="*80)



test_routing_queries = [

    "What is tokenization in NLP?",

    "Explain quicksort algorithm",

    "What is the capital of France?"

]



for query in test_routing_queries:

    result = router_chain.invoke({

        "query": query,

        "format_instructions": router_parser.get_format_instructions()

    })

    print(f"\nQuery: {query}")

    print(f"  ‚Üí Routed to: {result.tool}")

### 4.2) Search Engine Chain (Tavily)



**Tavily** is an API service optimized for LLM-based applications:

- Retrieves high-quality web search results

- Returns content in LLM-friendly format

- Free tier: 1000 searches/month



Our chain will:

1. Query Tavily API

2. Parse results into LangChain `Document` objects

3. Return top 5 results with content + URL metadata

In [None]:
from langchain_community.tools.tavily_search import TavilySearchResults

from langchain.schema import Document

from langchain.schema.runnable import RunnableLambda



# Initialize Tavily search tool

tavily_tool = TavilySearchResults(max_results=5)



# Post-processor to convert Tavily results to LangChain Documents

def tavily_to_documents(results):

    """Convert Tavily results to Document objects."""

    documents = []

    for result in results:

        doc = Document(

            page_content=result.get("content", ""),

            metadata={"url": result.get("url", "")}

        )

        documents.append(doc)

    return documents



# Build search engine chain

search_engine_chain = tavily_tool | RunnableLambda(tavily_to_documents)



print("‚úÖ Search Engine chain assembled")

print("  Input: query string")

print("  Output: List[Document] with page_content and url metadata")

In [None]:
# Test search engine chain

print("="*80)

print("TESTING SEARCH ENGINE CHAIN")

print("="*80)



test_search_query = "What are transformers in deep learning?"

print(f"\nQuery: {test_search_query}\n")



search_results = search_engine_chain.invoke(test_search_query)



print(f"Retrieved {len(search_results)} documents:\n")

for i, doc in enumerate(search_results, 1):

    print(f"[Result {i}]")

    print(f"  Content (first 150 chars): {doc.page_content[:150]}...")

    print(f"  URL: {doc.metadata['url']}")

    print()

### 4.3) Relevancy Check Chain (Bonus/Optional)



**Purpose**: Filter retrieved documents by relevance to the query.



**Why needed?**

- Retrievers may return marginally relevant documents

- Low-quality documents can confuse the LLM during generation

- Better to filter out noise before final answer generation



**Example scenario:**

- Query: "What is BERT?"

- Retrieved doc talks about "Sesame Street's Bert character"

- Relevancy check marks it as `irrelevant`

- Only NLP-related BERT docs proceed to generation



This chain evaluates each document individually and returns filtered list.

In [None]:
# Relevancy check output schema

class RelevancyOutput(BaseModel):

    """Relevancy classification result."""

    relevance: Literal["relevant", "irrelevant"] = Field(

        description="Whether the document is relevant to the query"

    )



relevancy_parser = PydanticOutputParser(pydantic_object=RelevancyOutput)



# Relevancy prompt

relevancy_prompt = ChatPromptTemplate.from_messages([

    ("system", """You are a relevancy judge. Given a query and a document, determine if the document is relevant to answering the query.



Respond with ONLY 'relevant' or 'irrelevant'.



{format_instructions}"""),

    ("human", """Query: {query}



Document: {document}



Is this document relevant to the query?""")

])



# Relevancy check chain (single document)

relevancy_check_chain = relevancy_prompt | router_llm | relevancy_parser



print("‚úÖ Relevancy check chain assembled")

### 4.4) Fallback Chain



When the router determines a query is out-of-scope (`None`), this chain:

1. Takes the query + chat history

2. Politely informs the user the chatbot cannot help

3. Explains the chatbot's domain (NLP topics)



We can use higher temperature here (e.g., 0.3-0.7) for more natural/varied responses.

In [None]:
from langchain.schema.output_parser import StrOutputParser

from langchain.schema.messages import BaseMessage



# Helper to convert chat history to string

def format_chat_history(messages: list[BaseMessage]) -> str:

    """Convert LangChain messages to readable text."""

    if not messages:

        return "No previous conversation."

    

    formatted = []

    for msg in messages:

        role = "User" if msg.type == "human" else "Assistant"

        formatted.append(f"{role}: {msg.content}")

    return "\n".join(formatted)



# Fallback LLM (can use higher temperature)

fallback_llm = ChatTogether(

    model="meta-llama/Llama-3-70b-chat-hf",

    temperature=0.5,

    max_tokens=200

)



# Fallback prompt

fallback_prompt = ChatPromptTemplate.from_messages([

    ("system", """You are an NLP (Natural Language Processing) chatbot assistant.



Your domain: Natural language processing, machine learning for text, transformers, language models, etc.



When asked about topics outside your domain, politely explain you can only help with NLP-related questions.



Chat History:

{chat_history}"""),

    ("human", "{query}")

])



# Fallback chain

fallback_chain = fallback_prompt | fallback_llm | StrOutputParser()



print("‚úÖ Fallback chain assembled")

### 4.5) Generate-with-Context Chain



The final generation chain that:

1. Takes user query + retrieved relevant documents

2. Uses LLM to synthesize an answer based on the context

3. Returns a natural language response



This is the core RAG (Retrieval-Augmented Generation) component.

In [None]:
# Generation LLM

generate_llm = ChatTogether(

    model="meta-llama/Llama-3-70b-chat-hf",

    temperature=0.3,  # Some creativity but still focused

    max_tokens=512

)



# Helper to format documents

def format_documents(docs: list[Document]) -> str:

    """Format documents as numbered context."""

    if not docs:

        return "No relevant documents found."

    

    formatted = []

    for i, doc in enumerate(docs, 1):

        formatted.append(f"[Document {i}]\n{doc.page_content}")

    return "\n\n".join(formatted)



# Generation prompt

generate_prompt = ChatPromptTemplate.from_messages([

    ("system", """You are a helpful NLP assistant. Answer the user's question using ONLY the provided documents as context.



If the documents don't contain enough information to answer, say so honestly.



Context Documents:

{context}"""),

    ("human", "{query}")

])



# Generate-with-context chain

generate_with_context_chain = generate_prompt | generate_llm | StrOutputParser()



print("‚úÖ Generate-with-context chain assembled")

## 5) LangGraph Agent Assembly



Now we connect all chains into a state graph using **LangGraph**.



### Agent State

The state tracks all data flowing through the graph:

```python

- query: str               # User's question

- chat_history: list       # Conversation history

- documents: list[Document] # Retrieved docs

- generation: str          # Final answer

```



### Nodes (Functions):

1. **router_node**: Classifies query ‚Üí decides next tool

2. **vector_store**: Retrieves from FAISS

3. **search_engine**: Retrieves from Tavily web search

4. **filter_docs** (optional): Filters irrelevant documents

5. **fallback**: Handles out-of-scope queries

6. **generate_with_context**: Generates final answer



### Edges (Conditional Routing):

- Router ‚Üí VectorStore / SearchEngine / Fallback

- VectorStore ‚Üí FilterDocs ‚Üí Generate (or direct to Generate if no filtering)

- SearchEngine ‚Üí FilterDocs ‚Üí Generate (or direct)

- Fallback ‚Üí END

In [None]:
from typing import TypedDict

from langgraph.graph import StateGraph, END



# Define agent state

class AgentState(TypedDict):

    """State dictionary tracking data through the graph."""

    query: str

    chat_history: list[BaseMessage]

    generation: str

    documents: list[Document]



print("‚úÖ AgentState defined")

In [None]:
# Define node functions



def router_node(state: AgentState) -> AgentState:

    """Route query to appropriate tool."""

    result = router_chain.invoke({

        "query": state["query"],

        "format_instructions": router_parser.get_format_instructions()

    })

    state["route"] = result.tool

    return state



def vector_store_node(state: AgentState) -> AgentState:

    """Retrieve documents from local vector store."""

    docs = ensemble_retriever.invoke(state["query"])

    state["documents"] = docs

    return state



def search_engine_node(state: AgentState) -> AgentState:

    """Retrieve documents from web search."""

    docs = search_engine_chain.invoke(state["query"])

    state["documents"] = docs

    return state



def filter_docs_node(state: AgentState) -> AgentState:

    """Filter documents by relevancy (optional bonus)."""

    query = state["query"]

    docs = state.get("documents", [])

    

    filtered = []

    for doc in docs:

        try:

            result = relevancy_check_chain.invoke({

                "query": query,

                "document": doc.page_content[:500],  # Limit length for efficiency

                "format_instructions": relevancy_parser.get_format_instructions()

            })

            if result.relevance == "relevant":

                filtered.append(doc)

        except:

            # If relevancy check fails, keep the document

            filtered.append(doc)

    

    state["documents"] = filtered

    return state



def fallback_node(state: AgentState) -> AgentState:

    """Handle out-of-scope queries."""

    response = fallback_chain.invoke({

        "query": state["query"],

        "chat_history": format_chat_history(state.get("chat_history", []))

    })

    state["generation"] = response

    return state



def generate_with_context_node(state: AgentState) -> AgentState:

    """Generate answer using retrieved documents."""

    docs = state.get("documents", [])

    response = generate_with_context_chain.invoke({

        "query": state["query"],

        "context": format_documents(docs)

    })

    state["generation"] = response

    return state



print("‚úÖ All node functions defined")

In [None]:
# Build the graph

workflow = StateGraph(AgentState)



# Add nodes

workflow.add_node("router", router_node)

workflow.add_node("vector_store", vector_store_node)

workflow.add_node("search_engine", search_engine_node)

workflow.add_node("filter_docs", filter_docs_node)  # Optional bonus node

workflow.add_node("fallback", fallback_node)

workflow.add_node("generate", generate_with_context_node)



# Set entry point

workflow.set_entry_point("router")



# Define conditional routing from router

def route_query(state: AgentState) -> str:

    """Determine which node to call based on router decision."""

    route = state.get("route", "None")

    if route == "VectorStore":

        return "vector_store"

    elif route == "SearchEngine":

        return "search_engine"

    else:  # None or fallback

        return "fallback"



# Add conditional edges from router

workflow.add_conditional_edges(

    "router",

    route_query,

    {

        "vector_store": "vector_store",

        "search_engine": "search_engine",

        "fallback": "fallback"

    }

)



# WITH bonus filter_docs node:

workflow.add_edge("vector_store", "filter_docs")

workflow.add_edge("search_engine", "filter_docs")

workflow.add_edge("filter_docs", "generate")



# WITHOUT bonus (direct to generate) - comment out above 3 lines and uncomment these:

# workflow.add_edge("vector_store", "generate")

# workflow.add_edge("search_engine", "generate")



# Fallback and generate both end the workflow

workflow.add_edge("fallback", END)

workflow.add_edge("generate", END)



# Compile the graph

app = workflow.compile()



print("‚úÖ LangGraph compiled successfully!")

print("\nGraph structure:")

print("  router ‚Üí [vector_store | search_engine | fallback]")

print("  vector_store ‚Üí filter_docs ‚Üí generate ‚Üí END")

print("  search_engine ‚Üí filter_docs ‚Üí generate ‚Üí END")

print("  fallback ‚Üí END")

## 6Ô∏è‚É£ Testing the RAG Pipeline



Now let's test our complete RAG system with three types of queries:



### Test Query Types



1. **NLP-Related Query**: Should route to `VectorStore` and retrieve relevant documents from our corpus

2. **Computer Science (Non-NLP) Query**: Should route to `SearchEngine` for web search results

3. **Out-of-Scope Query**: Should route to `Fallback` for a general response



### Execution Flow



For each query, we'll:

- Display the router's classification decision

- Show retrieved/searched documents (if applicable)

- Present the final generated answer

- Analyze the routing correctness

### Test 1: NLP-Related Query



This query should be routed to **VectorStore** since it's related to natural language processing.

In [None]:
# Test 1: NLP-Related Query

query_nlp = "What are the main applications of natural language processing?"



print("‚ñ∂Ô∏è Running NLP query...")

print(f"Query: {query_nlp}\n")



result_nlp = app.invoke({"query": query_nlp, "chat_history": []})



print(f"\n‚úÖ Routing Decision: {result_nlp.get('route', 'N/A')}")

print(f"\n‚úÖ Number of Retrieved Documents: {len(result_nlp.get('documents', []))}")



if result_nlp.get('documents'):

    print("\n‚úÖ Retrieved Documents (top 3):")

    for i, doc in enumerate(result_nlp['documents'][:3], 1):

        content_preview = doc.page_content[:200].replace('\n', ' ')

        print(f"\n  [{i}] {content_preview}...")

        if hasattr(doc, 'metadata') and doc.metadata:

            print(f"      Source: {doc.metadata.get('source', 'Unknown')}")



print(f"\n\n‚úÖ Generated Answer:\n")

print(result_nlp.get('generation', 'No generation available'))

### Test 2: Computer Science (Non-NLP) Query



This query is about computer science but not NLP-specific, so it should route to **SearchEngine** for web search.

In [None]:
# Test 2: CS but non-NLP Query

query_cs = "Explain how binary search trees work and their time complexity"



print("‚ñ∂Ô∏è Running CS (non-NLP) query...")

print(f"Query: {query_cs}\n")



result_cs = app.invoke({"query": query_cs, "chat_history": []})



print(f"\n‚úÖ Routing Decision: {result_cs.get('route', 'N/A')}")

print(f"\n‚úÖ Number of Retrieved Documents: {len(result_cs.get('documents', []))}")



if result_cs.get('documents'):

    print("\n‚úÖ Retrieved Documents (top 3):")

    for i, doc in enumerate(result_cs['documents'][:3], 1):

        content_preview = doc.page_content[:200].replace('\n', ' ')

        print(f"\n  [{i}] {content_preview}...")

        if hasattr(doc, 'metadata') and doc.metadata:

            source = doc.metadata.get('source', 'Unknown')

            # For search results, show URL if available

            if 'url' in doc.metadata:

                print(f"      URL: {doc.metadata['url']}")

            else:

                print(f"      Source: {source}")



print(f"\n\n‚úÖ Generated Answer:\n")

print(result_cs.get('generation', 'No generation available'))

### Test 3: Out-of-Scope Query



This query is completely unrelated to computer science or NLP, so it should route to **Fallback** for a polite decline response.

In [None]:
# Test 3: Out-of-Scope Query

query_oos = "Who is the current president of Bolivia?"



print("‚ñ∂Ô∏è Running out-of-scope query...")

print(f"Query: {query_oos}\n")



result_oos = app.invoke({"query": query_oos, "chat_history": []})



print(f"\n‚úÖ Routing Decision: {result_oos.get('route', 'N/A')}")

print(f"\n‚úÖ Number of Retrieved Documents: {len(result_oos.get('documents', []))}")



if result_oos.get('documents'):

    print("\n‚úÖ Retrieved Documents:")

    for i, doc in enumerate(result_oos['documents'][:3], 1):

        content_preview = doc.page_content[:200].replace('\n', ' ')

        print(f"\n  [{i}] {content_preview}...")

        if hasattr(doc, 'metadata') and doc.metadata:

            print(f"      Source: {doc.metadata.get('source', 'Unknown')}")



print(f"\n\n‚úÖ Generated Answer:\n")

print(result_oos.get('generation', 'No generation available'))

## 7Ô∏è‚É£ Results Analysis and Evaluation



### Routing Accuracy



Let's analyze if the router correctly classified each query type:



| Query Type | Expected Route | Actual Route | Correct? |

|------------|----------------|--------------|----------|

| NLP-Related | VectorStore | (see above) | ‚úÖ/‚ùå |

| CS (Non-NLP) | SearchEngine | (see above) | ‚úÖ/‚ùå |

| Out-of-Scope | Fallback | (see above) | ‚úÖ/‚ùå |



### Pipeline Performance



**Strengths:**

- üìö **Hybrid Retrieval**: Ensemble retriever combines semantic understanding (FAISS) with keyword matching (BM25)

- üß† **Smart Routing**: Temperature=0 ensures deterministic classification by the router

- ‚úÖ **Relevancy Filtering**: Optional filter_docs node removes low-quality documents before generation

- üîç **Web Search Fallback**: Tavily integration provides up-to-date information for non-corpus queries

- üíæ **Efficient Caching**: Embeddings are cached to speed up repeated runs



**Limitations:**

- üåê **Language Bias**: English-optimized embedder may underperform on non-English corpora

- üìä **Corpus Dependency**: Vector store quality depends on document coverage and chunking strategy

- üîë **API Requirements**: Requires TogetherAI and Tavily API keys

- ‚è±Ô∏è **Latency**: Multiple LLM calls (router, relevancy check, generation) add response time



### Possible Improvements



1. **Adaptive Retrieval**: Dynamically adjust ensemble weights based on query type

2. **Multi-lingual Support**: Use language-agnostic or multilingual embedders (e.g., `paraphrase-multilingual-mpnet-base-v2`)

3. **Re-ranking**: Add a re-ranker after retrieval to improve document ordering

4. **Conversational Memory**: Implement chat history tracking for multi-turn conversations

5. **Evaluation Metrics**: Add automated evaluation with RAGAS (Retrieval Augmented Generation Assessment)

## 8Ô∏è‚É£ Conclusion



This notebook implemented a complete **Retrieval-Augmented Generation (RAG)** pipeline using:



### Core Technologies

- **LangChain**: Framework for building LLM applications with chains and agents

- **LangGraph**: State machine for orchestrating multi-step agent workflows

- **FAISS**: Efficient vector similarity search for semantic retrieval

- **TogetherAI**: LLM provider (Llama 3 70B) for routing, filtering, and generation

- **Tavily**: Web search API for real-time information retrieval

- **HuggingFace**: Sentence transformers for document embeddings



### Implementation Highlights

1. **Document Processing**: PDF/text loading, recursive chunking (1000 chars, 200 overlap)

2. **Embeddings**: Cached HuggingFace embeddings (all-MiniLM-L6-v2) with LocalFileStore

3. **Hybrid Retrieval**: Ensemble combining BM25 (30%) and FAISS (70%)

4. **Intelligent Routing**: Router chain classifies queries ‚Üí VectorStore/SearchEngine/Fallback

5. **Optional Filtering**: Relevancy check chain removes low-quality documents

6. **Context Generation**: LLM generates answers grounded in retrieved documents

7. **Stateful Workflow**: LangGraph manages state transitions and conditional routing



### Assignment Completion

- ‚úÖ **Required**: Basic FAISS retriever, router chain, search engine, fallback, generate chain

- ‚úÖ **Bonus**: Hybrid retriever (BM25 + FAISS), relevancy check chain

- ‚úÖ **Testing**: Three query types (NLP, CS non-NLP, out-of-scope)

- ‚úÖ **Documentation**: Comprehensive English explanations for each component



### Key Takeaways

- **RAG bridges LLMs and external knowledge**: Reduces hallucinations by grounding responses in documents

- **Hybrid search is powerful**: Combining lexical and semantic retrieval improves recall

- **Temperature matters**: Lower temperatures (0-0.3) for deterministic tasks, higher (0.5+) for creative generation

- **Caching saves time**: Persistent embeddings cache avoids recomputation

- **Modular design enables flexibility**: Each chain/node can be tested and improved independently

---



## üì¶ Appendix: Setup Instructions



### Prerequisites

1. Python 3.8+ installed

2. API keys for TogetherAI and Tavily



### Installation

```bash

pip install langchain langchain-community langchain-huggingface langgraph \

            faiss-cpu sentence-transformers langchain-together tavily-python \

            pypdf pydantic

```



### Environment Setup

Create a `.env` file or set environment variables:

```bash

export TOGETHER_API_KEY="your-together-api-key"

export TAVILY_API_KEY="your-tavily-api-key"

```



### Data Preparation

1. Create directory structure:

   ```

   NLP-CA6/

   ‚îú‚îÄ‚îÄ answer/

   ‚îÇ   ‚îî‚îÄ‚îÄ code.ipynb

   ‚îú‚îÄ‚îÄ data/

   ‚îÇ   ‚îî‚îÄ‚îÄ documents/  # Place your PDFs/text files here

   ‚îú‚îÄ‚îÄ cache/  # Auto-created for embeddings

   ‚îî‚îÄ‚îÄ vector_store/  # Auto-created for FAISS index

   ```



2. Add documents to `data/documents/` (PDFs or .txt files)



### Running the Notebook

1. Open `code.ipynb` in Jupyter Lab/Notebook

2. Run cells sequentially from top to bottom

3. First run will:

   - Generate and cache embeddings (~2-5 min depending on corpus size)

   - Create FAISS vector store

   - Initialize BM25 index

4. Subsequent runs will load from cache (much faster)



### Troubleshooting

- **Import errors**: Ensure all packages are installed with correct versions

- **API key errors**: Verify environment variables are set correctly

- **Memory issues**: Reduce `chunk_size` or process fewer documents

- **Slow embeddings**: First run is slow; subsequent runs use cache

- **Empty retrievals**: Check document paths and ensure PDFs contain extractable text