# 04. Building Intelligent RAG Systems

## 安装依赖

In [6]:
%uv pip install langchain~=1.0 langchain-core~=1.0

[2mAudited [1m2 packages[0m [2min 3ms[0m[0m
Note: you may need to restart the kernel to use updated packages.


In [4]:
%uv pip install langchain-chroma~=1.0 langchain-community~=0.4 langchain-openai~=1.0

[2mAudited [1m3 packages[0m [2min 7ms[0m[0m
Note: you may need to restart the kernel to use updated packages.


In [None]:
%uv pip install jq~=1.10 python-dotenv~=1.1 scann~=1.4 transformers~=4.56

[2mAudited [1m4 packages[0m [2min 4ms[0m[0m
Note: you may need to restart the kernel to use updated packages.


In [7]:
# 安装 cpu 版 PyTorch，避免后续依赖 PyTorch 的包安装庞大的 GPU 版浪费空间
%uv pip install torch~=2.9 torchvision~=0.24 torchaudio~=2.9 --index-url https://download.pytorch.org/whl/cpu
%uv pip install langchain-huggingface~=1.2 sentence-transformers~=5.2

[2mAudited [1m3 packages[0m [2min 3ms[0m[0m
Note: you may need to restart the kernel to use updated packages.
[2mAudited [1m2 packages[0m [2min 11ms[0m[0m
Note: you may need to restart the kernel to use updated packages.


In [1]:
import dotenv


dotenv.load_dotenv()

True

工具类

In [2]:
import os


from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.embeddings import Embeddings

class Config:
    def __init__(self):
        api_key = os.getenv("OPENAI_API_KEY")
        if not api_key:
            raise ValueError("OPENAI_API_KEY is not set")

        base_url = os.getenv("OPENAI_API_BASE_URL")
        if not base_url:
            raise ValueError("OPENAI_API_BASE_URL is not set")

        model = os.getenv("OPENAI_MODEL")
        if not model:
            raise ValueError("OPENAI_MODEL is not set")

        vl_model = os.getenv("OPENAI_VL_MODEL")
        embeddings_model = os.getenv("OPENAI_EMBEDDINGS_MODEL")
        hf_pretrained_embeddings_model = os.getenv("HF_PRETRAINED_EMBEDDINGS_MODEL")

        self.api_key = api_key
        self.base_url = base_url
        self.model = model
        self.vl_model = vl_model
        self.embeddings_model = embeddings_model
        self.hf_pretrained_embeddings_model = (
            hf_pretrained_embeddings_model
            if hf_pretrained_embeddings_model
            else "Qwen/Qwen3-Embedding-8B"
        )

    def new_openai_like(self, **kwargs) -> ChatOpenAI:
        # 参考：https://bailian.console.aliyun.com/?tab=api#/api/?type=model&url=2587654
        # 参考：https://help.aliyun.com/zh/model-studio/models
        # ChatOpenAI 文档参考：https://python.langchain.com/api_reference/openai/chat_models/langchain_openai.chat_models.base.ChatOpenAI.html#langchain_openai.chat_models.base.ChatOpenAI
        return ChatOpenAI(
            api_key=self.api_key, base_url=self.base_url, model=self.model, **kwargs
        )

    def new_openai_like_embeddings(self, **kwargs) -> OpenAIEmbeddings:
        if not self.embeddings_model:
            raise ValueError("OPENAI_EMBEDDINGS_MODEL is not set")

        # 参考：https://python.langchain.com/api_reference/openai/embeddings/langchain_openai.embeddings.base.OpenAIEmbeddings.html#langchain_openai.embeddings.base.OpenAIEmbeddings
        return OpenAIEmbeddings(
            api_key=self.api_key,
            base_url=self.base_url,
            model=self.embeddings_model,
            # https://python.langchain.com/api_reference/openai/embeddings/langchain_openai.embeddings.base.OpenAIEmbeddings.html#langchain_openai.embeddings.base.OpenAIEmbeddings.tiktoken_enabled
            # 对于非 OpenAI 的官方实现，将这个参数置为 False。
            # 回退到用 huggingface transformers 库 AutoTokenizer 来处理 token。
            tiktoken_enabled=False,
            # https://python.langchain.com/api_reference/openai/embeddings/langchain_openai.embeddings.base.OpenAIEmbeddings.html#langchain_openai.embeddings.base.OpenAIEmbeddings.model
            # 元宝说 Jina 的 embedding 模型 https://huggingface.co/jinaai/jina-embeddings-v4 最接近
            # text-embedding-ada-002
            # 个人喜好，选了 Qwen/Qwen3-Embedding-8B
            # tiktoken_model_name='Qwen/Qwen3-Embedding-8B',
            tiktoken_model_name=self.hf_pretrained_embeddings_model,
            **kwargs,
        )
    
def new_hf_embeddings(**kwargs) -> Embeddings:
# ref: https://reference.langchain.com/python/integrations/langchain_huggingface/#langchain_huggingface.HuggingFaceEmbeddings
    model_name = kwargs.pop("model_name", os.environ['HF_EMBEDDINGS_MODEL'])
    out = HuggingFaceEmbeddings(
        model_name=model_name,
        model_kwargs={"device": "cpu"},
        encode_kwargs={"normalize_embeddings": False},
    )
    return out


  from .autonotebook import tqdm as notebook_tqdm


## From indexes to intelligent retrieval

- A fundamental limitation of traditional retrieval systems lies in their lexical approach to document retrieval.
- The breakthrough came with advances in neural network models that could capture the meaning
of words and documents as dense vector representations—known as embeddings.
- As a “closed-book” generative systems, LLM faced limitations: hallucination
risks, knowledge cutoffs limited to training data, inability to cite sources, and challenges with
complex reasoning

## Components of a RAG system
- RAG enables language models to ground their outputs in external knowledge, providing an elegant
solution to the limitations that plague pure LLMs: hallucinations, outdated information, and
restricted context windows.
- Main components:
    - Knowledge base: The storage layer for external information
    - Retriever: The knowledge access layer that finds relevant information
    - Augmenter: The integration layer that prepares retrieved content
    - Generator: The response layer that produces the final output
- RAG operates through two interconnected pipelines:
    - An indexing pipeline that processes, chunks, and stores documents in the knowledge base
    - A query pipeline that retrieves relevant information and generates responses using that information
- Architecture and workflow offers several advantages for production systems:
    1. modularity allows components to be developed independently;
    1. scalability enables resources to be allocated based on specific needs;
    1. maintainability is improved through the clear separation of concerns;
    1. flexibility permits different implementation strategies to be swapped in as requirements evolve.

<img src="static/rag-arch-and-workflow.png" alt="RAG architecture and workflow" style="width: 30%;" />

### When to implement RAG

- Significant implementation considerations
    1. The system requires efficient indexing and retrieval mechanisms to maintain reasonable response times.
    1. Knowledge bases need regular updates and maintenance to remain valuable.
    1. Infrastructure must be designed to handle errors and edge cases gracefully, especially where different components interact.
    1. Development teams must be prepared to manage these ongoing operational requirements.
- Experience from Chelsea AI Ventures
    1. Clients in regulated industries particularly benefit from RAG's verifiability.
    1. Creative applications often perform adequately with pure LLMs.
- Development teams should consider RAG when their applications require:
    1. Access to current information not available in LLM training data
    1. Domain-specific knowledge integration
    1. Verifiable responses with source attribution
    1. Processing of specialized data formats
    1. High precision in regulated industries

## From embeddings to search

The core components of a RAG system:
1. vector embeddings
1. vector stores
1. Indexing strategies to optimize retrieval

### Embeddings

Embeddings are numerical representations of text that capture semantic meaning.

In [17]:
# Initialize the embeddings model
# embeddings_model = Config().new_openai_like_embeddings()
embeddings_model = new_hf_embeddings()

# Create embeddings for example sentences
text1 = "The cat sat on the mat"
text2 = "A feline rested on the carpet"
text3 = "Python is a programming language"

# Get embeddings using LangChain
embeddings = embeddings_model.embed_documents([text1, text2, text3])

# These similar sentences will have similar embeddings
embedding1 = embeddings[0]  # Embedding for "The cat sat on the mat"
embedding2 = embeddings[1]  # Embedding for "A feline rested on the carpet"
embedding3 = embeddings[2]  # Embedding for "Python is a programming language"

# Output shows number of documents and embedding dimensions
print(f"Number of documents: {len(embeddings)}")
print(f"Dimensions per embedding: {len(embeddings[0])}")
# Typically 1536 dimensions with OpenAI's embeddings

Number of documents: 3
Dimensions per embedding: 384


### Vector stores

Vector stores are specialized databases designed to store, manage, and efficiently search vector
embeddings.

The vector database operates as an independent system that can be:
- Scaled independently of the RAG components
- Maintained and optimized separately
- Potentially shared across multiple RAG applications
- Hosted as a dedicated service

When working with embeddings, several challenges arise:
- **Scale**: Applications often need to store millions of embeddings
- **Dimensionality**: Each embedding might have hundreds or thousands of dimensions
- **Search performance**: Finding similar vectors quickly becomes computationally intensive
- **Associated data**: We need to maintain connections between vectors and their source documents

Vector stores combine two essential components:
- **Vector storage**: The actual database that persists vectors and metadata
- **Vector index**: A specialized data structure that enables efficient similarity search

The curse of dimensionality: as vector dimensions increase, computing similarities becomes increasingly expensive,
requiring `O(dN)` operations for `d` dimensions and `N` vectors.


Traditional database:
- Uses exact matching (equality, ranges)
- Optimized for structured data (for example, “find all customers with age > 30”)
- Usually utilizes B-trees or hash-based indexes

Vector store search:
- Uses similarity metrics (cosine similarity, Euclidean distance)
- Optimized for high-dimensional vector spaces
- Employs Approximate Nearest Neighbor (ANN) algorithms

#### Vector stores comparison

Database | Deployment options | License | Notable features
---------|--------------------|---------|----------------------------
Pinecone | Cloud-only | Commercial | Auto-scaling, enterprise security, monitoring
Milvus | Cloud, Self-hosted | Apache 2.0 | HNSW/IVF indexing, multi-modal support, CRUD operations
Weaviate | Cloud, Self-hosted | BSD 3-Clause | Graph-like structure, multi-modal support
Qdrant | Cloud, Self-hosted | Apache 2.0 | HNSW indexing, filtering optimization, JSON metadata
ChromaDB | Cloud, Self-hosted | Apache 2.0 | Lightweight, easy setup
AnalyticDB-V | Cloud-only | Commercial | OLAP integration, SQL support, enterprise features
pg_vector | Cloud, Self-hosted | OSS | SQL support, PostgreSQL integration
Vertex Vector Search | Cloud-only | Commercial | Easy setup, low latency, high scalability


Several search patterns:
1. Exact search: Returns precise nearest neighbors but becomes computationally prohibitive with large vector collections
1. Approximate search: Trades accuracy for speed using techniques like LSH, HNSW, or quantization; measured by recall 
  (the percentage of true nearest neighbors retrieved)
1. Hybrid search: Combines vector similarity with text-based search (like keyword matching or BM25) in a single query
1. Filtered vector search: Applies traditional database filters (for example, metadata constraints) alongside vector
  similarity search

Different types of embeddings:
1. Dense vector search: Uses continuous embeddings where most dimensions have non-zero values, typically from neural
  models (like BERT, OpenAI embeddings)
1. Sparse vector search: Uses high-dimensional vectors where most values are zero, resembling traditional TF-IDF or
  BM25 representations
1. Sparse-dense hybrid: Combines both approaches to leverage semantic similarity (dense) and keyword precision (sparse)


Choice of multiple similarity measures, for example:
1. Inner product: Useful for comparing semantic directions
1. Cosine similarity: Normalizes for vector magnitude
1. Euclidean distance: Measures the L2 distance in vector space (note: with normalized embeddings, this becomes
  functionally equivalent to the dot product)
1. Hamming distance: For binary vector representations

#### Hardware considerations for vector stores

1. Memory requirements
2. CPU vs. GPU
3. Storage speed
4. Network bandwidth

#### Vector store interface in LangChain

In [18]:
from langchain_chroma import Chroma
from langchain_core.documents import Document

# Initialize with an embedding model
# embeddings = Config().new_openai_like_embeddings()
embeddings = new_hf_embeddings()

# Create some sample documents with explicit IDs
docs = [
    Document(page_content="Content about language models", metadata={"id": "doc_1"}),
    Document(
        page_content="Information about vector databases", metadata={"id": "doc_2"}
    ),
    Document(page_content="Details about retrieval systems", metadata={"id": "doc_3"}),
]

# Create the vector store
vector_store = Chroma(embedding_function=embeddings)

# Add documents with explicit IDs
vector_store.add_documents(docs)

# Similarity Search with appropriate k value
results = vector_store.similarity_search("How do language models work?", k=2)
print(results)

[Document(id='c75eb16d-b896-4e1c-bf35-029220937607', metadata={'id': 'doc_1'}, page_content='Content about language models'), Document(id='b7e1ccfb-d60a-4a06-8ee7-6ea6cce982a9', metadata={'id': 'doc_3'}, page_content='Details about retrieval systems')]


In [19]:
# For maximum marginal relevance search, adjust the parameters based on available documents
# Find relevant BUT diverse documents (reduce redundancy)
results = vector_store.max_marginal_relevance_search(
    "How does LangChain work?",
    k=3,
    fetch_k=10,
    lambda_mult=0.5,  # Controls diversity (0=max diversity, 1=max relevance)
)
print(results)

[Document(id='c75eb16d-b896-4e1c-bf35-029220937607', metadata={'id': 'doc_1'}, page_content='Content about language models'), Document(id='a5ed0e49-0714-4061-82f6-ad9dd47128ae', metadata={'id': 'doc_2'}, page_content='Information about vector databases'), Document(id='b7e1ccfb-d60a-4a06-8ee7-6ea6cce982a9', metadata={'id': 'doc_3'}, page_content='Details about retrieval systems')]


### Vector indexing strategies

Some common indexing approaches include:
- **Tree-based structures** that hierarchically divide the vector space
- **Graph-based methods** like **Hierarchical Navigable Small World (HNSW)** that create navigable networks of connected vectors
- **Hashing techniques** that map similar vectors to the same “buckets”

Trade-offs between:
- Search speed
- Accuracy of results
- Memory usage
- Update efficiency (how quickly new vectors can be added)


Takeway: proper indexing transforms vector search from an `O(n)` operation (where `n` is the number of vectors) to
something much more efficient (often closer to `O(log n)`), making it possible to search through millions of vectors in
milliseconds rather than seconds or minutes.

Vector store comparison by deployment options, licensing, and key features as
| Strategy                     | Core algorithm                                                                 | Complexity             | Memory usage                          | Best for                                                       | Notes                                                                                     |
|------------------------------|-------------------------------------------------------------------------------|------------------------|---------------------------------------|----------------------------------------------------------------|-------------------------------------------------------------------------------------------|
| Exact Search (Brute Force)  | Compares query vector with every vector in database                           | Search: O(DN); Build: O(1) | Low - only stores raw vectors         | Small datasets; When 100% recall needed; Testing/baseline        | Easiest to implement; Good baseline for testing                                            |
| HNSW (Hierarchical Navigable Small World) | Creates layered graph with decreasing connectivity from bottom to top        | Search: O(log N); Build: O(N log N) | High - stores graph connections plus vectors | Production systems; When high accuracy needed; Large-scale search | Industry standard; Requires careful tuning of M (connections) and ef (search depth)        |
| LSH (Locality Sensitive Hashing) | Uses hash functions that map similar vectors to the same buckets            | Search: O(Np); Build: O(N) | Medium - stores multiple hash tables | Streaming data; When updates frequent; Approximate search OK     | Good for dynamic data; Tunable accuracy vs speed                                          |
| IVF (Inverted File Index)    | Clusters vectors and searches within relevant clusters                        | Search: O(DN/k); Build: O(kN) | Low - stores cluster assignments      | Limited memory; Balance of speed/accuracy; Simple implementation | k = number of clusters; Often combined with other methods                                |
| Product Quantization (PQ)    | Compresses vectors by splitting into subspaces and quantizing                | Search: varies; Build: O(N) | Very Low - compressed vectors         | Memory-constrained systems; Massive datasets                    | Often combined with IVF; Requires training codebooks Complex implementation              |
|  Tree-Based (KD-Tree, Ball Tree) | Recursively partitions space into regions | Search: O(D log N) best case; Build: O(N log N) | Medium – tree structure | Low dimensional data; Static datasets | Works well for D < 100; Expensive updates

In [25]:
%uv pip install faiss-cpu~=1.13

[2mAudited [1m1 package[0m [2min 2ms[0m[0m


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Note: you may need to restart the kernel to use updated packages.


In [26]:
import numpy as np
import faiss
import time

# Create sample data - 10,000 vectors with 128 dimensions
dimension = 128
num_vectors = 10000
vectors = np.random.random((num_vectors, dimension)).astype('float32')
query = np.random.random((1, dimension)).astype('float32')

# Exact search index
exact_index = faiss.IndexFlatL2(dimension)
exact_index.add(vectors)

# HNSW index (approximate but faster)
hnsw_index = faiss.IndexHNSWFlat(dimension, 32) # 32 connections per node
hnsw_index.add(vectors)

# Compare search times
start_time = time.time()
exact_D, exact_I = exact_index.search(query, k=10) # Search for 10 nearest neighbors
exact_time = time.time() - start_time

start_time = time.time()
hnsw_D, hnsw_I = hnsw_index.search(query, k=10)
hnsw_time = time.time() - start_time

# Calculate overlap (how many of the same results were found)
overlap = len(set(exact_I[0]).intersection(set(hnsw_I[0])))
overlap_percentage = overlap * 100 / 10

print(f"Exact search time: {exact_time:.6f} seconds")
print(f"HNSW search time: {hnsw_time:.6f} seconds")
print(f"Speed improvement: {exact_time/hnsw_time:.2f}x faster")
print(f"Result overlap: {overlap_percentage:.1f}%")

Exact search time: 0.000583 seconds
HNSW search time: 0.000168 seconds
Speed improvement: 3.48x faster
Result overlap: 80.0%


Guide for choosing an indexing strategy

<img src="static/choosing-an-indexing-strategy.png" alt="Choosing an indexing strategy" style="width: 30%;" />


Vector libraries provide functionality for working with vector data. Examples as
1. Faiss by Meta: PQ, LSH, and HNSW;
1. Annoy by Spotify implemented in C++;
1. hnswlib in C++: HNSW;
1. Non-Metric Space Library (nmslib): HNSW, SW-graph, and SPTAG;
1. SPTAG by Microsoft implements a distributed ANN: SPTAG-KDT, SPTAG-BKT;

## Breaking down the RAG pipeline

4 steps:
1. Document processing – like preparing books for a library
1. Vector indexing – creating the card catalog
1. Vector stores – the organized shelves
1. Retrieval – finding the right books

In [31]:
# 1. Load documents
from langchain_community.document_loaders import JSONLoader

# Load a json file
loader = JSONLoader(
    file_path="static/knowledge_base.json",
    jq_schema=".[].content",  # This extracts the content field from each array item
    text_content=True,
)
documents = loader.load()

In [29]:
# 2. Make embedding model
# embedder = Config().new_openai_like_embeddings()
embedder = new_hf_embeddings()

In [30]:
# 3. Store in vector database
from langchain_community.vectorstores import FAISS

vector_db = FAISS.from_documents(documents, embedder)

In [33]:
# 4. Retrieve similar docs
query = "What is GPT?"

vector_db.similarity_search(query)

[Document(id='05ee00d3-56ab-4424-a565-2bcf68ca89b9', metadata={'source': '/github.com/sammyne/generative-ai-with-lang-chain-2ed/chapter04/static/knowledge_base.json', 'seq_num': 3}, page_content='GPT (Generative Pre-trained Transformer) models are autoregressive language models that use transformer-based neural networks. Unlike BERT, which is bidirectional, GPT models are unidirectional and predict the next token based on previous tokens. The original GPT was introduced by OpenAI in 2018, followed by GPT-2 in 2019 and GPT-3 in 2020, each significantly larger than its predecessor.'),
 Document(id='3b62a1ac-f5aa-458a-b4ca-7976c7c99a9c', metadata={'source': '/github.com/sammyne/generative-ai-with-lang-chain-2ed/chapter04/static/knowledge_base.json', 'seq_num': 1}, page_content="Transformer models were introduced in the paper 'Attention Is All You Need' by Vaswani et al. in 2017. The architecture relies on self-attention mechanisms rather than recurrent or convolutional neural networks. Th

### Document processing

A document loader is a component in LangChain that transforms various data sources into a standardized document format that can be used throughout the LangChain ecosystem

In [11]:
from langchain_community.document_loaders import JSONLoader

# Load a json file
loader = JSONLoader(
    file_path="static/knowledge_base.json",
    jq_schema=".[].content",  # This extracts the content field from each array item
    text_content=True,
)
documents = loader.load()

print(documents)

[Document(metadata={'source': '/github.com/sammyne/generative-ai-with-lang-chain-2ed/chapter04/static/knowledge_base.json', 'seq_num': 1}, page_content="Transformer models were introduced in the paper 'Attention Is All You Need' by Vaswani et al. in 2017. The architecture relies on self-attention mechanisms rather than recurrent or convolutional neural networks. This design allows for more parallelization during training and better handling of long-range dependencies in text."), Document(metadata={'source': '/github.com/sammyne/generative-ai-with-lang-chain-2ed/chapter04/static/knowledge_base.json', 'seq_num': 2}, page_content='BERT (Bidirectional Encoder Representations from Transformers) was developed by Google AI Language team in 2018. It is pre-trained using masked language modeling and next sentence prediction tasks. BERT is designed to pre-train deep bidirectional representations by jointly conditioning on both left and right context in all layers.'), Document(metadata={'source':

Document loaders in LangChain

| Category | Description | Notable Examples | Common Use Cases |
| :--- | :--- | :--- | :--- |
| **File Systems** | Load from local files | TextLoader, CSVLoader, PDF-Loader | Processing local documents, data files |
| **Web Content** | Extract from online sources | WebBaseLoader, RecursiveURL-Loader, SitemapLoader | Web scraping, content aggregation |
| **Cloud Storage** | Access cloud-hosted files | S3DirectoryLoader, GCSFileLoader, DropboxLoader | Enterprise data integration |
| **Databases** | Load from structured data stores | MongoDBLoader, SnowflakeLoader, BigQueryLoader | Business intelligence, data analysis |
| **Social Media** | Import social platform content | TwitterTweetLoader, RedditPostsLoader, DiscordChatLoader | Social media analysis |
| **Productivity Tools** | Access workspace documents | NotionDirectoryLoader, SlackDirectoryLoader, TrelloLoader | Knowledge base creation |
| **Scientific Sources** | Load academic content | ArxivLoader, PubMedLoader | Research applications |

#### Chunking strategies

The way you chunk documents affects:
1. Retrieval accuracy: Well-formed chunks maintain semantic coherence, making them easier to match with relevant queries
1. Context preservation: Poor chunking can split related information, causing knowledge gaps
1. Response quality: When the LLM receives fragmented or irrelevant chunks, it generates less accurate responses

##### Fixed-size chunking

In [1]:
%uv pip install langchain-text-splitters~=1.0

[2mAudited [1m1 package[0m [2min 71ms[0m[0m
Note: you may need to restart the kernel to use updated packages.


In [4]:
from langchain_text_splitters import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    separator=" ",  # Split on spaces to avoid breaking words
    chunk_size=200,
    chunk_overlap=20,
)

chunks = text_splitter.split_documents(documents)
print(f"Generated {len(chunks)} chunks from document")

Generated 13 chunks from document


##### Recursive character chunking

**The recommended default strategy for most applications.**

In [5]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", ". ", " ", ""], chunk_size=150, chunk_overlap=20
)

document = """# Introduction to RAG
Retrieval-Augmented Generation (RAG) combines retrieval systems with generative AI models.

It helps address hallucinations by grounding responses in retrieved information.

## Key Components
RAG consists of several components:
1. Document processing
2. Vector embedding
3. Retrieval
4. Augmentation
5. Generation

### Document Processing
This step involves loading and chunking documents appropriately.
"""

text_splitter.split_text(document)

['# Introduction to RAG\nRetrieval-Augmented Generation (RAG) combines retrieval systems with generative AI models.',
 'It helps address hallucinations by grounding responses in retrieved information.',
 '## Key Components\nRAG consists of several components:\n1. Document processing\n2. Vector embedding\n3. Retrieval\n4. Augmentation\n5. Generation',
 '### Document Processing\nThis step involves loading and chunking documents appropriately.']

##### Document-specific chunking

An implementation could involve using different specialized splitters based on document type using if statements.

##### Semantic chunking

Workflow:
1. Splits text into sentences
2. Creates embeddings for groups of sentences (determined by buffer_size)
3. Measures semantic similarity between adjacent groups
4. Identifies natural breakpoints where topics or concepts change
5. Creates chunks that preserve semantic coherence

In [7]:
%uv pip install langchain-experimental~=0.4

[2mAudited [1m1 package[0m [2min 4ms[0m[0m
Note: you may need to restart the kernel to use updated packages.


In [11]:
from langchain_experimental.text_splitter import SemanticChunker

# embeddings = Config().new_openai_like_embeddings()
embeddings = new_hf_embeddings()
text_splitter = SemanticChunker(
    embeddings=embeddings, add_start_index=True  # Include position metadata
)

text_splitter.split_text(document)

['# Introduction to RAG\nRetrieval-Augmented Generation (RAG) combines retrieval systems with generative AI models. It helps address hallucinations by grounding responses in retrieved information. ## Key Components\nRAG consists of several components:\n1. Document processing\n2. Vector embedding\n3. Retrieval\n4.',
 'Augmentation\n5. Generation\n\n### Document Processing\nThis step involves loading and chunking documents appropriately. ']

##### Agent-based chunking
Uses LLMs to intelligently divide text based on semantic analysis and content understanding.

Workflow:
1. Analyze the document’s structure and content
2. Identify natural breakpoints based on topic shifts
3. Determine optimal chunk boundaries that preserve meaning
4. Return a list of starting positions for creating chunks

Useful when
1. Documents contain intricate logical flows that need to be preserved
1. Content requires domain-specific understanding to chunk appropriately
1. Maximum retrieval accuracy justifies the additional expense of LLM-based processing

##### Multi-modal chunking

Handle documents mixing text, tables, images and code.

##### Choosing the right chunking strategy

| Factor | Condition | Recommended Strategy |
| :--- | :--- | :--- |
| **Document Characteristics** | Highly structured documents (markdown, code) | Document-specific chunking |
|  | Complex technical content | Semantic chunking |
|  | Mixed media | Multi-modal approaches |
| **Retrieval Needs** | Fact-based QA | Smaller chunks (100-300 tokens) |
|  | Complex reasoning | Larger chunks (500-1000 tokens) |
|                               | Context-heavy answers | Sliding window with significant overlap |
| **Computational Resources**   | Limited API budget    | Basic recursive chunking                |
|                               | Performance-critical  | Pre-computed semantic chunks            |

#### Retrieval

A retriever is fundamentally an interface that accepts natural language queries and returns relevant documents.

Workflow
1. **Input**: Takes a query as a string
1. **Processing**: Applies retrieval logic specific to the implementation
1. **Output**: Returns a list of document objects, each containing:
    - `page_content`: The actual document content
    - `metadata`: Associated information like document ID or source

##### LangChain retrievers

A few key groups

Group | Example
------|-----------
Core infrastructure | self-hosted options like ElasticsearchRetriever, cloud-based from Amazon, Google, and Microsoft
External knowledge | ArxivRetriever, WikipediaRetriever, and TavilySearchAPI
Algorithmic | BM25 for keyword precision, TF-IDF for document classification, and kNN for similarity matching
Advanced/Specialized | NeuralDB provides CPU-optimized retrieval, while LLMLingua focuses on document compression.
Integration | connect with popular platforms and services

##### Vector store retrievers

Any vector store can become a retriever through the `as_retriever()` method.

In [12]:
from langchain_community.retrievers import KNNRetriever

# embeddings = Config().new_openai_like_embeddings()
embeddings = new_hf_embeddings()

retriever = KNNRetriever.from_documents(documents, embeddings)
retriever.invoke("query")

[Document(metadata={'source': '/github.com/sammyne/generative-ai-with-lang-chain-2ed/chapter04/static/knowledge_base.json', 'seq_num': 4}, page_content='Retrieval-Augmented Generation (RAG) combines a retrieval system with a text generator. The retriever fetches relevant documents from a knowledge base, and these documents are then provided as context to the generator. RAG models can be fine-tuned end-to-end and leverage large pre-trained models like BART or T5 for generation. This approach helps ground the generated text in factual information.'),
 Document(metadata={'source': '/github.com/sammyne/generative-ai-with-lang-chain-2ed/chapter04/static/knowledge_base.json', 'seq_num': 5}, page_content='Vector databases store high-dimensional vectors and efficiently perform similarity searches. Popular vector databases include Pinecone, Milvus, and FAISS. They use algorithms like HNSW (Hierarchical Navigable Small World) or IVF (Inverted File Index) to enable fast approximate nearest neighb

Retrievers most relevant for RAG systems
1. Search API retrievers: These retrievers interface with external search services without storing documents locally.
1. Database retrievers: These connect to structured data sources, translating natural language queries into database queries.
1. Lexical search retrievers: These implement traditional text-matching algorithms:
    - BM25 for probabilistic ranking
    - TF-IDF for term frequency analysis
    - Elasticsearch integration for scalable text search

In [5]:
%uv pip install xmltodict~=1.0

[2mAudited [1m1 package[0m [2min 4ms[0m[0m
Note: you may need to restart the kernel to use updated packages.


In [None]:
from langchain_community.retrievers import PubMedRetriever

retriever = PubMedRetriever(email="xiangminli@outlook.com")
# 注意事项：国内访问 PubMed 会报错，估计是 PubMed 限制了国内访问
results = retriever.invoke("chatgpt")

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

In [8]:
%uv pip install arxiv~=2.2

[2mAudited [1m1 package[0m [2min 2ms[0m[0m
Note: you may need to restart the kernel to use updated packages.


In [9]:
from langchain_community.retrievers import ArxivRetriever

retriever = ArxivRetriever(
    load_max_docs=2,
    # get_ful_documents=True,
)

retriever.invoke("chat-gpt")

[Document(metadata={'Entry ID': 'http://arxiv.org/abs/2405.09300v1', 'Published': datetime.date(2024, 5, 15), 'Title': 'Comparing the Efficacy of GPT-4 and Chat-GPT in Mental Health Care: A Blind Assessment of Large Language Models for Psychological Support', 'Authors': 'Birger Moell'}, page_content="Background: Rapid advancements in natural language processing have led to the development of large language models with the potential to revolutionize mental health care. These models have shown promise in assisting clinicians and providing support to individuals experiencing various psychological challenges.\n  Objective: This study aims to compare the performance of two large language models, GPT-4 and Chat-GPT, in responding to a set of 18 psychological prompts, to assess their potential applicability in mental health care settings.\n  Methods: A blind methodology was employed, with a clinical psychologist evaluating the models' responses without knowledge of their origins. The prompts 

Modern retrieval systems often combine multiple approaches for better results:
1. Hybrid search
    - Vector similarity for semantic understanding
    - Keyword matching for precise terminology
    - Weighted combinations for optimal results
1. Maximal Marginal Relevance (MMR): Optimizes for both relevance and diversity by:
    - Selecting documents similar to the query
    - Ensuring retrieved documents are distinct from each other
    - Balancing exploration and exploitation
1. Custom retrieval logic: LangChain allows the creation of specialized retrievers by implementing the `BaseRetriever` class.

### Advanced RAG techniques

A standard vector search has several limitations:
- It might miss contextually relevant documents that use different terminology
- It can’t distinguish between authoritative and less reliable sources
- It might return redundant or contradictory information
- It has no way to verify if generated responses accurately reflect the source material

#### Hybrid retrieval: Combining semantic and keyword search

In [10]:
%uv pip install rank-bm25~=0.2

[2K[2mResolved [1m2 packages[0m [2min 1.84s[0m[0m                                         [0m
[2K[37m⠙[0m [2mPreparing packages...[0m (0/1)                                                   
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/1)--------------[0m[0m     0 B/8.38 KiB            [1A
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/1)----------[2m[0m[0m 8.38 KiB/8.38 KiB           [1A
[2K[2mPrepared [1m1 package[0m [2min 97ms[0m[0m                                                   [1A
         If the cache and target directories are on different filesystems, hardlinking may not be supported.
[2K[2mInstalled [1m1 package[0m [2min 13ms[0m[0m                                 [0m
 [32m+[39m [1mrank-bm25[0m[2m==0.2.2[0m
Note: you may need to restart the kernel to use updated packages.


In [None]:
from langchain_core.documents import Document
from langchain_community.vectorstores import FAISS

# Initialize with an embedding model
# embeddings = Config().new_openai_like_embeddings()
embeddings = new_hf_embeddings()

# Create some sample documents with explicit IDs
docs = [
    Document(page_content="Content about language models", metadata={"id": "doc_1"}),
    Document(
        page_content="Information about vector databases", metadata={"id": "doc_2"}
    ),
    Document(page_content="Details about retrieval systems", metadata={"id": "doc_3"}),
]

# Create the vector store
vector_store = FAISS.from_documents(docs, embeddings)

In [13]:
# https://docs.langchain.com/oss/python/migrate/langchain-v1#langchain-classic
%uv pip install langchain-classic~=1.0

[2mAudited [1m1 package[0m [2min 16ms[0m[0m


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Note: you may need to restart the kernel to use updated packages.


In [None]:
from langchain_classic.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever

# Setup semantic retriever
vector_retriever = vector_store.as_retriever(search_kwargs={"k": 2})
# print(vector_retriever.invoke("climate change impacts"))

# Setup lexical retriever
bm25_retriever = BM25Retriever.from_documents(docs)
bm25_retriever.k = 2
# print(bm25_retriever.invoke("climate change impacts"))

# Combine retrievers
hybrid_retriever = EnsembleRetriever(
    retrievers=[vector_retriever, bm25_retriever],
    weights=[0.7, 0.3],  # Weight semantic search higher than keyword search
)

hybrid_retriever.invoke("climate change impacts")

# 注意事项：结果有 3 个。

[Document(id='fdca971b-a214-4b72-87cb-0f6f5136167e', metadata={'id': 'doc_3'}, page_content='Details about retrieval systems'),
 Document(id='64cad61f-ea1e-4699-bdd2-ed982ba26802', metadata={'id': 'doc_1'}, page_content='Content about language models'),
 Document(metadata={'id': 'doc_2'}, page_content='Information about vector databases')]

#### Re-ranking

Re-ranking is a post-processing step that can follow any retrieval method.

Workflow
1. retrieve a larger set of candidate documents
1. Apply a more sophisticated model to re-score documents
1. Reorder based on these more precise relevance scores

3 main paradigms:
- **Pointwise rerankers**: Score each document independently (for example, on a scale of 1-10) and sort the resulting array of documents accordingly
- **Pairwise rerankers**: Compare document pairs to determine preferences, then construct a final ordering by ranking documents based on their win/loss record across all comparisons
- **Listwise rerankers**: The re-ranking model processes the entire list of documents (and the original query) holistically to determine optimal order by optimizing NDCG or MAP


LangChain's implementations
1. Cohere rerank: Commercial API-based solution with excellent quality
1. RankLLM: Library supporting open-source LLMs fine-tuned specifically for re-ranking
1. LLM-based custom rerankers: Using any LLM to score document relevance

Cross-encoder re-ranking typically improves these metrics by 10-20% over initial retrieval, especially for the top positions.

#### Query transformation: Improving retrieval through better queries

Query expansion generates multiple variations of the original query to capture different aspects or phrasings.

Particularly useful when dealing with ambiguous queries, questions formulated by non-experts, or situations where terminology mismatches between queries and documents are common.

In [3]:
# 用 langchain-classic 不用 langchain-core 的原理参见
# https://docs.langchain.com/oss/python/migrate/langchain-v1#langchain-classic
from langchain_classic.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser

expansion_template = """Given the user question: {question}
Generate three alternative versions that express the same information need but with different wording:
1."""

expansion_prompt = PromptTemplate(
    input_variables=["question"], template=expansion_template
)

llm = Config().new_openai_like(temperature=0.7)
expansion_chain = expansion_prompt | llm | StrOutputParser()

# Generate expanded queries
original_query = "What are the effects of climate change?"
reply = expansion_chain.invoke(original_query)
print(reply)

1. How is climate change impacting the environment and society?
2. What are the consequences of global warming?
3. In what ways is the changing climate affecting the planet?


##### Hypothetical Document Embeddings (HyDE)

HyDE uses an LLM to generate a hypothetical answer document based on the query, and then uses that document’s embedding for retrieval.

In [4]:
from langchain_community.document_loaders import JSONLoader
from langchain_community.vectorstores import FAISS

# Load a json file
loader = JSONLoader(
    file_path="static/knowledge_base.json",
    jq_schema=".[].content",  # This extracts the content field from each array item
    text_content=True,
)
documents = loader.load()

# embedder = Config().new_openai_like_embeddings()
embedder = new_hf_embeddings()

vector_db = FAISS.from_documents(documents, embedder)

In [6]:
# 用 langchain-classic 不用 langchain-core 的原理参见
# https://docs.langchain.com/oss/python/migrate/langchain-v1#langchain-classic
from langchain_classic.prompts import PromptTemplate


# Create prompt for generating hypothetical document
hyde_template = """Based on the question: {question}
Write a passage that could contain the answer to this question:"""

hyde_prompt = PromptTemplate(input_variables=["question"], template=hyde_template)
llm = Config().new_openai_like(temperature=0.2)
hyde_chain = hyde_prompt | llm | StrOutputParser()

# Generate hypothetical document
query = "What dietary changes can reduce carbon footprint?"
hypothetical_doc = hyde_chain.invoke(query)

In [7]:
# Use the hypothetical document for retrieval
# embeddings = Config().new_openai_like_embeddings()
embeddings = new_hf_embeddings()
embedded_query = embeddings.embed_query(hypothetical_doc)
vector_db.similarity_search_by_vector(embedded_query, k=3)

[Document(id='06789117-fad5-496b-82ec-c7cca664fb53', metadata={'source': '/github.com/sammyne/generative-ai-with-lang-chain-2ed/chapter04/static/knowledge_base.json', 'seq_num': 4}, page_content='Retrieval-Augmented Generation (RAG) combines a retrieval system with a text generator. The retriever fetches relevant documents from a knowledge base, and these documents are then provided as context to the generator. RAG models can be fine-tuned end-to-end and leverage large pre-trained models like BART or T5 for generation. This approach helps ground the generated text in factual information.'),
 Document(id='f3eeab14-faf7-49d7-9b6b-96ddb448c5d4', metadata={'source': '/github.com/sammyne/generative-ai-with-lang-chain-2ed/chapter04/static/knowledge_base.json', 'seq_num': 1}, page_content="Transformer models were introduced in the paper 'Attention Is All You Need' by Vaswani et al. in 2017. The architecture relies on self-attention mechanisms rather than recurrent or convolutional neural netw

#### Context processing: maximizing retrieved information value

Especially valuable when dealing with lengthy documents
where only portions are relevant, or when providing comprehensive coverage of a topic requires
diverse viewpoints. They help reduce noise in the generator’s input and ensure that the most
valuable information is prioritized.

##### Contextual compression

Extracts only the most relevant parts of retrieved documents, removing irrelevant content that might distract the generator.

In [10]:
from langchain_classic.retrievers.document_compressors import LLMChainExtractor
from langchain_classic.retrievers import ContextualCompressionRetriever


llm = Config().new_openai_like(temperature=0)
compressor = LLMChainExtractor.from_llm(llm)

# Create a basic retriever from the vector store
base_retriever = vector_db.as_retriever(search_kwargs={"k": 2})

compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=base_retriever
)

compression_retriever.invoke("How do transformers work?")

[Document(metadata={'source': '/github.com/sammyne/generative-ai-with-lang-chain-2ed/chapter04/static/knowledge_base.json', 'seq_num': 1}, page_content="Extracted relevant parts:\n>>>\nTransformer models were introduced in the paper 'Attention Is All You Need' by Vaswani et al. in 2017. The architecture relies on self-attention mechanisms rather than recurrent or convolutional neural networks. This design allows for more parallelization during training and better handling of long-range dependencies in text.\n>>>"),
 Document(metadata={'source': '/github.com/sammyne/generative-ai-with-lang-chain-2ed/chapter04/static/knowledge_base.json', 'seq_num': 2}, page_content='BERT (Bidirectional Encoder Representations from Transformers) was developed by Google AI Language team in 2018. It is pre-trained using masked language modeling and next sentence prediction tasks. BERT is designed to pre-train deep bidirectional representations by jointly conditioning on both left and right context in all l

##### Maximum marginal relevance

Balances document relevance with diversity.

In [13]:
from langchain_community.vectorstores import FAISS

# Initialize with an embedding model
embeddings = Config().new_openai_like_embeddings()

# Create the vector store
vector_store = FAISS.from_documents(documents, embeddings)

# For maximum marginal relevance search, adjust the parameters based on available documents
# Find relevant BUT diverse documents (reduce redundancy)
vector_store.max_marginal_relevance_search(
    query="What are transformer models?",
    k=3,  # Number of documents to return
    fetch_k=20,  # Number of documents to initially fetch
    lambda_mult=0.5,  # Diversity parameter (0 = max diversity, 1 = max relevance)
)

[Document(id='bf4e8670-55c6-40f9-96ca-70643d34d806', metadata={'source': '/github.com/sammyne/generative-ai-with-lang-chain-2ed/chapter04/static/knowledge_base.json', 'seq_num': 1}, page_content="Transformer models were introduced in the paper 'Attention Is All You Need' by Vaswani et al. in 2017. The architecture relies on self-attention mechanisms rather than recurrent or convolutional neural networks. This design allows for more parallelization during training and better handling of long-range dependencies in text."),
 Document(id='20ae0319-45dd-45c5-9d73-d693e8ab347f', metadata={'source': '/github.com/sammyne/generative-ai-with-lang-chain-2ed/chapter04/static/knowledge_base.json', 'seq_num': 5}, page_content='Vector databases store high-dimensional vectors and efficiently perform similarity searches. Popular vector databases include Pinecone, Milvus, and FAISS. They use algorithms like HNSW (Hierarchical Navigable Small World) or IVF (Inverted File Index) to enable fast approximate

#### Response enhancement: Improving generator output

In [14]:
from langchain_core.documents import Document

# Example documents
documents = [
    Document(
        page_content="The transformer architecture was introduced in the paper 'Attention is All You Need' by Vaswani et al. in 2017.",
        metadata={"source": "Neural Network Review 2021", "page": 42},
    ),
    Document(
        page_content="BERT uses bidirectional training of the Transformer, masked language modeling, and next sentence prediction tasks.",
        metadata={"source": "Introduction to NLP", "page": 137},
    ),
    Document(
        page_content="GPT models are autoregressive transformers that predict the next token based on previous tokens.",
        metadata={"source": "Large Language Models Survey", "page": 89},
    ),
]

##### Source attribution

Explicitly connects generated information to the retrieved sources, helping users verify facts and understand where information comes from.

Example implementation
1. Retrieving relevant documents for a query
1. Formatting each document with a citation number
1. Using a prompt that explicitly requests citations for each fact
1. Generating a response that includes inline citations ([1], [2], etc.)
1. Adding a references section that links each citation to its source

In [15]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_community.vectorstores import FAISS


# Create a vector store and retriever
# embeddings = Config().new_openai_like_embeddings()
embeddings = new_hf_embeddings()
vector_store = FAISS.from_documents(documents, embeddings)
retriever = vector_store.as_retriever(search_kwargs={"k": 3})

# Source attribution prompt template
attribution_prompt = ChatPromptTemplate.from_template(
    """
You are a precise AI assistant that provides well-sourced information.
Answer the following question based ONLY on the provided sources. For each fact or claim in your answer,
include a citation using [1], [2], etc. that refers to the source. Include a numbered reference list at the end.

Question: {question}

Sources:
{sources}

Your answer:
"""
)

In [16]:
from langchain_core.output_parsers import StrOutputParser


# Create a source-formatted string from documents
def format_sources_with_citations(docs):
    formatted_sources = []
    for i, doc in enumerate(docs, 1):
        source_info = f"[{i}] {doc.metadata.get('source', 'Unknown source')}"
        if doc.metadata.get("page"):
            source_info += f", page {doc.metadata['page']}"
        formatted_sources.append(f"{source_info}\n{doc.page_content}")
    return "\n\n".join(formatted_sources)


# Build the RAG chain with source attribution
def generate_attributed_response(question):
    # Retrieve relevant documents
    retrieved_docs = retriever.invoke(question)

    # Format sources with citation numbers
    sources_formatted = format_sources_with_citations(retrieved_docs)
    # print(sources_formatted)

    # Create the attribution chain using LCEL
    attribution_chain = (
        attribution_prompt | Config().new_openai_like(temperature=0) | StrOutputParser()
    )

    # Generate the response with citations
    response = attribution_chain.invoke(
        {"question": question, "sources": sources_formatted}
    )

    return response

In [17]:
# Example usage
question = "How do transformer models work and what are some examples?"
attributed_answer = generate_attributed_response(question)
print(attributed_answer)

Based solely on the provided sources:

Transformer models are a neural network architecture that was introduced in a 2017 paper titled "Attention is All You Need" [3]. They work using an attention mechanism.

Two specific examples of transformer models are:
1.  **GPT models**, which are described as autoregressive transformers. They function by predicting the next token in a sequence based on the previous tokens [1].
2.  **BERT**, which uses a bidirectional training approach based on the Transformer architecture. Its training involves masked language modeling and next sentence prediction tasks [2].

**References:**
[1] Large Language Models Survey, page 89
[2] Introduction to NLP, page 137
[3] Neural Network Review 2021, page 42


##### Self-consistency checking: ensuring factual accuracy

In [41]:
from langchain_core.documents import Document
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI


def verify_response_accuracy(
    retrieved_docs: list[Document], generated_answer: str, llm: ChatOpenAI | None = None
) -> str:
    """
    Verify if a generated answer is fully supported by the retrieved documents.
    Args:
        retrieved_docs: List of documents used to generate the answer
        generated_answer: The answer produced by the RAG system
        llm: Language model to use for verification
    Returns:
        Dictionary containing verification results and any identified issues
    """
    if llm is None:
        llm = Config().new_openai_like(temperature=0)

    # Create context from retrieved documents
    context = "\n\n".join([doc.page_content for doc in retrieved_docs])

    # Define verification prompt - fixed to avoid JSON formatting issues in the template
    verification_prompt = ChatPromptTemplate.from_template(
        """
    As a fact-checking assistant, verify whether the following answer is fully supported
    by the provided context. Identify any statements that are not supported or contradict the context.
    
    Context:
    {context}
    
    Answer to verify:
    {answer}
    
    Perform a detailed analysis with the following structure:
    1. List any factual claims in the answer
    2. For each claim, indicate whether it is:
       - Fully supported (provide the supporting text from context)
       - Partially supported (explain what parts lack support)
       - Contradicted (identify the contradiction)
       - Not mentioned in context
    3. Overall assessment: Is the answer fully grounded in the context?
    
    Return your analysis in JSON format with the following structure:
    {{
      "claims": [
        {{
          "claim": "The factual claim",
          "status": "fully_supported|partially_supported|contradicted|not_mentioned",
          "evidence": "Supporting or contradicting text from context",
          "explanation": "Your explanation"
        }}
      ],
      "fully_grounded": true|false,
      "issues_identified": ["List any specific issues"]
    }}
    """
    )

    # Create verification chain using LCEL
    verification_chain = verification_prompt | llm | StrOutputParser()

    # Run verification
    result = verification_chain.invoke({"context": context, "answer": generated_answer})

    return result

In [42]:
# Example usage
retrieved_docs = [
    Document(
        page_content="The transformer architecture was introduced in the paper 'Attention Is All You Need' by Vaswani et al. in 2017. It relies on self-attention mechanisms instead of recurrent or convolutional neural networks."
    ),
    Document(
        page_content="BERT is a transformer-based model developed by Google that uses masked language modeling and next sentence prediction as pre-training objectives."
    ),
]

generated_answer = "The transformer architecture was introduced by OpenAI in 2018 and uses recurrent neural networks. BERT is a transformer model developed by Google."

verification_result = verify_response_accuracy(retrieved_docs, generated_answer)
print(verification_result)

{
  "claims": [
    {
      "claim": "The transformer architecture was introduced by OpenAI in 2018",
      "status": "contradicted",
      "evidence": "The transformer architecture was introduced in the paper 'Attention Is All You Need' by Vaswani et al. in 2017.",
      "explanation": "The context clearly states that the transformer architecture was introduced in 2017 by Vaswani et al., not by OpenAI in 2018. This directly contradicts the claim."
    },
    {
      "claim": "The transformer architecture ... uses recurrent neural networks",
      "status": "contradicted",
      "evidence": "It relies on self-attention mechanisms instead of recurrent or convolutional neural networks.",
      "explanation": "The context explicitly states that transformers use self-attention mechanisms instead of recurrent neural networks, making this claim directly false."
    },
    {
      "claim": "BERT is a transformer model developed by Google",
      "status": "fully_supported",
      "evidence": 

The verification can be further enhanced by:
1. Granular claim extraction: Breaking down complex responses into atomic factual claims
1. Evidence linking: Explicitly connecting each claim to specific supporting text
1. Confidence scoring: Assigning numerical confidence scores to different parts of the response
1. Selective regeneration: Regenerating only the unsupported portions of responses

#### Corrective RAG

In real-world applications, retrieval systems often return irrelevant, insufficient, or even misleading content.

Corrective Retrieval-Augmented Generation (CRAG) directly addresses this challenge by introducing explicit evaluation and correction mechanisms into the RAG pipeline.

Workflow
1. **Initial retrieval**: Standard document retrieval from the vector store based on the query.
1. **Retrieval evaluation**: A retrieval evaluator component assesses each document’s relevance and quality.
1. **Conditional correction**:
    1. **Relevant documents**: Pass high-quality documents directly to the generator.
    1. Irrelevant documents: Filter out low-quality documents to prevent noise.
    1. Insufficient/Ambiguous results: Trigger alternative information-seeking strategies (like web search) when internal knowledge is inadequate.
1. Generation: Produce the final response using the filtered or augmented context.

<img src="static/corrective-rag.png" style="width: 80%;" alt="Corrective RAG workflow" />

#### Agentic RAG

Agentic RAG uses agents to:
1. Analyze queries and decompose complex questions into manageable sub-questions
1. Plan information-gathering strategies based on the specific task requirements
1. Select appropriate tools (retrievers, web search, calculators, APIs, etc.)
1. Execute multi-step processes, potentially involving multiple rounds of retrieval and reasoning
1. Reflect on intermediate results and adapt strategies accordingly

CRAG primarily enhances data quality through evaluation and correction, while agentic RAG focuses on process intelligence through autonomous planning and orchestration.

Particularly valuable for complex use cases that require:
1. Multi-step reasoning across multiple information sources
1. Dynamic tool selection based on query analysis
1. Persistent task execution with intermediate reflection
1. Integration with various external systems and APIs

Agentic RAG introduces significant complexity in implementation, potentially higher
latency due to multiple reasoning steps, and increased computational costs from multiple LLM
calls for planning and reflection.

#### Choosing the right techniques

| RAG Approach | Chapter Section | Core Mechanism | Key Strengths | Key Weaknesses | Primary Use Cases | Relative Complexity |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| Naive RAG | Breaking down the RAG pipeline | Basic index → retrieve → generate workflow with single retrieval step | Simple implementation Low initial resource usage Straightforward debugging | Limited retrieval quality Vulnerability to hallucinations No handling of retrieval failures | Simple Q&A systems Basic document lookup Prototyping | Low |
| Hybrid Retrieval | Advanced RAG techniques - hybrid retrieval | Combines sparse (BM25) and dense (vector) retrieval methods | Balances keyword precision with semantic understanding Handles vocabulary mismatch Improves recall without sacrificing precision | Increased system complexity Challenge in optimizing fusion weights Higher computational overhead | Technical documentation Content with specialized terminology Multi-domain knowledge bases | Medium |
| Re-ranking | Advanced RAG techniques - re-ranking | Post-processes initial retrieval results with more sophisticated relevance models | Improves result ordering Captures nuanced relevance signals Can be applied to any retrieval method | Additional computation layer May create bottlenecks for large result sets Requires training or configuring re-rankers | When retrieval quality is critical For handling ambiguous queries High-value information needs | Medium |
| Query Transformation (HyDE) | Advanced RAG techniques - query transformation | Generates hypothetical document from query for improved retrieval | Bridges query-document semantic gap Improves retrieval for complex queries Handles implicit information needs | Additional LLM generation step Depends on hypothetical document quality Potential for query drift | Complex or ambiguous queries Users with unclear information needs Potential for query drift | Medium |
| Context Processing | Advanced RAG techniques - context processing | Optimizes retrieved documents before sending to the generator (compression, MMR) | • Maximizes context window utilization<br>• Reduces redundancy<br>• Focuses on most relevant information | • Risk of removing important context<br>• Processing adds latency<br>• May lose document coherence | • Large documents<br>• When context window is limited<br>• Redundant information sources | Medium |
| Response Enhancement | Advanced RAG techniques - response enhancement | Improves generated output with source attribution and consistency checking | • Increases output trustworthiness<br>• Provides verification mechanisms<br>• Enhances user confidence | • May reduce fluency or conciseness<br>• Additional post-processing overhead<br>• Complex implementation logic | • Educational or research content<br>• Legal or medical information<br>• When attribution is required | Medium-High |
| Corrective RAG (CRAG) | Advanced RAG techniques - corrective RAG | Evaluates retrieved documents and takes corrective actions (filtering, web search) | • Explicitly handles poor retrieval results<br>• Improves robustness<br>• Can dynamically supplement knowledge | • Increased latency from evaluation<br>• Depends on evaluator accuracy<br>• More complex conditional logic | • High-reliability requirements<br>• Systems needing factual accuracy<br>• Applications with potential knowledge gaps | High |
| Agentic RAG | Advanced RAG techniques - agentic RAG | Uses autonomous AI agents to orchestrate information gathering and synthesis | • Highly adaptable to complex tasks<br>• Can use diverse tools beyond retrieval<br>• Multi-step reasoning capabilities | • Significant implementation complexity<br>• Higher cost and latency<br>• Challenging to debug and control | • Complex multi-step information tasks<br>• Research applications<br>• Systems integrating multiple data sources | Very High |

## Developing a corporate documentation chatbot

源码参见 src/chapter04/developing-a-corporate-documentation-chatbot 目录。

In [43]:
%uv pip install langgraph~=1.0 streamlit~=1.50

[2K[37m⠙[0m [2mResolving dependencies...                                                     [0m

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[2K[2mResolved [1m60 packages[0m [2min 549ms[0m[0m                                        [0m
[2K[37m⠙[0m [2mPreparing packages...[0m (0/2)                                                   
[2K[1A[37m⠹[0m [2mPreparing packages...[0m (0/2)--------------[0m[0m     0 B/153.08 KiB          [1A
[2K[1A[37m⠹[0m [2mPreparing packages...[0m (0/2)--------------[0m[0m     0 B/153.08 KiB          [1A
[2K[1A[37m⠹[0m [2mPreparing packages...[0m (0/2)--------------[0m[0m 16.00 KiB/153.08 KiB        [1A
[2K[1A[37m⠹[0m [2mPreparing packages...[0m (0/2)--------------[0m[0m 32.00 KiB/153.08 KiB        [1A
[2K[1A[37m⠹[0m [2mPreparing packages...[0m (0/2)--------------[0m[0m 48.00 KiB/153.08 KiB        [1A
[2K[1A[37m⠹[0m [2mPreparing packages...[0m (0/2)--------------[0m[0m 60.25 KiB/153.08 KiB        [1A
[2K[1A[37m⠹[0m [2mPreparing packages...[0m (0/2)--------------[0m[0m 60.25 KiB/153.08 KiB        [1A
[2mlanggraph-prebuilt  [

In [44]:
%uv run src/chapter04/developing-a-corporate-documentation-chatbot/rag.py

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
  _warn_about_sha1_encoder()
[]
INFO:httpx:HTTP Request: POST https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions "HTTP/1.1 200 OK"
The square root of 10 is approximately 3.1623.

None of the provided corporate document snippets are relevant to this mathematical question.
INFO:httpx:HTTP Request: POST https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions "HTTP/1.1 200 OK"
no issues detected
The square root of 10 is approximately 3.1623.

None of the provided corporate document snippets are relevant to this mathematical question.
Note: you may need to restart the kernel to use updated packages.


In [23]:
!.venv/bin/streamlit run src/chapter04/developing-a-corporate-documentation-chatbot/streamlit_app.py

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)



Collecting usage statistics. To deactivate, set browser.gatherUsageStats to false.
[0m
[0m
[34m[1m  You can now view your Streamlit app in your browser.[0m
[0m
[34m  Local URL: [0m[1mhttp://localhost:8501[0m
[34m  Network URL: [0m[1mhttp://172.17.0.4:8501[0m
[34m  External URL: [0m[1mhttp://43.132.141.4:8501[0m
[0m
[34m  Stopping...[0m
^C


## Evaluation and performance considerations

Improvements for RAG pipeline
1. Integrate a robust retrieval system such as FAISS, Pinecone, or Elasticsearch to fetch real-time sources.
1. Scoring mechanisms like precision, recall, and mean reciprocal rank to evaluate retrieval quality.
1. Assess answer accuracy by comparing generated responses against ground-truth data or curated references and incorporating human-in-the-loop validation to ensure the outputs are both correct and useful.

Other considerations
1. Error-handling.
1. Building observability into the pipeline by logging API calls, node execution times, and retrieval performance is essential for scaling up and maintaining reliability in production.
1. Optimizing API use by leveraging local models when possible, caching common queries, and managing memory efficiently when handling large-scale embeddings further supports cost optimization and scalability.

## Troubleshooting RAG systems

Robust design and continuous system calibration:
1. Foundational setup: Ensure comprehensive and high-quality document collections, clear prompt formulations, and effective retrieval techniques that enhance precision and relevance.
1. Continuous calibration: Regular monitoring, user feedback, and updates to the knowledge base help identify emerging issues during operation.

A few common failure points and their remedies are as follows:
1. **Missing content**: Prevent this by validating content during ingestion and adding domain-specific resources. Use explicit
signals to indicate when information is unavailable.
1. **Missed top-ranked documents**: Improve this with advanced embedding models, hybrid semantic-lexical searches, and sentence-level retrieval.
1. **Context window limitations**: Optimizing document chunking and extracting the most relevant sentences.
1. **Information extraction failure**: LLM fails to synthesize the available context properly. This can be resolved by refining prompt design—using explicit instructions and contrastive examples enhances extraction accuracy.
1. **Format compliance issues**: Enforce structured output with parsers, precise format examples, and post-processing validation.
1. **Specificity mismatch**: The output may be too general or too detailed. Address this by using query expansion techniques and tailoring prompts based on the user’s expertise level.
1. **Incomplete information**: Increase retrieval diversity (e.g., using maximum marginal relevance) and refine query transformation methods to cover all aspects of the query.