# 04. Building Intelligent RAG Systems

## 安装依赖

In [16]:
%uv pip install langchain-core~=1.0

[2K[2mResolved [1m25 packages[0m [2min 14ms[0m[0m                                         [0m
[2mUninstalled [1m1 package[0m [2min 5ms[0m[0m
         If the cache and target directories are on different filesystems, hardlinking may not be supported.
[2K[2mInstalled [1m1 package[0m [2min 20ms[0m[0m2                                [0m
 [31m-[39m [1mlangchain-core[0m[2m==0.3.79[0m
 [32m+[39m [1mlangchain-core[0m[2m==1.0.2[0m


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Note: you may need to restart the kernel to use updated packages.


In [17]:
%uv pip install langchain-chroma~=1.0 langchain-community==1.0.0a1 langchain-openai~=1.0

[2K[2mResolved [1m115 packages[0m [2min 29ms[0m[0m                                        [0m
[2mUninstalled [1m2 packages[0m [2min 18ms[0m[0m
         If the cache and target directories are on different filesystems, hardlinking may not be supported.
[2K██████████░░░░░░░░░░ [1/2] [2mlangchain-text-splitters==1.0.0                      [0m

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[2K[2mInstalled [1m2 packages[0m [2min 44ms[0m[0m1.0.0a1                         [0m
 [31m-[39m [1mlangchain-community[0m[2m==0.3.31[0m
 [32m+[39m [1mlangchain-community[0m[2m==1.0.0a1[0m
 [31m-[39m [1mlangchain-text-splitters[0m[2m==0.3.11[0m
 [32m+[39m [1mlangchain-text-splitters[0m[2m==1.0.0[0m
Note: you may need to restart the kernel to use updated packages.


In [3]:
%uv pip install jq~=1.10 python-dotenv~=1.1 scann~=1.4 transformers~=4.56

[2K[2mResolved [1m22 packages[0m [2min 62ms[0m[0m                                         [0m
[2mUninstalled [1m1 package[0m [2min 1ms[0m[0m
         If the cache and target directories are on different filesystems, hardlinking may not be supported.
[2K[2mInstalled [1m5 packages[0m [2min 113ms[0m[0m                               [0m
 [31m-[39m [1mhuggingface-hub[0m[2m==1.0.1[0m
 [32m+[39m [1mhuggingface-hub[0m[2m==0.36.0[0m
 [32m+[39m [1mjq[0m[2m==1.10.0[0m
 [32m+[39m [1msafetensors[0m[2m==0.6.2[0m
 [32m+[39m [1mscann[0m[2m==1.4.2[0m
 [32m+[39m [1mtransformers[0m[2m==4.57.1[0m
Note: you may need to restart the kernel to use updated packages.


工具类

In [4]:
import os

import dotenv
from langchain_openai import ChatOpenAI, OpenAIEmbeddings


class Config:
    def __init__(self):
        # By default, load_dotenv doesn't override existing environment variables and looks for a .env file in same directory as python script or searches for it incrementally higher up.
        dotenv_path = dotenv.find_dotenv(usecwd=True)
        if not dotenv_path:
            raise ValueError("No .env file found")
        dotenv.load_dotenv(dotenv_path=dotenv_path)

        api_key = os.getenv("OPENAI_API_KEY")
        if not api_key:
            raise ValueError("OPENAI_API_KEY is not set")

        base_url = os.getenv("OPENAI_API_BASE_URL")
        if not base_url:
            raise ValueError("OPENAI_API_BASE_URL is not set")

        model = os.getenv("OPENAI_MODEL")
        if not model:
            raise ValueError("OPENAI_MODEL is not set")

        vl_model = os.getenv("OPENAI_VL_MODEL")
        embeddings_model = os.getenv("OPENAI_EMBEDDINGS_MODEL")
        hf_pretrained_embeddings_model = os.getenv("HF_PRETRAINED_EMBEDDINGS_MODEL")

        self.api_key = api_key
        self.base_url = base_url
        self.model = model
        self.vl_model = vl_model
        self.embeddings_model = embeddings_model
        self.hf_pretrained_embeddings_model = (
            hf_pretrained_embeddings_model
            if hf_pretrained_embeddings_model
            else "Qwen/Qwen3-Embedding-8B"
        )

    def new_openai_like(self, **kwargs) -> ChatOpenAI:
        # 参考：https://bailian.console.aliyun.com/?tab=api#/api/?type=model&url=2587654
        # 参考：https://help.aliyun.com/zh/model-studio/models
        # ChatOpenAI 文档参考：https://python.langchain.com/api_reference/openai/chat_models/langchain_openai.chat_models.base.ChatOpenAI.html#langchain_openai.chat_models.base.ChatOpenAI
        return ChatOpenAI(
            api_key=self.api_key, base_url=self.base_url, model=self.model, **kwargs
        )

    def new_openai_like_embeddings(self, **kwargs) -> OpenAIEmbeddings:
        if not self.embeddings_model:
            raise ValueError("OPENAI_EMBEDDINGS_MODEL is not set")

        # 参考：https://python.langchain.com/api_reference/openai/embeddings/langchain_openai.embeddings.base.OpenAIEmbeddings.html#langchain_openai.embeddings.base.OpenAIEmbeddings
        return OpenAIEmbeddings(
            api_key=self.api_key,
            base_url=self.base_url,
            model=self.embeddings_model,
            # https://python.langchain.com/api_reference/openai/embeddings/langchain_openai.embeddings.base.OpenAIEmbeddings.html#langchain_openai.embeddings.base.OpenAIEmbeddings.tiktoken_enabled
            # 对于非 OpenAI 的官方实现，将这个参数置为 False。
            # 回退到用 huggingface transformers 库 AutoTokenizer 来处理 token。
            tiktoken_enabled=False,
            # https://python.langchain.com/api_reference/openai/embeddings/langchain_openai.embeddings.base.OpenAIEmbeddings.html#langchain_openai.embeddings.base.OpenAIEmbeddings.model
            # 元宝说 Jina 的 embedding 模型 https://huggingface.co/jinaai/jina-embeddings-v4 最接近
            # text-embedding-ada-002
            # 个人喜好，选了 Qwen/Qwen3-Embedding-8B
            # tiktoken_model_name='Qwen/Qwen3-Embedding-8B',
            tiktoken_model_name=self.hf_pretrained_embeddings_model,
            **kwargs,
        )

  from .autonotebook import tqdm as notebook_tqdm
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


## From indexes to intelligent retrieval

## Components of a RAG system

### When to implement RAG

## From embeddings to search

### Embeddings

In [5]:
# Initialize the embeddings model
embeddings_model = Config().new_openai_like_embeddings()

# Create embeddings for example sentences
text1 = "The cat sat on the mat"
text2 = "A feline rested on the carpet"
text3 = "Python is a programming language"

# Get embeddings using LangChain
embeddings = embeddings_model.embed_documents([text1, text2, text3])

# These similar sentences will have similar embeddings
embedding1 = embeddings[0]  # Embedding for "The cat sat on the mat"
embedding2 = embeddings[1]  # Embedding for "A feline rested on the carpet"
embedding3 = embeddings[2]  # Embedding for "Python is a programming language"

# Output shows number of documents and embedding dimensions
print(f"Number of documents: {len(embeddings)}")
print(f"Dimensions per embedding: {len(embeddings[0])}")
# Typically 1536 dimensions with OpenAI's embeddings

Number of documents: 3
Dimensions per embedding: 1024


### Vector stores

The vector database operates as an independent system that can be:
- Scaled independently of the RAG components
- Maintained and optimized separately
- Potentially shared across multiple RAG applications
- Hosted as a dedicated service

When working with embeddings, several challenges arise:
- **Scale**: Applications often need to store millions of embeddings
- **Dimensionality**: Each embedding might have hundreds or thousands of dimensions
- **Search performance**: Finding similar vectors quickly becomes computationally intensive
- **Associated data**: We need to maintain connections between vectors and their source documents

Vector stores combine two essential components:
- **Vector storage**: The actual database that persists vectors and metadata
- **Vector index**: A specialized data structure that enables efficient similarity search

The curse of dimensionality: as vector dimensions increase, computing similarities becomes increasingly expensive, requiring `O(dN)` operations for `d` dimensions and `N` vectors.


Traditional database:
- Uses exact matching (equality, ranges)
- Optimized for structured data (for example, “find all customers with age > 30”)
- Usually utilizes B-trees or hash-based indexes

Vector store search:
- Uses similarity metrics (cosine similarity, Euclidean distance)
- Optimized for high-dimensional vector spaces
- Employs Approximate Nearest Neighbor (ANN) algorithms

#### Vector stores comparison

#### Hardware considerations for vector stores

#### Vector store interface in LangChain

In [6]:
from langchain_chroma import Chroma
from langchain_core.documents import Document

# Initialize with an embedding model
embeddings = Config().new_openai_like_embeddings()

# Create some sample documents with explicit IDs
docs = [
    Document(page_content="Content about language models", metadata={"id": "doc_1"}),
    Document(
        page_content="Information about vector databases", metadata={"id": "doc_2"}
    ),
    Document(page_content="Details about retrieval systems", metadata={"id": "doc_3"}),
]

# Create the vector store
vector_store = Chroma(embedding_function=embeddings)

# Add documents with explicit IDs
vector_store.add_documents(docs)

# Similarity Search with appropriate k value
results = vector_store.similarity_search("How do language models work?", k=2)

# For maximum marginal relevance search, adjust the parameters based on available documents
# Find relevant BUT diverse documents (reduce redundancy)
results = vector_store.max_marginal_relevance_search(
    "How does LangChain work?",
    k=3,
    fetch_k=10,
    lambda_mult=0.5,  # Controls diversity (0=max diversity, 1=max relevance)
)
print(results)

[Document(id='cdb87f66-2c1a-4f8f-b40b-ac1cc23bb874', metadata={'id': 'doc_1'}, page_content='Content about language models'), Document(id='c9fc393a-d6a1-454c-948f-108b4bd775e2', metadata={'id': 'doc_3'}, page_content='Details about retrieval systems'), Document(id='268a82a4-c387-4acd-ab79-54b180b3f95f', metadata={'id': 'doc_2'}, page_content='Information about vector databases')]


### Vector indexing strategies

Some common indexing approaches include:
- **Tree-based structures** that hierarchically divide the vector space
- **Graph-based methods** like **Hierarchical Navigable Small World (HNSW)** that create navigable networks of connected vectors
- **Hashing techniques** that map similar vectors to the same “buckets”

faiss 库不支持 python3.12。google 的 ScaNN 库没找到接口文档。
TODO：用 ScaNN 复现书中代码。

## Breaking down the RAG pipeline

1. Load documents

In [7]:
from langchain_community.document_loaders import JSONLoader

# Load a json file
loader = JSONLoader(
    file_path="static/knowledge_base.json",
    jq_schema=".[].content",  # This extracts the content field from each array item
    text_content=True,
)
documents = loader.load()

print(documents)

[Document(metadata={'source': '/github.com/sammyne/generative-ai-with-lang-chain-2ed/chapter04/static/knowledge_base.json', 'seq_num': 1}, page_content="Transformer models were introduced in the paper 'Attention Is All You Need' by Vaswani et al. in 2017. The architecture relies on self-attention mechanisms rather than recurrent or convolutional neural networks. This design allows for more parallelization during training and better handling of long-range dependencies in text."), Document(metadata={'source': '/github.com/sammyne/generative-ai-with-lang-chain-2ed/chapter04/static/knowledge_base.json', 'seq_num': 2}, page_content='BERT (Bidirectional Encoder Representations from Transformers) was developed by Google AI Language team in 2018. It is pre-trained using masked language modeling and next sentence prediction tasks. BERT is designed to pre-train deep bidirectional representations by jointly conditioning on both left and right context in all layers.'), Document(metadata={'source':

2. Make embedding model

In [8]:
embedder = Config().new_openai_like_embeddings()

3. Store in vector database

In [9]:
from langchain_community.vectorstores import ScaNN

vector_db = ScaNN.from_documents(documents, embedder)

4. Retrieve similar docs

In [10]:
query = "What are the effects of climate change?"

vector_db.similarity_search(query)

[Document(metadata={'source': '/github.com/sammyne/generative-ai-with-lang-chain-2ed/chapter04/static/knowledge_base.json', 'seq_num': 1}, page_content="Transformer models were introduced in the paper 'Attention Is All You Need' by Vaswani et al. in 2017. The architecture relies on self-attention mechanisms rather than recurrent or convolutional neural networks. This design allows for more parallelization during training and better handling of long-range dependencies in text."),
 Document(metadata={'source': '/github.com/sammyne/generative-ai-with-lang-chain-2ed/chapter04/static/knowledge_base.json', 'seq_num': 2}, page_content='BERT (Bidirectional Encoder Representations from Transformers) was developed by Google AI Language team in 2018. It is pre-trained using masked language modeling and next sentence prediction tasks. BERT is designed to pre-train deep bidirectional representations by jointly conditioning on both left and right context in all layers.'),
 Document(metadata={'source

### Document processing

A document loader is a component in LangChain that transforms various data sources into a standardized document format that can be used throughout the LangChain ecosystem

In [11]:
from langchain_community.document_loaders import JSONLoader

# Load a json file
loader = JSONLoader(
    file_path="static/knowledge_base.json",
    jq_schema=".[].content",  # This extracts the content field from each array item
    text_content=True,
)
documents = loader.load()

print(documents)

[Document(metadata={'source': '/github.com/sammyne/generative-ai-with-lang-chain-2ed/chapter04/static/knowledge_base.json', 'seq_num': 1}, page_content="Transformer models were introduced in the paper 'Attention Is All You Need' by Vaswani et al. in 2017. The architecture relies on self-attention mechanisms rather than recurrent or convolutional neural networks. This design allows for more parallelization during training and better handling of long-range dependencies in text."), Document(metadata={'source': '/github.com/sammyne/generative-ai-with-lang-chain-2ed/chapter04/static/knowledge_base.json', 'seq_num': 2}, page_content='BERT (Bidirectional Encoder Representations from Transformers) was developed by Google AI Language team in 2018. It is pre-trained using masked language modeling and next sentence prediction tasks. BERT is designed to pre-train deep bidirectional representations by jointly conditioning on both left and right context in all layers.'), Document(metadata={'source':

#### Chunking strategies

##### Fixed-size chunking

In [12]:
%uv pip install langchain-text-splitters~=1.0

[2mAudited [1m1 package[0m [2min 2ms[0m[0m


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Note: you may need to restart the kernel to use updated packages.


In [13]:
from langchain_text_splitters import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    separator=" ",  # Split on spaces to avoid breaking words
    chunk_size=200,
    chunk_overlap=20,
)

chunks = text_splitter.split_documents(documents)
print(f"Generated {len(chunks)} chunks from document")

Generated 13 chunks from document


##### Recursive character chunking

In [14]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", ". ", " ", ""], chunk_size=150, chunk_overlap=20
)

document = """# Introduction to RAG
Retrieval-Augmented Generation (RAG) combines retrieval systems with generative AI models.

It helps address hallucinations by grounding responses in retrieved information.

## Key Components
RAG consists of several components:
1. Document processing
2. Vector embedding
3. Retrieval
4. Augmentation
5. Generation

### Document Processing
This step involves loading and chunking documents appropriately.
"""

text_splitter.split_text(document)

['# Introduction to RAG\nRetrieval-Augmented Generation (RAG) combines retrieval systems with generative AI models.',
 'It helps address hallucinations by grounding responses in retrieved information.',
 '## Key Components\nRAG consists of several components:\n1. Document processing\n2. Vector embedding\n3. Retrieval\n4. Augmentation\n5. Generation',
 '### Document Processing\nThis step involves loading and chunking documents appropriately.']

##### Document-specific chunking

##### Semantic chunking

TODO: langchain-experimental 的最新版本 v0.3 依赖的 langchain 版本为 v0.3，待 langchain-experimental 升级后修复下述程序。

In [15]:
%uv pip install langchain-experimental~=0.3

[2K[37m⠙[0m [2mResolving dependencies...                                                     [0m

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[2K[2mResolved [1m47 packages[0m [2min 551ms[0m[0m                                        [0m
[2mUninstalled [1m3 packages[0m [2min 22ms[0m[0m
         If the cache and target directories are on different filesystems, hardlinking may not be supported.
[2K[2mInstalled [1m5 packages[0m [2min 80ms[0m[0m0.3.31                          [0m
 [32m+[39m [1mlangchain[0m[2m==0.3.27[0m
 [31m-[39m [1mlangchain-community[0m[2m==1.0.0a1[0m
 [32m+[39m [1mlangchain-community[0m[2m==0.3.31[0m
 [31m-[39m [1mlangchain-core[0m[2m==1.0.2[0m
 [32m+[39m [1mlangchain-core[0m[2m==0.3.79[0m
 [32m+[39m [1mlangchain-experimental[0m[2m==0.3.4[0m
 [31m-[39m [1mlangchain-text-splitters[0m[2m==1.0.0[0m
 [32m+[39m [1mlangchain-text-splitters[0m[2m==0.3.11[0m
Note: you may need to restart the kernel to use updated packages.


In [15]:
from langchain_experimental.text_splitter import SemanticChunker

embeddings = Config().new_openai_like_embeddings()
text_splitter = SemanticChunker(
    embeddings=embeddings, add_start_index=True  # Include position metadata
)

text_splitter.split_text(document)

['# Introduction to RAG\nRetrieval-Augmented Generation (RAG) combines retrieval systems with generative AI models. It helps address hallucinations by grounding responses in retrieved information. ## Key Components\nRAG consists of several components:\n1. Document processing\n2. Vector embedding\n3. Retrieval\n4.',
 'Augmentation\n5. Generation\n\n### Document Processing\nThis step involves loading and chunking documents appropriately. ']

##### Agent-based chunking
Uses LLMs to intelligently divide text based on semantic analysis and content understanding.

##### Multi-modal chunking

##### Choosing the right chunking strategy

#### Retrieval
Retrieval integrates a vector store with other LangChain components for simplified querying and compatibility.

A retriever in LangChain follows a pattern:
- Input: Takes a query as a string
- Processing: Applies retrieval logic specific to the implementation
- Output: Returns a list of document objects, each containing:
  - `page_content`: The actual document content
  - `metadata`: Associated information like document ID or source

##### LangChain retrievers

##### Vector store retrievers

In [18]:
from langchain_community.retrievers import KNNRetriever

embeddings = Config().new_openai_like_embeddings()

retriever = KNNRetriever.from_documents(documents, embeddings)
retriever.invoke("query")

[Document(metadata={'source': '/github.com/sammyne/generative-ai-with-lang-chain-2ed/chapter04/static/knowledge_base.json', 'seq_num': 5}, page_content='Vector databases store high-dimensional vectors and efficiently perform similarity searches. Popular vector databases include Pinecone, Milvus, and FAISS. They use algorithms like HNSW (Hierarchical Navigable Small World) or IVF (Inverted File Index) to enable fast approximate nearest neighbor search. These databases are essential for scaling embedding-based retrieval systems to large document collections.'),
 Document(metadata={'source': '/github.com/sammyne/generative-ai-with-lang-chain-2ed/chapter04/static/knowledge_base.json', 'seq_num': 4}, page_content='Retrieval-Augmented Generation (RAG) combines a retrieval system with a text generator. The retriever fetches relevant documents from a knowledge base, and these documents are then provided as context to the generator. RAG models can be fine-tuned end-to-end and leverage large pre

In [19]:
%uv pip install xmltodict~=1.0

[2K[2mResolved [1m1 package[0m [2min 44ms[0m[0m                                           [0m
         If the cache and target directories are on different filesystems, hardlinking may not be supported.
[2K[2mInstalled [1m1 package[0m [2min 13ms[0m[0m                                 [0m
 [32m+[39m [1mxmltodict[0m[2m==1.0.2[0m


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Note: you may need to restart the kernel to use updated packages.


In [20]:
from langchain_community.retrievers import PubMedRetriever

retriever = PubMedRetriever(email="xiangminli@outlook.com")
# FIXME: 没有跑通
results = retriever.invoke("chatgpt")

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

In [21]:
%uv pip install arxiv~=2.2

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[2K[2mResolved [1m8 packages[0m [2min 282ms[0m[0m                                         [0m
         If the cache and target directories are on different filesystems, hardlinking may not be supported.
[2K[2mInstalled [1m3 packages[0m [2min 15ms[0m[0m                                [0m
 [32m+[39m [1marxiv[0m[2m==2.2.0[0m
 [32m+[39m [1mfeedparser[0m[2m==6.0.12[0m
 [32m+[39m [1msgmllib3k[0m[2m==1.0.0[0m
Note: you may need to restart the kernel to use updated packages.


In [22]:
from langchain_community.retrievers import ArxivRetriever

retriever = ArxivRetriever(
    load_max_docs=2,
    # get_ful_documents=True,
)

retriever.invoke("chat-gpt")

[Document(metadata={'Entry ID': 'http://arxiv.org/abs/2503.04758v2', 'Published': datetime.date(2025, 3, 10), 'Title': 'Chat-GPT: An AI Based Educational Revolution', 'Authors': 'Sasa Maric, Sonja Maric, Lana Maric'}, page_content='The AI revolution is gathering momentum at an unprecedented rate. Over the\npast decade, we have witnessed a seemingly inevitable integration of AI in\nevery facet of our lives. Much has been written about the potential\nrevolutionary impact of AI in education. AI has the potential to completely\nrevolutionise the educational landscape as we could see entire courses and\ndegrees developed by programs such as ChatGPT. AI has the potential to develop\ncourses, set assignments, grade and provide feedback to students much faster\nthan a team of teachers. In addition, because of its dynamic nature, it has the\npotential to continuously improve its content. In certain fields such as\ncomputer science, where technology is continuously evolving, AI based\napplicatio

### Advanced RAG techniques

A standard vector search has several limitations:
- It might miss contextually relevant documents that use different terminology
- It can’t distinguish between authoritative and less reliable sources
- It might return redundant or contradictory information
- It has no way to verify if generated responses accurately reflect the source material

#### Hybrid retrieval: Combining semantic and keyword search

In [23]:
%uv pip install rank-bm25~=0.2

[2K[37m⠙[0m [2mResolving dependencies...                                                     [0m

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[2K[2mResolved [1m2 packages[0m [2min 51ms[0m[0m                                          [0m
         If the cache and target directories are on different filesystems, hardlinking may not be supported.
[2K[2mInstalled [1m1 package[0m [2min 14ms[0m[0m                                 [0m
 [32m+[39m [1mrank-bm25[0m[2m==0.2.2[0m
Note: you may need to restart the kernel to use updated packages.


In [24]:
from langchain_chroma import Chroma
from langchain_core.documents import Document

# Initialize with an embedding model
embeddings = Config().new_openai_like_embeddings()

# Create some sample documents with explicit IDs
docs = [
    Document(page_content="Content about language models", metadata={"id": "doc_1"}),
    Document(
        page_content="Information about vector databases", metadata={"id": "doc_2"}
    ),
    Document(page_content="Details about retrieval systems", metadata={"id": "doc_3"}),
]

# Create the vector store
vector_store = Chroma(embedding_function=embeddings)

In [26]:
# https://docs.langchain.com/oss/python/migrate/langchain-v1#langchain-classic
%uv pip install langchain-classic~=1.0

[2mAudited [1m1 package[0m [2min 3ms[0m[0m


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Note: you may need to restart the kernel to use updated packages.


In [29]:
from langchain_classic.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever

# Setup semantic retriever
vector_retriever = vector_store.as_retriever(search_kwargs={"k": 5})

# Setup lexical retriever
bm25_retriever = BM25Retriever.from_documents(documents)
bm25_retriever.k = 5

# Combine retrievers
hybrid_retriever = EnsembleRetriever(
    retrievers=[vector_retriever, bm25_retriever],
    weights=[0.7, 0.3],  # Weight semantic search higher than keyword search
)

hybrid_retriever.invoke("climate change impacts")

[Document(id='cdb87f66-2c1a-4f8f-b40b-ac1cc23bb874', metadata={'id': 'doc_1'}, page_content='Content about language models'),
 Document(id='268a82a4-c387-4acd-ab79-54b180b3f95f', metadata={'id': 'doc_2'}, page_content='Information about vector databases'),
 Document(id='c9fc393a-d6a1-454c-948f-108b4bd775e2', metadata={'id': 'doc_3'}, page_content='Details about retrieval systems'),
 Document(metadata={'source': '/github.com/sammyne/generative-ai-with-lang-chain-2ed/chapter04/static/knowledge_base.json', 'seq_num': 5}, page_content='Vector databases store high-dimensional vectors and efficiently perform similarity searches. Popular vector databases include Pinecone, Milvus, and FAISS. They use algorithms like HNSW (Hierarchical Navigable Small World) or IVF (Inverted File Index) to enable fast approximate nearest neighbor search. These databases are essential for scaling embedding-based retrieval systems to large document collections.'),
 Document(metadata={'source': '/github.com/sammyn

#### Re-ranking

#### Query transformation: Improving retrieval through better queries

Query expansion generates multiple variations of the original query to capture different aspects or phrasings.

In [None]:
# 用 langchain-classic 不用 langchain-core 的原理参见
# https://docs.langchain.com/oss/python/migrate/langchain-v1#langchain-classic
from langchain_classic.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser

expansion_template = """Given the user question: {question}
Generate three alternative versions that express the same information need but with different wording:
1."""

expansion_prompt = PromptTemplate(
    input_variables=["question"], template=expansion_template
)

llm = Config().new_openai_like(temperature=0.7)
expansion_chain = expansion_prompt | llm | StrOutputParser()

# Generate expanded queries
original_query = "What are the effects of climate change?"
reply = expansion_chain.invoke(original_query)
print(reply)

1. How is climate change impacting the environment and human societies?  
2. What are the consequences of global warming and changing climate patterns?  
3. In what ways has climate change altered ecosystems, weather, and daily life?


##### Hypothetical Document Embeddings (HyDE)

HyDE uses an LLM to generate a hypothetical answer document based on the query, and then uses that document’s embedding for retrieval.

In [32]:
from langchain_community.document_loaders import JSONLoader
from langchain_community.vectorstores import ScaNN

# Load a json file
loader = JSONLoader(
    file_path="static/knowledge_base.json",
    jq_schema=".[].content",  # This extracts the content field from each array item
    text_content=True,
)
documents = loader.load()

embedder = Config().new_openai_like_embeddings()

vector_db = ScaNN.from_documents(documents, embedder)

In [34]:
# 用 langchain-classic 不用 langchain-core 的原理参见
# https://docs.langchain.com/oss/python/migrate/langchain-v1#langchain-classic
from langchain_classic.prompts import PromptTemplate


# Create prompt for generating hypothetical document
hyde_template = """Based on the question: {question}
Write a passage that could contain the answer to this question:"""

hyde_prompt = PromptTemplate(input_variables=["question"], template=hyde_template)
llm = Config().new_openai_like(temperature=0.2)
hyde_chain = hyde_prompt | llm | StrOutputParser()

# Generate hypothetical document
query = "What dietary changes can reduce carbon footprint?"
hypothetical_doc = hyde_chain.invoke(query)

# Use the hypothetical document for retrieval
embeddings = Config().new_openai_like_embeddings()
embedded_query = embeddings.embed_query(hypothetical_doc)
vector_db.similarity_search_by_vector(embedded_query, k=3)

[Document(metadata={'source': '/github.com/sammyne/generative-ai-with-lang-chain-2ed/chapter04/static/knowledge_base.json', 'seq_num': 1}, page_content="Transformer models were introduced in the paper 'Attention Is All You Need' by Vaswani et al. in 2017. The architecture relies on self-attention mechanisms rather than recurrent or convolutional neural networks. This design allows for more parallelization during training and better handling of long-range dependencies in text."),
 Document(metadata={'source': '/github.com/sammyne/generative-ai-with-lang-chain-2ed/chapter04/static/knowledge_base.json', 'seq_num': 4}, page_content='Retrieval-Augmented Generation (RAG) combines a retrieval system with a text generator. The retriever fetches relevant documents from a knowledge base, and these documents are then provided as context to the generator. RAG models can be fine-tuned end-to-end and leverage large pre-trained models like BART or T5 for generation. This approach helps ground the gen

#### Context processing: maximizing retrieved information value

##### Contextual compression

Extracts only the most relevant parts of retrieved documents, removing irrelevant content that might distract the generator.

In [7]:
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain.retrievers import ContextualCompressionRetriever


llm = Config().new_openai_like(temperature=0)
compressor = LLMChainExtractor.from_llm(llm)

# Create a basic retriever from the vector store
base_retriever = vector_db.as_retriever(search_kwargs={"k": 3})

compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=base_retriever
)

compression_retriever.invoke("How do transformers work?")

[Document(metadata={'source': '/github.com/sammyne/generative-ai-with-lang-chain-2ed/chapter04/static/knowledge_base.json', 'seq_num': 1}, page_content="Transformer models were introduced in the paper 'Attention Is All You Need' by Vaswani et al. in 2017. The architecture relies on self-attention mechanisms rather than recurrent or convolutional neural networks. This design allows for more parallelization during training and better handling of long-range dependencies in text."),
 Document(metadata={'source': '/github.com/sammyne/generative-ai-with-lang-chain-2ed/chapter04/static/knowledge_base.json', 'seq_num': 3}, page_content='GPT (Generative Pre-trained Transformer) models are autoregressive language models that use transformer-based neural networks.')]

##### Maximum marginal relevance

FAISS 不支持 python 3.12，ScaNN 没实现 `max_marginal_relevance_search`。

In [35]:
from langchain_chroma import Chroma
from langchain_core.documents import Document

# Initialize with an embedding model
embeddings = Config().new_openai_like_embeddings()

# Create the vector store
vector_store = Chroma.from_documents(documents, embeddings)

# For maximum marginal relevance search, adjust the parameters based on available documents
# Find relevant BUT diverse documents (reduce redundancy)
vector_store.max_marginal_relevance_search(
    query="What are transformer models?",
    k=5,  # Number of documents to return
    fetch_k=20,  # Number of documents to initially fetch
    lambda_mult=0.5,  # Diversity parameter (0 = max diversity, 1 = max relevance)
)

[Document(id='8de3bfa9-f133-4053-951b-0bcdaaea757f', metadata={'source': '/github.com/sammyne/generative-ai-with-lang-chain-2ed/chapter04/static/knowledge_base.json', 'seq_num': 1}, page_content="Transformer models were introduced in the paper 'Attention Is All You Need' by Vaswani et al. in 2017. The architecture relies on self-attention mechanisms rather than recurrent or convolutional neural networks. This design allows for more parallelization during training and better handling of long-range dependencies in text."),
 Document(id='52b4e8fa-49aa-4270-ac89-8b27e4227af3', metadata={'source': '/github.com/sammyne/generative-ai-with-lang-chain-2ed/chapter04/static/knowledge_base.json', 'seq_num': 3}, page_content='GPT (Generative Pre-trained Transformer) models are autoregressive language models that use transformer-based neural networks. Unlike BERT, which is bidirectional, GPT models are unidirectional and predict the next token based on previous tokens. The original GPT was introduce

#### Response enhancement: Improving generator output

In [36]:
from langchain_core.documents import Document

# Example documents
documents = [
    Document(
        page_content="The transformer architecture was introduced in the paper 'Attention is All You Need' by Vaswani et al. in 2017.",
        metadata={"source": "Neural Network Review 2021", "page": 42},
    ),
    Document(
        page_content="BERT uses bidirectional training of the Transformer, masked language modeling, and next sentence prediction tasks.",
        metadata={"source": "Introduction to NLP", "page": 137},
    ),
    Document(
        page_content="GPT models are autoregressive transformers that predict the next token based on previous tokens.",
        metadata={"source": "Large Language Models Survey", "page": 89},
    ),
]

##### Source attribution

In [37]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_community.vectorstores import ScaNN


# Create a vector store and retriever
embeddings = Config().new_openai_like_embeddings()
vector_store = ScaNN.from_documents(documents, embeddings)
retriever = vector_store.as_retriever(search_kwargs={"k": 3})

# Source attribution prompt template
attribution_prompt = ChatPromptTemplate.from_template(
    """
You are a precise AI assistant that provides well-sourced information.
Answer the following question based ONLY on the provided sources. For each fact or claim in your answer,
include a citation using [1], [2], etc. that refers to the source. Include a numbered reference list at the end.

Question: {question}

Sources:
{sources}

Your answer:
"""
)

In [38]:
from langchain_core.output_parsers import StrOutputParser


# Create a source-formatted string from documents
def format_sources_with_citations(docs):
    formatted_sources = []
    for i, doc in enumerate(docs, 1):
        source_info = f"[{i}] {doc.metadata.get('source', 'Unknown source')}"
        if doc.metadata.get("page"):
            source_info += f", page {doc.metadata['page']}"
        formatted_sources.append(f"{source_info}\n{doc.page_content}")
    return "\n\n".join(formatted_sources)


# Build the RAG chain with source attribution
def generate_attributed_response(question):
    # Retrieve relevant documents
    retrieved_docs = retriever.invoke(question)

    # Format sources with citation numbers
    sources_formatted = format_sources_with_citations(retrieved_docs)
    # print(sources_formatted)

    # Create the attribution chain using LCEL
    attribution_chain = (
        attribution_prompt | Config().new_openai_like(temperature=0) | StrOutputParser()
    )

    # Generate the response with citations
    response = attribution_chain.invoke(
        {"question": question, "sources": sources_formatted}
    )

    return response

In [40]:
# Example usage
question = "How do transformer models work and what are some examples?"
attributed_answer = generate_attributed_response(question)
print(attributed_answer)

Transformer models are a type of neural network architecture introduced in 2017 by Vaswani et al. in the paper "Attention is All You Need" [2]. They rely primarily on attention mechanisms to process input sequences, allowing them to weigh the importance of different words in a sequence when making predictions, rather than relying on sequential processing like recurrent networks [2]. 

One key characteristic of transformer models is their ability to handle long-range dependencies efficiently. GPT models are examples of autoregressive transformers that predict the next token based on previous tokens [1]. In contrast, BERT is a transformer-based model that uses bidirectional training, meaning it considers context from both left and right sides of a word simultaneously, and is trained using masked language modeling and next sentence prediction tasks [3].

Examples of transformer models include GPT (Generative Pre-trained Transformer) and BERT (Bidirectional Encoder Representations from Tra

##### Self-consistency checking: ensuring factual accuracy

In [41]:
from langchain_core.documents import Document
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI


def verify_response_accuracy(
    retrieved_docs: list[Document], generated_answer: str, llm: ChatOpenAI | None = None
) -> str:
    """
    Verify if a generated answer is fully supported by the retrieved documents.
    Args:
        retrieved_docs: List of documents used to generate the answer
        generated_answer: The answer produced by the RAG system
        llm: Language model to use for verification
    Returns:
        Dictionary containing verification results and any identified issues
    """
    if llm is None:
        llm = Config().new_openai_like(temperature=0)

    # Create context from retrieved documents
    context = "\n\n".join([doc.page_content for doc in retrieved_docs])

    # Define verification prompt - fixed to avoid JSON formatting issues in the template
    verification_prompt = ChatPromptTemplate.from_template(
        """
    As a fact-checking assistant, verify whether the following answer is fully supported
    by the provided context. Identify any statements that are not supported or contradict the context.
    
    Context:
    {context}
    
    Answer to verify:
    {answer}
    
    Perform a detailed analysis with the following structure:
    1. List any factual claims in the answer
    2. For each claim, indicate whether it is:
       - Fully supported (provide the supporting text from context)
       - Partially supported (explain what parts lack support)
       - Contradicted (identify the contradiction)
       - Not mentioned in context
    3. Overall assessment: Is the answer fully grounded in the context?
    
    Return your analysis in JSON format with the following structure:
    {{
      "claims": [
        {{
          "claim": "The factual claim",
          "status": "fully_supported|partially_supported|contradicted|not_mentioned",
          "evidence": "Supporting or contradicting text from context",
          "explanation": "Your explanation"
        }}
      ],
      "fully_grounded": true|false,
      "issues_identified": ["List any specific issues"]
    }}
    """
    )

    # Create verification chain using LCEL
    verification_chain = verification_prompt | llm | StrOutputParser()

    # Run verification
    result = verification_chain.invoke({"context": context, "answer": generated_answer})

    return result

In [42]:
# Example usage
retrieved_docs = [
    Document(
        page_content="The transformer architecture was introduced in the paper 'Attention Is All You Need' by Vaswani et al. in 2017. It relies on self-attention mechanisms instead of recurrent or convolutional neural networks."
    ),
    Document(
        page_content="BERT is a transformer-based model developed by Google that uses masked language modeling and next sentence prediction as pre-training objectives."
    ),
]

generated_answer = "The transformer architecture was introduced by OpenAI in 2018 and uses recurrent neural networks. BERT is a transformer model developed by Google."

verification_result = verify_response_accuracy(retrieved_docs, generated_answer)
print(verification_result)

{
  "claims": [
    {
      "claim": "The transformer architecture was introduced by OpenAI in 2018",
      "status": "contradicted",
      "evidence": "The transformer architecture was introduced in the paper 'Attention Is All You Need' by Vaswani et al. in 2017.",
      "explanation": "The context clearly states that the transformer architecture was introduced in 2017 by Vaswani et al., not by OpenAI in 2018. This directly contradicts the claim."
    },
    {
      "claim": "The transformer architecture ... uses recurrent neural networks",
      "status": "contradicted",
      "evidence": "It relies on self-attention mechanisms instead of recurrent or convolutional neural networks.",
      "explanation": "The context explicitly states that transformers use self-attention mechanisms instead of recurrent neural networks, making this claim directly false."
    },
    {
      "claim": "BERT is a transformer model developed by Google",
      "status": "fully_supported",
      "evidence": 

#### Corrective RAG

In real-world applications, retrieval systems often return irrelevant, insufficient, or even misleading content.

Corrective Retrieval-Augmented Generation (CRAG) directly addresses this challenge by introducing explicit evaluation and correction mechanisms into the RAG pipeline.

缺失完整示例代码

#### Agentic RAG

CRAG primarily enhances data quality through evaluation and correction, while agentic RAG focuses on process intelligence through autonomous planning and orchestration.

#### Choosing the right techniques

## Developing a corporate documentation chatbot

源码参见 src/chapter04/developing-a-corporate-documentation-chatbot 目录。

In [43]:
%uv pip install langgraph~=1.0 streamlit~=1.50

[2K[37m⠙[0m [2mResolving dependencies...                                                     [0m

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[2K[2mResolved [1m60 packages[0m [2min 549ms[0m[0m                                        [0m
[2K[37m⠙[0m [2mPreparing packages...[0m (0/2)                                                   
[2K[1A[37m⠹[0m [2mPreparing packages...[0m (0/2)--------------[0m[0m     0 B/153.08 KiB          [1A
[2K[1A[37m⠹[0m [2mPreparing packages...[0m (0/2)--------------[0m[0m     0 B/153.08 KiB          [1A
[2K[1A[37m⠹[0m [2mPreparing packages...[0m (0/2)--------------[0m[0m 16.00 KiB/153.08 KiB        [1A
[2K[1A[37m⠹[0m [2mPreparing packages...[0m (0/2)--------------[0m[0m 32.00 KiB/153.08 KiB        [1A
[2K[1A[37m⠹[0m [2mPreparing packages...[0m (0/2)--------------[0m[0m 48.00 KiB/153.08 KiB        [1A
[2K[1A[37m⠹[0m [2mPreparing packages...[0m (0/2)--------------[0m[0m 60.25 KiB/153.08 KiB        [1A
[2K[1A[37m⠹[0m [2mPreparing packages...[0m (0/2)--------------[0m[0m 60.25 KiB/153.08 KiB        [1A
[2mlanggraph-prebuilt  [

In [44]:
%uv run src/chapter04/developing-a-corporate-documentation-chatbot/rag.py

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
  _warn_about_sha1_encoder()
[]
INFO:httpx:HTTP Request: POST https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions "HTTP/1.1 200 OK"
The square root of 10 is approximately 3.1623.

None of the provided corporate document snippets are relevant to this mathematical question.
INFO:httpx:HTTP Request: POST https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions "HTTP/1.1 200 OK"
no issues detected
The square root of 10 is approximately 3.1623.

None of the provided corporate document snippets are relevant to this mathematical question.
Note: you may need to restart the kernel to use updated packages.


In [23]:
!.venv/bin/streamlit run src/chapter04/developing-a-corporate-documentation-chatbot/streamlit_app.py

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)



Collecting usage statistics. To deactivate, set browser.gatherUsageStats to false.
[0m
[0m
[34m[1m  You can now view your Streamlit app in your browser.[0m
[0m
[34m  Local URL: [0m[1mhttp://localhost:8501[0m
[34m  Network URL: [0m[1mhttp://172.17.0.4:8501[0m
[34m  External URL: [0m[1mhttp://43.132.141.4:8501[0m
[0m
[34m  Stopping...[0m
^C


## Troubleshooting RAG systems