# Retrieval 模块代码实战

文档加载器

从许多不同的来源加载文档。LangChain提供了100多个不同的文档加载器，以及与其他主要提供商（如AirByte和Unstructured）的集成。我们提供了加载所有类型文档（HTML、PDF、代码）的集成，可以从各种位置（私有s3存储桶、公共网站）加载。

文档转换器

检索的一个关键部分是仅获取文档中相关的部分。这涉及到多个转换步骤，以最好地准备文档进行检索。其中一个主要步骤是将大型文档分割（或分块）为较小的块。LangChain提供了几种不同的算法来实现这一点，并且还针对特定文档类型（代码、Markdown等）优化了逻辑。

文本嵌入模型

检索的另一个关键部分是为文档创建嵌入。嵌入捕捉文本的语义含义，使您能够快速高效地找到其他相似的文本。LangChain与25多个不同的嵌入提供商和方法集成，从开源到专有的API，让您可以选择最适合您需求的嵌入方法。LangChain提供了一个标准接口，使您可以轻松切换模型。

向量存储

随着嵌入的兴起，出现了对数据库支持这些嵌入的高效存储和搜索的需求。LangChain与50多个不同的向量存储进行集成，从开源的本地存储到云托管的专有存储，让您可以选择最适合您需求的存储方式。LangChain提供了一个标准接口，使您可以轻松切换向量存储。

检索器

一旦数据存储在数据库中，您仍然需要检索它。LangChain支持许多不同的检索算法，这是我们增加最多价值的领域之一。我们支持一些简单易用的基本方法，例如简单的语义搜索。然而，我们还添加了一系列算法来提高性能。包括：

父文档检索器：允许您为每个父文档创建多个嵌入，以便查找较小的块但返回较大的上下文。

自查询检索器：用户的问题通常包含对不仅仅是语义的东西的引用，而是表达一些最好以元数据过滤器来表示的逻辑。自查询允许您从查询中解析出语义部分，并筛选出查询中存在的其他元数据过滤器。

集合检索器：有时您可能希望从多个不同的来源或使用多个不同的算法检索文档。集合检索器使您能够轻松实现这一点。

还有更多！

## 1. Document loaders

### 1.1 CSVLoader

In [1]:
from langchain.document_loaders.csv_loader import CSVLoader

loader = CSVLoader(file_path="data/data1.csv", encoding="utf-8")

data = loader.load()

print(data)

[Document(page_content='Los Angeles: New York City\n34°03′N: 40°42′46″N\n118°15′W: 74°00′21″W', metadata={'source': 'data/data1.csv', 'row': 0}), Document(page_content='Los Angeles: Paris\n34°03′N: 48°51′24″N\n118°15′W: 2°21′03″E', metadata={'source': 'data/data1.csv', 'row': 1})]


### 1.2 TextLoader

In [2]:
from langchain.document_loaders import TextLoader

docs = TextLoader('消失的她.txt', encoding="utf-8").load()
docs

[Document(page_content='《消失的她》是一部充满悬疑和心理刺激的电影，讲述了丈夫何非的妻子李木子在结婚周年旅行中神秘失踪的故事。随后，出现了一个陌生女人冒充李木子，引发了一系列扑朔迷离的事件。以下是对该电影的详细解读：\n\n\n剧情介绍与发展：\n\n电影一开始，丈夫何非报警称妻子李木子在国外旅行中突然失踪。然而，当警方介入调查后，情况变得异常复杂。一方面，陌生女人自称是李木子，并提供了身份证明和其他证据。另一方面，何非坚持认为这个女人并非自己的妻子。情节中呈现了众多的矛盾和谜团，让观众猜测真相。\n\n主要角色及其设定：\n\n何非（朱一龙饰）：失踪的妻子李木子的丈夫，他在失去妻子后竭尽全力寻找她。但他的神秘过去和心理问题让人产生怀疑，观众对他的真实性格和动机产生质疑。\n\n李木子（文咏珊饰）：消失的女主角，她在故事初期被认为是何非的妻子。然而，随着故事发展，她的真实身份和行为逐渐显露，让观众对她产生疑虑。\n\n陈麦（倪妮饰）：一名金牌律师，卷入了这起失踪案件中。她对案件产生浓厚兴趣，主动接手此案，并不断解密真相。她的出现为故事注入了更多的悬疑元素。\n\n林梅（黄子琪饰）：陈麦安排的替身女人，她是受害者之一。她与陈麦合作，试图揭示真相并向何非复仇。\n\n悬疑和心理刺激：\n\n电影的悬疑元素令人着迷。观众在故事中跟随何非、陈麦和李木子之间错综复杂的关系，推理和猜测真相。每个角色都有着不为人知的秘密和动机，令观众陷入扑朔迷离的氛围。随着剧情的发展，一层层真相被揭示，每一个细节都引发观众的疑虑和惊讶。\n\n同时，电影还探讨了人性、道德和正义等主题。何非的精神问题、李木子的复杂人格、陈麦的正义感，以及林梅等受害者的复仇心理，都使得故事更加丰富和引人深思。\n\n主题和意义：\n\n《消失的她》除了讲述一起失踪案件的悬疑故事，也强调了女性之间的团结与互助。陈麦和林梅是受害者的代表，她们联合其他受害者，通过巧妙的计划揭露何非的罪行，最终让他受到应有的惩罚。这展现了女性的力量和智慧，体现了女性在面对困境时的勇气和决心。\n\n同时，电影也警示人们珍惜生命中的幸福和美好时刻，对自己的行为负责，并追求自己的梦想。正义和尊严是电影所强调的核心价值观。\n\n综上所述，《消失的她》是一部充满悬疑和反转的电影，通过精心构建的剧情、角色设定和悬念设置，吸

In [3]:
with open('消失的她.txt', encoding="utf-8") as f:
    document = f.read()

In [4]:
print(document)

《消失的她》是一部充满悬疑和心理刺激的电影，讲述了丈夫何非的妻子李木子在结婚周年旅行中神秘失踪的故事。随后，出现了一个陌生女人冒充李木子，引发了一系列扑朔迷离的事件。以下是对该电影的详细解读：


剧情介绍与发展：

电影一开始，丈夫何非报警称妻子李木子在国外旅行中突然失踪。然而，当警方介入调查后，情况变得异常复杂。一方面，陌生女人自称是李木子，并提供了身份证明和其他证据。另一方面，何非坚持认为这个女人并非自己的妻子。情节中呈现了众多的矛盾和谜团，让观众猜测真相。

主要角色及其设定：

何非（朱一龙饰）：失踪的妻子李木子的丈夫，他在失去妻子后竭尽全力寻找她。但他的神秘过去和心理问题让人产生怀疑，观众对他的真实性格和动机产生质疑。

李木子（文咏珊饰）：消失的女主角，她在故事初期被认为是何非的妻子。然而，随着故事发展，她的真实身份和行为逐渐显露，让观众对她产生疑虑。

陈麦（倪妮饰）：一名金牌律师，卷入了这起失踪案件中。她对案件产生浓厚兴趣，主动接手此案，并不断解密真相。她的出现为故事注入了更多的悬疑元素。

林梅（黄子琪饰）：陈麦安排的替身女人，她是受害者之一。她与陈麦合作，试图揭示真相并向何非复仇。

悬疑和心理刺激：

电影的悬疑元素令人着迷。观众在故事中跟随何非、陈麦和李木子之间错综复杂的关系，推理和猜测真相。每个角色都有着不为人知的秘密和动机，令观众陷入扑朔迷离的氛围。随着剧情的发展，一层层真相被揭示，每一个细节都引发观众的疑虑和惊讶。

同时，电影还探讨了人性、道德和正义等主题。何非的精神问题、李木子的复杂人格、陈麦的正义感，以及林梅等受害者的复仇心理，都使得故事更加丰富和引人深思。

主题和意义：

《消失的她》除了讲述一起失踪案件的悬疑故事，也强调了女性之间的团结与互助。陈麦和林梅是受害者的代表，她们联合其他受害者，通过巧妙的计划揭露何非的罪行，最终让他受到应有的惩罚。这展现了女性的力量和智慧，体现了女性在面对困境时的勇气和决心。

同时，电影也警示人们珍惜生命中的幸福和美好时刻，对自己的行为负责，并追求自己的梦想。正义和尊严是电影所强调的核心价值观。

综上所述，《消失的她》是一部充满悬疑和反转的电影，通过精心构建的剧情、角色设定和悬念设置，吸引观众的注意力并让他们对真相充满好奇。影片的主题也深刻探讨了女性的力量、正义和珍惜生命的价值观。这些元素共

### 1.3 MarkdownLoader

# !pip install markdown unstructured

In [5]:
from langchain_community.document_loaders import UnstructuredMarkdownLoader
markdown_path = "data/README.md"
loader = UnstructuredMarkdownLoader(markdown_path)
data = loader.load()

In [11]:
data

[Document(page_content="✨ Navigate at cookbook.openai.com\n\nExample code and guides for accomplishing common tasks with the OpenAI API. To run these examples, you'll need an OpenAI account and associated API key (create a free account here). Set an environment variable called OPENAI_API_KEY with your API key. Alternatively, in most IDEs such as Visual Studio Code, you can create an .env file at the root of your repo containing OPENAI_API_KEY=<your API key>, which will be picked up by the notebooks.\n\nMost code examples are written in Python, though the concepts can be applied in any language.\n\nFor other useful tools, guides and courses, check out these related resources from around the web.\n\nContributing\n\nThe OpenAI Cookbook is a community-driven resource. Whether you're submitting an idea, fixing a typo, adding a new guide, or improving an existing one, your contributions are greatly appreciated!\n\nBefore contributing, read through the existing issues and pull requests to see

### 1.4 PDFLoader

In [None]:
from langchain_community.document_loaders import UnstructuredPDFLoader
loader = UnstructuredPDFLoader("data/test.pdf")

In [None]:
data = loader.load()

In [None]:
data

## 2. Document transformers

### 2.1 Recursively split by character

In [48]:
with open("data/test.txt") as f:
    testTXT = f.read()

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
chunk_size = 20,          # 块大小（每个分割文本的字符数量）
chunk_overlap  = 10,       # 块重叠（两个相邻块之间重叠的字符数量）
length_function = len,     # 长度函数（用于计算文本长度的函数）
add_start_index = True,    # 添加起始索引（是否在结果中包含分割文本的起始索引）
)

In [49]:
texts = text_splitter.create_documents([testTXT])
print(texts[0])
print(texts[1])

page_content='Madam Speaker, Madam' metadata={'start_index': 0}
page_content='Madam Vice' metadata={'start_index': 15}


In [50]:
print(texts)

[Document(page_content='Madam Speaker, Madam', metadata={'start_index': 0}), Document(page_content='Madam Vice', metadata={'start_index': 15}), Document(page_content='Vice President,our', metadata={'start_index': 21}), Document(page_content='First Lady and', metadata={'start_index': 40}), Document(page_content='Lady and Second', metadata={'start_index': 46}), Document(page_content='Second Gentleman.', metadata={'start_index': 55}), Document(page_content='Members of Congress', metadata={'start_index': 74}), Document(page_content='Congress and the', metadata={'start_index': 85}), Document(page_content='and the Cabinet.', metadata={'start_index': 94}), Document(page_content='Justices of the', metadata={'start_index': 112}), Document(page_content='of the Supreme', metadata={'start_index': 121}), Document(page_content='Supreme Court. My', metadata={'start_index': 128}), Document(page_content='Court. My fellow', metadata={'start_index': 136}), Document(page_content='fellow Americans.', metad

### 2.2 Spilt Code

In [9]:
from langchain.text_splitter import Language

# 支持编程语言的完整列表
[e.value for e in Language]

['cpp',
 'go',
 'java',
 'kotlin',
 'js',
 'ts',
 'php',
 'proto',
 'python',
 'rst',
 'ruby',
 'rust',
 'scala',
 'swift',
 'markdown',
 'latex',
 'html',
 'sol',
 'csharp',
 'cobol',
 'c',
 'lua',
 'perl']

In [10]:
python_text = """

 def print_multiplication_table():
    for i in range(1, 10):
        for j in range(1, i+1):
            print(f'{j} * {i} = {i*j}\t', end='')
        print()

print_multiplication_table()

"""

In [11]:
python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON, chunk_size=60, chunk_overlap=0
)
python_docs = python_splitter.create_documents([python_text])
python_docs

[Document(page_content='def print_multiplication_table():'),
 Document(page_content='for i in range(1, 10):\n        for j in range(1, i+1):'),
 Document(page_content="print(f'{j} * {i} = {i*j}\t', end='')"),
 Document(page_content='print()'),
 Document(page_content='print_multiplication_table()')]

###2.3 Split by character

In [36]:
with open("data/test.txt") as f:
    state_of_the_union = f.read()

In [37]:
from langchain_text_splitters import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    separator="\n\n",#指定了用于分割文本的分隔符为两个连续的换行符
    chunk_size=20,#指定了每个分割后的文本块的大小为1000个字符
    chunk_overlap=10,#指定了相邻文本块之间的重叠大小为200个字符
    length_function=len,#指定了计算字符串长度的函数为内置的len函数
    is_separator_regex=False,#指定了分隔符不是一个正则表达式
)

In [38]:
texts = text_splitter.create_documents([state_of_the_union])
print(texts[0])
print(texts[1])

Created a chunk of size 72, which is longer than the specified 20
Created a chunk of size 36, which is longer than the specified 20
Created a chunk of size 51, which is longer than the specified 20


page_content='Madam Speaker, Madam Vice President,our First Lady and Second Gentleman.'
page_content='Members of Congress and the Cabinet.'


## 3. Text embedding models

In [13]:
from langchain_openai import OpenAIEmbeddings
#from langchain.embeddings import OpenAIEmbeddings

embeddings_model = OpenAIEmbeddings()


In [14]:
# 将一组文本转换为嵌入向量，并将结果存储在embeddings变量中
embeddings = embeddings_model.embed_documents(
    [
        "Hi there!",
        "Oh, hello!",
        "What's your name?",
        "My friends call me World",
        "Hello World!"
    ]
)
len(embeddings), len(embeddings[0])

(5, 1536)

In [15]:
# 使用embed_query方法将查询文本转换为嵌入向量，并将结果存储在embedded_query变量中
embedded_query = embeddings_model.embed_query("What was the name mentioned in the conversation?")
embedded_query[:5]

[0.005384807424727807,
 -0.0005522561790177147,
 0.03896066510130955,
 -0.002939867294003909,
 -0.008987877434176603]

## 4. vector store

# pip install chromadb

In [16]:
from langchain.document_loaders import TextLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma

# 加载长文本
raw_documents = TextLoader('消失的她.txt', encoding="utf-8").load()

In [17]:
# 实例化文本分割器
text_splitter = CharacterTextSplitter(chunk_size=100, chunk_overlap=0)

In [18]:
# 分割文本
documents = text_splitter.split_documents(raw_documents)

Created a chunk of size 125, which is longer than the specified 100
Created a chunk of size 112, which is longer than the specified 100
Created a chunk of size 118, which is longer than the specified 100


In [19]:
# 将分割后的文本，使用 OpenAI 嵌入模型获取嵌入向量，并存储在 Chroma 中
db = Chroma.from_documents(documents, OpenAIEmbeddings())

In [20]:
query = "消失的她这部电影的角色有哪些？"
docs = db.similarity_search(query)
print(docs[0].page_content)

《消失的她》是一部充满悬疑和心理刺激的电影，讲述了丈夫何非的妻子李木子在结婚周年旅行中神秘失踪的故事。随后，出现了一个陌生女人冒充李木子，引发了一系列扑朔迷离的事件。以下是对该电影的详细解读：


In [21]:

embedding_vector = OpenAIEmbeddings().embed_query(query)
docs = db.similarity_search_by_vector(embedding_vector)
print(docs[0].page_content)

《消失的她》是一部充满悬疑和心理刺激的电影，讲述了丈夫何非的妻子李木子在结婚周年旅行中神秘失踪的故事。随后，出现了一个陌生女人冒充李木子，引发了一系列扑朔迷离的事件。以下是对该电影的详细解读：


## 5. Retriever

### 5.1 Ensemble Retriever

In [None]:
# 集合检索器（Ensemble Retriever）接受一组检索器作为输入，并将它们的get_relevant_documents()方法的结果进行集成，并根据互惠排序融合算法对结果进行重新排序。通过发挥不同算法的优势，集合检索器可以比单一算法获得更好的性能。

# 最常见的模式是将稀疏检索器（如BM25）与密集检索器（如嵌入相似度）相结合，因为它们的优势互补。这也被称为“混合搜索”。稀疏检索器擅长基于关键词找到相关文档，而密集检索器擅长基于语义相似度找到相关文档。

In [22]:
from langchain.retrievers import BM25Retriever, EnsembleRetriever
from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings


In [24]:
doc_list = [
    "I like apples",
    "I like oranges",
    "Apples and oranges are fruits",
]

# initialize the bm25 retriever and faiss retriever
## 稀松检索器
bm25_retriever = BM25Retriever.from_texts(doc_list)
bm25_retriever.k = 2

## 密集检索器
embedding = OpenAIEmbeddings()
faiss_vectorstore = FAISS.from_texts(doc_list, embedding)
faiss_retriever = faiss_vectorstore.as_retriever(search_kwargs={"k": 2})

# initialize the ensemble retriever
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, faiss_retriever], weights=[0.5, 0.5]
)



In [25]:
## 获取数据
docs = ensemble_retriever.get_relevant_documents("apples")
docs

[Document(page_content='I like apples'),
 Document(page_content='Apples and oranges are fruits')]

### 5.2 Self-querying

In [1]:
# 安装依赖 pip install --upgrade lark chromadb
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings

docs = [
    Document(
        page_content="A bunch of scientists bring back dinosaurs and mayhem breaks loose",
        metadata={"year": 1993, "rating": 7.7, "genre": "science fiction"},
    ),
    Document(
        page_content="Leo DiCaprio gets lost in a dream within a dream within a dream within a ...",
        metadata={"year": 2010, "director": "Christopher Nolan", "rating": 8.2},
    ),
    Document(
        page_content="A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea",
        metadata={"year": 2006, "director": "Satoshi Kon", "rating": 8.6},
    ),
    Document(
        page_content="A bunch of normal-sized women are supremely wholesome and some men pine after them",
        metadata={"year": 2019, "director": "Greta Gerwig", "rating": 8.3},
    ),
    Document(
        page_content="Toys come alive and have a blast doing so",
        metadata={"year": 1995, "genre": "animated"},
    ),
    Document(
        page_content="Three men walk into the Zone, three men walk out of the Zone",
        metadata={
            "year": 1979,"director": "Andrei Tarkovsky","genre": "thriller","rating": 9.9},
    ),
]
vectorstore = Chroma.from_documents(docs, OpenAIEmbeddings())

In [2]:
from langchain.chains.query_constructor.base import AttributeInfo
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain_openai import ChatOpenAI

metadata_field_info = [
    AttributeInfo(
        name="genre",
        description="The genre of the movie. One of ['science fiction', 'comedy', 'drama', 'thriller', 'romance', 'action', 'animated']",
        type="string",
    ),
    AttributeInfo(
        name="year",
        description="The year the movie was released",
        type="integer",
    ),
    AttributeInfo(
        name="director",
        description="The name of the movie director",
        type="string",
    ),
    AttributeInfo(
        name="rating", description="A 1-10 rating for the movie", type="float"
    ),
]
document_content_description = "Brief summary of a movie"
llm = ChatOpenAI(temperature=0)
retriever = SelfQueryRetriever.from_llm(
    llm,
    vectorstore,
    document_content_description,
    metadata_field_info,
)

In [3]:
retriever.invoke("I want to watch a movie rated higher than 8.5")

[Document(page_content='Three men walk into the Zone, three men walk out of the Zone', metadata={'director': 'Andrei Tarkovsky', 'genre': 'thriller', 'rating': 9.9, 'year': 1979}),
 Document(page_content='A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea', metadata={'director': 'Satoshi Kon', 'rating': 8.6, 'year': 2006})]

In [4]:
retriever.invoke("Has Greta Gerwig directed any movies about women")

[Document(page_content='A bunch of normal-sized women are supremely wholesome and some men pine after them', metadata={'director': 'Greta Gerwig', 'rating': 8.3, 'year': 2019})]

In [None]:
retriever.invoke("What's a highly rated (above 8.5) science fiction film?")

In [None]:
### 使用自查询检索器来指定 k：要获取的文档数

In [5]:
retriever = SelfQueryRetriever.from_llm(
    llm,
    vectorstore,
    document_content_description,
    metadata_field_info,
    enable_limit=True,
)

# This example only specifies a relevant query
retriever.invoke("What are two movies about dinosaurs")

[Document(page_content='A bunch of scientists bring back dinosaurs and mayhem breaks loose', metadata={'genre': 'science fiction', 'rating': 7.7, 'year': 1993}),
 Document(page_content='Toys come alive and have a blast doing so', metadata={'genre': 'animated', 'year': 1995})]

### 5.3 MultiQueryRetriever

通过使用 LLM 从不同角度为给定的用户输入查询生成多个查询，可以自动执行提示调整过程。对于每个查询，它都会检索一组相关文档，并采用所有查询之间的唯一并集来获取更大的一组潜在相关文档

In [52]:
# Build a sample vectorDB
from langchain_community.document_loaders import WebBaseLoader
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Load blog post
loader = WebBaseLoader("https://lilianweng.github.io/posts/2023-06-23-agent/")
data = loader.load()

# Split
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
splits = text_splitter.split_documents(data)

# VectorDB
embedding = OpenAIEmbeddings()
vectordb = Chroma.from_documents(documents=splits, embedding=embedding)


from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain_openai import ChatOpenAI


llm = ChatOpenAI(temperature=0)
retriever_from_llm = MultiQueryRetriever.from_llm(
    retriever=vectordb.as_retriever(), llm=llm
)

# Set logging for the queries
#import logging

#logging.basicConfig()
#logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)
question = "What are the approaches to Task Decomposition?"
unique_docs = retriever_from_llm.get_relevant_documents(query=question)
#len(unique_docs)
unique_docs

INFO:langchain.retrievers.multi_query:Generated queries: ['1. How can Task Decomposition be achieved through different methods?', '2. What strategies are commonly used for breaking down tasks into smaller components?', '3. What are the various techniques employed for Task Decomposition in practice?']


[Document(page_content='Task decomposition can be done (1) by LLM with simple prompting like "Steps for XYZ.\\n1.", "What are the subgoals for achieving XYZ?", (2) by using task-specific instructions; e.g. "Write a story outline." for writing a novel, or (3) with human inputs.', metadata={'description': 'Building agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.\nAgent System Overview In a LLM-powered autonomous agent system, LLM functions as the agent’s brain, complemented by several key components:', 'language': 'en', 'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/', 'title': "LLM Powered Autonomous Agents | Lil'Log"}),
 Document(page_content='Fig. 1. Overview of a LLM-powered auto

In [None]:
###自定义提示词和解析器

In [56]:
from typing import List

from langchain.chains import LLMChain
from langchain.output_parsers import PydanticOutputParser
from langchain_core.prompts import PromptTemplate
from pydantic import BaseModel, Field


# Output parser will split the LLM result into a list of queries
class LineList(BaseModel):
    # "lines" is the key (attribute name) of the parsed output
    lines: List[str] = Field(description="Lines of text")


class LineListOutputParser(PydanticOutputParser):
    def __init__(self) -> None:
        super().__init__(pydantic_object=LineList)

    def parse(self, text: str) -> LineList:
        lines = text.strip().split("\n")
        return LineList(lines=lines)


output_parser = LineListOutputParser()

QUERY_PROMPT = PromptTemplate(
    input_variables=["question"],
    template="""You are an AI language model assistant. Your task is to generate five
    different versions of the given user question to retrieve relevant documents from a vector
    database. By generating multiple perspectives on the user question, your goal is to help
    the user overcome some of the limitations of the distance-based similarity search.
    Provide these alternative questions separated by newlines.
    Original question: {question}""",
)
llm = ChatOpenAI(temperature=0)

# Chain
llm_chain = LLMChain(llm=llm, prompt=QUERY_PROMPT, output_parser=output_parser)

# Other inputs
question = "What are the approaches to Task Decomposition?"

In [57]:
# Run
retriever = MultiQueryRetriever(
    retriever=vectordb.as_retriever(), llm_chain=llm_chain, parser_key="lines"
)  # "lines" is the key (attribute name) of the parsed output

# Results
unique_docs = retriever.get_relevant_documents(
    query="What does the course say about regression?"
)
len(unique_docs)

OutputParserException: Failed to parse LineList from completion 1. Got: 1 validation error for LineList
  Input should be a valid dictionary or instance of LineList [type=model_type, input_value=1, input_type=int]
    For further information visit https://errors.pydantic.dev/2.6/v/model_type