### 1. 加载数据

In [1]:
!pip install chromadb



In [2]:
import chromadb

In [3]:
from langchain_community.document_loaders import WebBaseLoader

In [4]:
import bs4

In [5]:
bs4_strainer = bs4.SoupStrainer(class_=("post-title", "post-header", "post-content"))

In [41]:
loader = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
    bs_kwargs={"parse_only": bs4_strainer},
)

In [42]:
docs = loader.load()

### 2. 数据拆分

In [43]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

什么是 chunk_size 和 chunk_overlap
对于大型语言模型如GPT-3等来说,chunk_size和chunk_overlap通常指的是文本序列的切分参数:

chunk_size: 对输入文本序列进行切分的最大长度。大语言模型一般会限制最大输入序列长度,比如GPT-3的最大输入长度是2048个token。为了处理更长的文本,需要切分成多个chunk,chunk_size控制每个chunk的最大长度。
chunk_overlap: 相邻两个chunk之间的重叠token数量。为了保证文本语义的连贯性,相邻chunk会有一定的重叠。chunk_overlap控制这个重叠区域的大小。
举例来说,如果chunk_size设为1024,chunk_overlap设为128,则对一个长度为2560的文本序列,会切分成3个chunk:

chunk 1: 第1-1024个token

chunk 2: 第897-1920个token (与chunk 1重叠128个)

chunk 3: 第1793-2560个token (与chunk 2重叠128个)

这样的切分方式既满足了最大长度限制,也保证了相邻chunk间语义的衔接。适当的chunk大小和重叠可以提升大语言模型处理长文本的流畅性和连贯性。

In [44]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200, add_start_index=True
)

In [45]:
all_splits = text_splitter.split_documents(docs)

In [46]:
all_splits

[]

In [12]:
len(all_splits)

66

In [13]:
len(all_splits[0].page_content)

969

In [14]:
all_splits[0].page_content[:50]

'LLM Powered Autonomous Agents\n    \nDate: June 23, '

In [15]:
all_splits[10].metadata

{'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/',
 'start_index': 7056}

In [16]:
all_splits[0].__dict__

{'page_content': 'LLM Powered Autonomous Agents\n    \nDate: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng\n\n\nBuilding agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.\nAgent System Overview#\nIn a LLM-powered autonomous agent system, LLM functions as the agent’s brain, complemented by several key components:\n\nPlanning\n\nSubgoal and decomposition: The agent breaks down large tasks into smaller, manageable subgoals, enabling efficient handling of complex tasks.\nReflection and refinement: The agent can do self-criticism and self-reflection over past actions, learn from mistakes and refine them for future steps, thereby improving the quality of final results.\n\n\nMemo

### 3. 向量存储

In [17]:
from langchain_community.embeddings import OllamaEmbeddings
from langchain.vectorstores import Chroma

In [21]:
embeddings = OllamaEmbeddings(model='nomic-embed-text')

In [22]:
vectorstore = Chroma.from_documents(all_splits, embeddings, collection_name="serverless_guide")

In [23]:
vectorstore

<langchain_community.vectorstores.chroma.Chroma at 0x2208d62f910>

In [49]:
retrieved_docs = vectorstore.similarity_search('Tree of Thoughts')

In [50]:
for i in range(len(retrieved_docs)):
    print('*'*30)
    print(retrieved_docs[i].page_content)

******************************
Tree of Thoughts (Yao et al. 2023) extends CoT by exploring multiple reasoning possibilities at each step. It first decomposes the problem into multiple thought steps and generates multiple thoughts per step, creating a tree structure. The search process can be BFS (breadth-first search) or DFS (depth-first search) with each state evaluated by a classifier (via a prompt) or majority vote.
Task decomposition can be done (1) by LLM with simple prompting like "Steps for XYZ.\n1.", "What are the subgoals for achieving XYZ?", (2) by using task-specific instructions; e.g. "Write a story outline." for writing a novel, or (3) with human inputs.
******************************
Fig. 9. Comparison of MIPS algorithms, measured in recall@10. (Image source: Google Blog, 2020)
Check more MIPS algorithms and performance comparison in ann-benchmarks.com.
Component Three: Tool Use#
Tool use is a remarkable and distinguishing characteristic of human beings. We create, modify