# 一、文档嵌入模型【Text Embedding Model】

## 1、嵌入模型概述

- Text Embedding Models:文档嵌入模型,提供将文本编码为向量的能力,即文档向量化
     文档写入和用户查询匹配前都会先执行文档嵌入编码,即向量化
- LangChain从开源到专有API,提供了超过25种不同的嵌入提供商和方法的集成，
- Hugging Face等开源社区提供了一些文本向量化模型(例如BGE),效果比闭源且调用API的向量化模型效果好，并且向量化模型参数量小，在CPU上即可运行。
- 所以，这里推荐在开发RAG应用的过程中，使用开源的文本向量化模型。此外,开源模型还可以根据应用场景下收集的数据对模型进行微调.提高模型效果。
- LangChain中针对向量化模型的封装提供了两种接口,一种针对文档的向量化(embed_documents),一种针对 句子的向量化embed_query 。

## 2、句子/文档嵌入

In [4]:
from langchain_openai import OpenAIEmbeddings
import os, dotenv

dotenv.load_dotenv()
os.environ['OPENAI_API_KEY'] = os.getenv("OPENAI_API_KEY")
os.environ['OPENAI_BASE_URL'] = os.getenv("OPENAI_BASE_URL")

embedding_model = OpenAIEmbeddings(model="text-embedding-3-small")

# 嵌入句子【embed_query函数】
text = "Text Embedding Models:文档嵌入模型，提供将文本编码为向量的能力,即文档向量化,文档写入和 用户查询匹配 前都会先执行文档嵌入编码,即向量化"
text_vector = embedding_model.embed_query(text)

print(len(text_vector))  # 输出嵌入向量的维度
print(f"text_vector: {text_vector[:100]}")

# 待嵌入的文本列表【embed_documents函数】, 接收的参数是字符串数组。
texts = [
    "Hi there!",
    "Oh, hello!",
    "What hobbies do you have? ",
    "I like swimming, programming, and writing!"
]
# 生成嵌入向量
docs_vector = embedding_model.embed_documents(texts)

for i in range(len(docs_vector)):
    print(f"{texts[i]} => {docs_vector[i][:5]}",end="\n\n")

1536
text_vector: [-0.029042674228549004, 0.06226688623428345, -0.005592805799096823, -0.025355318561196327, 0.007365207187831402, -0.00924689881503582, 0.01829422451555729, 0.018246706575155258, -0.003281080862507224, -0.023796746507287025, 0.034155555069446564, 0.03459271416068077, -0.021762998774647713, -0.0022891538683325052, 0.004088877700269222, 0.016583595424890518, 0.03512490913271904, -0.0029532103799283504, 0.0008398711797781289, 0.013342903926968575, 0.001646480173803866, 0.025317305698990822, -0.014245736412703991, 0.002846296178176999, 0.04447634890675545, -0.017353378236293793, -0.005169900134205818, 0.05987200513482094, 0.03767184540629387, -0.08401087671518326, -0.04006672650575638, -0.0108339823782444, 0.008196762762963772, -0.018911950290203094, 0.009845619089901447, 0.022732354700565338, -0.010643912479281425, 0.031038407236337662, -0.0025611913297325373, -0.0027797718066722155, -0.00574486143887043, 0.05090070888400078, -0.01094802375882864, 0.03812801465392113, 0.0

- 加载csv文件并拆分成文本块:

In [7]:
from langchain_community.document_loaders import CSVLoader

# 情况1:
loader = CSVLoader("./document/load/03-load.csv", encoding="utf-8")
# 加载并切分文档
docs = loader.load_and_split()
#print(len(docs))
# 存放的是每一个chrunk的embedding。
embeded_docs = embedding_model.embed_documents([doc.page_content for doc in docs])
print(len(embeded_docs))
# 表示的是每一个chrunk的embedding的维度
print(len(embeded_docs[0]))

for i in range(len(embeded_docs)):
    print(f"{docs[i]} => {embeded_docs[i][:10]}",end="\n\n")


4
1536
page_content='id: 1
title: Introduction to Python
content: Python is a popular programming language.
author: Guido van Rossum' metadata={'source': './document/load/03-load.csv', 'row': 0} => [0.02165742963552475, -0.044537920504808426, 0.022788185626268387, -0.002097090007737279, -0.007367218378931284, -0.054230112582445145, -0.029330413788557053, 0.05589162930846214, -0.014088290743529797, 0.04375331476330757]

page_content='id: 2
title: Data Science Basics
content: Data science involves statistics and machine learning.
author: Jane Smith' metadata={'source': './document/load/03-load.csv', 'row': 1} => [0.01213858276605606, 0.0008625349146313965, 0.0406014658510685, -0.03944423794746399, 0.01269257441163063, 0.006567884236574173, -0.017186066135764122, 0.02560674585402012, -0.010735135525465012, 0.03511079028248787]

page_content='id: 3
title: Web Development
content: HTML, CSS and JavaScript are core web technologies.
author: Mike Johnson' metadata={'source': './document/load/