transformers库是有Huggingface开发的模型库，可以方便的使用各种预训练模型，并且可以方便的进行模型的微调。
embedding模型也是transformers库的其中一个模型，可以方便的使用各种预训练模型，并且可以方便的进行模型的微调。

In [None]:
pip install -U sentence-transformers

In [None]:
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel

In [3]:
input_texts = [
    "中国的首都是哪里",
    "你喜欢去哪里旅游",
    "北京",
    "今天中午吃什么"
]

In [5]:
tokenizer = AutoTokenizer.from_pretrained("thenlper/gte-large-zh")
model = AutoModel.from_pretrained("thenlper/gte-large-zh")

In [6]:
batch_tokens = tokenizer(
    input_texts,
    padding=True,
    truncation=True,
    max_length=30,
    return_tensors="pt"
)

In [None]:
batch_tokens[0].tokens

In [None]:
batch_tokens.input_ids[0]

In [9]:
outputs = model(**batch_tokens)

In [None]:
outputs

In [None]:
outputs.last_hidden_state.shape

In [None]:
texts = ["苹果", "香蕉", "橙子", ]

chroma向量数据库

In [12]:
import chromadb
chroma_client = chromadb.HttpClient(host="localhost", port=8000)

In [None]:
from chromadb.utils import embedding_functions
model_path = "thenlper/gte-large-zh"
# 创建一个向量模型，将文本转换为向量
# SentenceTransformerEmbeddingFunction: 将文本转换为向量
em_fn = embedding_functions.SentenceTransformerEmbeddingFunction(model_path)

##### "create_collection": 相当于在Chroma仓库中搭建一个货架，你得给这个货架起一个名字，告诉他以后要放什么类型的东西。搭建好之后，就可以往里边塞东西了。

In [15]:
collection = chroma_client.create_collection(
    name="rag_db",
    embedding_function=em_fn,
    metadata={"hnsw:space": "cosine"}
)

In [16]:
documents = ["在向量搜索领域，我们拥有多种索引方法和向量处理技术,\
    它们使我们能够在召回率、响应时间和内存使用之间做出权衡。",
    "虽然单独使用特定技术如倒排文件（IVF）、乘积量化（PQ）\
    或分层导航小世界（HNSW）通常能够带来满意的结果",
    "GraphRAG 本质上就是 RAG，只不过与一般 RAG 相比，其检索路径上多了一个知识图谱"]

collection.add：往货架上放东西：就是将东西放置在你刚好搭建的货架上
- 放什么货物：（向量，比如猫咪的图片转成的数字）
- 每个货物的“标签”，（ID,方便查找）
- 可选的“描述”（元数据）

放置好之后，Chroma就自动整理这些货物，方便以后查找形似的东西


In [17]:
collection.add(
    documents=documents,
    ids=["id1", "id2", "id3"],
    metadatas=[{"chapter": 3, "verse": 16}, 
               {"chapter": 4, "verse": 5}, 
               {"chapter": 12, "verse": 5}]
)
    

In [None]:
collection.count()

In [None]:
collection.peek(limit=1)

In [21]:
get_collection = chroma_client.get_collection(name="rag_db", embedding_function=em_fn)


In [22]:
id_result = get_collection.get(ids=["id2"], include=["documents", "metadatas", "embeddings"])

In [None]:
id_result["documents"]

In [None]:
id_result["embeddings"]

In [None]:
import numpy as np
np.array(id_result["embeddings"]).shape

In [26]:
query = "索引技术有哪些？"

In [None]:
get_collection.query(query_texts=query, 
                     n_results=2, 
                     include=["documents", "metadatas"])

In [None]:
get_collection.query(query_texts=query, 
                     n_results=2, 
                     include=["documents", "metadatas"],
                     where={"verse": 5})

In [None]:
get_collection.query(query_texts=query, 
                     n_results=2, 
                     where={"chapter": {"$lt": 10}})

In [None]:
get_collection.query(query_texts=query, 
                     n_results=2, 
                     where={"$and": [
                         {"chapter": {"$lt": 10}},
                         {"verse": {"$eq": 5}}
                     ]})

In [None]:
pip install pymilvus

In [3]:
import numpy as np
from pymilvus import (
    connections,
    utility,
    FieldSchema,
    CollectionSchema,
    DataType,
    Collection
)

In [17]:
connections.connect(host='127.0.0.1', port="19530")

声明字段和构建集合

In [18]:
fileds = [
    FieldSchema(name="pk", 
                dtype=DataType.VARCHAR, 
                auto_id=False, 
                max_length=100, 
                desc="primary key", 
                is_primary=True),
    FieldSchema(name="documents", 
                dtype=DataType.VARCHAR, 
                max_length=512),
    FieldSchema(name="embeddings", 
                dtype=DataType.FLOAT_VECTOR, 
                dim=1024),
    FieldSchema(name="verse", dtype=DataType.INT64)
]

In [19]:
rag_db = Collection("rag_db_test", CollectionSchema(fileds), consistency_level="Strong")

In [20]:
documents = ["在向量搜索领域，我们拥有多种索引方法和向量处理技术,\
    它们使我们能够在召回率、响应时间和内存使用之间做出权衡。",
    "虽然单独使用特定技术如倒排文件（IVF）、乘积量化（PQ）\
    或分层导航小世界（HNSW）通常能够带来满意的结果",
    "GraphRAG 本质上就是 RAG，只不过与一般 RAG 相比，其检索路径上多了一个知识图谱"]

In [21]:
from langchain.embeddings import HuggingFaceEmbeddings

model = HuggingFaceEmbeddings(model_name="thenlper/gte-large-zh")

embeddings = model.embed_documents(documents)

In [22]:
entities = [
 [str(i) for i in range(len(documents))],
 documents,
 np.array(embeddings),
 [16,5,5],
]

In [23]:
insert_result = rag_db.insert(entities)
rag_db.flush()

In [None]:
rag_db.num_entities

创建索引

In [25]:
index = {
    "index_type": "IVF_FLAT",
    "metric_type": "L2",
    "params": {"nlist": 128}
}

In [None]:
rag_db.create_index("embeddings", index)

检索

实战：文档解析和分块处理

In [27]:
from langchain_core.documents import Document

In [28]:
document = Document(page_content="Hello world", metadata={"source": "test meta data"})

In [None]:
# document.page_content
document.metadata

In [None]:
# html解析
from langchain_community.document_loaders import WebBaseLoader

In [33]:
loader = WebBaseLoader("https://flask.palletsprojects.com/en/3.0.x/tutorial/layout/")
docs = loader.load()

In [None]:
for doc in docs:
    print(doc.page_content, doc.metadata)

知识库的构建

文档解析 --> 分块 --> embedding --> 向量数据库

In [None]:
pip install datrie

In [None]:
pip install strenum

In [None]:
pip install hanziconv

In [None]:
pip install pycryptodome

In [None]:
pip install ruamel.yaml

In [None]:
pip install word2number

In [None]:
pip install cn2an

In [None]:
pip install xgboost

In [None]:
pip install roman_numbers

In [2]:
from model import RagEmbedding, RagLLM

In [None]:
pip install exceptions

In [None]:
pip install docx

In [None]:
pip install docx

In [None]:
pip uninstall docx

In [None]:
pip uninstall -y docx

In [None]:
pip install python-docx

In [None]:
pip install html_text

In [16]:
import nltk
import os
import sys

# 完全清除NLTK现有的数据路径
nltk.data.path = []

# 设置新的数据路径到当前工作目录下的nltk_data文件夹
current_dir = os.getcwd()
nltk_data_dir = os.path.join(current_dir, 'nltk_data')
os.makedirs(nltk_data_dir, exist_ok=True)
nltk.data.path.append(nltk_data_dir)

print(f"设置NLTK数据路径为: {nltk_data_dir}")

# 强制重新下载所有需要的数据包到当前工作目录
print("正在下载NLTK数据包到当前工作目录...")
nltk.download('punkt', download_dir=nltk_data_dir)
nltk.download('wordnet', download_dir=nltk_data_dir)
nltk.download('omw-1.4', download_dir=nltk_data_dir)
nltk.download('stopwords', download_dir=nltk_data_dir)

# 下载punkt_tab资源
print("正在下载punkt_tab资源...")
nltk.download('punkt_tab', download_dir=nltk_data_dir)

# 下载所有punkt相关资源
nltk.download('punkt_data', download_dir=nltk_data_dir, quiet=False)
nltk.download('tokenizers/punkt/PY3', download_dir=nltk_data_dir, quiet=False)

# 确保按照正确的路径组织结构
print("验证数据包...")
for resource in ['punkt', 'wordnet', 'omw-1.4', 'stopwords', 'punkt_tab']:
    try:
        nltk.data.find(f'tokenizers/{resource}')
        print(f"{resource} 数据包成功加载")
    except LookupError as e:
        # 如果是punkt_tab，尝试直接创建
        if resource == 'punkt_tab':
            print(f"无法找到 {resource}，尝试替代解决方案...")
            # 尝试创建一个空的punkt_tab目录和文件
            tab_dir = os.path.join(nltk_data_dir, 'tokenizers/punkt_tab/english')
            os.makedirs(tab_dir, exist_ok=True)
            with open(os.path.join(tab_dir, 'punkt.pickle'), 'w') as f:
                f.write('')
            print(f"创建了一个空的{resource}文件")
        else:
            print(f"无法找到 {resource}: {e}")

# 测试word_tokenize，但使用更安全的方式
try:
    # 使用不依赖punkt_tab的方式进行分词
    from nltk.tokenize import TreebankWordTokenizer
    tokenizer = TreebankWordTokenizer()
    print(f"TreebankWordTokenizer测试: {tokenizer.tokenize('这是测试句子')}")
    print("分词功能正常")
except Exception as e:
    print(f"TreebankWordTokenizer测试失败: {e}")

# 处理docx模块问题
import sys
if 'docx' in sys.modules:
    print(f"移除已加载的docx模块")
    del sys.modules['docx']

设置NLTK数据路径为: /Users/june/Documents/大模型/RAG/embedding-train/nltk_data
正在下载NLTK数据包到当前工作目录...
正在下载punkt_tab资源...
验证数据包...
punkt 数据包成功加载
无法找到 wordnet: 
**********************************************************************
  Resource [93mwordnet[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('wordnet')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/wordnet[0m

  Searched in:
    - '/Users/june/Documents/大模型/RAG/embedding-train/nltk_data'
**********************************************************************

无法找到 omw-1.4: 
**********************************************************************
  Resource [93momw-1.4[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('omw-1.4')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/omw-1.4[0m

  Searched

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/june/Documents/大模型/RAG/embedding-
[nltk_data]     train/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/june/Documents/大模型/RAG/embedding-
[nltk_data]     train/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/june/Documents/大模型/RAG/embedding-
[nltk_data]     train/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/june/Documents/大模型/RAG/embedding-
[nltk_data]     train/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/june/Documents/大模型/RAG/embedding-
[nltk_data]     train/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Error loading punkt_data: Package 'punkt_dat

In [17]:

from doc_parse import chunk, read_and_process_excel, logger

In [18]:
import pandas as pd
from langchain_chroma import Chroma
from langchain_text_splitters import RecursiveCharacterTextSplitter
import chromadb

In [19]:
pdf_files = ["./data/zhidu_employee.pdf", "./data/zhidu_travel.pdf"]
excel_files = ["./data/zhidu_detail.xlsx"]

In [20]:
r_spliter = RecursiveCharacterTextSplitter(
    chunk_size=128,
    chunk_overlap=30,
    separators=["\n\n", 
                "\n", 
                ".", 
                "\uff0e", 
                "\u3002",
                ",",
                "\uff0c",
                "\u3001'"
                ])

In [35]:
doc_data = []
for pdf_file_name in pdf_files:
    res = chunk(pdf_file_name, callback=logger)
    for data in res:
        content = data["content_with_weight"]
        if '<table>' not in content and len(content) > 200:
            doc_data = doc_data + r_spliter.split_text(content)
        else:
            doc_data.append(content)

TypeError: 'str' object is not callable

In [24]:
for i in doc_data:
    print(len(i), "="*10, i)

In [27]:
for excel_file_name in excel_files:
    data = read_and_process_excel(excel_file_name)
    df = pd.DataFrame(data[8:], columns=data[7])
    data_excel = df.drop(columns=df.columns[11:17])
    doc_data.append(data_excel.to_markdown(index=False).replace(' ', ""))

In [28]:
from langchain_core.documents import Document
documents = []
for chunk in doc_data:
    document = Document(page_content=chunk, metadata={"source": "test"})
    documents.append(document)

In [29]:
from model import RagEmbedding

In [13]:
from langchain_chroma import Chroma

In [30]:
embedding_cls = RagEmbedding()

In [31]:
import chromadb

In [32]:
chroma_client = chromadb.HttpClient(
    host="localhost", 
    port=8000
)

In [33]:
from langchain_community.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(model_name="thenlper/gte-large-zh")

In [34]:
embedding_db = Chroma.from_documents(
    documents,
    embeddings, # 直接传递embedding对象,不调用方法
    client=chroma_client,
    collection_name="zhidu_db"
)

AttributeError: 'NoneType' object has no attribute 'get'

In [None]:
query = "迟到有什么规定？"