## 向量数据库 embedding DB

- 安装和部署
- 数据库连接操作
- 数据入库
- 检索操作
- 索引操作

### 1.chroma
- https://docs.trychroma.com/guides

- 安装和部署

``` shell
pip install chromadb
```

``` shell
# 服务端部署
chroma run --path ./data

 Usage: chroma run [OPTIONS]                                                                                                                        
 Run a chroma server                                                                                                                                
╭─ Options ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --path            TEXT     The path to the file or directory. [default: ./chroma_data]                                                                                                                                                  │
│ --host            TEXT     The host to listen to. Default: localhost [default: localhost]                                                                                                                                               │
│ --log-path        TEXT     The path to the log file. [default: chroma.log]                                                                                                                                                              │
│ --port            INTEGER  The port to run the server on. [default: 8000]                                                                                                                                                               │
│ --help                     Show this message and exit.                                                                                                                                                                                  │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

```
``` python
客户端使用
import chromadb
chroma_client = chromadb.HttpClient(host='localhost', port=8000)
```

``` python
# 直接使用
import chromadb
client = chromadb.Client()
client = chromadb.PersistentClient(path="./data")
```




In [6]:
import chromadb

In [7]:
chroma_client = chromadb.HttpClient(host="localhost", port=8000)

In [8]:
from chromadb.utils import embedding_functions

In [9]:
model_path = './data/llm_app/embedding_models/gte-large-zh/'

In [10]:
em_fn = embedding_functions.SentenceTransformerEmbeddingFunction(model_name=model_path)

  from tqdm.autonotebook import tqdm, trange


In [11]:
collection = chroma_client.create_collection(name='rag_db',
                                            embedding_function=em_fn,
                                            metadata={"hnsw:space": "cosine"})

In [13]:
documents=["在向量搜索领域，我们拥有多种索引方法和向量处理技术，\
    它们使我们能够在召回率、响应时间和内存使用之间做出权衡。", 
               "虽然单独使用特定技术如倒排文件（IVF）、乘积量化（PQ）\
               或分层导航小世界（HNSW）通常能够带来满意的结果",
               "GraphRAG 本质上就是 RAG，只不过与一般 RAG 相比，其检索路径上多了一个知识图谱"]

In [14]:
collection.add(documents=documents,
              ids=["id1", "id2", "id3"],
              metadatas=[{"chapter": 3, "verse": 16}, 
               {"chapter": 4, "verse": 5}, 
               {"chapter": 12, "verse": 5}])

In [15]:
collection.count()

3

In [20]:
collection.peek(limit=1)

{'ids': ['id1'],
 'embeddings': [[0.014445115812122822,
   -0.011570136994123459,
   0.022304952144622803,
   0.02056988701224327,
   -0.013929496519267559,
   0.023787513375282288,
   -0.0014493806520476937,
   -0.022644491866230965,
   0.004704768769443035,
   -0.03234069421887398,
   0.0016840423922985792,
   -0.00019554801110643893,
   0.01087995246052742,
   0.031090212985873222,
   0.005490060430020094,
   0.010350426658987999,
   -0.007863803766667843,
   -0.002432334702461958,
   -0.01740989089012146,
   0.042129792273044586,
   -0.020444413647055626,
   0.02480866014957428,
   0.04080279543995857,
   -0.005154515150934458,
   0.011707911267876625,
   -0.007286903914064169,
   0.0407462902367115,
   0.002859817584976554,
   0.010258081369102001,
   -0.011978231370449066,
   0.04582567512989044,
   0.017384285107254982,
   0.018167546018958092,
   -0.029062245041131973,
   0.059368252754211426,
   -0.008747910149395466,
   0.020769016817212105,
   -0.025084445253014565,
   0.023

#### 检索

In [21]:
get_collection = chroma_client.get_collection(name='rag_db',
                                             embedding_function=em_fn)

In [22]:
id_result = get_collection.get(ids=['id2'],
                              include=["documents", "embeddings", "metadatas"])

In [23]:
id_result['documents']

['虽然单独使用特定技术如倒排文件（IVF）、乘积量化（PQ）               或分层导航小世界（HNSW）通常能够带来满意的结果']

In [24]:
id_result['metadatas']

[{'chapter': 4, 'verse': 5}]

In [26]:
import numpy as np

In [27]:
np.array(id_result['embeddings']).shape

(1, 1024)

In [28]:
query = '索引技术有哪些？'

In [29]:
get_collection.query(query_texts=query,
                    n_results=2,
                    include=["documents", 'metadatas'])

{'ids': [['id1', 'id3']],
 'distances': None,
 'embeddings': None,
 'metadatas': [[{'chapter': 3, 'verse': 16}, {'chapter': 12, 'verse': 5}]],
 'documents': [['在向量搜索领域，我们拥有多种索引方法和向量处理技术，    它们使我们能够在召回率、响应时间和内存使用之间做出权衡。',
   'GraphRAG 本质上就是 RAG，只不过与一般 RAG 相比，其检索路径上多了一个知识图谱']],
 'uris': None,
 'data': None,
 'included': ['documents', 'metadatas']}

In [30]:
get_collection.query(query_texts=query,
                    n_results=2,
                    include=["documents", 'metadatas'],
                    where={"verse": 5})

{'ids': [['id3', 'id2']],
 'distances': None,
 'embeddings': None,
 'metadatas': [[{'chapter': 12, 'verse': 5}, {'chapter': 4, 'verse': 5}]],
 'documents': [['GraphRAG 本质上就是 RAG，只不过与一般 RAG 相比，其检索路径上多了一个知识图谱',
   '虽然单独使用特定技术如倒排文件（IVF）、乘积量化（PQ）               或分层导航小世界（HNSW）通常能够带来满意的结果']],
 'uris': None,
 'data': None,
 'included': ['documents', 'metadatas']}

#### 混合检索支持的操作



``` shell
- $eq - equal to (string, int, float)

- $ne - not equal to (string, int, float)

- $gt - greater than (int, float)

- $gte - greater than or equal to (int, float)

- $lt - less than (int, float)

- $lte - less than or equal to (int, float)

```

In [31]:
get_collection.query(
    query_texts=["索引技术有哪些？"],
    n_results=2,
    where={"chapter": {"$lt": 10}},
)

{'ids': [['id1', 'id2']],
 'distances': [[0.48257795562384265, 0.7083043108414732]],
 'embeddings': None,
 'metadatas': [[{'chapter': 3, 'verse': 16}, {'chapter': 4, 'verse': 5}]],
 'documents': [['在向量搜索领域，我们拥有多种索引方法和向量处理技术，    它们使我们能够在召回率、响应时间和内存使用之间做出权衡。',
   '虽然单独使用特定技术如倒排文件（IVF）、乘积量化（PQ）               或分层导航小世界（HNSW）通常能够带来满意的结果']],
 'uris': None,
 'data': None,
 'included': ['metadatas', 'documents', 'distances']}

In [32]:
get_collection.query(
    query_texts=["索引技术有哪些？"],
    n_results=2,
    where={"$and": [{"chapter": {"$lt": 10}}, 
                    {"verse": {"$eq": 5}}
                   ]}
)

{'ids': [['id2']],
 'distances': [[0.7083043108414732]],
 'embeddings': None,
 'metadatas': [[{'chapter': 4, 'verse': 5}]],
 'documents': [['虽然单独使用特定技术如倒排文件（IVF）、乘积量化（PQ）               或分层导航小世界（HNSW）通常能够带来满意的结果']],
 'uris': None,
 'data': None,
 'included': ['metadatas', 'documents', 'distances']}

In [33]:
get_collection.query(
    query_texts=["索引技术有哪些？"],
    n_results=2,
    where_document={"$contains":"索引"}
)

{'ids': [['id1']],
 'distances': [[0.48257795562384265]],
 'embeddings': None,
 'metadatas': [[{'chapter': 3, 'verse': 16}]],
 'documents': [['在向量搜索领域，我们拥有多种索引方法和向量处理技术，    它们使我们能够在召回率、响应时间和内存使用之间做出权衡。']],
 'uris': None,
 'data': None,
 'included': ['metadatas', 'documents', 'distances']}

### 2.milvus
- https://milvus.io/api-reference/pymilvus/v2.2.x/About.mdhttps://milvus.io/api-reference/pymilvus/v2.2.x/About.md

安装部署
- https://milvus.io/docs/install-overview.md

- docker 部署
``` shell

# 方式一
curl -sfL https://raw.githubusercontent.com/milvus-io/milvus/master/scripts/standalone_embed.sh -o standalone_embed.sh

bash standalone_embed.sh start
bash standalone_embed.sh stop
bash standalone_embed.sh delete


```

``` shell
# 方式二
# 通过docker-compose
mkdir milvus_compose
cd milvus_compose
wget https://github.com/milvus-io/milvus/releases/download/v2.2.8/milvus-standalone-docker-compose.yml -O docker-compose.yml
 
sudo systemctl daemon-reload
sudo systemctl restart docker

# 启动服务
docker-compose up -d



# 安装 python接口库
pip install pymilvus

```

In [35]:
import numpy as np
from pymilvus import (
    connections,
    utility,
    FieldSchema, CollectionSchema, DataType,
    Collection,
)

In [36]:
connections.connect(host='127.0.0.1', port="19530")

I0000 00:00:1728108761.480678   50952 config.cc:230] gRPC experiments enabled: call_status_override_on_cancellation, event_engine_dns, event_engine_listener, http2_stats_fix, monitoring_experiment, pick_first_new, trace_record_callops, work_serializer_clears_time_cache


#### 声明字段 和构建集合

In [37]:
fileds = [
    FieldSchema(name="pk", dtype=DataType.VARCHAR, 
                is_primary=True, auto_id=False, max_length=100),
    FieldSchema(name="documents", dtype=DataType.VARCHAR, max_length=512),
    FieldSchema(name="embeddings", dtype=DataType.FLOAT_VECTOR, dim=1024),
    FieldSchema(name="verse", dtype=DataType.INT64),
    
]

In [40]:
rag_db = Collection("rag_db",
                    CollectionSchema(fileds),
                   consistency_level="Strong")

#### 插入数据

In [41]:
documents=["在向量搜索领域，我们拥有多种索引方法和向量处理技术，\
    它们使我们能够在召回率、响应时间和内存使用之间做出权衡。", 
               "虽然单独使用特定技术如倒排文件（IVF）、乘积量化（PQ）\
               或分层导航小世界（HNSW）通常能够带来满意的结果",
               "GraphRAG 本质上就是 RAG，只不过与一般 RAG 相比，其检索路径上多了一个知识图谱"]

In [42]:
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
model_path = './data/llm_app/embedding_models/gte-large-zh'
model = HuggingFaceEmbeddings(model_name=model_path,
                                   model_kwargs={'device': "cpu"})
embeddings = model.embed_documents(documents)

  model = HuggingFaceEmbeddings(model_name=model_path,


In [43]:
entities = [
    [str(i) for i in range(len(documents))],
    documents,
    np.array(embeddings),
    [16,5,5],
]

In [44]:
insert_result = rag_db.insert(entities)

In [47]:
rag_db.flush()

In [49]:
rag_db.num_entities

3

#### 创建索引

In [51]:
index = {
    "index_type": "IVF_FLAT",
    "metric_type": "L2",
    "params": {"nlist": 128},
}

In [52]:
rag_db.create_index("embeddings", index)

Status(code=0, message=)

#### 检索

In [53]:

get_collection = Collection("rag_db")

In [54]:
get_collection.load()

In [55]:
query = "索引技术有哪些？"

In [56]:
query_emb = model.embed_documents([query])

In [61]:
result = get_collection.search(query_emb, 
                       "embeddings", 
                       param={"metric_type": "L2"},
                       limit=2, 
                       output_fields=["documents", "verse"])

In [62]:
for hits in result:
    for hit in hits:
        print(f"hit: {hit}, documents field: {hit.entity.get('documents')}")

hit: (distance: 0.9651559591293335, id: 0), documents field: 在向量搜索领域，我们拥有多种索引方法和向量处理技术，    它们使我们能够在召回率、响应时间和内存使用之间做出权衡。
hit: (distance: 1.1978662014007568, id: 2), documents field: GraphRAG 本质上就是 RAG，只不过与一般 RAG 相比，其检索路径上多了一个知识图谱


In [64]:
result2 = get_collection.search(query_emb, 
                       "embeddings", 
                       param={"metric_type": "L2"},
                       expr="verse < 10",
                       limit=1, 
                       output_fields=["documents", "verse"])

In [65]:
for hits in result2:
    for hit in hits:
        print(f"hit: {hit}, documents field: {hit.entity.get('documents')}")

hit: (distance: 1.1978662014007568, id: 2), documents field: GraphRAG 本质上就是 RAG，只不过与一般 RAG 相比，其检索路径上多了一个知识图谱


### 3.INDEX索引优化

In [66]:
import numpy as np
from scipy.cluster.vq import kmeans2

In [67]:
query = np.random.normal(size=(128,))
dataset = np.random.normal(size=(1000, 128))

#### flat index

In [68]:
np.argmin(np.linalg.norm(query - dataset, axis=1))

187

#### IVF

In [2]:
num_part = 100
(centroids, assignments) = kmeans2(dataset, num_part, iter=10000)

NameError: name 'kmeans2' is not defined

In [71]:
centroids.shape

(100, 128)

In [72]:
assignments[:10]

array([28, 95, 92, 79, 51, 82, 15, 53, 51, 89], dtype=int32)

In [84]:
index = [[] for _ in range(num_part)]
for n, k in enumerate(assignments):
     index[k].append(n)

In [75]:
index[0]

[14, 413]

In [74]:
index[1]

[26, 71, 104, 120, 200, 324, 540, 732, 877]

In [76]:
cluster_id = np.argmin(np.linalg.norm(query - centroids, axis=1))

In [77]:
cluster_id

30

In [78]:
np.argmin(np.linalg.norm(query - dataset[index[30]], axis=1))

9

In [102]:
index[9]

9

In [85]:
cluster_ids = np.argsort(np.linalg.norm(query - centroids, axis=1))[: 3]

In [80]:
cluster_ids

array([30, 73, 15])

In [86]:
top3_index = []
for c in cluster_ids:
    top3_index += index[c]

In [90]:
np.argmin(np.linalg.norm(query - dataset[top3_index], axis=1))

66

In [91]:
top3_index[66]

187