# 1. embedding_function

```python
collection = client.get_or_create_collection(name="my_collection", embedding_function=emb_fn)
```

BERT模型：BERT的全称是“Bidirectional Encoder Representations from Transformers”，中文为“多Transformer的双向编码器表示法”

## 1.1 transformer
![alt 属性文本](./image.png)

## 1.2 embedding方式

1. SentenceTransformers模型   
2. 封装的第三方服务商的embedding服务（得花钱）
3. 自己搭建的embedding服务
4. 自定义embedding

```python
from chromadb.utils import embedding_functions
```

### 1.2.1 SentenceTransformers

chroma默认使用的就是SentenceTransformers库中支持的all-MiniLM-L6-v2。[官方文档](https://www.sbert.net/)

[支持的模型列表](https://www.sbert.net/docs/sentence_transformer/pretrained_models.html)

默认会从huggingface下载模型，
- 默认macOS地址为：~/.cache/huggingface/hub
- 默认的window地址为：C:\Users\<YourUsername>\.cache\huggingface\hub

```python
from chromadb.utils import embedding_functions
st_ef = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="",
    cache_folder="",
)
```

### 1.2.2 封装的第三方服务商的embedding服务（得花钱）
国内访问不了。

1. OpenAI
2. Google
3. Cohere
4. HuggingFace

In [None]:
import chromadb.utils.embedding_functions as embedding_functions
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
                api_key="YOUR_API_KEY",
                model_name="text-embedding-3-small"
            )

### 1.2.3 自己搭建的embedding服务

1. ollama: [文档地址](https://docs.trychroma.com/integrations/embedding-models/ollama) 【注意这个文档的示例有问题。】
2. huggingface server: 是将HuggingFace的文本嵌入服务的docker镜像，在本地跑起来。[文档地址](https://docs.trychroma.com/integrations/embedding-models/hugging-face-server)

```python
from chromadb.utils import embedding_functions
ollama_ef = embedding_functions.OllamaEmbeddingFunction(
    url="http://localhost:11434/api/embeddings",
    model_name="nomic-embed-text",
)
```

In [5]:
from chromadb.utils import embedding_functions
ollama_ef = embedding_functions.OllamaEmbeddingFunction(
    url="http://localhost:11434/api/embeddings",
    model_name="nomic-embed-text",
)
results = ollama_ef(["文档0001", "文档0002"])
len(results)

2

### 1.2.4 自定义embedding

```python
from chromadb import Documents, EmbeddingFunction, Embeddings

class MyEmbeddingFunction(EmbeddingFunction):
    def __call__(self, input: Documents) -> Embeddings:
        # embed the documents somehow
        return embeddings
```

In [7]:
!pip install openai

Looking in indexes: https://mirrors.tencent.com/pypi/simple/, https://mirrors.tencent.com/repository/pypi/tencent_pypi/simple
Collecting openai
  Downloading https://mirrors.tencent.com/yun/pypi/packages/15/64/db3462b358072387b8e93e6e6a38d3c741a17b4a84171ef01d6c85c63f25/openai-1.63.2-py3-none-any.whl (472 kB)
Collecting distro<2,>=1.7.0 (from openai)
  Using cached https://mirrors.tencent.com/yun/pypi/packages/12/b3/231ffd4ab1fc9d679809f356cebee130ac7daa00d6d6f3206dd4fd137e9e/distro-1.9.0-py3-none-any.whl (20 kB)
Collecting jiter<1,>=0.4.0 (from openai)
  Using cached https://mirrors.tencent.com/yun/pypi/packages/3c/c1/6da849640cd35a41e91085723b76acc818d4b7d92b0b6e5111736ce1dd10/jiter-0.8.2-cp312-cp312-macosx_11_0_arm64.whl (310 kB)
Installing collected packages: jiter, distro, openai
Successfully installed distro-1.9.0 jiter-0.8.2 openai-1.63.2


In [15]:
from chromadb import Documents, EmbeddingFunction, Embeddings
from openai import OpenAI

class MyEmbeddingFunction(EmbeddingFunction):
    def __init__(self, model_name, api_key="", base_url=""):
        self.model_name = model_name
        self.api_key = api_key
        self.base_url = base_url
        
    def __call__(self, input: Documents) -> Embeddings:
        client = OpenAI(api_key=self.api_key, base_url=self.base_url)
        response = client.embeddings.create(
            input=input,
            model=self.model_name
        )
        # print(response)
        embeddings = [item.embedding for item in response.data]
        return embeddings

In [16]:
import getpass

api_key = getpass.getpass()

my_embedding_fn = MyEmbeddingFunction(
    model_name="embedding-3",
    api_key=api_key,
    base_url="https://open.bigmodel.cn/api/paas/v4/"
)
results = my_embedding_fn(["文档0001", "文档0002"])
results

 ········


[array([-2.8312918e-02,  2.2436652e-02,  4.8403348e-05, ...,
        -7.2233942e-03, -1.5576170e-03,  1.2797718e-02], dtype=float32),
 array([-0.02656638,  0.01375132,  0.008637  , ..., -0.00215925,
        -0.00118349,  0.01302572], dtype=float32)]

In [21]:
# https://yunwu.ai/v1/embeddings
# import getpass

api_key = "sk-ArmSDkeRAgNp2L7RS0Sq39KVKxMOyddAtT6zXqJfpRjsziyd"

my_embedding_fn = MyEmbeddingFunction(
    # model_name="embedding-3",
    model_name="text-embedding-ada-002",
    api_key=api_key,
    base_url="https://yunwu.ai/v1"
)
results = my_embedding_fn(["文档0001", "文档0002"])
results

[array([-0.00863519, -0.00956835, -0.01586368, ..., -0.00673057,
        -0.00711359, -0.01958238], dtype=float32),
 array([-0.01486283, -0.00533767, -0.0116189 , ...,  0.00653262,
        -0.02319649,  0.00046317], dtype=float32)]