### 文本转向量介绍


#### 如何使用bert模型将文本转换成向量的
1. 主要是使用hidden_state值、attention_mask 来做处理的。
2. 基本的参考链接可以看这里 https://github.com/UKPLab/sentence-transformers/blob/06f5c4e9857f013da2657d43a77d9f5f0bf50a61/sentence_transformers/models/Pooling.py#L128
3. 最常见的，或者最方便的，就是提取cls对应的hidden_status的值。当然，还有别的，这里先不展开介绍。


#### 标准化问题
1. 为什么有的时候是使用cos相似度，有的时候，就是直接用矩阵乘法
2. 实际上是等效的：
- 2.1 如果模型输出的时候，已经做过标准化了，那就直接使用矩阵乘法就行了。
- 2.2 如果模型输出的时候，没有做标准化，那就使用cos相似度

3. 做标准化部分，就相当于提前做了$A / ||A||$操作

![](images/cos.png)


## 如何使用文本转向量模型
### 使用transformers方式
1. 建议大家使用这个方式，而不是直接使用sentence-transformers

In [1]:
from transformers import AutoTokenizer, AutoModel
import torch

model_name_or_path = "model/bge-base-zh-v1.5"

In [2]:
# Sentences we want sentence embeddings for
sentences = ["样例数据-1", "样例数据-2"]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
model = AutoModel.from_pretrained(model_name_or_path)
model.eval()

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
# for s2p(short query to long passage) retrieval task, add an instruction to query (not add instruction for passages)
# encoded_input = tokenizer([instruction + q for q in queries], padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)
    # Perform pooling. In this case, cls pooling.
    sentence_embeddings = model_output[0][:, 0]
# normalize embeddings
sentence_embeddings = torch.nn.functional.normalize(sentence_embeddings, p=2, dim=1)
print("Sentence embeddings:", sentence_embeddings)

Sentence embeddings: tensor([[-0.0157, -0.0291,  0.0919,  ..., -0.0066,  0.0219,  0.0269],
        [-0.0168, -0.0304,  0.1028,  ..., -0.0277,  0.0120,  0.0091]])


## onnx 推理转换

### onnx必须使用的环境
1. http://www.xavierdupre.fr/app/onnxcustom/helpsphinx/api/onnxruntime_python/helpers.html

```BASH
pip install "optimum[onnxruntime]" # cpu版本
pip install "optimum[onnxruntime-gpu]" # gpu版本

```

In [3]:
import pprint
import onnxruntime

pprint.pprint(onnxruntime.get_available_providers())  #  'CUDAExecutionProvider', 一定要出现这个

['TensorrtExecutionProvider',
 'CUDAExecutionProvider',
 'AzureExecutionProvider',
 'CPUExecutionProvider']


In [4]:
model_output.keys()

odict_keys(['last_hidden_state', 'pooler_output'])

In [9]:
model_output.last_hidden_state[:,0].shape

torch.Size([2, 768])

### 将模型进行转换
1. sbert = bert + pool
2. optimum do not have sbert
- 2.1 optimum for bert 
- 2.2 pool 
- xxxx

In [2]:
from optimum.onnxruntime.configuration import OptimizationConfig
from optimum.onnxruntime import ORTOptimizer, ORTModelForFeatureExtraction

model_id = "model/bge-base-zh-v1.5"
onnx_path = "model/bge_base_zh_v1_5_onnx"

model = ORTModelForFeatureExtraction.from_pretrained(model_id=model_id, from_transformers=True)
optimizer = ORTOptimizer.from_pretrained(model)

optimizer_config = OptimizationConfig(
    optimization_level=2,
    optimize_for_gpu=True,
    fp16=True
)
optimizer.optimize(save_dir=onnx_path, optimization_config=optimizer_config)

The argument `from_transformers` is deprecated, and will be removed in optimum 2.0.  Use `export` instead
Framework not specified. Using pt to export the model.
Using the export variant default. Available variants are:
    - default: The default ONNX variant.
Using framework PyTorch: 2.2.0+cu118
Overriding 1 configuration item(s)
	- use_cache -> False
Optimizing model...
[0;93m2024-03-08 07:18:20.809915600 [W:onnxruntime:, session_state.cc:1166 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.[m
[0;93m2024-03-08 07:18:20.809946916 [W:onnxruntime:, session_state.cc:1168 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.[m
Configuration saved in model/bge_base_zh_v1_5_onnx/ort_config.json
Optimized model saved at: model/bge_base_zh_v1_5_onnx (exte

PosixPath('model/bge_base_zh_v1_5_onnx')

### onnx 开始推理（测试准确性和效率）

In [5]:
from optimum.onnxruntime import ORTModelForFeatureExtraction
from transformers import AutoTokenizer,AutoModel
import torch.nn.functional as F
from tqdm import tqdm
from pathlib import Path
from typing import List
import time
import numpy as np 
import torch

In [57]:
model_id = "model/bge-base-zh-v1.5"
onnx_path = "model/bge_base_zh_v1_5_onnx"


def load_model_raw(model_id):
    model_raw = AutoModel.from_pretrained(model_id, device_map="cuda:0")
    # model_raw = model_raw.to(torch.float16)
    model_raw.eval()
    # model_raw.half()
    return model_raw


def load_model_ort(model_path):
    model = ORTModelForFeatureExtraction.from_pretrained(
        model_id=model_path,
        file_name="model_optimized.onnx",
        provider="CUDAExecutionProvider",
    )
    return model


model_raw = load_model_raw(model_id=model_id)
model_ort = load_model_ort(model_path=onnx_path)
tokenizer = AutoTokenizer.from_pretrained(model_id)

[0;93m2024-03-08 07:52:08.739857954 [W:onnxruntime:, session_state.cc:1166 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.[m
[0;93m2024-03-08 07:52:08.739890976 [W:onnxruntime:, session_state.cc:1168 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.[m


In [58]:
sentece1 = ["啊哈哈哈哈，我爱良睦路程序员"]

tokenizer_output = tokenizer(sentece1, padding=True, truncation=True, return_tensors="pt")
# tokenizer_output

for k in tokenizer_output.keys():
    tokenizer_output[k] = tokenizer_output[k].cuda()

raw_o1 = model_raw(**tokenizer_output)
ort_o1 = model_ort(**tokenizer_output)




In [59]:
raw_o1.keys(), ort_o1.keys()


(odict_keys(['last_hidden_state', 'pooler_output']),
 odict_keys(['last_hidden_state']))

In [60]:
ort_o1["last_hidden_state"].detach().cpu().numpy()[0, 0, :4]

array([ 0.35131836,  0.00605774,  0.12158203, -0.6694336 ], dtype=float32)

In [61]:
raw_o1["last_hidden_state"].detach().cpu().numpy()[0, 0, :4]

array([ 0.35138747,  0.00591746,  0.12244577, -0.66722745], dtype=float32)

In [71]:
# raw_o1['last_hidden_state'].detach()
ort_o1["last_hidden_state"].detach()

(
    np.allclose(
        raw_o1["last_hidden_state"].detach().cpu().numpy(),
        ort_o1["last_hidden_state"].detach().cpu().numpy(),
        atol=1e-3,
    ),
    np.allclose(
        raw_o1["last_hidden_state"].detach().cpu().numpy(),
        ort_o1["last_hidden_state"].detach().cpu().numpy(),
        atol=1e-1,
    ),
) # 误差有点大，不要紧

(False, True)

#### 测试效率

In [65]:
s1 = time.time()
for i in tqdm(range(100), desc="raw infer"):
    raw_o1 = model_raw(**tokenizer_output)

s_raw = time.time() - s1


s1 = time.time()
for i in tqdm(range(100), desc="ort infer"):
    ort_o1 = model_ort(**tokenizer_output)

s_ort = time.time() - s1

(s_raw, s_ort)

raw infer: 100%|██████████| 100/100 [00:00<00:00, 115.30it/s]
ort infer: 100%|██████████| 100/100 [00:00<00:00, 633.66it/s]


(0.8699741363525391, 0.15927767753601074)

#### 测试准确性

In [66]:
# Sentences we want sentence embeddings for
sentences = ["样例数据-1", "样例数据-2"]


def embd_func(model, tokenizer, inputs: List[str], normalize_embedding: bool = True):
    # Load model from HuggingFace Hub
    # Tokenize sentences
    encoded_input = tokenizer(
        inputs, padding=True, truncation=True, return_tensors="pt"
    )
    # for s2p(short query to long passage) retrieval task, add an instruction to query (not add instruction for passages)
    # encoded_input = tokenizer([instruction + q for q in queries], padding=True, truncation=True, return_tensors='pt')
    for k in encoded_input.keys():
        encoded_input[k] = encoded_input[k].cuda()
    # Compute token embeddings
    with torch.no_grad():
        model_output = model(**encoded_input)
        # Perform pooling. In this case, cls pooling.
        sentence_embeddings = model_output[0][:, 0]
    if normalize_embedding:
    # normalize embeddings
        sentence_embeddings = torch.nn.functional.normalize(sentence_embeddings, p=2, dim=1)


    return sentence_embeddings.detach().cpu().numpy()
    # print("Sentence embeddings:", sentence_embeddings)

In [67]:
sentence_embedding_raw = embd_func(model_raw,tokenizer, sentences,normalize_embedding=True)
sentence_embedding_ort = embd_func(model_ort, tokenizer, sentences,normalize_embedding=True)

sentence_embedding_raw[:4], sentence_embedding_ort[:4]

(array([[-0.01574466, -0.0290632 ,  0.09189218, ..., -0.00664417,
          0.02189073,  0.02688614],
        [-0.01677734, -0.03039891,  0.10282401, ..., -0.02765604,
          0.01199661,  0.00911522]], dtype=float32),
 array([[-0.01573375, -0.029145  ,  0.09198891, ..., -0.00649501,
          0.02190429,  0.02698189],
        [-0.01686918, -0.03031984,  0.10277913, ..., -0.02763865,
          0.01200952,  0.00904345]], dtype=float32))

In [70]:
np.allclose(sentence_embedding_raw, sentence_embedding_ort, atol=1e-3)

True

### 总结
1. sbert = bert + pool
2. optimum -> sbert not !!
3. optimum -> bert
