# ch3-使用embedding模型

## 1.本节实战通过不同方式来调用embedding模型以及embedding的操作


### 内容
- <font size=5>3种embedding调用方式</font>:transformers和sentence_transformers、langchain

``` shell
pip install transformers
pip install sentence_transformers
pip install langchain
```

- <font size=5>2种embedding的基本操作</font>：相似度计算和聚类

## 2.transformers 方式
- Transformers库是由Hugging Face开发的一个非常流行的Python库，专门用于自然语言处理（NLP）任务，最出名就是实现了transformer架构
- embedding模型也是transformer架构
- 也可以通过transformers库来调用

In [33]:
from transformers import AutoTokenizer, AutoModel
from sentence_transformers.util import cos_sim
import torch.nn.functional as F

In [34]:
input_texts = [
    "中国的首都是哪里",
    "你喜欢去哪里旅游",
    "北京",
    "今天中午吃什么"
]

In [35]:
model_path = './data/llm_app/embedding_models/gte-large-zh/'
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModel.from_pretrained(model_path, device_map='cpu')

In [47]:
batch_tokens = tokenizer(input_texts,
                        max_length=30,
                        padding=True,
                        truncation=True,
                        return_tensors='pt')

In [48]:
batch_tokens[0].tokens

['[CLS]', '中', '国', '的', '首', '都', '是', '哪', '里', '[SEP]']

In [50]:
print(batch_tokens[2].tokens)

['[CLS]', '北', '京', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']


In [52]:
batch_tokens.input_ids[0]

tensor([ 101,  704, 1744, 4638, 7674, 6963, 3221, 1525, 7027,  102])

In [53]:
batch_tokens.input_ids[2]

tensor([ 101, 1266,  776,  102,    0,    0,    0,    0,    0,    0])

In [54]:
outputs = model(**batch_tokens)

In [58]:
outputs

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[ 2.0181,  0.4087,  0.1180,  ...,  1.0200,  0.4325, -0.5421],
         [ 1.5130,  0.2715, -0.0208,  ...,  0.5583,  0.4469, -0.1568],
         [ 2.0624, -0.1686,  0.0687,  ...,  0.9885,  0.5314, -0.1303],
         ...,
         [ 1.4790,  0.3036, -0.0824,  ...,  0.8273,  0.6185, -0.1685],
         [ 1.6145,  0.5669,  0.0285,  ...,  1.0715,  0.6865, -0.3383],
         [ 2.0181,  0.4085,  0.1180,  ...,  1.0200,  0.4326, -0.5420]],

        [[ 0.4519, -1.1412,  0.0964,  ...,  0.2385,  0.5646, -0.9434],
         [ 0.0185, -0.7171,  0.1632,  ...,  0.7637, -0.1800, -0.2161],
         [-0.3061, -0.8769,  0.2718,  ...,  0.6523,  0.3814, -0.8347],
         ...,
         [ 0.2691, -1.1795,  0.2144,  ...,  0.2852,  0.1001, -0.7169],
         [ 0.2553, -1.0392,  0.0997,  ...,  0.6434,  0.3948, -0.6787],
         [ 0.4520, -1.1413,  0.0967,  ...,  0.2389,  0.5644, -0.9432]],

        [[ 0.6939,  0.8812, -0.5522,  ...,  1.1094, -

In [59]:
outputs.last_hidden_state.shape

torch.Size([4, 10, 1024])

In [60]:
outputs.pooler_output.shape

torch.Size([4, 1024])

In [61]:
embeddings = outputs.last_hidden_state[:, 0]

In [63]:
embeddings.shape

torch.Size([4, 1024])

In [64]:
embeddings = F.normalize(embeddings, p=2, dim=1)

In [65]:
for i in range(1, 4):
    print(input_texts[0], input_texts[i], cos_sim(embeddings[0], embeddings[i]))

中国的首都是哪里 你喜欢去哪里旅游 tensor([[0.3295]], grad_fn=<MmBackward0>)
中国的首都是哪里 北京 tensor([[0.6354]], grad_fn=<MmBackward0>)
中国的首都是哪里 今天中午吃什么 tensor([[0.3248]], grad_fn=<MmBackward0>)


## 3. sentence_transformers 方式
- Sentence-Transformers是一个基于PyTorch和Transformers的Python库，它专门用于句子、文本和图像嵌入（Embedding）

In [66]:
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

In [68]:
input_texts = [
    "中国的首都是哪里",
    "你喜欢去哪里旅游",
    "北京",
    "今天中午吃什么"
]

In [69]:
model_path = './data/llm_app/embedding_models/gte-large-zh/'

In [70]:
model = SentenceTransformer(model_path)

In [71]:
embeddings = model.encode(input_texts)

In [72]:
embeddings.shape

(4, 1024)

In [73]:
for i in range(1, 4):
    print(input_texts[0], input_texts[i], cos_sim(embeddings[0], embeddings[i]))

中国的首都是哪里 你喜欢去哪里旅游 tensor([[0.3295]])
中国的首都是哪里 北京 tensor([[0.6354]])
中国的首都是哪里 今天中午吃什么 tensor([[0.3248]])


## 4. langchain 方式 
- <font size=5>对SentenceTransformer的封装</font>

In [74]:
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from sentence_transformers.util import cos_sim

In [75]:
model_path = './data/llm_app/embedding_models/gte-large-zh/'

In [78]:
model = HuggingFaceEmbeddings(model_name=model_path,
                             model_kwargs={"device": "cpu"})

In [79]:
embeddings = model.embed_documents(input_texts)

In [81]:
import numpy as np

embeddings = np.array(embeddings)
print(embeddings.shape)

(4, 1024)


In [82]:
for i in range(1, 4):
    print(input_texts[0], input_texts[i], cos_sim(embeddings[0], embeddings[i]))

中国的首都是哪里 你喜欢去哪里旅游 tensor([[0.3295]], dtype=torch.float64)
中国的首都是哪里 北京 tensor([[0.6354]], dtype=torch.float64)
中国的首都是哪里 今天中午吃什么 tensor([[0.3248]], dtype=torch.float64)


## 5.embedding 操作

### 5.1 距离计算

- <font size=5>余弦相似度</font>

![](./data/cos.png)

In [83]:
a = embeddings[0]
b = embeddings[2]

In [84]:
from numpy import dot
from numpy.linalg import norm

In [85]:
cos_a_b = dot(a, b) / (norm(a) * norm(b))

In [86]:
print(cos_a_b, cos_sim(a, b))

0.6353580110345967 tensor([[0.6354]], dtype=torch.float64)


- ><font size=5>欧几里得距离</font>

![](./data/l2.png)


In [87]:
norm(a - b)

0.853981260361025

### 5.2聚类

In [88]:
texts = ['苹果', '菠萝', '西瓜', '斑马', '大象', '老鼠']

In [89]:
output_embeddings = model.embed_documents(texts)

In [90]:
from sklearn.cluster import KMeans

In [91]:
kmeans = KMeans(n_clusters=2)

In [93]:
kmeans.fit(output_embeddings)

In [94]:
label = kmeans.labels_

In [96]:
for i in range(len(texts)):
    print(f"cls({texts[i]}) = {label[i]}")

cls(苹果) = 1
cls(菠萝) = 1
cls(西瓜) = 1
cls(斑马) = 0
cls(大象) = 0
cls(老鼠) = 0
