# 使用 _sentence_transformer_ 的嵌入模型

在本笔记本中，我们将探讨如何使用流行的 [sentence_transformers 库](https://sbert.net/index.html) 将包含多个单词/标记的文本编码为嵌入向量。我们将检查以下内容：

* [OpenAI 嵌入](#openai-embedding)
* [开源编码器的输入嵌入](#open-source-encoder---input-embeddings)
* [开源编码器的输出嵌入（带上下文）](#open-source-encoder---output-embedding-with-context)
* [针对查询和文档的改进编码器（双编码器）](#improved-encoder-for-queries-and-documents-bi-encoder)

定义丰富的主题以实现更好的对象打印

In [None]:
from rich.console import Console
from rich_theme_manager import Theme, ThemeManager
import pathlib

theme_dir = pathlib.Path("themes")
theme_manager = ThemeManager(theme_dir=theme_dir)
dark = theme_manager.get("dark")

# Create a console with the dark theme
console = Console(theme=dark)


In [None]:
import warnings

# Suppress warnings
warnings.filterwarnings('ignore')

## OpenAI 嵌入 <a id='openai-embedding'></a>

一个常见的选择是使用与生成模型相同的提供者提供的嵌入。

In [None]:
first_sentence = "I have no interest in politics"

In [None]:
from dotenv import load_dotenv

load_dotenv()

In [None]:
# 导入 OpenAI 模块，用于与 OpenAI API 进行交互
from openai import OpenAI

# 创建一个 OpenAI 客户端实例，用于发送请求
# client = OpenAI()
client = OpenAI(api_key="sk-83db2355e64e4639ace2fbaaf75e1f4a", base_url="https://api.deepseek.com/v1")

# 使用客户端创建文本嵌入（embedding），将输入的文本转换为向量表示
response = client.embeddings.create(
    # 输入的文本，通常是字符串形式
    input=first_sentence,
    # 指定使用的模型，这里使用的是 "text-embedding-3-small" 模型
    # model="text-embedding-3-small"  
    model="embedding-2"  
)

# 打印 API 返回的响应结果，通常包含嵌入向量等信息
console.print(response)

## 开源编码器 - 输入嵌入

我们将从 _sentence_transformers_ 库中的一个流行编码器开始。这将使我们能够探索其架构和流程，并在之后针对我们的用例进行优化。

In [None]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

# 这是一个句子转换模型：它将句子和段落映射到384维的密集向量空间，并可用于像聚类或语义搜索这样的任务。

### 模型的分词器

我们将使用模型的默认分词器。每个单词或子词都会被转换为一个具有固定 ID 的标记。例如，在以下两个句子中，单词 `interest` 被分词为相同的 ID (`3037`)。

In [None]:
first_sentence = "I have no interest in politics"
second_sentence = "The bank's interest rate rises"


In [None]:
tokenized_first_sentence = model.tokenize([first_sentence])
console.rule(f"{first_sentence}")
console.print(tokenized_first_sentence)

In [None]:
tokenized_second_sentence = model.tokenize([second_sentence])
console.rule(f"{second_sentence}")
console.print(tokenized_second_sentence)

令牌ID可以用于将其转换回可读文本：

In [None]:
sentence_tokens = (
    model
    .tokenizer
    .convert_ids_to_tokens(
        tokenized_second_sentence["input_ids"]
        [0]
    )
)

console.print(sentence_tokens)

### 模型词汇表

In [None]:
vocabulary = (
    model
    ._first_module()
    .tokenizer
    .get_vocab()
    .items()
)

console.print("[bold]Vocabulary size[/bold]:", len(vocabulary))
console.print(dict(list(vocabulary)[:20]))

让我们看一下分词器词汇表的一部分。我们将搜索 `interest` 的令牌，并查看它的邻近令牌。

In [None]:


sorted_vocabulary = sorted(
    vocabulary, 
    key=lambda x: x[1],  # uses the value of the dictionary entry
)
sorted_tokens = [token for token, _ in sorted_vocabulary]

focused_token = 'interest'
# Find the index of the 'interest' token
focused_index = sorted_tokens.index(focused_token)

# Get 20 tokens around the focused token
start_index = max(0, focused_index - 10)
end_index = min(len(sorted_tokens), focused_index + 11)
tokens_around_focused_index = sorted_tokens[start_index:end_index]

from rich.table import Table

table = Table(title=f"Tokens around '{focused_token}':")
table.add_column("id", justify="right", style="cyan", no_wrap=True)
table.add_column("token", style="bright_green")

for i, token in enumerate(tokens_around_focused_index, start=start_index):
    if token == focused_token:
        table.add_row(f"[bold][black on yellow]{i}[/black on yellow][/bold]", f"[bold][black on yellow]{token}[/black on yellow][/bold]")
    else:
        table.add_row(str(i), token)

console.print(table)

### 嵌入 Transformer 模型

Transformer 由多个堆叠模块组成。标记是第一个模块的输入。让我们看看第一个模型。

In [None]:
console.print(model)

In [None]:
first_module = model._first_module()
console.print(first_module.auto_model)

在自然语言处理（NLP）中，embeddings 是将词汇表中的令牌（tokens）映射到高维向量空间的过程。这些向量捕捉了词汇之间的语义和语法关系，是模型理解和生成文本的核心。让我们聚焦于 `embeddings` 部分:

In [None]:
embeddings = first_module.auto_model.embeddings
console.print(embeddings)

### 嵌入模型的输入标记 ID

我们将把上述两个句子发送到 Transformer 模型，并检查 **input** 标记之间的嵌入相似性。

In [None]:
import torch

device = torch.device("mps" if torch.has_mps else "cpu")  # Use MPS for Apple, CUDA for others, or fallback to CPU

with torch.no_grad():
    # Tokenize both texts
    first_tokens = model.tokenize([first_sentence])
    second_tokens = model.tokenize([second_sentence])
    
    # Get the corresponding embeddings
    first_embeddings = embeddings.word_embeddings(
        first_tokens["input_ids"].to(device)
    )
    second_embeddings = embeddings.word_embeddings(
        second_tokens["input_ids"].to(device)
    )

console.print(first_embeddings.shape, second_embeddings.shape)

In [None]:
from rich.table import Table

table = Table(title="Embeddings Shape Explanation")

table.add_column("Text", style="cyan", no_wrap=True)
table.add_column("Batch Size", style="white")
table.add_column("Tokens Number", style="white")
table.add_column("Embedding Dimension", style="white")

table.add_row(
    first_sentence,
    str(first_embeddings.shape[0]),
    str(first_embeddings.shape[1]),
    str(first_embeddings.shape[2]),
)
table.add_row(
    second_sentence,
    str(second_embeddings.shape[0]),
    str(second_embeddings.shape[1]),
    str(second_embeddings.shape[2]),
)

console.print(table)

### 比较令牌的输入嵌入

在自然语言处理（NLP）中，输入嵌入 `（input embeddings）` 是将词汇表中的每个令牌 `（token）` 映射到一个高维向量空间的过程。这些嵌入向量捕捉了词汇的语义和语法信息，是模型理解和生成文本的基础。通过比较不同令牌的输入嵌入，我们可以了解它们在模型中的表示方式以及它们之间的关系。

In [None]:
from sentence_transformers import util
import altair as alt
import pandas as pd

# Calculate cosine similarity
distances = util.cos_sim(
    first_embeddings.squeeze(), 
    second_embeddings.squeeze()
).cpu().numpy()

# Get token labels
x_labels = model.tokenizer.convert_ids_to_tokens(second_tokens["input_ids"][0])
y_labels = model.tokenizer.convert_ids_to_tokens(first_tokens["input_ids"][0])

# Create a DataFrame for Altair
data = pd.DataFrame(
    [(x, y, distances[i, j]) for i, y in enumerate(y_labels) for j, x in enumerate(x_labels)],
    columns=['x', 'y', 'similarity']
)

# Create heatmap using Altair
chart = alt.Chart(data).mark_rect().encode(
    x=alt.X('x:O', title='Second Sentence Tokens', axis=alt.Axis(labelAngle=-45), sort=x_labels),
    y=alt.Y('y:O', title='First Sentence Tokens', sort=y_labels),
    color=alt.Color('similarity:Q', scale=alt.Scale(scheme='yellowgreenblue')),
    tooltip=['x', 'y', alt.Tooltip('similarity:Q', format='.2f')]
).properties(
    width=500,
    height=400,
    title='Input Token Similarity Heatmap'
)

# Add text labels
text = chart.mark_text(baseline='middle').encode(
    text=alt.Text('similarity:Q', format='.2f'),
    color=alt.condition(
        alt.datum.similarity > 0.5,
        alt.value('white'),
        alt.value('black')
    )
)

# Combine chart and text
final_chart = (chart + text).configure_title(fontSize=16)

# Display the chart
final_chart

### 词汇表嵌入

正如我们所看到的，词汇表中有 30,522 个标记，每个标记都被嵌入为一个大小为 384 的向量。

In [None]:
token_embeddings = first_module.auto_model \
    .embeddings \
    .word_embeddings \
    .weight \
    .detach() \
    .cpu() \
    .numpy()

console.print(token_embeddings.shape)

### Reduce the embedding vectors to 2D for visualization

We will use the TSNE library to create a 2D visualization of the token embeddings, to allow us to see tokens that are close to one another.

This process can take about a minute or two, based on your CPU.

In [None]:
from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, metric="cosine", random_state=42)
tsne_embeddings_2d = tsne.fit_transform(token_embeddings)
console.print(tsne_embeddings_2d.shape)

### 标记嵌入可视化

一旦我们将 384 维降维到 2D，就可以绘制它以进行探索。

In [None]:
token_colors = []
for token in sorted_tokens:
    if token[0] == "[" and token[-1] == "]": # Control Tokens
        token_colors.append("red")
    elif token.startswith("##"):            # Suffix Tokens
        token_colors.append("blue")
    else:
        token_colors.append("green")        # All Word Tokens

In [None]:
import altair as alt
import pandas as pd

# Enable VegaFusion data transformer to handle larger datasets
alt.data_transformers.enable("vegafusion")

# Create a DataFrame from the data
df = pd.DataFrame({
    'x': tsne_embeddings_2d[:, 0],
    'y': tsne_embeddings_2d[:, 1],
    'token': sorted_tokens,
    'color': token_colors
})

# Create the Altair chart
chart = alt.Chart(df).mark_circle(size=30).encode(
    x='x:Q',
    y='y:Q',
    color=alt.Color('color:N', scale=None),
    tooltip=['token:N']
).properties(
    width=600,
    height=900,
    title='Token Embeddings'
).interactive()

# Display the chart
chart

## 开源编码器 - 输出嵌入（带上下文）

现在让我们看看 Transformer 嵌入模型输出端的标记嵌入。

In [None]:
output_embedding = model.encode([first_sentence])
console.print(output_embedding.shape)

In [None]:
output_token_embeddings = model.encode(
    [first_sentence], 
    output_value="token_embeddings"
)
console.print(output_token_embeddings[0].shape)

In [None]:
with torch.no_grad():
    first_tokens = model.tokenize([first_sentence])
    second_tokens = model.tokenize([second_sentence])
    
    first_output_embeddings = model.encode(
        [first_sentence], 
        output_value="token_embeddings"
    )
    second_output_embeddings = model.encode(
        [second_sentence], 
        output_value="token_embeddings"
    )

# Calculate cosine similarity
distances = util.cos_sim(
    first_output_embeddings[0], 
    second_output_embeddings[0]
)

### 可视化 **输出** 标记的相似性

类似于我们从嵌入查找表中可视化 **输入** 标记相似性的方式，我们将在 Transformer 模型应用位置编码和注意力层后，可视化 **输出** 中相同标记的相似性。

In [None]:


# Get token labels
x_labels = model.tokenizer.convert_ids_to_tokens(second_tokens["input_ids"][0])
y_labels = model.tokenizer.convert_ids_to_tokens(first_tokens["input_ids"][0])

# Create a DataFrame for Altair
data = pd.DataFrame(
    [(x, y, distances[i, j]) for i, y in enumerate(y_labels) for j, x in enumerate(x_labels)],
    columns=['x', 'y', 'similarity']
)

# Create heatmap using Altair
chart = alt.Chart(data).mark_rect().encode(
    x=alt.X('x:O', title='Second Sentence Tokens', axis=alt.Axis(labelAngle=-45), sort=x_labels),
    y=alt.Y('y:O', title='First Sentence Tokens', sort=y_labels),
    color=alt.Color('similarity:Q', scale=alt.Scale(scheme='yellowgreenblue', domain=[0, 1])),
    tooltip=['x', 'y', alt.Tooltip('similarity:Q', format='.2f')]
).properties(
    width=500,
    height=400,
    title='Output Token Similarity Heatmap'
)

# Add text labels
text = chart.mark_text(baseline='middle').encode(
    text=alt.Text('similarity:Q', format='.2f'),
    color=alt.condition(
        alt.datum.similarity > 0.5,
        alt.value('white'),
        alt.value('black')
    )
)

# Combine chart and text
final_chart = (chart + text).configure_title(fontSize=16)

# Display the chart
final_chart


In [None]:
# Calculate cosine distance between output embeddings
from sklearn.metrics.pairwise import cosine_distances
from rich.panel import Panel
from rich.table import Table

def calculate_sentence_similarity(first_sentence, second_sentence):

    first_embeddings = model.encode([first_sentence])
    second_embeddings = model.encode([second_sentence])

    # Reshape the embeddings to 2D arrays
    first_embedding_2d = first_embeddings.reshape(1, -1)
    second_embedding_2d = second_embeddings.reshape(1, -1)

    # Calculate cosine distance
    cosine_distance = cosine_distances(first_embedding_2d, second_embedding_2d)[0][0]

    # Note: Cosine distance is 1 - cosine similarity
    cosine_similarity = 1 - cosine_distance

    console.print(
        Panel(
            f"[cyan bold]First Sentence:[/cyan bold] {first_sentence}\n"
            f"[cyan bold]Second Sentence:[/cyan bold] {second_sentence}",
            title="[green bold]Similarity Calculation[/green bold]",
            expand=False,
            border_style="dim white"
        )
    )

    results = Table(title="Results")
    results.add_column("Metric", style="bold")
    results.add_column("Value", style="bold")
    results.add_row("Cosine Distance", f"{cosine_distance:.4f}", style="cyan")
    results.add_row("Cosine Similarity", f"{cosine_similarity:.4f}", style="bright_yellow")

    console.print(results)


In [None]:
calculate_sentence_similarity(first_sentence, second_sentence)

In [None]:
third_sentence = "Chase increased its lending fees"

calculate_sentence_similarity(second_sentence, third_sentence)

## 针对查询和文档的改进编码器（bi-encoder）

我们将使用 [上下文文档嵌入 (CDE)](https://huggingface.co/jxm/cde-small-v1)，这是 Hugging Face 模型库中的热门模型之一。

In [2]:
import transformers
import sys
print(sys.executable)
# improved_model = transformers.AutoModel.from_pretrained("jxm/cde-small-v2", trust_remote_code=True)
# tokenizer = transformers.AutoTokenizer.from_pretrained("bert-base-uncased")
print("transformers :", transformers.__version__)
improved_model = transformers.AutoModel.from_pretrained("jxm/cde-small-v2", trust_remote_code=True)
tokenizer = transformers.AutoTokenizer.from_pretrained("answerdotai/ModernBERT-base")

e:\MySpeace\advanced-rag\.venv\Scripts\python.exe
transformers : 4.49.0


To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Disabled 23 dropout modules from model type <class 'transformers_modules.jxm.cde-small-v2.287bf0ea6ebfecf2339762d0ef28fb846959a8f2.model.BiEncoder'>
Disabled 46 dropout modules from model type <class 'transformers_modules.jxm.cde-small-v2.287bf0ea6ebfecf2339762d0ef28fb846959a8f2.model.ContextualDocumentEmbeddingTransformer'>


In [None]:
console.print(improved_model)

In [None]:
from datasets import load_dataset

corpus = load_dataset("BeIR/fiqa", "corpus")["corpus"]
queries = load_dataset("BeIR/fiqa", "queries")["queries"]

### 数据集样本

让我们看看 [金融意见挖掘和问答 (FiQA) 数据集](https://huggingface.co/datasets/BeIR/fiqa) 的一些文档和查询示例。

In [None]:
import pandas as pd
from tabulate import tabulate

console.rule("Corpus Sample")
print(tabulate( 
    corpus
    .to_pandas()
    .head(10)
    .assign(text_start=lambda x: x['text'].str[:100])
    .drop(columns=['text','title'])
    ,headers='keys', 
    tablefmt='github', 
    showindex=False
))


In [None]:
console.rule("Queries Sample")
print(tabulate( 
    queries
    .to_pandas()
    .head(10)
    .assign(text_start=lambda x: x['text'].str[:100])
    .drop(columns=['text','title'])
    ,headers='keys', 
    tablefmt='github',
    showindex=False
))

## 第一阶段：收集数据集嵌入

CDE 的工作原理是首先从语料库文档中获取一组嵌入，这些嵌入旨在代表整个语料库。我们首先从语料库中采样一些文档（该模型使用每个上下文中的 512 个文档进行训练），并从我们的第一阶段模型中获取它们的嵌入。

In [None]:
query_prefix = "search_query: "
document_prefix = "search_document: "

In [None]:
import random

def process_ex_document(ex: dict) -> dict:
  ex["text"] = f"{ex['title']} {ex['text']}"
  return ex

corpus_size = improved_model.config.transductive_corpus_size
console.print(f"Choosing {corpus_size} out of {len(corpus)} documents")
minicorpus_docs = corpus.select(random.choices(list(range(len(corpus))), k=corpus_size))
minicorpus_docs = minicorpus_docs.map(process_ex_document)["text"]
minicorpus_docs = tokenizer(
    [document_prefix + doc for doc in minicorpus_docs],
    truncation=True,
    padding=True,
    max_length=512,
    return_tensors="pt"
)

In [None]:
import torch
device = torch.device("mps" if torch.cuda.is_available() else "cpu")
model.to(device)
minicorpus_docs = minicorpus_docs.to(device)

In [None]:
import torch
from tqdm.autonotebook import tqdm

batch_size = 32

dataset_embeddings = []
for i in tqdm(range(0, len(minicorpus_docs["input_ids"]), batch_size)):
    minicorpus_docs_batch = {k: v[i:i+batch_size] for k,v in minicorpus_docs.items()}
    with torch.no_grad():
        dataset_embeddings.append(
            improved_model.first_stage_model(**minicorpus_docs_batch)
        )

dataset_embeddings = torch.cat(dataset_embeddings)

## 第二阶段：在上下文中嵌入

现在我们有了数据集嵌入，我们可以像平常一样使用它们来嵌入查询和文档。我们只需要提供一个额外的参数（CDE 代码中的 `dataset_embeddings`）。

In [None]:
sample_docs = corpus.select(range(16)).map(process_ex_document)["text"]

docs_tokens = tokenizer(
    [document_prefix + doc for doc in sample_docs],
    truncation=True,
    padding=True,
    max_length=512,
    return_tensors="pt"
).to(device)

with torch.no_grad():
  doc_embeddings = improved_model.second_stage_model(
      input_ids=docs_tokens["input_ids"],
      attention_mask=docs_tokens["attention_mask"],
      dataset_embeddings=dataset_embeddings,
  )
doc_embeddings /= doc_embeddings.norm(p=2, dim=1, keepdim=True)

In [None]:
queries_sample = queries.select(range(16))["text"]
queries_tokens = tokenizer(
    [query_prefix + query for query in queries_sample],
    truncation=True,
    padding=True,
    max_length=512,
    return_tensors="pt"
).to(device)

with torch.no_grad():
  query_embeddings = improved_model.second_stage_model(
      input_ids=queries_tokens["input_ids"],
      attention_mask=queries_tokens["attention_mask"],
      dataset_embeddings=dataset_embeddings,
  )
query_embeddings /= query_embeddings.norm(p=2, dim=1, keepdim=True)

### 模型比较

让我们在文档和查询的样本上比较这两个模型（基础模型和带上下文的改进模型）。

In [None]:
with torch.no_grad():
  doc_basic_embeddings = model.encode(sample_docs)


In [None]:
with torch.no_grad():
  queries_basic_embeddings = model.encode(queries_sample)

In [None]:
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 8))

# Heatmap for improved model
sns.heatmap((doc_embeddings @ query_embeddings.T).cpu(), cmap="jet", ax=ax1, vmin=0, vmax=1)
ax1.set_title("Improved Model", fontsize=16)

# Heatmap for basic model
sns.heatmap((doc_basic_embeddings @ queries_basic_embeddings.T), cmap="jet", ax=ax2 ,vmin=0, vmax=1)
ax2.set_title("Basic Model", fontsize=16)

plt.tight_layout()
console.rule("Embedding Model Comparison")
plt.show()