## Text Embedding Models

Head to [Integrations](https://python.langchain.com/docs/integrations/text_embedding/) for documentation on built-in integrations with text embedding model providers.

* The Embeddings class is a class designed for interfacing with text embedding models. 
> Embedding 类用来与text embedding model 交互的.

two interface:<br>

1. embedding documents: 这个使用多文本输入

2. embedding a query: 这个使用但文本输入

之所以使用两个接口,是因为很多embedding provider为查询和向量化提供了不同的方法, 为了方便统一的架构才设计了两个接口.

In [5]:
from langchain.embeddings import OpenAIEmbeddings

In [18]:
embedding_model = OpenAIEmbeddings()

### embed_documents

In [19]:
embeddings = embedding_model.embed_documents(
    [
        "Hi there!",
        "Oh, hello!",
        "What's your name?",
        "My friends call me World",
        "Hello World!"
    ]
)

In [20]:
len(embeddings), len(embeddings[0])

(5, 1536)

In [21]:
len(embeddings[1])

1536

> OpenAIEmbedding 是encoding 为 1536的长度了

### embed_query

In [22]:
embedded_query = embedding_model.embed_query("What was the name mentioned in the conversation?")
embedded_query[:5]

[0.0053546813655943075,
 -0.0005715346531097275,
 0.038875909934336914,
 -0.0029596003572924623,
 -0.008966285328704282]

In [23]:
len(embedded_query)

1536

> 向量长度为1536

## Caching

> Caching embeddings can be done using a CacheBackedEmbeddings. The cache backed embedder is a wrapper around an embedder that caches embeddings in a key-value store.<br>
> 缓存Embedding使用 缓存支持嵌入 , 这个  缓存嵌入器  是 向量数据库的 装饰器

**text被哈希过了, 且作为向量数据库的key**

对CacheBackedEmbeddings的主要支持是`from_bytes_store`. This takes in the following parameters.:<br>

* underlying_embedder: The embedder to use for embedding.

* document_embedding_cache: The cache to use for storing document embeddings.

* namespace: (optional, defaults to ""): 命名空间,防止同一个库内,不同的cache的碰撞.
> 对于同一个向量存储库, 如果<font color=blue>对同一个text使用不同的Embedding model, 必须使用不同的namespace, 否则会碰撞</font>

In [1]:
from langchain.storage import InMemoryStore, LocalFileStore, RedisStore, UpstashRedisStore
from langchain.embeddings import OpenAIEmbeddings, CacheBackedEmbeddings

### Vector Store

First, let's see an example that uses the local file system for storing embeddings and uses FAISS vector store for retrieval.<br>

让我们使用本地文件来存储Embedding and 使用 FAISS 来做向量检索 <font color=blue>存储和检索是分开的?</font>

In [2]:
from langchain.document_loaders import TextLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import FAISS

In [4]:
underlying_embeddings = OpenAIEmbeddings()

In [5]:
fs = LocalFileStore("./cache/")

In [6]:
cached_embedder = CacheBackedEmbeddings.from_bytes_store(underlying_embeddings, fs, namespace=underlying_embeddings.model)

In [7]:
underlying_embeddings.model

'text-embedding-ada-002'

In [8]:
# The cache is empty prior to embedding:
list(fs.yield_keys())

[]

> Load the document, split it into chunks, embed each chunk and load it into the vector store.

In [9]:
raw_documents = TextLoader('./input/state_of_the_union.txt').load()

In [10]:
raw_documents.__len__()

1

In [11]:
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)

In [12]:
documents = text_splitter.split_documents(raw_documents)

In [13]:
len(documents)

42

In [14]:
print(documents[0].page_content)

Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  

Last year COVID-19 kept us apart. This year we are finally together again. 

Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. 

With a duty to one another to the American people to the Constitution. 

And with an unwavering resolve that freedom will always triumph over tyranny. 

Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. 

He thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. 

He met the Ukrainian people. 

From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.


**Create the vector store:**

In [15]:
%time
db = FAISS.from_documents(documents, cached_embedder)

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 6.2 µs


> If we try to create the vector store again, it'll be much faster since it does not need to re-compute any embeddings.<br>
> 如果我们重新产生向量库将非常快, 因为不需要重新计算Embedding

In [17]:
%time
db2 = FAISS.from_documents(documents, cached_embedder)

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 5.25 µs


In [18]:
list(fs.yield_keys())[:5]

['text-embedding-ada-00220c8f906-bea3-5e8c-b01a-e5ecfa990007',
 'text-embedding-ada-002464862c8-03d2-5854-b32c-65a075e612a2',
 'text-embedding-ada-0025ba09d7e-6a58-5c76-b038-5d8636e5ea25',
 'text-embedding-ada-002812fdf9a-5fca-504e-9890-b93dd6a8b22c',
 'text-embedding-ada-002305efb5c-3f01-5657-bcf2-2b92fb1747ca']

In [19]:
with open('./cache/text-embedding-ada-002464862c8-03d2-5854-b32c-65a075e612a2') as f:
    line = f.read()
    arr = eval(line)
    print('embedding size is ', arr.__len__())

embedding size is  1536


> 也是1536 text-embedding-ada-002 

### In Memory

> This type of cache is primarily useful for unit tests or prototyping. Do not use this cache if you need to actually store the embeddings.
> 这种类型的缓存主要用于单元测试或原型设计, 请不要用于实际线上缓存

In [20]:
store = InMemoryStore()

In [21]:
underlying_embeddings = OpenAIEmbeddings()
embedder = CacheBackedEmbeddings.from_bytes_store(
    underlying_embeddings, store, namespace=underlying_embeddings.model
)

In [22]:
embeddings = embedder.embed_documents(["hello", "goodbye"])

In [23]:
%time
embeddings_from_cache = embedder.embed_documents(["hello", "goodbye"])

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 6.91 µs


In [24]:
embeddings == embeddings_from_cache

True

### File system

> This section covers how to use a file system store.

In [25]:
fs = LocalFileStore("./test_cache/")

In [26]:
embedder2 = CacheBackedEmbeddings.from_bytes_store(
    underlying_embeddings, fs, namespace=underlying_embeddings.model
)

In [27]:
embeddings = embedder2.embed_documents(["hello", "goodbye"])

In [28]:
embeddings = embedder2.embed_documents(["hello", "goodbye"])

In [29]:
list(fs.yield_keys())

['text-embedding-ada-002e885db5b-c0bd-5fbc-88b1-4d1da6020aa5',
 'text-embedding-ada-0026ba52e44-59c9-5cc9-a084-284061b13c80']

### Upstash Redis Store

> 无服务redis 存储

In [30]:
from langchain.storage.upstash_redis import UpstashRedisStore

```python
from upstash_redis import Redis
# 需要购买服务并获取令牌
URL = "<UPSTASH_REDIS_REST_URL>"
TOKEN = "<UPSTASH_REDIS_REST_TOKEN>"

redis_client = Redis(url=URL, token=TOKEN)
store = UpstashRedisStore(client=redis_client, ttl=None, namespace="test-ns")

underlying_embeddings = OpenAIEmbeddings()
embedder = CacheBackedEmbeddings.from_bytes_store(
    underlying_embeddings, store, namespace=underlying_embeddings.model
)


embeddings = embedder.embed_documents(["welcome", "goodbye"])

embeddings = embedder.embed_documents(["welcome", "goodbye"])

list(store.yield_keys())

list(store.client.scan(0))
```

### [Redis Store](https://python.langchain.com/docs/modules/data_connection/text_embedding/caching_embeddings)