# Timescale Vector (Postgres)

>[Timescale Vector](https://www.timescale.com/ai?utm_campaign=vectorlaunch&utm_source=langchain&utm_medium=referral) 是专为 AI 应用设计的 `PostgreSQL++` 向量数据库。

本指南将展示如何使用 Postgres 向量数据库 `Timescale Vector`。你将学习如何使用 TimescaleVector 进行 (1) 语义搜索，(2) 基于时间的向量搜索，(3) 自我查询，以及 (4) 如何创建索引以加快查询速度。

## Timescale Vector 是什么？

`Timescale Vector` 使你能够在 `PostgreSQL` 中高效地存储和查询数百万个向量嵌入。
- 通过基于 `DiskANN` 的索引算法，增强 `pgvector`，实现对 1 亿以上向量的更快、更准确的相似性搜索。
- 通过自动化的时间分区和索引，实现快速的基于时间的向量搜索。
- 提供熟悉的 SQL 接口，用于查询向量嵌入和关系数据。

`Timescale Vector` 是面向 AI 的云 `PostgreSQL`，可伴随你从概念验证（POC）扩展到生产环境：
- 通过允许你在单个数据库中存储关系元数据、向量嵌入和时间序列数据，简化了操作。
- 受益于坚如磐石的 PostgreSQL 基础，具备流式备份和复制、高可用性以及行级安全性等企业级功能。
- 通过企业级安全和合规性，带来无忧的使用体验。

## 如何访问 Timescale Vector

`Timescale Vector` 可在云 PostgreSQL 平台 [Timescale](https://www.timescale.com/ai?utm_campaign=vectorlaunch&utm_source=langchain&utm_medium=referral) 上获得。（目前没有自托管版本。）

LangChain 用户可获得 90 天 Timescale Vector 免费试用。
- 要开始使用，请在 [Timescale 注册](https://console.cloud.timescale.com/signup?utm_campaign=vectorlaunch&utm_source=langchain&utm_medium=referral)并创建一个新数据库，然后按照本指南操作！
- 有关更多详细信息和性能基准测试，请参阅 [Timescale Vector 解释博客](https://www.timescale.com/blog/how-we-made-postgresql-the-best-vector-database/?utm_campaign=vectorlaunch&utm_source=langchain&utm_medium=referral)。
- 有关在 Python 中使用 Timescale Vector 的更多详细信息，请参阅 [安装说明](https://github.com/timescale/python-vector)。

## 设置

请遵循以下步骤准备开始本教程。

In [None]:
# Pip install necessary packages
%pip install --upgrade --quiet  timescale-vector
%pip install --upgrade --quiet  langchain-openai langchain-community
%pip install --upgrade --quiet  tiktoken

在此示例中，我们将使用 `OpenAIEmbeddings`，因此请加载您的 OpenAI API 密钥。

In [1]:
import os

# Run export OPENAI_API_KEY=sk-YOUR_OPENAI_API_KEY...
# Get openAI api key by reading local .env file
from dotenv import find_dotenv, load_dotenv

_ = load_dotenv(find_dotenv())
OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]

In [None]:
# Get the API key and save it as an environment variable
# import os
# import getpass
# os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")


In [2]:
from typing import Tuple

接下来，我们将导入所需的 Python 库以及 LangChain 中的库。请注意，我们导入了 `timescale-vector` 库以及 TimescaleVector LangChain 向量存储。

In [3]:
from datetime import datetime, timedelta

from langchain_community.document_loaders import TextLoader
from langchain_community.document_loaders.json_loader import JSONLoader
from langchain_community.vectorstores.timescalevector import TimescaleVector
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter

## 1. 欧几里得距离（默认）相似性搜索

首先，我们将通过一个对《国情咨文》进行相似性搜索的示例，来查找与给定查询句最相似的句子。我们将使用 [欧几里得距离](https://en.wikipedia.org/wiki/Euclidean_distance) 作为我们的相似性度量。

In [4]:
# Load the text and split it into chunks
loader = TextLoader("../../../extras/modules/state_of_the_union.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings()

接下来，我们将加载 Timescale 数据库的服务 URL。

如果还没有，请先[注册 Timescale](https://console.cloud.timescale.com/signup?utm_campaign=vectorlaunch&utm_source=langchain&utm_medium=referral) 并创建一个新数据库。

然后，要连接到 PostgreSQL 数据库，您需要服务 URI。该 URI 可以在创建新数据库后下载的备忘单或 `.env` 文件中找到。

URI 的格式大致如下：`postgres://tsdbadmin:<password>@<id>.tsdb.cloud.timescale.com:<port>/tsdb?sslmode=require`。

In [5]:
# Timescale Vector needs the service url to your cloud database. You can see this as soon as you create the
# service in the cloud UI or in your credentials.sql file
SERVICE_URL = os.environ["TIMESCALE_SERVICE_URL"]

# Specify directly if testing
# SERVICE_URL = "postgres://tsdbadmin:<password>@<id>.tsdb.cloud.timescale.com:<port>/tsdb?sslmode=require"

# # You can get also it from an environment variables. We suggest using a .env file.
# import os
# SERVICE_URL = os.environ.get("TIMESCALE_SERVICE_URL", "")

接下来，我们创建一个 TimescaleVector 向量存储。我们指定一个 collection name，这将是我们数据存储的表的名称。

注意：在创建 TimescaleVector 的新实例时，TimescaleVector Module 会尝试创建一个与 collection 名称同名的表。因此，请确保 collection name 是唯一的（即它尚未存在）。

In [6]:
# The TimescaleVector Module will create a table with the name of the collection.
COLLECTION_NAME = "state_of_the_union_test"

# Create a Timescale Vector instance from the collection of documents
db = TimescaleVector.from_documents(
    embedding=embeddings,
    documents=docs,
    collection_name=COLLECTION_NAME,
    service_url=SERVICE_URL,
)

现在我们已经加载了数据，可以执行相似性搜索了。

In [7]:
query = "What did the president say about Ketanji Brown Jackson"
docs_with_score = db.similarity_search_with_score(query)

In [8]:
for doc, score in docs_with_score:
    print("-" * 80)
    print("Score: ", score)
    print(doc.page_content)
    print("-" * 80)

--------------------------------------------------------------------------------
Score:  0.18443380687035138
Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. 

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. 

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.
--------------------------------------------------------------------------------
----------------------

### 将 Timescale Vector 用作检索器
初始化 TimescaleVector 存储后，可以将其用作[检索器](/docs/how_to#retrievers)。

In [9]:
# Use TimescaleVector as a retriever
retriever = db.as_retriever()

In [10]:
print(retriever)

tags=['TimescaleVector', 'OpenAIEmbeddings'] metadata=None vectorstore=<langchain_community.vectorstores.timescalevector.TimescaleVector object at 0x10fc8d070> search_type='similarity' search_kwargs={}


让我们看一个在 RetrievalQA 链和 stuff documents 链中使用 Timescale Vector 作为检索器的示例。

在这个示例中，我们将问与上面相同的问题，但这次我们将把从 Timescale Vector 返回的相关文档传递给 LLM，作为回答问题的上下文。

首先，我们将创建我们的 stuff 链：

In [11]:
# Initialize GPT3.5 model
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(temperature=0.1, model="gpt-3.5-turbo-16k")

# Initialize a RetrievalQA class from a stuff chain
from langchain.chains import RetrievalQA

qa_stuff = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    verbose=True,
)

In [12]:
query = "What did the president say about Ketanji Brown Jackson?"
response = qa_stuff.run(query)



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


In [13]:
print(response)

The President said that he nominated Circuit Court of Appeals Judge Ketanji Brown Jackson, who is one of our nation's top legal minds and will continue Justice Breyer's legacy of excellence. He also mentioned that since her nomination, she has received a broad range of support from various groups, including the Fraternal Order of Police and former judges appointed by Democrats and Republicans.


## 2. 基于时间的相似性搜索

Timescale Vector 的一个关键用例是高效的基于时间的向量搜索。Timescale Vector 通过自动按时间对向量（及相关元数据）进行分区来实现这一点。这使您能够高效地通过与查询向量的相似性和时间来查询向量。

基于时间的向量搜索功能对于以下应用非常有用：
- 存储和检索 LLM 响应历史（例如，聊天机器人）
- 查找与查询向量相似的最新嵌入（例如，最近的新闻）。
- 将相似性搜索限制在相关的时间范围内（例如，就某个知识库提出基于时间的问题）

为了说明如何使用 TimescaleVector 的基于时间的向量搜索功能，我们将询问有关 TimescaleDB 的 git 日志历史的问题。我们将说明如何添加带有基于时间的 uuid 的文档，以及如何运行带有时间范围过滤器的相似性搜索。

### 从 git log JSON 中提取内容和元数据
首先，我们将 git log 数据加载到 PostgreSQL 数据库的一个名为 `timescale_commits` 的新集合中。

我们将定义一个辅助函数，根据文档的时间戳为其创建 uuid 以及关联的向量嵌入。我们将使用此函数为每个 git log 条目创建 uuid。

重要提示：如果您正在处理文档，并且希望将当前日期和时间与向量关联以进行基于时间的搜索，则可以跳过此步骤。默认情况下，在摄取文档时会自动生成 uuid。

In [15]:
from timescale_vector import client


# Function to take in a date string in the past and return a uuid v1
def create_uuid(date_string: str):
    if date_string is None:
        return None
    time_format = "%a %b %d %H:%M:%S %Y %z"
    datetime_obj = datetime.strptime(date_string, time_format)
    uuid = client.uuid_from_time(datetime_obj)
    return str(uuid)

接下来，我们将定义一个 metadata 函数，用于从 JSON 记录中提取相关的元数据。我们将把这个函数传递给 JSONLoader。有关更多详细信息，请参阅 [JSON 文档加载器文档](/docs/how_to/document_loader_json)。

In [16]:
# Helper function to split name and email given an author string consisting of Name Lastname <email>
def split_name(input_string: str) -> Tuple[str, str]:
    if input_string is None:
        return None, None
    start = input_string.find("<")
    end = input_string.find(">")
    name = input_string[:start].strip()
    email = input_string[start + 1 : end].strip()
    return name, email


# Helper function to transform a date string into a timestamp_tz string
def create_date(input_string: str) -> datetime:
    if input_string is None:
        return None
    # Define a dictionary to map month abbreviations to their numerical equivalents
    month_dict = {
        "Jan": "01",
        "Feb": "02",
        "Mar": "03",
        "Apr": "04",
        "May": "05",
        "Jun": "06",
        "Jul": "07",
        "Aug": "08",
        "Sep": "09",
        "Oct": "10",
        "Nov": "11",
        "Dec": "12",
    }

    # Split the input string into its components
    components = input_string.split()
    # Extract relevant information
    day = components[2]
    month = month_dict[components[1]]
    year = components[4]
    time = components[3]
    timezone_offset_minutes = int(components[5])  # Convert the offset to minutes
    timezone_hours = timezone_offset_minutes // 60  # Calculate the hours
    timezone_minutes = timezone_offset_minutes % 60  # Calculate the remaining minutes
    # Create a formatted string for the timestamptz in PostgreSQL format
    timestamp_tz_str = (
        f"{year}-{month}-{day} {time}+{timezone_hours:02}{timezone_minutes:02}"
    )
    return timestamp_tz_str


# Metadata extraction function to extract metadata from a JSON record
def extract_metadata(record: dict, metadata: dict) -> dict:
    record_name, record_email = split_name(record["author"])
    metadata["id"] = create_uuid(record["date"])
    metadata["date"] = create_date(record["date"])
    metadata["author_name"] = record_name
    metadata["author_email"] = record_email
    metadata["commit_hash"] = record["commit"]
    return metadata

接下来，您需要[下载示例数据集](https://s3.amazonaws.com/assets.timescale.com/ai/ts_git_log.json)，并将其放置在与此 Notebook 相同的目录中。

您可以使用以下命令：

In [None]:
# Download the file using curl and save it as commit_history.csv
# Note: Execute this command in your terminal, in the same directory as the notebook
!curl -O https://s3.amazonaws.com/assets.timescale.com/ai/ts_git_log.json

最后，我们可以初始化 JSON loader 来解析 JSON 记录。为了简化起见，我们还移除了空记录。

In [17]:
# Define path to the JSON file relative to this notebook
# Change this to the path to your JSON file
FILE_PATH = "../../../../../ts_git_log.json"

# Load data from JSON file and extract metadata
loader = JSONLoader(
    file_path=FILE_PATH,
    jq_schema=".commit_history[]",
    text_content=False,
    metadata_func=extract_metadata,
)
documents = loader.load()

# Remove documents with None dates
documents = [doc for doc in documents if doc.metadata["date"] is not None]

In [18]:
print(documents[0])

page_content='{"commit": "44e41c12ab25e36c202f58e068ced262eadc8d16", "author": "Lakshmi Narayanan Sreethar<lakshmi@timescale.com>", "date": "Tue Sep 5 21:03:21 2023 +0530", "change summary": "Fix segfault in set_integer_now_func", "change details": "When an invalid function oid is passed to set_integer_now_func, it finds out that the function oid is invalid but before throwing the error, it calls ReleaseSysCache on an invalid tuple causing a segfault. Fixed that by removing the invalid call to ReleaseSysCache.  Fixes #6037 "}' metadata={'source': '/Users/avtharsewrathan/sideprojects2023/timescaleai/tsv-langchain/ts_git_log.json', 'seq_num': 1, 'id': '8b407680-4c01-11ee-96a6-b82284ddccc6', 'date': '2023-09-5 21:03:21+0850', 'author_name': 'Lakshmi Narayanan Sreethar', 'author_email': 'lakshmi@timescale.com', 'commit_hash': '44e41c12ab25e36c202f58e068ced262eadc8d16'}


### 将文档和元数据加载到 TimescaleVector 向量数据库

既然我们已经准备好了文档，现在就将它们与向量嵌入表示一起处理并加载到我们的 TimescaleVector 向量数据库中。

由于这是一个演示，我们只加载前 500 条记录。在实际应用中，您可以加载任意数量的记录。

In [19]:
NUM_RECORDS = 500
documents = documents[:NUM_RECORDS]

然后我们使用 `CharacterTextSplitter` 将文档分割成更小的块（如果需要），以便于嵌入。请注意，此分割过程会保留每个文档的元数据。

In [20]:
# Split the documents into chunks for embedding
text_splitter = CharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
)
docs = text_splitter.split_documents(documents)

接下来，我们将从先前完成预处理的文档集合中创建一个 Timescale Vector 实例。

首先，我们将定义一个集合名称，它将是我们 PostgreSQL 数据库中的表名。

我们还将定义一个时间间隔（time delta），并将其传递给 `time_partition_interval` 参数，该参数将用于按时间对数据进行分区。每个分区将包含指定时长的时段数据。为简单起见，我们将使用 7 天，但您可以根据您的用例选择任何有意义的值——例如，如果您频繁查询近期向量，您可能希望使用更小的时间间隔，如 1 天；如果您查询跨度长达十年的向量，您可能希望使用更大的时间间隔，如 6 个月或 1 年。

最后，我们将创建 TimescaleVector 实例。我们将 `ids` 参数指定为我们在上述预处理步骤中创建的元数据中的 `uuid` 字段。我们这样做是因为我们希望我们的 UUID 的时间部分反映过去（即提交发生的时间）。但是，如果我们希望当前日期和时间与我们的文档相关联，我们可以删除 `id` 参数，UUID 将自动创建为当前日期和时间。

In [21]:
# Define collection name
COLLECTION_NAME = "timescale_commits"
embeddings = OpenAIEmbeddings()

# Create a Timescale Vector instance from the collection of documents
db = TimescaleVector.from_documents(
    embedding=embeddings,
    ids=[doc.metadata["id"] for doc in docs],
    documents=docs,
    collection_name=COLLECTION_NAME,
    service_url=SERVICE_URL,
    time_partition_interval=timedelta(days=7),
)

### 按时间和相似度查询向量

现在我们已将文档加载到 TimescaleVector 中，可以按时间和相似度进行查询。

TimescaleVector 提供了多种通过相似度搜索并进行基于时间过滤来查询向量的方法。

下面我们来看看每种方法：

In [31]:
# Time filter variables
start_dt = datetime(2023, 8, 1, 22, 10, 35)  # Start date = 1 August 2023, 22:10:35
end_dt = datetime(2023, 8, 30, 22, 10, 35)  # End date = 30 August 2023, 22:10:35
td = timedelta(days=7)  # Time delta = 7 days

query = "What's new with TimescaleDB functions?"

方法 1：在提供的开始日期和结束日期内进行筛选。

In [32]:
# Method 1: Query for vectors between start_date and end_date
docs_with_score = db.similarity_search_with_score(
    query, start_date=start_dt, end_date=end_dt
)

for doc, score in docs_with_score:
    print("-" * 80)
    print("Score: ", score)
    print("Date: ", doc.metadata["date"])
    print(doc.page_content)
    print("-" * 80)

--------------------------------------------------------------------------------
Score:  0.17488396167755127
Date:  2023-08-29 18:13:24+0320
{"commit": " e4facda540286b0affba47ccc63959fefe2a7b26", "author": "Sven Klemm<sven@timescale.com>", "date": "Tue Aug 29 18:13:24 2023 +0200", "change summary": "Add compatibility layer for _timescaledb_internal functions", "change details": "With timescaledb 2.12 all the functions present in _timescaledb_internal were moved into the _timescaledb_functions schema to improve schema security. This patch adds a compatibility layer so external callers of these internal functions will not break and allow for more flexibility when migrating. "}
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Score:  0.18102192878723145
Date:  2023-08-20 22:47:10+0320
{"commit": " 0a66bdb8d36a1879246bd652e4c28500c4b951ab", "author": "Sven Klemm<sven@timescale.

请注意，此查询仅返回指定日期范围内的结果。

方法二：在提供的开始日期和之后的时间差内进行筛选。

In [33]:
# Method 2: Query for vectors between start_dt and a time delta td later
# Most relevant vectors between 1 August and 7 days later
docs_with_score = db.similarity_search_with_score(
    query, start_date=start_dt, time_delta=td
)

for doc, score in docs_with_score:
    print("-" * 80)
    print("Score: ", score)
    print("Date: ", doc.metadata["date"])
    print(doc.page_content)
    print("-" * 80)

--------------------------------------------------------------------------------
Score:  0.18458807468414307
Date:  2023-08-3 14:30:23+0500
{"commit": " 7aeed663b9c0f337b530fd6cad47704a51a9b2ec", "author": "Dmitry Simonenko<dmitry@timescale.com>", "date": "Thu Aug 3 14:30:23 2023 +0300", "change summary": "Feature flags for TimescaleDB features", "change details": "This PR adds several GUCs which allow to enable/disable major timescaledb features:  - enable_hypertable_create - enable_hypertable_compression - enable_cagg_create - enable_policy_create "}
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Score:  0.20492422580718994
Date:  2023-08-7 18:31:40+0320
{"commit": " 07762ea4cedefc88497f0d1f8712d1515cdc5b6e", "author": "Sven Klemm<sven@timescale.com>", "date": "Mon Aug 7 18:31:40 2023 +0200", "change summary": "Test timescaledb debian 12 packages in CI", "change details"

再次注意我们得到了指定时间筛选范围内的结果，这与之前的查询不同。

方法 3：在提供的结束日期和提前一定时间段内进行筛选。

In [34]:
# Method 3: Query for vectors between end_dt and a time delta td earlier
# Most relevant vectors between 30 August and 7 days earlier
docs_with_score = db.similarity_search_with_score(query, end_date=end_dt, time_delta=td)

for doc, score in docs_with_score:
    print("-" * 80)
    print("Score: ", score)
    print("Date: ", doc.metadata["date"])
    print(doc.page_content)
    print("-" * 80)

--------------------------------------------------------------------------------
Score:  0.17488396167755127
Date:  2023-08-29 18:13:24+0320
{"commit": " e4facda540286b0affba47ccc63959fefe2a7b26", "author": "Sven Klemm<sven@timescale.com>", "date": "Tue Aug 29 18:13:24 2023 +0200", "change summary": "Add compatibility layer for _timescaledb_internal functions", "change details": "With timescaledb 2.12 all the functions present in _timescaledb_internal were moved into the _timescaledb_functions schema to improve schema security. This patch adds a compatibility layer so external callers of these internal functions will not break and allow for more flexibility when migrating. "}
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Score:  0.18496227264404297
Date:  2023-08-29 10:49:47+0320
{"commit": " a9751ccd5eb030026d7b975d22753f5964972389", "author": "Sven Klemm<sven@timescale.

方法 4：我们也可以通过在查询中仅指定一个开始日期来过滤给定日期之后的所有向量。

方法 5：同样，我们也可以通过在查询中仅指定一个结束日期来过滤给定日期之前的所有向量。

In [35]:
# Method 4: Query all vectors after start_date
docs_with_score = db.similarity_search_with_score(query, start_date=start_dt)

for doc, score in docs_with_score:
    print("-" * 80)
    print("Score: ", score)
    print("Date: ", doc.metadata["date"])
    print(doc.page_content)
    print("-" * 80)

--------------------------------------------------------------------------------
Score:  0.17488396167755127
Date:  2023-08-29 18:13:24+0320
{"commit": " e4facda540286b0affba47ccc63959fefe2a7b26", "author": "Sven Klemm<sven@timescale.com>", "date": "Tue Aug 29 18:13:24 2023 +0200", "change summary": "Add compatibility layer for _timescaledb_internal functions", "change details": "With timescaledb 2.12 all the functions present in _timescaledb_internal were moved into the _timescaledb_functions schema to improve schema security. This patch adds a compatibility layer so external callers of these internal functions will not break and allow for more flexibility when migrating. "}
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Score:  0.18102192878723145
Date:  2023-08-20 22:47:10+0320
{"commit": " 0a66bdb8d36a1879246bd652e4c28500c4b951ab", "author": "Sven Klemm<sven@timescale.

In [36]:
# Method 5: Query all vectors before end_date
docs_with_score = db.similarity_search_with_score(query, end_date=end_dt)

for doc, score in docs_with_score:
    print("-" * 80)
    print("Score: ", score)
    print("Date: ", doc.metadata["date"])
    print(doc.page_content)
    print("-" * 80)

--------------------------------------------------------------------------------
Score:  0.16723191738128662
Date:  2023-04-11 22:01:14+0320
{"commit": " 0595ff0888f2ffb8d313acb0bda9642578a9ade3", "author": "Sven Klemm<sven@timescale.com>", "date": "Tue Apr 11 22:01:14 2023 +0200", "change summary": "Move type support functions into _timescaledb_functions schema", "change details": ""}
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Score:  0.1706540584564209
Date:  2023-04-6 13:00:00+0320
{"commit": " 04f43335dea11e9c467ee558ad8edfc00c1a45ed", "author": "Sven Klemm<sven@timescale.com>", "date": "Thu Apr 6 13:00:00 2023 +0200", "change summary": "Move aggregate support function into _timescaledb_functions", "change details": "This patch moves the support functions for histogram, first and last into the _timescaledb_functions schema. Since we alter the schema of the existing

主要收获是，在上面得到的每个结果中，只返回了指定时间范围内的向量。这些查询非常高效，因为它们只需要搜索相关的分区。

我们也可以将此功能用于问答，在这种情况下，我们希望在指定的时间范围内找到最相关的向量，以便将它们作为上下文来回答问题。下面我们来看一个使用 Timescale Vector 作为检索器的示例：

In [39]:
# Set timescale vector as a retriever and specify start and end dates via kwargs
retriever = db.as_retriever(search_kwargs={"start_date": start_dt, "end_date": end_dt})

In [42]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(temperature=0.1, model="gpt-3.5-turbo-16k")

from langchain.chains import RetrievalQA

qa_stuff = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    verbose=True,
)

query = (
    "What's new with the timescaledb functions? Tell me when these changes were made."
)
response = qa_stuff.run(query)
print(response)



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
The following changes were made to the timescaledb functions:

1. "Add compatibility layer for _timescaledb_internal functions" - This change was made on Tue Aug 29 18:13:24 2023 +0200.
2. "Move functions to _timescaledb_functions schema" - This change was made on Sun Aug 20 22:47:10 2023 +0200.
3. "Move utility functions to _timescaledb_functions schema" - This change was made on Tue Aug 22 12:01:19 2023 +0200.
4. "Move partitioning functions to _timescaledb_functions schema" - This change was made on Tue Aug 29 10:49:47 2023 +0200.


请注意，LLM 用来撰写答案的上下文仅来自于指定日期范围内检索到的文档。

这展示了如何使用 Timescale Vector 通过检索与您的查询相关的日期范围内的文档来增强检索增强生成。

## 3. 使用 ANN 搜索索引加速查询

可以通过在嵌入列上创建索引来加速相似性查询。建议在摄取了大部分数据后再执行此操作。

Timescale Vector 支持以下索引：
- timescale_vector 索引 (tsv)：一种受 disk-ann 启发的图形索引，用于快速相似性搜索（默认）。
- pgvector 的 HNSW 索引：一种分层可导航小世界图索引，用于快速相似性搜索。
- pgvector 的 IVFFLAT 索引：一种倒排文件索引，用于快速相似性搜索。

重要提示：在 PostgreSQL 中，每个表只能在一列上有一个索引。因此，如果您想测试不同索引类型的性能，可以通过以下几种方式实现：(1) 创建具有不同索引的多个表，(2) 在同一表中创建多个向量列，并在每个列上创建不同的索引，或者 (3) 在同一列上删除并重新创建索引并比较结果。

In [43]:
# Initialize an existing TimescaleVector store
COLLECTION_NAME = "timescale_commits"
embeddings = OpenAIEmbeddings()
db = TimescaleVector(
    collection_name=COLLECTION_NAME,
    service_url=SERVICE_URL,
    embedding_function=embeddings,
)

使用 `create_index()` 函数而不带额外参数将默认创建 `timescale_vector_index`，并使用默认参数。

In [44]:
# create an index
# by default this will create a Timescale Vector (DiskANN) index
db.create_index()

你还可以为索引指定参数。有关不同参数及其对性能影响的详细讨论，请参阅 Timescale Vector 文档。

注意：你不需要指定参数，因为我们已经设置了智能默认值。但如果你想为特定数据集进行实验以榨取更多性能，也可以随时指定自己的参数。

In [45]:
# drop the old index
db.drop_index()

# create an index
# Note: You don't need to specify m and ef_construction parameters as we set smart defaults.
db.create_index(index_type="tsv", max_alpha=1.0, num_neighbors=50)

Timescale Vector 还支持 HNSW ANN 索引算法以及 ivfflat ANN 索引算法。只需在 `index_type` 参数中指定您想要创建的索引，并可选择性地指定索引的参数。

In [46]:
# drop the old index
db.drop_index()

# Create an HNSW index
# Note: You don't need to specify m and ef_construction parameters as we set smart defaults.
db.create_index(index_type="hnsw", m=16, ef_construction=64)

In [47]:
# drop the old index
db.drop_index()

# Create an IVFFLAT index
# Note: You don't need to specify num_lists and num_records parameters as we set smart defaults.
db.create_index(index_type="ivfflat", num_lists=20, num_records=1000)

通常，我们建议使用默认的时间尺度向量索引，或 HNSW 索引。

In [48]:
# drop the old index
db.drop_index()
# Create a new timescale vector index
db.create_index()

## 4. 自我查询检索器与 Timescale Vector

Timescale Vector 也支持自我查询检索器功能，这使其能够查询自身。给定一个包含查询语句和筛选器（单个或复合）的自然语言查询，检索器会使用 LLM 查询构造链来编写 SQL 查询，然后将其应用于 Timescale Vector 向量存储中的底层 PostgreSQL 数据库。

关于自我查询的更多信息，请参阅[文档](/docs/how_to/self_query)。

为了说明使用 Timescale Vector 进行自查询，我们将使用与第 3 部分相同的 gitlog 数据集。

In [49]:
COLLECTION_NAME = "timescale_commits"
vectorstore = TimescaleVector(
    embedding_function=OpenAIEmbeddings(),
    collection_name=COLLECTION_NAME,
    service_url=SERVICE_URL,
)

接下来，我们将创建自查询检索器。为此，我们需要预先提供有关文档支持的元数据字段以及文档内容简短描述的一些信息。

In [50]:
from langchain.chains.query_constructor.base import AttributeInfo
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain_openai import OpenAI

# Give LLM info about the metadata fields
metadata_field_info = [
    AttributeInfo(
        name="id",
        description="A UUID v1 generated from the date of the commit",
        type="uuid",
    ),
    AttributeInfo(
        name="date",
        description="The date of the commit in timestamptz format",
        type="timestamptz",
    ),
    AttributeInfo(
        name="author_name",
        description="The name of the author of the commit",
        type="string",
    ),
    AttributeInfo(
        name="author_email",
        description="The email address of the author of the commit",
        type="string",
    ),
]
document_content_description = "The git log commit summary containing the commit hash, author, date of commit, change summary and change details"

# Instantiate the self-query retriever from an LLM
llm = OpenAI(temperature=0)
retriever = SelfQueryRetriever.from_llm(
    llm,
    vectorstore,
    document_content_description,
    metadata_field_info,
    enable_limit=True,
    verbose=True,
)

现在让我们在 gitlog 数据集上测试自查询检索器。

运行下面的查询，并注意你可以用自然语言指定单个查询、带过滤器的查询以及带复合过滤器的查询（带 AND、OR 的过滤器），自查询检索器会将该查询翻译成 SQL 并对 Timescale Vector PostgreSQL 向量存储执行搜索。

这说明了自查询检索器的强大之处。你可以使用它对向量存储执行复杂搜索，而你或你的用户无需直接编写任何 SQL！

In [51]:
# This example specifies a relevant query
retriever.invoke("What are improvements made to continuous aggregates?")



query='improvements to continuous aggregates' filter=None limit=None


[Document(page_content='{"commit": " 35c91204987ccb0161d745af1a39b7eb91bc65a5", "author": "Fabr\\u00edzio de Royes Mello<fabriziomello@gmail.com>", "date": "Thu Nov 24 13:19:36 2022 -0300", "change summary": "Add Hierarchical Continuous Aggregates validations", "change details": "Commit 3749953e introduce Hierarchical Continuous Aggregates (aka Continuous Aggregate on top of another Continuous Aggregate) but it lacks of some basic validations.  Validations added during the creation of a Hierarchical Continuous Aggregate:  * Forbid create a continuous aggregate with fixed-width bucket on top of   a continuous aggregate with variable-width bucket.  * Forbid incompatible bucket widths:   - should not be equal;   - bucket width of the new continuous aggregate should be greater than     the source continuous aggregate;   - bucket width of the new continuous aggregate should be multiple of     the source continuous aggregate. "}', metadata={'id': 'c98d1c00-6c13-11ed-9bbe-23925ce74d13', 'date

In [52]:
# This example specifies a filter
retriever.invoke("What commits did Sven Klemm add?")

query=' ' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='author_name', value='Sven Klemm') limit=None


[Document(page_content='{"commit": " e2e7ae304521b74ac6b3f157a207da047d44ab06", "author": "Sven Klemm<sven@timescale.com>", "date": "Fri Mar 3 11:22:06 2023 +0100", "change summary": "Don\'t run sanitizer test on individual PRs", "change details": "Sanitizer tests take a long time to run so we don\'t want to run them on individual PRs but instead run them nightly and on commits to master. "}', metadata={'id': '3f401b00-b9ad-11ed-b5ea-a3fd40b9ac16', 'date': '2023-03-3 11:22:06+0140', 'source': '/Users/avtharsewrathan/sideprojects2023/timescaleai/tsv-langchain/langchain/docs/docs/modules/ts_git_log.json', 'seq_num': 295, 'author_name': 'Sven Klemm', 'commit_hash': ' e2e7ae304521b74ac6b3f157a207da047d44ab06', 'author_email': 'sven@timescale.com'}),
 Document(page_content='{"commit": " d8f19e57a04d17593df5f2c694eae8775faddbc7", "author": "Sven Klemm<sven@timescale.com>", "date": "Wed Feb 1 08:34:20 2023 +0100", "change summary": "Bump version of setup-wsl github action", "change details": 

In [53]:
# This example specifies a query and filter
retriever.invoke("What commits about timescaledb_functions did Sven Klemm add?")

query='timescaledb_functions' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='author_name', value='Sven Klemm') limit=None


[Document(page_content='{"commit": " 04f43335dea11e9c467ee558ad8edfc00c1a45ed", "author": "Sven Klemm<sven@timescale.com>", "date": "Thu Apr 6 13:00:00 2023 +0200", "change summary": "Move aggregate support function into _timescaledb_functions", "change details": "This patch moves the support functions for histogram, first and last into the _timescaledb_functions schema. Since we alter the schema of the existing functions in upgrade scripts and do not change the aggregates this should work completely transparently for any user objects using those aggregates. "}', metadata={'id': '2cb47800-d46a-11ed-8f0e-2b624245c561', 'date': '2023-04-6 13:00:00+0320', 'source': '/Users/avtharsewrathan/sideprojects2023/timescaleai/tsv-langchain/langchain/docs/docs/modules/ts_git_log.json', 'seq_num': 233, 'author_name': 'Sven Klemm', 'commit_hash': ' 04f43335dea11e9c467ee558ad8edfc00c1a45ed', 'author_email': 'sven@timescale.com'}),
 Document(page_content='{"commit": " feef9206facc5c5f506661de4a81d96ef0

In [54]:
# This example specifies a time-based filter
retriever.invoke("What commits were added in July 2023?")

query=' ' filter=Operation(operator=<Operator.AND: 'and'>, arguments=[Comparison(comparator=<Comparator.GTE: 'gte'>, attribute='date', value='2023-07-01T00:00:00Z'), Comparison(comparator=<Comparator.LTE: 'lte'>, attribute='date', value='2023-07-31T23:59:59Z')]) limit=None


[Document(page_content='{"commit": " 5cf354e2469ee7e43248bed382a4b49fc7ccfecd", "author": "Markus Engel<engel@sero-systems.de>", "date": "Mon Jul 31 11:28:25 2023 +0200", "change summary": "Fix quoting owners in sql scripts.", "change details": "When referring to a role from a string type, it must be properly quoted using pg_catalog.quote_ident before it can be casted to regrole. Fixed this, especially in update scripts. "}', metadata={'id': '99590280-2f84-11ee-915b-5715b2447de4', 'date': '2023-07-31 11:28:25+0320', 'source': '/Users/avtharsewrathan/sideprojects2023/timescaleai/tsv-langchain/langchain/docs/docs/modules/ts_git_log.json', 'seq_num': 76, 'author_name': 'Markus Engel', 'commit_hash': ' 5cf354e2469ee7e43248bed382a4b49fc7ccfecd', 'author_email': 'engel@sero-systems.de'}),
 Document(page_content='{"commit": " 88aaf23ae37fe7f47252b87325eb570aa417c607", "author": "noctarius aka Christoph Engelbert<me@noctarius.com>", "date": "Wed Jul 12 14:53:40 2023 +0200", "change summary": "

In [55]:
# This example specifies a query and a LIMIT value
retriever.invoke("What are two commits about hierarchical continuous aggregates?")

query='hierarchical continuous aggregates' filter=None limit=2


[Document(page_content='{"commit": " 35c91204987ccb0161d745af1a39b7eb91bc65a5", "author": "Fabr\\u00edzio de Royes Mello<fabriziomello@gmail.com>", "date": "Thu Nov 24 13:19:36 2022 -0300", "change summary": "Add Hierarchical Continuous Aggregates validations", "change details": "Commit 3749953e introduce Hierarchical Continuous Aggregates (aka Continuous Aggregate on top of another Continuous Aggregate) but it lacks of some basic validations.  Validations added during the creation of a Hierarchical Continuous Aggregate:  * Forbid create a continuous aggregate with fixed-width bucket on top of   a continuous aggregate with variable-width bucket.  * Forbid incompatible bucket widths:   - should not be equal;   - bucket width of the new continuous aggregate should be greater than     the source continuous aggregate;   - bucket width of the new continuous aggregate should be multiple of     the source continuous aggregate. "}', metadata={'id': 'c98d1c00-6c13-11ed-9bbe-23925ce74d13', 'date

## 5. 使用现有的 TimescaleVector 向量库

在上面的示例中，我们从文档集合中创建了一个向量库。然而，我们通常希望处理（插入和查询）现有向量库中的数据。下面我们将展示如何初始化、向其中添加文档以及查询 TimescaleVector 向量库中现有的文档集合。

要使用现有的 Timescale Vector 向量库，我们需要知道要查询的表的名称（`COLLECTION_NAME`）以及云 PostgreSQL 数据库的 URL（`SERVICE_URL`）。

In [56]:
# Initialize the existing
COLLECTION_NAME = "timescale_commits"
embeddings = OpenAIEmbeddings()
vectorstore = TimescaleVector(
    collection_name=COLLECTION_NAME,
    service_url=SERVICE_URL,
    embedding_function=embeddings,
)

要将新数据加载到表中，我们使用 `add_document()` 函数。此函数接受一个文档列表和一个元数据列表。元数据必须包含每个文档的唯一 id。

如果您希望文档与当前日期和时间关联，则无需创建 id 列表。系统将为每个文档自动生成一个 uuid。

如果您希望文档与过去的日期和时间关联，可以使用 `timescale-vector` Python 库中的 `uuid_from_time` 函数创建 id 列表，如上面第 2 节所示。此函数接受一个 datetime 对象，并返回一个 uuid，其中日期和时间已编码在 uuid 中。

In [58]:
# Add documents to a collection in TimescaleVector
ids = vectorstore.add_documents([Document(page_content="foo")])
ids

['a34f2b8a-53d7-11ee-8cc3-de1e4b2a0118']

In [59]:
# Query the vectorstore for similar documents
docs_with_score = vectorstore.similarity_search_with_score("foo")

In [60]:
docs_with_score[0]

(Document(page_content='foo', metadata={}), 5.006789860928507e-06)

In [61]:
docs_with_score[1]

(Document(page_content='{"commit": " 00b566dfe478c11134bcf1e7bcf38943e7fafe8f", "author": "Fabr\\u00edzio de Royes Mello<fabriziomello@gmail.com>", "date": "Mon Mar 6 15:51:03 2023 -0300", "change summary": "Remove unused functions", "change details": "We don\'t use `ts_catalog_delete[_only]` functions anywhere and instead we rely on `ts_catalog_delete_tid[_only]` functions so removing it from our code base. "}', metadata={'id': 'd7f5c580-bc4f-11ed-9712-ffa0126a201a', 'date': '2023-03-6 15:51:03+-500', 'source': '/Users/avtharsewrathan/sideprojects2023/timescaleai/tsv-langchain/langchain/docs/docs/modules/ts_git_log.json', 'seq_num': 285, 'author_name': 'Fabrízio de Royes Mello', 'commit_hash': ' 00b566dfe478c11134bcf1e7bcf38943e7fafe8f', 'author_email': 'fabriziomello@gmail.com'}),
 0.23607668446580354)

### 删除数据

您可以通过 uuid 或按元数据过滤来删除数据。

In [64]:
ids = vectorstore.add_documents([Document(page_content="Bar")])

vectorstore.delete(ids)

True

使用元数据删除在以下情况下尤其有用：您想定期更新从特定来源抓取的信息，或特定日期或其他元数据属性。

In [65]:
vectorstore.add_documents(
    [Document(page_content="Hello World", metadata={"source": "www.example.com/hello"})]
)
vectorstore.add_documents(
    [Document(page_content="Adios", metadata={"source": "www.example.com/adios"})]
)

vectorstore.delete_by_metadata({"source": "www.example.com/adios"})

vectorstore.add_documents(
    [
        Document(
            page_content="Adios, but newer!",
            metadata={"source": "www.example.com/adios"},
        )
    ]
)

['c6367004-53d7-11ee-8cc3-de1e4b2a0118']

### 重写向量存储

如果您有一个现有的集合，可以通过执行 `from_documents` 并设置 `pre_delete_collection` = True 来重写它

In [None]:
db = TimescaleVector.from_documents(
    documents=docs,
    embedding=embeddings,
    collection_name=COLLECTION_NAME,
    service_url=SERVICE_URL,
    pre_delete_collection=True,
)

In [None]:
docs_with_score = db.similarity_search_with_score("foo")

In [None]:
docs_with_score[0]