# 在云器Lakehouse的同一张表中进行向量和标量检索

## 方案概要介绍
- 参考并已完成在云器Lakehouse的同一张表中进行向量和标量存储

![image.png](./image/scala_vector_search_in_one_table.png)

## 部署本地向量模型和召回模型服务

### 文本的向量化表示
bge-m3为1024维，bge-base-zh-v1.5为768维
本文通过私有部署xinference，提供向量embedding模型服务和rerank召回模型服务
xinference运行在X86 CPU上, xinference的安装和使用请参考这里的[文档](https://inference.readthedocs.io/en/latest/)。

In [2]:
from xinference.client import Client as Xinference_Client  # 添加别名

def get_embedding_xin(
    input_text: str,
    base_url: str = "http://localhost:9998",
    model_name: str = "bge-m3"
) -> list:
    """
    获取文本的嵌入向量
    
    参数:
    input_text (str): 要生成嵌入向量的文本
    base_url (str): Xinference服务器地址，默认为本地服务
    model_name (str): 要使用的模型名称，默认为bge-m3
    
    返回:
    list: 文本的嵌入向量
    """
    # 使用别名创建客户端连接
    client = Xinference_Client(base_url)  # 修改类名调用
    
    # 获取指定模型
    model = client.get_model(model_name)
    embedding = model.create_embedding(input_text)
    # 生成并返回嵌入向量
    return embedding['data'][0]['embedding']

# 使用示例保持不变
if __name__ == "__main__":
    embedding = get_embedding_xin("What is the capital of China?")
    print(f"生成的嵌入向量维度：{len(embedding)}")


生成的嵌入向量维度：1024


### 用Rerank模型对检索出的文本进行精排

In [3]:
import requests

def rerank_documents(query, documents, model_name: str ="bge-reranker-large"):
    url = 'http://localhost:9998/v1/rerank'
    headers = {
        'accept': 'application/json',
        'Content-Type': 'application/json'
    }
    data = {
        "model": model_name,
        "query": query,
        "documents": documents
    }
    response = requests.post(url, headers=headers, json=data)
    return response.json()
# Example usage
if __name__ == "__main__":
    query = "A man is eating pasta."
    documents = [
        "A woman is playing violin.",
        "A man is eating food.",
        "A man is eating a piece of bread.",
        "The girl is carrying a baby.",
        "A man is riding a horse."
    ]
    
    rerank_list = rerank_documents(query, documents)
    # Print the results
    for result in rerank_list['results']:
        print(result)

{'index': 1, 'relevance_score': 0.9999258518218994, 'document': None}
{'index': 2, 'relevance_score': 0.048283617943525314, 'document': None}
{'index': 3, 'relevance_score': 7.636440568603575e-05, 'document': None}
{'index': 0, 'relevance_score': 7.636331429239362e-05, 'document': None}
{'index': 4, 'relevance_score': 7.617334631504491e-05, 'document': None}


### 也可以使用ollama提供的embedding模型服务
文本的向量化表示，1024维embedding函数
ollama运行在Mac M1 ARM上, ollama的安装和使用请参考这里的[文档](https://ollama.com/)。

In [7]:
from ollama import Client

def get_embedding_ollama(text: str, 
                 model: str = 'bge-m3',  # 默认模型
                 host: str = 'http://192.168.6.167:11434') -> list[float]:
    """
    获取文本的向量化表示
    
    参数：
    text (str): 需要向量化的文本内容
    model (str): 使用的embedding模型名称，默认为深度求索的embedding模型
    host (str): Ollama服务器地址，格式为http://IP:PORT
    
    返回：
    list[float]: 文本的向量表示（浮点数列表）
    """
    try:
        client = Client(host=host)
        response = client.embed(
            model=model,
            input=text.strip()  # 去除首尾空白字符
        )
        return response['embeddings'][0]
    except Exception as e:
        print(f"获取embedding失败: {str(e)}")
        return None

# 使用示例
if __name__ == "__main__":
    embedding = get_embedding_ollama(
        text="为什么天空是蓝色的？"
    )
    
    if embedding is not None:
        print(f"向量维度: {len(embedding)}")

向量维度: 1024


### 安装云器Zettapark
如果已安装请跳过

In [8]:
# !pip install -U clickzetta-zettapark-python

In [9]:
# Get a nicer output from .show()
from IPython.core.display import HTML
display(HTML("<style>pre { white-space: pre !important; }</style>"))

## 通过ZettaPark连接到云器Lakehouse

In [10]:
import time
import pandas as pd
from clickzetta.zettapark.session import Session
import clickzetta.zettapark.functions as f
from clickzetta.zettapark import Session, DataFrame
from clickzetta.zettapark.functions import udf, col
from clickzetta.zettapark.types import IntegerType

### 云器Lakehouse所在Region
- 阿里云上海

### 查询使用的VC：
- 分析型（ANALYTICS），2CRU。
- 配置preload table，预加载数据
- 5分钟自动停止
- 有查询请求自动启动

### 连接配置信息
-  [config.json](config.json)

In [11]:
import json
from clickzetta.zettapark.session import Session

import logging
logging.getLogger("clickzetta.zettapark").setLevel(logging.ERROR)

# 从配置文件中读取参数
with open('config-vector.json', 'r') as config_file:
    config = json.load(config_file)

print("正在连接到云器Lakehouse.....\n")

# 创建会话
session = Session.builder.configs(config).create()

print("连接成功！...\n")

# session.sql("SELECT current_instance_id(), current_workspace(),current_workspace_id(), current_schema(), current_user(),current_user_id(), current_vcluster()").show()


正在连接到云器Lakehouse.....

连接成功！...



In [12]:
session.get_current_schema()

'PUBLIC'

In [13]:
## 查看计算资源详细情况
session.sql(f"desc vcluster extended {config['vcluster']}").to_pandas()

Unnamed: 0,info_name,info_value
0,name,DEFAULT
1,creator,
2,created_time,2024-04-12 11:59:40.79
3,last_modified_time,2025-03-21 12:08:12.681
4,comment,Auto-created virtual cluster
5,properties,()
6,vcluster_size,1
7,vcluster_type,GENERAL
8,state,SUSPENDED
9,scaling_policy,


##  将Table里的数据预加载到VCluster上，进行查询加速

In [9]:
table_name = "github_event_issuesevent_embedding"

In [10]:
session.sql(f"""ALTER VCLUSTER {config['vcluster']} SET PRELOAD_TABLES='{config['schema']}.{table_name}'""").collect()

[Row(result_message='OPERATION SUCCEED')]

In [50]:
session.sql(f"""SHOW VCLUSTER {config['vcluster']} PRELOAD CACHED STATUS""").collect()

[Row(table='123.gharchive.public.github_event_issuesevent_embedding', partition='', vc_name='default_ap', replica_id='301', preload_ratio='3.5%', preload_size='3787586738')]

## 查看表结构和索引定义

- 向量索引，对Vector类型字段issue_body_embedding建立向量索引，加速向量检索issue_body_embedding是文本字段issue_body的向量化表示。
  
    INDEX index_issue_body_vector (issue_body_embedding) Vector PROPERTIES('scalar.type'='f32','distance.function'='l2_distance')

- 倒排索引，对文本字段issue_body建立倒排索引，加速文本检索。
  
    INDEX index_issue_body_inverted (issue_body) Inverted PROPERTIES('analyzer'='unicode'))

- Bloomfilter索引，加速按照row_id查找
  
    INDEX index_rowid_bf (row_id) BloomFilter

- 分区键，按partition_date进行分区，查找时按照分区裁剪，避免每次查询对大表的全表扫描，提高查找性能
- 
    PARTITIONED BY(partition_date)

In [12]:
from IPython.display import Markdown, display
import re

def pretty_show_ddl(sql_str):
    formatted_sql = (
        re.sub(r'\\n', '\n', sql_str)  # 处理换行转义
        .replace('`', '')             # 移除反引号（此处去掉了行尾的注释）
        .replace('\\t', '    ')       # 替换制表符
    )
    display(Markdown(f"```sql\n{formatted_sql}\n```"))

# 使用示例
ddl = session.sql(f"SHOW CREATE TABLE {table_name}").to_pandas().iloc[0,0]
pretty_show_ddl(ddl)


```sql
CREATE TABLE gharchive.public.github_event_issuesevent_embedding(
  row_id bigint,
  event_id bigint,
  event_type varchar(255),
  created_at timestamp,
  partition_date date,
  public boolean,
  repo_id varchar(255),
  repo_name varchar(255),
  repo_url varchar(255),
  actor_id varchar(255),
  actor_login varchar(255),
  actor_gravatar_id varchar(255),
  actor_avatar_url varchar(255),
  actor_url varchar(255),
  org_id varchar(255),
  org_login varchar(255),
  org_gravatar_id varchar(255),
  org_avatar_url varchar(255),
  org_url varchar(255),
  issue_body string,
  issue_body_embedding vector(float,1024),
  INDEX index_issue_body_vector (issue_body_embedding) Vector PROPERTIES('scalar.type'='f32','distance.function'='l2_distance'),
  INDEX index_rowid_bf (row_id) BloomFilter,
  INDEX index_issue_body_inverted (issue_body) Inverted PROPERTIES('analyzer'='unicode'))
PARTITIONED BY(partition_date)
USING PARQUET
TBLPROPERTIES(
  'cz.compaction.server.enable.archive'='fasle');
```

## 统计全表的行数(全表1.96亿行数据)

In [13]:
session.table(table_name).count()

196213194

## 查看表的详细信息

In [58]:
session.sql(f"""DESC EXTENDED {table_name}""").to_pandas()

Unnamed: 0,column_name,data_type,comment
0,row_id,bigint,
1,event_id,bigint,
2,event_type,varchar(255),
3,created_at,timestamp_ltz,
4,partition_date,date,
5,public,boolean,
6,repo_id,varchar(255),
7,repo_name,varchar(255),
8,repo_url,varchar(255),
9,actor_id,varchar(255),


## 统计指定分区内的行数，后续数据检索都在此分区范围内
结合分区裁剪，减少数据扫描量

- issue_body字段，github上提交issue的说明文字
- issue_body_embedding字段，issue_body的1024维向量化表示。issue_body的向量化还没有完成，部分的issue_body_embedding值为空

In [59]:
# 这个时间段的记录数为56万行，比较典型的向量检索场景的数据量
begin_date = "2024-06-01"
end_date = "2024-06-08"

In [60]:
def count_filtered_record(
                  session: Session = session,
                  table_name: str = "github_event_issuesevent_embedding",
                  where_condition:str = f"partition_date >= '{begin_date}' and partition_date <= '{end_date}'",
                 ) :
    query_stmt = f"""
        SELECT 
            COUNT(CASE WHEN length(issue_body)>0 THEN 1 END) AS count_text_is_not_null_record_of_{table_name},
            COUNT(CASE WHEN issue_body_embedding IS NOT NULL THEN 1 END) AS count_embedded_record_of_{table_name},
            COUNT(CASE WHEN issue_body_embedding IS NULL THEN 1 END) AS count_not_embedded_record_of_{table_name}
        FROM {table_name}
        WHERE {where_condition}

        """
    search_result = session.sql(query_stmt).to_pandas()
    return search_result

In [205]:
count_filtered_record()

Unnamed: 0,count_text_is_not_null_record_of_github_event_issuesevent_embedding,count_embedded_record_of_github_event_issuesevent_embedding,count_not_embedded_record_of_github_event_issuesevent_embedding
0,566039,111133,454906


## 预览指定分区内的数据

In [62]:
session.sql(f"select * from {table_name} where partition_date >= '{begin_date}' and partition_date <= '{end_date}' and issue_body_embedding is not NULL limit 10").to_pandas()

Unnamed: 0,row_id,event_id,event_type,created_at,partition_date,public,repo_id,repo_name,repo_url,actor_id,...,actor_gravatar_id,actor_avatar_url,actor_url,org_id,org_login,org_gravatar_id,org_avatar_url,org_url,issue_body,issue_body_embedding
0,31478810,38962632210,IssuesEvent,2024-06-04 06:31:46+00:00,2024-06-04,True,776493969,ruslkhay/ReaBoo,https://api.github.com/repos/ruslkhay/ReaBoo,102460836,...,,https://avatars.githubusercontent.com/u/102460...,https://api.github.com/users/ruslkhay,,,,,,Studing progress should be tracked.,"[-0.03296089,-0.0046119564,-0.05780587,-0.0082..."
1,31473612,38962623284,IssuesEvent,2024-06-04 06:31:18+00:00,2024-06-04,True,810016258,camiladz/learn-github,https://api.github.com/repos/camiladz/learn-gi...,85312719,...,,https://avatars.githubusercontent.com/u/85312719?,https://api.github.com/users/camiladz,,,,,,Necessidade de criar um readme,"[-0.029348638,0.02384073,-0.03198142,0.0101071..."
2,31473588,38962596635,IssuesEvent,2024-06-04 06:30:33+00:00,2024-06-04,True,810013016,Anacss24/learn-github,https://api.github.com/repos/Anacss24/learn-gi...,101267392,...,,https://avatars.githubusercontent.com/u/101267...,https://api.github.com/users/Anacss24,,,,,,Necessidade de criar um readme,"[-0.029348638,0.02384073,-0.03198142,0.0101071..."
3,31473610,38962621284,IssuesEvent,2024-06-04 06:31:11+00:00,2024-06-04,True,795768303,sergiorbf/kenko-coffee,https://api.github.com/repos/sergiorbf/kenko-c...,90218907,...,,https://avatars.githubusercontent.com/u/90218907?,https://api.github.com/users/sergiorbf,,,,,,fix checkout submit component,"[-0.039279714,-0.0056430707,-0.010742575,-0.02..."
4,31478822,38962638075,IssuesEvent,2024-06-04 06:32:06+00:00,2024-06-04,True,764024204,HichTala/marseille,https://api.github.com/repos/HichTala/marseille,98521878,...,,https://avatars.githubusercontent.com/u/98521878?,https://api.github.com/users/HichTala,,,,,,Corriger le probleme de déconnection,"[0.018347852,0.010826592,-0.04977021,-0.010925..."
5,31473620,38962625664,IssuesEvent,2024-06-04 06:31:25+00:00,2024-06-04,True,559591126,mokeyish/smartdns-rs,https://api.github.com/repos/mokeyish/smartdns-rs,16131917,...,,https://avatars.githubusercontent.com/u/16131917?,https://api.github.com/users/mokeyish,,,,,,What is the difference between GNU and MSVC?\r...,"[-0.070999354,-0.0312622,-0.020586753,0.016268..."
6,31478806,38962630598,IssuesEvent,2024-06-04 06:31:41+00:00,2024-06-04,True,764024204,HichTala/marseille,https://api.github.com/repos/HichTala/marseille,98521878,...,,https://avatars.githubusercontent.com/u/98521878?,https://api.github.com/users/HichTala,,,,,,Rendre plus joli le filtrage en normalisant le...,"[-0.03164603,-0.015199391,-0.044382773,0.00385..."
7,31473605,38962603963,IssuesEvent,2024-06-04 06:30:54+00:00,2024-06-04,True,764024204,HichTala/marseille,https://api.github.com/repos/HichTala/marseille,98521878,...,,https://avatars.githubusercontent.com/u/98521878?,https://api.github.com/users/HichTala,,,,,,Afficher uniquement les 4 prochaines missions ...,"[-0.039396387,0.0061013484,-0.03005435,0.00258..."
8,31473587,38962596381,IssuesEvent,2024-06-04 06:30:31+00:00,2024-06-04,True,30492678,facetoe/zenpy,https://api.github.com/repos/facetoe/zenpy,1507221,...,,https://avatars.githubusercontent.com/u/1507221?,https://api.github.com/users/sglebs,,,,,,I am getting this error:\r\n\r\nrequests.excep...,"[-0.05882265,-0.0074409284,-0.027671786,-0.003..."
9,31478814,38962634995,IssuesEvent,2024-06-04 06:31:55+00:00,2024-06-04,True,157486749,kwsch/pkNX,https://api.github.com/repos/kwsch/pkNX,6393368,...,,https://avatars.githubusercontent.com/u/6393368?,https://api.github.com/users/kwsch,,,,,,![Screenshot 2024-06-03 181612](https://github...,"[-0.014927758,0.007211781,-0.057648428,-0.0397..."


## 定义从云器Lakehouse进行信息检索的函数

In [181]:
def vector_search(question:str,
                  session: Session = session,
                  table_name: str = "github_event_issuesevent_embedding", 
                  other_column_names = "row_id, partition_date, issue_body",
                  embedding_column_name: str = "issue_body_embedding",
                  where_condition:str = f"partition_date >= '{begin_date}' and partition_date <= '{end_date}' and issue_body_embedding IS NOT NULL",
                  order_condition:str = "4 ASC",
                  limit_num: str = "5",
                  max_distance:str = "0.8" )-> pd :
    embedded_question = get_embedding_xin(question)
    query_stmt = f"""
        SELECT  {other_column_names},L2_DISTANCE({embedding_column_name}, CAST({embedded_question} as VECTOR(1024))) AS distance, "vector_search_l2" as search_method
        FROM {table_name}
        WHERE {where_condition} AND (L2_DISTANCE({embedding_column_name}, CAST({embedded_question} as VECTOR(1024))) <= {max_distance})
        ORDER BY {order_condition}
        LIMIT {limit_num}
        """
    search_result = session.sql(query_stmt).to_pandas()
    return search_result

In [171]:
def match_all_search(question:str,
                  session: Session = session,
                  table_name: str = "github_event_issuesevent_embedding", 
                  other_column_names = "row_id, partition_date, issue_body",
                  text_column_name: str = "issue_body",
                  where_condition:str = f"partition_date >= '{begin_date}' and partition_date <= '{end_date}'",
                  order_condition:str = "4 ASC",
                  limit_num: str = "2" )-> pd :
    query_stmt = f"""
        SELECT  {other_column_names}, 0 AS distance, "match_all_search" as search_method
        FROM {table_name}
        WHERE {where_condition} AND (MATCH_ALL({text_column_name}, '{question}' ))
        ORDER BY {order_condition}
        LIMIT {limit_num}
        """
    search_result = session.sql(query_stmt).to_pandas()
    return search_result

In [172]:
def match_any_search(question:str,
                  session: Session = session,
                  table_name: str = "github_event_issuesevent_embedding", 
                  other_column_names = "row_id, partition_date, issue_body",
                  text_column_name: str = "issue_body",
                  where_condition:str = f"partition_date >= '{begin_date}' and partition_date <= '{end_date}'",
                  order_condition:str = "4 ASC",
                  limit_num: str = "3",
                  distance:str = "1.0")-> pd :
    query_stmt = f"""
        SELECT  {other_column_names}, {distance} AS distance, "match_any_search" as search_method
        FROM {table_name}
        WHERE {where_condition} AND (MATCH_ANY({text_column_name}, '{question}' ))
        ORDER BY {order_condition}
        LIMIT {limit_num}
        """
    search_result = session.sql(query_stmt).to_pandas()
    return search_result

In [173]:
question = "Imo, sulfur probably? "
# question = "Database Migration"
# question = "Octopus Deploy"

In [182]:
vector_search_answer = vector_search(question, max_distance = '1.0')
match_all_answer = match_all_search(question)
match_any_answer = match_any_search(question)

In [183]:
answer_merged_df = pd.concat([vector_search_answer, match_all_answer, match_any_answer], ignore_index=True)
answer_merged_df = answer_merged_df.sort_values(by='distance', ascending=True)
answer_merged_df = answer_merged_df.drop_duplicates(subset="row_id", keep="first")
answer_merged_df

Unnamed: 0,row_id,partition_date,issue_body,distance,search_method
5,59087051,2024-06-06,"Imo, sulfur probably? Its found in a region wi...",0.0,match_all_search
1,287191552,2024-06-05,maybe it's something to do with stuff higher u...,0.981065,vector_search_l2
2,188420163,2024-06-05,I think that Physic Material might be missing ...,0.991636,vector_search_l2
3,240764228,2024-06-05,The documentation vaguely implies that Simba (...,0.991705,vector_search_l2
4,43673287,2024-06-05,"For every container, i guess.",0.992787,vector_search_l2
6,32886878,2024-06-04,Domain: Encyclopedia (Wikia)/various (Fandom w...,1.0,match_any_search
7,32887522,2024-06-04,Currently uploads in flight stop when you leav...,1.0,match_any_search
8,32881424,2024-06-04,<p>This issue has been put in the moderation q...,1.0,match_any_search


In [184]:
issue_body_list = answer_merged_df['issue_body'].tolist()
rerank_list = rerank_documents(question, issue_body_list)
score_map = {result["index"]: result["relevance_score"] for result in rerank_list["results"]}
answer_merged_df["relevance_score"] = answer_merged_df.index.map(score_map)

In [185]:
answer_merged_df.sort_values(by="relevance_score", ascending=False)

Unnamed: 0,row_id,partition_date,issue_body,distance,search_method,relevance_score
2,188420163,2024-06-05,I think that Physic Material might be missing ...,0.991636,vector_search_l2,0.001848
6,32886878,2024-06-04,Domain: Encyclopedia (Wikia)/various (Fandom w...,1.0,match_any_search,0.000824
1,287191552,2024-06-05,maybe it's something to do with stuff higher u...,0.981065,vector_search_l2,0.000221
4,43673287,2024-06-05,"For every container, i guess.",0.992787,vector_search_l2,0.000102
5,59087051,2024-06-06,"Imo, sulfur probably? Its found in a region wi...",0.0,match_all_search,9.4e-05
7,32887522,2024-06-04,Currently uploads in flight stop when you leav...,1.0,match_any_search,8e-05
3,240764228,2024-06-05,The documentation vaguely implies that Simba (...,0.991705,vector_search_l2,7.6e-05
8,32881424,2024-06-04,<p>This issue has been put in the moderation q...,1.0,match_any_search,


## 查看单次查询需要的时间

In [None]:
import time

# 定义计时装饰器 ‌:ml-citation{ref="3" data="citationList"}
def timeit(func):
    def wrapper(*args, **kwargs):
        start = time.perf_counter()
        result = func(*args, **kwargs)
        end = time.perf_counter()
        print(f"[{func.__name__}] 执行耗时: {end - start:.3f}s")
        return result, end - start
    return wrapper

# 封装搜索函数 ‌:ml-citation{ref="1" data="citationList"}
@timeit
def execute_vector_search(question):
    return vector_search(question, max_distance='1.0')

@timeit
def execute_match_all(question):
    return match_all_search(question)

@timeit
def execute_match_any(question):
    return match_any_search(question)

# 主执行逻辑
if __name__ == "__main__":
    total_start = time.perf_counter()
    
    vector_result, vector_time = execute_vector_search(question)
    match_all_result, match_all_time = execute_match_all(question)
    match_any_result, match_any_time = execute_match_any(question)
    
    # 计算总耗时
    total_time = time.perf_counter() - total_start
    
    # 打印结果
    print("\n===== E2E耗时统计 =====")
    print(f"• 向量搜索耗时(包括一次问题文本的向量化过程): {vector_time:.3f}s")
    print(f"• 全匹配搜索耗时: {match_all_time:.3f}s") 
    print(f"• 任意匹配搜索耗时: {match_any_time:.3f}s")
    print(f"• 总执行时间: {total_time:.3f}s")


In [187]:
onetime_query = """
                SELECT  row_id, issue_body,L2_DISTANCE(issue_body_embedding, CAST([-0.038588822,0.013613229,-0.032117527,-0.022368649,-0.026103461,-0.014429993,-0.008706978,0.052240074,-0.032955725,0.01976541,0.028083079,-0.018595401,-9.328244e-05,0.020900456,0.008664492,0.012073942,0.028870013,0.02973391,0.024368819,-0.028582772,-0.0021799358,0.003920943,0.025884438,0.013413902,-0.0035814024,0.0121239275,-0.011165111,-0.021781359,-0.0020752025,0.04929254,-0.0020518422,-0.022365028,0.024208471,-0.039694667,-0.014269119,-0.045559775,0.007610542,-0.01239021,-0.047636855,0.0050605806,-0.026767632,0.029241577,-0.007394314,-0.021243945,0.04085532,-0.033113554,0.02567797,-0.021553347,0.009249847,-0.011706535,-0.011076223,0.017547367,0.062697075,0.0032560087,0.018568687,-0.032352347,0.02385997,-0.006825313,-0.047669467,0.006683766,-0.04243342,-0.005511082,-0.0029530204,-0.019131899,-0.0032982067,0.049897935,0.03080964,0.03667401,-0.037257228,-0.073476575,0.009286169,0.024104036,-0.0002485741,-0.03625055,-0.040646542,0.028124234,0.043852717,0.017463474,0.00889522,0.017119154,0.042427693,-0.01735837,-0.043517485,-0.019903604,-0.004114573,-0.018323982,-0.023804162,0.011030757,0.02834939,0.008606075,0.05418132,-0.05720728,-0.026532916,0.0148015,-0.019549666,0.009425403,-0.0018266725,-0.008534378,0.061947748,-0.0077301995,-0.0014201081,-0.017104432,0.0074564964,-0.058603708,0.0037512449,0.017906016,0.031225612,-0.05832421,0.0015376123,-0.0038673885,0.014415894,0.04898108,0.0240008,0.047618713,0.0054231253,-0.015318564,-0.005380431,0.0077468175,0.021052,-0.019382583,0.0039735367,-0.00987389,0.054058794,-0.02959981,-0.006275549,0.0060486575,-0.054179054,0.02465131,0.0394837,0.07126506,-0.025283137,0.04730835,0.007979147,0.026918177,-0.035642017,-0.009828851,-0.041394096,-0.004230115,-0.0057935244,-0.040367704,0.01934699,0.04284675,-0.051411737,-0.018625671,0.033104513,0.002610617,0.027978461,0.01565346,-0.025824461,-0.020814061,-0.0025741076,-0.028871644,0.018380307,-0.0072354362,-0.023655176,-0.012983362,-0.025746785,-0.008451677,-0.024523562,-0.0059690746,-0.059199534,0.00956326,-0.05090314,0.04070047,0.020235742,-0.0074386206,0.015800167,0.01592449,-0.020797089,-0.02455123,0.027970226,-0.022356672,-0.0141429445,-0.016052935,-0.018227024,0.029982159,0.098887786,0.01844724,0.007272229,-0.035777424,-0.03207416,-0.026518147,0.049201638,-0.016517775,-0.036950834,-0.0023616417,0.024112182,0.029576374,-0.012687433,0.012933538,-0.012397113,-0.002409625,0.03518608,-0.018902734,0.0025148906,-0.0036851245,-0.011390372,-0.007754619,-0.006919649,-0.017673342,-0.017100416,0.027893018,0.013328257,-0.002411273,0.0023435696,0.02834309,-0.012515213,-0.018323429,0.024909317,-0.04272079,-0.004667657,0.010462556,-0.034347937,0.051729508,-0.060068116,-0.0002400899,0.006477311,0.032598276,0.0025078936,-0.008526018,-0.034888104,0.0702491,-0.018894384,-0.059065856,0.02567653,0.02186525,0.0026201762,0.047364812,-0.028982555,-0.039577156,-0.044007897,-0.056955118,-0.000506457,0.0056865104,0.09987736,-0.03357016,-0.0176079,-0.0022506781,-0.0048388136,-0.0054666484,-0.0145285735,-0.047769457,0.00041414256,0.007752966,-0.0229387,0.014642996,0.01587616,0.020659806,-0.027700594,-0.019629197,0.054361,0.014634502,0.0015574285,-0.017768014,0.03347474,-0.009231269,-0.0089231,0.001554388,0.015673533,-0.018824223,0.04061011,-0.033671834,-0.017727438,0.007201781,0.030414766,0.0122459885,-0.017428674,-0.006575266,0.01621539,0.0012540368,-0.024400078,0.0062513403,-0.021589717,0.007232865,-0.033444572,0.0023275588,-0.018746505,-0.008649477,0.041597907,0.04992287,-0.015230548,0.024582598,0.024891108,0.03923758,0.0017907444,0.006064147,0.026066452,-0.0069895927,-0.0058321776,-0.036779903,-0.0348853,-0.0062519955,0.05709141,0.0061473837,0.016716521,0.013143841,0.03160076,-0.16282156,-0.030049562,-0.0132082,0.07057519,0.0029185456,0.017504657,-0.030352725,0.018821238,-0.044966586,0.022904819,-0.008447027,-0.053658947,0.003597539,-0.02731822,-0.009098128,-0.022176871,-0.030150387,0.010350826,-0.004336436,-0.032780465,-0.044058826,-0.0004373785,0.027021175,0.0059637935,-0.00021651229,-0.007857218,-0.0035309372,-0.023825722,-0.038277637,-0.03377965,-0.035845347,-0.029361945,-0.013896237,0.005341577,0.082996644,0.04988103,0.013001914,-0.039204508,0.0148277925,0.013433599,0.04263385,-0.004254592,-0.006374136,0.0068774,0.023932831,-0.028110415,-0.019918278,0.011164886,0.008267399,-0.017736658,0.0063877827,-0.020068584,0.018653136,-0.010846642,-0.025656866,-0.020347651,-0.004887696,0.024611268,-0.012843132,-0.015923431,-0.008826659,-0.051180977,0.031053234,-0.03712587,-0.043910146,-0.007909312,-0.019465104,-0.034685165,0.06159699,-0.06027535,0.03499583,0.0044660266,0.0460187,-0.03217527,-0.007931442,0.018822167,0.023553982,-0.039261203,0.0043006195,-0.13480943,-0.043726355,0.00302948,0.001985053,0.0339887,-0.040201016,0.040517613,0.0149442395,0.0020687538,0.054553907,0.21967354,0.023545424,-0.018339004,-0.0019084474,0.048179165,-0.059194654,0.026249455,0.03623838,0.02616977,-0.033159424,0.0010378407,-0.033397533,-0.025355509,0.0038711058,-0.0005628489,0.008170404,-0.041000936,0.009530209,0.07613246,-0.018814225,0.008954255,0.003828509,0.027177393,-0.032090325,-0.045836926,-0.012889759,0.028229307,0.060389817,-0.034046732,0.01407476,-0.032968044,0.016483396,0.007783416,-0.010019729,-0.011474604,0.00978828,0.029749513,-0.027471567,-0.0067048124,0.0013202013,-0.008752336,-0.028934198,-0.0002413265,0.015210264,-0.020298319,-0.053711,-0.0068444367,-0.029368367,-0.011025884,-0.01287464,-0.03234157,-0.020245897,-0.04747086,0.031350538,-0.02757074,-0.027527295,-0.01845017,-0.023679968,0.002198275,0.011187652,-0.01409879,-0.0051698782,-0.03694385,0.018390851,0.022449248,0.044313855,-0.0149339195,-0.03928103,0.039341807,0.040122017,-0.030632062,-0.024952736,0.012724452,0.034704506,0.007607646,-0.023654338,-0.016281983,0.089847706,-0.0928803,0.033460706,0.0253034,-0.02181927,-0.039337736,0.024087884,0.027398445,-0.010131777,0.00978873,0.060752705,-0.053817406,-0.028424883,-0.017653083,0.0048667607,-0.032547686,0.03921153,-0.007589497,0.031506434,-0.008509918,-0.0027552233,-0.014812834,-0.0046581365,-0.03398614,-0.012996118,0.0019263355,-0.018849436,0.016230162,0.022911416,0.0031538059,0.034024898,-0.0071294974,0.032576293,-0.032698188,0.017563274,0.00023868517,-0.03113934,-0.010499359,0.068754464,-0.0044702063,0.0592204,0.050541177,0.019004134,-0.028469605,-0.017080784,-0.026565135,0.007864695,0.02310103,-0.03254814,-0.012962417,-0.030348008,-0.0011692562,0.04774039,0.04515706,-0.030175287,-0.015368211,0.0038790824,0.033674672,0.044499245,-0.047261186,0.044988763,0.0034576268,0.065176696,-0.024302268,0.00074241194,-0.0054908274,-0.025261702,-0.03083438,0.01523369,-0.03586423,-0.011769807,-0.022784343,0.023110338,0.0643679,0.024998436,0.010275806,-0.002172185,-0.022853272,-0.037064288,-0.029726798,-0.04017762,-0.043497078,-0.0283345,-0.022657467,-0.0105899405,-0.028942527,0.053049106,-0.015829299,0.033336647,-0.021478232,0.017135795,0.013154925,0.004889004,-2.7582914e-07,0.008184078,-0.024299564,0.029097736,-0.015596457,-0.054337926,-0.033307876,-0.028687876,-0.019092211,0.021113805,0.023096284,-0.014946661,-0.031619225,0.009959406,0.061434727,-0.002097585,-0.023753809,-0.044339377,-0.006524966,-0.005045401,0.024871122,0.11418441,0.034246895,-0.034440458,0.019647626,0.024835195,0.021860896,0.031926468,-0.006215582,0.018847115,-0.03595409,-0.052492533,0.0091853,0.006122102,-0.044608094,-0.05266439,0.05907514,0.029411793,-0.043077376,0.037430372,0.014981488,-0.02817645,-0.038666718,-0.04519742,0.020501286,-0.015271245,0.051756844,0.052076366,0.023114279,-0.022064706,-0.031635486,0.015336762,0.026278263,-0.043818988,0.011094511,0.044800848,0.015171537,-0.038254265,-0.022329293,0.027449418,-0.0044986047,-0.017847354,0.02706862,-0.0072754594,-0.01519281,0.022344323,0.045372255,0.039401706,-0.0079067815,0.009511223,-0.0072937007,-0.02332018,-0.0020931782,0.014239463,-0.00094053446,0.0586727,-0.01829711,-0.03068342,-0.0028247095,0.022851199,-0.031411957,0.018937139,3.043857e-05,-0.018573204,0.03179877,0.05827801,-0.02756613,0.033691786,0.007860031,-0.05921093,-0.01958434,0.03893231,0.027990898,0.013335433,-0.0014291211,-0.061177198,-0.0053935386,-0.003179317,-0.0374259,-0.024400093,0.033838592,0.0126542915,-0.051926695,-0.013451446,0.01730691,0.020166328,-0.0036130045,-0.004563011,-0.00067251286,0.0016332538,-0.051939018,0.02633797,0.026295582,0.03222161,-0.0026472185,0.043459024,0.040823217,-0.045042504,0.0065227672,-0.050841946,0.01950129,-0.0019450163,-0.06432601,0.049384106,-0.053718477,-0.009978018,0.035591148,-0.031038042,-0.036933064,0.06250413,0.009909603,0.019392278,0.020710465,-0.012107074,-0.054951336,-0.0099177025,-0.016361378,0.015448971,0.003978096,0.02801674,-0.011205609,-0.04521375,0.004979566,0.017422464,0.03986951,0.019696167,0.03077849,0.009481704,-0.012249596,0.02208747,-0.025294617,-0.008141892,-0.023875065,0.050939575,-0.016328242,0.008709088,-0.022100126,-0.008358129,-0.017230945,-0.027768157,-0.040712964,-0.022482987,-0.03652925,-0.028150953,-0.005789198,0.013696129,-0.0036569396,-0.09364957,-9.866125e-05,-0.0012082519,0.024274586,0.040349822,0.00400591,-0.018919582,0.039156497,0.011952573,0.012720891,-0.004828589,-0.04284371,0.014611118,-0.016344484,-0.0055980454,-0.0404279,-0.015277441,0.021183185,0.050680894,0.019031508,0.02792838,0.001906678,0.020749463,-0.033562705,0.039877158,0.014056945,0.007007187,-0.00012629155,-0.036080576,-0.06637786,0.030135298,0.026462128,0.010610585,-0.020333042,0.03440006,-0.011010459,-0.018737538,-0.005784681,0.060906816,-0.02008242,0.016424641,0.042583305,0.0075262603,-0.00404679,0.007963539,0.0009393453,-0.0250206,0.002319116,-0.038954332,-0.01870824,0.013942696,0.06517896,0.051036824,-0.021225922,-0.0071451636,-0.019209348,0.020653494,-0.13757604,0.04258736,-0.02278369,0.025329195,-0.012794614,0.010610259,-0.035129424,-0.022685736,0.026669523,-0.06376148,0.012456675,-0.008061456,-0.011342298,-0.024580486,0.0038983314,-0.00064386387,-0.048159536,-0.012775248,0.0034516635,0.030173276,2.3875557e-05,-0.011723299,0.03285668,0.009848239,-0.031188643,-0.02357622,0.028773751,-0.00513577,-0.050513532,-0.007965329,-0.011149051,-0.03297629,0.0287876,0.03956153,0.011321373,-0.02955734,-0.010250664,-0.033227865,-0.064955644,0.019986793,-0.018153133,0.039969534,0.0017091299,0.0142736565,0.029545506,0.05129804,-0.015793653,-0.016438665,-0.009803832,0.029447492,-0.0047071893,0.050106015,0.012471416,0.049238667,0.029914835,0.0075991224,-0.049972787,-0.016296022,-0.012712005,0.022159897,-0.018776502,0.004951495,-0.06551404,-0.07393065,0.025968637,0.008510159,-0.02377095,0.030109175,-0.015651062,0.030680757,-0.03894479,-0.0033102515,0.043109693,0.017044248,-0.0063439244,-0.0144035155,-0.005276942,-0.00082273176,-0.021981938,-0.020129098,-0.029557835,0.00504195,-0.017127998,0.04098444,0.07790255,0.0126888575,0.007502555,-0.057754412,-0.031653702,-0.02071157,-0.025789423,-0.0019831876,0.026144164,-0.002432299,0.0027808198,0.007946114,-0.029933574,0.010093641,0.06143649,0.0026681798,-0.028511552,0.004880205,-0.004085376,-0.017317876,0.0136468,0.008981977,0.04271422,0.012572116,0.010176383,-0.00056185276,-0.08278849,0.012592364,0.01215919,0.009803353,-0.029197501,-0.03304318,0.009712782,-0.031854615,0.0021533542,-0.046964936,-0.026111599,-0.034000747,-0.023225538,-0.029268578,0.010007051,0.032097302,0.06325986,0.03881379,-0.0018985497,-0.009029036,-0.009139551,-0.0124626,0.021971518,0.039626013,0.031327132,-0.008032233,-0.019604705,0.033366073,-0.026865326,-0.031870328,-0.002057602,0.030467171,0.04201059,-0.049723577,-0.033834368,0.0055480413,0.05916902,-0.017980559,0.041296046,0.038471993,-0.012031731,-0.010915341,-0.0026132728,0.07255861,-0.00014143762,-0.0096360445,-0.016832197,-0.025261384,-0.019555645,0.021978857,-0.00960337,-0.020916466,-0.0513544,-0.028999664,-0.0059528733,0.010852928,-0.013853867,-0.02040292,-0.0047721686,0.0075757327,-0.021391712,0.04260639,-0.033387963,0.02570652,0.016892103,0.008521211,0.015505475,-0.005445467,0.05498388,0.024176579,0.048282895,0.0080413325,-0.007574308,0.025973443,-0.02473009,-0.0028047515,-0.007602927,0.012159283,0.014793604,0.048253413,-0.0065901317,0.012784184,-0.042883895,-0.013301774,0.03517462,-0.013037468,0.0025428,0.061964475,-0.0035750496,0.06559971,-0.009489027,-0.051768,0.011360379,0.017449144,-0.001779978,-0.0067086928,-0.020254528,-0.02597501,-0.024420982,0.03551557,0.009517773,0.024091614,-0.009113567,-0.036984526,-0.009246098,-0.05353137,0.019907026,-0.015295892,0.02280224,0.02781255,0.030702613,0.024561608,0.0067303027,0.009961315,-0.049611557,-0.0059286687,-0.01576469,0.009506654,-0.020913525,0.026078394,-0.043661594,0.011714504,-0.008150973,0.0034151261,0.024689585,-0.003916488,-0.05002923,0.012951453,-0.0042923354,-0.024774965,0.03701858,-0.008111729,0.05212536,-0.025109122] AS VECTOR(1024))) AS l2_distance from github_event_issuesevent_embedding WHERE partition_date >= '2024-06-01' and partition_date <= '2024-06-08' and issue_body_embedding IS NOT NULL ORDER BY 3 ASC LIMIT 10;
"""

In [188]:
start = time.perf_counter()
session.sql("show schemas").collect()
end = time.perf_counter()
print(f"执行耗时: {end - start:.3f}s")

执行耗时: 0.115s


In [None]:
start = time.perf_counter()
session.sql(onetime_query).collect()
end = time.perf_counter()
print(f"执行耗时: {end - start:.3f}s")

## 查看并发情况的查询响应时间

In [None]:
# 主执行逻辑
if __name__ == "__main__":
    import time

    from concurrent.futures import ThreadPoolExecutor

    import warnings
    warnings.filterwarnings('ignore')

    # 执行参数配置
    NUM_QUERIES = 200  # 总执行次数
    CONCURRENT_WORKERS = 20  # 并发线程数

    total_start = time.perf_counter()

    # 封装执行逻辑（添加异常处理）
    def execute_query(_):
        start_time = time.perf_counter()

        result = session.sql(onetime_query).collect()  # 预清理查询语句

        return time.perf_counter() - start_time

    # 执行并发测试
    with ThreadPoolExecutor(max_workers=CONCURRENT_WORKERS) as executor:
        # 提交多轮任务（总次数=CONCURRENT_WORKERS * 轮数）
        futures = [executor.submit(execute_query, i) for i in range(NUM_QUERIES)]
        elapsed_times = [f.result() for f in futures if f.result() > 0]  # 过滤失败请求

    # 统计计算
    total_time = time.perf_counter() - total_start
    valid_count = len(elapsed_times)
    
    # 打印精简报告
    print("\n===== 高并发压力测试报告 =====")
    print(f"• 尝试次数: {NUM_QUERIES}次")
    print(f"• 成功次数: {valid_count}次")
    print(f"• 平均耗时: {sum(elapsed_times)/valid_count:.3f}s" if valid_count else "• 无成功执行记录")
    print(f"• 峰值耗时: {max(elapsed_times):.3f}s" if valid_count else "")
    print(f"• 总用时: {total_time:.3f}s")
    print(f"• 系统吞吐: {valid_count/total_time:.1f} qps" if valid_count else "")


## 运维优化

### 查看集群里的慢SQL

In [108]:
slow_jobs = session.sql(f"""
            SHOW JOBS IN VCLUSTER {config['vcluster']} 
            where execution_time >interval 3 second 
            limit 10""").to_pandas()
slow_jobs

Unnamed: 0,job_id,status,creator,priority,start_time,end_time,execution_time,vcluster_name,job_text,query_tag
0,2025031917304140922147964,SUCCEED,liangmo,9,2025-03-19 09:30:41.646000+00:00,2025-03-19 09:30:47.507000+00:00,"<DateOffset: days=0, microseconds=5861000, mon...",default_ap,"\n SELECT row_id, issue_body,L...",test_zettapark_vector
1,2025031917304118250950813,SUCCEED,liangmo,9,2025-03-19 09:30:41.594000+00:00,2025-03-19 09:30:46.816000+00:00,"<DateOffset: days=0, microseconds=5222000, mon...",default_ap,"\n SELECT row_id, issue_body,L...",test_zettapark_vector
2,2025031917304114660218252,SUCCEED,liangmo,9,2025-03-19 09:30:41.579000+00:00,2025-03-19 09:30:46.500000+00:00,"<DateOffset: days=0, microseconds=4921000, mon...",default_ap,"\n SELECT row_id, issue_body,L...",test_zettapark_vector
3,2025031917304116341840160,SUCCEED,liangmo,9,2025-03-19 09:30:41.573000+00:00,2025-03-19 09:30:44.717000+00:00,"<DateOffset: days=0, microseconds=3144000, mon...",default_ap,"\n SELECT row_id, issue_body,L...",test_zettapark_vector
4,2025031917304109743689899,SUCCEED,liangmo,9,2025-03-19 09:30:41.565000+00:00,2025-03-19 09:30:45.312000+00:00,"<DateOffset: days=0, microseconds=3747000, mon...",default_ap,"\n SELECT row_id, issue_body,L...",test_zettapark_vector
5,2025031917304110336658840,SUCCEED,liangmo,9,2025-03-19 09:30:41.537000+00:00,2025-03-19 09:30:45.923000+00:00,"<DateOffset: days=0, microseconds=4386000, mon...",default_ap,"\n SELECT row_id, issue_body,L...",test_zettapark_vector
6,2025031917304110864466538,SUCCEED,liangmo,9,2025-03-19 09:30:41.532000+00:00,2025-03-19 09:30:45.921000+00:00,"<DateOffset: days=0, microseconds=4389000, mon...",default_ap,"\n SELECT row_id, issue_body,L...",test_zettapark_vector
7,2025031917304106287053654,SUCCEED,liangmo,9,2025-03-19 09:30:41.527000+00:00,2025-03-19 09:30:44.725000+00:00,"<DateOffset: days=0, microseconds=3198000, mon...",default_ap,"\n SELECT row_id, issue_body,L...",test_zettapark_vector
8,2025031917304103215754417,SUCCEED,liangmo,9,2025-03-19 09:30:41.521000+00:00,2025-03-19 09:30:44.726000+00:00,"<DateOffset: days=0, microseconds=3205000, mon...",default_ap,"\n SELECT row_id, issue_body,L...",test_zettapark_vector
9,2025031917304107927665092,SUCCEED,liangmo,9,2025-03-19 09:30:41.514000+00:00,2025-03-19 09:30:44.868000+00:00,"<DateOffset: days=0, microseconds=3354000, mon...",default_ap,"\n SELECT row_id, issue_body,L...",test_zettapark_vector


### 查看慢查询的执行计划

In [None]:
# 查看第一个慢SQL的计划
slow_job_plan = session.sql("explain " + slow_jobs.iloc[0,8].replace("\n","")).to_pandas().iloc[0,1]
pretty_show_ddl(slow_job_plan)

### 小文件合并优化
由于一直在持续的对issue_body进行embedding，update issue_body_embedding的值，通过对分区里小文件的手工合并，达到查询性能优化的目的。

In [190]:
show_partitions_query = f"""
                        SHOW PARTITIONS EXTENDED {table_name} 
                        where partitions >= 'partition_date={begin_date}' 
                        and partitions <= 'partition_date={end_date}' 
                        and total_files>1;
"""

In [204]:
partitions_with_small_files = session.sql(show_partitions_query).to_pandas()
partitions_with_small_files

Unnamed: 0,partitions,total_rows,bytes,total_files,created_time,last_modified_time
0,partition_date=2024-06-01,64630,39724308,2,2025-03-12 11:07:46.509000+00:00,2025-03-12 11:07:46.509000+00:00
1,partition_date=2024-06-02,46110,30641432,2,2025-03-12 11:07:46.509000+00:00,2025-03-14 03:38:15.522000+00:00
2,partition_date=2024-06-04,78359,273105009,82,2025-03-19 21:09:11.094000+00:00,2025-03-20 02:39:07.772000+00:00
3,partition_date=2024-06-05,82353,673040145,15,2025-03-19 08:27:10.988000+00:00,2025-03-19 08:27:10.988000+00:00
4,partition_date=2024-06-06,94275,53527833,2,2025-03-12 11:07:46.509000+00:00,2025-03-14 03:26:49.891000+00:00


In [82]:
session.sql("SHOW VCLUSTERS").show()

+----------+-------------+-------------+---------------+---------+-------+--------------------+------------+------------+-----------------+-----------------+---------------------+--------------------+----------------+----------------------+-----------------------+------------+-----------+-------------+
|      name|vcluster_size|vcluster_type|max_concurrency|    state|creator|         create_time|min_replicas|max_replicas|min_vcluster_size|max_vcluster_size|current_vcluster_size|      preload_tables|current_replicas|auto_suspend_in_second|auto_scale_in_in_second|running_jobs|queued_jobs|error_message|
+----------+-------------+-------------+---------------+---------+-------+--------------------+------------+------------+-----------------+-----------------+---------------------+--------------------+----------------+----------------------+-----------------------+------------+-----------+-------------+
|    AD_HOC|            1|      GENERAL|              0|SUSPENDED|    N/A|2024-04-17 15:

In [197]:
# optimize命令必须运行在GENERAL类型的VCLUSTERS上，从上述VCLUSTERS列表里选择一个GENERAL类型。
# 从partitions_with_small_files找到小文件多的分区（total_files多，bytes/total_files多又很小，比如几MB甚至更小）
compact_small_files_sql = f"""
                        USE VCLUSTER default;
                        optimize  {table_name} where partition_date='2024-06-04';
                        """

In [198]:
try:
    result = session.sql(compact_small_files_sql).collect()
except Exception as e:
    print(f"optimize命令是异步执行，这个异常可以忽略: {str(e)}")



[get_arrow_result] Error while converting from arrow to result: Expected to read 962868537 metadata bytes, but only read 32


optimize命令是异步执行，这个异常可以忽略: [get_arrow_result] Error while converting from arrow to result: Expected to read 962868537 metadata bytes, but only read 32


In [199]:
session.sql(show_partitions_query).to_pandas()

Unnamed: 0,partitions,total_rows,bytes,total_files,created_time,last_modified_time
0,partition_date=2024-06-01,64630,39724308,2,2025-03-12 11:07:46.509000+00:00,2025-03-12 11:07:46.509000+00:00
1,partition_date=2024-06-02,46110,30641432,2,2025-03-12 11:07:46.509000+00:00,2025-03-14 03:38:15.522000+00:00
2,partition_date=2024-06-04,78359,93152316,5,2025-03-14 03:32:27.626000+00:00,2025-03-19 12:20:07.276000+00:00
3,partition_date=2024-06-05,82353,673040145,15,2025-03-19 08:09:06.123000+00:00,2025-03-19 08:27:10.988000+00:00
4,partition_date=2024-06-06,94275,53527833,2,2025-03-12 11:07:46.509000+00:00,2025-03-14 03:26:49.891000+00:00


### 检查和没有做向量化存储、向量索引、倒排索引和BF索引的分区数据量变化
github_event_issuesevent是原始底表，没有做向量化存储、向量索引、倒排索引和BF索引

对比查询到的bytes,考虑带来的存储成本的变化

In [157]:
show_partitions_without_embedding_query = f"""
                        SHOW PARTITIONS EXTENDED github_event_issuesevent 
                        where partitions = 'partition_date=2024-06-05';
"""

In [158]:
session.sql(show_partitions_without_embedding_query).to_pandas()

Unnamed: 0,partitions,total_rows,bytes,total_files,created_time,last_modified_time
0,partition_date=2024-06-05,82353,31918095,1,2025-03-12 09:04:23.916000+00:00,2025-03-12 09:04:23.916000+00:00


### 查看索引的详细情况

In [167]:
# 查看向量索引的详细情况
session.sql("desc index extended index_issue_body_vector;").to_pandas()

Unnamed: 0,info_name,info_value
0,name,index_issue_body_vector
1,creator,liangmo
2,created_time,2025-03-12 17:07:39.444
3,last_modified_time,2025-03-12 17:07:39.444
4,comment,
5,properties,"((""scalar.type"",""f32""),(""distance.function"",""l..."
6,index_type,vector
7,table_name,github_event_issuesevent_embedding
8,table_column,issue_body_embedding
9,index_size_in_data_file,433197584


In [168]:
# 查看倒排索引的详细情况
session.sql("desc index extended index_issue_body_inverted;").to_pandas()

Unnamed: 0,info_name,info_value
0,name,index_issue_body_inverted
1,creator,liangmo
2,created_time,2025-03-12 17:07:39.446
3,last_modified_time,2025-03-12 17:07:39.446
4,comment,
5,properties,"((""analyzer"",""unicode""))"
6,index_type,inverted
7,table_name,github_event_issuesevent_embedding
8,table_column,issue_body
9,index_size_in_data_file,31060178287


In [169]:
# 查看bloom_filter的详细情况
session.sql("desc index extended index_rowid_bf;").to_pandas()

Unnamed: 0,info_name,info_value
0,name,index_rowid_bf
1,creator,liangmo
2,created_time,2025-03-13 10:41:22.362
3,last_modified_time,2025-03-13 10:41:22.362
4,comment,
5,properties,()
6,index_type,bloom_filter
7,table_name,github_event_issuesevent_embedding
8,table_column,row_id


In [None]:
# session.close()