### Load env variables

In this example we're loading the environment variables with all the secrets from a file in Localfile. The .evn file includes the following variables:

cz_username: Username for connecting to the Lakehouse service 

cz_password: Password for connecting to the Lakehouse service

cz_service: Name of the Lakehouse service to connect to

cz_instance: Instance name of the Lakehouse service to connect to

cz_workspace: Workspace name of the Lakehouse service to connect to

cz_schema: Schema name of the Lakehouse service to connect to

cz_vcluster: Virtual cluster name of the Lakehouse service to connect to

AWS_KEY: Key for connecting to AWS services

AWS_SECRET: Secret key for connecting to AWS services

AWS_S3_NAME: Bucket name for connecting to AWS S3 service

UNSTRUCTURED_API_KEY: API key for connecting to the UNSTRUCTURED API

UNSTRUCTURED_URL: URL for connecting to the UNSTRUCTURED API


In [12]:
import os
import dotenv

dotenv.load_dotenv('./.env') # replace with the path to your .env file

True

In [13]:
!pip install pyiceberg boto3 pandas

Looking in indexes: https://pypi.org/simple/


In [15]:
import boto3
from pyiceberg.catalog import load_catalog
from pyiceberg.schema import Schema
from pyiceberg.types import NestedField, StringType, IntegerType
import pandas as pd

In [16]:
# 1. 创建S3存储桶并配置Iceberg Catalog
BUCKET_NAME = os.getenv("AWS_S3_NAME")+"_IcebergTable"
key=os.getenv("AWS_KEY")
secret=os.getenv("AWS_SECRET")
TABLE_NAME = "demo_iceberg_table"
REGION = "us-east-1"



In [17]:
key

''

In [None]:
# 使用显式密钥
s3 = boto3.client(
    "s3",
    aws_access_key_id=key,
    aws_secret_access_key=secret
)

s3.create_bucket(Bucket=BUCKET_NAME, CreateBucketConfiguration={"LocationConstraint": REGION})

catalog = load_catalog("s3", {"s3.endpoint": f"s3://{BUCKET_NAME}"})

In [None]:
# 2. 定义Iceberg表Schema
schema = Schema(
    NestedField.required(1, "id", IntegerType()),
    NestedField.optional(2, "name", StringType()),
)

# 3. 创建Iceberg表
try:
    catalog.create_table(
        identifier=TABLE_NAME,
        schema=schema,
        partition_spec=None,
        properties={"format-version": "2"}
    )
    print(f"表 {TABLE_NAME} 创建成功！")
except Exception as e:
    print(f"表创建失败：{e}")

# 4. 插入数据到Iceberg表
data = pd.DataFrame({"id": [1, 2, 3], "name": ["Alice", "Bob", "Charlie"]})
data_file = f"s3://{BUCKET_NAME}/data.parquet"
data.to_parquet(data_file)

print("数据插入成功！")

# 5. 查询数据
try:
    table = catalog.load_table(TABLE_NAME)
    snapshots = table.snapshots()
    print("当前数据快照：")
    for snapshot in snapshots:
        print(snapshot)
except Exception as e:
    print(f"数据查询失败：{e}")

# 6. 演示Iceberg特性（如快照、事务处理）
try:
    new_data = pd.DataFrame({"id": [4], "name": ["Diana"]})
    new_data_file = f"s3://{BUCKET_NAME}/new_data.parquet"
    new_data.to_parquet(new_data_file)
    print("新增数据成功，支持事务操作！")
except Exception as e:
    print(f"操作失败：{e}")


In [None]:
# Create the connection to Singdata Lakehouse.
conn = get_connection(password=_password, username=_username, service=_service, instance=_instance, workspace=_workspace, schema=_schema, vcluster=_vcluster)

In [None]:
# Function to execute SQL statements
def excute_sql(conn,sql_statement: str):
    with conn.cursor() as cur:

        stmt = sql_statement

        cur.execute(stmt)

        results = cur.fetchall()

    return results

In [None]:
if drop_tables:
    excute_sql(conn,f"DROP TABLE IF EXISTS {_schema}.{raw_table_name}")
    excute_sql(conn,f"DROP TABLE IF EXISTS {_schema}.{silver_table_name}")

In [None]:
# Create Table in Singdata Lakehouse
excute_sql(conn, raw_table_ddl)
excute_sql(conn, silver_table_ddl)

[['OPERATION SUCCEED']]

Creating a database may take a few seconds. Let's check the status. We want to make sure that it says `healthy` before we begin writing into it.

![Image Alt Text](./image/unstructured_tables.png)


### PDFs/Images/Emails ingestion and preprocessing pipeline

Unstructured ingestion and transformation pipeline is compiled from a number of necessary configs. These don't have to be in the exact same order.

* `ProcessorConfig`: defines general processing behavior

* `S3IndexerConfig`, `S3DownloaderConfig`, `S3ConnectionConfig`: control data ingestion from S3, including source location, and authentication options.

* `PartitionerConfig`: describes partitioning behavior. Here we only set up authentication for the Unstructured API, but you can also control [partitioning parameters](https://docs.unstructured.io/api-reference/ingest/ingest-configuration/partition-configuration) such as partitioning strategy through this config. We're going with the defaults.  

* `ChunkerConfig`: defines the chunking strategy, and chunk sizes.

* `EmbedderConfig`: sets up connection to an embedding model provider to generate embeddings for data chunks.

* `ClickzettaConnectionConfig`, `ClickzettaUploadStagerConfig`, `ClickzettaUploaderConfig`: control the final step of the pipeline - data loading into Singdata Lakehouse RAW table.

In [None]:
from unstructured_ingest.v2.interfaces import ProcessorConfig
from unstructured_ingest.v2.pipeline.pipeline import Pipeline
from unstructured_ingest.v2.processes.chunker import ChunkerConfig
from unstructured_ingest.v2.processes.connectors.fsspec.s3 import (
    S3ConnectionConfig,
    S3DownloaderConfig,
    S3IndexerConfig,
    S3AccessConfig,
)
from unstructured_ingest.v2.processes.embedder import EmbedderConfig
from unstructured_ingest.v2.processes.partitioner import PartitionerConfig

from unstructured_ingest.v2.processes.connectors.sql.clickzetta import (
    ClickzettaConnectionConfig,
    ClickzettaAccessConfig,
    ClickzettaUploadStagerConfig,
    ClickzettaUploaderConfig
)


In [None]:
pipeline = Pipeline.from_configs(

    context=ProcessorConfig(
        verbose=True,
        tqdm=True,
        num_processes=20,
    ),

    indexer_config=S3IndexerConfig(remote_url=os.getenv("AWS_S3_NAME")),
    downloader_config=S3DownloaderConfig(),
    source_connection_config=S3ConnectionConfig(
        access_config=S3AccessConfig(
            key=os.getenv("AWS_KEY"),
            secret=os.getenv("AWS_SECRET"))
    ),

    partitioner_config=PartitionerConfig(
        partition_by_api=True,
        api_key=os.getenv("UNSTRUCTURED_API_KEY"),
        partition_endpoint=os.getenv("UNSTRUCTURED_URL"),
    ),

    chunker_config=ChunkerConfig(
        chunking_strategy="by_title",
        chunk_max_characters=512,
        chunk_combine_text_under_n_chars=200,
    ),

    embedder_config=EmbedderConfig(
        embedding_provider="huggingface", # "langchain-huggingface" for ingest v<0.23
        embedding_model_name="BAAI/bge-base-en-v1.5",
    ),

    destination_connection_config=ClickzettaConnectionConfig(
        access_config=ClickzettaAccessConfig(password=_password),
        username=_username,
        service=_service,
        instance=_instance,
        workspace=_workspace,
        schema=_schema,
        vcluster=_vcluster,
    ),
    stager_config=ClickzettaUploadStagerConfig(),
    uploader_config=ClickzettaUploaderConfig(table_name=raw_table_name),
)

pipeline.run()

### Clean/Transformation RAW table and Insert into Silver table

In [None]:
# You could excute more SQLs to clean and transform data before insert into Silver table.、
excute_sql(conn, clean_transformation_data_sql)

[['OPERATION SUCCEED']]

### Check the RAG data Ready outputs

Let's connect to the Singdata Lakehouse. In the logs to the previous cell, you can see how many elements have been uploaded during the Upload Step for each document. 

In [None]:

def get_rag_ready_data(conn,  num_results: int = 5):
    with conn.cursor() as cur:

        stmt = f"""
            SELECT
                *
            FROM {silver_table_name}
            LIMIT {num_results}
        """

        cur.execute(stmt)

        results = cur.fetchall()
        columns = [desc[0] for desc in cur.description]  # Get column names from cursor description
        rag_ready_data_df = pd.DataFrame(results, columns=columns)
    return rag_ready_data_df

In [None]:
rag_ready_data_df = get_rag_ready_data(conn)
rag_ready_data_df

Unnamed: 0,id,record_locator,type,record_id,element_id,filetype,file_directory,filename,last_modified,languages,...,sent_to,subject,url,version,date_created,date_modified,date_processed,text_as_html,emphasized_text_contents,emphasized_text_tags
0,97e783aa-0e9e-5880-b675-50653b540ebc,"{""protocol"": ""s3"", ""remote_file_path"": ""s3://u...",Table,d23ad16a-5c65-5bc3-bcd8-534bb13cced1,8c0a032c9cb87b17f493b290f05614b9,message/rfc822,,Register Now Complimentary Gartner webinars.eml,,"[""eng""]",...,"[""QILIANG@CLICKZETTA.COM""]",Register Now: Complimentary Gartner webinars,s3://unstructured-io/Register Now Complimentar...,284f556327e4a0d3014b8830884a6e60,2025-02-19 18:51:19+08:00,2025-02-19 18:51:19+08:00,2025-02-20 04:12:28.829406+08:00,<table><tr><td>Gartner 2025 Leadership Vision ...,,
1,e243b827-227d-59ea-9022-130b84d085e4,"{""protocol"": ""s3"", ""remote_file_path"": ""s3://u...",CompositeElement,a83718dc-2b6b-554d-861f-16889038e34e,d8cf5ddb47377c8810c275d977500dc0,application/pdf,,Building an ETL Pipeline using PySpark.ipynb -...,,"[""eng""]",...,,,s3://unstructured-io/Building an ETL Pipeline ...,6d8aa51304c550b26dda27e1137121fc,2025-02-19 04:14:35+08:00,2025-02-19 04:14:35+08:00,2025-02-20 04:12:28.482226+08:00,,,
2,52f1cafa-825f-55db-b567-8f59524c741f,"{""protocol"": ""s3"", ""remote_file_path"": ""s3://u...",CompositeElement,a83718dc-2b6b-554d-861f-16889038e34e,928937e57dfa74001a87cd3b9eb94dc7,application/pdf,,Building an ETL Pipeline using PySpark.ipynb -...,,"[""eng""]",...,,,s3://unstructured-io/Building an ETL Pipeline ...,6d8aa51304c550b26dda27e1137121fc,2025-02-19 04:14:35+08:00,2025-02-19 04:14:35+08:00,2025-02-20 04:12:28.482226+08:00,,,
3,c0e411e4-fd7c-5c9e-8fe2-12ebf06046c2,"{""protocol"": ""s3"", ""remote_file_path"": ""s3://u...",Table,d23ad16a-5c65-5bc3-bcd8-534bb13cced1,1f6df8c317560b2a22211e5a80a04d78,message/rfc822,,Register Now Complimentary Gartner webinars.eml,,"[""eng""]",...,"[""QILIANG@CLICKZETTA.COM""]",Register Now: Complimentary Gartner webinars,s3://unstructured-io/Register Now Complimentar...,284f556327e4a0d3014b8830884a6e60,2025-02-19 18:51:19+08:00,2025-02-19 18:51:19+08:00,2025-02-20 04:12:28.829406+08:00,<table><tr><td>Trending Now The Art of the 1-P...,,
4,6ba3b1ee-f29d-54f2-a9bb-036acad78918,"{""protocol"": ""s3"", ""remote_file_path"": ""s3://u...",CompositeElement,a83718dc-2b6b-554d-861f-16889038e34e,dda187b40afb515d4e5e948dcb6c0b68,application/pdf,,Building an ETL Pipeline using PySpark.ipynb -...,,"[""eng""]",...,,,s3://unstructured-io/Building an ETL Pipeline ...,6d8aa51304c550b26dda27e1137121fc,2025-02-19 04:14:35+08:00,2025-02-19 04:14:35+08:00,2025-02-20 04:12:28.482226+08:00,,,


Or you could check the data Via Singdata Lakehouse Studio.


![Image Alt Text](./image/unstructured_table_data.png)

### Retrieve relevant documents from Singdata Lakehouse


In [None]:
from sentence_transformers import SentenceTransformer


def get_embedding(query):
    model = SentenceTransformer("BAAI/bge-base-en-v1.5")
    return model.encode(query, normalize_embeddings=True)

def retrieve_documents(conn, query: str, num_results: int = 5):

    embedding = get_embedding(query)
    embedding_list = embedding.tolist()
    embedding_json = json.dumps(embedding_list)

    with conn.cursor() as cur:

        stmt = f"""
            WITH 
            vector_embedding_result AS (
            SELECT
                "vector_embedding" as retrieve_method,
                record_locator,
                type,
                filename,
                text,
                orig_elements,
                cosine_distance(embeddings, cast({embedding_list} as vector({embeddings_dimensions}))) AS score
            FROM {silver_table_name}
            ORDER BY score ASC
            LIMIT {num_results} 
            )
            SELECT    *  FROM      vector_embedding_result
           
            ORDER by score ASC;
        """

        cur.execute(stmt)

        results = cur.fetchall()
        columns = [desc[0] for desc in cur.description]  # Get column names from cursor description
        df = pd.DataFrame(results, columns=columns)
    return df

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
# query_text = "Harmon, Dave Scott, Bill Schmidt, Chris Teumer • Gain an action plan to hiring top IT talent • Understand how to best position yourself in the market to gain top talent • Learn why CIOs need to pay attention to hiring IT talent Register The Gartner 2025 Technology Adoption Roadmap for Infrastructure & Operations (I&O) Wednesday, February 19, 2025 EST: 10:00 a.m. | GMT: 15:00 Presented by: Ajeeta Malhotra and Amol Nadkarni • Discover why 66% of surveyed technologies are"
query_text = "What is gartner leadership vision for digital tech?"
retrieve_documents_df = retrieve_documents(conn, query_text)
retrieve_documents_df

Unnamed: 0,retrieve_method,record_locator,type,filename,text,orig_elements,score
0,vector_embedding,"{""protocol"": ""s3"", ""remote_file_path"": ""s3://u...",Table,Register Now Complimentary Gartner webinars.eml,Gartner 2025 Leadership Vision for Digital Tec...,eJztXA1zI7eR/SsopZJLqiQtKVFfG5fLWkora3e10kl0tn...,0.272893
1,vector_embedding,"{""protocol"": ""s3"", ""remote_file_path"": ""s3://u...",Table,Register Now Complimentary Gartner webinars.eml,"Harmon, Dave Scott, Bill Schmidt, Chris Teumer...",eJztXA1zI7eR/SsopZJLqiQtKVFfG5fLWkora3e10kl0tn...,0.307342
2,vector_embedding,"{""protocol"": ""s3"", ""remote_file_path"": ""s3://u...",Table,Register Now Complimentary Gartner webinars.eml,a.m. | GMT: 15:00 Presented by: Christie Struc...,eJztXA1zI7eR/SsopZJLqiQtKVFfG5fLWkora3e10kl0tn...,0.318675
3,vector_embedding,"{""protocol"": ""s3"", ""remote_file_path"": ""s3://u...",Table,Register Now Complimentary Gartner webinars.eml,common use cases for GenAI • Learn best practi...,eJztXA1zI7eR/SsopZJLqiQtKVFfG5fLWkora3e10kl0tn...,0.328295
4,vector_embedding,"{""protocol"": ""s3"", ""remote_file_path"": ""s3://u...",Table,Register Now Complimentary Gartner webinars.eml,expectations for your organization’s GenAI jou...,eJztXA1zI7eR/SsopZJLqiQtKVFfG5fLWkora3e10kl0tn...,0.344925


In [None]:
def match_all_documents(conn, query: str, num_results: int = 1):
    with conn.cursor() as cur:

        stmt = f"""
            WITH 
            scalar_match_all_result AS (
            SELECT
                "scalar_match_all" as retrieve_method,
                record_locator,
                type,
                filename,
                text,
                orig_elements,
                -100 AS score
            FROM {silver_table_name}
            WHERE match_all(
                    text,
                    "{query}",
                    map("analyzer", "unicode")
                    )
            ORDER BY score ASC
            LIMIT {num_results} 
            )
            SELECT    *  FROM      scalar_match_all_result
            ORDER by score ASC;
        """

        cur.execute(stmt)

        results = cur.fetchall()
        columns = [desc[0] for desc in cur.description]  # Get column names from cursor description
        df = pd.DataFrame(results, columns=columns)
    return df

In [None]:
match_all_documents_df = match_all_documents(conn,query_text)
match_all_documents_df

Unnamed: 0,retrieve_method,record_locator,type,filename,text,orig_elements,score


In [None]:
def match_any_documents(conn, query: str, num_results: int = 5):
    with conn.cursor() as cur:

        stmt = f"""
            WITH 
            scalar_match_any_result AS (
            SELECT
                "scalar_match_any" as retrieve_method,
                record_locator,
                type,
                filename,
                text,
                orig_elements,
                0 AS score
            FROM {silver_table_name}
            WHERE match_any(
                    text,
                    "{query}",
                    map("analyzer", "unicode")
                    )
            ORDER BY score ASC
            LIMIT {num_results} 
            )
            SELECT    *  FROM      scalar_match_any_result
            ORDER by score ASC;
        """

        cur.execute(stmt)

        results = cur.fetchall()
        columns = [desc[0] for desc in cur.description]  # Get column names from cursor description
        df = pd.DataFrame(results, columns=columns)
    return df

In [None]:
match_any_documents_df = match_any_documents(conn,query_text)
match_any_documents_df

Unnamed: 0,retrieve_method,record_locator,type,filename,text,orig_elements,score
0,scalar_match_any,"{""protocol"": ""s3"", ""remote_file_path"": ""s3://u...",CompositeElement,Building an ETL Pipeline using PySpark.ipynb -...,Key Steps in the Pipeline:\n\n1-Extract:\n\nTh...,eJztVNtu1DAQ/RUrTyC1u3GuTh8pRUIgQNoVL1W1msSTXY...,0
1,scalar_match_any,"{""protocol"": ""s3"", ""remote_file_path"": ""s3://u...",CompositeElement,Building an ETL Pipeline using PySpark.ipynb -...,from pyspark.sql.functions import expr # resha...,eJztVN9P2zAQ/les8NBWKyXOzxZpL0NMQprEJNgDoqhy7H...,0
2,scalar_match_any,"{""protocol"": ""s3"", ""remote_file_path"": ""s3://u...",Table,Register Now Complimentary Gartner webinars.eml,"within the C-suite Register CIOs, Take a Blend...",eJztXA1zI7eR/SsopZJLqiQtKVFfG5fLWkora3e10kl0tn...,0
3,scalar_match_any,"{""protocol"": ""s3"", ""remote_file_path"": ""s3://u...",Table,Register Now Complimentary Gartner webinars.eml,deemed medium or high value • Understand how c...,eJztXA1zI7eR/SsopZJLqiQtKVFfG5fLWkora3e10kl0tn...,0
4,scalar_match_any,"{""protocol"": ""s3"", ""remote_file_path"": ""s3://u...",Table,Register Now Complimentary Gartner webinars.eml,common use cases for GenAI • Learn best practi...,eJztXA1zI7eR/SsopZJLqiQtKVFfG5fLWkora3e10kl0tn...,0


In [None]:
merged_df = pd.concat([retrieve_documents_df, match_all_documents_df, match_any_documents_df], ignore_index=True)
merged_df = merged_df.sort_values(by='score', ascending=True)
merged_df


  merged_df = pd.concat([retrieve_documents_df, match_all_documents_df, match_any_documents_df], ignore_index=True)


Unnamed: 0,retrieve_method,record_locator,type,filename,text,orig_elements,score
5,scalar_match_any,"{""protocol"": ""s3"", ""remote_file_path"": ""s3://u...",CompositeElement,Building an ETL Pipeline using PySpark.ipynb -...,Key Steps in the Pipeline:\n\n1-Extract:\n\nTh...,eJztVNtu1DAQ/RUrTyC1u3GuTh8pRUIgQNoVL1W1msSTXY...,0.0
6,scalar_match_any,"{""protocol"": ""s3"", ""remote_file_path"": ""s3://u...",CompositeElement,Building an ETL Pipeline using PySpark.ipynb -...,from pyspark.sql.functions import expr # resha...,eJztVN9P2zAQ/les8NBWKyXOzxZpL0NMQprEJNgDoqhy7H...,0.0
7,scalar_match_any,"{""protocol"": ""s3"", ""remote_file_path"": ""s3://u...",Table,Register Now Complimentary Gartner webinars.eml,"within the C-suite Register CIOs, Take a Blend...",eJztXA1zI7eR/SsopZJLqiQtKVFfG5fLWkora3e10kl0tn...,0.0
8,scalar_match_any,"{""protocol"": ""s3"", ""remote_file_path"": ""s3://u...",Table,Register Now Complimentary Gartner webinars.eml,deemed medium or high value • Understand how c...,eJztXA1zI7eR/SsopZJLqiQtKVFfG5fLWkora3e10kl0tn...,0.0
9,scalar_match_any,"{""protocol"": ""s3"", ""remote_file_path"": ""s3://u...",Table,Register Now Complimentary Gartner webinars.eml,common use cases for GenAI • Learn best practi...,eJztXA1zI7eR/SsopZJLqiQtKVFfG5fLWkora3e10kl0tn...,0.0
0,vector_embedding,"{""protocol"": ""s3"", ""remote_file_path"": ""s3://u...",Table,Register Now Complimentary Gartner webinars.eml,Gartner 2025 Leadership Vision for Digital Tec...,eJztXA1zI7eR/SsopZJLqiQtKVFfG5fLWkora3e10kl0tn...,0.272893
1,vector_embedding,"{""protocol"": ""s3"", ""remote_file_path"": ""s3://u...",Table,Register Now Complimentary Gartner webinars.eml,"Harmon, Dave Scott, Bill Schmidt, Chris Teumer...",eJztXA1zI7eR/SsopZJLqiQtKVFfG5fLWkora3e10kl0tn...,0.307342
2,vector_embedding,"{""protocol"": ""s3"", ""remote_file_path"": ""s3://u...",Table,Register Now Complimentary Gartner webinars.eml,a.m. | GMT: 15:00 Presented by: Christie Struc...,eJztXA1zI7eR/SsopZJLqiQtKVFfG5fLWkora3e10kl0tn...,0.318675
3,vector_embedding,"{""protocol"": ""s3"", ""remote_file_path"": ""s3://u...",Table,Register Now Complimentary Gartner webinars.eml,common use cases for GenAI • Learn best practi...,eJztXA1zI7eR/SsopZJLqiQtKVFfG5fLWkora3e10kl0tn...,0.328295
4,vector_embedding,"{""protocol"": ""s3"", ""remote_file_path"": ""s3://u...",Table,Register Now Complimentary Gartner webinars.eml,expectations for your organization’s GenAI jou...,eJztXA1zI7eR/SsopZJLqiQtKVFfG5fLWkora3e10kl0tn...,0.344925


In [None]:
import pandas as pd
import torch
import numpy as np
from transformers import AutoModelForSequenceClassification, AutoTokenizer


# Define the rerank function
def rerank_texts(query, texts, model_name="BAAI/bge-reranker-v2-m3", normalize=True):
    """
    Rerank a list of texts based on their relevance to a given query using the specified reranker model.

    Parameters:
    - query: The query string.
    - texts: List of texts to be reranked.
    - model_name: The name of the reranker model to use.
    - normalize: Whether to normalize the scores to the [0, 1] range using the sigmoid function.

    Returns:
    - A list of reranked texts.
    - A list of corresponding scores.
    """
    # Load the model and tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSequenceClassification.from_pretrained(model_name)
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)
    model.eval()

    # Prepare input pairs [query, text]
    pairs = [[query, text] for text in texts]
    inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors="pt", max_length=512)
    inputs = {key: value.to(device) for key, value in inputs.items()}

    # Get relevance scores
    with torch.no_grad():
        outputs = model(**inputs)
        scores = outputs.logits.view(-1).cpu().numpy()

    # Normalize scores to [0, 1] if required
    if normalize:
        scores = 1 / (1 + np.exp(-scores))

    # Combine texts with scores and sort by score in descending order
    scored_texts = list(zip(texts, scores))
    scored_texts.sort(key=lambda x: x[1], reverse=True)

    # Separate the sorted texts and scores
    sorted_texts, sorted_scores = zip(*scored_texts)

    return list(sorted_texts), list(sorted_scores)

In [None]:
# Example usage
# query = "Which session is presented by Ajeeta Malhotra and Amol Nadkarni?"
query = "What is gartner leadership vision for digital tech?"
sorted_texts, sorted_scores = rerank_texts(query, merged_df["text"].tolist())

# Update DataFrame with reranked texts and scores
merged_df["reranked_text"] = sorted_texts
merged_df["rerank_score"] = sorted_scores

In [None]:
merged_df

Unnamed: 0,retrieve_method,record_locator,type,filename,text,orig_elements,score,reranked_text,rerank_score
5,scalar_match_any,"{""protocol"": ""s3"", ""remote_file_path"": ""s3://u...",CompositeElement,Building an ETL Pipeline using PySpark.ipynb -...,Key Steps in the Pipeline:\n\n1-Extract:\n\nTh...,eJztVNtu1DAQ/RUrTyC1u3GuTh8pRUIgQNoVL1W1msSTXY...,0.0,Gartner 2025 Leadership Vision for Digital Tec...,0.725993
6,scalar_match_any,"{""protocol"": ""s3"", ""remote_file_path"": ""s3://u...",CompositeElement,Building an ETL Pipeline using PySpark.ipynb -...,from pyspark.sql.functions import expr # resha...,eJztVN9P2zAQ/les8NBWKyXOzxZpL0NMQprEJNgDoqhy7H...,0.0,common use cases for GenAI • Learn best practi...,0.603706
7,scalar_match_any,"{""protocol"": ""s3"", ""remote_file_path"": ""s3://u...",Table,Register Now Complimentary Gartner webinars.eml,"within the C-suite Register CIOs, Take a Blend...",eJztXA1zI7eR/SsopZJLqiQtKVFfG5fLWkora3e10kl0tn...,0.0,common use cases for GenAI • Learn best practi...,0.603706
8,scalar_match_any,"{""protocol"": ""s3"", ""remote_file_path"": ""s3://u...",Table,Register Now Complimentary Gartner webinars.eml,deemed medium or high value • Understand how c...,eJztXA1zI7eR/SsopZJLqiQtKVFfG5fLWkora3e10kl0tn...,0.0,a.m. | GMT: 15:00 Presented by: Christie Struc...,0.561251
9,scalar_match_any,"{""protocol"": ""s3"", ""remote_file_path"": ""s3://u...",Table,Register Now Complimentary Gartner webinars.eml,common use cases for GenAI • Learn best practi...,eJztXA1zI7eR/SsopZJLqiQtKVFfG5fLWkora3e10kl0tn...,0.0,expectations for your organization’s GenAI jou...,0.018261
0,vector_embedding,"{""protocol"": ""s3"", ""remote_file_path"": ""s3://u...",Table,Register Now Complimentary Gartner webinars.eml,Gartner 2025 Leadership Vision for Digital Tec...,eJztXA1zI7eR/SsopZJLqiQtKVFfG5fLWkora3e10kl0tn...,0.272893,"Harmon, Dave Scott, Bill Schmidt, Chris Teumer...",0.00454
1,vector_embedding,"{""protocol"": ""s3"", ""remote_file_path"": ""s3://u...",Table,Register Now Complimentary Gartner webinars.eml,"Harmon, Dave Scott, Bill Schmidt, Chris Teumer...",eJztXA1zI7eR/SsopZJLqiQtKVFfG5fLWkora3e10kl0tn...,0.307342,deemed medium or high value • Understand how c...,0.000239
2,vector_embedding,"{""protocol"": ""s3"", ""remote_file_path"": ""s3://u...",Table,Register Now Complimentary Gartner webinars.eml,a.m. | GMT: 15:00 Presented by: Christie Struc...,eJztXA1zI7eR/SsopZJLqiQtKVFfG5fLWkora3e10kl0tn...,0.318675,"within the C-suite Register CIOs, Take a Blend...",7.9e-05
3,vector_embedding,"{""protocol"": ""s3"", ""remote_file_path"": ""s3://u...",Table,Register Now Complimentary Gartner webinars.eml,common use cases for GenAI • Learn best practi...,eJztXA1zI7eR/SsopZJLqiQtKVFfG5fLWkora3e10kl0tn...,0.328295,from pyspark.sql.functions import expr # resha...,1.7e-05
4,vector_embedding,"{""protocol"": ""s3"", ""remote_file_path"": ""s3://u...",Table,Register Now Complimentary Gartner webinars.eml,expectations for your organization’s GenAI jou...,eJztXA1zI7eR/SsopZJLqiQtKVFfG5fLWkora3e10kl0tn...,0.344925,Key Steps in the Pipeline:\n\n1-Extract:\n\nTh...,1.6e-05


In [None]:
# Get the first row of the DataFrame, which get the highest rerank_score
first_row_reranked_text = merged_df.iloc[0]['reranked_text']
print(first_row_reranked_text)

Gartner 2025 Leadership Vision for Digital Technology and Business Services Wednesday, February 19, 2025 EST: 11:00 a.m. | GMT: 16:00 Presented by: Chrissy Healey, Scott Frederick and Jennifer Barry • Revert back to growth by defining and delivering transformative impact • Resolve the asset and AI-first dilemma in delivery • Decode demand in your top accounts Register How U.S. Government Executives Can Navigate Upcoming Workforce Changes Friday, February 21, 2025 EDT: 10:00


### Summary Benefits for RAG Application Development

![Image Alt Text](./image/UnstructuredDataPipelineBenifits.png)


**Enhanced Search Efficiency:** 
- By supporting both inverted and vector searches, this table allows RAG applications to efficiently retrieve relevant information based on both text content and semantic similarity. This enhances the model's ability to find and generate contextually relevant responses.

**Improved Accuracy:** 
- The combination of full-text and similarity searches ensures that RAG applications can access a broader range of relevant data, improving the accuracy and relevance of generated content.

**Scalability:** 
- With optimized indexes, the table can handle large volumes of data and perform searches quickly, supporting the scalability needs of RAG applications.

**Simplified Architecture:** 
- Combining inverted text and vector search capabilities in a single table eliminates the need for separate text and vector search databases. This simplifies maintenance, reduces operational overhead, and improves development efficiency.

**Data Consistency:** 
- Reducing the number of data replicas from three to one enhances data consistency, minimizes data duplication, and reduces the need for data synchronization and movement.

Overall, this Singdata Lakehouse architecture reduces operational complexity, enhances data consistency, and improves development efficiency, making it ideal for effective RAG application development.