# Hybrid Search with LlamaIndex & KDB.ai

## Install dependencies

In [1]:
# !pip install llama-index llama-index-embeddings-huggingface llama-index-llms-openai llama-index-readers-file llama-index-vector-stores-kdbai
# !pip install kdbai_client langchain-text-splitters pandas

## Downloading data

**Libraries**

In [2]:
import os
import urllib.request

**Data directories and paths**

In [3]:
# Root path
root_path = os.path.abspath(os.getcwd())

# Data directory and path
data_dir = "data"
data_path = os.path.join(root_path, data_dir)
if not os.path.exists(data_path):
    os.mkdir(data_path)

**Downloading text**

In [4]:
text_url = "https://raw.githubusercontent.com/KxSystems/kdbai-samples/main/hybrid_search/data/inflation.txt"
with urllib.request.urlopen(text_url) as response:
    text_content = response.read().decode("utf-8")

text_file_name = text_url.split('/')[-1]
text_path = os.path.join(data_path, text_file_name)
if not os.path.exists(text_path):
    with open(text_path, 'w') as text_file:
        text_file.write(text_content)

metadata = {
    f"{data_dir}/{text_file_name}": {
        "title": text_file_name,
        "file_path": text_path
    }
}

**Show text data**

In [5]:
def show_text(text_path):
    with open(text_path, 'r') as text_file:
        contents = text_file.read()
    print(contents[:500])
    print("="*80)

In [6]:
show_text(text_path)

 At last year's Jackson Hole symposium, I delivered a brief, direct message. My remarks this year will be a bit longer, but the message is the same: It is the Fed's job to bring inflation down to our 2 percent goal, and we will do so. We have tightened policy significantly over the past year. Although inflation has moved down from its peak—a welcome development—it remains too high. We are prepared to raise rates further if appropriate, and intend to hold policy at a restrictive level until we ar


## KDB.ai Vector Database - session and table

**Libraries**

In [7]:
import kdbai_client as kdbai 

**KDB.ai session**

In [8]:
KDBAI_ENDPOINT = "http://localhost:8085"
session = kdbai.Session(endpoint=KDBAI_ENDPOINT)

**KDB.ai table**

In [9]:
# Table - name & schema
table_name = "hs_docs"
table_schema = {
    "columns": [
        dict(name="document_id", pytype="bytes"),
        dict(name="text", pytype="bytes"),
        dict(
            name="embedding",
            pytypte="float32",
            vectorIndex=dict(type="flat", metric="L2", dims=768)
        ),
        dict(
            name="sparseVectors",
            pytype="dict",
            sparseIndex=dict(k=1.25, b=0.75)
        ),
        dict(name="title", pytype="str"),
        dict(name="file_path", pytype="str")
    ]
}

In [10]:
# Drop table if exists
if table_name in session.list():
    session.table(table_name).drop()

In [11]:
# Texts table
table = session.create_table(table_name, table_schema)

In [12]:
# Table schema
table.schema()

{'columns': [{'name': 'document_id', 'qtype': 'string', 'pytype': 'bytes'},
  {'name': 'text', 'qtype': 'string', 'pytype': 'bytes'},
  {'name': 'embedding',
   'vectorIndex': {'type': 'flat', 'metric': 'L2', 'dims': 768},
   'qtype': 'reals',
   'pytype': 'float32'},
  {'name': 'sparseVectors',
   'sparseIndex': {'k': 1.25, 'b': 0.75},
   'qtype': '',
   'pytype': 'dict'},
  {'name': 'title', 'qtype': 'symbol', 'pytype': 'str'},
  {'name': 'file_path', 'qtype': 'symbol', 'pytype': 'str'}]}

## Loading data

**Libraries**

In [13]:
# Using Langchain recursive character text splitter for generating text chunks
from langchain_text_splitters import RecursiveCharacterTextSplitter
from llama_index.core import Document

**Loading data: texts and metadata**

In [14]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=0,
)

In [15]:
docs = []
for fpath, fmetadata in metadata.items():
    with open(fpath, 'r') as f:
        fcontent = f.read()

    texts = text_splitter.create_documents([fcontent])

    for text in texts: 
        doc = Document(
            text=text.page_content,
            metadata={
                "title": fmetadata['title'],
                "file_path": fmetadata['file_path']
            }
        )
        docs.append(doc)

## Creating Vector Store Index for data

**Text embeddings model**

In [16]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

In [17]:
EMBEDDING = "sentence-transformers/all-mpnet-base-v2"
embeddings_model = HuggingFaceEmbedding(model_name=EMBEDDING)

  return torch._C._cuda_getDeviceCount() > 0


**Create vector store, storage context and the index for retrieval, query purposes**

In [18]:
from llama_index.vector_stores.kdbai import KDBAIVectorStore
from llama_index.core import StorageContext
from llama_index.core import Settings
from llama_index.core.indices import VectorStoreIndex

In [19]:
%%time

# Vector Store
text_store = KDBAIVectorStore(table=table, hybrid_search=True)

# Storage context
storage_context = StorageContext.from_defaults(
    vector_store=text_store, 
)

# Settings
Settings.embed_model = embeddings_model
Settings.llm = None

# Vector Store Index
index = VectorStoreIndex.from_documents(
    docs,
    storage_context=storage_context,
)

LLM is explicitly disabled. Using MockLLM.
CPU times: user 35.7 s, sys: 509 ms, total: 36.3 s
Wall time: 24.7 s


## Retrieval from query using Hybrid Search 

**Query**

In [20]:
query = '12-month basis'

**Helper function: To display search results**

In [21]:
import pandas as pd

In [22]:
def display_search_results(nodes):
    nodes_df = pd.DataFrame(columns=['score', 'text'])
    for node in nodes:
        nodes_df.loc[len(nodes_df.index)] = (node.score, node.text)
    return nodes_df

**Hybrid Search: Giving equal priority to both sparse and dense vector search ($\alpha=0.5$)**

In [23]:
%%time

retriever = index.as_retriever(similarity_top_k=5, vector_store_query_mode="hybrid")

CPU times: user 35 μs, sys: 1 μs, total: 36 μs
Wall time: 39.6 μs


In [24]:
equal_priority_nodes = retriever.retrieve(query)
display_search_results(equal_priority_nodes)

Unnamed: 0,score,text
0,0.25,Total hours worked has been flat over the past...
1,0.166667,coming quarters. Twelve-month core inflation i...
2,0.125,Today I will review our progress so far and di...
3,0.1,"The final category, nonhousing services, accou..."
4,0.083333,Measured housing services inflation lagged the...


**Hybrid Search: Giving more priority to sparse vector search ($\alpha=0.1$)**

In [25]:
%%time

retriever = index.as_retriever(similarity_top_k=5, vector_store_query_mode="hybrid", alpha=0.1)

CPU times: user 33 μs, sys: 1 μs, total: 34 μs
Wall time: 37 μs


In [26]:
sparse_priority_nodes = retriever.retrieve(query)
display_search_results(sparse_priority_nodes)

Unnamed: 0,score,text
0,0.05,Total hours worked has been flat over the past...
1,0.033333,coming quarters. Twelve-month core inflation i...
2,0.025,Today I will review our progress so far and di...
3,0.02,"The final category, nonhousing services, accou..."
4,0.016667,Measured housing services inflation lagged the...


**Hybrid Search: Giving more priority to dense vector search ($\alpha=0.9$)**

In [27]:
%%time

retriever = index.as_retriever(similarity_top_k=5, vector_store_query_mode="hybrid", alpha=0.90)

CPU times: user 33 μs, sys: 1e+03 ns, total: 34 μs
Wall time: 37 μs


In [28]:
dense_priority_nodes = retriever.retrieve(query)
display_search_results(dense_priority_nodes)

Unnamed: 0,score,text
0,0.45,Total hours worked has been flat over the past...
1,0.3,coming quarters. Twelve-month core inflation i...
2,0.225,Today I will review our progress so far and di...
3,0.18,"The final category, nonhousing services, accou..."
4,0.15,Measured housing services inflation lagged the...


**Conclusion**
- If the Hybrid Search is sparse search biased i.e $\alpha=0.1$ then, from the results we can see the terms we are interested directly i.e "12-month basis" rather than terms having similar meanings.
- In the Hybrid Search is dense search biased i.e $\alpha=0.9$ then, from the results we can see the most related or similar text to the query.