# Databricks Vector Search

Databricks Vector Search is a vector database that is built into the Databricks Intelligence Platform and integrated with its governance and productivity tools. Full docs here: https://docs.databricks.com/en/generative-ai/vector-search.html

Install llama-index and databricks-vectorsearch. You can now authenticate the VectorSearchClient outside of a Databricks runtime with workspace url and token.

More details here: https://api-docs.databricks.com/python/vector-search/databricks.vector_search.html#databricks.vector_search.client.VectorSearchClient

In [None]:
%pip install llama-index llama-index-vector-stores-databricks
%pip install -U databricks-vectorsearch

Import databricks dependencies

In [None]:
from databricks.vector_search.client import (
    VectorSearchIndex,
    VectorSearchClient,
)

Import LlamaIndex dependencies

In [None]:
from llama_index.core import (
    VectorStoreIndex,
    SimpleDirectoryReader,
    ServiceContext,
    StorageContext,
)

from llama_index.vector_stores.databricks import DatabricksVectorSearch

Load example data

In [None]:
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

Read the data

In [None]:
# load documents
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
print(f"Total documents: {len(documents)}")
print(f"First document, id: {documents[0].doc_id}")
print(f"First document, hash: {documents[0].hash}")
print(
    "First document, text"
    f" ({len(documents[0].text)} characters):\n{'='*20}\n{documents[0].text[:360]} ..."
)

Create a Databricks Vector Search endpoint which will serve the index

In [None]:
# Create a vector search endpoint
# NOTE: Specify workspace url and personal access token in client to authenticate
client = VectorSearchClient()
client.create_endpoint_and_wait(
    name="llamaindex_dbx_vector_store_test_endpoint", endpoint_type="STANDARD"
)

## Basic Usage 

Create the Databricks Vector Search index, and build it from the documents

In [None]:
# Create a vector search index
# it must be placed inside a Unity Catalog-enabled schema

# We'll use self-managed embeddings (i.e. managed by LlamaIndex) rather than a Databricks-managed index
databricks_index = client.create_direct_access_index(
    endpoint_name="llamaindex_dbx_vector_store_test_endpoint",
    index_name="my_catalog.my_schema.my_test_table",
    primary_key="my_primary_key_name",
    embedding_dimension=1536,  # match the embeddings model dimension you're going to use
    embedding_vector_column="my_embedding_vector_column_name",  # you name this anything you want - it'll be picked up by the LlamaIndex class
    schema={
        "my_primary_key_name": "string",
        "my_embedding_vector_column_name": "array<double>",
        "text": "string",  # one column must match the text_column in the DatabricksVectorSearch instance created below; this will hold the raw node text,
        "doc_id": "string",  # one column must contain the reference document ID (this will be populated by LlamaIndex automatically)
        # add any other metadata you may have in your nodes (Databricks Vector Search supports metadata filtering)
        # NOTE THAT THESE FIELDS MUST BE ADDED EXPLICITLY TO BE USED FOR METADATA FILTERING
    },
)

databricks_vector_store = DatabricksVectorSearch(
    index=databricks_index,
    text_column="text",
    columns=None,  # YOU MUST ALSO RECORD YOUR METADATA FIELD NAMES HERE
)  # text_column is required for self-managed embeddings
storage_context = StorageContext.from_defaults(
    vector_store=databricks_vector_store
)
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)

Query the index

In [None]:
query_engine = index.as_query_engine()
response = query_engine.query("Why did the author choose to work on AI?")

print(response.response)

## Advanced Usage (Auto Merging Retriever)

With the addition of node relationships in the index, advanced features utilizing the hierarchical structure of nodes can be used to improve the search results. 

In this demo, we will use the Auto Merging Retriever to improve the search results, which will retrieve the parent node if the number of child nodes selected is greater than a certain threshold.

For more on Auto Merging Retriever, please refer to https://docs.llamaindex.ai/en/stable/examples/retrievers/auto_merging_retriever/

In [None]:
from llama_index.core.storage.docstore import SimpleDocumentStore
from llama_index.core.node_parser import HierarchicalNodeParser, get_leaf_nodes
from llama_index.core.retrievers import AutoMergingRetriever
from llama_index.core.query_engine import RetrieverQueryEngine

In [None]:
# Initialize the node parser
node_parser = HierarchicalNodeParser.from_defaults()

# Split documents into nodes and save leaf nodes
nodes = node_parser.get_nodes_from_documents(documents)
leaf_nodes = get_leaf_nodes(nodes)
print(len(leaf_nodes))

In [None]:
# Create a vector search index
# it must be placed inside a Unity Catalog-enabled schema

# We'll use self-managed embeddings (i.e. managed by LlamaIndex) rather than a Databricks-managed index
databricks_index = client.create_direct_access_index(
    endpoint_name="llamaindex_dbx_vector_store_test_endpoint",
    index_name="my_catalog.my_schema.my_advanced_test_table",
    primary_key="my_primary_key_name",
    embedding_dimension=1536,  # match the embeddings model dimension you're going to use
    embedding_vector_column="my_embedding_vector_column_name",  # you name this anything you want - it'll be picked up by the LlamaIndex class
    schema={
        "my_primary_key_name": "string",
        "my_embedding_vector_column_name": "array<double>",
        "text": "string",  # one column must match the text_column in the DatabricksVectorSearch instance created below; this will hold the raw node text,
        "doc_id": "string",  # one column must contain the reference document ID (this will be populated by LlamaIndex automatically)
        "node_info": "string",  # this is REQUIRED to store the node information generated by llama-index (field name must be "node_info")
        # add any other metadata you may have in your nodes (Databricks Vector Search supports metadata filtering)
        # NOTE: THAT THESE FIELDS MUST BE ADDED EXPLICITLY TO BE USED FOR METADATA FILTERING
    },
)

# insert nodes into docstore
docstore = SimpleDocumentStore()
docstore.add_documents(nodes)

databricks_vector_store = DatabricksVectorSearch(
    index=databricks_index,
    text_column="text",  # text_column is required for self-managed embeddings
    # YOU MUST ALSO RECORD YOUR METADATA FIELD NAMES HERE
    columns=[
        "node_info"  # NOTE: node_info is REQUIRED for the node relationships to be retrieved from the index
    ],
)

storage_context = StorageContext.from_defaults(
    docstore=docstore, vector_store=databricks_vector_store
)
index = VectorStoreIndex(leaf_nodes, storage_context=storage_context)
base_retriever = index.as_retriever(similarity_top_k=10)
retriever = AutoMergingRetriever(base_retriever, storage_context, verbose=True)

Query the index with auto merging retriever

In [None]:
query_engine = RetrieverQueryEngine.from_args(retriever)
response = query_engine.query("Why did the author choose to work on AI?")

print(response.response)