
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img
    src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png"
    alt="Databricks Learning"
  >
</div>



# LAB - Create Managed Vector Search Index

The objective of this lab is to demonstrate the process of creating a **managed** Vector Search index for retrieval-augmented generation (RAG) applications. This involves configuring Databricks Vector Search to ingest data from a Delta table containing text embeddings and metadata.



**Lab Outline:**

In this lab, you will need to complete the following tasks;

* **Task 1 :** Create a Vector Search endpoint to serve the index.

* **Task 2 :** Connect Delta table with Vector Search endpoint

* **Task 3 :** Test the Vector Search index

* **Task 4 :** Re-rank search results

**📝 Your task:** Complete the **`<FILL_IN>`** sections in the code blocks and follow the other steps as instructed.

## REQUIRED - SELECT CLASSIC COMPUTE
Before executing cells in this notebook, please select your classic compute cluster in the lab. Be aware that **Serverless** is enabled by default.

Follow these steps to select the classic compute cluster:
1. Navigate to the top-right of this notebook and click the drop-down menu to select your cluster. By default, the notebook will use **Serverless**.

2. If your cluster is available, select it and continue to the next cell. If the cluster is not shown:

   - Click **More** in the drop-down.
   
   - In the **Attach to an existing compute resource** window, use the first drop-down to select your unique cluster.

**NOTE:** If your cluster has terminated, you might need to restart it in order to select it. To do this:

1. Right-click on **Compute** in the left navigation pane and select *Open in new tab*.

2. Find the triangle icon to the right of your compute cluster name and click it.

3. Wait a few minutes for the cluster to start.

4. Once the cluster is running, complete the steps above to select your cluster.

## Requirements

Please review the following requirements before starting the lesson:

* To run this notebook, you need to use one of the following Databricks runtime(s): **17.3.x-cpu-ml-scala2.13**

**🚨 Important: This lab relies on the resources created in the previous Lab. Please ensure you have completed the prior lab before starting this lab.**


## Classroom Setup

Before starting the lab, run the provided classroom setup script. This script will define configuration variables necessary for the lab. Execute the following cell:

In [0]:
%pip install -U -qqqq databricks-vectorsearch 'mlflow-skinny[databricks]==3.4.0' PyPDF2==3.0.0 databricks-sdk flashrank 
%restart_python

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
langchain 0.3.21 requires langchain-core<1.0.0,>=0.3.45, but you have langchain-core 1.0.2 which is incompatible.
langchain 0.3.21 requires langsmith<0.4,>=0.1.17, but you have langsmith 0.4.39 which is incompatible.
langchain-text-splitters 0.3.8 requires langchain-core<1.0.0,>=0.3.51, but you have langchain-core 1.0.2 which is incompatible.[0m[31m
[0m[43mNote: you may need to restart the kernel using %restart_python or dbutils.library.restartPython() to use updated packages.[0m


In [0]:
%run ../Includes/Classroom-Setup-03


The examples and models presented in this course are intended solely for demonstration and educational purposes.
 Please note that the models and prompt examples may sometimes contain offensive, inaccurate, biased, or harmful content.


**Other Conventions:**

Throughout this demo, we'll refer to the object `DA`. This object, provided by Databricks Academy, contains variables such as your username, catalog name, schema name, working directory, and dataset locations. Run the code block below to view these details:

In [0]:
print(f"Username:          {DA.username}")
print(f"Catalog Name:      {DA.catalog_name}")
print(f"Schema Name:       {DA.schema_name}")
print(f"Working Directory: {DA.paths.working_dir}")
print(f"Dataset Location:  {DA.paths.datasets}")

Username:          labuser12209929_1761968096@vocareum.com
Catalog Name:      dbacademy
Schema Name:       labuser12209929_1761968096
Working Directory: /Volumes/dbacademy/ops/labuser12209929_1761968096@vocareum_com
Dataset Location:  NestedNamespace (arxiv='/Volumes/dbacademy_arxiv/v01', dais='/Volumes/dbacademy_dais/v01', news='/Volumes/dbacademy_news/v01', docs='/Volumes/dbacademy_docs/v01')


## Task 1: Create a Vector Search Endpoint

To start, you need to create a Vector Search endpoint to serve the index.

**🚨IMPORTANT: Vector Search endpoints must be created before running the rest of the demo. These are already created for you in Databricks Lab environment. See instructions in the demo notebook if you run this notebook in another environment.**

**💡 Instructions:**

1. Define the endpoint that you will use if you don't have endpoint creation permissions. 
1. [Optional]: Create a new endpoint. Check if the vector search endpoint exists, if not, create it.
1. Wait for the endpoint to be ready.


### Step-by-Step Instructions:


**Vector Search Endpoint**: The first step for creating a Vector Search index is to create a compute endpoint. This endpoint is already created in this lab environment.

**Wait for Endpoint to be Ready**: After defining the endpoint name, check the status of the endpoint using the provided function `wait_for_vs_endpoint_to_be_ready`.

Additionally, you can check the endpoint status in the Databricks workspace [Vector Search Endpoints in Compute section](#/setting/clusters/vector-search).

In [0]:
## assign vs search endpoint by username
vs_endpoint_prefix = "vs_endpoint_"
vs_endpoint_name = vs_endpoint_prefix + str(get_fixed_integer(DA.unique_name("_")))
print(f"Assigned Vector Search endpoint name: {vs_endpoint_name}.")

Assigned Vector Search endpoint name: vs_endpoint_2.


In [0]:
import databricks.sdk.service.catalog as c
from databricks.vector_search.client import VectorSearchClient
from databricks.sdk import WorkspaceClient

vsc = VectorSearchClient(disable_notice=True)

## check the status of the endpoint.
wait_for_vs_endpoint_to_be_ready(vsc, vs_endpoint_name)
print(f"Endpoint named {vs_endpoint_name} is ready.")

Endpoint named vs_endpoint_2 is ready.


## Task 2: Create a Managed Vector Search Index

Now, connect the Delta table containing text and metadata with the Vector Search endpoint. In this lab, you will create a **managed** index, which means you don't need to create the embeddings manually. For API details, check the [documentation page](https://docs.databricks.com/en/generative-ai/create-query-vector-search.html#create-index-using-the-python-sdk).


**📌 Note 1: You will use the embeddings table that you created in the previous lab. If you haven't completed that lab, stop here and complete it first.**

**📌 Note 2:** Although the source table already has the embedding column precomputed, we are not going to use it here to test the managed vector search capability to populate embeddings on the fly during data ingestion and query.

**💡 Instructions:**

1. Define the source Delta table containing the text to be indexed.

1. Create a Vector Search index. Use these parameters; source column as `content` and `databricks-gte-large-en` as embedding model. Also, the sync process should be  `manually triggered`.

1. Create or synchronize the Vector Search index based on the source Delta table.


In [0]:
## the Delta table containing the text embeddings and metadata.
source_table_fullname = f"{DA.catalog_name}.{DA.schema_name}.lab_pdf_text_embeddings"

## the Delta table to store the Vector Search index.
vs_index_fullname = f"{DA.catalog_name}.{DA.schema_name}.lab_pdf_text_managed_vs_index"

## create or sync the index.
if not index_exists(vsc, vs_endpoint_name, vs_index_fullname):
  print(f"Creating index {vs_index_fullname} on endpoint {vs_endpoint_name}...")
  
  vsc.create_delta_sync_index(
    endpoint_name=vs_endpoint_name,
    index_name=vs_index_fullname,
    source_table_name=source_table_fullname,
    pipeline_type="TRIGGERED",
    primary_key="id",
    embedding_source_column="content",
    embedding_model_endpoint_name="databricks-gte-large-en"
  )
else:
  ## trigger a sync to update our vs content with the new data saved in the table.
  vsc.get_index(vs_endpoint_name, vs_index_fullname).sync()

## let's wait for the index to be ready and all our embeddings to be created and indexed.
wait_for_index_to_be_ready(vsc, vs_endpoint_name, vs_index_fullname)

Creating index dbacademy.labuser12209929_1761968096.lab_pdf_text_managed_vs_index on endpoint vs_endpoint_2...
Waiting for index to be ready, this can take a few min... {'detailed_state': 'PROVISIONING_INDEX', 'message': 'Delta sync Index creation is pending. Check latest status: https://dbc-6a912028-eced.cloud.databricks.com/explore/data/dbacademy/labuser12209929_1761968096/lab_pdf_text_managed_vs_index', 'indexed_row_count': 0, 'ready': False, 'index_url': 'dbc-6a912028-eced.cloud.databricks.com/api/2.0/vector-search/indexes/dbacademy.labuser12209929_1761968096.lab_pdf_text_managed_vs_index'} - pipeline url:dbc-6a912028-eced.cloud.databricks.com/api/2.0/vector-search/indexes/dbacademy.labuser12209929_1761968096.lab_pdf_text_managed_vs_index


## Task 3: Search Documents Similar to the Query

Test the Vector Search index by searching for similar content based on a sample query.

**💡 Instructions:**

1. Get the index instance that we created.

1. Send a sample query to the language model endpoint using **query text**. 🚨 Note: As you created a managed index, you will use plain text for similarity search using `query_text` parameter.

1. Use the embeddings to search for similar content in the Vector Search index.

In [0]:
## get VS index
index = vsc.get_index(vs_endpoint_name, vs_index_fullname)

question = "What are the security and privacy concerns when training generative models?"

## search for similar documents  
results = index.similarity_search(
    query_text = question,
    columns=["pdf_name", "content"],
    num_results=4
    )

## show the results
docs = results.get("result", {}).get("data_array", [])

pprint(docs)

[NOTICE] Using a notebook authentication token. Recommended for development only. For improved performance, please use Service Principal based authentication. To disable this message, pass disable_notice=True.
[['dbfs:/Volumes/dbacademy_arxiv/v01/arxiv-articles/2302.09419.pdf',
  'Some defense approaches have been pro-\n'
  'posed to defend against such attacks. [268] designs an auxiliary anomaly detection classiﬁer '
  'and uses a\n'
  'multi-task learning procedure to defend against adversarial samples. On the other hand, some '
  'defects in the\n'
  'PFM may be inherited by the custom models in transfer learning, such as the adversarial '
  'vulnerabilities and\n'
  'backdoors mentioned above. To mitigate this issue, [269] proposes a relevant model slicing '
  'technique to\n'
  'reduce defect inheritance during transfer learning while retaining useful knowledge from the '
  'PFM.\n'
  'Data Privacy in PFMs LLMs and other PFMs have been trained on private datasets [270]. The re-\n'

## Task 4: Re-rank Search Results

You have retrieved some documents that are similar to the query text. However, the question of which documents are the most relevant is not done by the vector search results. Use `flashrank` library to re-rank the results and show the most relevant top 3 documents. 

**💡 Instructions:**

1. Define `flashrank` with **`rank-T5-flan`** model.

1. Re-rank the search results.

1. Show the most relevant **top 3** documents.


In [0]:
from flashrank import Ranker, RerankRequest

## define the ranker.
cache_dir = f"{DA.paths.working_dir}/opt"

ranker = Ranker(model_name="rank-T5-flan", cache_dir=cache_dir)

## format the result to align with reranker library format. 
passages = []
for doc in docs:
    new_doc = {"file": doc[0], "text": doc[1]}
    passages.append(new_doc)

## rerank the passages.
rerankrequest = RerankRequest(query=question, passages=passages)
ranked_passages = ranker.rerank(rerankrequest)

## show the top 3 results.
print(*ranked_passages[:3], sep="\n\n")

INFO:flashrank.Ranker:Downloading rank-T5-flan...
rank-T5-flan.zip:   0%|          | 0.00/73.7M [00:00<?, ?iB/s]rank-T5-flan.zip:   3%|▎         | 2.55M/73.7M [00:00<00:02, 26.7MiB/s]rank-T5-flan.zip:   7%|▋         | 5.09M/73.7M [00:00<00:03, 21.9MiB/s]rank-T5-flan.zip:  11%|█         | 8.03M/73.7M [00:00<00:02, 25.5MiB/s]rank-T5-flan.zip:  15%|█▍        | 10.8M/73.7M [00:00<00:02, 26.7MiB/s]rank-T5-flan.zip:  19%|█▉        | 14.1M/73.7M [00:00<00:02, 29.4MiB/s]rank-T5-flan.zip:  23%|██▎       | 17.3M/73.7M [00:00<00:01, 30.8MiB/s]rank-T5-flan.zip:  28%|██▊       | 20.5M/73.7M [00:00<00:01, 31.7MiB/s]rank-T5-flan.zip:  32%|███▏      | 23.6M/73.7M [00:00<00:01, 31.8MiB/s]rank-T5-flan.zip:  36%|███▌      | 26.6M/73.7M [00:00<00:01, 31.6MiB/s]rank-T5-flan.zip:  41%|████      | 30.0M/73.7M [00:01<00:01, 32.7MiB/s]rank-T5-flan.zip:  45%|████▍     | 33.1M/73.7M [00:01<00:01, 31.0MiB/s]rank-T5-flan.zip:  49%|████▉     | 36.1M/73.7M [00:01<00:01, 30.2MiB/s]rank-T5-flan.zip:  53%

{'file': 'dbfs:/Volumes/dbacademy_arxiv/v01/arxiv-articles/2302.09419.pdf', 'text': 'Knowledge distillation refers to the transfer of knowledge from the larger teacher model to the smaller\nstudent model through the use of a soft label, etc. DistilBERT [261], for example, uses the knowledge dis-\ntillation method to compress BERT, reducing the size of the BERT model by 40% while retaining 97% of\nits language comprehension.\n7.3 Security and Privacy\nThe security risks, social bias, and data privacy in PFMs become an important research topic. Qiu et al. [5]\nrecognize that deep neural networks can be attacked by adversarial samples, which mislead the model to\nproduce false predictions. Due to the excellent portability of pretraining models, they have been widely used\nin NLP, CV , and GL. However, it has been found that the pretraining model is susceptible to the inﬂuence of\nadversarial samples. A tiny interference of the original input may mislead the pretraining model to produce\ns


## Clean up Classroom

**🚨 Warning:** Please don't delete the catalog and tables created in this lab as next labs depend on these resources. To clean-up the classroom assets, run the classroom clean-up script in the last lab.


## Conclusion

In this lab, you learned how to set up a Vector Search index using Databricks Vector Search for retrieval-augmented generation (RAG) applications. By following the tasks, you successfully created a Vector Search endpoint, connected a Delta table containing text embeddings, and tested the search functionality. Furthermore, using a re-ranking library, you re-ordered the search results from the most relevant to least relevant documents. This lab provided hands-on experience in configuring and utilizing Vector Search, empowering you to enhance content retrieval and recommendation systems in your projects.

&copy; 2025 Databricks, Inc. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg logo are trademarks of the <a href="https://www.apache.org/" target="_blank">Apache Software Foundation</a>.<br/><br/><a href="https://databricks.com/privacy-policy" target="_blank">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use" target="_blank">Terms of Use</a> | <a href="https://help.databricks.com/" target="_blank">Support</a>