# 🏪 Indexing and querying restaurant data with Milvus (dense embeddings)

This notebook demonstrates how to index and search restaurant data using Milvus and dense vector embeddings produced by a sentence transformer model (LaBSE). It's focused on semantic search using dense vectors derived from restaurant text fields.

### What you will learn
- Data preparation and basic preprocessing for restaurant records
- Creating an embedding function (LaBSE) and encoding documents/queries
- Defining a Milvus collection schema with dense vector fields
- Inserting embedded data and building vector indexes
- Running semantic searches and formatting results

### Requirements
- `pymilvus[model]`
- `sentence-transformers`
- `pandas`

Run notes: Run the notebook top-to-bottom. Edit connection settings in the "Connect to Milvus" cell when using a cloud endpoint.

Reference: For a step-by-step blog post, see [wiphoo.dev](https://go.wiphoo.dev/nhL42L).

In [1]:
%pip install -q --upgrade "pymilvus[model]" sentence-transformers pandas

Note: you may need to restart the kernel to use updated packages.


🔌 Step 1 — Connect to Milvus

Establish a connection to your Milvus instance (local or managed). Update the connection URI and credentials as needed. Use `connections.connect(uri=...)` for the standard client or `MilvusClient(uri=...)` depending on your setup.

In [2]:
# create a connection to Milvus either local or Zilliz cloud
from pymilvus import connections

# local Milvus
connections.connect(uri='http://localhost:19530')

# # Zilliz cloud
# connections.connect(uri="https://YOUR_URI.cloud.zilliz.com", 
#                     token='YOUR_TOKEN',
#                     )

🧠 Step 2 — Embedding function (LaBSE)

We use LaBSE (a sentence transformer) to produce dense, multilingual embeddings suitable for semantic search. Configure `batch_size` and `device` to match your environment. For cosine similarity, consider normalizing embeddings.

In [3]:
from pymilvus.model.dense import SentenceTransformerEmbeddingFunction

embedding_func = SentenceTransformerEmbeddingFunction(
    model_name="sentence-transformers/LaBSE",
    batch_size=32,
    device="cpu",
    normalize_embeddings=True,  # แนะนำเปิดไว้ถ้าใช้ COSINE
)

  from .autonotebook import tqdm as notebook_tqdm


🗂️ Step 3 — Define collection schema

Define fields for primary key, metadata (latitude/longitude), and the dense vector field. Use clear field names to simplify queries and outputs.

In [4]:
from pymilvus import (
    FieldSchema,
    DataType,
)


# define fields
fields = [
    FieldSchema(name="id", dtype=DataType.VARCHAR, is_primary=True, auto_id=False, max_length=128),
    FieldSchema(name="title", dtype=DataType.VARCHAR, max_length=512),
    FieldSchema(name="latitude", dtype=DataType.FLOAT),
    FieldSchema(name="longitude", dtype=DataType.FLOAT),
    FieldSchema(name="dense_vector", dtype=DataType.FLOAT_VECTOR, dim=embedding_func.dim),  # LaBSE embedding
    FieldSchema(name="h3_r8", dtype=DataType.VARCHAR, max_length=32, is_partition_key=True),
]


📄 Step 3.1 — Create the schema

Wrap fields into a `CollectionSchema` and set description/metadata. This schema is used to create the Milvus collection.

In [5]:
from pymilvus import CollectionSchema

schema = CollectionSchema(fields=fields, description="Schema สำหรับข้อมูลร้านอาหาร")

#### 📄 Step 3: Create the Collection

In [6]:
from pymilvus import Collection, utility

collection_name = "restaurants"
if utility.has_collection(collection_name):
    Collection(collection_name).drop()
collection = Collection(collection_name, schema)

📄 Step 3.2 — Create the index

Create an index for the dense vector field (e.g., AUTOINDEX/COSINE or HNSW). Loading the collection into memory is required before running searches.

In [7]:
dense_index = {"index_type": "AUTOINDEX", "metric_type": "COSINE"}  # ให้ Milvus เลือกโครงสร้างเหมาะสม (อาจใช้ HNSW ใต้ hood)
collection.create_index(field_name="dense_vector", index_params=dense_index)
collection.load()  # จำเป็นต้อง load เข้าหน่วยความจำก่อน search

🍽️ Step 4 — Prepare, embed, and ingest restaurant data

Download/load the sample restaurant data, combine relevant text fields for embedding, and use the embedding function to encode documents prior to insertion.

In [8]:
# downlond smaple restaurant data
!wget https://go.wiphoo.dev/gYNwXN -O './sample_restaurants.csv'

--2025-10-02 21:55:07--  https://go.wiphoo.dev/gYNwXN
Resolving go.wiphoo.dev (go.wiphoo.dev)... 91.197.243.143, 207.174.61.1
Connecting to go.wiphoo.dev (go.wiphoo.dev)|91.197.243.143|:443... connected.
91.197.243.143, 207.174.61.1
Connecting to go.wiphoo.dev (go.wiphoo.dev)|91.197.243.143|:443... connected.
HTTP request sent, awaiting response... HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://raw.githubusercontent.com/wiphoo/Website_Resources/ed5ef8e5ca8aab2afef63a588da126e393c53b61/data/2025/restaurants/2025-05-31_sample_restaurants.csv?clid=eyJpIjoiUWNKTXkxWE5nLW5DWTdXaEJSWlQ4IiwiaCI6IiIsInAiOiIvZ1lOd1hOIiwidCI6MTc1OTQxNjkwN30.Wz3eBJvjpfI8-6w515GTttapdVa2KKoFLWgxfB9gmhk [following]
--2025-10-02 21:55:08--  https://raw.githubusercontent.com/wiphoo/Website_Resources/ed5ef8e5ca8aab2afef63a588da126e393c53b61/data/2025/restaurants/2025-05-31_sample_restaurants.csv?clid=eyJpIjoiUWNKTXkxWE5nLW5DWTdXaEJSWlQ4IiwiaCI6IiIsInAiOiIvZ1lOd1hOIiwidCI6MTc1OTQxNjkwN3

In [9]:
import pandas as pd

# read sample restaurant data
df = pd.read_csv("./sample_restaurants.csv")

In [10]:
# combinate multiple fields and create embedded
df["combined_text"] = df[["title", "types", "type_ids"]].agg(" ".join, axis=1)
embedded_text = embedding_func.encode_documents(df["combined_text"].tolist())

In [11]:
entities = [
    df["place_id"].tolist(),
    df["title"].tolist(),
    df["latitude"].tolist(),
    df["longitude"].tolist(),
    embedded_text,
    df["h3_r8"].tolist(),
]

collection.insert(entities)
collection.flush()

print(f"Inserted {len(df)} records with embeddings.")

Inserted 268 records with embeddings.


⚙️ Step 5 — Helper functions

Create utility helpers to convert Milvus search results into a pandas DataFrame and to perform query encoding and searching.

In [12]:
def milvus_result_to_dataframe(results):
    """
    Convert Milvus search results to a pandas DataFrame.

    Parameters:
        results (list): Milvus search results in the format:
            [
                [  # query 1 result
                    Hit(id=..., distance=..., entity=...), 
                    ...
                ],
                ...
            ]

    Returns:
        pd.DataFrame: Flattened DataFrame with distance and entity fields.
    """
    flat_results = []

    for query_results in results:
        for match in query_results:
            entity = match.get("entity", {})
            flat_result = {
                "id": match.get("id"),
                "distance": match.get("distance"),
                **{k: v for k, v in entity.items()}
            }
            flat_results.append(flat_result)

    return pd.DataFrame(flat_results)[['id', 'distance', 'title', 'latitude', 'longitude', 'h3_r8']]

In [13]:
def search(query):
    """
    Perform a semantic search on the collection using a given text query.

    Args:
        query (str): The text query to search for.

    Returns:
        List[Dict[str, Any]]: A list of search results containing output fields
        such as 'id', 'title', 'lat', 'lng', and 'h3_r8'.
    """
    # convert the query to an embedding vector using the provided embedding function
    query_vector = embedding_func.encode_queries(queries=[query])

    # execute the search on the vector database
    results = collection.search(
        data=query_vector,
        anns_field="dense_vector",
        param={
            "metric_type": "COSINE",
            "params": {"nprobe": 15}
        },
        output_fields=["id", "title", "latitude", "longitude", "h3_r8"],
        limit=10,
    )

    return results

🔍 Step 6 — Test queries

Run a few sample queries to validate retrieval quality and inspect formatted search outputs.

In [14]:
query = 'ซูซิ'
results = search(query)
milvus_result_to_dataframe(results)

Unnamed: 0,id,distance,title,latitude,longitude,h3_r8
0,ChIJG5cMlYmf4jARjI14Tvhzj5I,0.504464,Sushi Sora,13.726238,100.543182,8864a4b14dfffff
1,ChIJC_IG4WGZ4jARJil71QyJnJQ,0.445954,Min Sushi by Sushi Cottage ずしコテージ,13.740359,100.525108,8864a4b10dfffff
2,ChIJ89mnRNGj4jARXFw7F8kdlu8,0.428978,ไข่หวานบ้านซูชิ สาขาประชาอุทิศ,13.660226,100.501335,8864a4b223fffff
3,ChIJFepMlimf4jARW2MqZCN7GMQ,0.424719,sushimai ซูชิมั้ย ศรีบำเพ็ญ,13.721445,100.54673,8864a4b327fffff
4,ChIJ-eRZ7dif4jARwl4RGXfAbuI,0.353874,OJI Omakase at Sathorn,13.722355,100.546768,8864a4b327fffff
5,ChIJT-MmY-mj4jARBxgvjap0hf0,0.339274,Suki Teenoi Susco Phuttha Bucha,13.65134,100.488991,8864a4b231fffff
6,ChIJQRMf7wOZ4jARQz7dlVrlE48,0.328779,Cozii Steak and Restaurant โคซี่ สเต๊ก,13.721587,100.516533,8864a4b15dfffff
7,ChIJcdXwlUif4jAR2xee0t6QEs0,0.317556,Sindosegi Thailand (ซินโดเซกิ),13.74421,100.53511,8864a4b16bfffff
8,ChIJfR02toKZ4jARBdP-FsD5LHw,0.311415,Yuzu Curry Siam Square Soi.9,13.744181,100.533394,8864a4b16bfffff
9,ChIJRffK4yuf4jARIcEK2GMqhEc,0.303736,Xin Tian Di (ซิน เทียน ตี้),13.729627,100.535095,8864a4b141fffff


In [15]:
query = 'ซูสิ'
results = search(query)
milvus_result_to_dataframe(results)

Unnamed: 0,id,distance,title,latitude,longitude,h3_r8
0,ChIJG5cMlYmf4jARjI14Tvhzj5I,0.49179,Sushi Sora,13.726238,100.543182,8864a4b14dfffff
1,ChIJC_IG4WGZ4jARJil71QyJnJQ,0.437201,Min Sushi by Sushi Cottage ずしコテージ,13.740359,100.525108,8864a4b10dfffff
2,ChIJ89mnRNGj4jARXFw7F8kdlu8,0.41677,ไข่หวานบ้านซูชิ สาขาประชาอุทิศ,13.660226,100.501335,8864a4b223fffff
3,ChIJFepMlimf4jARW2MqZCN7GMQ,0.401952,sushimai ซูชิมั้ย ศรีบำเพ็ญ,13.721445,100.54673,8864a4b327fffff
4,ChIJ-eRZ7dif4jARwl4RGXfAbuI,0.346747,OJI Omakase at Sathorn,13.722355,100.546768,8864a4b327fffff
5,ChIJT-MmY-mj4jARBxgvjap0hf0,0.322483,Suki Teenoi Susco Phuttha Bucha,13.65134,100.488991,8864a4b231fffff
6,ChIJQRMf7wOZ4jARQz7dlVrlE48,0.313915,Cozii Steak and Restaurant โคซี่ สเต๊ก,13.721587,100.516533,8864a4b15dfffff
7,ChIJcdXwlUif4jAR2xee0t6QEs0,0.296838,Sindosegi Thailand (ซินโดเซกิ),13.74421,100.53511,8864a4b16bfffff
8,ChIJJVKaiCyZ4jARNtm2xPOh4zA,0.29353,Sasa Restaurant,13.741302,100.527039,8864a4b10dfffff
9,ChIJfR02toKZ4jARBdP-FsD5LHw,0.292948,Yuzu Curry Siam Square Soi.9,13.744181,100.533394,8864a4b16bfffff


In [16]:
query = 'sush'
results = search(query)
milvus_result_to_dataframe(results)

Unnamed: 0,id,distance,title,latitude,longitude,h3_r8
0,ChIJG5cMlYmf4jARjI14Tvhzj5I,0.369419,Sushi Sora,13.726238,100.543182,8864a4b14dfffff
1,ChIJ89mnRNGj4jARXFw7F8kdlu8,0.341008,ไข่หวานบ้านซูชิ สาขาประชาอุทิศ,13.660226,100.501335,8864a4b223fffff
2,ChIJFepMlimf4jARW2MqZCN7GMQ,0.33789,sushimai ซูชิมั้ย ศรีบำเพ็ญ,13.721445,100.54673,8864a4b327fffff
3,ChIJZ2mDRy6j4jAR3lgsc2hfQUw,0.310574,สเต็กปากมันส์,13.642928,100.49353,8864a4b239fffff
4,ChIJC_IG4WGZ4jARJil71QyJnJQ,0.307294,Min Sushi by Sushi Cottage ずしコテージ,13.740359,100.525108,8864a4b10dfffff
5,ChIJd-0f4IaZ4jARzU5kkvVX-io,0.289997,หมูจิ้มเปรี้ยว (โรงอาหารเซนต์โยฯ),13.725265,100.530655,8864a4b141fffff
6,ChIJT-MmY-mj4jARBxgvjap0hf0,0.280936,Suki Teenoi Susco Phuttha Bucha,13.65134,100.488991,8864a4b231fffff
7,ChIJm5ZRHxqj4jARVgsfB1L8ziY,0.280684,ครัวกันเอง,13.651203,100.484299,8864a4b233fffff
8,ChIJ2zWuDkWj4jARW4erWTVj2mk,0.280051,After You Dessert Cafe - Susco Phutthabucha,13.65135,100.488884,8864a4b231fffff
9,ChIJJVKaiCyZ4jARNtm2xPOh4zA,0.268876,Sasa Restaurant,13.741302,100.527039,8864a4b10dfffff


🧹 Step 7 — Cleanup

Disconnect from the Milvus instance and free resources when finished.

In [17]:
# disconnect Milvus connection
connections.disconnect('default')