# 🍜 Hybrid Vector Search for Restaurants with Milvus

This notebook demonstrates how to build a hybrid search system for restaurant data using Milvus, combining dense (BGE-M3, LaBSE) and sparse (BM25) embeddings, plus full-text search.

**You will learn how to:**
- Prepare restaurant data for hybrid search
- Use BGE-M3 and LaBSE embedding models
- Create a Milvus collection with dense, sparse, and text fields
- Insert data with multiple embedding types
- Perform and compare hybrid, dense, and sparse searches with clear examples

#### 🛠 Requirements
Make sure you have the following Python libraries installed:
* `pymilvus`
* `pandas`
* `numpy`

You can use either:
* A **local Milvus** instance (e.g. via Docker)
* Or a **managed Milvus** service such as [Zilliz Cloud](https://cloud.zilliz.com)

📖 For more context, see the full blog post at: [wiphoo.dev](https://go.wiphoo.dev/IOifZo)


## Table of Contents

- 1. Connect to Milvus
- 2. Create Embedding Functions (LaBSE, BGE-M3)
- 3. Define Collection Schema
- 4. Add BM25 Function (PyThaiNLP)
- 5. Create Collection & Indexes
- 6. Preprocess Data (Tokenize)
- 7. Generate Embeddings (Dense + Sparse)
- 8. Convert Sparse to Milvus Dict
- 9. Insert Data
- 10. Collection Stats
- 11. Prepare Query Embeddings
- 12. Helpers (to_dataframe, sparse conversion)
- 13. Search Helpers (search_labse, search_fulltext, search_bge_dense, search_bge_sparse)
- 14. Search Examples (LaBSE, Full-text, BGE-M3 Dense & Sparse)
- 15. Comparison & Visualization

> Run the notebook top-to-bottom. Use the `search_*` helper functions for concise queries and easy comparison.

---

## 1. Connect to Milvus

Establish a connection to your Milvus instance (local or cloud). Use `MilvusClient(uri=...)` or your cloud credentials.

In [1]:
# Step 1: Connect to Milvus
from pymilvus import MilvusClient

# Connect to local Milvus server
client = MilvusClient(uri="http://localhost:19530")

# For Zilliz Cloud, uncomment and fill in your credentials:
# client = MilvusClient(
#     uri="https://<your-endpoint>",
#     token="<your-token>"
# )



Connect to Milvus server (local or cloud).

---

## 2. Create Embedding Functions (LaBSE, BGE-M3)

Set up embedding functions for both LaBSE (dense) and BGE-M3 (hybrid: dense + sparse).

### 2.1 LaBSE (Dense) — Multilingual Dense Embeddings

In [2]:
# Step 2: Create Embedding Functions
from pymilvus.model.dense import SentenceTransformerEmbeddingFunction

# LaBSE: Multilingual dense embedding (good for Thai/English)
labse_embedding_func = SentenceTransformerEmbeddingFunction(
    model_name="sentence-transformers/LaBSE",
    batch_size=32,
    device="cpu",
    normalize_embeddings=True,  # Recommended for COSINE metric
)

  from .autonotebook import tqdm as notebook_tqdm


Set up LaBSE embedding (multilingual dense vectors).

### 2.2 BGE-M3 (Hybrid) — Dense + Sparse Embeddings

BGE-M3 provides both a dense vector and a sparse representation; we store both to support hybrid ranking.

In [3]:
from pymilvus.model.hybrid import BGEM3EmbeddingFunction

bge_m3_embedding_func = BGEM3EmbeddingFunction(
    model_name='BAAI/bge-m3',
    device='cpu',
    use_fp16=False,
)

Fetching 30 files: 100%|██████████| 30/30 [00:00<00:00, 149263.49it/s]


Set up BGE-M3 embedding (dense + sparse for hybrid search).

In [4]:
from pymilvus import DataType

# Step 3: Define Collection Schema
collection_name = "restaurants"
if collection_name in client.list_collections():
    client.drop_collection(collection_name)

schema = MilvusClient.create_schema(
    auto_id=False,
    enable_dynamic_field=False,
)

# Main fields
schema.add_field(field_name="id", datatype=DataType.VARCHAR, is_primary=True, auto_id=False, max_length=128)
schema.add_field(field_name="title", datatype=DataType.VARCHAR, max_length=512)
schema.add_field(field_name="latitude", datatype=DataType.FLOAT)
schema.add_field(field_name="longitude", datatype=DataType.FLOAT)

# Text for full-text search
schema.add_field(field_name="text_tokenize", datatype=DataType.VARCHAR, max_length=4096,
                enable_analyzer=True, analyzer_params={"tokenizer": "whitespace"})

# Embedding fields
schema.add_field(field_name="labse_dense_vector", datatype=DataType.FLOAT_VECTOR, dim=labse_embedding_func.dim)
schema.add_field(field_name="pythainlp_sparse_vector", datatype=DataType.SPARSE_FLOAT_VECTOR)
schema.add_field(field_name="bge_m3_dense_vector", datatype=DataType.FLOAT_VECTOR, dim=bge_m3_embedding_func.dim["dense"])
schema.add_field(field_name="bge_m3_sparse_vector", datatype=DataType.SPARSE_FLOAT_VECTOR)

# Partition key for geo search
schema.add_field(field_name="h3_r8", datatype=DataType.VARCHAR, max_length=32, is_partition_key=True)

{'auto_id': False, 'description': '', 'fields': [{'name': 'id', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 128}, 'is_primary': True, 'auto_id': False}, {'name': 'title', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 512}}, {'name': 'latitude', 'description': '', 'type': <DataType.FLOAT: 10>}, {'name': 'longitude', 'description': '', 'type': <DataType.FLOAT: 10>}, {'name': 'text_tokenize', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 4096, 'enable_analyzer': True, 'analyzer_params': '{"tokenizer":"whitespace"}'}}, {'name': 'labse_dense_vector', 'description': '', 'type': <DataType.FLOAT_VECTOR: 101>, 'params': {'dim': 768}}, {'name': 'pythainlp_sparse_vector', 'description': '', 'type': <DataType.SPARSE_FLOAT_VECTOR: 104>}, {'name': 'bge_m3_dense_vector', 'description': '', 'type': <DataType.FLOAT_VECTOR: 101>, 'params': {'dim': 1024}}, {'name': 'bge_m3_sparse_vector', 'description': '', 'type'

Define Milvus collection schema for all fields.

---

## 3. Define Collection Schema

Drop the old collection (if it exists) and define fields for dense vectors, sparse vectors, and tokenized text.

In [5]:
# 4. Add BM25 Function (PyThaiNLP)
from pymilvus import Function, FunctionType

# Add BM25 function for full-text search (PyThaiNLP tokenizer)
schema.add_function(Function(
    name="bm25_pythainlp",
    function_type=FunctionType.BM25,
    input_field_names=["text_tokenize"],
    output_field_names=["pythainlp_sparse_vector"],
))

{'auto_id': False, 'description': '', 'fields': [{'name': 'id', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 128}, 'is_primary': True, 'auto_id': False}, {'name': 'title', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 512}}, {'name': 'latitude', 'description': '', 'type': <DataType.FLOAT: 10>}, {'name': 'longitude', 'description': '', 'type': <DataType.FLOAT: 10>}, {'name': 'text_tokenize', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 4096, 'enable_analyzer': True, 'analyzer_params': '{"tokenizer":"whitespace"}'}}, {'name': 'labse_dense_vector', 'description': '', 'type': <DataType.FLOAT_VECTOR: 101>, 'params': {'dim': 768}}, {'name': 'pythainlp_sparse_vector', 'description': '', 'type': <DataType.SPARSE_FLOAT_VECTOR: 104>, 'is_function_output': True}, {'name': 'bge_m3_dense_vector', 'description': '', 'type': <DataType.FLOAT_VECTOR: 101>, 'params': {'dim': 1024}}, {'name': 'bge_m3_sparse_vector

Add BM25 function for full-text search (PyThaiNLP).

In [6]:
# Step 4: Create the collection
client.create_collection(
    collection_name=collection_name,
    schema=schema,
)

Create the collection in Milvus.

In [7]:
# Step 5: Create indexes for all fields
index_params = client.prepare_index_params()

index_params.add_index(
    field_name="labse_dense_vector",
    index_type="AUTOINDEX",
    metric_type="COSINE",
)
index_params.add_index(
    field_name="pythainlp_sparse_vector",
    index_type="SPARSE_INVERTED_INDEX",
    metric_type="BM25",
    params={
        "inverted_index_algo": "DAAT_MAXSCORE",
        "bm25_k1": 1.2,
        "bm25_b": 0.75,
    },
)
index_params.add_index(
    field_name="bge_m3_dense_vector",
    index_type="AUTOINDEX",
    metric_type="IP",
)
index_params.add_index(
    field_name="bge_m3_sparse_vector",
    index_type="SPARSE_INVERTED_INDEX",
    metric_type="IP",
)
index_params.add_index(
    field_name="h3_r8",
    index_type="AUTOINDEX",
)

Create indexes for all search fields.

---

## 5. Create Collection & Indexes

Create the Milvus collection and build indexes for dense and sparse fields. Index params include metric type and optional BM25 params for sparse fields.

In [8]:
# Build all indexes
client.create_index(collection_name, index_params)

Build all indexes.

In [9]:
# Step 6: Tokenize text for full-text search (using PyThaiNLP)
from pythainlp.tokenize import word_tokenize

stopwords = ['ร้าน', 'อาหาร', 'สาขา']

def tokenize_and_filter(text: str) -> str:
    tokens = word_tokenize(text, engine="newmm")
    return " ".join([t for t in tokens if t not in stopwords and t.strip() != ""])

Tokenize text for BM25 search (remove stopwords).

---

## 6. Preprocess Data (Tokenize)

Tokenize and clean restaurant text prior to embedding or BM25 indexing. We use PyThaiNLP for Thai tokenization.

In [10]:
# Step 7: Load and preprocess restaurant data
import pandas as pd

df = pd.read_csv("../../data/2025/restaurants/sample_restaurants.csv")
df["combined_text"] = df[["title", "type_ids"]].agg(" ".join, axis=1).str.replace("[", "", regex=False).str.replace("]", "", regex=False).str.replace("\"", "", regex=False)
df["tokenized_separate_by_whitespace"] = df["combined_text"].astype(str).apply(tokenize_and_filter)
df.head()

Unnamed: 0,title,place_id,latitude,longitude,rating,reviews,types,type_ids,h3_r6,h3_r8,h3_r10,h3_r12,combined_text,tokenized_separate_by_whitespace
0,Catory Pizza สาขาประชาอุทิศ,ChIJP5SViq6j4jAR6C9tANeGbcM,13.652919,100.496844,4.9,936.0,"[""Pizza restaurant""]","[""pizza_restaurant""]",8664a4b27ffffff,8864a4b23dfffff,8a64a4b23c57fff,8c64a4b23cec9ff,Catory Pizza สาขาประชาอุทิศ pizza_restaurant,Catory Pizza ประชา อุทิศ pizza _restaurant
1,ERA VALLEY,ChIJS_B0uzKj4jARyoRkKkXMRo4,13.657025,100.46872,4.6,53.0,"[""Restaurant""]","[""restaurant""]",8664a4b2fffffff,8864a4b2e5fffff,8a64a4b2e4cffff,8c64a4b2e4c13ff,ERA VALLEY restaurant,ERA VALLEY restaurant
2,HUANG หวง เกี๊ยวจีน 黄餃子館,ChIJGyEeYxSZ4jAREqdmdTa_mQ4,13.658768,100.468625,4.9,83.0,"[""Chinese restaurant"",""Chinese takeaway""]","[""chinese_restaurant"",""chinese_takeaway""]",8664a4b2fffffff,8864a4b05bfffff,8a64a4b2e4dffff,8c64a4b2e4db5ff,"HUANG หวง เกี๊ยวจีน 黄餃子館 chinese_restaurant,ch...","HUANG หวง เกี๊ยว จีน 黄餃子館 chinese _restaurant,..."
3,เนื้อเทพ NueaThep (สาขาประชาอุทิศ),ChIJX-tIU6aj4jARBc4SLkbXKM8,13.651996,100.497408,4.7,29.0,"[""Restaurant"",""Noodle shop""]","[""restaurant"",""noodle_shop""]",8664a4b27ffffff,8864a4b23dfffff,8a64a4b23c57fff,8c64a4b23c541ff,"เนื้อเทพ NueaThep (สาขาประชาอุทิศ) restaurant,...",เนื้อ เทพ NueaThep ( ประชา อุทิศ ) restaurant ...
4,Indian Barbeque Nation 57,ChIJBUSnKQqf4jAR_9kF8Wjqwu8,13.645877,100.498482,4.6,43.0,"[""Restaurant""]","[""restaurant""]",8664a4b27ffffff,8864a4b239fffff,8a64a4b23997fff,8c64a4b239959ff,Indian Barbeque Nation 57 restaurant,Indian Barbeque Nation 57 restaurant


Load and preprocess restaurant data.

In [11]:
# Step 8: Generate embeddings for all documents
bge_m3_embedded_text = bge_m3_embedding_func.encode_documents(df["combined_text"].tolist())
labse_embedded_text = labse_embedding_func.encode_documents(df["combined_text"].tolist())

pre tokenize: 100%|██████████| 17/17 [00:00<00:00, 1975.10it/s]
You're using a XLMRobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Inference Embeddings: 100%|██████████| 17/17 [00:11<00:00,  1.49it/s]


Generate dense embeddings for all records.

---

## 7. Generate Embeddings (Dense + Sparse)

Use LaBSE and BGE-M3 embedding functions to produce dense vectors and BGE-M3's sparse outputs.

In [12]:
# 8. Convert Sparse to Milvus Dict
import numpy as np
bge_m3_embedded_sparse_csr = bge_m3_embedded_text["sparse"].tocsr()
bge_m3_embedded_sparse_csr.sum_duplicates()
bge_m3_embedded_sparse_csr.sort_indices()

def bge_m3_embedded_sparse_row_to_dict(row):
    s, e = bge_m3_embedded_sparse_csr.indptr[row], bge_m3_embedded_sparse_csr.indptr[row + 1]
    idx = bge_m3_embedded_sparse_csr.indices[s:e]
    val = bge_m3_embedded_sparse_csr.data[s:e].astype(np.float32, copy=False)
    return {int(i): float(v) for i, v in zip(idx, val)}

Convert BGE-M3 sparse output to Milvus format.

In [13]:
# 9. Insert Data
entities = [
    {
        "id": str(row["place_id"]),
        "title": str(row["title"]),
        "latitude": float(row['latitude']),
        "longitude": float(row['longitude']),
        "text_tokenize": str(row["tokenized_separate_by_whitespace"]),
        "labse_dense_vector": labse_embedded_text[idx],
        "bge_m3_dense_vector": bge_m3_embedded_text["dense"][idx].tolist(),
        "bge_m3_sparse_vector": bge_m3_embedded_sparse_row_to_dict(idx),
        "h3_r8": str(row["h3_r8"]),
    }
    for idx, row in df.iterrows()
]

client.insert(collection_name, entities)
client.flush(collection_name)
client.load_collection(collection_name)
print(f"Inserted {len(entities)} records with embeddings.")

Inserted 268 records with embeddings.


Insert all data and embeddings into Milvus.

In [14]:
# Step 10: Check collection stats
print(f"Number of {client.get_collection_stats(collection_name)['row_count']} records in collection {collection_name}.")

Number of 268 records in collection restaurants.


Show number of records in the collection.

---

## 10. Collection Stats

Check collection statistics and ensure data is loaded for search.

In [15]:
# Step 11: Prepare query embeddings for search
query_text = "ร้านอาหารญี่ปุ่น"
query_labse_embedding = labse_embedding_func.encode_queries([query_text])
query_bge_m3_embedding = bge_m3_embedding_func.encode_queries([query_text])

Encode query for search.

---

## 11. Prepare Query Embeddings

Encode your search query using the same embedding functions (LaBSE, BGE-M3) before running search examples.

---

## 12. Helpers (to_dataframe, search functions)

Utility helpers for result formatting and search (e.g., `to_dataframe`).

In [16]:
# Helper: Convert Milvus results to DataFrame for easy viewing
from itertools import chain
import pandas as pd

def to_dataframe(data):
    """
    Accepts either a list of dicts or a list of lists of dicts (like your example)
    and returns a flattened DataFrame with `entity_*` columns.
    """
    if data and isinstance(data[0], list):
        records = list(chain.from_iterable(data))
    else:
        records = data
    flat = []
    for item in records:
        base = {k: v for k, v in item.items() if k != "entity"}
        entity = item.get("entity") or {}
        base.update({f"{k}": v for k, v in entity.items()})
        flat.append(base)
    df = pd.DataFrame(flat)
    preferred = ["id", "distance", "title", "latitude", "longitude", "h3_r8"]
    df = df.reindex(columns=[c for c in preferred if c in df.columns] + [c for c in df.columns if c not in preferred])
    return df

Convert Milvus results to DataFrame.

In [17]:
# 13. Search Helpers (search_labse, search_fulltext, search_bge_dense, search_bge_sparse)
from IPython.display import display

def search_labse(query: str, limit: int = 5):
    """Encode query with LaBSE and search the labse_dense_vector field."""
    q_emb = labse_embedding_func.encode_queries([query])
    res = client.search(
        collection_name=collection_name,
        data=q_emb,
        anns_field="labse_dense_vector",
        limit=limit,
        search_params={"metric_type": "COSINE", "params": {"nprobe": 10}},
        output_fields=["title", "latitude", "longitude", "h3_r8"],
    )
    return to_dataframe(res)

def search_fulltext(query: str, limit: int = 10):
    """Tokenize query and search the PyThaiNLP BM25 sparse vector field."""
    tokens = tokenize_and_filter(query).split()
    res = client.search(
        collection_name=collection_name,
        data=tokens,
        anns_field="pythainlp_sparse_vector",
        limit=limit,
        search_params={"metric_type": "BM25", "topk": limit},
        output_fields=["id", "title", "latitude", "longitude", "h3_r8"],
    )
    return to_dataframe(res)

def search_bge_dense(query: str, limit: int = 5):
    """Encode query with BGE-M3 and search its dense field."""
    q_emb = bge_m3_embedding_func.encode_queries([query])["dense"]
    res = client.search(
        collection_name=collection_name,
        data=q_emb,
        anns_field="bge_m3_dense_vector",
        limit=limit,
        search_params={"metric_type": "IP", "params": {"nprobe": 10}},
        output_fields=["id", "title", "latitude", "longitude", "h3_r8"],
    )
    return to_dataframe(res)

def search_bge_sparse(query: str, limit: int = 5):
    """Encode query with BGE-M3 and search its sparse field."""
    q_emb = bge_m3_embedding_func.encode_queries([query])["sparse"]
    res = client.search(
        collection_name=collection_name,
        data=q_emb,
        anns_field="bge_m3_sparse_vector",
        limit=limit,
        search_params={"metric_type": "IP", "params": {"nprobe": 10}},
        output_fields=["id", "title", "latitude", "longitude", "h3_r8"],
    )
    return to_dataframe(res)

Helper functions for all search types.

---

## 15. Comparison & Visualization

Run the same query across LaBSE (dense), Full-text (BM25), BGE-M3 dense, and BGE-M3 sparse, then compare results side-by-side. Use the helper below to collect and display results for quick inspection.

#### 15.1 LaBSE Dense Search Example

In [18]:
# LaBSE dense search example
query = "ซูชิ"
display(search_labse(query, limit=5))

Unnamed: 0,id,distance,title,latitude,longitude,h3_r8
0,ChIJG5cMlYmf4jARjI14Tvhzj5I,0.539514,Sushi Sora,13.726238,100.543182,8864a4b14dfffff
1,ChIJFepMlimf4jARW2MqZCN7GMQ,0.521155,sushimai ซูชิมั้ย ศรีบำเพ็ญ,13.721445,100.54673,8864a4b327fffff
2,ChIJC_IG4WGZ4jARJil71QyJnJQ,0.475555,Min Sushi by Sushi Cottage ずしコテージ,13.740359,100.525108,8864a4b10dfffff
3,ChIJ89mnRNGj4jARXFw7F8kdlu8,0.463266,ไข่หวานบ้านซูชิ สาขาประชาอุทิศ,13.660226,100.501335,8864a4b223fffff
4,ChIJT-MmY-mj4jARBxgvjap0hf0,0.424365,Suki Teenoi Susco Phuttha Bucha,13.65134,100.488991,8864a4b231fffff


LaBSE dense search example.

#### 15.2 Full-text search example

In [19]:
# 15.2 Full-text Search Example
query_text_tokenized = tokenize_and_filter(query).split()
query_text_tokenized

['ซูชิ']

Tokenize query for BM25 search.

In [20]:
display(search_fulltext(query, limit=10))

Unnamed: 0,id,distance,title,latitude,longitude,h3_r8
0,ChIJ89mnRNGj4jARXFw7F8kdlu8,4.527886,ไข่หวานบ้านซูชิ สาขาประชาอุทิศ,13.660226,100.501335,8864a4b223fffff
1,ChIJFepMlimf4jARW2MqZCN7GMQ,4.527886,sushimai ซูชิมั้ย ศรีบำเพ็ญ,13.721445,100.54673,8864a4b327fffff


BM25 full-text search example.

#### 15.3 BGE M3 dense search example

In [21]:
search_bge_dense(query, limit=3)

Unnamed: 0,id,distance,title,latitude,longitude,h3_r8
0,ChIJ89mnRNGj4jARXFw7F8kdlu8,0.609167,ไข่หวานบ้านซูชิ สาขาประชาอุทิศ,13.660226,100.501335,8864a4b223fffff
1,ChIJFepMlimf4jARW2MqZCN7GMQ,0.597811,sushimai ซูชิมั้ย ศรีบำเพ็ญ,13.721445,100.54673,8864a4b327fffff
2,ChIJG5cMlYmf4jARjI14Tvhzj5I,0.567272,Sushi Sora,13.726238,100.543182,8864a4b14dfffff


BGE-M3 dense search example.

#### 15.4 BEG M3 sparse search example

In [22]:
search_bge_sparse(query, limit=3)

Unnamed: 0,id,distance,title,latitude,longitude,h3_r8
0,ChIJFepMlimf4jARW2MqZCN7GMQ,0.095425,sushimai ซูชิมั้ย ศรีบำเพ็ญ,13.721445,100.54673,8864a4b327fffff
1,ChIJ89mnRNGj4jARXFw7F8kdlu8,0.060696,ไข่หวานบ้านซูชิ สาขาประชาอุทิศ,13.660226,100.501335,8864a4b223fffff
2,ChIJ7crcpPaj4jAR3QlRUKokiWI,0.000433,KFC @Susco square,13.650912,100.488785,8864a4b231fffff


BGE-M3 sparse search example.

#### 15.5 Hybrid Search Example: LaBSE + Full-text BM25

Combine LaBSE dense vector search and BM25 full-text search. This cell retrieves results from both, normalizes and merges scores, and ranks by a weighted sum for a hybrid result.

In [23]:
# Hybrid search: combine LaBSE dense and BM25 full-text using client.hybrid_search (float fix)
from pymilvus import AnnSearchRequest, WeightedRanker

# Prepare LaBSE dense query (should be a list of list of float)
labse_query = labse_embedding_func.encode_queries([query])  # returns [[...]]
labse_request = AnnSearchRequest(
    data=labse_query,  # [[float, float, ...]]
    anns_field="labse_dense_vector",
    param={
        "metric_type": "COSINE",
        "params": {"nprobe": 10}
    },
    limit=10
)

# Prepare BM25 full-text query (data should be list of str)
fulltext_tokens = [str(t) for t in tokenize_and_filter(query).split()]
fulltext_request = AnnSearchRequest(
    data=fulltext_tokens,  # [str, str, ...]
    anns_field="pythainlp_sparse_vector",
    param={
        "metric_type": "BM25"
    },
    limit=10
)

# Weighted ranker: weights must match number of requests
ranker = WeightedRanker(0.5, 0.5)

# Run hybrid search
hybrid_res = client.hybrid_search(
    collection_name=collection_name,
    reqs=[labse_request, fulltext_request],
    ranker=ranker,
    limit=5,
    output_fields=["id", "title", "latitude", "longitude", "h3_r8"]
)

display(to_dataframe(hybrid_res))

Unnamed: 0,id,distance,title,latitude,longitude,h3_r8
0,ChIJFepMlimf4jARW2MqZCN7GMQ,0.811099,sushimai ซูชิมั้ย ศรีบำเพ็ญ,13.721445,100.54673,8864a4b327fffff
1,ChIJ89mnRNGj4jARXFw7F8kdlu8,0.796627,ไข่หวานบ้านซูชิ สาขาประชาอุทิศ,13.660226,100.501335,8864a4b223fffff
2,ChIJG5cMlYmf4jARjI14Tvhzj5I,0.384878,Sushi Sora,13.726238,100.543182,8864a4b14dfffff
3,ChIJC_IG4WGZ4jARJil71QyJnJQ,0.368889,Min Sushi by Sushi Cottage ずしコテージ,13.740359,100.525108,8864a4b10dfffff
4,ChIJT-MmY-mj4jARBxgvjap0hf0,0.356091,Suki Teenoi Susco Phuttha Bucha,13.65134,100.488991,8864a4b231fffff


#### 15.6 Hybrid Search Example: BGE-M3 Dense + Sparse

Combine BGE-M3 dense and sparse vectors for hybrid search. This uses the same embedding model's dense and sparse representations, ranked by WeightedRanker.

In [24]:
# Hybrid search: combine BGE-M3 dense and sparse using client.hybrid_search
from pymilvus import AnnSearchRequest, WeightedRanker
import numpy as np

# Prepare BGE-M3 dense query
bge_dense_query = query_bge_m3_embedding["dense"]  # [float, ...]
bge_dense_request = AnnSearchRequest(
    data=bge_dense_query,  # [float, ...]
    anns_field="bge_m3_dense_vector",
    param={
        "metric_type": "IP",
        "params": {"nprobe": 10}
    },
    limit=10
)

# Prepare BGE-M3 sparse query
bge_sparse_query = query_bge_m3_embedding["sparse"]  # dict
bge_sparse_request = AnnSearchRequest(
    data=bge_sparse_query,  # sparse_dict
    anns_field="bge_m3_sparse_vector",
    param={
        "metric_type": "IP"
    },
    limit=10
)

# Weighted ranker: adjust weights (e.g., favor dense more)
ranker = WeightedRanker(0.7, 0.3)  # 70% dense, 30% sparse

# Run hybrid search
hybrid_res = client.hybrid_search(
    collection_name=collection_name,
    reqs=[bge_dense_request, bge_sparse_request],
    ranker=ranker,
    limit=5,
    output_fields=["id", "title", "latitude", "longitude", "h3_r8"]
)

display(to_dataframe(hybrid_res))

Unnamed: 0,id,distance,title,latitude,longitude,h3_r8
0,ChIJE4Fo9SWf4jARBjP_LjHfd2c,0.489091,Fuji Restaurant,13.727249,100.540962,8864a4b14dfffff
1,ChIJNcxj5ZGZ4jAR1QmlhMQwj5g,0.48816,Nijiki,13.73406,100.52636,8864a4b147fffff
2,ChIJT1SX7Q-f4jARPs72c600vVQ,0.484106,Ko' Edo,13.730033,100.534241,8864a4b141fffff
3,ChIJ6bsdFyOf4jARwkK6xhgKl1A,0.482674,ROUX'S,13.723981,100.534882,8864a4b141fffff
4,ChIJ_53rJw6f4jARqQOQOs-5dkk,0.480409,Maji Curry,13.746146,100.532379,8864a4b16bfffff
