### 🏪 Series II: Indexing and Querying Restaurant Data with Milvus using Full-Text Search (BM25)

This notebook demonstrates how to index and search restaurant data using the Milvus Vector Database with sparse vector embeddings for full-text search.

In this example, we'll use BM25 to convert restaurant fields (like title) into sparse vectors, then perform full-text similarity search on those embeddings.
> 🧠 This approach enables keyword-based search — finding similar restaurants based on tokenized text and BM25 ranking.

You'll learn how to:

* Prepare restaurant data with `title` fields.
* Tokenize and preprocess text for BM25.
* Insert the processed data into Milvus.
* Perform full-text similarity searches to find related restaurants.

#### 🛠 Requirements
Make sure you have the following Python libraries installed:
* `pymilvus`
* `pythainlp`
* `pandas`

You can use either:
* A **local Milvus** instance (e.g. via Docker)
* Or a **managed Milvus** service such as [Zilliz Cloud](https://cloud.zilliz.com)

📖 For more context, see the full blog post at: [wiphoo.dev](https://go.wiphoo.dev/Z9R1Wy)

### 🔗 Connect to Milvus (Local or Managed Cloud)

In [1]:
# create a connection to Milvus either local or Zilliz cloud
from pymilvus import connections

# local Milvus
connections.connect(uri='http://localhost:19530')

# # Zilliz cloud
# connections.connect(uri="https://YOUR_URI.cloud.zilliz.com", 
#                     token='YOUR_TOKEN',
#                     )

#### 🔤 Tokenize and Preprocess Text for BM25

We use PyThaiNLP to tokenize Thai text and remove common stopwords. This helps BM25 focus on meaningful tokens.

In [2]:
from pythainlp.tokenize import word_tokenize
stopwords = ['ร้าน', 'อาหาร', 'สาขา']
def tokenize_and_filter(text):
    tokens = word_tokenize(text, engine="newmm")
    return " ".join([t for t in tokens if t not in stopwords and t.strip() != ""])


#### 🗂️ Define a Schema and Create a Collection with Indexing

##### 📄 Step 1: Define the Schema for Restaurant Data

We’ll define a schema that includes key information for each restaurant:
* `id`: Unique identifier (string)
* `title`: Restaurant name (string)
* `text_standard`, `text_whitespace`: Tokenized text fields for different analyzers
* `sparse_vector_standard`, `sparse_vector_whitespace`: Sparse vector fields for BM25

In [3]:
from pymilvus import FieldSchema, DataType
fields = [
    FieldSchema(name="id", dtype=DataType.VARCHAR, is_primary=True, auto_id=False, max_length=128),
    FieldSchema(name="title", dtype=DataType.VARCHAR, max_length=512),
    FieldSchema(name="text_standard", dtype=DataType.VARCHAR, max_length=1000, enable_match=True, enable_analyzer=True, analyzer_params={"tokenizer": "standard"}),
    FieldSchema(name="text_whitespace", dtype=DataType.VARCHAR, max_length=1000, enable_match=True, enable_analyzer=True, analyzer_params={"tokenizer": "whitespace"}),
    FieldSchema(name="sparse_vector_standard", dtype=DataType.SPARSE_FLOAT_VECTOR),
    FieldSchema(name="sparse_vector_whitespace", dtype=DataType.SPARSE_FLOAT_VECTOR),
]


##### 📄 Step 2: Define BM25 Functions for Sparse Vectorization

In [4]:
from pymilvus import Function, FunctionType
functions = [
    Function(name="bm25_standard", function_type=FunctionType.BM25, input_field_names=["text_standard"], output_field_names="sparse_vector_standard"),
    Function(name="bm25_whitespace", function_type=FunctionType.BM25, input_field_names=["text_whitespace"], output_field_names="sparse_vector_whitespace"),
]


##### 📄 Step 3: Create the Collection and Indices

In [5]:
from pymilvus import CollectionSchema, Collection, utility
schema = CollectionSchema(fields=fields, functions=functions, description="Schema สำหรับข้อมูลร้านอาหาร เพื่อ Full Text Search")
collection_name = "full_text_search_restaurants"
if utility.has_collection(collection_name):
    Collection(collection_name).drop()
collection = Collection(collection_name, schema)
# สร้าง index
sparse_index = {
    "index_type": "SPARSE_INVERTED_INDEX",
    "metric_type": "BM25",
    "params":{
        "inverted_index_algo": "DAAT_MAXSCORE",
        "bm25_k1": 1.2,
        "bm25_b": 0.75
    }
}
collection.create_index(field_name="sparse_vector_standard", index_params=sparse_index)
collection.create_index(field_name="sparse_vector_whitespace", index_params=sparse_index)
collection.load()


#### 🍽️ Step 4: Prepare, Tokenize, and Ingest Restaurant Data

In [6]:
import pandas as pd
df = pd.read_csv("../../data/2025/restaurants/sample_restaurants.csv")
df["tokenized_separate_by_whitespace"] = df[["title"]].agg(" ".join, axis=1).apply(tokenize_and_filter)
entities = [
    df["place_id"].tolist(),
    df["title"].tolist(),
    df["tokenized_separate_by_whitespace"].tolist(),
    df["tokenized_separate_by_whitespace"].tolist(),
]
collection.insert(entities)
collection.flush()
print(f"Inserted {len(df)} records with embeddings.")


Inserted 268 records with embeddings.


#### ⚙️ Step 5: Create Helper Function to Display Results

In [7]:
def display_results(results):
    flat_results = []
    for query_results in results:
        for match in query_results:
            entity = match.get("entity", {})
            flat_result = {
                "id": match.get("id"),
                "distance": match.get("distance"),
                **{k: v for k, v in entity.items()}
            }
            flat_results.append(flat_result)
    return pd.DataFrame(flat_results).sort_values(by=['distance', 'title', 'id'])


#### 🔍 Step 6: Test Queries (BM25 Search)

In [8]:
results_standard = collection.search(
    data=['บ้าน'],
    anns_field="sparse_vector_standard",
    param={"metric_type": "BM25", "topk": 10},
    output_fields=["id", "title"],
    limit=10,
)
display_results(results_standard)


Unnamed: 0,id,distance,title
1,ChIJXx6MOwCj4jARGMX5QqIoacQ,3.362928,ต้มเลือดหมูท่านเปา ประชาอุทิศ 62
0,ChIJ89mnRNGj4jARXFw7F8kdlu8,8.132664,ไข่หวานบ้านซูชิ สาขาประชาอุทิศ


In [9]:
results_standard = collection.search(
    data=['บ้าน'],
    anns_field="sparse_vector_whitespace",
    param={"metric_type": "BM25", "topk": 10},
    output_fields=["id", "title"],
    limit=10,
)
display_results(results_standard)


Unnamed: 0,id,distance,title
0,ChIJ89mnRNGj4jARXFw7F8kdlu8,5.057873,ไข่หวานบ้านซูชิ สาขาประชาอุทิศ


#### Step 7: Cleanup - Disconnect Milvus Connection

In [10]:
# disconnect Milvus connection
connections.disconnect('default')