# 🔎 Building a full-text (BM25) search with Milvus and PyThaiNLP

This notebook shows how to build a full-text search pipeline using PyThaiNLP tokenization and Milvus sparse/BM25 inverted indexes. It demonstrates tokenization, BM25 function setup, sparse vector indexing, and example BM25 queries against restaurant data.

### What you will learn
- Tokenize and preprocess Thai text for BM25
- Create BM25 function mappings and sparse fields in Milvus
- Build sparse inverted indexes and tune BM25 parameters
- Insert tokenized records and run BM25 queries

### Requirements
- `pymilvus[model]`
- `pythainlp`
- `pandas`

Run order: execute cells top-to-bottom. Update the connection cell before running cloud-based operations.

Reference: For a step-by-step blog post, see [wiphoo.dev](https://go.wiphoo.dev/Z9R1Wy).

### Requirements

- `pymilvus[model]`
- `pythainlp`
- `pandas`

In [1]:
%pip install -q --upgrade "pymilvus[model]" pythainlp pandas

Note: you may need to restart the kernel to use updated packages.


🔌 Step 1 — Connect to Milvus

Establish a connection to Milvus (local or cloud). Update the `connections.connect(...)` URI or cloud credentials before running indexing or search steps.

In [2]:
# create a connection to Milvus either local or Zilliz cloud
from pymilvus import connections

# local Milvus
connections.connect(uri='http://localhost:19530')

# # Zilliz cloud
# connections.connect(uri="https://YOUR_URI.cloud.zilliz.com", 
#                     token='YOUR_TOKEN',
#                     )

🔤 Step 2 — Tokenize and Preprocess Text for BM25

We use PyThaiNLP to tokenize Thai text and remove common stopwords. Proper tokenization helps BM25 focus on meaningful tokens for better ranking.

In [3]:
from pythainlp.tokenize import word_tokenize
stopwords = ['ร้าน', 'อาหาร', 'สาขา']
def tokenize_and_filter(text):
    tokens = word_tokenize(text, engine="newmm")
    return " ".join([t for t in tokens if t not in stopwords and t.strip() != ""])


🗂️ Step 3 — Define Schema and BM25 Functions

Define the schema fields (IDs, titles, tokenized text fields, and sparse vector fields). Then declare BM25 Function(s) to convert tokenized text into sparse vectors for indexing.

In [4]:
from pymilvus import FieldSchema, DataType
fields = [
    FieldSchema(name="id", dtype=DataType.VARCHAR, is_primary=True, auto_id=False, max_length=128),
    FieldSchema(name="title", dtype=DataType.VARCHAR, max_length=512),
    FieldSchema(name="text_standard", dtype=DataType.VARCHAR, max_length=1000, enable_match=True, enable_analyzer=True, analyzer_params={"tokenizer": "standard"}),
    FieldSchema(name="text_whitespace", dtype=DataType.VARCHAR, max_length=1000, enable_match=True, enable_analyzer=True, analyzer_params={"tokenizer": "whitespace"}),
    FieldSchema(name="sparse_vector_standard", dtype=DataType.SPARSE_FLOAT_VECTOR),
    FieldSchema(name="sparse_vector_whitespace", dtype=DataType.SPARSE_FLOAT_VECTOR),
]


🧠 Step 3.1 — BM25 Functions for Sparse Vectorization

Create BM25 function mappings that convert tokenized text fields into sparse vector fields to be indexed with BM25.

In [5]:
from pymilvus import Function, FunctionType
functions = [
    Function(name="bm25_standard", function_type=FunctionType.BM25, input_field_names=["text_standard"], output_field_names="sparse_vector_standard"),
    Function(name="bm25_whitespace", function_type=FunctionType.BM25, input_field_names=["text_whitespace"], output_field_names="sparse_vector_whitespace"),
]


🗂️ Step 3.2 — Create the Collection and Indices

Create the collection with the schema and BM25 functions, then build sparse inverted indexes with BM25 parameters. Load the collection to prepare for search.

In [6]:
from pymilvus import CollectionSchema, Collection, utility
schema = CollectionSchema(fields=fields, functions=functions, description="Schema สำหรับข้อมูลร้านอาหาร เพื่อ Full Text Search")
collection_name = "full_text_search_restaurants"
if utility.has_collection(collection_name):
    Collection(collection_name).drop()
collection = Collection(collection_name, schema)
# สร้าง index
sparse_index = {
    "index_type": "SPARSE_INVERTED_INDEX",
    "metric_type": "BM25",
    "params":{
        "inverted_index_algo": "DAAT_MAXSCORE",
        "bm25_k1": 1.2,
        "bm25_b": 0.75
    }
}
collection.create_index(field_name="sparse_vector_standard", index_params=sparse_index)
collection.create_index(field_name="sparse_vector_whitespace", index_params=sparse_index)
collection.load()


🍽️ Step 4 — Prepare, Tokenize, and Ingest Restaurant Data

Load your restaurant CSV, tokenize text fields, and insert the tokenized text alongside any metadata into Milvus. Flush to persist data.

In [7]:
import pandas as pd
df = pd.read_csv("../../data/2025/restaurants/sample_restaurants.csv")
df["tokenized_separate_by_whitespace"] = df[["title"]].agg(" ".join, axis=1).apply(tokenize_and_filter)
entities = [
    df["place_id"].tolist(),
    df["title"].tolist(),
    df["tokenized_separate_by_whitespace"].tolist(),
    df["tokenized_separate_by_whitespace"].tolist(),
]
collection.insert(entities)
collection.flush()
print(f"Inserted {len(df)} records with embeddings.")


Inserted 268 records with embeddings.


⚙️ Step 5 — Helper Functions

Add utility functions to format Milvus search results and display them in a DataFrame for analysis.

In [8]:
def display_results(results):
    flat_results = []
    for query_results in results:
        for match in query_results:
            entity = match.get("entity", {})
            flat_result = {
                "id": match.get("id"),
                "distance": match.get("distance"),
                **{k: v for k, v in entity.items()}
            }
            flat_results.append(flat_result)
    return pd.DataFrame(flat_results).sort_values(by=['distance', 'title', 'id'])


🔍 Step 6 — Test Queries (BM25 Search)

Run example BM25 queries against different sparse fields (standard vs whitespace analyzers) and compare returned results.

In [9]:
results_standard = collection.search(
    data=['บ้าน'],
    anns_field="sparse_vector_standard",
    param={"metric_type": "BM25", "topk": 10},
    output_fields=["id", "title"],
    limit=10,
)
display_results(results_standard)


Unnamed: 0,id,distance,title
1,ChIJXx6MOwCj4jARGMX5QqIoacQ,3.362928,ต้มเลือดหมูท่านเปา ประชาอุทิศ 62
0,ChIJ89mnRNGj4jARXFw7F8kdlu8,8.132664,ไข่หวานบ้านซูชิ สาขาประชาอุทิศ


In [10]:
results_standard = collection.search(
    data=['บ้าน'],
    anns_field="sparse_vector_whitespace",
    param={"metric_type": "BM25", "topk": 10},
    output_fields=["id", "title"],
    limit=10,
)
display_results(results_standard)


Unnamed: 0,id,distance,title
0,ChIJ89mnRNGj4jARXFw7F8kdlu8,5.057873,ไข่หวานบ้านซูชิ สาขาประชาอุทิศ


🧹 Step 7 — Cleanup

Disconnect the Milvus connection and free resources when finished.

In [11]:
# disconnect Milvus connection
connections.disconnect('default')