# Metadata Filtering in Vector Search: Milvus Demo
#
This notebook demonstrates metadata filtering with Milvus, following the
"Metadata Filtering in Vector Search: A Comprehensive Guide for Engineering Leaders".
We use our standard synthetic product dataset and Milvus Lite for easy local development.
#
**Key Milvus Concepts Covered:**
- Setting up Milvus Lite.
- Defining a Collection schema with scalar and vector fields.
- Inserting data, including vectors and metadata.
- Creating indexes on vector fields and scalar fields.
- Using boolean expressions (`expr`) for filtering, including the `LIKE` operator.
- Discussing limitations of certain index types (e.g., Bitmap).
- Combining vector search with metadata filters.
#
*For a full discussion on Milvus's architecture, scaling, and specific index types, please refer to our main guide.*

In [2]:
# 1. Setup
# !pip install pymilvus==2.4.3 numpy pandas # Using a specific recent version supporting LIKE
# Milvus Lite is bundled with pymilvus >= 2.2.4

from pymilvus import connections, utility
from pymilvus import CollectionSchema, FieldSchema, DataType, Collection
from milvus import default_server
from pymilvus import connections, utility
# from pymilvus import MilvusClient # Can be used for Milvus Lite management

import numpy as np
import pandas as pd
import time

## IMPORTANT: Setting up Milvus Lite
#
Milvus Lite runs Milvus locally within your Python environment.
Pymilvus versions 2.4.x generally bundle a Milvus Lite core that supports
features like the `LIKE` operator discussed in our article.

In [3]:
from pymilvus import MilvusClient
client = MilvusClient("./milvus_demo.db")
print(f"Successfully connected to Milvus using MilvusClient")

Successfully connected to Milvus using MilvusClient


In [4]:
# 2. Data Preparation
# Same synthetic product data
data = [
    {"id": "P001", "product_name": "Smartwatch Series X", "category": "electronics", "brand": "AlphaTech", "price": 299.99, "rating": 4.5, "in_stock": True, "release_year": 2023},
    {"id": "P002", "product_name": "Organic Green Tea", "category": "groceries", "brand": "NaturePure", "price": 15.50, "rating": 4.8, "in_stock": True, "release_year": 2022},
    {"id": "P003", "product_name": "Running Shoes Pro", "category": "apparel", "brand": "FitStride", "price": 120.00, "rating": 4.3, "in_stock": False, "release_year": 2023},
    {"id": "P004", "product_name": "Wireless Headphones", "category": "electronics", "brand": "AudioMax", "price": 199.50, "rating": 4.7, "in_stock": True, "release_year": 2022},
    {"id": "P005", "product_name": "Advanced Yoga Mat", "category": "sports", "brand": "ZenFlow", "price": 45.00, "rating": 4.9, "in_stock": True, "release_year": 2024},
    {"id": "P006", "product_name": "Smartphone Model Z", "category": "electronics", "brand": "AlphaTech", "price": 799.00, "rating": 4.2, "in_stock": True, "release_year": 2023},
]
df = pd.DataFrame(data)

vector_dim = 128
df['vector'] = [np.random.rand(vector_dim).tolist() for _ in range(len(df))]
df.head(2)

Unnamed: 0,id,product_name,category,brand,price,rating,in_stock,release_year,vector
0,P001,Smartwatch Series X,electronics,AlphaTech,299.99,4.5,True,2023,"[0.7473844001701807, 0.2301659131005277, 0.974..."
1,P002,Organic Green Tea,groceries,NaturePure,15.5,4.8,True,2022,"[0.2395778541273892, 0.2744254339694293, 0.698..."


In [5]:
# 3. Collection Creation (Schema Definition)
# Collection and schema definition
collection_name = "ProductCatalogMilvusV2"

# Check if collection exists and drop if needed
if client.has_collection(collection_name):
    print(f"Collection '{collection_name}' found. Dropping it.")
    client.drop_collection(collection_name)
    time.sleep(1)

# METHOD 1: Use create_schema() method to create the schema object
schema = client.create_schema()
schema.add_field(field_name="product_pk_id", datatype=DataType.VARCHAR, is_primary=True, max_length=100)
schema.add_field(field_name="product_name", datatype=DataType.VARCHAR, max_length=256)
schema.add_field(field_name="category", datatype=DataType.VARCHAR, max_length=100)
schema.add_field(field_name="brand", datatype=DataType.VARCHAR, max_length=100)
schema.add_field(field_name="price", datatype=DataType.FLOAT)
schema.add_field(field_name="rating", datatype=DataType.FLOAT)
schema.add_field(field_name="in_stock", datatype=DataType.BOOL)
schema.add_field(field_name="release_year", datatype=DataType.INT64)
schema.add_field(field_name="embedding", datatype=DataType.FLOAT_VECTOR, dim=vector_dim)

# Create the collection
client.create_collection(collection_name=collection_name, schema=schema)
print(f"Collection '{collection_name}' created.")

Collection 'ProductCatalogMilvusV2' found. Dropping it.
Collection 'ProductCatalogMilvusV2' created.


In [6]:
# 4. Create Index for Vector Field
vector_index_params = client.prepare_index_params()
vector_index_params.add_index(metric_type= "L2", index_type= "FLAT", params= {"M": 16, "efConstruction": 200},index_name="vector_hnsw_idx",field_name="embedding")
client.create_index(collection_name=collection_name, index_params=vector_index_params)
print("Vector index definition created for 'embedding' field.")

Vector index definition created for 'embedding' field.


In [7]:
# 5. Inserting Data
data_to_insert = []
for _, row in df.iterrows():
    data_to_insert.append({
        "product_pk_id": row["id"],
        "product_name": row["product_name"],
        "category": row["category"],
        "brand": row["brand"],
        "price": float(row["price"]),
        "rating": float(row["rating"]),
        "in_stock": bool(row["in_stock"]),
        "release_year": int(row["release_year"]),
        "embedding": row["vector"]
    })
insert_result = client.insert(collection_name=collection_name, data=data_to_insert)
client.flush(collection_name=collection_name)
print(f"Inserted {insert_result} entities. Collection count: {client.get_collection_stats(collection_name=collection_name)}")

Inserted {'insert_count': 6, 'ids': ['P001', 'P002', 'P003', 'P004', 'P005', 'P006']} entities. Collection count: {'row_count': 6}


In [8]:
# 6. Create Indexes on Scalar Fields for Filtering Performance
print("Creating scalar indexes...")

# Prepare an empty IndexParams object
index_params = client.prepare_index_params()

# Add indexes for each field
index_params.add_index(
    field_name="product_name",
    index_name="product_name_scalar_idx"
    # Type is omitted for auto-indexing
)

index_params.add_index(
    field_name="category",
    index_name="category_scalar_idx"
    # Could be BITMAP if cardinality is low
)

index_params.add_index(
    field_name="price",
    index_name="price_scalar_idx"
)

# Create all indexes in one call
client.create_index(
    collection_name=collection_name,
    index_params=index_params
)

print("Scalar indexes created for 'product_name', 'category', and 'price'.")
print("Note: The 'product_name' index supports LIKE if Milvus version >= 2.4.")

Creating scalar indexes...
Scalar indexes created for 'product_name', 'category', and 'price'.
Note: The 'product_name' index supports LIKE if Milvus version >= 2.4.


### Bitmap Index Considerations (Note)

Our main guide discusses Milvus's **Bitmap indexes**, which are highly efficient for filtering on **low-cardinality** scalar fields (typically fewer than 500 distinct values).

**Key Limitations of Bitmap Indexes (from the article & Milvus docs):**
- **Not for high-cardinality fields.**
- **Not compatible with floating-point types (FLOAT, DOUBLE) or JSON data types.** They work well with `BOOL`, `INT` types, and `VARCHAR` fields that have few unique values.
- Supported for scalar fields that are not primary keys.

If our `category` field had very few unique values (e.g., <10 across millions of records), explicitly creating it as a BITMAP index could be beneficial using the newer MilvusClient approach:

```python
# Using MilvusClient to create a BITMAP index (if 'category' is low cardinality)
# try:
#     # Create index parameters
#     index_params = client.prepare_index_params()
#     
#     # Add BITMAP index for the low-cardinality category field
#     index_params.add_index(
#         field_name="category",
#         index_type="BITMAP",
#         index_name="category_bitmap_idx"
#     )
#     
#     # Create the index
#     client.create_index(
#         collection_name=collection_name,
#         index_params=index_params
#     )
#     print("BITMAP index created for 'category'.")
# except Exception as e:
#     print(f"Note on BITMAP: {e}")
```

For most general VARCHAR fields or numerical fields like `price`, Milvus typically uses other scalar index types (like INVERTED or STL_SORT) by default. When using the newer MilvusClient API, you can often omit the index_type parameter to let Milvus choose the appropriate index type automatically.

In [9]:
# Load the collection into memory for searching
print("Loading collection into memory...")
client.load_collection(collection_name=collection_name, replica_number=1)
print("Collection loaded.")

Loading collection into memory...
Collection loaded.


## 7. Metadata Filtering Examples (`expr`)
#
Milvus uses SQL-like boolean expressions (`expr`) for filtering.
*Ensure your Milvus instance (or Milvus Lite via Pymilvus) is version 2.4 or newer for `LIKE` operator support.*

In [10]:
query_vector = np.random.rand(vector_dim).tolist()
TOP_K = 3
OUTPUT_FIELDS = ["product_pk_id", "product_name", "category", "brand", "price", "rating", "in_stock", "release_year"]
search_params_hnsw = {"metric_type": "L2", "params": {"ef": 20}}

def print_milvus_results(results, query_desc=""):
    print(f"\n--- {query_desc} ---")
    if not results or not results[0]:
        print("No results found.")
        return
    for hits in results:
        for hit in hits:
            print(f"  ID: {hit.entity.get('product_pk_id')}, Dist: {hit.distance:.4f}, Name: {hit.entity.get('product_name')}, Cat: {hit.entity.get('category')}")
            # print(f"  Full Entity: {hit.entity.to_dict()}") # For all fields

# Example:
expr_cat = 'category == "electronics"'
# Define search parameters
search_params_hnsw = {"metric_type": "L2", "params": {"ef": 20}}

# Execute the search - note the parameter name changes
results_cat = client.search(
    collection_name=collection_name, 
    data=[query_vector],
    anns_field="embedding", 
    search_params=search_params_hnsw,  # Changed from param to search_params
    filter=expr_cat,                   # Changed from expr to filter
    limit=TOP_K,
    output_fields=OUTPUT_FIELDS
    # Removed consistency_level as it's not used in MilvusClient
)
print_milvus_results(results_cat, f"Filtering with expr: {expr_cat}")


--- Filtering with expr: category == "electronics" ---
  ID: P004, Dist: 19.1212, Name: Wireless Headphones, Cat: electronics
  ID: P006, Dist: 20.1757, Name: Smartphone Model Z, Cat: electronics
  ID: P001, Dist: 20.3946, Name: Smartwatch Series X, Cat: electronics


In [11]:
# ### Example 7.6: `LIKE` operator for partial string matches
#
# As noted in our article, the `LIKE` operator was officially added in Milvus 2.4.
# It requires a Milvus version >= 2.4 and the target VARCHAR field (e.g., `product_name`)
# should have a scalar index created on it.
# The "%" symbol is used as a wildcard.


expr_like = 'product_name like "Smartwatch%"'
print(f"\nAttempting LIKE query (ensure Milvus version >= 2.4): {expr_like}")
try:
    results_like = client.search(
        collection_name=collection_name,
        data=[query_vector],
        anns_field="embedding",
        search_params=search_params_hnsw,  # Changed from param to search_params
        limit=TOP_K,
        filter=expr_like,                  # Changed from expr to filter
        output_fields=OUTPUT_FIELDS
        # Removed consistency_level - not used in MilvusClient
    )
    print_milvus_results(results_like, f"Filtering with expr: {expr_like}")
except Exception as e:
    print(f"Error during 'LIKE' query: {e}")
    print("This could be due to Milvus version < 2.4, issues with scalar index on 'product_name', or specific characters in the query string needing escape (e.g. literal '%').")

expr_like_infix = 'product_name like "%Green Tea%"'
print(f"\nAttempting infix LIKE query: {expr_like_infix}")
try:
    results_like_infix = client.search(
        collection_name=collection_name,  # Added collection_name
        data=[query_vector], 
        anns_field="embedding", 
        search_params=search_params_hnsw,  # Changed from param
        limit=TOP_K,
        filter=expr_like_infix,           # Changed from expr
        output_fields=OUTPUT_FIELDS
        # Removed consistency_level
    )
    print_milvus_results(results_like_infix, f"Filtering with expr: {expr_like_infix}")
except Exception as e:
    print(f"Error during infix 'LIKE' query: {e}")


Attempting LIKE query (ensure Milvus version >= 2.4): product_name like "Smartwatch%"

--- Filtering with expr: product_name like "Smartwatch%" ---
  ID: P001, Dist: 20.3946, Name: Smartwatch Series X, Cat: electronics

Attempting infix LIKE query: product_name like "%Green Tea%"

--- Filtering with expr: product_name like "%Green Tea%" ---
  ID: P002, Dist: 22.0829, Name: Organic Green Tea, Cat: groceries


## 8. Discussion
#
- **Schema Enforcement:** Milvus's schema-first approach is good for data integrity, which is foundational for reliable filtering.
- **`expr` Syntax:** The SQL-like `expr` offers powerful and familiar filtering. With version 2.4+, `LIKE` enhances string matching.
- **Scalar Indexing:** Essential for filter performance. Understanding types like Bitmap (for low-cardinality) vs. others (e.g., INVERTED for general text) is key, as discussed in our main guide.
- **Milvus Lite:** Greatly simplifies local development and testing of these features.
#
*Our comprehensive guide provides a deeper comparison of Milvus against other vector databases and its optimal use cases.*

In [None]:
# 9. Cleanup
print(f"\nCleaning up Milvus resources...")

# With MilvusClient, there's no separate "release" step needed - resources are managed automatically

user_confirmation = input(f"Do you want to drop the collection '{collection_name}'? (yes/no): ")
if user_confirmation.lower() == 'yes':
    if client.has_collection(collection_name):
        print(f"Dropping collection '{collection_name}'...")
        client.drop_collection(collection_name)
        print("Collection dropped.")
else:
    print(f"Collection '{collection_name}' was not deleted.")

print("Closing Milvus client connection...")
# Close the MilvusClient (this properly cleans up resources)
client.close()

print("\nMilvus demo finished.")


Cleaning up Milvus resources...


## Conclusion
#
This Milvus notebook highlighted its robust metadata filtering via `expr`, including the `LIKE` operator for string patterns (in Milvus >= 2.4).
Effective use of scalar indexing, and understanding specific index behaviors like Bitmap limitations, are crucial for leveraging Milvus's performance at scale.
#
*Consult our main guide for comprehensive selection criteria and advanced feature discussions.*