# Metadata Filtering in Vector Search: Weaviate Demo
#
This notebook demonstrates how to implement metadata filtering with Weaviate,
building upon the concepts from the "Metadata Filtering in Vector Search: A Comprehensive Guide for Engineering Leaders".
We will use the same synthetic product dataset to illustrate various filtering techniques
and explore Weaviate's keyword search (BM25) capabilities alongside metadata filters.
#
**Key Concepts Covered:**
- Defining a schema (Class) with properties.
- Upserting data objects with vectors and metadata.
- Filtering using the `with_where` filter.
- Performing BM25 keyword searches and combining them with `where` filters.
- Combining vector search (`with_near_vector`) with metadata filters.
- Notes on advanced index tuning and performance optimizations in recent Weaviate versions.
#
*For a detailed discussion of Weaviate's features, including its GraphQL interface and hybrid search modules, please refer to our main guide.*

In [1]:
# 1. Setup
# !pip install weaviate-client numpy pandas # Ensure you have a recent version, e.g., 3.x or 4.x for Weaviate client
                                          # Weaviate Client v4 (weaviate-client>=4.0.0) has a different API style.
                                          # This notebook will use the v3 style (weaviate-client~=3.26) for broader compatibility
                                          # with the article's original snippets. If using v4, adapt client calls.
                                          # For v4: import weaviate.classes as wvc

import weaviate
import numpy as np
import pandas as pd
import json # For printing results nicely
import time
import uuid # For generating UUIDs if needed

## IMPORTANT: Starting Weaviate
#
Before running this notebook, you need a Weaviate instance running. You can setup a free [Weaviate Cloud Cluster](https://weaviate.io/developers/weaviate/quickstart)
The other way is to use Docker. We recommend version 1.25 or newer
for the filter performance optimizations mentioned in our guide.
#
```bash
docker run -d --name weaviate-demo-v125 \
    -p 8080:8080 \
    -p 50051:50051 \
    -e "AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED=true" \
    -e "PERSISTENCE_DATA_PATH=/var/lib/weaviate" \
    -e "DEFAULT_VECTORIZER_MODULE=none" \
    -e "ENABLE_MODULES=''" \
    semitechnologies/weaviate:1.25.3
```
This command starts Weaviate `1.25.3` with anonymous access and no default vectorizer.
If you have an existing Weaviate instance, ensure it's accessible and preferably >= v1.25.

In [2]:
# Initialize Weaviate Client
from weaviate.classes.init import Auth
import os
import getpass

# Get credentials securely
weaviate_url = os.environ.get("WEAVIATE_URL")
if not weaviate_url:
    weaviate_url = input("Enter your Weaviate Cloud URL: ")
    
weaviate_api_key = os.environ.get("WEAVIATE_API_KEY")
if not weaviate_api_key:
    weaviate_api_key = getpass.getpass("Enter your Weaviate API key: ")

try:
    client = weaviate.connect_to_weaviate_cloud(
        cluster_url=weaviate_url,
        auth_credentials=Auth.api_key(weaviate_api_key),
    )

    if client.is_ready():
        print(f"Weaviate client connected successfully to {weaviate_url}.")
        server_version = client.get_meta()['version']
        print(f"Weaviate server version: {server_version}")
        if server_version < "1.25":
            print(f"WARNING: Your Weaviate version {server_version} is older than 1.25. Consider upgrading for optimal filter performance as discussed in our guide.")
    else:
        print(f"Weaviate client not ready. Ensure Weaviate is running at {weaviate_url}.")
except Exception as e:
    print(f"Error connecting to Weaviate: {e}")
    print(f"Please ensure Weaviate is running and credentials are correct")

Weaviate client connected successfully to 09wgxvzmt5mub25le5yb5w.c0.us-west3.gcp.weaviate.cloud.
Weaviate server version: 1.30.1


In [3]:
# 2. Data Preparation
# Using the same synthetic product data
data = [
    {"id": "P001", "product_name": "Smartwatch Series X", "description": "Latest smartwatch with advanced health tracking and long battery life.", "category": "electronics", "brand": "AlphaTech", "price": 299.99, "rating": 4.5, "in_stock": True, "release_year": 2023},
    {"id": "P002", "product_name": "Organic Green Tea", "description": "Premium organic green tea, rich in antioxidants. Soothing and refreshing.", "category": "groceries", "brand": "NaturePure", "price": 15.50, "rating": 4.8, "in_stock": True, "release_year": 2022},
    {"id": "P003", "product_name": "Running Shoes Pro", "description": "Professional running shoes for marathon runners. Excellent cushioning.", "category": "apparel", "brand": "FitStride", "price": 120.00, "rating": 4.3, "in_stock": False, "release_year": 2023},
    {"id": "P004", "product_name": "Wireless Noise-Cancelling Headphones", "description": "Immersive sound experience with these wireless noise-cancelling headphones.", "category": "electronics", "brand": "AudioMax", "price": 199.50, "rating": 4.7, "in_stock": True, "release_year": 2022},
    {"id": "P005", "product_name": "Advanced Yoga Mat", "description": "Non-slip advanced yoga mat for all types of yoga practice. Eco-friendly.", "category": "sports", "brand": "ZenFlow", "price": 45.00, "rating": 4.9, "in_stock": True, "release_year": 2024},
    {"id": "P006", "product_name": "Smartphone Model Z", "description": "Flagship smartphone with stunning display and pro-grade camera system.", "category": "electronics", "brand": "AlphaTech", "price": 799.00, "rating": 4.2, "in_stock": True, "release_year": 2023},
]
df = pd.DataFrame(data)

vector_dim = 128
df['vector'] = [np.random.rand(vector_dim).tolist() for _ in range(len(df))]
df.head(2)

Unnamed: 0,id,product_name,description,category,brand,price,rating,in_stock,release_year,vector
0,P001,Smartwatch Series X,Latest smartwatch with advanced health trackin...,electronics,AlphaTech,299.99,4.5,True,2023,"[0.15409087681554223, 0.04027049403866623, 0.0..."
1,P002,Organic Green Tea,"Premium organic green tea, rich in antioxidant...",groceries,NaturePure,15.5,4.8,True,2022,"[0.9784289190080834, 0.6745252668040118, 0.704..."


In [8]:
# 3. Schema Definition (Collection Creation)
import weaviate.classes as wvc
from weaviate.classes.config import Property, DataType, Configure

class_name = "ProductDemoWeaviate"

# Delete collection if it exists from a previous run
if client.collections.exists(class_name):
    print(f"Collection '{class_name}' already exists. Deleting it.")
    client.collections.delete(class_name)
    time.sleep(1)

# Create collection with properties
collection = client.collections.create(
    name=class_name,
    description="A collection of products with metadata for demo",
    vectorizer_config=wvc.config.Configure.Vectorizer.none(),
    properties=[
        wvc.config.Property(name="product_id_prop", data_type=wvc.config.DataType.TEXT),
        wvc.config.Property(name="product_name", data_type=wvc.config.DataType.TEXT, tokenization=wvc.config.Tokenization.WORD),
        wvc.config.Property(name="description", data_type=wvc.config.DataType.TEXT, tokenization=wvc.config.Tokenization.WORD),
        wvc.config.Property(name="category", data_type=wvc.config.DataType.TEXT),
        wvc.config.Property(name="brand", data_type=wvc.config.DataType.TEXT),
        wvc.config.Property(name="price", data_type=wvc.config.DataType.NUMBER),
        wvc.config.Property(name="rating", data_type=wvc.config.DataType.NUMBER),
        wvc.config.Property(name="in_stock", data_type=wvc.config.DataType.BOOL),
        wvc.config.Property(name="release_year", data_type=wvc.config.DataType.INT)
    ],
    vector_index_config=wvc.config.Configure.VectorIndex.hnsw(distance_metric=wvc.config.VectorDistances.COSINE)
)
print(f"Collection '{class_name}' created successfully.")

/Users/saumilsrivastava/Documents/development/personal_learning/saumil-ai-implementation-examples/metadata-filtering/.venv/lib/python3.13/site-packages/weaviate/collections/classes/config.py:1950: PydanticDeprecatedSince211: Accessing the 'model_fields' attribute on the instance is deprecated. Instead, you should access this attribute from the model class. Deprecated in Pydantic V2.11 to be removed in V3.0.
  for cls_field in self.model_fields:


Collection 'ProductDemoWeaviate' created successfully.


### Advanced Index Tuning & Performance (Note)
#
Our main guide discusses Weaviate's capabilities for tuning its inverted index and BM25 search, including:
- **`invertedIndexConfig`**: For managing stopwords (language presets like "en", custom additions/removals).
- **BM25 Parameters**: `k1` (term frequency saturation, default: 1.2) and `b` (document length normalization, default: 0.75) can be customized in the class schema to fine-tune keyword search relevance.
#
```json
// Example snippet for class_obj to include invertedIndexConfig
"invertedIndexConfig": {
  "stopwords": {
    "preset": "en",
    "additions": ["extra", "words"],
    "removals": ["a", "the"]
  },
  "bm25": {
    "k1": 1.25,
    "b": 0.8
  }
}
```
Additionally, as highlighted in the article, **Weaviate versions 1.25 and newer offer significant filter performance improvements** due to inverted-index acceleration. It's highly recommended to use these versions for production workloads involving heavy filtering.

In [10]:
# 4. Upserting Data with Metadata
print("Upserting data objects...")

# Create batch object (v4 client style)
with client.batch.dynamic() as batch:
    # Configure batch settings
    batch.batch_size = 100
    
    # Add objects to batch
    for _, row in df.iterrows():
        properties = {
            "product_id_prop": row["id"],
            "product_name": row["product_name"],
            "description": row["description"],
            "category": row["category"],
            "brand": row["brand"],
            "price": float(row["price"]),
            "rating": float(row["rating"]),
            "in_stock": bool(row["in_stock"]),
            "release_year": int(row["release_year"])
        }
        
        # Add object to batch with vector
        batch.add_object(
            collection=class_name,
            properties=properties,
            vector=row["vector"]
        )
        # If you need deterministic UUIDs:
        # uuid = weaviate.util.generate_uuid5(row["id"], class_name)
        # batch.add_object(..., uuid=uuid)

print(f"Successfully upserted {len(df)} data objects.")
time.sleep(2) # Give Weaviate a moment

Upserting data objects...
Successfully upserted 6 data objects.


## 5. Metadata Filtering Examples with Vector Search
#
We will perform vector searches (`with_near_vector`) combined with `with_where` filters.

In [41]:
# Initialize your query_vector and TOP_K variables as before
query_vector = np.random.rand(vector_dim).tolist()
TOP_K = 3

def print_weaviate_results(result, query_desc=""):
    print(f"\n--- {query_desc} ---")
    if not result.objects:
        print("No results found.")
        return
    for obj in result.objects:
        print(f"  Name: {obj.properties.get('product_name', 'N/A')}, Category: {obj.properties.get('category', 'N/A')}, Price: {obj.properties.get('price', 'N/A')}")
        if hasattr(obj, 'metadata'):
            if hasattr(obj.metadata, 'score') and obj.metadata.score is not None:
                print(f"    Score/Distance: {obj.metadata.score}")
            else:
                print(f"    Score/Distance: {obj.metadata.distance}")


In [42]:
# ### Example 5.1: Exact Match on `category` with Vector Search
from weaviate.classes.query import Filter
from weaviate.classes.query import MetadataQuery

# Get the collection
collection = client.collections.get(class_name)

# Example 5.1: Exact Match on `category` with Vector Search
category_filter = Filter.by_property("category").equal("electronics")

result_cat_vec = collection.query.near_vector(near_vector=query_vector,limit=TOP_K,filters=category_filter,return_metadata=MetadataQuery(distance=True))

print_weaviate_results(result_cat_vec, "Vector Search: Filtering for 'electronics' category")

# Example 5.2: Range Query on `price` with Vector Search
price_filter = Filter.by_property("price").less_than(100.00)

result_price_lt_vec = collection.query.near_vector(near_vector=query_vector,limit=TOP_K,filters=price_filter,return_metadata=MetadataQuery(distance=True))

print_weaviate_results(result_price_lt_vec, "Vector Search: Filtering for price less than $100")



--- Vector Search: Filtering for 'electronics' category ---
  Name: Smartphone Model Z, Category: electronics, Price: 799.0
    Score/Distance: 0.2378298044204712
  Name: Wireless Noise-Cancelling Headphones, Category: electronics, Price: 199.5
    Score/Distance: 0.2784872055053711
  Name: Smartwatch Series X, Category: electronics, Price: 299.99
    Score/Distance: 0.28156906366348267

--- Vector Search: Filtering for price less than $100 ---
  Name: Organic Green Tea, Category: groceries, Price: 15.5
    Score/Distance: 0.242254376411438
  Name: Advanced Yoga Mat, Category: sports, Price: 45.0
    Score/Distance: 0.26119357347488403


## 6. BM25 Keyword Search with Metadata Filtering
#
As discussed in our main article, Weaviate supports BM25 keyword search, which can be combined with metadata filters.
This is powerful for hybrid search scenarios.

In [43]:
# ### Example 6.1: BM25 Search for "smartwatch" in 'product_name' or 'description'
bm25_query = "smartwatch"
result_bm25_only = collection.query.bm25(query=bm25_query, query_properties=["product_name", "description"],limit=TOP_K,return_metadata=MetadataQuery(score=True))

print_weaviate_results(result_bm25_only, f"BM25 Search for '{bm25_query}'")


--- BM25 Search for 'smartwatch' ---
  Name: Smartwatch Series X, Category: electronics, Price: 299.99
    Score/Distance: 1.471821904182434


In [44]:
# ### Example 6.2: BM25 Search for "headphones" COMBINED with metadata filter (category: "electronics")
bm25_query_headphones = "headphones"
category_filter = Filter.by_property("category").equal("electronics")

result_bm25_filtered = collection.query.bm25(
    query=bm25_query_headphones, 
    query_properties=["product_name", "description"],
    limit=TOP_K,
    filters=category_filter,
    return_metadata=MetadataQuery(score=True)
)

print_weaviate_results(result_bm25_filtered, f"BM25 Search for '{bm25_query_headphones}' AND category 'electronics'")


--- BM25 Search for 'headphones' AND category 'electronics' ---
  Name: Wireless Noise-Cancelling Headphones, Category: electronics, Price: 199.5
    Score/Distance: 1.4359540939331055


### Combining BM25, Vector Search, and Metadata Filters (Hybrid Search)
Weaviate's `with_hybrid()` method (not shown in this basic demo for brevity, but available)
allows combining keyword (BM25) and semantic (vector) search, along with `where` filters.
This enables sophisticated hybrid search strategies, as detailed in our main guide.
Example conceptual structure:
 ```python
 response = collection.query.hybrid(
    query="food",
    vector=query_vector,
    alpha=0.25,
    limit=3,
)
```

## 7. Discussion
#
- **Schema Enforcement:** Weaviate's schema ensures data consistency, vital for reliable filtering.
- **Rich Filtering & BM25:** The `where` filter is highly expressive. Combining it with BM25 keyword search and vector search provides powerful hybrid retrieval.
- **Performance:** Using Weaviate v1.25+ is recommended for filter performance benefits. Advanced tuning of `invertedIndexConfig` and BM25 parameters can further optimize search.
- **GraphQL:** Remember Weaviate also offers a rich GraphQL API for complex queries, which is discussed in our main guide.
#
*Our comprehensive guide offers more on Weaviate's pros, cons, and ideal use cases.*

In [None]:
# 8. Cleanup

if client.collections.exists(class_name):
    print(f"Deleting collection '{class_name}'...")
    client.collections.delete(class_name)
    print(f"Collection '{class_name}' deleted.")
else:
    print(f"Collection '{class_name}' not found for deletion.")


# Close the client connection
client.close()

## Conclusion
#
This Weaviate notebook demonstrated its strong metadata filtering using `where` clauses,
and showcased how this can be combined with both vector searches and BM25 keyword searches.
Its schema-first approach and rich query capabilities make it suitable for diverse applications.
#
*Refer to our main guide for selecting the best vector database for your specific needs and for more details on Weaviate's advanced features.*