# Metadata Filtering in Vector Search: Pinecone Demo
#
This notebook demonstrates how to implement metadata filtering with Pinecone,
as discussed in the "Metadata Filtering in Vector Search: A Comprehensive Guide for Engineering Leaders".
We will use a synthetic product dataset to illustrate various filtering techniques and
touch upon newer features like sparse-dense vectors.
#
**Key Concepts Covered:**
- Upserting vectors with JSON metadata.
- Filtering with MongoDB-style query language.
- Combining vector search with metadata filters.
- Using namespaces for coarse-grained filtering.
- Conceptual demonstration of sparse-dense vectors for hybrid search.
- Important considerations for Pinecone serverless metadata.
#
*For a full discussion of these concepts and their business impact, please refer to our main guide.*

In [4]:
# 1. Setup
#!pip install pinecone-client numpy pandas

import os
import time
import numpy as np
import pandas as pd
import getpass
from pinecone import Pinecone, ServerlessSpec, PodSpec # Added PodSpec for context

**IMPORTANT:**
Replace `YOUR_API_KEY` with your actual Pinecone API key.
You can get one from [https://app.pinecone.io/](https://app.pinecone.io/).
#
For this demo, we'll primarily focus on a serverless index. Some features like direct hybrid search with an alpha parameter are more straightforward with pod-based indexes using `dotproduct`, while serverless might encourage separate sparse/dense index querying. We will illustrate the data structure for sparse-dense.

In [5]:
# Initialize Pinecone connection
try:
    api_key = getpass.getpass("Enter your Pinecone API key: ")  # Securely prompt for API key
    if not api_key:
        raise ValueError("PINECONE_API_KEY environment variable not set or empty.")

    pc = Pinecone(api_key=api_key)
    print("Pinecone initialized successfully.")
except Exception as e:
    print(f"Error initializing Pinecone: {e}")
    print("Please ensure your API key is correct and you have internet access.")

Pinecone initialized successfully.


In [6]:
# 2. Data Preparation
# Synthetic product data
data = [
    {"id": "P001", "product_name": "Smartwatch Series X", "category": "electronics", "brand": "AlphaTech", "price": 299.99, "rating": 4.5, "in_stock": True, "release_year": 2023, "keywords_indices": [1, 5, 10], "keywords_values": [0.8, 0.7, 0.9]},
    {"id": "P002", "product_name": "Organic Green Tea", "category": "groceries", "brand": "NaturePure", "price": 15.50, "rating": 4.8, "in_stock": True, "release_year": 2022, "keywords_indices": [2, 7, 12], "keywords_values": [0.9, 0.6, 0.8]},
    {"id": "P003", "product_name": "Running Shoes Pro", "category": "apparel", "brand": "FitStride", "price": 120.00, "rating": 4.3, "in_stock": False, "release_year": 2023, "keywords_indices": [3, 6, 11], "keywords_values": [0.7, 0.8, 0.7]},
    {"id": "P004", "product_name": "Wireless Headphones", "category": "electronics", "brand": "AudioMax", "price": 199.50, "rating": 4.7, "in_stock": True, "release_year": 2022, "keywords_indices": [1, 8, 15], "keywords_values": [0.9, 0.9, 0.6]},
    {"id": "P005", "product_name": "Advanced Yoga Mat", "category": "sports", "brand": "ZenFlow", "price": 45.00, "rating": 4.9, "in_stock": True, "release_year": 2024, "keywords_indices": [4, 9, 13], "keywords_values": [0.6, 0.8, 0.9]},
    {"id": "P006", "product_name": "Smartphone Model Z", "category": "electronics", "brand": "AlphaTech", "price": 799.00, "rating": 4.2, "in_stock": True, "release_year": 2023, "keywords_indices": [1, 5, 16], "keywords_values": [0.8, 0.8, 0.7]},
]
df = pd.DataFrame(data)

# Generate mock dense embeddings (e.g., 128 dimensions)
vector_dim = 128
df['dense_vector'] = [np.random.rand(vector_dim).tolist() for _ in range(len(df))]

# Prepare sparse vectors from keywords_indices and keywords_values
df['sparse_vector_data'] = df.apply(lambda row: {"indices": row['keywords_indices'], "values": row['keywords_values']}, axis=1)

print(f"Prepared {len(df)} items with mock dense and sparse embeddings.")
print(df[['id', 'product_name', 'dense_vector', 'sparse_vector_data']].head(2))

Prepared 6 items with mock dense and sparse embeddings.
     id         product_name  \
0  P001  Smartwatch Series X   
1  P002    Organic Green Tea   

                                        dense_vector  \
0  [0.31560442051483995, 0.9741320292337782, 0.70...   
1  [0.4428882542546977, 0.9739330601386337, 0.629...   

                                  sparse_vector_data  
0  {'indices': [1, 5, 10], 'values': [0.8, 0.7, 0...  
1  {'indices': [2, 7, 12], 'values': [0.9, 0.6, 0...  


In [9]:
# 3. Index Creation
# We'll use a serverless index for this main demo.
# For sparse-dense queries, 'dotproduct' is required if doing direct hybrid query on a single index (typically pod-based).
# Serverless sparse-dense often involves separate sparse and dense indexes.
index_name_serverless = "product-catalog-serverless-demo"
cloud_provider = "aws"
region = "us-east-1" # choose a region
print(pc.list_indexes().names)
if index_name_serverless not in pc.list_indexes():
    print(f"Creating new serverless index: {index_name_serverless}")
    pc.create_index(
        name=index_name_serverless,
        dimension=vector_dim, # For dense vectors
        metric="cosine", # Common for dense vectors
        spec=ServerlessSpec(
            cloud=cloud_provider,
            region=region
        )
    )
    # Wait for index to be ready
    while not pc.describe_index(index_name_serverless).status['ready']:
        print(f"Waiting for serverless index '{index_name_serverless}' to be ready...")
        time.sleep(5)
else:
    print(f"Serverless index '{index_name_serverless}' already exists.")

index = pc.Index(index_name_serverless)
print(index.describe_index_stats())

<bound method IndexList.names of [
    {
        "name": "hybrid-furniture-search",
        "metric": "dotproduct",
        "host": "hybrid-furniture-search-24ytqix.svc.aped-4627-b74a.pinecone.io",
        "spec": {
            "serverless": {
                "cloud": "aws",
                "region": "us-east-1"
            }
        },
        "status": {
            "ready": true,
            "state": "Ready"
        },
        "vector_type": "dense",
        "dimension": 512,
        "deletion_protection": "disabled",
        "tags": null
    }
]>
Creating new serverless index: product-catalog-serverless-demo
{'dimension': 128,
 'index_fullness': 0.0,
 'metric': 'cosine',
 'namespaces': {},
 'total_vector_count': 0,
 'vector_type': 'dense'}


## 4. Upserting Data with Metadata (and Conceptual Sparse Vectors)
#
Pinecone allows arbitrary JSON metadata.
For sparse-dense, you'd include `sparse_values` alongside `values` when upserting to a compatible index.
Our serverless index above is primarily for dense vectors; we'll show how the data *would* be structured.

In [10]:
vectors_to_upsert = []
for _, row in df.iterrows():
    metadata = {
        "product_name": row["product_name"],
        "category": row["category"],
        "brand": row["brand"],
        "price": float(row["price"]),
        "rating": float(row["rating"]),
        "in_stock": bool(row["in_stock"]),
        "release_year": int(row["release_year"])
        # Note: For a true sparse-dense index, the sparse data isn't typically in metadata
        # but passed directly as `sparse_values` in the upsert operation.
    }
    # For a dense-only index (like our serverless example here):
    vectors_to_upsert.append({
        "id": row["id"],
        "values": row["dense_vector"],
        "metadata": metadata
    })
    # If this were a pod-based index supporting sparse-dense vectors directly:
    # vectors_to_upsert_hybrid.append({
    #     "id": row["id"],
    #     "values": row["dense_vector"],
    #     "sparse_values": row["sparse_vector_data"], # Correct field name is sparse_values
    #     "metadata": metadata
    # })


# Upsert to our serverless (dense) index
batch_size = 100
for i in range(0, len(vectors_to_upsert), batch_size):
    batch = vectors_to_upsert[i:i+batch_size]
    index.upsert(vectors=batch)
    print(f"Upserted batch {i//batch_size + 1} to '{index_name_serverless}'")

print("Waiting for vectors to be indexed...")
time.sleep(10)
print(index.describe_index_stats())

Upserted batch 1 to 'product-catalog-serverless-demo'
Waiting for vectors to be indexed...
{'dimension': 128,
 'index_fullness': 0.0,
 'metric': 'cosine',
 'namespaces': {'': {'vector_count': 6}},
 'total_vector_count': 6,
 'vector_type': 'dense'}


### Serverless Metadata Considerations (Important Note)
#
As discussed in our main guide ("*Important Note on Serverless Metadata*"), Pinecone serverless indexes have specific behaviors regarding metadata:
1.  They **do not use** the `metadata_config` parameter (which pod-based indexes can use for selective indexing of metadata fields).
2.  **Large string fields in metadata that are not explicitly whitelisted by Pinecone as filterable might not be usable in filters, or filtering on them could be slow.** This is a crucial consideration if your application relies on filtering over extensive textual metadata. Always test with representative data.
#
For this demo, our metadata fields are small and common types, generally not an issue.

## 5. Metadata Filtering Examples (on Dense Index)
#
We will now perform vector searches combined with various metadata filters on our serverless (dense) index.
We'll use a random query vector for demonstration.

In [11]:
query_dense_vector = np.random.rand(vector_dim).tolist()
TOP_K = 3 # Reduced for brevity

In [12]:
# ### Example 5.1: Exact Match on `category`
print("\n--- Filtering for 'electronics' category ---")
results = index.query(
    vector=query_dense_vector,
    top_k=TOP_K,
    filter={"category": {"$eq": "electronics"}},
    include_metadata=True
)
for match in results['matches']:
    print(f"ID: {match['id']}, Score: {match['score']:.4f}, Metadata: {match['metadata']['product_name']}, Cat: {match['metadata']['category']}")

# ... (Keep other existing filter examples: 5.2 Range Query, 5.3 Boolean Query, 5.4 Combined Filters, 5.5 List Membership, 5.6 OR logic) ...
# (Ensure they use query_dense_vector and the serverless index `index`)
# Example:
print("\n--- Filtering for price less than $100 (Example 5.2) ---")
results = index.query(
    vector=query_dense_vector,
    top_k=TOP_K,
    filter={"price": {"$lt": 100.00}},
    include_metadata=True
)
for match in results['matches']:
    print(f"ID: {match['id']}, Score: {match['score']:.4f}, Metadata: {match['metadata']['product_name']}, Price: {match['metadata']['price']}")


--- Filtering for 'electronics' category ---
ID: P006, Score: 0.7832, Metadata: Smartphone Model Z, Cat: electronics
ID: P001, Score: 0.7711, Metadata: Smartwatch Series X, Cat: electronics
ID: P004, Score: 0.7361, Metadata: Wireless Headphones, Cat: electronics

--- Filtering for price less than $100 (Example 5.2) ---
ID: P002, Score: 0.7636, Metadata: Organic Green Tea, Price: 15.5
ID: P005, Score: 0.7276, Metadata: Advanced Yoga Mat, Price: 45.0


## 6. Conceptual Demonstration: Sparse-Dense Hybrid Search
#
As mentioned in our article, Pinecone supports sparse-dense hybrid search, combining semantic (dense) and keyword (sparse) relevance. This typically requires an index configured for `dotproduct` and often works most directly with pod-based indexes for single-query hybrid weighting, or separate sparse/dense indexes for serverless.
#
Let's illustrate the components of a hybrid query.

In [13]:
# Mock query components for hybrid search
query_dense_hybrid = np.random.rand(vector_dim).tolist()
query_sparse_hybrid = {"indices": [1, 10, 15], "values": [0.9, 0.8, 0.7]} # Example query sparse vector

print(f"Dense query component (sample): {query_dense_hybrid[:5]}")
print(f"Sparse query component: {query_sparse_hybrid}")

Dense query component (sample): [0.40056334342792965, 0.584911631501071, 0.43694316355124185, 0.6782474814390708, 0.19054042958830109]
Sparse query component: {'indices': [1, 10, 15], 'values': [0.9, 0.8, 0.7]}


**Querying a Hybrid Index (Conceptual for Pod-Based with `dotproduct`):**
 If you had a pod-based index `hybrid_index` supporting sparse-dense vectors directly:
```python
 hybrid_index = pc.Index("my-hybrid-pod-index") # Assuming it exists and uses dotproduct
 response_hybrid = hybrid_index.query(
         vector=query_dense_hybrid,
         sparse_vector=query_sparse_hybrid,
         top_k=TOP_K,
         include_metadata=True)
        # Pinecone combines scores. For explicit weighting (alpha), you'd typically apply it client-side
# #     # to dense and sparse query vectors before sending, or manage results from separate queries.
 for match in response_hybrid['matches']:
    print(f"Hybrid Match ID: {match['id']}, Score: {match['score']:.4f}, Metadata: {match['metadata']}")
```

**Hybrid Search with Serverless (Often Separate Queries):**
For serverless, as our article notes, hybrid search often involves querying a dense index and a sparse index separately, then combining results in your application. This gives more control over the `alpha` weighting.
#
```python
# Assuming `dense_index` (like our `index` variable) and a separate `sparse_index`
 dense_results = dense_index.query(vector=query_dense_hybrid, top_k=10, include_metadata=True)
 sparse_results = sparse_index.query(sparse_vector=query_sparse_hybrid, top_k=10, include_metadata=True)

# --- Application-level result fusion would happen here (e.g., RRF or weighted sum) ---
 alpha = 0.75 # Example weight for dense
 final_results = combine_sparse_dense_results(dense_results, sparse_results, alpha)
```
This conceptual section highlights the data and query structures involved. For full implementation details, refer to the latest Pinecone documentation on hybrid search for your specific index type.

## 7. Namespaces
Namespaces act as a coarse-grained filter, partitioning your index.
(This section can remain largely the same as before, using the `index` (serverless dense index) for demo)

In [15]:
namespace_name = "user_specific_data"
index.upsert(
    vectors=[{
        "id": "P_NS001",
        "values": np.random.rand(vector_dim).tolist(), # Dense vector
        "metadata": {"product_name": "Special Edition Watch", "category": "luxury", "price": 1500.00}
    }],
    namespace=namespace_name
)
time.sleep(5)
print(f"Stats after upserting to namespace '{namespace_name}': {index.describe_index_stats()}")

# Querying within the namespace
print(f"\n--- Querying within namespace '{namespace_name}' for 'luxury' ---")
results_ns = index.query(
    vector=query_dense_vector, # Querying with a dense vector
    top_k=TOP_K,
    filter={"category": {"$eq": "luxury"}},
    namespace=namespace_name,
    include_metadata=True
)
for match in results_ns['matches']:
    print(f"ID: {match['id']}, Score: {match['score']:.4f}, Metadata: {match['metadata']['product_name']}")

Stats after upserting to namespace 'user_specific_data': {'dimension': 128,
 'index_fullness': 0.0,
 'metric': 'cosine',
 'namespaces': {'': {'vector_count': 6},
                'user_specific_data': {'vector_count': 1}},
 'total_vector_count': 7,
 'vector_type': 'dense'}

--- Querying within namespace 'user_specific_data' for 'luxury' ---
ID: P_NS001, Score: 0.7512, Metadata: Special Edition Watch


## 8. Discussion
#
- **Ease of Use & Schema Flexibility:** Pinecone remains simple for basic dense vector search with schemaless JSON metadata.
- **Hybrid Search:** Pinecone is evolving its hybrid search capabilities. The approach can differ between serverless and pod-based indexes, with serverless often favoring separate sparse and dense index queries for more flexible `alpha` blending.
- **Serverless Considerations:** Be mindful of metadata limitations on serverless, especially for large, un-whitelisted string fields.
- **Filter Expressiveness:** Good range of MongoDB-style operators for common use cases.
#
*Refer to our main guide for a detailed comparison of Pinecone's pros, cons, and suitability for different organization sizes.*

In [16]:
# 9. Cleanup
user_confirmation = input(f"Do you want to delete the serverless index '{index_name_serverless}'? (yes/no): ")
if user_confirmation.lower() == 'yes':
    if index_name_serverless in pc.list_indexes().names:
        print(f"Deleting index '{index_name_serverless}'...")
        pc.delete_index(index_name_serverless)
        print("Index deleted.")
    else:
        print(f"Index '{index_name_serverless}' not found for deletion.")
else:
    print(f"Index '{index_name_serverless}' was not deleted.")

Index 'product-catalog-serverless-demo' was not deleted.


## Conclusion
#
This enhanced notebook demonstrated Pinecone's core metadata filtering and touched upon its hybrid search capabilities and serverless-specific considerations.
Pinecone's managed nature makes it attractive, but understanding the nuances of different index types and features is key.
#
*Our comprehensive guide provides further insights into selecting the right vector database and optimizing your metadata filtering strategy.*