### Query Workload Construction

To evaluate our baseline and hybrid retrieval methods under realistic and diverse conditions, we construct a structured query workload that spans a wide range of metadata selectivities. Each query consists of two components: (1) a query track ID used for vector similarity, and (2) a metadata filter specifying structured constraints such as year ranges, explicitness, tempo ranges, or categorical values.

The workload intentionally includes high-selectivity, low-selectivity, and mixed-selectivity cases. High-selectivity queries match only a small subset of tracks (e.g., specific artist, narrow year range), helping test scenarios where filtering early should be advantageous. Low-selectivity queries use broad numeric ranges or minimal filtering, highlighting conditions where post-filtering or approximate methods may be more efficient. Mixed-selectivity queries combine narrow and broad constraints, stressing whether the retrieval strategy prioritizes filters effectively.

By generating 50–150 queries covering multiple metadata dimensions, we ensure that evaluations of pre-filter, post-filter, and hybrid methods are consistent, repeatable, and representative of real usage patterns. The same query workload is shared across all experiments, enabling fair comparison of latency, recall, and selectivity-dependent performance.


In [1]:
import json
import random
import numpy as np
import pandas as pd

In [2]:
# Load dataset
spotify = pd.read_parquet("data/spotify_clean.parquet")
N = len(spotify)

In [3]:
# -------------------------------------------------------------
# Helper functions to generate filter ranges
# -------------------------------------------------------------

def narrow_year_range():
    """2–3 year window: high-selectivity."""
    start = random.randint(1960, 2018)
    return [start, min(start + random.randint(1, 2), 2020)]

def broad_year_range():
    """Wide 5–20 year range."""
    start = random.randint(1960, 2010)
    return [start, min(start + random.choice([5, 10, 20]), 2020)]

def tempo_range():
    """Typical broad tempo range (80—150 BPM)."""
    return [80, 150]

def duration_range():
    """3–5 minutes."""
    return [180000, 300000]  # ms

def danceability_range():
    """Common mixed query constraint."""
    return [0.6, 0.8]

def energy_range():
    """Another mixed-query example."""
    return [0.7, 0.9]

In [4]:
# -------------------------------------------------------------
# Build query workload with variety
# -------------------------------------------------------------

workload = []

# Select 150 random query tracks
query_ids = random.sample(range(N), 150)

for i, qid in enumerate(query_ids):

    if i < 50:
        # ---------------------------------------------------------
        # 1. HIGH-SELECTIVITY QUERIES (Narrow Filters)
        # ---------------------------------------------------------
        filters = {
            "year": narrow_year_range(),
            "explicit": random.choice([0, 1])
        }
        # occasional narrow key filter
        if random.random() < 0.3:
            filters["key"] = random.randint(0, 11)

    elif i < 100:
        # ---------------------------------------------------------
        # 2. LOW-SELECTIVITY QUERIES (Broad Filters)
        # ---------------------------------------------------------
        filters = {}

        if random.random() < 0.6:
            filters["tempo"] = tempo_range()

        if random.random() < 0.5:
            filters["popularity"] = [0.4, 1.0]

        if random.random() < 0.4:
            filters["year"] = broad_year_range()

        # if no filters selected, ensure broad workload has some
        if len(filters) == 0:
            filters["none"] = True

    else:
        # ---------------------------------------------------------
        # 3. MIXED QUERIES (Narrow + Broad Filters)
        # ---------------------------------------------------------
        filters = {}

        # narrow metadata constraints
        if random.random() < 0.5:
            filters["explicit"] = random.choice([0, 1])
        if random.random() < 0.3:
            filters["key"] = random.randint(0, 11)

        # broad numerical ranges
        if random.random() < 0.7:
            filters["duration_ms"] = duration_range()
        if random.random() < 0.5:
            filters["danceability"] = danceability_range()
        if random.random() < 0.3:
            filters["energy"] = energy_range()

        # fallback
        if len(filters) == 0:
            filters["year"] = broad_year_range()

    # Save final query record
    workload.append({
        "query_id": qid,
        "filters": filters
    })

In [6]:
# -------------------------------------------------------------
# Save workload
# -------------------------------------------------------------

with open("data/query_workload.json", "w") as f:
    json.dump(workload, f, indent=4)

print("Generated 150-query workload -> data/query_workload.json")

Generated 150-query workload -> data/query_workload.json
