# Cross-Database ID Lookup Benchmark

Compare lookup performance across PostgreSQL, MySQL, and Redis for sequential, UUIDv4, UUIDv7, and Snowflake identifiers. Run each cell sequentially, validating the output before moving on.

## Runbook overview
- Provision services with Docker Compose.
- Seed 1M rows per ID strategy.
- Execute 10k random lookups per dataset.
- Capture metrics, visualize, and save to `results.csv`.
- Repeat UUID-only workloads on PostgreSQL 18 to compare UUIDv4 vs UUIDv7.

## Environment checklist
1. From this directory, create the environment with `uv venv .venv`.
2. Activate it via `source .venv/bin/activate`.
3. Install dependencies using `uv pip install -r requirements.txt`.

In [None]:
from pathlib import Path
import sys

venv_path = Path('.venv').resolve()
print(f'Python executable: {sys.executable}')
if venv_path.exists() and (Path(sys.prefix) == venv_path or venv_path in Path(sys.prefix).parents):
    print('Environment check: running inside .venv ✅')
else:
    print('Environment check: please activate the uv-managed .venv before continuing ⚠️')

In [None]:
import os
import random
import time
from pathlib import Path

import matplotlib.pyplot as plt
import mysql.connector
import numpy as np
import pandas as pd
import psycopg2
import redis
import seaborn as sns
import sqlalchemy as sa
from tqdm.auto import tqdm

from bench_utils import (
    BenchmarkConfig,
    build_connections,
    bootstrap_mysql,
    bootstrap_postgres,
    fetch_function,
    fetch_mysql,
    fetch_redis,
    measure_operation,
    postgres_uuid_workload,
    results_to_frame,
    seed_mysql,
    seed_postgres,
    seed_redis,
)

sns.set_theme(style='whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

## Provision databases
Ensure Docker Desktop is running. The next cell brings up PostgreSQL (x2), MySQL, and Redis using `compose.yaml`.

In [None]:
!docker compose up -d --remove-orphans

## Benchmark configuration
Tweak counts through environment variables (`RECORD_COUNT`, `LOOKUP_ITERATIONS`, `BATCH_SIZE`, `UUID_WORKLOAD_ROWS`) if you need smaller dry runs.

In [None]:
RECORD_COUNT = int(os.getenv('RECORD_COUNT', '1000000'))
LOOKUP_ITERATIONS = int(os.getenv('LOOKUP_ITERATIONS', '10000'))
BATCH_SIZE = int(os.getenv('BATCH_SIZE', '20000'))
UUID_WORKLOAD_ROWS = int(os.getenv('UUID_WORKLOAD_ROWS', '200000'))

config = BenchmarkConfig(batch_size=BATCH_SIZE, lookup_iterations=LOOKUP_ITERATIONS, seed=42)
print(f'Records per table: {RECORD_COUNT:,}')
print(f'Lookup iterations: {config.lookup_iterations:,}')
print(f'Insert batch size: {config.batch_size:,}')
print(f'UUID secondary workload rows: {UUID_WORKLOAD_ROWS:,}')

In [None]:
connections = build_connections()
print('Connections established for PostgreSQL (mixed + UUID), MySQL, and Redis.')

In [None]:
bootstrap_postgres(connections.pg_mixed)
bootstrap_postgres(connections.pg_uuid)
bootstrap_mysql(connections.mysql)
print('Schemas ensured across PostgreSQL and MySQL instances.')

### Seed PostgreSQL (mixed ID strategies)

In [None]:
pg_mixed_summaries = seed_postgres(connections.pg_mixed, RECORD_COUNT, config)
pg_mixed_summaries

### Seed PostgreSQL 18 (UUID-only focus)

In [None]:
pg_uuid_summaries = seed_postgres(
    connections.pg_uuid,
    RECORD_COUNT,
    config,
    include_tables=['uuid_v4_test', 'uuid_v7_test']
)
pg_uuid_summaries

### Seed MySQL

In [None]:
mysql_summaries = seed_mysql(connections.mysql, RECORD_COUNT, config)
mysql_summaries

In [None]:
seed_df = pd.DataFrame(
    [
        {
            'database': summary.dataset.database,
            'id_type': summary.dataset.id_type,
            'rows_inserted': summary.rows_inserted,
            'duration_s': summary.duration_s,
            'rows_per_second': summary.rows_inserted / max(summary.duration_s, 1e-9),
        }
        for summary in (pg_mixed_summaries + pg_uuid_summaries + mysql_summaries)
    ]
)
seed_df.sort_values(['database', 'id_type']).reset_index(drop=True)

### Prepare Redis keyspace (mirrors PostgreSQL IDs)

In [None]:
redis_datasets = seed_redis(connections.redis, [s.dataset for s in pg_mixed_summaries], config)
redis_datasets

## Lookup benchmarks

In [None]:
lookup_jobs = []

for summary in pg_mixed_summaries:
    dataset = summary.dataset
    fetcher = fetch_function(connections.pg_mixed, dataset.table, dataset.id_column)
    lookup_jobs.append(
        (
            f'postgres_mixed::{dataset.id_type}::lookup',
            fetcher,
            list(dataset.samples)[: config.lookup_iterations],
        )
    )

for summary in pg_uuid_summaries:
    dataset = summary.dataset
    fetcher = fetch_function(connections.pg_uuid, dataset.table, dataset.id_column)
    lookup_jobs.append(
        (
            f'postgres_uuid18::{dataset.id_type}::lookup',
            fetcher,
            list(dataset.samples)[: config.lookup_iterations],
        )
    )

for summary in mysql_summaries:
    dataset = summary.dataset
    fetcher = fetch_mysql(connections.mysql, dataset.table, dataset.id_column)
    lookup_jobs.append(
        (
            f'mysql::{dataset.id_type}::lookup',
            fetcher,
            list(dataset.samples)[: config.lookup_iterations],
        )
    )

redis_fetcher = fetch_redis(connections.redis)
for dataset in redis_datasets:
    lookup_jobs.append(
        (
            f'redis::{dataset.id_type}::lookup',
            redis_fetcher,
            list(dataset.samples)[: config.lookup_iterations],
        )
    )

len(lookup_jobs)

In [None]:
results = []
for label, func, samples in lookup_jobs:
    results.append(measure_operation(func, samples, label, config))

results_df = results_to_frame(results)
lookup_df = results_df[results_df['operation'] == 'lookup'].sort_values(['database', 'id_type'])
lookup_df.reset_index(drop=True)

In [None]:
fig, ax = plt.subplots()
sns.barplot(data=lookup_df, x='id_type', y='avg_ms', hue='database', ax=ax)
ax.set_title('Average lookup latency (ms) by ID strategy')
ax.set_ylabel('Average latency (ms)')
ax.set_xlabel('ID type')
ax.legend(title='Database')
plt.show()

In [None]:
fig, ax = plt.subplots()
sns.barplot(data=lookup_df, x='id_type', y='p95_ms', hue='database', ax=ax)
ax.set_title('95th percentile lookup latency (ms)')
ax.set_ylabel('p95 latency (ms)')
ax.set_xlabel('ID type')
ax.legend(title='Database')
plt.show()

### Findings
Capture key observations here once the plots and tables populate. Focus on per-database winners, UUIDv7 gains in PostgreSQL 18, and trade-offs (index size, randomness, write amplification).

In [None]:
results_df.to_csv('results.csv', index=False)
print('Benchmark results saved to results.csv')

## PostgreSQL UUID Performance Comparison

In [None]:
uuid_v4_metrics = postgres_uuid_workload(
    connections.pg_uuid,
    'uuid_perf_v4',
    'uuid_v4',
    UUID_WORKLOAD_ROWS,
    config,
)
uuid_v7_metrics = postgres_uuid_workload(
    connections.pg_uuid,
    'uuid_perf_v7',
    'uuid_v7',
    UUID_WORKLOAD_ROWS,
    config,
)
secondary_df = results_to_frame(list(uuid_v4_metrics.values()) + list(uuid_v7_metrics.values()))
secondary_df.sort_values(['operation', 'id_type']).reset_index(drop=True)

In [None]:
pivot_df = secondary_df.pivot_table(
    index='operation',
    columns='id_type',
    values='avg_ms',
)
pivot_df.plot(kind='bar', rot=0, ylabel='Average latency (ms)', title='UUIDv4 vs UUIDv7 on PostgreSQL 18')
plt.show()

### Secondary findings
Document whether UUIDv7 consistently outperforms UUIDv4 for inserts, point selects, and ordered scans. Highlight any throughput multiplier observed (targeting ~3x where applicable).

In [None]:
!jupyter nbconvert --to notebook --output uuid_benchmark_report.ipynb id_benchmark.ipynb