# Cross-Database ID Lookup Benchmark

Compare lookup performance across PostgreSQL, MySQL, and Redis for sequential, UUIDv4, UUIDv7, and Snowflake identifiers. Run each cell sequentially, validating the output before moving on.

## Runbook overview
- Provision services with Docker Compose.
- Seed 1M rows per ID strategy.
- Execute 10k random lookups per dataset.
- Capture metrics, visualize, and save to `results.csv`.
- Repeat UUID-only workloads on PostgreSQL 18 to compare UUIDv4 vs UUIDv7.

## Environment checklist
1. From this directory, create the environment with `uv venv .venv`.
2. Activate it via `source .venv/bin/activate`.
3. Install dependencies using `uv pip install -r requirements.txt`.

In [1]:
from pathlib import Path
import sys

venv_path = Path('.venv').resolve()
print(f'Python executable: {sys.executable}')
if venv_path.exists() and (Path(sys.prefix) == venv_path or venv_path in Path(sys.prefix).parents):
    print('Environment check: running inside .venv ✅')
else:
    print('Environment check: please activate the uv-managed .venv before continuing ⚠️')

Python executable: /Users/codefox/workspace/practice_infra_arch/pg_uuid_benchmark/.venv/bin/python
Environment check: running inside .venv ✅


In [2]:
import os
from pathlib import Path

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

from bench_utils import (
    BenchmarkConfig,
    build_connections,
    bootstrap_mysql,
    bootstrap_postgres,
    fetch_function,
    fetch_mysql,
    fetch_redis,
    measure_operation,
    postgres_uuid_workload,
    results_to_frame,
    seed_mysql,
    seed_postgres,
    seed_redis,
)

sns.set_theme(style='whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
# DRY-RUN helper: override heavy defaults for interactive iteration
# Set DRY_RUN = False when you want to execute the full benchmark (1M records etc.)
DRY_RUN = False


if DRY_RUN:
    RECORD_COUNT = int(os.getenv('RECORD_COUNT', '10000'))
    LOOKUP_ITERATIONS = int(os.getenv('LOOKUP_ITERATIONS', '1000'))
    BATCH_SIZE = int(os.getenv('BATCH_SIZE', '2000'))
    UUID_WORKLOAD_ROWS = int(os.getenv('UUID_WORKLOAD_ROWS', '20000'))
else:
    RECORD_COUNT = int(os.getenv('RECORD_COUNT', '1000000'))
    LOOKUP_ITERATIONS = int(os.getenv('LOOKUP_ITERATIONS', '10000'))
    BATCH_SIZE = int(os.getenv('BATCH_SIZE', '20000'))
    UUID_WORKLOAD_ROWS = int(os.getenv('UUID_WORKLOAD_ROWS', '200000'))

# Config object used across the notebook
config = BenchmarkConfig(batch_size=BATCH_SIZE, lookup_iterations=LOOKUP_ITERATIONS, seed=42)
print(f'DRY_RUN={DRY_RUN} | Records per table: {RECORD_COUNT:,} | Lookups: {config.lookup_iterations:,} | Batch size: {config.batch_size:,} | UUID rows: {UUID_WORKLOAD_ROWS:,}')

DRY_RUN=False | Records per table: 1,000,000 | Lookups: 10,000 | Batch size: 20,000 | UUID rows: 200,000


## Provision databases
Ensure Docker Desktop is running. The next cell brings up PostgreSQL (x2), MySQL, and Redis using `compose.yaml`.

In [4]:
!docker compose up -d --build --remove-orphans

[1A[1B[0G[?25l[+] Running 4/4
 [32m✔[0m Container redis-bench     [32mRunning[0m                                       [34m0.0s [0m
 [32m✔[0m Container mysql-bench     [32mRunning[0m                                       [34m0.0s [0m
 [32m✔[0m Container pg-bench-mixed  [32mRunning[0m                                       [34m0.0s [0m
 [32m✔[0m Container pg-bench-uuid   [32mRunning[0m                                       [34m0.0s [0m
[?25h

## Benchmark configuration
Tweak counts through environment variables (`RECORD_COUNT`, `LOOKUP_ITERATIONS`, `BATCH_SIZE`, `UUID_WORKLOAD_ROWS`) if you need smaller dry runs.

In [5]:
RECORD_COUNT = int(os.getenv('RECORD_COUNT', '1000000'))
LOOKUP_ITERATIONS = int(os.getenv('LOOKUP_ITERATIONS', '10000'))
BATCH_SIZE = int(os.getenv('BATCH_SIZE', '20000'))
UUID_WORKLOAD_ROWS = int(os.getenv('UUID_WORKLOAD_ROWS', '200000'))

config = BenchmarkConfig(batch_size=BATCH_SIZE, lookup_iterations=LOOKUP_ITERATIONS, seed=42)
print(f'Records per table: {RECORD_COUNT:,}')
print(f'Lookup iterations: {config.lookup_iterations:,}')
print(f'Insert batch size: {config.batch_size:,}')
print(f'UUID secondary workload rows: {UUID_WORKLOAD_ROWS:,}')

Records per table: 1,000,000
Lookup iterations: 10,000
Insert batch size: 20,000
UUID secondary workload rows: 200,000


In [6]:
connections = build_connections()
print('Connections established for PostgreSQL (mixed + UUID), MySQL, and Redis.')

Connections established for PostgreSQL (mixed + UUID), MySQL, and Redis.


In [7]:
bootstrap_postgres(connections.pg_mixed)
bootstrap_postgres(connections.pg_uuid)
bootstrap_mysql(connections.mysql)
print('Schemas ensured across PostgreSQL and MySQL instances.')

Schemas ensured across PostgreSQL and MySQL instances.


### Seed PostgreSQL (mixed ID strategies)

In [8]:
pg_mixed_summaries = seed_postgres(connections.pg_mixed, RECORD_COUNT, config)
pg_mixed_summaries

Postgres seed seq_id_test: 100%|██████████| 1000000/1000000 [00:25<00:00, 38873.17it/s]
Postgres seed seq_id_test: 100%|██████████| 1000000/1000000 [00:25<00:00, 38873.17it/s]
Postgres seed uuid_v4_test: 100%|██████████| 1000000/1000000 [00:38<00:00, 25929.94it/s]
Postgres seed uuid_v4_test: 100%|██████████| 1000000/1000000 [00:38<00:00, 25929.94it/s]
Postgres seed uuid_v7_test: 100%|██████████| 1000000/1000000 [00:32<00:00, 31219.21it/s]
Postgres seed uuid_v7_test: 100%|██████████| 1000000/1000000 [00:32<00:00, 31219.21it/s]
Postgres seed snowflake_test: 100%|██████████| 1000000/1000000 [00:34<00:00, 28980.71it/s]



[InsertSummary(dataset=DatasetInfo(database='postgres', id_type='seq_id', table='seq_id_test', id_column='id', samples=[670488, 116740, 26226, 777573, 288390, 256788, 234054, 146317, 772247, 107474, 709571, 776647, 935519, 571859, 91162, 619177, 442418, 33327, 31245, 98247, 229259, 243963, 529904, 631263, 27825, 588509, 208497, 750801, 681454, 735393, 571413, 439899, 231149, 471030, 617890, 291705, 848750, 911528, 6815, 795668, 844963, 167415, 732053, 443144, 356779, 291370, 163033, 225773, 800582, 352945, 107176, 97252, 398383, 101415, 376418, 888663, 360664, 633053, 277371, 846336, 45562, 765180, 481742, 562276, 130890, 967097, 396923, 82628, 578857, 307420, 869694, 659177, 648565, 928464, 903566, 379202, 605398, 201630, 738798, 72934, 48051, 693385, 238969, 810621, 303446, 83668, 896866, 244099, 908574, 105908, 398592, 291477, 475436, 666564, 874629, 382555, 170556, 388163, 372529, 219685, 702730, 279947, 735912, 982154, 716752, 679515, 74871, 638721, 665823, 179452, 560087, 764545,

### Seed PostgreSQL 18 (UUID-only focus)

**Note:** This cell seeds only UUID tables for PostgreSQL 18. To enable full comparison including `seq_id` and `snowflake`, remove the `include_tables` parameter to seed all tables.

In [9]:
pg_uuid_summaries = seed_postgres(
    connections.pg_uuid,
    RECORD_COUNT,
    config,
    include_tables=['uuid_v4_test', 'uuid_v7_test']
)
pg_uuid_summaries

Postgres seed uuid_v4_test: 100%|██████████| 1000000/1000000 [00:37<00:00, 26576.41it/s]
Postgres seed uuid_v4_test: 100%|██████████| 1000000/1000000 [00:37<00:00, 26576.41it/s]
Postgres seed uuid_v7_test: 100%|██████████| 1000000/1000000 [00:32<00:00, 30779.29it/s]



[InsertSummary(dataset=DatasetInfo(database='postgres', id_type='uuid_v4', table='uuid_v4_test', id_column='id', samples=['4a7f93b7-9e50-4497-9792-384933978b20', '535ec611-91b9-4ece-926a-f296e04b839c', 'f53ac157-8040-404e-8d3c-3ea1929bcb2c', '0aee4e55-7b1f-4e60-98c4-e12ff38a44ea', 'b789d890-60a9-4cc9-b5c4-fb53a4f81035', '92206bd1-a93c-48a6-b9ec-20616aa9d11b', 'd5725035-a56a-4c84-8a85-c8ba3738e1a7', '4489c5b3-208d-4bbb-a1cb-569fdb7a4a0a', 'dc1f2e02-031f-4fbe-b36f-3c6c9ea9b770', '523b54f9-f43e-4f48-b08b-ced8029ce927', 'f9350b5c-568a-4272-878e-3d47a8ddec14', 'a960e823-a23d-4f07-99de-2c6d3fa5280a', 'a8c09f30-b227-463d-83c6-308519004ce5', 'e3df8db2-81de-40f6-985a-7ddcceb1c202', 'b05e6a96-3e5b-4e30-9c30-45b5f7cdc900', '3ed5548f-45de-426c-a945-3f759d50e5f7', '45dbd27d-ce70-4afa-b56f-2e7b071d62b3', '3e876de9-49d9-4cab-97cb-c41e282642ab', 'ed7c247d-1397-4859-b7d2-65b5e1f4c2a8', '8d29c667-daf6-466b-b7f5-01bdbcf607de', '947aa0a8-ebab-46b6-b70e-815bff1460d3', '11551e21-5ba1-401c-98d0-05eef90fa873'

### Seed MySQL

In [10]:
mysql_summaries = seed_mysql(connections.mysql, RECORD_COUNT, config)
mysql_summaries

MySQL seed seq_id_test: 100%|██████████| 1000000/1000000 [00:06<00:00, 152061.65it/s]
MySQL seed uuid_v4_test:   0%|          | 0/1000000 [00:00<?, ?it/s]
MySQL seed uuid_v4_test: 100%|██████████| 1000000/1000000 [01:02<00:00, 16046.98it/s]

MySQL seed uuid_v7_test: 100%|██████████| 1000000/1000000 [00:15<00:00, 62729.28it/s]

MySQL seed snowflake_test: 100%|██████████| 1000000/1000000 [00:08<00:00, 124168.74it/s]
MySQL seed snowflake_test: 100%|██████████| 1000000/1000000 [00:08<00:00, 124168.74it/s]


[InsertSummary(dataset=DatasetInfo(database='mysql', id_type='seq_id', table='seq_id_test', id_column='id', samples=[670488, 116740, 26226, 777573, 288390, 256788, 234054, 146317, 772247, 107474, 709571, 776647, 935519, 571859, 91162, 619177, 442418, 33327, 31245, 98247, 229259, 243963, 529904, 631263, 27825, 588509, 208497, 750801, 681454, 735393, 571413, 439899, 231149, 471030, 617890, 291705, 848750, 911528, 6815, 795668, 844963, 167415, 732053, 443144, 356779, 291370, 163033, 225773, 800582, 352945, 107176, 97252, 398383, 101415, 376418, 888663, 360664, 633053, 277371, 846336, 45562, 765180, 481742, 562276, 130890, 967097, 396923, 82628, 578857, 307420, 869694, 659177, 648565, 928464, 903566, 379202, 605398, 201630, 738798, 72934, 48051, 693385, 238969, 810621, 303446, 83668, 896866, 244099, 908574, 105908, 398592, 291477, 475436, 666564, 874629, 382555, 170556, 388163, 372529, 219685, 702730, 279947, 735912, 982154, 716752, 679515, 74871, 638721, 665823, 179452, 560087, 764545, 25

In [11]:
seed_df = pd.DataFrame(
    [
        {
            'database': summary.dataset.database,
            'id_type': summary.dataset.id_type,
            'rows_inserted': summary.rows_inserted,
            'duration_s': summary.duration_s,
            'rows_per_second': summary.rows_inserted / max(summary.duration_s, 1e-9),
        }
        for summary in (pg_mixed_summaries + pg_uuid_summaries + mysql_summaries)
    ]
)
seed_df.sort_values(['database', 'id_type']).reset_index(drop=True)

Unnamed: 0,database,id_type,rows_inserted,duration_s,rows_per_second
0,mysql,seq_id,1000000,6.578646,152006.956388
1,mysql,snowflake,1000000,8.054897,124148.07328
2,mysql,uuid_v4,1000000,62.388348,16028.634119
3,mysql,uuid_v7,1000000,15.944551,62717.34959
4,postgres,seq_id,1000000,25.737244,38854.198808
5,postgres,snowflake,1000000,34.507465,28979.236847
6,postgres,uuid_v4,1000000,38.566142,25929.479605
7,postgres,uuid_v4,1000000,37.628568,26575.55318
8,postgres,uuid_v7,1000000,32.033224,31217.588135
9,postgres,uuid_v7,1000000,32.490631,30778.103703


### Prepare Redis keyspace (mirrors PostgreSQL IDs)

In [12]:
redis_datasets = seed_redis(connections.redis, [s.dataset for s in pg_mixed_summaries], config)
redis_datasets

Redis load seq_id: 100%|██████████| 10000/10000 [00:00<00:00, 606867.49it/s]
Redis load seq_id: 100%|██████████| 10000/10000 [00:00<00:00, 606867.49it/s]
Redis load uuid_v4: 100%|██████████| 10000/10000 [00:00<00:00, 482414.43it/s]
Redis load uuid_v4: 100%|██████████| 10000/10000 [00:00<00:00, 482414.43it/s]
Redis load uuid_v7: 100%|██████████| 10000/10000 [00:00<00:00, 878038.90it/s]
Redis load uuid_v7: 100%|██████████| 10000/10000 [00:00<00:00, 878038.90it/s]
Redis load snowflake: 100%|██████████| 10000/10000 [00:00<00:00, 908762.84it/s]



[DatasetInfo(database='redis', id_type='seq_id', table='', id_column='key', samples=['seq_id:670488', 'seq_id:116740', 'seq_id:26226', 'seq_id:777573', 'seq_id:288390', 'seq_id:256788', 'seq_id:234054', 'seq_id:146317', 'seq_id:772247', 'seq_id:107474', 'seq_id:709571', 'seq_id:776647', 'seq_id:935519', 'seq_id:571859', 'seq_id:91162', 'seq_id:619177', 'seq_id:442418', 'seq_id:33327', 'seq_id:31245', 'seq_id:98247', 'seq_id:229259', 'seq_id:243963', 'seq_id:529904', 'seq_id:631263', 'seq_id:27825', 'seq_id:588509', 'seq_id:208497', 'seq_id:750801', 'seq_id:681454', 'seq_id:735393', 'seq_id:571413', 'seq_id:439899', 'seq_id:231149', 'seq_id:471030', 'seq_id:617890', 'seq_id:291705', 'seq_id:848750', 'seq_id:911528', 'seq_id:6815', 'seq_id:795668', 'seq_id:844963', 'seq_id:167415', 'seq_id:732053', 'seq_id:443144', 'seq_id:356779', 'seq_id:291370', 'seq_id:163033', 'seq_id:225773', 'seq_id:800582', 'seq_id:352945', 'seq_id:107176', 'seq_id:97252', 'seq_id:398383', 'seq_id:101415', 'seq_i

## Lookup benchmarks

In [13]:
lookup_jobs = []

for summary in pg_mixed_summaries:
    dataset = summary.dataset
    fetcher = fetch_function(connections.pg_mixed, dataset.table, dataset.id_column)
    lookup_jobs.append(
        (
            f'postgres_mixed::{dataset.id_type}::lookup',
            fetcher,
            list(dataset.samples)[: config.lookup_iterations],
        )
    )

for summary in pg_uuid_summaries:
    dataset = summary.dataset
    fetcher = fetch_function(connections.pg_uuid, dataset.table, dataset.id_column)
    lookup_jobs.append(
        (
            f'postgres_uuid18::{dataset.id_type}::lookup',
            fetcher,
            list(dataset.samples)[: config.lookup_iterations],
        )
    )

for summary in mysql_summaries:
    dataset = summary.dataset
    fetcher = fetch_mysql(connections.mysql, dataset.table, dataset.id_column)
    lookup_jobs.append(
        (
            f'mysql::{dataset.id_type}::lookup',
            fetcher,
            list(dataset.samples)[: config.lookup_iterations],
        )
    )

redis_fetcher = fetch_redis(connections.redis)
for dataset in redis_datasets:
    lookup_jobs.append(
        (
            f'redis::{dataset.id_type}::lookup',
            redis_fetcher,
            list(dataset.samples)[: config.lookup_iterations],
        )
    )

len(lookup_jobs)

14

### Execute lookup benchmarks
Run all prepared lookup jobs and collect timing metrics.

In [None]:
results = []
for label, fetcher, sample_ids in lookup_jobs:
    result_row = measure_operation(
        func=fetcher,
        inputs=sample_ids,
        label=label,
        config=config,
        show_progress=True
    )
    results.append(result_row)

results_df = results_to_frame(results)
results_df.sort_values(['database', 'id_type']).reset_index(drop=True)

TypeError: measure_operation() got an unexpected keyword argument 'operation'

### Lookup results summary
View the aggregated lookup benchmark results across all databases and ID types.

In [None]:
lookup_df = results_df[results_df['operation'] == 'lookup'].copy()
lookup_df.sort_values(['database', 'id_type']).reset_index(drop=True)

In [None]:
from plot_utils import bar_latency

bar_latency(lookup_df, metric='avg_ms', title='Average lookup latency (ms) by ID strategy')
plt.show()

In [None]:
from plot_utils import bar_latency

bar_latency(lookup_df, metric='p95_ms', title='95th percentile lookup latency (ms)')
plt.show()

### Findings
Capture key observations here once the plots and tables populate. Focus on per-database winners, UUIDv7 gains in PostgreSQL 18, and trade-offs (index size, randomness, write amplification).

### Cross-database ID strategy comparison
Compare the same ID strategies across different databases to identify performance patterns.

In [None]:
# Pivot table to compare ID strategies across databases
comparison_df = lookup_df.pivot_table(
    index='id_type',
    columns='database',
    values='avg_ms',
    aggfunc='mean'
)

# Calculate relative performance (normalized to PostgreSQL mixed)
if 'postgres_mixed' in comparison_df.columns:
    for col in comparison_df.columns:
        comparison_df[f'{col}_relative'] = comparison_df[col] / comparison_df['postgres_mixed']

comparison_df

### UUIDv4 vs UUIDv7 speedup analysis
Quantify the performance improvement of UUIDv7 over UUIDv4 across databases.

In [None]:
uuid_comparison = lookup_df[lookup_df['id_type'].isin(['uuid_v4', 'uuid_v7'])].copy()
uuid_pivot = uuid_comparison.pivot_table(
    index='database',
    columns='id_type',
    values='avg_ms'
)

# Calculate speedup: UUIDv4 / UUIDv7 (higher = UUIDv7 is faster)
uuid_pivot['speedup_v7_over_v4'] = uuid_pivot['uuid_v4'] / uuid_pivot['uuid_v7']
uuid_pivot['improvement_pct'] = (uuid_pivot['speedup_v7_over_v4'] - 1) * 100

print("UUIDv7 Performance Improvement over UUIDv4:")
print("=" * 60)
uuid_pivot[['uuid_v4', 'uuid_v7', 'speedup_v7_over_v4', 'improvement_pct']]

In [None]:
results_df.to_csv('results.csv', index=False)
print('Benchmark results saved to results.csv')

## PostgreSQL UUID Performance Comparison

In [None]:
uuid_v4_metrics = postgres_uuid_workload(
    connections.pg_uuid,
    'uuid_perf_v4',
    'uuid_v4',
    UUID_WORKLOAD_ROWS,
    config,
)
uuid_v7_metrics = postgres_uuid_workload(
    connections.pg_uuid,
    'uuid_perf_v7',
    'uuid_v7',
    UUID_WORKLOAD_ROWS,
    config,
)
secondary_df = results_to_frame(list(uuid_v4_metrics.values()) + list(uuid_v7_metrics.values()))
secondary_df.sort_values(['operation', 'id_type']).reset_index(drop=True)

In [None]:
pivot_df = secondary_df.pivot_table(
    index='operation',
    columns='id_type',
    values='avg_ms',
)
pivot_df.plot(kind='bar', rot=0, ylabel='Average latency (ms)', title='UUIDv4 vs UUIDv7 on PostgreSQL 18')
plt.show()

### Secondary findings
Document whether UUIDv7 consistently outperforms UUIDv4 for inserts, point selects, and ordered scans. Highlight any throughput multiplier observed (targeting ~3x where applicable).