# Canonical Identifier Analysis

This notebook explores the properties of the canonical identifier scheme used for public catalogue IDs.

## ID Generation Rules

Public catalogue identifiers follow specific rules to ensure they are:
- **Stable**: Once assigned, identifiers never change
- **Human-readable**: Easy to type and communicate
- **Unique**: No collisions in practical use

The scheme is defined as:
- **Length**: 8 characters
- **First character**: Letter only (`a-z`, excluding `o`, `i`, `l`)
- **Remaining characters**: Letters and digits (`a-z`, `2-9`, excluding `o`, `i`, `l`, `1`)

Confusing characters are excluded to avoid transcription errors (e.g., `0` vs `o`, `1` vs `l`).

In [23]:
from id_minter import generate_canonical_id

# Generate some example IDs
print("Example canonical IDs:")
for _ in range(5):
    print(f"  {generate_canonical_id()}")

Example canonical IDs:
  cj9g72ne
  bfxtp8kv
  d8ddtdzj
  fahpr5dk
  jx82u9f6


## ID Space Size

How many unique IDs can we generate with this scheme?

- **First character**: 23 options (26 letters - 3 forbidden)
- **Remaining 7 characters**: 31 options each (26 letters + 9 digits - 4 forbidden)

Total: $23 \times 31^7 \approx 620$ billion unique identifiers

In [24]:
import string

def calculate_id_space(length: int = 8) -> int:
    """Calculate the total number of possible unique IDs."""
    forbidden = {'o', 'i', 'l', '1'}
    numbers = set(str(n) for n in range(1, 10))
    letters = set(string.ascii_lowercase)
    
    allowed_chars = (numbers | letters) - forbidden
    first_chars = letters - forbidden
    
    # First char options × (remaining char options)^(length-1)
    return len(first_chars) * (len(allowed_chars) ** (length - 1))

id_space = calculate_id_space()

print(f"Total ID space: {id_space:,}")
print(f"First character options: {26 - 3} (letters minus o, i, l)")
print(f"Other character options: {35 - 4} (digits 2-9 + letters minus forbidden)")
print(f"Formula: 23 × 31^7 = {23 * 31**7:,}")

Total ID space: 632,790,124,553
First character options: 23 (letters minus o, i, l)
Other character options: 31 (digits 2-9 + letters minus forbidden)
Formula: 23 × 31^7 = 632,790,124,553


## Collision Probability

A **collision** occurs when a newly generated random ID matches an ID that already exists in the database.

When generating a single new random ID against a pool of existing IDs, the probability is simply:
$$P(\text{collision}) = \frac{\text{existing IDs}}{\text{total ID space}}$$

In [25]:
id_space = calculate_id_space()

def single_id_collision_probability(existing_ids: int, id_space: int) -> float:
    """Probability that ONE new random ID collides with any existing ID."""
    return existing_ids / id_space

# Current catalogue size (as of Feb 2026)
existing = 52_000_000

collision_prob = single_id_collision_probability(existing, id_space)
print(f"With {existing:,} existing IDs in a space of {id_space:,}:\n")
print(f"  Collision probability: {collision_prob:.2e} ({collision_prob*100:.6f}%)")
print(f"  That's about 1 in {int(1/collision_prob):,}")

# Show how this scales with catalogue growth
print(f"\nCollision probability at different catalogue sizes:\n")
for n in [10_000_000, 52_000_000, 100_000_000, 500_000_000, 1_000_000_000]:
    prob = single_id_collision_probability(n, id_space)
    print(f"  {n:>15,} IDs: 1 in {int(1/prob):>10,} ({prob*100:.4f}%)")

With 52,000,000 existing IDs in a space of 632,790,124,553:

  Collision probability: 8.22e-05 (0.008218%)
  That's about 1 in 12,169

Collision probability at different catalogue sizes:

       10,000,000 IDs: 1 in     63,279 (0.0016%)
       52,000,000 IDs: 1 in     12,169 (0.0082%)
      100,000,000 IDs: 1 in      6,327 (0.0158%)
      500,000,000 IDs: 1 in      1,265 (0.0790%)
    1,000,000,000 IDs: 1 in        632 (0.1580%)
