# ID Minter Migration Demo

This notebook demonstrates the complete migration process for the stable identifiers proposal, including:

1. **Current schema** - The existing 1:1 mapping between source identifiers and canonical IDs
2. **Sample data** - Loading data representing the current state of the catalogue pipeline
3. **Data migration** - Steps to migrate to the new schema with separate `canonical_ids` and `identifiers` tables
4. **ID pre-generation** - How to maintain a pool of free IDs for efficient minting
5. **Batch operations** - Efficient batch lookup and combined lookup/mint workflow
6. **Predecessor inheritance** - How new records inherit canonical IDs from predecessors
7. **Alias discovery** - How to identify which source identifiers share a canonical ID
8. **New record minting** - Minting brand new records from the pre-generated pool

## Prerequisites

- Docker installed
- [uv](https://docs.astral.sh/uv/) installed

## Setup

```bash
cd /Users/kennyr/workspace/docs/rfcs/XXX-stable_identifiers

# Install dependencies and create virtual environment
uv sync

# Start the MySQL container
docker-compose up -d
```

Then select the `.venv` Python interpreter for this notebook.

In [78]:
import pymysql
import random
from datetime import datetime, timedelta
from typing import Optional, Tuple
import time

from id_minter import generate_canonical_id

# Database connection settings
DB_CONFIG = {
    'host': 'localhost',
    'port': 3306,
    'user': 'root',
    'password': 'rootpassword',
    'database': 'id_minter'
}

def get_connection():
    """Get a database connection."""
    return pymysql.connect(**DB_CONFIG, cursorclass=pymysql.cursors.DictCursor)

def execute_query(query: str, params: tuple = None, fetch: bool = False):
    """Execute a query and optionally fetch results."""
    conn = get_connection()
    try:
        with conn.cursor() as cursor:
            cursor.execute(query, params)
            if fetch:
                return cursor.fetchall()
            conn.commit()
            return cursor.rowcount
    finally:
        conn.close()

# Test connection
try:
    conn = get_connection()
    conn.close()
    print("✓ Connected to MySQL successfully")
except Exception as e:
    print(f"✗ Connection failed: {e}")
    print("\nMake sure docker-compose is running:")
    print("  cd /Users/kennyr/workspace/docs/rfcs/XXX-stable_identifiers")
    print("  docker-compose up -d")

✓ Connected to MySQL successfully


## Step 1: Create the Current Schema

This is the existing schema with a 1:1 relationship between canonical IDs and source identifiers, enforced by the primary key on `CanonicalId`.

In [79]:
# Drop existing tables if they exist (for demo reset)
execute_query("DROP TABLE IF EXISTS identifiers")
execute_query("DROP TABLE IF EXISTS identifiers_old")
execute_query("DROP TABLE IF EXISTS canonical_ids")

# Create the current (legacy) schema
current_schema = """
CREATE TABLE identifiers (
    CanonicalId VARCHAR(255) NOT NULL,
    OntologyType VARCHAR(255) NOT NULL,
    SourceId VARCHAR(255) NOT NULL,
    SourceSystem VARCHAR(255) NOT NULL,
    PRIMARY KEY (CanonicalId),
    UNIQUE KEY UniqueFromSource (OntologyType, SourceSystem, SourceId)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
"""

execute_query(current_schema)
print("✓ Current schema created")

# Verify the schema
result = execute_query("DESCRIBE identifiers", fetch=True)
print("\nCurrent 'identifiers' table structure:")
for row in result:
    print(f"  {row['Field']:20} {row['Type']:20} {row['Key']}")

✓ Current schema created

Current 'identifiers' table structure:
  CanonicalId          varchar(255)         PRI
  OntologyType         varchar(255)         MUL
  SourceId             varchar(255)         
  SourceSystem         varchar(255)         


## Step 2: Load Sample Data

Load the sample identifiers from the CSV file representing the current state of the catalogue pipeline.

In [80]:
import csv

# Load sample data from CSV
csv_path = '/Users/kennyr/workspace/docs/rfcs/XXX-stable_identifiers/identifiers_sample.csv'

sample_data = []
with open(csv_path, 'r') as f:
    reader = csv.DictReader(f)
    for row in reader:
        sample_data.append({
            'canonical_id': row['CanonicalId'],
            'ontology_type': row['OntologyType'],
            'source_system': row['SourceSystem'],
            'source_id': row['SourceId']
        })

print(f"✓ Loaded {len(sample_data)} records from CSV")

# Count by source system
source_counts = {}
for record in sample_data:
    system = record['source_system']
    source_counts[system] = source_counts.get(system, 0) + 1

print("\nRecords by source system:")
for system, count in sorted(source_counts.items(), key=lambda x: -x[1]):
    print(f"  {system:30} {count:6} records")

print(f"\nSample records:")
for record in sample_data[:3]:
    print(f"  {record['canonical_id']} <- {record['source_system']}/{record['source_id']}")

✓ Loaded 10000 records from CSV

Records by source system:
  mets-image                       7781 records
  sierra-system-number             1613 records
  miro-image-number                 136 records
  label-derived                     103 records
  calm-record-id                     80 records
  mets                               67 records
  lc-names                           66 records
  ebsco-alt-lookup                   50 records
  calm-ref-no                        48 records
  library-of-congress-names          20 records
  lc-subjects                        17 records
  nlm-mesh                           12 records
  medical-subject-headings            3 records
  tei-manuscript-id                   2 records
  library-of-congress-subject-headings      2 records

Sample records:
  gbum7y2b <- sierra-system-number/1890040
  be99823c <- mets-image/b28065037/FILE_0175_OBJECTS
  thhp9d2t <- mets-image/b22372210/FILE_0045_OBJECTS


In [81]:
# Insert sample data into current schema
conn = get_connection()
cursor = conn.cursor()

insert_query = """
INSERT INTO identifiers (CanonicalId, OntologyType, SourceSystem, SourceId)
VALUES (%s, %s, %s, %s)
"""

for record in sample_data:
    cursor.execute(insert_query, (
        record['canonical_id'],
        record['ontology_type'],
        record['source_system'],
        record['source_id']
    ))

conn.commit()
cursor.close()
conn.close()

print(f"✓ Inserted {len(sample_data)} records into current schema")

# Verify counts
result = execute_query("""
    SELECT SourceSystem, COUNT(*) as count 
    FROM identifiers 
    GROUP BY SourceSystem
    ORDER BY count DESC
""", fetch=True)

print("\nRecords by source system:")
for row in result:
    print(f"  {row['SourceSystem']:30} {row['count']:6} records")

✓ Inserted 10000 records into current schema

Records by source system:
  mets-image                       7781 records
  sierra-system-number             1613 records
  miro-image-number                 136 records
  label-derived                     103 records
  calm-record-id                     80 records
  mets                               67 records
  lc-names                           66 records
  ebsco-alt-lookup                   50 records
  calm-ref-no                        48 records
  library-of-congress-names          20 records
  lc-subjects                        17 records
  nlm-mesh                           12 records
  medical-subject-headings            3 records
  library-of-congress-subject-headings      2 records
  tei-manuscript-id                   2 records


## Step 3: Run the Migration

Migrate to the new schema with:
- `canonical_ids` table for ID registry and pre-generation
- `identifiers` table allowing multiple source IDs per canonical ID

In [82]:
# Step 3a: Create the new canonical_ids table
print("Creating canonical_ids table...")

canonical_ids_schema = """
CREATE TABLE canonical_ids (
    CanonicalId VARCHAR(8) NOT NULL PRIMARY KEY,
    Status ENUM('free', 'assigned') NOT NULL DEFAULT 'free',
    CreatedAt TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    INDEX idx_free (Status, CanonicalId)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
"""

execute_query(canonical_ids_schema)
print("✓ canonical_ids table created")

# Verify
result = execute_query("DESCRIBE canonical_ids", fetch=True)
print("\n'canonical_ids' table structure:")
for row in result:
    print(f"  {row['Field']:15} {row['Type']:30} {row['Key']}")

Creating canonical_ids table...
✓ canonical_ids table created

'canonical_ids' table structure:
  CanonicalId     varchar(8)                     PRI
  Status          enum('free','assigned')        MUL
  CreatedAt       timestamp                      


In [83]:
# Step 3b: Populate canonical_ids from existing identifiers
print("Populating canonical_ids from existing data...")

populate_query = """
INSERT INTO canonical_ids (CanonicalId, Status)
SELECT DISTINCT CanonicalId, 'assigned' FROM identifiers
"""

rows_inserted = execute_query(populate_query)
print(f"✓ Inserted {rows_inserted} canonical IDs")

# Verify
result = execute_query("SELECT COUNT(*) as count FROM canonical_ids", fetch=True)
print(f"\nTotal canonical IDs: {result[0]['count']}")

result = execute_query("SELECT * FROM canonical_ids LIMIT 5", fetch=True)
print("\nSample canonical_ids records:")
for row in result:
    print(f"  {row['CanonicalId']}  {row['Status']:10}  {row['CreatedAt']}")

Populating canonical_ids from existing data...
✓ Inserted 10000 canonical IDs

Total canonical IDs: 10000

Sample canonical_ids records:
  a22kzjax  assigned    2026-02-04 15:43:05
  a22yfu6v  assigned    2026-02-04 15:43:05
  a23v5jrt  assigned    2026-02-04 15:43:05
  a24b67g9  assigned    2026-02-04 15:43:05
  a26ap7qq  assigned    2026-02-04 15:43:05


In [84]:
# Step 3c: Create new identifiers table with updated schema
print("Creating new identifiers table...")

new_identifiers_schema = """
CREATE TABLE identifiers_new (
    OntologyType VARCHAR(255) NOT NULL,
    SourceSystem VARCHAR(255) NOT NULL,
    SourceId VARCHAR(255) NOT NULL,
    CanonicalId VARCHAR(8) NOT NULL,
    CreatedAt TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    PRIMARY KEY (OntologyType, SourceSystem, SourceId),
    FOREIGN KEY (CanonicalId) REFERENCES canonical_ids(CanonicalId),
    INDEX idx_canonical (CanonicalId)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
"""

execute_query(new_identifiers_schema)
print("✓ identifiers_new table created")

# Verify
result = execute_query("DESCRIBE identifiers_new", fetch=True)
print("\n'identifiers_new' table structure:")
for row in result:
    print(f"  {row['Field']:15} {row['Type']:30} {row['Key']}")

Creating new identifiers table...
✓ identifiers_new table created

'identifiers_new' table structure:
  OntologyType    varchar(255)                   PRI
  SourceSystem    varchar(255)                   PRI
  SourceId        varchar(255)                   PRI
  CanonicalId     varchar(8)                     MUL
  CreatedAt       timestamp                      


In [85]:
# Step 3d: Copy data to new identifiers table
print("Copying data to new identifiers table...")

copy_query = """
INSERT INTO identifiers_new (OntologyType, SourceSystem, SourceId, CanonicalId)
SELECT OntologyType, SourceSystem, SourceId, CanonicalId FROM identifiers
"""

rows_copied = execute_query(copy_query)
print(f"✓ Copied {rows_copied} records")

# Step 3e: Swap tables (atomic rename)
print("\nSwapping tables...")
execute_query("RENAME TABLE identifiers TO identifiers_old, identifiers_new TO identifiers")
print("✓ Tables swapped")

# Verify final state
result = execute_query("SELECT COUNT(*) as count FROM identifiers", fetch=True)
print(f"\nRecords in new 'identifiers' table: {result[0]['count']}")

result = execute_query("SHOW TABLES", fetch=True)
print("\nTables in database:")
for row in result:
    print(f"  {list(row.values())[0]}")

Copying data to new identifiers table...
✓ Copied 10000 records

Swapping tables...
✓ Tables swapped

Records in new 'identifiers' table: 10000

Tables in database:
  canonical_ids
  identifiers
  identifiers_old


## Step 4: Pre-generate Free IDs

Maintain a pool of pre-generated IDs to eliminate collision checking during minting.

In [86]:
def pre_generate_ids(count: int) -> int:
    """Pre-generate a batch of free IDs."""
    conn = get_connection()
    cursor = conn.cursor()
    
    generated = 0
    attempts = 0
    max_attempts = count * 2  # Allow for some collisions
    
    while generated < count and attempts < max_attempts:
        new_id = generate_canonical_id()
        try:
            cursor.execute(
                "INSERT IGNORE INTO canonical_ids (CanonicalId, Status) VALUES (%s, 'free')",
                (new_id,)
            )
            if cursor.rowcount > 0:
                generated += 1
        except Exception as e:
            pass  # Collision or other error, try again
        attempts += 1
    
    conn.commit()
    cursor.close()
    conn.close()
    
    return generated

# Pre-generate 100 free IDs
print("Pre-generating free IDs...")
generated = pre_generate_ids(100)
print(f"✓ Generated {generated} new free IDs")

# Check pool status
result = execute_query("""
    SELECT Status, COUNT(*) as count 
    FROM canonical_ids 
    GROUP BY Status
""", fetch=True)

print("\nCanonical ID pool status:")
for row in result:
    print(f"  {row['Status']:10} {row['count']:5} IDs")

Pre-generating free IDs...
✓ Generated 100 new free IDs

Canonical ID pool status:
  free         100 IDs
  assigned   10000 IDs


## Step 5a: Test Batch Lookup

The ID Minter supports batch lookup for efficient processing. A single query can look up many source identifiers across **mixed ontology types** (Works, Images, etc.) — returning only those that exist.

This is the optimized hot path: most records processed already have canonical IDs. Missing IDs are then processed individually via `mint_id()` which handles predecessor inheritance and pool claiming.

In [87]:
# Reload the module to pick up changes
import importlib
import id_minter
importlib.reload(id_minter)
from id_minter import IDMinter

# Instantiate the minter with our connection factory
minter = IDMinter(get_connection)
print("✓ ID Minter initialized")

# Test batch lookup with a mix of existing and non-existing IDs
print("\n" + "="*60)
print("Testing batch lookup (mixed ontology types)...")
print("="*60)

# Get some existing Sierra IDs to test with
existing_sierra = execute_query("""
    SELECT SourceId FROM identifiers 
    WHERE SourceSystem = 'sierra-system-number' AND OntologyType = 'Work'
    LIMIT 3
""", fetch=True)

# Create test batch: mix of ontology types (Work + Image) and existing + non-existing
test_batch = [
    ('Work', 'sierra-system-number', existing_sierra[0]['SourceId']),
    ('Work', 'sierra-system-number', existing_sierra[1]['SourceId']),
    ('Image', 'sierra-system-number', existing_sierra[2]['SourceId']),  # Different ontology type (won't exist)
    ('Work', 'axiell-collections-id', 'DOES-NOT-EXIST'),  # Non-existent source ID
]

print(f"\nLooking up {len(test_batch)} source IDs (mixed ontology types)...")
found = minter.lookup_ids(test_batch)

print(f"Found {len(found)} existing canonical IDs:")
for (ontology, system, sid), cid in found.items():
    print(f"  {ontology}/{system}/{sid} -> {cid}")

missing = [sid for sid in test_batch if sid not in found]
print(f"\nNot found (would need minting): {len(missing)}")
for ontology, system, sid in missing:
    print(f"  {ontology}/{system}/{sid}")

✓ ID Minter initialized

Testing batch lookup (mixed ontology types)...

Looking up 4 source IDs (mixed ontology types)...
Found 2 existing canonical IDs:
  Work/sierra-system-number/1001768 -> nvkdnjxp
  Work/sierra-system-number/1007167 -> swbrj79k

Not found (would need minting): 2
  Image/sierra-system-number/1007828
  Work/axiell-collections-id/DOES-NOT-EXIST


## Step 5b: Test Batch Mint

The `mint_ids()` method combines batch lookup with individual minting. It takes a list of `(source_id, predecessor_or_none)` tuples:

```python
results = minter.mint_ids([
    (('Work', 'axiell', 'AC-123'), ('Work', 'sierra', 'b1234')),  # with predecessor
    (('Image', 'mets', 'xyz'), None),  # no predecessor
])
```

This optimizes the common case (most IDs already exist) while handling predecessor relationships correctly — including predecessors with different ontology types.

In [95]:
# Reload the module to pick up changes
importlib.reload(id_minter)
from id_minter import IDMinter
minter = IDMinter(get_connection)

# Test mint_ids() - batch lookup + individual minting for missing IDs
print("="*60)
print("Testing mint_ids() with new signature...")
print("="*60)

# Get some existing Sierra IDs
existing_works = execute_query("""
    SELECT SourceId, CanonicalId FROM identifiers 
    WHERE SourceSystem = 'sierra-system-number' AND OntologyType = 'Work'
    LIMIT 2
""", fetch=True)

# Create requests as (source_id, predecessor_or_none) tuples:
# - 2 existing Work IDs (should be found via batch lookup)
# - 1 new Axiell ID with predecessor (should inherit canonical ID)
# - 1 brand new ID (should claim from pool)
requests = [
    (('Work', 'sierra-system-number', existing_works[0]['SourceId']), None),  # Exists
    (('Work', 'sierra-system-number', existing_works[1]['SourceId']), None),  # Exists
    (('Work', 'axiell-collections-id', f'AC-BATCH-{random.randint(100000, 999999)}'), 
     ('Work', 'sierra-system-number', existing_works[0]['SourceId'])),  # New with predecessor
    (('Work', 'axiell-collections-id', f'AC-BRAND-NEW-{random.randint(100000, 999999)}'), None),  # Brand new
]

print(f"\nProcessing {len(requests)} requests:")
for i, (source_id, predecessor) in enumerate(requests):
    ont, sys, sid = source_id
    pred_str = f" <- {predecessor[0]}/{predecessor[1]}/{predecessor[2]}" if predecessor else ""
    print(f"  {i+1}. {ont}/{sys}/{sid}{pred_str}")

# Call mint_ids with new signature
results = minter.mint_ids(requests)

print(f"\nResults:")
for (ont, sys, sid), cid in results.items():
    # Check if this was an existing ID or newly minted
    was_existing = any(
        sys == 'sierra-system-number' and sid == w['SourceId'] 
        for w in existing_works
    )
    status = "✓ found" if was_existing else "✓ minted"
    
    # Check predecessor inheritance
    pred = next((r[1] for r in requests if r[0] == (ont, sys, sid) and r[1]), None)
    if pred:
        expected_cid = existing_works[0]['CanonicalId']
        if cid == expected_cid:
            status = "✓ inherited"
        else:
            status = "✗ wrong ID!"
    
    print(f"  {ont}/{sys}/{sid} -> {cid} ({status})")

Testing mint_ids() with new signature...

Processing 4 requests:
  1. Work/sierra-system-number/1001768
  2. Work/sierra-system-number/1007167
  3. Work/axiell-collections-id/AC-BATCH-399846 <- Work/sierra-system-number/1001768
  4. Work/axiell-collections-id/AC-BRAND-NEW-687806

Results:
  Work/sierra-system-number/1001768 -> nvkdnjxp (✓ found)
  Work/sierra-system-number/1007167 -> swbrj79k (✓ found)
  Work/axiell-collections-id/AC-BATCH-399846 -> nvkdnjxp (✓ inherited)
  Work/axiell-collections-id/AC-BRAND-NEW-687806 -> bawutxdn (✓ minted)


## Step 6: Demonstrate Predecessor Inheritance

Simulate the migration by creating new Axiell Collections records that inherit canonical IDs from existing Sierra records.

In [94]:
# Get some existing Sierra Work records to use as predecessors
sierra_records = execute_query("""
    SELECT SourceId, CanonicalId FROM identifiers 
    WHERE SourceSystem = 'sierra-system-number' AND OntologyType = 'Work'
    LIMIT 5
""", fetch=True)

print("Existing Sierra Work records that will be migrated:\n")
for rec in sierra_records:
    print(f"  sierra-system-number/{rec['SourceId']} -> {rec['CanonicalId']}")

print("\n" + "="*60)
print("Simulating migration to Axiell Collections...")
print("="*60 + "\n")

# Simulate Axiell Collections records referencing Sierra predecessors
for i, sierra_rec in enumerate(sierra_records):
    axiell_id = f"AC-{random.randint(100000, 999999)}"
    
    print(f"\nMigrating record {i+1}:")
    print(f"  New Axiell ID: axiell-collections-id/{axiell_id}")
    print(f"  Predecessor: Work/sierra-system-number/{sierra_rec['SourceId']}")
    
    canonical_id = minter.mint_id(
        ontology_type='Work',
        source_system='axiell-collections-id',
        source_id=axiell_id,
        predecessor=('Work', 'sierra-system-number', sierra_rec['SourceId'])  # Full tuple
    )
    
    # Verify it inherited the same canonical ID
    assert canonical_id == sierra_rec['CanonicalId'], "Should inherit predecessor's canonical ID!"
    print(f"  Result: Same canonical ID preserved! ✓")

Existing Sierra Work records that will be migrated:

  sierra-system-number/1001768 -> nvkdnjxp
  sierra-system-number/1007167 -> swbrj79k
  sierra-system-number/1007828 -> uxma4wqh
  sierra-system-number/1009992 -> w67tu2mw
  sierra-system-number/1011469 -> cz3nmzg2

Simulating migration to Axiell Collections...


Migrating record 1:
  New Axiell ID: axiell-collections-id/AC-561973
  Predecessor: Work/sierra-system-number/1001768
  Result: Same canonical ID preserved! ✓

Migrating record 2:
  New Axiell ID: axiell-collections-id/AC-102107
  Predecessor: Work/sierra-system-number/1007167
  Result: Same canonical ID preserved! ✓

Migrating record 3:
  New Axiell ID: axiell-collections-id/AC-470629
  Predecessor: Work/sierra-system-number/1007828
  Result: Same canonical ID preserved! ✓

Migrating record 4:
  New Axiell ID: axiell-collections-id/AC-501598
  Predecessor: Work/sierra-system-number/1009992
  Result: Same canonical ID preserved! ✓

Migrating record 5:
  New Axiell ID: axiell

## Step 7: Discover Aliases

Query to find which source identifiers share a canonical ID and identify aliases by creation timestamp.

In [89]:
# Find canonical IDs with multiple source identifiers (aliases)
alias_query = """
SELECT 
    CanonicalId, 
    COUNT(*) as source_count
FROM identifiers 
GROUP BY CanonicalId 
HAVING COUNT(*) > 1
ORDER BY source_count DESC
"""

aliased_ids = execute_query(alias_query, fetch=True)

print(f"Found {len(aliased_ids)} canonical IDs with multiple source identifiers:\n")

# Show details for each aliased canonical ID
for item in aliased_ids[:5]:  # Show first 5
    canonical_id = item['CanonicalId']
    
    # Get all source identifiers for this canonical ID with alias status
    details = execute_query("""
        SELECT 
            i.*,
            CASE WHEN i.CreatedAt = earliest.MinCreatedAt THEN 'Original' ELSE 'Alias' END AS Status
        FROM identifiers i
        JOIN (
            SELECT CanonicalId, MIN(CreatedAt) AS MinCreatedAt 
            FROM identifiers 
            GROUP BY CanonicalId
        ) earliest ON i.CanonicalId = earliest.CanonicalId
        WHERE i.CanonicalId = %s
        ORDER BY i.CreatedAt
    """, (canonical_id,), fetch=True)
    
    print(f"Canonical ID: {canonical_id}")
    print("-" * 60)
    for d in details:
        print(f"  [{d['Status']:8}] {d['SourceSystem']}/{d['SourceId']}")
        print(f"           Created: {d['CreatedAt']}")
    print()

Found 5 canonical IDs with multiple source identifiers:

Canonical ID: cz3nmzg2
------------------------------------------------------------
  [Original] sierra-system-number/1011469
           Created: 2026-02-04 15:43:15
  [Alias   ] axiell-collections-id/AC-877528
           Created: 2026-02-04 15:43:28

Canonical ID: nvkdnjxp
------------------------------------------------------------
  [Original] sierra-system-number/1001768
           Created: 2026-02-04 15:43:15
  [Alias   ] axiell-collections-id/AC-886937
           Created: 2026-02-04 15:43:28

Canonical ID: swbrj79k
------------------------------------------------------------
  [Original] sierra-system-number/1007167
           Created: 2026-02-04 15:43:15
  [Alias   ] axiell-collections-id/AC-366490
           Created: 2026-02-04 15:43:28

Canonical ID: uxma4wqh
------------------------------------------------------------
  [Original] sierra-system-number/1007828
           Created: 2026-02-04 15:43:15
  [Alias   ] axiell-c

## Step 8: Mint New IDs (No Predecessor)

Demonstrate minting brand new records that don't have predecessors - these claim IDs from the pre-generated pool.

In [90]:
# Check pool status before
pool_before = execute_query("""
    SELECT Status, COUNT(*) as count 
    FROM canonical_ids 
    GROUP BY Status
""", fetch=True)

print("Pool status before minting:\n")
for row in pool_before:
    print(f"  {row['Status']:10} {row['count']:5} IDs")

print("\n" + "="*60)
print("Minting 5 brand new Axiell Collections records (no predecessor)")
print("="*60 + "\n")

new_records = []
for i in range(5):
    axiell_id = f"AC-NEW-{random.randint(100000, 999999)}"
    
    canonical_id = minter.mint_id(
        ontology_type='Work',
        source_system='axiell-collections-id',
        source_id=axiell_id
        # No predecessor - this is a brand new record
    )
    
    new_records.append({
        'source_id': axiell_id,
        'canonical_id': canonical_id
    })

# Check pool status after
pool_after = execute_query("""
    SELECT Status, COUNT(*) as count 
    FROM canonical_ids 
    GROUP BY Status
""", fetch=True)

print("\nPool status after minting:\n")
for row in pool_after:
    print(f"  {row['Status']:10} {row['count']:5} IDs")

# The free count should have decreased by 5
free_before = next(r['count'] for r in pool_before if r['Status'] == 'free')
free_after = next(r['count'] for r in pool_after if r['Status'] == 'free')
print(f"\n✓ Free IDs consumed: {free_before - free_after}")

Pool status before minting:

  free         100 IDs
  assigned   10000 IDs

Minting 5 brand new Axiell Collections records (no predecessor)


Pool status after minting:

  free          95 IDs
  assigned   10005 IDs

✓ Free IDs consumed: 5


## Summary

This notebook demonstrated the complete migration process:

1. **Current schema** - 1:1 mapping with PK on `CanonicalId`
2. **Data loading** - Sample data from the catalogue pipeline
3. **Migration** - Create `canonical_ids` table, restructure `identifiers` table
4. **Pre-generation** - Pool of free IDs for efficient minting
5. **Batch operations**:
   - `lookup_ids()` - Single query to fetch multiple existing canonical IDs
   - `mint_ids()` - Combined batch lookup + individual mint for missing IDs
6. **Predecessor inheritance** - Migrated records inherit canonical IDs
7. **Alias discovery** - Query to find linked source identifiers by `CreatedAt`
8. **New record minting** - Claims from pre-generated pool

### Key Benefits Demonstrated:
- ✓ Stable canonical IDs across source system migrations
- ✓ Batch lookup for efficient processing of existing records
- ✓ Combined batch lookup + individual mint via `mint_ids()`
- ✓ Single-record minting keeps predecessor handling simple
- ✓ No collision retry loops (pre-generated IDs)  
- ✓ Full provenance via `CreatedAt` timestamps
- ✓ Support for future migrations (alias chaining)
- ✓ `SKIP LOCKED` enables concurrent minting without contention

## Cleanup

Run this to stop and remove the Docker container:

In [33]:
# Close connection and stop Docker container
if 'conn' in dir() and conn and conn.open:
    conn.close()
    print("✓ Database connection closed")

if 'minter' in dir() and minter.conn and minter.conn.open:
    minter.conn.close()
    print("✓ Minter connection closed")

# Stop the container (use -v to also remove the volume)
!docker compose down -v
print("✓ Docker container stopped")

✓ Minter connection closed
[?25l[0G[+] down 0/1
 [33m⠋[0m Container id-minter-mysql Stopping                                      [34m0.1s [0m
[?25h[?25l[2A[0G[+] down 0/1
 [33m⠙[0m Container id-minter-mysql Stopping                                      [34m0.2s [0m
[?25h[?25l[2A[0G[+] down 0/1
 [33m⠹[0m Container id-minter-mysql Stopping                                      [34m0.3s [0m
[?25h[?25l[2A[0G[+] down 0/1
 [33m⠸[0m Container id-minter-mysql Stopping                                      [34m0.4s [0m
[?25h[?25l[2A[0G[+] down 0/1
 [33m⠼[0m Container id-minter-mysql Stopping                                      [34m0.5s [0m
[?25h[?25l[2A[0G[+] down 0/1
 [33m⠴[0m Container id-minter-mysql Stopping                                      [34m0.6s [0m
[?25h[?25l[2A[0G[+] down 0/1
 [33m⠦[0m Container id-minter-mysql Stopping                                      [34m0.7s [0m
[?25h[?25l[2A[0G[+] down 0/1
 [33m⠧[0m Container id-m