# Document ID Migration - Option 4 (Hybrid System)

**M·ª•c ƒë√≠ch:** Migrate document IDs t·ª´ format c≈© sang format m·ªõi theo Hybrid System

**Format m·ªõi:** `{type_code}-{s·ªë_hi·ªáu}/{nƒÉm}#{hash_short}`

**V√≠ d·ª•:**
- `bidding_untitled` ‚Üí `FORM-Bidding/2025#bee720`
- `circular_untitled` ‚Üí `TT-Circular/2025#3be8b6`
- `decree_untitled` ‚Üí `ND-Decree/2025#95b863`

**L·ª£i √≠ch:**
- ‚úÖ Human-readable + Machine-friendly
- ‚úÖ T∆∞∆°ng th√≠ch chu·∫©n ph√°p l√Ω VN
- ‚úÖ ƒê·∫£m b·∫£o uniqueness v·ªõi hash
- ‚úÖ D·ªÖ query v√† maintain

---

## Quy tr√¨nh Migration

1. **Setup & Connect** - K·∫øt n·ªëi database
2. **Preview** - Xem tr∆∞·ªõc thay ƒë·ªïi
3. **Backup** - T·∫°o backup metadata
4. **Execute** - Th·ª±c hi·ªán migration
5. **Verify** - Ki·ªÉm tra k·∫øt qu·∫£
6. **Test API** - Test v·ªõi API endpoints

## Step 1: Setup & Import Libraries

In [1]:
import psycopg
import json
import hashlib
from datetime import datetime
from pathlib import Path
import sys

# Add project root to path
project_root = Path.cwd().parent
sys.path.insert(0, str(project_root))

from src.config.models import settings

print("‚úÖ Libraries imported successfully")
print(f"üìÅ Project root: {project_root}")
print(f"üîó Database: {settings.database_url.split('@')[1] if '@' in settings.database_url else 'configured'}")

‚úÖ Libraries imported successfully
üìÅ Project root: /home/sakana/Code/RAG-bidding
üîó Database: localhost:5432/rag_bidding_v2


## Step 2: Define Migration Functions

In [5]:
def extract_metadata_from_old_id(old_id: str, metadata: dict) -> dict:
    """Extract ho·∫∑c infer metadata t·ª´ old_id v√† cmetadata"""
    
    # L·∫•y document_type t·ª´ metadata ho·∫∑c old_id
    doc_type = metadata.get("document_type")
    if not doc_type and old_id:
        if "bidding" in old_id.lower():
            doc_type = "bidding"
        elif "circular" in old_id.lower():
            doc_type = "circular"
        elif "decree" in old_id.lower():
            doc_type = "decree"
        elif "law" in old_id.lower():
            doc_type = "law"
        elif "exam" in old_id.lower():
            doc_type = "exam"
        elif "report" in old_id.lower():
            doc_type = "report"
        else:
            doc_type = "document"
    
    # L·∫•y year t·ª´ metadata ho·∫∑c timestamp
    year = metadata.get("year")
    if not year and metadata.get("processing_metadata"):
        processed_at = metadata["processing_metadata"].get("last_processed_at", "")
        if processed_at:
            year = processed_at[:4]
    if not year:
        year = "2025"
    
    # L·∫•y number - d√πng type name thay v√¨ "Unknown"
    number = metadata.get("number")
    if not number:
        import re
        match = re.search(r'(\d+)', old_id)
        if match:
            number = match.group(1)
        else:
            number = doc_type.title()
    
    return {
        "type": doc_type or "doc",
        "year": year,
        "number": number
    }


def generate_new_document_id(old_id: str, metadata: dict) -> str:
    """
    Generate new document_id theo Hybrid System
    Format: {type_code}-{number}/{year}#{hash_short}
    """
    
    extracted = extract_metadata_from_old_id(old_id, metadata)
    doc_type = extracted["type"]
    year = extracted["year"]
    number = extracted["number"]
    
    # Type code mapping
    type_code_map = {
        "law": "LAW",
        "decree": "ND",
        "circular": "TT",
        "decision": "QD",
        "bidding": "FORM",
        "report": "RPT",
        "exam": "EXAM",
        "document": "DOC"
    }
    
    type_code = type_code_map.get(doc_type, "DOC")
    
    # Generate hash t·ª´ old_id ƒë·ªÉ ƒë·∫£m b·∫£o uniqueness v√† idempotent
    hash_obj = hashlib.md5(old_id.encode())
    hash_short = hash_obj.hexdigest()[:6]
    
    new_id = f"{type_code}-{number}/{year}#{hash_short}"
    
    return new_id


print("‚úÖ Migration functions defined")

‚úÖ Migration functions defined


## Step 3: Connect to Database & Preview Current Data

In [6]:
# Connect to database
# Use database_url from settings (convert SQLAlchemy format to psycopg format)
conn_str = settings.database_url.replace("postgresql+psycopg://", "postgresql://")

print("üîå Connecting to database...")
print(f"üîó Connection: {conn_str.split('@')[1] if '@' in conn_str else 'configured'}")
conn = psycopg.connect(conn_str)
cursor = conn.cursor()

# Query all unique document_ids with their chunk counts
query = """
SELECT 
    cmetadata->>'document_id' as document_id,
    COUNT(*) as chunk_count,
    MIN(cmetadata->>'doc_type') as doc_type,
    MIN(cmetadata->>'processing_metadata') as processing_metadata_sample
FROM langchain_pg_embedding
WHERE cmetadata->>'document_id' IS NOT NULL
GROUP BY cmetadata->>'document_id'
ORDER BY document_id;
"""

cursor.execute(query)
results = cursor.fetchall()

print(f"\nüìä Found {len(results)} documents with {sum(r[1] for r in results)} total chunks\n")
print("=" * 100)
print(f"{'OLD DOCUMENT_ID':<30} {'NEW DOCUMENT_ID':<30} {'TYPE':<8} {'CHUNKS':<8}")
print("=" * 100)

migration_preview = []
for old_id, chunk_count, doc_type, processing_metadata in results:
    # Parse processing_metadata if it exists
    metadata = {}
    if processing_metadata:
        try:
            import json
            metadata = json.loads(processing_metadata)
        except:
            metadata = {}
    
    # Generate new document_id
    new_id = generate_new_document_id(old_id, metadata)
    
    # Determine doc type for display
    inferred_meta = extract_metadata_from_old_id(old_id, metadata)
    display_type = inferred_meta.get('doc_type', 'unknown')
    
    print(f"{old_id:<30} {new_id:<30} {display_type:<8} {chunk_count:<8}")
    
    migration_preview.append({
        'old_id': old_id,
        'new_id': new_id,
        'doc_type': display_type,
        'chunk_count': chunk_count
    })

print("=" * 100)
print(f"\n‚úÖ Preview complete: {len(migration_preview)} documents ready for migration")

# Close connection (we'll reopen for migration)
cursor.close()
conn.close()
print("üîå Database connection closed")

üîå Connecting to database...
üîó Connection: localhost:5432/rag_bidding_v2

üìä Found 5 documents with 4708 total chunks

OLD DOCUMENT_ID                NEW DOCUMENT_ID                TYPE     CHUNKS  
bidding_untitled               FORM-Bidding/2025#bee720       unknown  2831    
circular_untitled              TT-Circular/2025#3be8b6        unknown  123     
decision_untitled              DOC-Document/2025#787999       unknown  5       
decree_untitled                ND-Decree/2025#95b863          unknown  595     
law_untitled                   LAW-Law/2025#cd5116            unknown  1154    

‚úÖ Preview complete: 5 documents ready for migration
üîå Database connection closed


## Step 4: Backup Current Data

**‚ö†Ô∏è IMPORTANT:** Tr∆∞·ªõc khi migrate, ch√∫ng ta s·∫Ω backup to√†n b·ªô cmetadata hi·ªán t·∫°i ƒë·ªÉ c√≥ th·ªÉ rollback n·∫øu c·∫ßn.

In [8]:
# Create backup directory (s·ª≠ d·ª•ng Path.cwd() thay v√¨ __file__ trong notebook)
backup_dir = Path.cwd() / "backups"
backup_dir.mkdir(exist_ok=True)

# Generate backup filename with timestamp
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
backup_file = backup_dir / f"document_id_backup_{timestamp}.json"

print(f"üíæ Creating backup at: {backup_file}")

# Connect and fetch all metadata
conn = psycopg.connect(conn_str)
cursor = conn.cursor()

backup_query = """
SELECT 
    id,
    cmetadata->>'document_id' as document_id,
    cmetadata
FROM langchain_pg_embedding
WHERE cmetadata->>'document_id' IS NOT NULL
ORDER BY id;
"""

cursor.execute(backup_query)
backup_data = cursor.fetchall()

# Save to JSON
backup_records = []
for row_id, doc_id, cmetadata in backup_data:
    backup_records.append({
        'id': row_id,
        'document_id': doc_id,
        'cmetadata': cmetadata
    })

with open(backup_file, 'w', encoding='utf-8') as f:
    json.dump({
        'timestamp': timestamp,
        'total_chunks': len(backup_records),
        'records': backup_records
    }, f, indent=2, ensure_ascii=False)

cursor.close()
conn.close()

print(f"‚úÖ Backup created: {len(backup_records)} chunks saved")
print(f"üìÅ Backup location: {backup_file}")
print("\n‚ö†Ô∏è  To rollback, run this SQL:")
print(f"-- Restore from backup: {backup_file.name}")
print("-- UPDATE langchain_pg_embedding SET cmetadata = backup_cmetadata WHERE id = backup_id")

üíæ Creating backup at: /home/sakana/Code/RAG-bidding/notebooks/backups/document_id_backup_20251109_154532.json
‚úÖ Backup created: 4708 chunks saved
üìÅ Backup location: /home/sakana/Code/RAG-bidding/notebooks/backups/document_id_backup_20251109_154532.json

‚ö†Ô∏è  To rollback, run this SQL:
-- Restore from backup: document_id_backup_20251109_154532.json
-- UPDATE langchain_pg_embedding SET cmetadata = backup_cmetadata WHERE id = backup_id
‚úÖ Backup created: 4708 chunks saved
üìÅ Backup location: /home/sakana/Code/RAG-bidding/notebooks/backups/document_id_backup_20251109_154532.json

‚ö†Ô∏è  To rollback, run this SQL:
-- Restore from backup: document_id_backup_20251109_154532.json
-- UPDATE langchain_pg_embedding SET cmetadata = backup_cmetadata WHERE id = backup_id


## Step 5: Execute Migration

**‚ö†Ô∏è CRITICAL:** ƒê√¢y l√† b∆∞·ªõc thay ƒë·ªïi database th·∫≠t. H√£y ch·∫Øc ch·∫Øn b·∫°n ƒë√£:
- ‚úÖ Review preview ·ªü Step 3
- ‚úÖ ƒê√£ c√≥ backup ·ªü Step 4
- ‚úÖ S·∫µn s√†ng th·ª±c hi·ªán migration

**ƒê·ªÉ th·ª±c hi·ªán migration, set `CONFIRM_MIGRATION = True` trong cell d∆∞·ªõi.**

In [10]:
# ‚ö†Ô∏è  SET THIS TO True TO CONFIRM MIGRATION
CONFIRM_MIGRATION = True

if not CONFIRM_MIGRATION:
    print("‚ùå Migration NOT confirmed. Set CONFIRM_MIGRATION = True to proceed.")
else:
    print("üöÄ Starting migration...\n")
    
    # Connect to database
    conn = psycopg.connect(conn_str)
    cursor = conn.cursor()
    
    try:
        total_updated = 0
        
        for item in migration_preview:
            old_id = item['old_id']
            new_id = item['new_id']
            chunk_count = item['chunk_count']
            
            print(f"üìù Migrating: {old_id} ‚Üí {new_id} ({chunk_count} chunks)...")
            
            # Update document_id in cmetadata
            update_query = """
            UPDATE langchain_pg_embedding
            SET cmetadata = jsonb_set(
                cmetadata, 
                '{document_id}', 
                to_jsonb(%s::text)
            )
            WHERE cmetadata->>'document_id' = %s;
            """
            
            cursor.execute(update_query, (new_id, old_id))
            updated_count = cursor.rowcount
            total_updated += updated_count
            
            print(f"   ‚úÖ Updated {updated_count} chunks")
        
        # Commit all changes
        conn.commit()
        print(f"\nüéâ Migration complete! Total chunks updated: {total_updated}")
        print(f"‚úÖ All changes committed to database")
        
    except Exception as e:
        conn.rollback()
        print(f"\n‚ùå Error during migration: {e}")
        print("üîÑ All changes rolled back")
        raise
    
    finally:
        cursor.close()
        conn.close()
        print("üîå Database connection closed")

üöÄ Starting migration...

üìù Migrating: bidding_untitled ‚Üí FORM-Bidding/2025#bee720 (2831 chunks)...
   ‚úÖ Updated 2831 chunks
üìù Migrating: circular_untitled ‚Üí TT-Circular/2025#3be8b6 (123 chunks)...
   ‚úÖ Updated 123 chunks
üìù Migrating: decision_untitled ‚Üí DOC-Document/2025#787999 (5 chunks)...
   ‚úÖ Updated 5 chunks
üìù Migrating: decree_untitled ‚Üí ND-Decree/2025#95b863 (595 chunks)...
   ‚úÖ Updated 595 chunks
üìù Migrating: law_untitled ‚Üí LAW-Law/2025#cd5116 (1154 chunks)...
   ‚úÖ Updated 1154 chunks

üéâ Migration complete! Total chunks updated: 4708
‚úÖ All changes committed to database
üîå Database connection closed
   ‚úÖ Updated 2831 chunks
üìù Migrating: circular_untitled ‚Üí TT-Circular/2025#3be8b6 (123 chunks)...
   ‚úÖ Updated 123 chunks
üìù Migrating: decision_untitled ‚Üí DOC-Document/2025#787999 (5 chunks)...
   ‚úÖ Updated 5 chunks
üìù Migrating: decree_untitled ‚Üí ND-Decree/2025#95b863 (595 chunks)...
   ‚úÖ Updated 595 chunks
üìù Migr

## Step 6: Verify Migration Results

Ki·ªÉm tra xem migration ƒë√£ th√†nh c√¥ng ch∆∞a b·∫±ng c√°ch query database.

In [11]:
# Connect and verify
conn = psycopg.connect(conn_str)
cursor = conn.cursor()

print("üîç Verifying migration results...\n")

# Check all document_ids now in database
verify_query = """
SELECT 
    cmetadata->>'document_id' as document_id,
    COUNT(*) as chunk_count,
    MIN(cmetadata->>'doc_type') as doc_type
FROM langchain_pg_embedding
WHERE cmetadata->>'document_id' IS NOT NULL
GROUP BY cmetadata->>'document_id'
ORDER BY document_id;
"""

cursor.execute(verify_query)
results = cursor.fetchall()

print(f"üìä Current database state: {len(results)} documents\n")
print("=" * 80)
print(f"{'DOCUMENT_ID':<35} {'TYPE':<10} {'CHUNKS':<10}")
print("=" * 80)

total_chunks = 0
for doc_id, chunk_count, doc_type in results:
    print(f"{doc_id:<35} {doc_type or 'unknown':<10} {chunk_count:<10}")
    total_chunks += chunk_count

print("=" * 80)
print(f"Total: {total_chunks} chunks across {len(results)} documents\n")

# Check for any old-format IDs remaining
check_old_format = """
SELECT COUNT(*) 
FROM langchain_pg_embedding
WHERE cmetadata->>'document_id' LIKE '%_untitled'
   OR cmetadata->>'document_id' NOT LIKE '%#%';
"""

cursor.execute(check_old_format)
old_format_count = cursor.fetchone()[0]

if old_format_count > 0:
    print(f"‚ö†Ô∏è  Warning: {old_format_count} chunks still have old-format document_ids")
else:
    print("‚úÖ All document_ids successfully migrated to new format!")

# Sample some chunks to verify metadata integrity
sample_query = """
SELECT 
    id,
    cmetadata->>'document_id' as document_id,
    cmetadata->>'doc_type' as doc_type,
    cmetadata->>'chunk_index' as chunk_index
FROM langchain_pg_embedding
LIMIT 5;
"""

cursor.execute(sample_query)
samples = cursor.fetchall()

print("\nüìã Sample chunks (first 5):")
print("=" * 80)
for row_id, doc_id, doc_type, chunk_idx in samples:
    print(f"ID: {row_id} | Doc: {doc_id} | Type: {doc_type} | Chunk: {chunk_idx}")
print("=" * 80)

cursor.close()
conn.close()

print("\n‚úÖ Verification complete!")

üîç Verifying migration results...

üìä Current database state: 5 documents

DOCUMENT_ID                         TYPE       CHUNKS    
DOC-Document/2025#787999            unknown    5         
FORM-Bidding/2025#bee720            unknown    2831      
LAW-Law/2025#cd5116                 unknown    1154      
ND-Decree/2025#95b863               unknown    595       
TT-Circular/2025#3be8b6             unknown    123       
Total: 4708 chunks across 5 documents

‚úÖ All document_ids successfully migrated to new format!

üìã Sample chunks (first 5):
ID: 4af1c105-4ce2-43a6-b304-244a6d16ee15 | Doc: FORM-Bidding/2025#bee720 | Type: None | Chunk: 102
ID: 82d0d255-4401-470d-9805-02ffe2a677a9 | Doc: FORM-Bidding/2025#bee720 | Type: None | Chunk: 66
ID: 748f9e32-7127-43c4-b575-71cea7d40e78 | Doc: FORM-Bidding/2025#bee720 | Type: None | Chunk: 46
ID: 8e742f0e-8a04-4cfe-b7a7-2548a93179fa | Doc: FORM-Bidding/2025#bee720 | Type: None | Chunk: 1
ID: ca5d8bad-ed7f-475b-ab69-44a317a611b1 | Doc: FORM-

## Step 7: Test API with New Document IDs

Ki·ªÉm tra API endpoints ho·∫°t ƒë·ªông ƒë√∫ng v·ªõi document_id format m·ªõi.

**L∆∞u √Ω:** Document IDs c√≥ k√Ω t·ª± `#` c·∫ßn ƒë∆∞·ª£c URL-encode khi g·ªçi API:
- `ND-43/2022#a7f3c9` ‚Üí `ND-43/2022%23a7f3c9`

In [13]:
import urllib.parse
import requests

# Assume API is running at localhost:8000
API_BASE_URL = "http://localhost:8000/api"

print("üß™ Testing API with new document IDs...\n")

# Get a sample new document_id from migration_preview
if migration_preview:
    sample_doc = migration_preview[0]
    test_doc_id = sample_doc['new_id']
    
    print(f"üìù Testing with document: {test_doc_id}")
    print(f"   (Originally: {sample_doc['old_id']})")
    print(f"   Type: {sample_doc['doc_type']}, Chunks: {sample_doc['chunk_count']}\n")
    
    # URL encode the document_id (important for # and / symbols)
    encoded_doc_id = urllib.parse.quote(test_doc_id, safe='')
    
    print(f"üîó URL-encoded: {encoded_doc_id}\n")
    
    # Test 1: GET document status (using query parameter)
    print("=" * 80)
    print("TEST 1: GET /api/document-status?document_id={document_id}")
    print("=" * 80)
    
    try:
        response = requests.get(
            f"{API_BASE_URL}/document-status",
            params={"document_id": test_doc_id}  # Use query param, requests will encode
        )
        print(f"Status Code: {response.status_code}")
        
        if response.status_code == 200:
            data = response.json()
            print("‚úÖ GET request successful!")
            print(f"Response: {json.dumps(data, indent=2, ensure_ascii=False)}")
        else:
            print(f"‚ùå GET request failed: {response.text}")
    except Exception as e:
        print(f"‚ùå Error calling API: {e}")
        print("‚ö†Ô∏è  Make sure the API server is running: uvicorn src.api.main:app --reload")
    
    print("\n")
    
    # Test 2: POST update document status
    print("=" * 80)
    print("TEST 2: POST /api/document-status/update")
    print("=" * 80)
    
    try:
        update_data = {
            "document_id": test_doc_id,  # Don't encode in JSON body
            "new_status": "active",
            "reason": "Testing new document_id format after migration",
            "notes": "Migration successful - testing API compatibility"
        }
        
        response = requests.post(
            f"{API_BASE_URL}/document-status/update",
            json=update_data
        )
        
        print(f"Status Code: {response.status_code}")
        
        if response.status_code == 200:
            data = response.json()
            print("‚úÖ POST request successful!")
            print(f"Response: {json.dumps(data, indent=2, ensure_ascii=False)}")
        else:
            print(f"‚ùå POST request failed: {response.text}")
    except Exception as e:
        print(f"‚ùå Error calling API: {e}")
        print("‚ö†Ô∏è  Make sure the API server is running: uvicorn src.api.main:app --reload")
    
    print("=" * 80)
    print("\n‚úÖ API testing complete!")
    
else:
    print("‚ùå No migration preview data available. Run Step 3 first.")

üß™ Testing API with new document IDs...

üìù Testing with document: FORM-Bidding/2025#bee720
   (Originally: bidding_untitled)
   Type: unknown, Chunks: 2831

üîó URL-encoded: FORM-Bidding%2F2025%23bee720

TEST 1: GET /api/document-status?document_id={document_id}
Status Code: 200
‚úÖ GET request successful!
Response: {
  "document_id": "FORM-Bidding/2025#bee720",
  "current_status": "active",
  "chunk_count": 2831,
  "last_updated": "2025-11-09T15:46:40.002664",
  "superseded_by": "bidding_new_2024"
}


TEST 2: POST /api/document-status/update
Status Code: 200
‚úÖ POST request successful!
Response: {
  "success": true,
  "message": "ƒê√£ c·∫≠p nh·∫≠t status cho 2831 chunks",
  "document_id": "FORM-Bidding/2025#bee720",
  "chunks_updated": 2831,
  "old_status": "active",
  "new_status": "active"
}

‚úÖ API testing complete!
Status Code: 200
‚úÖ POST request successful!
Response: {
  "success": true,
  "message": "ƒê√£ c·∫≠p nh·∫≠t status cho 2831 chunks",
  "document_id": "FORM-Bid

## üéâ Migration Complete!

### T√≥m t·∫Øt quy tr√¨nh:

1. ‚úÖ **Setup & Import** - ƒê√£ load c√°c th∆∞ vi·ªán v√† k·∫øt n·ªëi database
2. ‚úÖ **Define Functions** - ƒê√£ t·∫°o c√°c h√†m migration logic
3. ‚úÖ **Preview** - ƒê√£ xem tr∆∞·ªõc c√°c thay ƒë·ªïi (5 documents ‚Üí 4708 chunks)
4. ‚úÖ **Backup** - ƒê√£ backup to√†n b·ªô metadata hi·ªán t·∫°i
5. ‚úÖ **Execute** - ƒê√£ th·ª±c hi·ªán migration (n·∫øu CONFIRM_MIGRATION = True)
6. ‚úÖ **Verify** - ƒê√£ ki·ªÉm tra k·∫øt qu·∫£ migration
7. ‚úÖ **Test API** - ƒê√£ test API v·ªõi document_id m·ªõi

### Document ID Format m·ªõi:

```
{TYPE_CODE}-{Number}/{Year}#{Hash}
```

**V√≠ d·ª•:**
- `ND-43/2022#a7f3c9` - Ngh·ªã ƒë·ªãnh 43/2022
- `TT-15/2023#65aabb` - Th√¥ng t∆∞ 15/2023
- `LAW-Law/2025#cd5116` - Lu·∫≠t (kh√¥ng c√≥ s·ªë c·ª• th·ªÉ)

### L∆∞u √Ω ti·∫øp theo:

1. **C·∫≠p nh·∫≠t preprocessing pipeline** ƒë·ªÉ s·ª≠ d·ª•ng `DocumentIDGenerator` cho c√°c vƒÉn b·∫£n m·ªõi
2. **Update documentation** v·ªÅ format document_id m·ªõi
3. **Th√¥ng b√°o team** v·ªÅ s·ª± thay ƒë·ªïi n√†y
4. **Monitor API** ƒë·ªÉ ƒë·∫£m b·∫£o kh√¥ng c√≥ issues v·ªõi URL encoding (#)

### Rollback (n·∫øu c·∫ßn):

N·∫øu c·∫ßn quay l·∫°i document_id c≈©, s·ª≠ d·ª•ng file backup trong `notebooks/backups/` v√† ch·∫°y:

```python
# Load backup file
with open('backups/document_id_backup_TIMESTAMP.json', 'r') as f:
    backup = json.load(f)

# Restore each record
for record in backup['records']:
    cursor.execute("""
        UPDATE langchain_pg_embedding
        SET cmetadata = %s
        WHERE id = %s
    """, (json.dumps(record['cmetadata']), record['id']))
```