# üîß Fix Migration Links & Inconsistencies

**Purpose**: Fix c√°c v·∫•n ƒë·ªÅ ph√°t hi·ªán sau migration:
1. **bidding_untitled** (767 chunks) - Old format c·∫ßn update
2. **FORM-Bidding/2025#bee720** (3 chunks) - Missing trong documents table
3. **4 exam documents** - C√≥ trong documents table nh∆∞ng 0 chunks

**Mode**: Run t·ª´ng cell ƒë·ªÉ debug, c√≥ th·ªÉ rollback n·∫øu sai

## üì¶ Setup

In [12]:
import sys
import os
from pathlib import Path

# Add project root
project_root = Path.cwd().parent.parent
sys.path.insert(0, str(project_root))

print(f"üìÅ Project: {project_root}")

üìÅ Project: /home/sakana/Code/RAG-bidding


In [13]:
import psycopg2
import json
import pandas as pd
from datetime import datetime
from typing import Dict, List, Any
import warnings
warnings.filterwarnings('ignore')

# Database config
DB_CONFIG = {
    'host': 'localhost',
    'database': 'rag_bidding_v2',
    'user': 'sakana',
    'password': 'sakana123'
}

print("‚úÖ Imports successful")

‚úÖ Imports successful


In [14]:
# Helper functions

def get_connection():
    """Get database connection."""
    return psycopg2.connect(**DB_CONFIG)

def run_query(query: str, params: tuple = None) -> pd.DataFrame:
    """Run query and return DataFrame."""
    conn = get_connection()
    try:
        df = pd.read_sql_query(query, conn, params=params)
        return df
    finally:
        conn.close()

def execute_query(query: str, params: tuple = None) -> int:
    """Execute query and return affected rows."""
    conn = get_connection()
    try:
        cursor = conn.cursor()
        cursor.execute(query, params)
        affected = cursor.rowcount
        conn.commit()
        return affected
    except Exception as e:
        conn.rollback()
        raise e
    finally:
        conn.close()

def print_section(title: str):
    """Print formatted section header."""
    print("\n" + "="*80)
    print(f"üìä {title}")
    print("="*80 + "\n")

def determine_category(doc_type: str) -> str:
    """Determine document category from type."""
    mapping = {
        'law': 'Lu·∫≠t ch√≠nh',
        'decree': 'Ngh·ªã ƒë·ªãnh',
        'circular': 'Th√¥ng t∆∞',
        'decision': 'Quy·∫øt ƒë·ªãnh',
        'bidding': 'H·ªì s∆° m·ªùi th·∫ßu',
        'template': 'M·∫´u b√°o c√°o',
        'exam': 'C√¢u h·ªèi thi'
    }
    return mapping.get(doc_type, 'Kh√°c')

print("‚úÖ Helper functions loaded")

‚úÖ Helper functions loaded


---

## üîç Issue 1: Analyze Missing Documents

Documents c√≥ trong vector DB nh∆∞ng KH√îNG c√≥ trong documents table

In [15]:
print_section("Missing Documents Analysis")

query = """
WITH vector_docs AS (
    SELECT 
        cmetadata->>'document_id' as document_id,
        cmetadata->>'document_type' as document_type,
        cmetadata->>'title' as title,
        cmetadata->>'source_file' as source_file,
        COUNT(*) as total_chunks
    FROM langchain_pg_embedding
    GROUP BY 
        cmetadata->>'document_id',
        cmetadata->>'document_type',
        cmetadata->>'title',
        cmetadata->>'source_file'
)
SELECT v.*
FROM vector_docs v
LEFT JOIN documents d ON v.document_id = d.document_id
WHERE d.document_id IS NULL
ORDER BY total_chunks DESC;
"""

missing_docs = run_query(query)

if missing_docs.empty:
    print("‚úÖ All documents in vector DB exist in documents table")
else:
    print(f"‚ö†Ô∏è Found {len(missing_docs)} documents in vector DB but NOT in documents table:\n")
    
    for idx, row in missing_docs.iterrows():
        print(f"[{idx+1}] {row['document_id']}")
        print(f"    Type: {row['document_type'] or 'unknown'}")
        print(f"    Chunks: {row['total_chunks']}")
        print(f"    Title: {row['title'] or 'N/A'}")
        print(f"    Source: {Path(row['source_file']).name if row['source_file'] else 'N/A'}")
        print()

# Store for later use
print(f"\nüíæ Stored in variable: missing_docs (DataFrame with {len(missing_docs)} rows)")


üìä Missing Documents Analysis

‚úÖ All documents in vector DB exist in documents table

üíæ Stored in variable: missing_docs (DataFrame with 0 rows)


### 1.1: Inspect bidding_untitled Details

Xem chi ti·∫øt ƒë·ªÉ quy·∫øt ƒë·ªãnh x·ª≠ l√Ω nh∆∞ th·∫ø n√†o

In [16]:
if not missing_docs.empty and 'bidding_untitled' in missing_docs['document_id'].values:
    print_section("bidding_untitled Detailed Analysis")
    
    # Get sample chunks
    query = """
    SELECT 
        cmetadata->>'title' as title,
        cmetadata->>'source_file' as source_file,
        cmetadata->>'document_type' as doc_type,
        cmetadata->>'chunk_id' as chunk_id,
        cmetadata->>'chunk_index' as chunk_index,
        LEFT(document, 150) as content_preview
    FROM langchain_pg_embedding
    WHERE cmetadata->>'document_id' = 'bidding_untitled'
    ORDER BY (cmetadata->>'chunk_index')::int
    LIMIT 10;
    """
    
    samples = run_query(query)
    
    print(f"üìã Sample chunks (showing 10 of 767):\n")
    
    for idx, row in samples.iterrows():
        print(f"Chunk {idx+1}:")
        print(f"   chunk_id: {row['chunk_id']}")
        print(f"   chunk_index: {row['chunk_index']}")
        print(f"   type: {row['doc_type']}")
        print(f"   title: {row['title']}")
        print(f"   source: {Path(row['source_file']).name if row['source_file'] else 'N/A'}")
        print(f"   content: {row['content_preview']}...")
        print()
    
    print("\nüí° Decision Options:")
    print("   A. Update to proper FORM-* format (recommend: FORM-Bidding-HSMT)")
    print("   B. Delete if this is duplicate/invalid data")
    print("   C. Leave as-is and only backfill documents table")
else:
    print("‚ÑπÔ∏è bidding_untitled not found in missing documents")

‚ÑπÔ∏è bidding_untitled not found in missing documents


### 1.2: Backfill Missing Documents

Insert missing documents v√†o documents table

In [17]:
print_section("Backfilling Missing Documents")

if missing_docs.empty:
    print("‚úÖ No missing documents to backfill")
else:
    conn = get_connection()
    cursor = conn.cursor()
    
    insert_query = """
    INSERT INTO documents (
        document_id,
        document_name,
        document_type,
        category,
        file_name,
        source_file,
        total_chunks,
        status,
        created_at,
        updated_at
    ) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, NOW(), NOW())
    ON CONFLICT (document_id) DO UPDATE SET
        total_chunks = EXCLUDED.total_chunks,
        updated_at = NOW();
    """
    
    try:
        inserted = 0
        for idx, row in missing_docs.iterrows():
            doc_id = row['document_id']
            doc_type = row['document_type'] or 'unknown'
            title = row['title'] or doc_id
            source_file = row['source_file']
            chunks = row['total_chunks']
            category = determine_category(doc_type)
            
            # Extract file_name from source_file
            if source_file and source_file.strip():
                file_name = Path(source_file).name
                source_path = source_file
            else:
                # Fallback: use document_id as file_name
                file_name = f"{doc_id}.docx"
                source_path = f"/unknown/{file_name}"  # Placeholder path
            
            cursor.execute(insert_query, (
                doc_id,
                title[:200],  # Truncate if too long
                doc_type,
                category,
                file_name,
                source_path,
                chunks,
                'active'
            ))
            inserted += cursor.rowcount
            print(f"‚úÖ Inserted: {doc_id}")
            print(f"   - File: {file_name}")
            print(f"   - Source: {source_path}")
            print(f"   - Chunks: {chunks}")
        
        conn.commit()
        print(f"\n‚úÖ Successfully backfilled {inserted} documents")
        
    except Exception as e:
        conn.rollback()
        print(f"\n‚ùå Error: {e}")
        import traceback
        traceback.print_exc()
    
    finally:
        conn.close()


üìä Backfilling Missing Documents

‚úÖ No missing documents to backfill


---

## üîç Issue 2: Zero-Chunk Documents

Documents c√≥ trong documents table nh∆∞ng kh√¥ng c√≥ chunks

In [18]:
print_section("Zero-Chunk Documents Analysis")

query = """
SELECT 
    document_id,
    document_name,
    document_type,
    category,
    status,
    source_file,
    created_at
FROM documents
WHERE total_chunks = 0
ORDER BY document_id;
"""

zero_chunk_docs = run_query(query)

if zero_chunk_docs.empty:
    print("‚úÖ All documents have chunks")
else:
    print(f"‚ö†Ô∏è Found {len(zero_chunk_docs)} documents with 0 chunks:\n")
    
    for idx, row in zero_chunk_docs.iterrows():
        print(f"[{idx+1}] {row['document_id']}")
        print(f"    Name: {row['document_name']}")
        print(f"    Type: {row['document_type']} | Category: {row['category']}")
        print(f"    Status: {row['status']}")
        print(f"    Source: {row['source_file'] or 'N/A'}")
        print()

print(f"\nüíæ Stored in variable: zero_chunk_docs (DataFrame with {len(zero_chunk_docs)} rows)")
print("\nüí° Options: Mark as 'pending' OR Delete")


üìä Zero-Chunk Documents Analysis

‚ö†Ô∏è Found 4 documents with 0 chunks:

[1] EXAM-Ng√¢n-h√†ng-c√¢u-h·ªèi-CC
    Name: Ng√¢n h√†ng c√¢u h·ªèi CCDT ƒë·ª£t 2
    Type: exam_question | Category: C√¢u h·ªèi thi
    Status: pending
    Source: data/raw/Cau hoi thi/Ng√¢n h√†ng c√¢u h·ªèi CCDT ƒë·ª£t 2.pdf

[2] EXAM-Ng√¢n-h√†ng-c√¢u-h·ªèi-th
    Name: Ng√¢n h√†ng c√¢u h·ªèi thi CCDT ƒë·ª£t 1
    Type: exam_question | Category: C√¢u h·ªèi thi
    Status: pending
    Source: data/raw/Cau hoi thi/Ng√¢n h√†ng c√¢u h·ªèi thi CCDT ƒë·ª£t 1.pdf

[3] EXAM-NHCH_2692025_dot-2
    Name: NHCH 26.9.2025 dot 2- b·ªï sung
    Type: exam_question | Category: C√¢u h·ªèi thi
    Status: pending
    Source: data/raw/Cau hoi thi/NHCH_26.9.2025_dot 2- b·ªï sung.pdf

[4] EXAM-NHCH_30925_bo_sung
    Name: NHCH 30.9.25 bo sung theo TB1952 qldt (1)
    Type: exam_question | Category: C√¢u h·ªèi thi
    Status: pending
    Source: data/raw/Cau hoi thi/NHCH_30.9.25_bo_sung_theo_TB1952_qldt (1).pdf


üíæ Stored in 

### 2.1: Mark Zero-Chunk Documents as Pending

In [19]:
# Uncomment to execute:

print_section("Marking Zero-Chunk Documents as Pending")

if zero_chunk_docs.empty:
    print("‚ÑπÔ∏è No documents to mark")
else:
    query = """
    UPDATE documents
    SET status = 'pending', updated_at = NOW()
    WHERE total_chunks = 0;
    """
    
    try:
        affected = execute_query(query)
        print(f"‚úÖ Marked {affected} documents as 'pending'")
    except Exception as e:
        print(f"‚ùå Error: {e}")

print("‚ÑπÔ∏è Uncomment code above to mark documents as pending")


üìä Marking Zero-Chunk Documents as Pending

‚úÖ Marked 4 documents as 'pending'
‚ÑπÔ∏è Uncomment code above to mark documents as pending


---

## ‚úÖ Final Verification

In [21]:
print_section("Final Consistency Check")

query = """
SELECT 
    (SELECT COUNT(DISTINCT cmetadata->>'document_id') FROM langchain_pg_embedding) as vector_db_docs,
    (SELECT COUNT(*) FROM documents) as documents_table_docs,
    (SELECT COUNT(*) FROM documents WHERE total_chunks = 0) as zero_chunk_docs,
    (SELECT COUNT(*) FROM langchain_pg_embedding WHERE cmetadata->>'document_id' LIKE '%untitled%') as old_format_chunks,
    (SELECT COUNT(*) FROM langchain_pg_embedding WHERE cmetadata->>'document_id' ~ '^(LUA|ND|TT|QD|FORM|TEMPLATE|EXAM)-') as new_format_chunks,
    (SELECT COUNT(*) FROM langchain_pg_embedding) as total_chunks;
"""

stats = run_query(query)
s = stats.iloc[0]

print(f"""üìä Final State:

Vector DB:
   - Unique documents: {s['vector_db_docs']}
   - Total chunks: {s['total_chunks']}
   - New format: {s['new_format_chunks']} ({s['new_format_chunks']/s['total_chunks']*100:.1f}%)
   - Old format: {s['old_format_chunks']} ({s['old_format_chunks']/s['total_chunks']*100:.1f}%)

Documents Table:
   - Total: {s['documents_table_docs']}
   - With chunks: {s['documents_table_docs'] - s['zero_chunk_docs']}
   - Zero chunks: {s['zero_chunk_docs']}

Consistency: {'‚úÖ GOOD' if s['vector_db_docs'] == (s['documents_table_docs'] - s['zero_chunk_docs']) else '‚ö†Ô∏è NEEDS ATTENTION'}
""")

if s['old_format_chunks'] > 0:
    print(f"\n‚ö†Ô∏è Still have {s['old_format_chunks']} old format chunks!")
    print("   These are the 'bidding_untitled' chunks that need manual decision.")


üìä Final Consistency Check

üìä Final State:

Vector DB:
   - Unique documents: 57
   - Total chunks: 6242
   - New format: 5475 (87.7%)
   - Old format: 767 (12.3%)

Documents Table:
   - Total: 61
   - With chunks: 57
   - Zero chunks: 4

Consistency: ‚úÖ GOOD


‚ö†Ô∏è Still have 767 old format chunks!
   These are the 'bidding_untitled' chunks that need manual decision.
