# Thêm Metadata vào Documents đã Embedded

## Mục tiêu
Thêm `status` (active/expired) và `valid_until` vào metadata của documents **KHÔNG CẦN re-embedding**.

## ⚠️ LƯU Ý QUAN TRỌNG
- **Chạy từng cell một** và kiểm tra kết quả
- **KHÔNG chạy cell bulk update** cho đến khi verify xong
- Backup database trước khi update (optional nhưng recommended)

## Workflow
1. ✅ Kết nối database và kiểm tra schema
2. ✅ Xem sample metadata hiện tại
3. ✅ Test logic xác định status trên 1 document
4. ✅ Dry-run: Xem sẽ update những gì (KHÔNG thực sự update)
5. ⏸️ **PAUSE** - Review kết quả dry-run
6. ✅ Bulk update (sau khi confirm)
7. ✅ Verify kết quả

## Bước 1: Import Libraries và Setup

In [2]:
import psycopg
import json
import re
from datetime import datetime
from typing import Dict, Tuple
import pandas as pd

# Load config
import sys
sys.path.append('/home/sakana/Code/RAG-bidding')
from config.models import settings

print("✅ Libraries imported successfully")
print(f"📊 Database: {settings.database_url.split('@')[1] if '@' in settings.database_url else 'hidden'}")
print(f"📦 Collection: {settings.collection}")

✅ Libraries imported successfully
📊 Database: localhost:5432/ragdb
📦 Collection: docs


## Bước 2: Kết nối Database và Kiểm tra Schema

In [3]:
# Kết nối database
dsn = settings.database_url.replace("postgresql+psycopg", "postgresql")

try:
    conn = psycopg.connect(dsn)
    print("✅ Database connected successfully!")
    
    # Kiểm tra schema
    with conn.cursor() as cur:
        # 1. Kiểm tra tables
        cur.execute("""
            SELECT table_name 
            FROM information_schema.tables 
            WHERE table_schema = 'public' 
            AND table_name LIKE '%langchain%'
        """)
        tables = cur.fetchall()
        print(f"\n📊 PGVector Tables:")
        for table in tables:
            print(f"   - {table[0]}")
        
        # 2. Kiểm tra columns trong langchain_pg_embedding
        cur.execute("""
            SELECT column_name, data_type 
            FROM information_schema.columns 
            WHERE table_name = 'langchain_pg_embedding'
            ORDER BY ordinal_position
        """)
        columns = cur.fetchall()
        print(f"\n📋 Columns in langchain_pg_embedding:")
        for col_name, col_type in columns:
            print(f"   - {col_name}: {col_type}")
            
except Exception as e:
    print(f"❌ Connection error: {e}")
    conn = None

✅ Database connected successfully!

📊 PGVector Tables:
   - langchain_pg_collection
   - langchain_pg_embedding

📋 Columns in langchain_pg_embedding:
   - id: character varying
   - collection_id: uuid
   - embedding: USER-DEFINED
   - document: character varying
   - cmetadata: jsonb


## Bước 3: Lấy Collection UUID và Đếm Documents

In [4]:
with conn.cursor() as cur:
    # Get collection UUID
    cur.execute(
        "SELECT uuid FROM langchain_pg_collection WHERE name = %s",
        (settings.collection,)
    )
    result = cur.fetchone()
    
    if not result:
        print(f"❌ Collection '{settings.collection}' not found!")
        collection_uuid = None
    else:
        collection_uuid = result[0]
        print(f"✅ Collection found: {settings.collection}")
        print(f"   UUID: {collection_uuid}")
        
        # Count documents
        cur.execute(
            "SELECT COUNT(*) FROM langchain_pg_embedding WHERE collection_id = %s",
            (collection_uuid,)
        )
        doc_count = cur.fetchone()[0]
        print(f"\n📊 Total documents: {doc_count}")

✅ Collection found: docs
   UUID: b625f353-708f-41a4-8580-6e7523325aba

📊 Total documents: 2103


## Bước 4: Xem Sample Metadata Hiện Tại

## Bước 4a: Phân tích chi tiết Metadata Structure

In [6]:
# Phân tích chi tiết metadata structure
print("🔍 ANALYZING METADATA STRUCTURE\n")

with conn.cursor() as cur:
    cur.execute("""
        SELECT cmetadata
        FROM langchain_pg_embedding 
        WHERE collection_id = %s
        LIMIT 20
    """, (collection_uuid,))
    
    samples = cur.fetchall()
    
    # Collect all metadata fields
    all_fields = set()
    source_examples = []
    title_examples = []
    
    for (metadata,) in samples:
        if metadata:
            all_fields.update(metadata.keys())
            if 'source' in metadata:
                source_examples.append(metadata['source'])
            if 'title' in metadata:
                title_examples.append(metadata['title'])
    
    print("📋 All metadata fields found:")
    for field in sorted(all_fields):
        print(f"   - {field}")
    
    print("\n📂 Sample 'source' field (first 10):")
    for i, src in enumerate(source_examples[:10], 1):
        print(f"   {i}. {src}")
    
    print("\n📄 Sample 'title' field (first 10):")
    for i, title in enumerate(title_examples[:10], 1):
        print(f"   {i}. {title}")
    
    print("\n" + "="*60)
    print("💡 OBSERVATION:")
    print("="*60)
    print("- Data comes from PDF files, NOT web-crawled documents")
    print("- 'source' contains file paths")
    print("- 'title' contains document titles")
    print("- NO 'url' field available")
    print("="*60)

🔍 ANALYZING METADATA STRUCTURE

📋 All metadata fields found:
   - author
   - char_count
   - chunk_id
   - chunk_level
   - chunking_strategy
   - chuong
   - cleaned
   - crawled_at
   - creationdate
   - creator
   - dieu
   - file_size
   - file_type
   - has_diem
   - has_khoan
   - hierarchy
   - is_within_token_limit
   - moddate
   - page
   - page_label
   - producer
   - quality_flags
   - readability_score
   - section
   - semantic_tags
   - source
   - source_file
   - status
   - structure_score
   - title
   - token_count
   - token_ratio
   - total_pages
   - url
   - valid_until

📂 Sample 'source' field (first 10):
   1. /home/sakana/Code/RAG-bidding/app/data/raw/Tư-tưởng-Hồ-Chí-Minh-2016.pdf
   2. /home/sakana/Code/RAG-bidding/app/data/raw/BG HP TTTN 2 CNPM 2020 final.pdf
   3. /home/sakana/Code/RAG-bidding/app/data/raw/Tư-tưởng-Hồ-Chí-Minh-2016.pdf
   4. /home/sakana/Code/RAG-bidding/app/data/raw/BG HP TTTN 2 CNPM 2020 final.pdf
   5. /home/sakana/Code/RAG-bidding/ap

In [12]:
# Lấy 5 sample documents
with conn.cursor() as cur:
    cur.execute("""
        SELECT 
            id, 
            LEFT(document, 100) as doc_preview,
            cmetadata
        FROM langchain_pg_embedding 
        WHERE collection_id = %s
        LIMIT 5
    """, (collection_uuid,))
    
    samples = cur.fetchall()
    
    print("📄 Sample Documents và Metadata:\n")
    for i, (doc_id, doc_preview, metadata) in enumerate(samples, 1):
        print(f"--- Document {i} ---")
        print(f"ID: {doc_id}")
        print(f"Preview: {doc_preview}...")
        print(f"Metadata keys: {list(metadata.keys()) if metadata else 'None'}")
        
        # Check if already has status
        if metadata and 'status' in metadata:
            print(f"⚠️  ALREADY HAS STATUS: {metadata['status']}")
        else:
            print(f"✅ No status field (need to add)")
        
        # Show URL for verification
        if metadata and 'url' in metadata:
            print(f"URL: {metadata['url']}")
        
        print()

📄 Sample Documents và Metadata:

--- Document 1 ---
ID: 8b4824ec-a6fa-4535-8a57-3f5f20b8650d
Preview: HỌC VIỆN CÔNG NGHỆ BƯU CHÍNH VIỄN THÔNG 
------------------- 
 
KHOA CƠ BẢN 
 
 
BÀI GIẢNG 
TƯ TƯ...
Metadata keys: ['page', 'title', 'author', 'source', 'cleaned', 'creator', 'moddate', 'producer', 'file_size', 'file_type', 'page_label', 'total_pages', 'creationdate']
✅ No status field (need to add)

--- Document 2 ---
ID: 664eb9bc-a2d6-4bd1-b2c8-441c679af8e0
Preview: 1 
MỤC LỤC 
 
LỜI NÓI ĐẦU... 
Chƣơng mở đầu: ĐỐI TƢỢNG, PHƢƠNG PHÁP NGHIÊN CỨU VÀ Ý NGHĨA HỌC 
TẬP M...
Metadata keys: ['page', 'title', 'author', 'source', 'cleaned', 'creator', 'moddate', 'producer', 'file_size', 'file_type', 'page_label', 'total_pages', 'creationdate']
✅ No status field (need to add)

--- Document 3 ---
ID: 4b94d8ee-2cca-4a5d-ab4c-308a7f026380
Preview: TRIỂN TƢ TƢỞNG HỒ CHÍ MINH 
 I. Cơ sở hình thành tƣ tƣởng Hồ Chí Minh ... 
 1.Cơ sở khách quan 
 2.N...
Metadata keys: ['page', 'title', 'author', '

## Bước 5: Define Logic Xác định Status

In [30]:
def parse_year_from_source_or_title(metadata: dict) -> int | None:
    """
    Extract year from various sources:
    1. URL (for web-scraped legal docs): 'Luat-Dau-thau-2023-22-2023-QH15'
    2. Source filename (for PDFs): 'Tư-tưởng-2016.pdf' or 'CNPM 2020 final.pdf'
    3. Title field
    4. Creation date
    """
    # Priority 1: URL field (for legal documents)
    url = metadata.get("url", "")
    if url:
        # Pattern: Luat-Dau-thau-2023-22-2023-QH15
        match = re.search(r'Luat.*-(\d{4})-', url)
        if match:
            year = int(match.group(1))
            if 1900 <= year <= 2030:
                return year
        
        # Pattern: Nghi-dinh-214-2025-ND-CP
        match = re.search(r'-(\d{4})-', url)
        if match:
            year = int(match.group(1))
            if 1900 <= year <= 2030:
                return year
    
    # Priority 2: Source field (filename for PDFs)
    source = metadata.get("source", "")
    if source:
        # Pattern 1: -YYYY.pdf or _YYYY.pdf
        match = re.search(r'[-_](\d{4})\.pdf', source)
        if match:
            year = int(match.group(1))
            if 1900 <= year <= 2030:
                return year
        
        # Pattern 2: space+YYYY+space/end (e.g., "CNPM 2020 final.pdf")
        match = re.search(r'\s(\d{4})[\s\.]', source)
        if match:
            year = int(match.group(1))
            if 1900 <= year <= 2030:
                return year
        
        # Pattern 3: any 4-digit year in filename
        match = re.search(r'\b(19\d{2}|20[0-3]\d)\b', source)
        if match:
            year = int(match.group(1))
            return year
    
    # Priority 3: Title field
    title = metadata.get("title", "")
    if title:
        match = re.search(r'\b(19\d{2}|20[0-3]\d)\b', title)
        if match:
            return int(match.group(1))
    
    # Priority 4: Creation date (D:20160101120000)
    creationdate = metadata.get("creationdate", "")
    if creationdate:
        match = re.search(r'D:(\d{4})', creationdate)
        if match:
            return int(match.group(1))
    
    return None


def determine_status_and_validity(metadata: dict) -> tuple[str, str]:
    """
    Determine document status and valid_until date.
    
    Rules:
    - Legal documents (has 'url' field):
      * Luật (Law): 5 years validity
      * Nghị định (Decree): 2 years validity
      * Thông tư (Circular): 2 years validity
    - Educational materials (PDF files): 5 years validity
    
    Returns:
        (status, valid_until_iso_string)
    """
    year = parse_year_from_source_or_title(metadata)
    
    if not year:
        # No year found: mark as expired
        return "expired", "2024-01-01"
    
    current_year = datetime.now().year
    url = metadata.get("url", "")
    title = metadata.get("title", "")
    
    # Determine document type and validity period
    if url and ("Luat" in url or "Luật" in title):
        # Law: 5 years validity
        valid_until = datetime(year + 5, 12, 31)
    elif url and ("Nghi-dinh" in url or "Nghị định" in title):
        # Decree: 2 years validity
        valid_until = datetime(year + 2, 12, 31)
    elif url and ("Thong-tu" in url or "Thông tư" in title):
        # Circular: 2 years validity
        valid_until = datetime(year + 2, 12, 31)
    else:
        # Educational materials or unknown: 5 years default
        valid_until = datetime(year + 5, 12, 31)
    
    # Check if still valid
    status = "active" if valid_until.year >= current_year else "expired"
    
    return status, valid_until.strftime("%Y-%m-%d")

print("✅ Functions defined successfully")
print("📚 Supports both:")
print("   - Legal documents (Luật: 5yr, Nghị định/Thông tư: 2yr)")
print("   - Educational materials (5yr)")
print("🔍 Parses year from URL, filename, title, or creation date")

✅ Functions defined successfully
📚 Supports both:
   - Legal documents (Luật: 5yr, Nghị định/Thông tư: 2yr)
   - Educational materials (5yr)
🔍 Parses year from URL, filename, title, or creation date


## Bước 6: Test Logic trên Sample Documents

In [18]:
# Test trên 20 sample documents để cover different files
print("🧪 Testing status detection logic:\n")

with conn.cursor() as cur:
    cur.execute("""
        SELECT id, cmetadata
        FROM langchain_pg_embedding 
        WHERE collection_id = %s
        LIMIT 20
    """, (collection_uuid,))
    
    samples = cur.fetchall()
    
    for i, (doc_id, metadata) in enumerate(samples, 1):
        if not metadata:
            print(f"{i}. No metadata")
            continue
        
        source = metadata.get('source', 'N/A')
        title = metadata.get('title', 'N/A')
        
        # Determine status
        status, valid_until = determine_status_and_validity(metadata)
        
        # Extract year
        year = parse_year_from_source_or_title(metadata)
        
        # Extract filename from source
        filename = source.split('/')[-1] if source != 'N/A' else 'N/A'
        
        print(f"{i}. Year: {year} → Status: {status}, Valid until: {valid_until}")
        print(f"   File: {filename}")
        print()

🧪 Testing status detection logic:

1. Year: 2016 → Status: expired, Valid until: 2021-12-31
   File: Tư-tưởng-Hồ-Chí-Minh-2016.pdf

2. Year: 2016 → Status: expired, Valid until: 2021-12-31
   File: Tư-tưởng-Hồ-Chí-Minh-2016.pdf

3. Year: 2016 → Status: expired, Valid until: 2021-12-31
   File: Tư-tưởng-Hồ-Chí-Minh-2016.pdf

4. Year: 2016 → Status: expired, Valid until: 2021-12-31
   File: Tư-tưởng-Hồ-Chí-Minh-2016.pdf

5. Year: 2016 → Status: expired, Valid until: 2021-12-31
   File: Tư-tưởng-Hồ-Chí-Minh-2016.pdf

6. Year: None → Status: expired, Valid until: 2024-01-01
   File: BG HP TTTN 2 CNPM 2020 final.pdf

7. Year: 2016 → Status: expired, Valid until: 2021-12-31
   File: Tư-tưởng-Hồ-Chí-Minh-2016.pdf

8. Year: 2016 → Status: expired, Valid until: 2021-12-31
   File: Tư-tưởng-Hồ-Chí-Minh-2016.pdf

9. Year: 2016 → Status: expired, Valid until: 2021-12-31
   File: Tư-tưởng-Hồ-Chí-Minh-2016.pdf

10. Year: 2016 → Status: expired, Valid until: 2021-12-31
   File: Tư-tưởng-Hồ-Chí-Minh-2

## Bước 7: DRY RUN - Xem sẽ update như thế nào (KHÔNG thực sự update)

⚠️ **Cell này KHÔNG update database**, chỉ SHOW kết quả

In [32]:
# DRY RUN: Simulate update without actually updating
active_count = 0
expired_count = 0
already_has_status_count = 0
no_metadata_count = 0

with conn.cursor() as cur:
    cur.execute("""
        SELECT id, cmetadata 
        FROM langchain_pg_embedding 
        WHERE collection_id = %s
    """, (collection_uuid,))
    
    all_docs = cur.fetchall()
    
    print(f"📊 DRY RUN - Analyzing {len(all_docs)} documents...\n")
    
    for doc_id, metadata in all_docs:
        if not metadata:
            no_metadata_count += 1
            continue
        
        # Skip if already has status
        if 'status' in metadata:
            already_has_status_count += 1
            continue
        
        # Determine status
        status, valid_until = determine_status_and_validity(metadata)
        
        if status == "active":
            active_count += 1
        else:
            expired_count += 1

print("=" * 60)
print("📊 DRY RUN RESULTS:")
print("=" * 60)
print(f"Total documents: {len(all_docs)}")
print(f"✅ Will mark as ACTIVE: {active_count}")
print(f"❌ Will mark as EXPIRED: {expired_count}")
print(f"⏭️  Already has status (skip): {already_has_status_count}")
print(f"⚠️  No metadata (skip): {no_metadata_count}")
print("=" * 60)
print(f"\n⚠️  Documents to update: {active_count + expired_count}")
print(f"💡 No changes made yet. Review results before proceeding to Bước 8.")

📊 DRY RUN - Analyzing 2103 documents...

📊 DRY RUN RESULTS:
Total documents: 2103
✅ Will mark as ACTIVE: 1327
❌ Will mark as EXPIRED: 776
⏭️  Already has status (skip): 0
⚠️  No metadata (skip): 0

⚠️  Documents to update: 2103
💡 No changes made yet. Review results before proceeding to Bước 8.


## Bước 7a: Phân tích Year Distribution

In [31]:
# Phân tích distribution of years
from collections import Counter

year_distribution = Counter()
files_by_year = {}

with conn.cursor() as cur:
    cur.execute("""
        SELECT cmetadata
        FROM langchain_pg_embedding 
        WHERE collection_id = %s
    """, (collection_uuid,))
    
    all_docs = cur.fetchall()
    
    for (metadata,) in all_docs:
        if not metadata:
            continue
        
        year = parse_year_from_source_or_title(metadata)
        year_distribution[year] += 1
        
        # Track unique filenames per year
        source = metadata.get('source', '')
        if source:
            filename = source.split('/')[-1]
            if year not in files_by_year:
                files_by_year[year] = set()
            files_by_year[year].add(filename)

print("=" * 60)
print("📊 YEAR DISTRIBUTION:")
print("=" * 60)

# Sort with None at the end
sorted_years = sorted([y for y in year_distribution.keys() if y is not None]) + [None]

for year in sorted_years:
    count = year_distribution[year]
    year_label = year if year else "No year"
    files = len(files_by_year.get(year, set()))
    validity_end = year + 5 if year else None
    status = "✅ ACTIVE" if validity_end and validity_end >= 2025 else "❌ EXPIRED"
    
    print(f"{year_label}: {count} docs ({files} files) - Valid until {validity_end} - {status}")

print("=" * 60)
print(f"\n💡 Current year: 2025")
print(f"📚 Materials published in 2020 or later are ACTIVE (5-year validity)")
print(f"📚 Materials published before 2020 are EXPIRED")

📊 YEAR DISTRIBUTION:
2016: 776 docs (1 files) - Valid until 2021 - ❌ EXPIRED
2020: 482 docs (1 files) - Valid until 2025 - ✅ ACTIVE
2023: 272 docs (1 files) - Valid until 2028 - ✅ ACTIVE
2024: 88 docs (1 files) - Valid until 2029 - ✅ ACTIVE
2025: 485 docs (1 files) - Valid until 2030 - ✅ ACTIVE
No year: 0 docs (0 files) - Valid until None - ❌ EXPIRED

💡 Current year: 2025
📚 Materials published in 2020 or later are ACTIVE (5-year validity)
📚 Materials published before 2020 are EXPIRED


In [26]:
# Inspect files with "No year"
print("\n🔍 FILES WITHOUT DETECTED YEAR:")
print("=" * 60)
for filename in sorted(files_by_year.get(None, set())):
    print(f"   - {filename}")

print("\n💡 Observation:")
print("File 'BG HP TTTN 2 CNPM 2020 final.pdf' has '2020' in name")
print("but our regex pattern requires '-YYYY.pdf' format")
print("\nLet's improve the regex pattern...")


🔍 FILES WITHOUT DETECTED YEAR:
   - thuvienphapluat.vn

💡 Observation:
File 'BG HP TTTN 2 CNPM 2020 final.pdf' has '2020' in name
but our regex pattern requires '-YYYY.pdf' format

Let's improve the regex pattern...


In [28]:
# Inspect thuvienphapluat.vn documents
print("\n🔍 INSPECTING 'thuvienphapluat.vn' DOCUMENTS:")
print("=" * 60)

with conn.cursor() as cur:
    cur.execute("""
        SELECT cmetadata
        FROM langchain_pg_embedding 
        WHERE collection_id = %s
        AND cmetadata->>'source' LIKE %s
        LIMIT 5
    """, (collection_uuid, '%thuvienphapluat.vn%'))
    
    samples = cur.fetchall()
    
    for i, (metadata,) in enumerate(samples, 1):
        print(f"\n--- Document {i} ---")
        print(f"Source: {metadata.get('source', 'N/A')}")
        print(f"Title: {metadata.get('title', 'N/A')[:100]}...")
        
        # Check all fields
        print(f"All fields: {list(metadata.keys())}")
        
        # Check for any date-related fields
        for key, value in metadata.items():
            if any(date_keyword in key.lower() for date_keyword in ['date', 'time', 'year', 'created', 'modified']):
                print(f"  {key}: {value}")


🔍 INSPECTING 'thuvienphapluat.vn' DOCUMENTS:

--- Document 1 ---
Source: thuvienphapluat.vn
Title: Nội dung từ thuvienphapluat.vn...
All fields: ['url', 'dieu', 'title', 'chuong', 'source', 'section', 'chunk_id', 'has_diem', 'has_khoan', 'hierarchy', 'char_count', 'crawled_at', 'chunk_level', 'source_file', 'token_count', 'token_ratio', 'quality_flags', 'semantic_tags', 'structure_score', 'chunking_strategy', 'readability_score', 'is_within_token_limit']

--- Document 2 ---
Source: thuvienphapluat.vn
Title: Nội dung từ thuvienphapluat.vn...
All fields: ['url', 'dieu', 'khoan', 'title', 'chuong', 'source', 'section', 'chunk_id', 'has_diem', 'has_khoan', 'hierarchy', 'char_count', 'crawled_at', 'chunk_level', 'parent_dieu', 'source_file', 'token_count', 'token_ratio', 'quality_flags', 'semantic_tags', 'structure_score', 'chunking_strategy', 'readability_score', 'is_within_token_limit']

--- Document 3 ---
Source: thuvienphapluat.vn
Title: Nội dung từ thuvienphapluat.vn...
All fields: ['

In [29]:
# Check URL examples for thuvienphapluat.vn docs
print("\n📋 SAMPLE URLs:")
print("=" * 60)

with conn.cursor() as cur:
    cur.execute("""
        SELECT DISTINCT cmetadata->>'url' as url
        FROM langchain_pg_embedding 
        WHERE collection_id = %s
        AND cmetadata->>'source' = 'thuvienphapluat.vn'
        LIMIT 10
    """, (collection_uuid,))
    
    urls = cur.fetchall()
    
    for i, (url,) in enumerate(urls, 1):
        print(f"{i}. {url}")


📋 SAMPLE URLs:
1. https://thuvienphapluat.vn/van-ban/Dau-tu/Luat-Dau-thau-2023-22-2023-QH15-518805.aspx
2. https://thuvienphapluat.vn/van-ban/Dau-tu/Nghi-dinh-214-2025-ND-CP-huong-dan-Luat-Dau-thau-ve-lua-chon-nha-thau-668157.aspx
3. https://thuvienphapluat.vn/van-ban/Dau-tu/Thong-tu-22-2024-TT-BKHDT-cung-cap-thong-tin-ve-lua-chon-nha-thau-tren-He-thong-mang-dau-thau-quoc-gia-619403.aspx


## ⏸️ PAUSE - Review Dry Run Results

**Trước khi chạy Bước 8:**
1. ✅ Check dry run numbers có hợp lý không
2. ✅ Verify active/expired ratio
3. ✅ Confirm không có documents nào bị miss

**Nếu OK → Chạy Bước 8**

## Bước 8: 🚨 BULK UPDATE - Thực sự update database

⚠️ **CRITICAL: Cell này sẽ UPDATE database!**

Chỉ chạy sau khi:
- ✅ Review dry run results
- ✅ Confirm active/expired numbers
- ✅ Ready to commit changes

In [34]:
# 🚨 BULK UPDATE - Uncomment and run to execute
# ⚠️  Remove the triple quotes to enable this cell

from psycopg.types.json import Json

updated_count = 0
active_count = 0
expired_count = 0

with conn.cursor() as cur:
    # Get all documents
    cur.execute(
        "SELECT id, cmetadata FROM langchain_pg_embedding WHERE collection_id = %s",
        (collection_uuid,)
    )
    
    all_docs = cur.fetchall()
    total_docs = len(all_docs)
    
    print(f"🚀 Starting bulk update of {total_docs} documents...")
    
    for i, (doc_id, metadata) in enumerate(all_docs, 1):
        # Skip if no metadata or already has status
        if not metadata or 'status' in metadata:
            continue
        
        # Determine status
        status, valid_until = determine_status_and_validity(metadata)
        
        # Update metadata
        metadata["status"] = status
        metadata["valid_until"] = valid_until
        
        # Update database (wrap dict in Json())
        cur.execute(
            "UPDATE langchain_pg_embedding SET cmetadata = %s WHERE id = %s",
            (Json(metadata), doc_id)
        )
        
        updated_count += 1
        if status == "active":
            active_count += 1
        else:
            expired_count += 1
        
        # Progress
        if updated_count % 100 == 0:
            print(f"⏳ Updated {updated_count}/{total_docs} documents...")
    
    # COMMIT changes
    conn.commit()
    
    print("\n" + "=" * 60)
    print("✅ BULK UPDATE COMPLETE!")
    print("=" * 60)
    print(f"📝 Total updated: {updated_count}")
    print(f"✅ Active documents: {active_count}")
    print(f"❌ Expired documents: {expired_count}")
    print("=" * 60)


print("⚠️  Cell is commented out for safety.")
print("💡 Remove triple quotes to enable update.")

🚀 Starting bulk update of 2103 documents...
⏳ Updated 100/2103 documents...
⏳ Updated 200/2103 documents...
⏳ Updated 300/2103 documents...
⏳ Updated 400/2103 documents...
⏳ Updated 500/2103 documents...
⏳ Updated 600/2103 documents...
⏳ Updated 700/2103 documents...
⏳ Updated 800/2103 documents...
⏳ Updated 900/2103 documents...
⏳ Updated 1000/2103 documents...
⏳ Updated 1100/2103 documents...
⏳ Updated 1200/2103 documents...
⏳ Updated 1300/2103 documents...
⏳ Updated 1400/2103 documents...
⏳ Updated 1500/2103 documents...
⏳ Updated 1600/2103 documents...
⏳ Updated 1700/2103 documents...
⏳ Updated 1800/2103 documents...
⏳ Updated 1900/2103 documents...
⏳ Updated 2000/2103 documents...
⏳ Updated 2100/2103 documents...

✅ BULK UPDATE COMPLETE!
📝 Total updated: 2103
✅ Active documents: 1327
❌ Expired documents: 776
⚠️  Cell is commented out for safety.
💡 Remove triple quotes to enable update.

✅ BULK UPDATE COMPLETE!
📝 Total updated: 2103
✅ Active documents: 1327
❌ Expired documents: 776

## Bước 9: Verify Update - Kiểm tra kết quả

In [36]:
# Verify update results
with conn.cursor() as cur:
    # Count by status
    cur.execute("""
        SELECT 
            cmetadata->>'status' as status,
            COUNT(*) as count
        FROM langchain_pg_embedding 
        WHERE collection_id = %s
        GROUP BY cmetadata->>'status'
    """, (collection_uuid,))
    
    results = cur.fetchall()
    
    print("=" * 60)
    print("📊 STATUS BREAKDOWN:")
    print("=" * 60)
    for status, count in results:
        emoji = "✅" if status == "active" else "❌" if status == "expired" else "❓"
        print(f"{emoji} {status or 'NULL'}: {count} documents")
    print("=" * 60)
    
    # Sample documents with new metadata
    print("\n📄 Sample documents with metadata:")
    cur.execute("""
        SELECT 
            LEFT(document, 60) as doc_preview,
            cmetadata->>'status' as status,
            cmetadata->>'valid_until' as valid_until,
            cmetadata->>'url' as url,
            cmetadata->>'source' as source
        FROM langchain_pg_embedding 
        WHERE collection_id = %s
        LIMIT 5
    """, (collection_uuid,))
    
    samples = cur.fetchall()
    for i, (doc, status, valid_until, url, source) in enumerate(samples, 1):
        print(f"\n{i}. {doc}...")
        print(f"   Status: {status}, Valid until: {valid_until}")
        if url:
            print(f"   URL: {url[:60]}...")
        else:
            print(f"   Source: {source}")

📊 STATUS BREAKDOWN:
❌ expired: 776 documents
✅ active: 1327 documents

📄 Sample documents with metadata:

1. HỌC VIỆN CÔNG NGHỆ BƯU CHÍNH VIỄN THÔNG 
----------------...
   Status: expired, Valid until: 2021-12-31
   Source: /home/sakana/Code/RAG-bidding/app/data/raw/Tư-tưởng-Hồ-Chí-Minh-2016.pdf

2. Chương 4. Thiết kế
• Các lớp thực thể liên quan.
102
Hình 4....
   Status: active, Valid until: 2025-12-31
   Source: /home/sakana/Code/RAG-bidding/app/data/raw/BG HP TTTN 2 CNPM 2020 final.pdf

3. vị cao nhất là dân, vì dân là chủ”74. 
 
 
86 Hồ C hí Minh, ...
   Status: expired, Valid until: 2021-12-31
   Source: /home/sakana/Code/RAG-bidding/app/data/raw/Tư-tưởng-Hồ-Chí-Minh-2016.pdf

4. required /></td>
</tr>
<tr>
<td>Mật khẩu:</td>
<td><input ty...
   Status: active, Valid until: 2025-12-31
   Source: /home/sakana/Code/RAG-bidding/app/data/raw/BG HP TTTN 2 CNPM 2020 final.pdf

5. String sqlDiem = "{call DiemcuaDK(?)}";// su dung stored pro...
   Status: active, Valid until: 2025-12

## Bước 9a: Verify Legal Documents (thuvienphapluat.vn)

In [37]:
# Check legal documents status
print("🔍 LEGAL DOCUMENTS STATUS:")
print("=" * 60)

with conn.cursor() as cur:
    # Get status breakdown for legal docs
    cur.execute("""
        SELECT 
            cmetadata->>'status' as status,
            COUNT(*) as count
        FROM langchain_pg_embedding 
        WHERE collection_id = %s
        AND cmetadata->>'source' = 'thuvienphapluat.vn'
        GROUP BY cmetadata->>'status'
    """, (collection_uuid,))
    
    results = cur.fetchall()
    
    print("Legal documents status:")
    for status, count in results:
        emoji = "✅" if status == "active" else "❌"
        print(f"  {emoji} {status}: {count} documents")
    
    # Sample legal documents
    print("\n📄 Sample legal documents:")
    cur.execute("""
        SELECT 
            cmetadata->>'url' as url,
            cmetadata->>'status' as status,
            cmetadata->>'valid_until' as valid_until
        FROM langchain_pg_embedding 
        WHERE collection_id = %s
        AND cmetadata->>'source' = 'thuvienphapluat.vn'
        LIMIT 5
    """, (collection_uuid,))
    
    samples = cur.fetchall()
    for i, (url, status, valid_until) in enumerate(samples, 1):
        # Extract doc type and year from URL
        doc_name = url.split('/')[-1].split('.')[0] if url else 'N/A'
        emoji = "✅" if status == "active" else "❌"
        print(f"\n{i}. {emoji} {doc_name}")
        print(f"   Status: {status}, Valid until: {valid_until}")
        print(f"   URL: {url[:80]}...")

🔍 LEGAL DOCUMENTS STATUS:
Legal documents status:
  ✅ active: 845 documents

📄 Sample legal documents:

1. ✅ Luat-Dau-thau-2023-22-2023-QH15-518805
   Status: active, Valid until: 2028-12-31
   URL: https://thuvienphapluat.vn/van-ban/Dau-tu/Luat-Dau-thau-2023-22-2023-QH15-518805...

2. ✅ Luat-Dau-thau-2023-22-2023-QH15-518805
   Status: active, Valid until: 2028-12-31
   URL: https://thuvienphapluat.vn/van-ban/Dau-tu/Luat-Dau-thau-2023-22-2023-QH15-518805...

3. ✅ Luat-Dau-thau-2023-22-2023-QH15-518805
   Status: active, Valid until: 2028-12-31
   URL: https://thuvienphapluat.vn/van-ban/Dau-tu/Luat-Dau-thau-2023-22-2023-QH15-518805...

4. ✅ Luat-Dau-thau-2023-22-2023-QH15-518805
   Status: active, Valid until: 2028-12-31
   URL: https://thuvienphapluat.vn/van-ban/Dau-tu/Luat-Dau-thau-2023-22-2023-QH15-518805...

5. ✅ Nghi-dinh-214-2025-ND-CP-huong-dan-Luat-Dau-thau-ve-lua-chon-nha-thau-668157
   Status: active, Valid until: 2030-12-31
   URL: https://thuvienphapluat.vn/van-ban/Dau-tu/N

## Bước 10: 🗑️ XÓA TÀI LIỆU KHÔNG LIÊN QUAN ĐẾN ĐẤU THẦU

⚠️ **CRITICAL**: Các cells dưới đây sẽ XÓA VĨNH VIỄN tài liệu từ database!

**Tài liệu sẽ bị xóa:**
- Educational PDFs (textbooks, course materials) - 1,258 documents
- Ví dụ: "Tư tưởng Hồ Chí Minh 2016.pdf", "BG HP TTTN 2 CNPM 2020.pdf"

**Tài liệu được GIỮ LẠI:**
- Legal documents từ thuvienphapluat.vn - 845 documents
- Ví dụ: Luật Đấu thầu 2023, Nghị định 214-2025, Thông tư

**Workflow:**
1. ✅ Phân tích documents để xóa
2. ✅ DRY RUN - Xem sẽ xóa những gì (KHÔNG thực sự xóa)
3. ⏸️ **PAUSE** - Review trước khi xóa
4. 🚨 BULK DELETE - Thực sự xóa (cần confirm)

## Bước 10a: Phân tích Documents theo Loại

In [5]:
# Phân tích documents theo nguồn
print("🔍 PHÂN TÍCH DOCUMENTS THEO NGUỒN:")
print("=" * 60)

with conn.cursor() as cur:
    # Count by source type
    cur.execute("""
        SELECT 
            cmetadata->>'source' as source,
            COUNT(*) as count
        FROM langchain_pg_embedding 
        WHERE collection_id = %s
        GROUP BY cmetadata->>'source'
        ORDER BY count DESC
    """, (collection_uuid,))
    
    results = cur.fetchall()
    
    legal_count = 0
    pdf_count = 0
    pdf_files = set()
    
    for source, count in results:
        if source == 'thuvienphapluat.vn':
            legal_count = count
            print(f"📜 Legal documents (thuvienphapluat.vn): {count} documents")
        else:
            pdf_count += count
            pdf_files.add(source)
    
    print(f"📚 Educational PDFs: {pdf_count} documents from {len(pdf_files)} files")
    
    print("\n" + "=" * 60)
    print(f"Total: {legal_count + pdf_count} documents")
    print("=" * 60)
    
    # Show sample PDF filenames
    print("\n📂 Sample PDF files to be deleted:")
    cur.execute("""
        SELECT DISTINCT cmetadata->>'source' as source
        FROM langchain_pg_embedding 
        WHERE collection_id = %s
        AND cmetadata->>'source' != 'thuvienphapluat.vn'
        LIMIT 10
    """, (collection_uuid,))
    
    samples = cur.fetchall()
    for i, (source,) in enumerate(samples, 1):
        filename = source.split('/')[-1]
        print(f"   {i}. {filename}")

🔍 PHÂN TÍCH DOCUMENTS THEO NGUỒN:
📜 Legal documents (thuvienphapluat.vn): 845 documents
📚 Educational PDFs: 1258 documents from 2 files

Total: 2103 documents

📂 Sample PDF files to be deleted:
   1. Tư-tưởng-Hồ-Chí-Minh-2016.pdf
   2. BG HP TTTN 2 CNPM 2020 final.pdf


## Bước 10b: DRY RUN - Xem sẽ xóa gì (KHÔNG thực sự xóa)

In [6]:
# DRY RUN: Xem documents sẽ bị xóa
print("🔍 DRY RUN - Analyzing documents to delete...")
print("=" * 60)

with conn.cursor() as cur:
    # Count documents to delete (PDFs)
    cur.execute("""
        SELECT COUNT(*)
        FROM langchain_pg_embedding 
        WHERE collection_id = %s
        AND cmetadata->>'source' != 'thuvienphapluat.vn'
    """, (collection_uuid,))
    
    to_delete_count = cur.fetchone()[0]
    
    # Count documents to keep (legal)
    cur.execute("""
        SELECT COUNT(*)
        FROM langchain_pg_embedding 
        WHERE collection_id = %s
        AND cmetadata->>'source' = 'thuvienphapluat.vn'
    """, (collection_uuid,))
    
    to_keep_count = cur.fetchone()[0]
    
    # Get total
    cur.execute("""
        SELECT COUNT(*)
        FROM langchain_pg_embedding 
        WHERE collection_id = %s
    """, (collection_uuid,))
    
    total_count = cur.fetchone()[0]
    
    print(f"📊 DELETION PLAN:")
    print(f"   Total documents: {total_count}")
    print(f"   🗑️  Will DELETE (Educational PDFs): {to_delete_count}")
    print(f"   ✅ Will KEEP (Legal documents): {to_keep_count}")
    print("=" * 60)
    
    # Show sample documents to delete
    print("\n📄 Sample documents that will be DELETED:")
    cur.execute("""
        SELECT 
            LEFT(document, 80) as doc_preview,
            cmetadata->>'source' as source,
            cmetadata->>'title' as title
        FROM langchain_pg_embedding 
        WHERE collection_id = %s
        AND cmetadata->>'source' != 'thuvienphapluat.vn'
        LIMIT 5
    """, (collection_uuid,))
    
    samples = cur.fetchall()
    for i, (doc, source, title) in enumerate(samples, 1):
        filename = source.split('/')[-1] if source else 'N/A'
        print(f"\n{i}. File: {filename}")
        print(f"   Title: {title[:60] if title else 'N/A'}...")
        print(f"   Preview: {doc}...")
    
    # Show sample documents to keep
    print("\n" + "=" * 60)
    print("✅ Sample documents that will be KEPT:")
    cur.execute("""
        SELECT 
            LEFT(document, 80) as doc_preview,
            cmetadata->>'url' as url,
            cmetadata->>'status' as status
        FROM langchain_pg_embedding 
        WHERE collection_id = %s
        AND cmetadata->>'source' = 'thuvienphapluat.vn'
        LIMIT 3
    """, (collection_uuid,))
    
    samples = cur.fetchall()
    for i, (doc, url, status) in enumerate(samples, 1):
        doc_name = url.split('/')[-1].split('.')[0][:40] if url else 'N/A'
        emoji = "✅" if status == "active" else "❌"
        print(f"\n{i}. {emoji} {doc_name}")
        print(f"   Preview: {doc}...")

print("\n" + "=" * 60)
print("⚠️  No changes made yet. Review carefully before proceeding!")
print("=" * 60)

🔍 DRY RUN - Analyzing documents to delete...
📊 DELETION PLAN:
   Total documents: 2103
   🗑️  Will DELETE (Educational PDFs): 1258
   ✅ Will KEEP (Legal documents): 845

📄 Sample documents that will be DELETED:

1. File: Tư-tưởng-Hồ-Chí-Minh-2016.pdf
   Title: §Ò thi chÝnh trÞ cuèi khãa khèi ®¹i häc n¨m häc 2006 – 2007...
   Preview: HỌC VIỆN CÔNG NGHỆ BƯU CHÍNH VIỄN THÔNG 
------------------- 
 
KHOA CƠ BẢN 
...

2. File: BG HP TTTN 2 CNPM 2020 final.pdf
   Title: N/A...
   Preview: Chương 4. Thiết kế
• Các lớp thực thể liên quan.
102
Hình 4.7: Thiết kế giao diệ...

3. File: Tư-tưởng-Hồ-Chí-Minh-2016.pdf
   Title: §Ò thi chÝnh trÞ cuèi khãa khèi ®¹i häc n¨m häc 2006 – 2007...
   Preview: vị cao nhất là dân, vì dân là chủ”74. 
 
 
86 Hồ C hí Minh, toàn tập, nxb Chính ...

4. File: BG HP TTTN 2 CNPM 2020 final.pdf
   Title: N/A...
   Preview: required /></td>
</tr>
<tr>
<td>Mật khẩu:</td>
<td><input type="password" name="...

5. File: BG HP TTTN 2 CNPM 2020 final.pdf
   Title: N/A...

## ⏸️ PAUSE - Review Deletion Plan

**Trước khi chạy cell xóa:**
1. ✅ Verify số lượng documents to delete/keep
2. ✅ Check sample documents to ensure không xóa nhầm legal docs
3. ✅ Backup database nếu cần (recommended)
4. ✅ Confirm sẵn sàng xóa vĩnh viễn

**⚠️ LƯU Ý:**
- Documents bị xóa KHÔNG THỂ KHÔI PHỤC (trừ khi có backup)
- Embeddings cũng sẽ bị xóa theo
- Nên backup trước: `pg_dump ragdb > backup_before_delete.sql`

**Nếu OK → Chạy Bước 10c**

## Bước 10c: 🚨 BULK DELETE - Xóa tài liệu không liên quan

⚠️ **CRITICAL: Cell này sẽ XÓA VĨNH VIỄN documents từ database!**

Chỉ chạy sau khi:
- ✅ Review dry run results
- ✅ Confirm numbers are correct
- ✅ Backup database (recommended)
- ✅ Ready to permanently delete educational PDFs

In [7]:
# 🚨 BULK DELETE - Uncomment and run to execute
# ⚠️  Remove the triple quotes to enable this cell


deleted_count = 0

with conn.cursor() as cur:
    # Count before deletion
    cur.execute(
        "SELECT COUNT(*) FROM langchain_pg_embedding WHERE collection_id = %s",
        (collection_uuid,)
    )
    before_count = cur.fetchone()[0]
    
    print(f"🗑️  Starting deletion process...")
    print(f"📊 Documents before: {before_count}")
    
    # DELETE all educational PDFs (source != 'thuvienphapluat.vn')
    cur.execute("""
        DELETE FROM langchain_pg_embedding 
        WHERE collection_id = %s
        AND cmetadata->>'source' != 'thuvienphapluat.vn'
    """, (collection_uuid,))
    
    deleted_count = cur.rowcount
    
    # Count after deletion
    cur.execute(
        "SELECT COUNT(*) FROM langchain_pg_embedding WHERE collection_id = %s",
        (collection_uuid,)
    )
    after_count = cur.fetchone()[0]
    
    # COMMIT changes
    conn.commit()
    
    print("\n" + "=" * 60)
    print("✅ BULK DELETE COMPLETE!")
    print("=" * 60)
    print(f"📊 Documents before: {before_count}")
    print(f"🗑️  Documents deleted: {deleted_count}")
    print(f"✅ Documents remaining: {after_count}")
    print("=" * 60)
    print(f"\n💡 Kept only legal documents from thuvienphapluat.vn")
    print(f"🎯 Database now contains ONLY bidding law-related documents")


print("⚠️  Cell is commented out for safety.")
print("💡 Remove triple quotes to enable deletion.")
print("⚠️  Make sure you've reviewed the dry-run results first!")

🗑️  Starting deletion process...
📊 Documents before: 2103

✅ BULK DELETE COMPLETE!
📊 Documents before: 2103
🗑️  Documents deleted: 1258
✅ Documents remaining: 845

💡 Kept only legal documents from thuvienphapluat.vn
🎯 Database now contains ONLY bidding law-related documents
⚠️  Cell is commented out for safety.
💡 Remove triple quotes to enable deletion.
⚠️  Make sure you've reviewed the dry-run results first!


## Bước 10d: Verify Deletion - Kiểm tra sau khi xóa

In [8]:
# Verify deletion results
print("🔍 VERIFICATION AFTER DELETION:")
print("=" * 60)

with conn.cursor() as cur:
    # Total count
    cur.execute("""
        SELECT COUNT(*)
        FROM langchain_pg_embedding 
        WHERE collection_id = %s
    """, (collection_uuid,))
    
    total_count = cur.fetchone()[0]
    print(f"📊 Total documents remaining: {total_count}")
    
    # Count by source (should only have thuvienphapluat.vn)
    cur.execute("""
        SELECT 
            cmetadata->>'source' as source,
            COUNT(*) as count
        FROM langchain_pg_embedding 
        WHERE collection_id = %s
        GROUP BY cmetadata->>'source'
    """, (collection_uuid,))
    
    results = cur.fetchall()
    
    print("\n📋 Documents by source:")
    for source, count in results:
        print(f"   - {source}: {count} documents")
    
    # Status breakdown
    cur.execute("""
        SELECT 
            cmetadata->>'status' as status,
            COUNT(*) as count
        FROM langchain_pg_embedding 
        WHERE collection_id = %s
        GROUP BY cmetadata->>'status'
    """, (collection_uuid,))
    
    results = cur.fetchall()
    
    print("\n📊 Status breakdown:")
    for status, count in results:
        emoji = "✅" if status == "active" else "❌"
        print(f"   {emoji} {status}: {count} documents")
    
    # Sample remaining documents
    print("\n📄 Sample remaining documents:")
    cur.execute("""
        SELECT 
            cmetadata->>'url' as url,
            cmetadata->>'status' as status,
            cmetadata->>'valid_until' as valid_until
        FROM langchain_pg_embedding 
        WHERE collection_id = %s
        LIMIT 5
    """, (collection_uuid,))
    
    samples = cur.fetchall()
    for i, (url, status, valid_until) in enumerate(samples, 1):
        doc_name = url.split('/')[-1].split('.')[0][:50] if url else 'N/A'
        emoji = "✅" if status == "active" else "❌"
        print(f"\n   {i}. {emoji} {doc_name}")
        print(f"      Status: {status}, Valid until: {valid_until}")

print("\n" + "=" * 60)
print("✅ Verification complete!")
print("🎯 Database now contains ONLY legal documents from thuvienphapluat.vn")
print("=" * 60)

🔍 VERIFICATION AFTER DELETION:
📊 Total documents remaining: 845

📋 Documents by source:
   - thuvienphapluat.vn: 845 documents

📊 Status breakdown:
   ✅ active: 845 documents

📄 Sample remaining documents:

   1. ✅ Luat-Dau-thau-2023-22-2023-QH15-518805
      Status: active, Valid until: 2028-12-31

   2. ✅ Luat-Dau-thau-2023-22-2023-QH15-518805
      Status: active, Valid until: 2028-12-31

   3. ✅ Luat-Dau-thau-2023-22-2023-QH15-518805
      Status: active, Valid until: 2028-12-31

   4. ✅ Luat-Dau-thau-2023-22-2023-QH15-518805
      Status: active, Valid until: 2028-12-31

   5. ✅ Nghi-dinh-214-2025-ND-CP-huong-dan-Luat-Dau-thau-v
      Status: active, Valid until: 2030-12-31

✅ Verification complete!
🎯 Database now contains ONLY legal documents from thuvienphapluat.vn
