# Th√™m Metadata v√†o Documents ƒë√£ Embedded

## M·ª•c ti√™u
Th√™m `status` (active/expired) v√† `valid_until` v√†o metadata c·ªßa documents **KH√îNG C·∫¶N re-embedding**.

## ‚ö†Ô∏è L∆ØU √ù QUAN TR·ªåNG
- **Ch·∫°y t·ª´ng cell m·ªôt** v√† ki·ªÉm tra k·∫øt qu·∫£
- **KH√îNG ch·∫°y cell bulk update** cho ƒë·∫øn khi verify xong
- Backup database tr∆∞·ªõc khi update (optional nh∆∞ng recommended)

## Workflow
1. ‚úÖ K·∫øt n·ªëi database v√† ki·ªÉm tra schema
2. ‚úÖ Xem sample metadata hi·ªán t·∫°i
3. ‚úÖ Test logic x√°c ƒë·ªãnh status tr√™n 1 document
4. ‚úÖ Dry-run: Xem s·∫Ω update nh·ªØng g√¨ (KH√îNG th·ª±c s·ª± update)
5. ‚è∏Ô∏è **PAUSE** - Review k·∫øt qu·∫£ dry-run
6. ‚úÖ Bulk update (sau khi confirm)
7. ‚úÖ Verify k·∫øt qu·∫£

## B∆∞·ªõc 1: Import Libraries v√† Setup

In [1]:
import psycopg
import json
import re
from datetime import datetime
from typing import Dict, Tuple
import pandas as pd

# Load config
import sys
sys.path.append('/home/sakana/Code/RAG-bidding')
from src.config.models import settings

print("‚úÖ Libraries imported successfully")
print(f"üìä Database: {settings.database_url.split('@')[1] if '@' in settings.database_url else 'hidden'}")
print(f"üì¶ Collection: {settings.collection}")

‚úÖ Libraries imported successfully
üìä Database: localhost:5432/rag_bidding_v2
üì¶ Collection: docs


## B∆∞·ªõc 2: K·∫øt n·ªëi Database v√† Ki·ªÉm tra Schema

In [3]:
# K·∫øt n·ªëi database
dsn = settings.database_url.replace("postgresql+psycopg", "postgresql")

try:
    conn = psycopg.connect(dsn)
    print("‚úÖ Database connected successfully!")
    
    # Ki·ªÉm tra schema
    with conn.cursor() as cur:
        # 1. Ki·ªÉm tra tables
        cur.execute("""
            SELECT table_name 
            FROM information_schema.tables 
            WHERE table_schema = 'public' 
            AND table_name LIKE '%langchain%'
        """)
        tables = cur.fetchall()
        print(f"\nüìä PGVector Tables:")
        for table in tables:
            print(f"   - {table[0]}")
        
        # 2. Ki·ªÉm tra columns trong langchain_pg_embedding
        cur.execute("""
            SELECT column_name, data_type 
            FROM information_schema.columns 
            WHERE table_name = 'langchain_pg_embedding'
            ORDER BY ordinal_position
        """)
        columns = cur.fetchall()
        print(f"\nüìã Columns in langchain_pg_embedding:")
        for col_name, col_type in columns:
            print(f"   - {col_name}: {col_type}")
            
except Exception as e:
    print(f"‚ùå Connection error: {e}")
    conn = None

‚úÖ Database connected successfully!

üìä PGVector Tables:
   - langchain_pg_collection
   - langchain_pg_embedding

üìã Columns in langchain_pg_embedding:
   - id: character varying
   - collection_id: uuid
   - embedding: USER-DEFINED
   - document: character varying
   - cmetadata: jsonb


## B∆∞·ªõc 3: L·∫•y Collection UUID v√† ƒê·∫øm Documents

In [4]:
with conn.cursor() as cur:
    # Get collection UUID
    cur.execute(
        "SELECT uuid FROM langchain_pg_collection WHERE name = %s",
        (settings.collection,)
    )
    result = cur.fetchone()
    
    if not result:
        print(f"‚ùå Collection '{settings.collection}' not found!")
        collection_uuid = None
    else:
        collection_uuid = result[0]
        print(f"‚úÖ Collection found: {settings.collection}")
        print(f"   UUID: {collection_uuid}")
        
        # Count documents
        cur.execute(
            "SELECT COUNT(*) FROM langchain_pg_embedding WHERE collection_id = %s",
            (collection_uuid,)
        )
        doc_count = cur.fetchone()[0]
        print(f"\nüìä Total documents: {doc_count}")

‚úÖ Collection found: docs
   UUID: b625f353-708f-41a4-8580-6e7523325aba

üìä Total documents: 845


## B∆∞·ªõc 4: Xem Sample Metadata Hi·ªán T·∫°i

## B∆∞·ªõc 4a: Ph√¢n t√≠ch chi ti·∫øt Metadata Structure

In [5]:
# Ph√¢n t√≠ch chi ti·∫øt metadata structure
print("üîç ANALYZING METADATA STRUCTURE\n")

with conn.cursor() as cur:
    cur.execute("""
        SELECT cmetadata
        FROM langchain_pg_embedding 
        WHERE collection_id = %s
        LIMIT 20
    """, (collection_uuid,))
    
    samples = cur.fetchall()
    
    # Collect all metadata fields
    all_fields = set()
    source_examples = []
    title_examples = []
    
    for (metadata,) in samples:
        if metadata:
            all_fields.update(metadata.keys())
            if 'source' in metadata:
                source_examples.append(metadata['source'])
            if 'title' in metadata:
                title_examples.append(metadata['title'])
    
    print("üìã All metadata fields found:")
    for field in sorted(all_fields):
        print(f"   - {field}")
    
    print("\nüìÇ Sample 'source' field (first 10):")
    for i, src in enumerate(source_examples[:10], 1):
        print(f"   {i}. {src}")
    
    print("\nüìÑ Sample 'title' field (first 10):")
    for i, title in enumerate(title_examples[:10], 1):
        print(f"   {i}. {title}")
    
    print("\n" + "="*60)
    print("üí° OBSERVATION:")
    print("="*60)
    print("- Data comes from PDF files, NOT web-crawled documents")
    print("- 'source' contains file paths")
    print("- 'title' contains document titles")
    print("- NO 'url' field available")
    print("="*60)

üîç ANALYZING METADATA STRUCTURE

üìã All metadata fields found:
   - char_count
   - chunk_id
   - chunk_level
   - chunking_strategy
   - chuong
   - crawled_at
   - dieu
   - has_diem
   - has_khoan
   - hierarchy
   - is_within_token_limit
   - khoan
   - parent_dieu
   - quality_flags
   - readability_score
   - section
   - semantic_tags
   - source
   - source_file
   - status
   - structure_score
   - title
   - token_count
   - token_ratio
   - url
   - valid_until

üìÇ Sample 'source' field (first 10):
   1. thuvienphapluat.vn
   2. thuvienphapluat.vn
   3. thuvienphapluat.vn
   4. thuvienphapluat.vn
   5. thuvienphapluat.vn
   6. thuvienphapluat.vn
   7. thuvienphapluat.vn
   8. thuvienphapluat.vn
   9. thuvienphapluat.vn
   10. thuvienphapluat.vn

üìÑ Sample 'title' field (first 10):
   1. N·ªôi dung t·ª´ thuvienphapluat.vn
   2. N·ªôi dung t·ª´ thuvienphapluat.vn
   3. N·ªôi dung t·ª´ thuvienphapluat.vn
   4. N·ªôi dung t·ª´ thuvienphapluat.vn
   5. N·ªôi dung t·ª´ thu

In [12]:
# L·∫•y 5 sample documents
with conn.cursor() as cur:
    cur.execute("""
        SELECT 
            id, 
            LEFT(document, 100) as doc_preview,
            cmetadata
        FROM langchain_pg_embedding 
        WHERE collection_id = %s
        LIMIT 5
    """, (collection_uuid,))
    
    samples = cur.fetchall()
    
    print("üìÑ Sample Documents v√† Metadata:\n")
    for i, (doc_id, doc_preview, metadata) in enumerate(samples, 1):
        print(f"--- Document {i} ---")
        print(f"ID: {doc_id}")
        print(f"Preview: {doc_preview}...")
        print(f"Metadata keys: {list(metadata.keys()) if metadata else 'None'}")
        
        # Check if already has status
        if metadata and 'status' in metadata:
            print(f"‚ö†Ô∏è  ALREADY HAS STATUS: {metadata['status']}")
        else:
            print(f"‚úÖ No status field (need to add)")
        
        # Show URL for verification
        if metadata and 'url' in metadata:
            print(f"URL: {metadata['url']}")
        
        print()

üìÑ Sample Documents v√† Metadata:

--- Document 1 ---
ID: 8b4824ec-a6fa-4535-8a57-3f5f20b8650d
Preview: H·ªåC VI·ªÜN C√îNG NGH·ªÜ B∆ØU CH√çNH VI·ªÑN TH√îNG 
---------ÔÇñÔÄ¶ÔÇó---------- 
 
KHOA C∆† B·∫¢N 
 
 
B√ÄI GI·∫¢NG 
T∆Ø T∆Ø...
Metadata keys: ['page', 'title', 'author', 'source', 'cleaned', 'creator', 'moddate', 'producer', 'file_size', 'file_type', 'page_label', 'total_pages', 'creationdate']
‚úÖ No status field (need to add)

--- Document 2 ---
ID: 664eb9bc-a2d6-4bd1-b2c8-441c679af8e0
Preview: 1 
M·ª§C L·ª§C 
 
L·ªúI N√ìI ƒê·∫¶U... 
Ch∆£∆°ng m·ªü ƒë·∫ßu: ƒê·ªêI T∆¢·ª¢NG, PH∆¢∆†NG PH√ÅP NGHI√äN C·ª®U V√Ä √ù NGHƒ®A H·ªåC 
T·∫¨P M...
Metadata keys: ['page', 'title', 'author', 'source', 'cleaned', 'creator', 'moddate', 'producer', 'file_size', 'file_type', 'page_label', 'total_pages', 'creationdate']
‚úÖ No status field (need to add)

--- Document 3 ---
ID: 4b94d8ee-2cca-4a5d-ab4c-308a7f026380
Preview: TRI·ªÇN T∆¢ T∆¢·ªûNG H·ªí CH√ç MINH 
 I. C∆° s·ªü h√¨nh th√†nh t∆£ t∆£·ªüng H·

## B∆∞·ªõc 5: Define Logic X√°c ƒë·ªãnh Status

In [30]:
def parse_year_from_source_or_title(metadata: dict) -> int | None:
    """
    Extract year from various sources:
    1. URL (for web-scraped legal docs): 'Luat-Dau-thau-2023-22-2023-QH15'
    2. Source filename (for PDFs): 'T∆∞-t∆∞·ªüng-2016.pdf' or 'CNPM 2020 final.pdf'
    3. Title field
    4. Creation date
    """
    # Priority 1: URL field (for legal documents)
    url = metadata.get("url", "")
    if url:
        # Pattern: Luat-Dau-thau-2023-22-2023-QH15
        match = re.search(r'Luat.*-(\d{4})-', url)
        if match:
            year = int(match.group(1))
            if 1900 <= year <= 2030:
                return year
        
        # Pattern: Nghi-dinh-214-2025-ND-CP
        match = re.search(r'-(\d{4})-', url)
        if match:
            year = int(match.group(1))
            if 1900 <= year <= 2030:
                return year
    
    # Priority 2: Source field (filename for PDFs)
    source = metadata.get("source", "")
    if source:
        # Pattern 1: -YYYY.pdf or _YYYY.pdf
        match = re.search(r'[-_](\d{4})\.pdf', source)
        if match:
            year = int(match.group(1))
            if 1900 <= year <= 2030:
                return year
        
        # Pattern 2: space+YYYY+space/end (e.g., "CNPM 2020 final.pdf")
        match = re.search(r'\s(\d{4})[\s\.]', source)
        if match:
            year = int(match.group(1))
            if 1900 <= year <= 2030:
                return year
        
        # Pattern 3: any 4-digit year in filename
        match = re.search(r'\b(19\d{2}|20[0-3]\d)\b', source)
        if match:
            year = int(match.group(1))
            return year
    
    # Priority 3: Title field
    title = metadata.get("title", "")
    if title:
        match = re.search(r'\b(19\d{2}|20[0-3]\d)\b', title)
        if match:
            return int(match.group(1))
    
    # Priority 4: Creation date (D:20160101120000)
    creationdate = metadata.get("creationdate", "")
    if creationdate:
        match = re.search(r'D:(\d{4})', creationdate)
        if match:
            return int(match.group(1))
    
    return None


def determine_status_and_validity(metadata: dict) -> tuple[str, str]:
    """
    Determine document status and valid_until date.
    
    Rules:
    - Legal documents (has 'url' field):
      * Lu·∫≠t (Law): 5 years validity
      * Ngh·ªã ƒë·ªãnh (Decree): 2 years validity
      * Th√¥ng t∆∞ (Circular): 2 years validity
    - Educational materials (PDF files): 5 years validity
    
    Returns:
        (status, valid_until_iso_string)
    """
    year = parse_year_from_source_or_title(metadata)
    
    if not year:
        # No year found: mark as expired
        return "expired", "2024-01-01"
    
    current_year = datetime.now().year
    url = metadata.get("url", "")
    title = metadata.get("title", "")
    
    # Determine document type and validity period
    if url and ("Luat" in url or "Lu·∫≠t" in title):
        # Law: 5 years validity
        valid_until = datetime(year + 5, 12, 31)
    elif url and ("Nghi-dinh" in url or "Ngh·ªã ƒë·ªãnh" in title):
        # Decree: 2 years validity
        valid_until = datetime(year + 2, 12, 31)
    elif url and ("Thong-tu" in url or "Th√¥ng t∆∞" in title):
        # Circular: 2 years validity
        valid_until = datetime(year + 2, 12, 31)
    else:
        # Educational materials or unknown: 5 years default
        valid_until = datetime(year + 5, 12, 31)
    
    # Check if still valid
    status = "active" if valid_until.year >= current_year else "expired"
    
    return status, valid_until.strftime("%Y-%m-%d")

print("‚úÖ Functions defined successfully")
print("üìö Supports both:")
print("   - Legal documents (Lu·∫≠t: 5yr, Ngh·ªã ƒë·ªãnh/Th√¥ng t∆∞: 2yr)")
print("   - Educational materials (5yr)")
print("üîç Parses year from URL, filename, title, or creation date")

‚úÖ Functions defined successfully
üìö Supports both:
   - Legal documents (Lu·∫≠t: 5yr, Ngh·ªã ƒë·ªãnh/Th√¥ng t∆∞: 2yr)
   - Educational materials (5yr)
üîç Parses year from URL, filename, title, or creation date


## B∆∞·ªõc 6: Test Logic tr√™n Sample Documents

In [18]:
# Test tr√™n 20 sample documents ƒë·ªÉ cover different files
print("üß™ Testing status detection logic:\n")

with conn.cursor() as cur:
    cur.execute("""
        SELECT id, cmetadata
        FROM langchain_pg_embedding 
        WHERE collection_id = %s
        LIMIT 20
    """, (collection_uuid,))
    
    samples = cur.fetchall()
    
    for i, (doc_id, metadata) in enumerate(samples, 1):
        if not metadata:
            print(f"{i}. No metadata")
            continue
        
        source = metadata.get('source', 'N/A')
        title = metadata.get('title', 'N/A')
        
        # Determine status
        status, valid_until = determine_status_and_validity(metadata)
        
        # Extract year
        year = parse_year_from_source_or_title(metadata)
        
        # Extract filename from source
        filename = source.split('/')[-1] if source != 'N/A' else 'N/A'
        
        print(f"{i}. Year: {year} ‚Üí Status: {status}, Valid until: {valid_until}")
        print(f"   File: {filename}")
        print()

üß™ Testing status detection logic:

1. Year: 2016 ‚Üí Status: expired, Valid until: 2021-12-31
   File: T∆∞-t∆∞·ªüng-H·ªì-Ch√≠-Minh-2016.pdf

2. Year: 2016 ‚Üí Status: expired, Valid until: 2021-12-31
   File: T∆∞-t∆∞·ªüng-H·ªì-Ch√≠-Minh-2016.pdf

3. Year: 2016 ‚Üí Status: expired, Valid until: 2021-12-31
   File: T∆∞-t∆∞·ªüng-H·ªì-Ch√≠-Minh-2016.pdf

4. Year: 2016 ‚Üí Status: expired, Valid until: 2021-12-31
   File: T∆∞-t∆∞·ªüng-H·ªì-Ch√≠-Minh-2016.pdf

5. Year: 2016 ‚Üí Status: expired, Valid until: 2021-12-31
   File: T∆∞-t∆∞·ªüng-H·ªì-Ch√≠-Minh-2016.pdf

6. Year: None ‚Üí Status: expired, Valid until: 2024-01-01
   File: BG HP TTTN 2 CNPM 2020 final.pdf

7. Year: 2016 ‚Üí Status: expired, Valid until: 2021-12-31
   File: T∆∞-t∆∞·ªüng-H·ªì-Ch√≠-Minh-2016.pdf

8. Year: 2016 ‚Üí Status: expired, Valid until: 2021-12-31
   File: T∆∞-t∆∞·ªüng-H·ªì-Ch√≠-Minh-2016.pdf

9. Year: 2016 ‚Üí Status: expired, Valid until: 2021-12-31
   File: T∆∞-t∆∞·ªüng-H·ªì-Ch√≠-Minh-2016.pdf

10. Year: 20

## B∆∞·ªõc 7: DRY RUN - Xem s·∫Ω update nh∆∞ th·∫ø n√†o (KH√îNG th·ª±c s·ª± update)

‚ö†Ô∏è **Cell n√†y KH√îNG update database**, ch·ªâ SHOW k·∫øt qu·∫£

In [32]:
# DRY RUN: Simulate update without actually updating
active_count = 0
expired_count = 0
already_has_status_count = 0
no_metadata_count = 0

with conn.cursor() as cur:
    cur.execute("""
        SELECT id, cmetadata 
        FROM langchain_pg_embedding 
        WHERE collection_id = %s
    """, (collection_uuid,))
    
    all_docs = cur.fetchall()
    
    print(f"üìä DRY RUN - Analyzing {len(all_docs)} documents...\n")
    
    for doc_id, metadata in all_docs:
        if not metadata:
            no_metadata_count += 1
            continue
        
        # Skip if already has status
        if 'status' in metadata:
            already_has_status_count += 1
            continue
        
        # Determine status
        status, valid_until = determine_status_and_validity(metadata)
        
        if status == "active":
            active_count += 1
        else:
            expired_count += 1

print("=" * 60)
print("üìä DRY RUN RESULTS:")
print("=" * 60)
print(f"Total documents: {len(all_docs)}")
print(f"‚úÖ Will mark as ACTIVE: {active_count}")
print(f"‚ùå Will mark as EXPIRED: {expired_count}")
print(f"‚è≠Ô∏è  Already has status (skip): {already_has_status_count}")
print(f"‚ö†Ô∏è  No metadata (skip): {no_metadata_count}")
print("=" * 60)
print(f"\n‚ö†Ô∏è  Documents to update: {active_count + expired_count}")
print(f"üí° No changes made yet. Review results before proceeding to B∆∞·ªõc 8.")

üìä DRY RUN - Analyzing 2103 documents...

üìä DRY RUN RESULTS:
Total documents: 2103
‚úÖ Will mark as ACTIVE: 1327
‚ùå Will mark as EXPIRED: 776
‚è≠Ô∏è  Already has status (skip): 0
‚ö†Ô∏è  No metadata (skip): 0

‚ö†Ô∏è  Documents to update: 2103
üí° No changes made yet. Review results before proceeding to B∆∞·ªõc 8.


## B∆∞·ªõc 7a: Ph√¢n t√≠ch Year Distribution

In [31]:
# Ph√¢n t√≠ch distribution of years
from collections import Counter

year_distribution = Counter()
files_by_year = {}

with conn.cursor() as cur:
    cur.execute("""
        SELECT cmetadata
        FROM langchain_pg_embedding 
        WHERE collection_id = %s
    """, (collection_uuid,))
    
    all_docs = cur.fetchall()
    
    for (metadata,) in all_docs:
        if not metadata:
            continue
        
        year = parse_year_from_source_or_title(metadata)
        year_distribution[year] += 1
        
        # Track unique filenames per year
        source = metadata.get('source', '')
        if source:
            filename = source.split('/')[-1]
            if year not in files_by_year:
                files_by_year[year] = set()
            files_by_year[year].add(filename)

print("=" * 60)
print("üìä YEAR DISTRIBUTION:")
print("=" * 60)

# Sort with None at the end
sorted_years = sorted([y for y in year_distribution.keys() if y is not None]) + [None]

for year in sorted_years:
    count = year_distribution[year]
    year_label = year if year else "No year"
    files = len(files_by_year.get(year, set()))
    validity_end = year + 5 if year else None
    status = "‚úÖ ACTIVE" if validity_end and validity_end >= 2025 else "‚ùå EXPIRED"
    
    print(f"{year_label}: {count} docs ({files} files) - Valid until {validity_end} - {status}")

print("=" * 60)
print(f"\nüí° Current year: 2025")
print(f"üìö Materials published in 2020 or later are ACTIVE (5-year validity)")
print(f"üìö Materials published before 2020 are EXPIRED")

üìä YEAR DISTRIBUTION:
2016: 776 docs (1 files) - Valid until 2021 - ‚ùå EXPIRED
2020: 482 docs (1 files) - Valid until 2025 - ‚úÖ ACTIVE
2023: 272 docs (1 files) - Valid until 2028 - ‚úÖ ACTIVE
2024: 88 docs (1 files) - Valid until 2029 - ‚úÖ ACTIVE
2025: 485 docs (1 files) - Valid until 2030 - ‚úÖ ACTIVE
No year: 0 docs (0 files) - Valid until None - ‚ùå EXPIRED

üí° Current year: 2025
üìö Materials published in 2020 or later are ACTIVE (5-year validity)
üìö Materials published before 2020 are EXPIRED


In [26]:
# Inspect files with "No year"
print("\nüîç FILES WITHOUT DETECTED YEAR:")
print("=" * 60)
for filename in sorted(files_by_year.get(None, set())):
    print(f"   - {filename}")

print("\nüí° Observation:")
print("File 'BG HP TTTN 2 CNPM 2020 final.pdf' has '2020' in name")
print("but our regex pattern requires '-YYYY.pdf' format")
print("\nLet's improve the regex pattern...")


üîç FILES WITHOUT DETECTED YEAR:
   - thuvienphapluat.vn

üí° Observation:
File 'BG HP TTTN 2 CNPM 2020 final.pdf' has '2020' in name
but our regex pattern requires '-YYYY.pdf' format

Let's improve the regex pattern...


In [28]:
# Inspect thuvienphapluat.vn documents
print("\nüîç INSPECTING 'thuvienphapluat.vn' DOCUMENTS:")
print("=" * 60)

with conn.cursor() as cur:
    cur.execute("""
        SELECT cmetadata
        FROM langchain_pg_embedding 
        WHERE collection_id = %s
        AND cmetadata->>'source' LIKE %s
        LIMIT 5
    """, (collection_uuid, '%thuvienphapluat.vn%'))
    
    samples = cur.fetchall()
    
    for i, (metadata,) in enumerate(samples, 1):
        print(f"\n--- Document {i} ---")
        print(f"Source: {metadata.get('source', 'N/A')}")
        print(f"Title: {metadata.get('title', 'N/A')[:100]}...")
        
        # Check all fields
        print(f"All fields: {list(metadata.keys())}")
        
        # Check for any date-related fields
        for key, value in metadata.items():
            if any(date_keyword in key.lower() for date_keyword in ['date', 'time', 'year', 'created', 'modified']):
                print(f"  {key}: {value}")


üîç INSPECTING 'thuvienphapluat.vn' DOCUMENTS:

--- Document 1 ---
Source: thuvienphapluat.vn
Title: N·ªôi dung t·ª´ thuvienphapluat.vn...
All fields: ['url', 'dieu', 'title', 'chuong', 'source', 'section', 'chunk_id', 'has_diem', 'has_khoan', 'hierarchy', 'char_count', 'crawled_at', 'chunk_level', 'source_file', 'token_count', 'token_ratio', 'quality_flags', 'semantic_tags', 'structure_score', 'chunking_strategy', 'readability_score', 'is_within_token_limit']

--- Document 2 ---
Source: thuvienphapluat.vn
Title: N·ªôi dung t·ª´ thuvienphapluat.vn...
All fields: ['url', 'dieu', 'khoan', 'title', 'chuong', 'source', 'section', 'chunk_id', 'has_diem', 'has_khoan', 'hierarchy', 'char_count', 'crawled_at', 'chunk_level', 'parent_dieu', 'source_file', 'token_count', 'token_ratio', 'quality_flags', 'semantic_tags', 'structure_score', 'chunking_strategy', 'readability_score', 'is_within_token_limit']

--- Document 3 ---
Source: thuvienphapluat.vn
Title: N·ªôi dung t·ª´ thuvienphapluat.vn...

In [29]:
# Check URL examples for thuvienphapluat.vn docs
print("\nüìã SAMPLE URLs:")
print("=" * 60)

with conn.cursor() as cur:
    cur.execute("""
        SELECT DISTINCT cmetadata->>'url' as url
        FROM langchain_pg_embedding 
        WHERE collection_id = %s
        AND cmetadata->>'source' = 'thuvienphapluat.vn'
        LIMIT 10
    """, (collection_uuid,))
    
    urls = cur.fetchall()
    
    for i, (url,) in enumerate(urls, 1):
        print(f"{i}. {url}")


üìã SAMPLE URLs:
1. https://thuvienphapluat.vn/van-ban/Dau-tu/Luat-Dau-thau-2023-22-2023-QH15-518805.aspx
2. https://thuvienphapluat.vn/van-ban/Dau-tu/Nghi-dinh-214-2025-ND-CP-huong-dan-Luat-Dau-thau-ve-lua-chon-nha-thau-668157.aspx
3. https://thuvienphapluat.vn/van-ban/Dau-tu/Thong-tu-22-2024-TT-BKHDT-cung-cap-thong-tin-ve-lua-chon-nha-thau-tren-He-thong-mang-dau-thau-quoc-gia-619403.aspx


## ‚è∏Ô∏è PAUSE - Review Dry Run Results

**Tr∆∞·ªõc khi ch·∫°y B∆∞·ªõc 8:**
1. ‚úÖ Check dry run numbers c√≥ h·ª£p l√Ω kh√¥ng
2. ‚úÖ Verify active/expired ratio
3. ‚úÖ Confirm kh√¥ng c√≥ documents n√†o b·ªã miss

**N·∫øu OK ‚Üí Ch·∫°y B∆∞·ªõc 8**

## B∆∞·ªõc 8: üö® BULK UPDATE - Th·ª±c s·ª± update database

‚ö†Ô∏è **CRITICAL: Cell n√†y s·∫Ω UPDATE database!**

Ch·ªâ ch·∫°y sau khi:
- ‚úÖ Review dry run results
- ‚úÖ Confirm active/expired numbers
- ‚úÖ Ready to commit changes

In [34]:
# üö® BULK UPDATE - Uncomment and run to execute
# ‚ö†Ô∏è  Remove the triple quotes to enable this cell

from psycopg.types.json import Json

updated_count = 0
active_count = 0
expired_count = 0

with conn.cursor() as cur:
    # Get all documents
    cur.execute(
        "SELECT id, cmetadata FROM langchain_pg_embedding WHERE collection_id = %s",
        (collection_uuid,)
    )
    
    all_docs = cur.fetchall()
    total_docs = len(all_docs)
    
    print(f"üöÄ Starting bulk update of {total_docs} documents...")
    
    for i, (doc_id, metadata) in enumerate(all_docs, 1):
        # Skip if no metadata or already has status
        if not metadata or 'status' in metadata:
            continue
        
        # Determine status
        status, valid_until = determine_status_and_validity(metadata)
        
        # Update metadata
        metadata["status"] = status
        metadata["valid_until"] = valid_until
        
        # Update database (wrap dict in Json())
        cur.execute(
            "UPDATE langchain_pg_embedding SET cmetadata = %s WHERE id = %s",
            (Json(metadata), doc_id)
        )
        
        updated_count += 1
        if status == "active":
            active_count += 1
        else:
            expired_count += 1
        
        # Progress
        if updated_count % 100 == 0:
            print(f"‚è≥ Updated {updated_count}/{total_docs} documents...")
    
    # COMMIT changes
    conn.commit()
    
    print("\n" + "=" * 60)
    print("‚úÖ BULK UPDATE COMPLETE!")
    print("=" * 60)
    print(f"üìù Total updated: {updated_count}")
    print(f"‚úÖ Active documents: {active_count}")
    print(f"‚ùå Expired documents: {expired_count}")
    print("=" * 60)


print("‚ö†Ô∏è  Cell is commented out for safety.")
print("üí° Remove triple quotes to enable update.")

üöÄ Starting bulk update of 2103 documents...
‚è≥ Updated 100/2103 documents...
‚è≥ Updated 200/2103 documents...
‚è≥ Updated 300/2103 documents...
‚è≥ Updated 400/2103 documents...
‚è≥ Updated 500/2103 documents...
‚è≥ Updated 600/2103 documents...
‚è≥ Updated 700/2103 documents...
‚è≥ Updated 800/2103 documents...
‚è≥ Updated 900/2103 documents...
‚è≥ Updated 1000/2103 documents...
‚è≥ Updated 1100/2103 documents...
‚è≥ Updated 1200/2103 documents...
‚è≥ Updated 1300/2103 documents...
‚è≥ Updated 1400/2103 documents...
‚è≥ Updated 1500/2103 documents...
‚è≥ Updated 1600/2103 documents...
‚è≥ Updated 1700/2103 documents...
‚è≥ Updated 1800/2103 documents...
‚è≥ Updated 1900/2103 documents...
‚è≥ Updated 2000/2103 documents...
‚è≥ Updated 2100/2103 documents...

‚úÖ BULK UPDATE COMPLETE!
üìù Total updated: 2103
‚úÖ Active documents: 1327
‚ùå Expired documents: 776
‚ö†Ô∏è  Cell is commented out for safety.
üí° Remove triple quotes to enable update.

‚úÖ BULK UPDATE COMPLETE!
üìù Tot

## B∆∞·ªõc 9: Verify Update - Ki·ªÉm tra k·∫øt qu·∫£

In [36]:
# Verify update results
with conn.cursor() as cur:
    # Count by status
    cur.execute("""
        SELECT 
            cmetadata->>'status' as status,
            COUNT(*) as count
        FROM langchain_pg_embedding 
        WHERE collection_id = %s
        GROUP BY cmetadata->>'status'
    """, (collection_uuid,))
    
    results = cur.fetchall()
    
    print("=" * 60)
    print("üìä STATUS BREAKDOWN:")
    print("=" * 60)
    for status, count in results:
        emoji = "‚úÖ" if status == "active" else "‚ùå" if status == "expired" else "‚ùì"
        print(f"{emoji} {status or 'NULL'}: {count} documents")
    print("=" * 60)
    
    # Sample documents with new metadata
    print("\nüìÑ Sample documents with metadata:")
    cur.execute("""
        SELECT 
            LEFT(document, 60) as doc_preview,
            cmetadata->>'status' as status,
            cmetadata->>'valid_until' as valid_until,
            cmetadata->>'url' as url,
            cmetadata->>'source' as source
        FROM langchain_pg_embedding 
        WHERE collection_id = %s
        LIMIT 5
    """, (collection_uuid,))
    
    samples = cur.fetchall()
    for i, (doc, status, valid_until, url, source) in enumerate(samples, 1):
        print(f"\n{i}. {doc}...")
        print(f"   Status: {status}, Valid until: {valid_until}")
        if url:
            print(f"   URL: {url[:60]}...")
        else:
            print(f"   Source: {source}")

üìä STATUS BREAKDOWN:
‚ùå expired: 776 documents
‚úÖ active: 1327 documents

üìÑ Sample documents with metadata:

1. H·ªåC VI·ªÜN C√îNG NGH·ªÜ B∆ØU CH√çNH VI·ªÑN TH√îNG 
---------ÔÇñÔÄ¶ÔÇó-------...
   Status: expired, Valid until: 2021-12-31
   Source: /home/sakana/Code/RAG-bidding/app/data/raw/T∆∞-t∆∞·ªüng-H·ªì-Ch√≠-Minh-2016.pdf

2. Ch∆∞∆°ng 4. Thi·∫øt k·∫ø
‚Ä¢ C√°c l·ªõp th·ª±c th·ªÉ li√™n quan.
102
H√¨nh 4....
   Status: active, Valid until: 2025-12-31
   Source: /home/sakana/Code/RAG-bidding/app/data/raw/BG HP TTTN 2 CNPM 2020 final.pdf

3. v·ªã cao nh·∫•t l√† d√¢n, v√¨ d√¢n l√† ch·ªß‚Äù74. 
 
 
86 H·ªì C h√≠ Minh, ...
   Status: expired, Valid until: 2021-12-31
   Source: /home/sakana/Code/RAG-bidding/app/data/raw/T∆∞-t∆∞·ªüng-H·ªì-Ch√≠-Minh-2016.pdf

4. required /></td>
</tr>
<tr>
<td>M·∫≠t kh·∫©u:</td>
<td><input ty...
   Status: active, Valid until: 2025-12-31
   Source: /home/sakana/Code/RAG-bidding/app/data/raw/BG HP TTTN 2 CNPM 2020 final.pdf

5. String sqlDiem = "{call 

## B∆∞·ªõc 9a: Verify Legal Documents (thuvienphapluat.vn)

In [37]:
# Check legal documents status
print("üîç LEGAL DOCUMENTS STATUS:")
print("=" * 60)

with conn.cursor() as cur:
    # Get status breakdown for legal docs
    cur.execute("""
        SELECT 
            cmetadata->>'status' as status,
            COUNT(*) as count
        FROM langchain_pg_embedding 
        WHERE collection_id = %s
        AND cmetadata->>'source' = 'thuvienphapluat.vn'
        GROUP BY cmetadata->>'status'
    """, (collection_uuid,))
    
    results = cur.fetchall()
    
    print("Legal documents status:")
    for status, count in results:
        emoji = "‚úÖ" if status == "active" else "‚ùå"
        print(f"  {emoji} {status}: {count} documents")
    
    # Sample legal documents
    print("\nüìÑ Sample legal documents:")
    cur.execute("""
        SELECT 
            cmetadata->>'url' as url,
            cmetadata->>'status' as status,
            cmetadata->>'valid_until' as valid_until
        FROM langchain_pg_embedding 
        WHERE collection_id = %s
        AND cmetadata->>'source' = 'thuvienphapluat.vn'
        LIMIT 5
    """, (collection_uuid,))
    
    samples = cur.fetchall()
    for i, (url, status, valid_until) in enumerate(samples, 1):
        # Extract doc type and year from URL
        doc_name = url.split('/')[-1].split('.')[0] if url else 'N/A'
        emoji = "‚úÖ" if status == "active" else "‚ùå"
        print(f"\n{i}. {emoji} {doc_name}")
        print(f"   Status: {status}, Valid until: {valid_until}")
        print(f"   URL: {url[:80]}...")

üîç LEGAL DOCUMENTS STATUS:
Legal documents status:
  ‚úÖ active: 845 documents

üìÑ Sample legal documents:

1. ‚úÖ Luat-Dau-thau-2023-22-2023-QH15-518805
   Status: active, Valid until: 2028-12-31
   URL: https://thuvienphapluat.vn/van-ban/Dau-tu/Luat-Dau-thau-2023-22-2023-QH15-518805...

2. ‚úÖ Luat-Dau-thau-2023-22-2023-QH15-518805
   Status: active, Valid until: 2028-12-31
   URL: https://thuvienphapluat.vn/van-ban/Dau-tu/Luat-Dau-thau-2023-22-2023-QH15-518805...

3. ‚úÖ Luat-Dau-thau-2023-22-2023-QH15-518805
   Status: active, Valid until: 2028-12-31
   URL: https://thuvienphapluat.vn/van-ban/Dau-tu/Luat-Dau-thau-2023-22-2023-QH15-518805...

4. ‚úÖ Luat-Dau-thau-2023-22-2023-QH15-518805
   Status: active, Valid until: 2028-12-31
   URL: https://thuvienphapluat.vn/van-ban/Dau-tu/Luat-Dau-thau-2023-22-2023-QH15-518805...

5. ‚úÖ Nghi-dinh-214-2025-ND-CP-huong-dan-Luat-Dau-thau-ve-lua-chon-nha-thau-668157
   Status: active, Valid until: 2030-12-31
   URL: https://thuvienphapluat.v

## B∆∞·ªõc 10: üóëÔ∏è X√ìA T√ÄI LI·ªÜU KH√îNG LI√äN QUAN ƒê·∫æN ƒê·∫§U TH·∫¶U

‚ö†Ô∏è **CRITICAL**: C√°c cells d∆∞·ªõi ƒë√¢y s·∫Ω X√ìA Vƒ®NH VI·ªÑN t√†i li·ªáu t·ª´ database!

**T√†i li·ªáu s·∫Ω b·ªã x√≥a:**
- Educational PDFs (textbooks, course materials) - 1,258 documents
- V√≠ d·ª•: "T∆∞ t∆∞·ªüng H·ªì Ch√≠ Minh 2016.pdf", "BG HP TTTN 2 CNPM 2020.pdf"

**T√†i li·ªáu ƒë∆∞·ª£c GI·ªÆ L·∫†I:**
- Legal documents t·ª´ thuvienphapluat.vn - 845 documents
- V√≠ d·ª•: Lu·∫≠t ƒê·∫•u th·∫ßu 2023, Ngh·ªã ƒë·ªãnh 214-2025, Th√¥ng t∆∞

**Workflow:**
1. ‚úÖ Ph√¢n t√≠ch documents ƒë·ªÉ x√≥a
2. ‚úÖ DRY RUN - Xem s·∫Ω x√≥a nh·ªØng g√¨ (KH√îNG th·ª±c s·ª± x√≥a)
3. ‚è∏Ô∏è **PAUSE** - Review tr∆∞·ªõc khi x√≥a
4. üö® BULK DELETE - Th·ª±c s·ª± x√≥a (c·∫ßn confirm)

## B∆∞·ªõc 10a: Ph√¢n t√≠ch Documents theo Lo·∫°i

In [5]:
# Ph√¢n t√≠ch documents theo ngu·ªìn
print("üîç PH√ÇN T√çCH DOCUMENTS THEO NGU·ªíN:")
print("=" * 60)

with conn.cursor() as cur:
    # Count by source type
    cur.execute("""
        SELECT 
            cmetadata->>'source' as source,
            COUNT(*) as count
        FROM langchain_pg_embedding 
        WHERE collection_id = %s
        GROUP BY cmetadata->>'source'
        ORDER BY count DESC
    """, (collection_uuid,))
    
    results = cur.fetchall()
    
    legal_count = 0
    pdf_count = 0
    pdf_files = set()
    
    for source, count in results:
        if source == 'thuvienphapluat.vn':
            legal_count = count
            print(f"üìú Legal documents (thuvienphapluat.vn): {count} documents")
        else:
            pdf_count += count
            pdf_files.add(source)
    
    print(f"üìö Educational PDFs: {pdf_count} documents from {len(pdf_files)} files")
    
    print("\n" + "=" * 60)
    print(f"Total: {legal_count + pdf_count} documents")
    print("=" * 60)
    
    # Show sample PDF filenames
    print("\nüìÇ Sample PDF files to be deleted:")
    cur.execute("""
        SELECT DISTINCT cmetadata->>'source' as source
        FROM langchain_pg_embedding 
        WHERE collection_id = %s
        AND cmetadata->>'source' != 'thuvienphapluat.vn'
        LIMIT 10
    """, (collection_uuid,))
    
    samples = cur.fetchall()
    for i, (source,) in enumerate(samples, 1):
        filename = source.split('/')[-1]
        print(f"   {i}. {filename}")

üîç PH√ÇN T√çCH DOCUMENTS THEO NGU·ªíN:
üìú Legal documents (thuvienphapluat.vn): 845 documents
üìö Educational PDFs: 1258 documents from 2 files

Total: 2103 documents

üìÇ Sample PDF files to be deleted:
   1. T∆∞-t∆∞·ªüng-H·ªì-Ch√≠-Minh-2016.pdf
   2. BG HP TTTN 2 CNPM 2020 final.pdf


## B∆∞·ªõc 10b: DRY RUN - Xem s·∫Ω x√≥a g√¨ (KH√îNG th·ª±c s·ª± x√≥a)

In [6]:
# DRY RUN: Xem documents s·∫Ω b·ªã x√≥a
print("üîç DRY RUN - Analyzing documents to delete...")
print("=" * 60)

with conn.cursor() as cur:
    # Count documents to delete (PDFs)
    cur.execute("""
        SELECT COUNT(*)
        FROM langchain_pg_embedding 
        WHERE collection_id = %s
        AND cmetadata->>'source' != 'thuvienphapluat.vn'
    """, (collection_uuid,))
    
    to_delete_count = cur.fetchone()[0]
    
    # Count documents to keep (legal)
    cur.execute("""
        SELECT COUNT(*)
        FROM langchain_pg_embedding 
        WHERE collection_id = %s
        AND cmetadata->>'source' = 'thuvienphapluat.vn'
    """, (collection_uuid,))
    
    to_keep_count = cur.fetchone()[0]
    
    # Get total
    cur.execute("""
        SELECT COUNT(*)
        FROM langchain_pg_embedding 
        WHERE collection_id = %s
    """, (collection_uuid,))
    
    total_count = cur.fetchone()[0]
    
    print(f"üìä DELETION PLAN:")
    print(f"   Total documents: {total_count}")
    print(f"   üóëÔ∏è  Will DELETE (Educational PDFs): {to_delete_count}")
    print(f"   ‚úÖ Will KEEP (Legal documents): {to_keep_count}")
    print("=" * 60)
    
    # Show sample documents to delete
    print("\nüìÑ Sample documents that will be DELETED:")
    cur.execute("""
        SELECT 
            LEFT(document, 80) as doc_preview,
            cmetadata->>'source' as source,
            cmetadata->>'title' as title
        FROM langchain_pg_embedding 
        WHERE collection_id = %s
        AND cmetadata->>'source' != 'thuvienphapluat.vn'
        LIMIT 5
    """, (collection_uuid,))
    
    samples = cur.fetchall()
    for i, (doc, source, title) in enumerate(samples, 1):
        filename = source.split('/')[-1] if source else 'N/A'
        print(f"\n{i}. File: {filename}")
        print(f"   Title: {title[:60] if title else 'N/A'}...")
        print(f"   Preview: {doc}...")
    
    # Show sample documents to keep
    print("\n" + "=" * 60)
    print("‚úÖ Sample documents that will be KEPT:")
    cur.execute("""
        SELECT 
            LEFT(document, 80) as doc_preview,
            cmetadata->>'url' as url,
            cmetadata->>'status' as status
        FROM langchain_pg_embedding 
        WHERE collection_id = %s
        AND cmetadata->>'source' = 'thuvienphapluat.vn'
        LIMIT 3
    """, (collection_uuid,))
    
    samples = cur.fetchall()
    for i, (doc, url, status) in enumerate(samples, 1):
        doc_name = url.split('/')[-1].split('.')[0][:40] if url else 'N/A'
        emoji = "‚úÖ" if status == "active" else "‚ùå"
        print(f"\n{i}. {emoji} {doc_name}")
        print(f"   Preview: {doc}...")

print("\n" + "=" * 60)
print("‚ö†Ô∏è  No changes made yet. Review carefully before proceeding!")
print("=" * 60)

üîç DRY RUN - Analyzing documents to delete...
üìä DELETION PLAN:
   Total documents: 2103
   üóëÔ∏è  Will DELETE (Educational PDFs): 1258
   ‚úÖ Will KEEP (Legal documents): 845

üìÑ Sample documents that will be DELETED:

1. File: T∆∞-t∆∞·ªüng-H·ªì-Ch√≠-Minh-2016.pdf
   Title: ¬ß√í thi ch√ùnh tr√û cu√®i kh√£a kh√®i ¬Æ¬πi h√§c n¬®m h√§c 2006 ‚Äì 2007...
   Preview: H·ªåC VI·ªÜN C√îNG NGH·ªÜ B∆ØU CH√çNH VI·ªÑN TH√îNG 
---------ÔÇñÔÄ¶ÔÇó---------- 
 
KHOA C∆† B·∫¢N 
...

2. File: BG HP TTTN 2 CNPM 2020 final.pdf
   Title: N/A...
   Preview: Ch∆∞∆°ng 4. Thi·∫øt k·∫ø
‚Ä¢ C√°c l·ªõp th·ª±c th·ªÉ li√™n quan.
102
H√¨nh 4.7: Thi·∫øt k·∫ø giao di·ªá...

3. File: T∆∞-t∆∞·ªüng-H·ªì-Ch√≠-Minh-2016.pdf
   Title: ¬ß√í thi ch√ùnh tr√û cu√®i kh√£a kh√®i ¬Æ¬πi h√§c n¬®m h√§c 2006 ‚Äì 2007...
   Preview: v·ªã cao nh·∫•t l√† d√¢n, v√¨ d√¢n l√† ch·ªß‚Äù74. 
 
 
86 H·ªì C h√≠ Minh, to√†n t·∫≠p, nxb Ch√≠nh ...

4. File: BG HP TTTN 2 CNPM 2020 final.pdf
   Title: N/A...
   Preview: required /></td>
</tr

## ‚è∏Ô∏è PAUSE - Review Deletion Plan

**Tr∆∞·ªõc khi ch·∫°y cell x√≥a:**
1. ‚úÖ Verify s·ªë l∆∞·ª£ng documents to delete/keep
2. ‚úÖ Check sample documents to ensure kh√¥ng x√≥a nh·∫ßm legal docs
3. ‚úÖ Backup database n·∫øu c·∫ßn (recommended)
4. ‚úÖ Confirm s·∫µn s√†ng x√≥a vƒ©nh vi·ªÖn

**‚ö†Ô∏è L∆ØU √ù:**
- Documents b·ªã x√≥a KH√îNG TH·ªÇ KH√îI PH·ª§C (tr·ª´ khi c√≥ backup)
- Embeddings c≈©ng s·∫Ω b·ªã x√≥a theo
- N√™n backup tr∆∞·ªõc: `pg_dump ragdb > backup_before_delete.sql`

**N·∫øu OK ‚Üí Ch·∫°y B∆∞·ªõc 10c**

## B∆∞·ªõc 10c: üö® BULK DELETE - X√≥a t√†i li·ªáu kh√¥ng li√™n quan

‚ö†Ô∏è **CRITICAL: Cell n√†y s·∫Ω X√ìA Vƒ®NH VI·ªÑN documents t·ª´ database!**

Ch·ªâ ch·∫°y sau khi:
- ‚úÖ Review dry run results
- ‚úÖ Confirm numbers are correct
- ‚úÖ Backup database (recommended)
- ‚úÖ Ready to permanently delete educational PDFs

In [7]:
# üö® BULK DELETE - Uncomment and run to execute
# ‚ö†Ô∏è  Remove the triple quotes to enable this cell


deleted_count = 0

with conn.cursor() as cur:
    # Count before deletion
    cur.execute(
        "SELECT COUNT(*) FROM langchain_pg_embedding WHERE collection_id = %s",
        (collection_uuid,)
    )
    before_count = cur.fetchone()[0]
    
    print(f"üóëÔ∏è  Starting deletion process...")
    print(f"üìä Documents before: {before_count}")
    
    # DELETE all educational PDFs (source != 'thuvienphapluat.vn')
    cur.execute("""
        DELETE FROM langchain_pg_embedding 
        WHERE collection_id = %s
        AND cmetadata->>'source' != 'thuvienphapluat.vn'
    """, (collection_uuid,))
    
    deleted_count = cur.rowcount
    
    # Count after deletion
    cur.execute(
        "SELECT COUNT(*) FROM langchain_pg_embedding WHERE collection_id = %s",
        (collection_uuid,)
    )
    after_count = cur.fetchone()[0]
    
    # COMMIT changes
    conn.commit()
    
    print("\n" + "=" * 60)
    print("‚úÖ BULK DELETE COMPLETE!")
    print("=" * 60)
    print(f"üìä Documents before: {before_count}")
    print(f"üóëÔ∏è  Documents deleted: {deleted_count}")
    print(f"‚úÖ Documents remaining: {after_count}")
    print("=" * 60)
    print(f"\nüí° Kept only legal documents from thuvienphapluat.vn")
    print(f"üéØ Database now contains ONLY bidding law-related documents")


print("‚ö†Ô∏è  Cell is commented out for safety.")
print("üí° Remove triple quotes to enable deletion.")
print("‚ö†Ô∏è  Make sure you've reviewed the dry-run results first!")

üóëÔ∏è  Starting deletion process...
üìä Documents before: 2103

‚úÖ BULK DELETE COMPLETE!
üìä Documents before: 2103
üóëÔ∏è  Documents deleted: 1258
‚úÖ Documents remaining: 845

üí° Kept only legal documents from thuvienphapluat.vn
üéØ Database now contains ONLY bidding law-related documents
‚ö†Ô∏è  Cell is commented out for safety.
üí° Remove triple quotes to enable deletion.
‚ö†Ô∏è  Make sure you've reviewed the dry-run results first!


## B∆∞·ªõc 10d: Verify Deletion - Ki·ªÉm tra sau khi x√≥a

In [8]:
# Verify deletion results
print("üîç VERIFICATION AFTER DELETION:")
print("=" * 60)

with conn.cursor() as cur:
    # Total count
    cur.execute("""
        SELECT COUNT(*)
        FROM langchain_pg_embedding 
        WHERE collection_id = %s
    """, (collection_uuid,))
    
    total_count = cur.fetchone()[0]
    print(f"üìä Total documents remaining: {total_count}")
    
    # Count by source (should only have thuvienphapluat.vn)
    cur.execute("""
        SELECT 
            cmetadata->>'source' as source,
            COUNT(*) as count
        FROM langchain_pg_embedding 
        WHERE collection_id = %s
        GROUP BY cmetadata->>'source'
    """, (collection_uuid,))
    
    results = cur.fetchall()
    
    print("\nüìã Documents by source:")
    for source, count in results:
        print(f"   - {source}: {count} documents")
    
    # Status breakdown
    cur.execute("""
        SELECT 
            cmetadata->>'status' as status,
            COUNT(*) as count
        FROM langchain_pg_embedding 
        WHERE collection_id = %s
        GROUP BY cmetadata->>'status'
    """, (collection_uuid,))
    
    results = cur.fetchall()
    
    print("\nüìä Status breakdown:")
    for status, count in results:
        emoji = "‚úÖ" if status == "active" else "‚ùå"
        print(f"   {emoji} {status}: {count} documents")
    
    # Sample remaining documents
    print("\nüìÑ Sample remaining documents:")
    cur.execute("""
        SELECT 
            cmetadata->>'url' as url,
            cmetadata->>'status' as status,
            cmetadata->>'valid_until' as valid_until
        FROM langchain_pg_embedding 
        WHERE collection_id = %s
        LIMIT 5
    """, (collection_uuid,))
    
    samples = cur.fetchall()
    for i, (url, status, valid_until) in enumerate(samples, 1):
        doc_name = url.split('/')[-1].split('.')[0][:50] if url else 'N/A'
        emoji = "‚úÖ" if status == "active" else "‚ùå"
        print(f"\n   {i}. {emoji} {doc_name}")
        print(f"      Status: {status}, Valid until: {valid_until}")

print("\n" + "=" * 60)
print("‚úÖ Verification complete!")
print("üéØ Database now contains ONLY legal documents from thuvienphapluat.vn")
print("=" * 60)

üîç VERIFICATION AFTER DELETION:
üìä Total documents remaining: 845

üìã Documents by source:
   - thuvienphapluat.vn: 845 documents

üìä Status breakdown:
   ‚úÖ active: 845 documents

üìÑ Sample remaining documents:

   1. ‚úÖ Luat-Dau-thau-2023-22-2023-QH15-518805
      Status: active, Valid until: 2028-12-31

   2. ‚úÖ Luat-Dau-thau-2023-22-2023-QH15-518805
      Status: active, Valid until: 2028-12-31

   3. ‚úÖ Luat-Dau-thau-2023-22-2023-QH15-518805
      Status: active, Valid until: 2028-12-31

   4. ‚úÖ Luat-Dau-thau-2023-22-2023-QH15-518805
      Status: active, Valid until: 2028-12-31

   5. ‚úÖ Nghi-dinh-214-2025-ND-CP-huong-dan-Luat-Dau-thau-v
      Status: active, Valid until: 2030-12-31

‚úÖ Verification complete!
üéØ Database now contains ONLY legal documents from thuvienphapluat.vn
