# 04: Full-Text Article Scraper
**Goal**: Extract full article content for records missing descriptions

**Strategy**:
1. Test trafilatura on sample articles
2. Fetch articles needing scraping from Supabase
3. Extract content in batches with error handling
4. Update database with extracted content
5. Generate quality report

**Target**: 82 articles (50 archive + 32 RSS without descriptions)

## Phase 1: Setup & Library Testing

In [18]:
# Install required libraries (run once)
# !pip install trafilatura supabase python-dotenv requests

import trafilatura
from supabase import create_client, Client
import requests
import pandas as pd
from datetime import datetime
import time
from typing import List, Dict, Tuple
import json

print("✅ Libraries imported successfully")
print(f"Trafilatura version: {trafilatura.__version__}")

✅ Libraries imported successfully
Trafilatura version: 2.0.0


In [19]:
# ============================================
# SUPABASE CREDENTIALS - REPLACE THESE
# ============================================
SUPABASE_URL = "https://lgnhjzlbezpczlobeevu.supabase.co"
SUPABASE_KEY = "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJzdXBhYmFzZSIsInJlZiI6ImxnbmhqemxiZXpwY3psb2JlZXZ1Iiwicm9sZSI6ImFub24iLCJpYXQiOjE3NTgyMDgzNjcsImV4cCI6MjA3Mzc4NDM2N30.O5Yt0dOyYq326ESo0LBL7lGj4k8zwpuodJfTtGwrPek"
# ============================================

# Initialize Supabase client
supabase: Client = create_client(SUPABASE_URL, SUPABASE_KEY)

print("✅ Supabase client initialized!")
print(f"Connected to: {SUPABASE_URL}")

✅ Supabase client initialized!
Connected to: https://lgnhjzlbezpczlobeevu.supabase.co


In [20]:
# Test trafilatura on a single article
TEST_URL = "https://www.thehindu.com/news/national/tamil-nadu/encroachments-demolished-to-complete-new-bridge-across-kamadalam-river-near-arani/article70102100.ece"

print(f"🧪 Testing extraction on: {TEST_URL}\n")
print("="*60)

try:
    # Download HTML
    downloaded = trafilatura.fetch_url(TEST_URL)
    
    if downloaded:
        # Extract content
        content = trafilatura.extract(downloaded, include_comments=False, include_tables=False)
        
        if content:
            print("✅ Extraction Successful!\n")
            print(f"Content Length: {len(content)} characters")
            print(f"Word Count: ~{len(content.split())} words\n")
            print("First 500 characters:")
            print("-"*60)
            print(content[:500])
            print("-"*60)
        else:
            print("⚠️ Extraction returned empty content")
    else:
        print("❌ Failed to download page")
        
except Exception as e:
    print(f"❌ Error: {e}")

print("\n💡 If extraction worked, proceed. If failed, we'll need a custom scraper.")

🧪 Testing extraction on: https://www.thehindu.com/news/national/tamil-nadu/encroachments-demolished-to-complete-new-bridge-across-kamadalam-river-near-arani/article70102100.ece

✅ Extraction Successful!

Content Length: 2234 characters
Word Count: ~374 words

First 500 characters:
------------------------------------------------------------
The State Highways Department demolished encroachments on Arcot - Villupuram Main Road (SH-4) near old bus terminus in Arani town in Tiruvannamalai to lay bitumen approach roads as part of completion of the two-lane high level bridge across Kamadalam Naganathi river.
Officials of State Highways, which is constructing the bridge, said that most of the encroachments are houses and petty shops, which were built more than two decades ago. Encroached land near the bus terminus belongs to State Highwa
------------------------------------------------------------

💡 If extraction worked, proceed. If failed, we'll need a custom scraper.


## Phase 2: Fetch Target Articles from Database

In [21]:
# Fetch all articles needing full-text scraping
print("📥 Fetching articles that need scraping...\n")

result = supabase.table('news_cleaned') \
    .select('id, title, link, source, category') \
    .eq('needs_full_scrape', True) \
    .execute()

articles_to_scrape = result.data

print("="*60)
print("📊 ARTICLES NEEDING SCRAPING")
print("="*60)
print(f"Total: {len(articles_to_scrape)} articles\n")

# Breakdown by source
df_scrape = pd.DataFrame(articles_to_scrape)
source_counts = df_scrape['source'].value_counts()

print("By Source:")
for source, count in source_counts.items():
    print(f"  {source}: {count}")

# Category breakdown
print("\nBy Category:")
category_counts = df_scrape['category'].value_counts()
for cat, count in category_counts.head(5).items():
    print(f"  {cat}: {count}")

print("\n💡 Recommendation: Start with 20 articles (10 archive + 10 RSS) for testing")

📥 Fetching articles that need scraping...

📊 ARTICLES NEEDING SCRAPING
Total: 462 articles

By Source:
  The Hindu - Archive: 462

By Category:
  Tamil Nadu: 144
  Madurai: 126
  Coimbatore: 102
  Chennai: 90

💡 Recommendation: Start with 20 articles (10 archive + 10 RSS) for testing


In [22]:
# Select test batch: 10 archive + 10 RSS
archive_articles = [a for a in articles_to_scrape if 'Archive' in a['source']][:10]
rss_articles = [a for a in articles_to_scrape if 'RSS' in a['source']][:10]

test_batch = archive_articles + rss_articles

print("🎯 TEST BATCH SELECTED")
print("="*60)
print(f"Archive articles: {len(archive_articles)}")
print(f"RSS articles: {len(rss_articles)}")
print(f"Total test batch: {len(test_batch)}\n")

print("Sample articles:")
for i, article in enumerate(test_batch[:3], 1):
    print(f"\n{i}. [{article['source']}]")
    print(f"   {article['title'][:60]}...")
    print(f"   {article['link'][:70]}...")

🎯 TEST BATCH SELECTED
Archive articles: 10
RSS articles: 0
Total test batch: 10

Sample articles:

1. [The Hindu - Archive]
   Mother kills child, ends life in Erode...
   https://www.thehindu.com/news/cities/Coimbatore/mother-kills-child-end...

2. [The Hindu - Archive]
   Bar employee arrested for murdering Salem native near Coimba...
   https://www.thehindu.com/news/cities/Coimbatore/bar-employee-arrested-...

3. [The Hindu - Archive]
   UGD charges to be collected as per 2023 bylaw, says Coimbato...
   https://www.thehindu.com/news/cities/Coimbatore/ugd-charges-to-be-coll...


## Phase 3: Content Extraction Pipeline

In [23]:
def extract_article_content(url: str, max_retries: int = 3, timeout: int = 10) -> Dict:
    """
    Extract full article content from URL using trafilatura
    
    Returns:
        Dict with keys: success, content, error, word_count
    """
    result = {
        'success': False,
        'content': None,
        'error': None,
        'word_count': 0
    }
    
    for attempt in range(max_retries):
        try:
            # Download HTML
            downloaded = trafilatura.fetch_url(url)
            
            if not downloaded:
                result['error'] = 'Failed to download page'
                continue
            
            # Extract content
            content = trafilatura.extract(
                downloaded,
                include_comments=False,
                include_tables=False,
                no_fallback=False  # Try harder to extract content
            )
            
            if content and len(content.strip()) > 100:  # Min 100 chars
                result['success'] = True
                result['content'] = content.strip()
                result['word_count'] = len(content.split())
                return result
            else:
                result['error'] = 'Empty or too short content'
                
        except requests.exceptions.Timeout:
            result['error'] = f'Timeout (attempt {attempt + 1}/{max_retries})'
        except requests.exceptions.RequestException as e:
            result['error'] = f'Network error: {str(e)[:50]}'
            break  # Don't retry network errors
        except Exception as e:
            result['error'] = f'Extraction error: {str(e)[:50]}'
            break
        
        if attempt < max_retries - 1:
            time.sleep(2)  # Wait before retry
    
    return result

print("✅ Extraction function defined")
print("Features: 3 retries, 10s timeout, min 100 chars validation")

✅ Extraction function defined
Features: 3 retries, 10s timeout, min 100 chars validation


In [24]:
def scrape_articles_batch(articles: List[Dict], rate_limit: float = 2.5) -> List[Dict]:
    """
    Scrape multiple articles with progress tracking
    
    Args:
        articles: List of dicts with 'id', 'link', 'title'
        rate_limit: Seconds to wait between requests
    
    Returns:
        List of dicts with scraping results
    """
    results = []
    total = len(articles)
    
    print(f"🚀 Starting batch scraping: {total} articles")
    print(f"⏱️  Rate limit: {rate_limit}s between requests\n")
    print("="*60)
    
    for i, article in enumerate(articles, 1):
        print(f"\n[{i}/{total}] Scraping: {article['title'][:50]}...")
        print(f"URL: {article['link'][:60]}...")
        
        # Extract content
        extraction = extract_article_content(article['link'])
        
        # Store result
        result = {
            'id': article['id'],
            'title': article['title'],
            'link': article['link'],
            'source': article['source'],
            'success': extraction['success'],
            'content': extraction['content'],
            'word_count': extraction['word_count'],
            'error': extraction['error']
        }
        results.append(result)
        
        # Progress indicator
        if extraction['success']:
            print(f"✅ Success! Extracted {extraction['word_count']} words")
        else:
            print(f"❌ Failed: {extraction['error']}")
        
        # Rate limiting (skip on last article)
        if i < total:
            time.sleep(rate_limit)
    
    print("\n" + "="*60)
    print("✅ Batch scraping complete!")
    
    return results

print("✅ Batch scraper function defined")

✅ Batch scraper function defined


In [25]:
# Run scraping on test batch
print(f"⚠️  About to scrape {len(test_batch)} articles")
print(f"Estimated time: ~{len(test_batch) * 2.5 / 60:.1f} minutes\n")

# Start scraping
scraping_results = scrape_articles_batch(test_batch, rate_limit=2.5)

⚠️  About to scrape 10 articles
Estimated time: ~0.4 minutes

🚀 Starting batch scraping: 10 articles
⏱️  Rate limit: 2.5s between requests


[1/10] Scraping: Mother kills child, ends life in Erode...
URL: https://www.thehindu.com/news/cities/Coimbatore/mother-kills...
✅ Success! Extracted 204 words

[2/10] Scraping: Bar employee arrested for murdering Salem native n...
URL: https://www.thehindu.com/news/cities/Coimbatore/bar-employee...
✅ Success! Extracted 292 words

[3/10] Scraping: UGD charges to be collected as per 2023 bylaw, say...
URL: https://www.thehindu.com/news/cities/Coimbatore/ugd-charges-...
✅ Success! Extracted 349 words

[4/10] Scraping: Coimbatore Corporation Council passes resolution t...
URL: https://www.thehindu.com/news/cities/Coimbatore/coimbatore-c...
✅ Success! Extracted 378 words

[5/10] Scraping: Stalin inaugurates new classrooms built at a cost ...
URL: https://www.thehindu.com/news/national/tamil-nadu/stalin-ina...
✅ Success! Extracted 85 words

[6/10] Scrap

In [26]:
# Analyze scraping results
df_results = pd.DataFrame(scraping_results)

success_count = df_results['success'].sum()
failure_count = len(df_results) - success_count
success_rate = (success_count / len(df_results)) * 100

print("="*60)
print("📊 SCRAPING RESULTS SUMMARY")
print("="*60)
print(f"Total Attempted: {len(df_results)}")
print(f"Successful: {success_count} ({success_rate:.1f}%)")
print(f"Failed: {failure_count} ({100-success_rate:.1f}%)\n")

# Success by source
print("Success Rate by Source:")
source_success = df_results.groupby('source')['success'].agg(['sum', 'count'])
source_success['rate'] = (source_success['sum'] / source_success['count'] * 100).round(1)
for source, row in source_success.iterrows():
    print(f"  {source}: {row['sum']}/{row['count']} ({row['rate']}%)")

# Word count statistics (for successful extractions)
successful_df = df_results[df_results['success'] == True]
if len(successful_df) > 0:
    print(f"\nWord Count Statistics:")
    print(f"  Average: {successful_df['word_count'].mean():.0f} words")
    print(f"  Median: {successful_df['word_count'].median():.0f} words")
    print(f"  Min: {successful_df['word_count'].min()} words")
    print(f"  Max: {successful_df['word_count'].max()} words")

# Error breakdown
if failure_count > 0:
    print(f"\nError Breakdown:")
    error_counts = df_results[df_results['success'] == False]['error'].value_counts()
    for error, count in error_counts.items():
        print(f"  {error}: {count}")

📊 SCRAPING RESULTS SUMMARY
Total Attempted: 10
Successful: 10 (100.0%)
Failed: 0 (0.0%)

Success Rate by Source:
  The Hindu - Archive: 10.0/10.0 (100.0%)

Word Count Statistics:
  Average: 226 words
  Median: 220 words
  Min: 64 words
  Max: 378 words


In [27]:
# Inspect sample extractions
print("="*60)
print("📰 SAMPLE EXTRACTED CONTENT")
print("="*60)

successful_samples = df_results[df_results['success'] == True].head(3)

for i, row in successful_samples.iterrows():
    print(f"\n{i+1}. {row['title'][:60]}...")
    print(f"   Source: {row['source']}")
    print(f"   Words: {row['word_count']}")
    print(f"   Content Preview:")
    print("-"*60)
    print(row['content'][:300] + "...")
    print("-"*60)

📰 SAMPLE EXTRACTED CONTENT

1. Mother kills child, ends life in Erode...
   Source: The Hindu - Archive
   Words: 204
   Content Preview:
------------------------------------------------------------
In a tragic incident, a 28-year-old mother reportedly killed her one-and-a-half-year-old son and ended her life in Vellode near here on Friday.
Police said Kavin Prasad and his wife Amaravathi had been married for five years and had a son named Aathiran. Kavin works at a private company in Perundura...
------------------------------------------------------------

2. Bar employee arrested for murdering Salem native near Coimba...
   Source: The Hindu - Archive
   Words: 292
   Content Preview:
------------------------------------------------------------
The police have arrested an employee of a bar attached to a Tasmac outlet near Sulur in Coimbatore district on charge of murdering a Salem native on June 22.
M. Dharmar, 27, a native of Thiruvadanai in Ramanathapuram district, was arrested on

## Phase 4: Update Database

In [28]:
def update_articles_in_db(results: List[Dict]) -> Dict:
    """
    Update Supabase with scraped content
    
    Returns:
        Dict with update statistics
    """
    updated = 0
    failed_updates = 0
    
    print("🔄 Updating database...\n")
    
    for result in results:
        if result['success']:
            try:
                # Update record
                supabase.table('news_cleaned').update({
                    'content_full': result['content'],
                    'needs_full_scrape': False
                }).eq('id', result['id']).execute()
                
                updated += 1
                print(f"✅ Updated: {result['title'][:50]}...")
                
            except Exception as e:
                failed_updates += 1
                print(f"❌ Failed to update {result['id']}: {e}")
    
    return {
        'updated': updated,
        'failed': failed_updates
    }

print("✅ Update function defined")

✅ Update function defined


In [29]:
# Update database with successful extractions
print(f"⚠️  About to update {success_count} records in database\n")
print("="*60)

update_stats = update_articles_in_db(scraping_results)

print("\n" + "="*60)
print("📊 UPDATE SUMMARY")
print("="*60)
print(f"Successfully Updated: {update_stats['updated']}")
print(f"Failed Updates: {update_stats['failed']}")

⚠️  About to update 10 records in database

🔄 Updating database...

✅ Updated: Mother kills child, ends life in Erode...
✅ Updated: Bar employee arrested for murdering Salem native n...
✅ Updated: UGD charges to be collected as per 2023 bylaw, say...
✅ Updated: Coimbatore Corporation Council passes resolution t...
✅ Updated: Stalin inaugurates new classrooms built at a cost ...
✅ Updated: Water level in Papanasam dam stands at 135.20 feet...
✅ Updated: District-level agricultural machinery maintenance ...
✅ Updated: Couple killed as container lorry runs over them on...
✅ Updated: Residents flag dumping of waste near waterbody on ...
✅ Updated: Residents of Rakkiapalayam in Tiruppur allege thef...

📊 UPDATE SUMMARY
Successfully Updated: 10
Failed Updates: 0


In [30]:
# Verify updates in database
print("🔍 Verifying database updates...\n")

# Check total articles still needing scraping
remaining = supabase.table('news_cleaned').select('*', count='exact').eq('needs_full_scrape', True).execute()

# Check articles with content
with_content = supabase.table('news_cleaned').select('*', count='exact').not_.is_('content_full', 'null').execute()

print("="*60)
print("📊 DATABASE STATUS")
print("="*60)
print(f"Articles still needing scraping: {remaining.count}")
print(f"Articles with full content: {with_content.count}")
print(f"\n✅ Reduction: {82 - remaining.count} articles scraped from original 82")

🔍 Verifying database updates...

📊 DATABASE STATUS
Articles still needing scraping: 452
Articles with full content: 2292

✅ Reduction: -370 articles scraped from original 82


## Phase 5: Export Failed Articles & Full Batch Processing

In [31]:
# Export failed articles for investigation
failed_df = df_results[df_results['success'] == False][['title', 'link', 'source', 'error']]

if len(failed_df) > 0:
    timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
    filename = f"../data/raw/failed_scrapes_{timestamp}.csv"
    failed_df.to_csv(filename, index=False)
    print(f"📁 Exported {len(failed_df)} failed articles to: {filename}")
else:
    print("✅ No failed articles to export!")

✅ No failed articles to export!


In [32]:
# OPTIONAL: Process all remaining articles
# ⚠️  Uncomment and run this cell ONLY after validating test batch results

print("⚠️  PROCESSING ALL REMAINING ARTICLES")
print(f"This will scrape ~{remaining.count} articles")
print(f"Estimated time: ~{remaining.count * 2.5 / 60:.1f} minutes\n")

# Fetch all remaining articles
remaining_articles = supabase.table('news_cleaned') \
    .select('id, title, link, source, category') \
    .eq('needs_full_scrape', True) \
    .execute()

# Scrape in batches of 20
batch_size = 20
all_results = []

for i in range(0, len(remaining_articles.data), batch_size):
    batch = remaining_articles.data[i:i+batch_size]
    print(f"\n🔄 Processing batch {i//batch_size + 1}...")
    batch_results = scrape_articles_batch(batch)
    update_articles_in_db(batch_results)
    all_results.extend(batch_results)
    time.sleep(5)  # Pause between batches

print("\n✅ All articles processed!")

# print("💡 Uncomment this cell to process all remaining articles after test validation")

⚠️  PROCESSING ALL REMAINING ARTICLES
This will scrape ~452 articles
Estimated time: ~18.8 minutes


🔄 Processing batch 1...
🚀 Starting batch scraping: 20 articles
⏱️  Rate limit: 2.5s between requests


[1/20] Scraping: Inflow into Mettur Dam increases to 80,984 cusecs;...
URL: https://www.thehindu.com/news/cities/Coimbatore/inflow-into-...
✅ Success! Extracted 310 words

[2/20] Scraping: Water level in Mullaperiyar dam stands at 135.60 f...
URL: https://www.thehindu.com/news/cities/Madurai/water-level-in-...
✅ Success! Extracted 98 words

[3/20] Scraping: Steady Krishna water inflow to boost storage in Ch...
URL: https://www.thehindu.com/news/cities/chennai/steady-krishna-...
✅ Success! Extracted 321 words

[4/20] Scraping: Free Wi-Fi now available for international passeng...
URL: https://www.thehindu.com/news/cities/chennai/free-wifi-now-a...
✅ Success! Extracted 156 words

[5/20] Scraping: Research finds financial inclusion to be crucial f...
URL: https://www.thehindu.com/news/nat

## Final Report & Next Steps

In [33]:
# Generate final report
print("="*60)
print("📋 FINAL SCRAPING REPORT")
print("="*60)
print(f"\nTest Batch Results:")
print(f"  Attempted: {len(test_batch)}")
print(f"  Successful: {success_count} ({success_rate:.1f}%)")
print(f"  Failed: {failure_count}")
print(f"  Avg Word Count: {successful_df['word_count'].mean():.0f} words")

print(f"\nDatabase Status:")
print(f"  Total articles in DB: 150")
print(f"  With full content: {with_content.count}")
print(f"  Still need scraping: {remaining.count}")

print(f"\nNext Steps:")
if success_rate > 70:
    print("  ✅ Success rate is good! Process remaining articles.")
elif success_rate > 50:
    print("  ⚠️  Success rate is moderate. Review failed extractions.")
    print("  ⚠️  Consider adjusting extraction parameters")
else:
    print("  ❌ Success rate is low. May need custom scraper.")
    print("  ❌ Review The Hindu's HTML structure manually")

print("\n" + "="*60)

📋 FINAL SCRAPING REPORT

Test Batch Results:
  Attempted: 10
  Successful: 10 (100.0%)
  Failed: 0
  Avg Word Count: 226 words

Database Status:
  Total articles in DB: 150
  With full content: 2292
  Still need scraping: 452

Next Steps:
  ✅ Success rate is good! Process remaining articles.

