# Extract Parish Data

This notebook extracts detailed parish information from discovered parish directory pages.

**Prerequisites**: 
1. Run `00_Colab_Setup.ipynb` first
2. Run `01_Build_Dioceses_Database.ipynb` to populate dioceses
3. Run `02_Find_Parish_Directories.ipynb` to discover directory URLs

**What this does**:
- Detects website patterns and selects optimal extraction strategies
- Extracts comprehensive parish data including addresses, contacts, and schedules
- Handles multiple website platforms (eCatholic, SquareSpace, WordPress, etc.)
- Saves extracted parish data to Supabase database

In [None]:
# Cell 1: Setup Environment and Imports
import os
import sys
import time
from datetime import datetime

# Ensure we're in the correct directory and set up Python path
repo_path = '/content/usccb-parish-extraction'

if not os.path.exists(repo_path):
    print("❌ Repository not found!")
    print("Please run 00_Colab_Setup.ipynb first to clone the repository.")
    raise FileNotFoundError("Repository not found")

# Change to repository directory and add to Python path
os.chdir(repo_path)
if repo_path not in sys.path:
    sys.path.append(repo_path)

print(f"📂 Working directory: {os.getcwd()}")
print("🐍 Python path configured")

# Import required modules
try:
    import requests
    from bs4 import BeautifulSoup
    import pandas as pd
    import json
    
    from config.settings import get_config
    from src.pipeline import ParishExtractionPipeline
    from src.models import Diocese, Parish, ExtractionResult, SiteType
    from src.utils.webdriver import setup_driver
    from src.utils.database import save_parishes_to_database, get_dioceses_to_process
    from src.extractors import get_extractor
    from src.utils.ai_analysis import detect_site_type
    
    print("✅ All modules imported successfully")
    
except ImportError as e:
    print(f"❌ Import error: {e}")
    print("\n🔧 Troubleshooting:")
    print("1. Make sure you've run 00_Colab_Setup.ipynb completely")
    print("2. If you restarted the runtime, re-run the setup notebook")
    print("3. Check that all required packages are installed")
    raise

# Get configuration
try:
    config = get_config()
    print("✅ Configuration loaded successfully")
    print(f"📊 Database: {'Connected' if config.supabase else 'Not connected'}")
    print(f"🤖 AI: {'Enabled' if config.genai_enabled else 'Mock mode'}")
except RuntimeError as e:
    print(f"❌ Configuration error: {e}")
    print("\n🔧 Please run 00_Colab_Setup.ipynb first to configure your environment.")
    raise

In [None]:
# Cell 2: Parish Extraction Functions

def get_dioceses_with_directories(limit=None):
    """Get dioceses that have parish directory URLs and need parish extraction."""
    if not config.supabase:
        print("❌ No database connection")
        return []
    
    try:
        # Get dioceses with parish directory URLs
        response = config.supabase.table('DiocesesParishDirectory').select(
            'diocese_url, parish_directory_url'
        ).not_.is_('parish_directory_url', 'null').not_.eq('parish_directory_url', '').execute()
        
        diocese_directories = response.data or []
        
        # Get diocese names
        if diocese_directories:
            diocese_urls = [item['diocese_url'] for item in diocese_directories]
            
            names_response = config.supabase.table('Dioceses').select(
                'Website, Name'
            ).in_('Website', diocese_urls).execute()
            
            url_to_name = {item['Website']: item['Name'] for item in (names_response.data or [])}
            
            # Combine data
            dioceses_to_process = []
            for item in diocese_directories:
                diocese_url = item['diocese_url']
                diocese_name = url_to_name.get(diocese_url, 'Unknown Diocese')
                
                dioceses_to_process.append({
                    'name': diocese_name,
                    'url': diocese_url,
                    'parish_directory_url': item['parish_directory_url']
                })
            
            if limit and len(dioceses_to_process) > limit:
                import random
                dioceses_to_process = random.sample(dioceses_to_process, limit)
            
            return dioceses_to_process
        
        return []
        
    except Exception as e:
        print(f"❌ Error fetching dioceses with directories: {e}")
        return []

def extract_parishes_from_directory(diocese_info, driver):
    """Extract parishes from a single diocese directory page."""
    diocese_name = diocese_info['name']
    diocese_url = diocese_info['url']
    directory_url = diocese_info['parish_directory_url']
    
    print(f"\n🏛️ Extracting parishes from: {diocese_name}")
    print(f"  📍 Diocese URL: {diocese_url}")
    print(f"  📂 Directory URL: {directory_url}")
    
    try:
        # Load the parish directory page
        print(f"  📥 Loading directory page...")
        driver.get(directory_url)
        time.sleep(3)  # Give time for JS to load
        
        html_content = driver.page_source
        soup = BeautifulSoup(html_content, 'html.parser')
        
        # Detect site type
        print(f"  🔍 Detecting website pattern...")
        site_type = detect_site_type(soup, directory_url)
        print(f"    📊 Detected type: {site_type.value}")
        
        # Get appropriate extractor
        extractor = get_extractor(site_type.value)
        print(f"    🔧 Using extractor: {extractor.name}")
        
        # Extract parishes
        print(f"  ⚙️ Extracting parish data...")
        parishes = extractor.extract(soup, directory_url, driver)
        
        print(f"  ✅ Extracted {len(parishes)} parishes")
        
        # Create extraction result
        result = ExtractionResult(
            diocese_name=diocese_name,
            diocese_url=diocese_url,
            directory_url=directory_url,
            parishes=parishes,
            site_type=site_type,
            success=len(parishes) > 0
        )
        
        # Save to database
        if parishes:
            print(f"  💾 Saving parishes to database...")
            saved_count = save_parishes_to_database(
                parishes, diocese_url, directory_url, site_type.value
            )
            result.saved_count = saved_count
            print(f"    📊 Saved {saved_count} parishes")
        
        return result
        
    except Exception as e:
        error_msg = str(e)[:100]
        print(f"  ❌ Error extracting from {diocese_name}: {error_msg}")
        
        return ExtractionResult(
            diocese_name=diocese_name,
            diocese_url=diocese_url,
            directory_url=directory_url,
            parishes=[],
            site_type=SiteType.GENERIC,
            success=False,
            errors=[error_msg]
        )

print("✅ Parish extraction functions loaded")

In [None]:
# Cell 3: Main Extraction Process

# Set processing limit (you can change this)
MAX_DIOCESES_TO_PROCESS = 3  # Process 3 dioceses as a test

print(f"🚀 Starting parish data extraction...")
print(f"📊 Will process up to {MAX_DIOCESES_TO_PROCESS} dioceses")

# Get dioceses with directory URLs
dioceses_to_process = get_dioceses_with_directories(limit=MAX_DIOCESES_TO_PROCESS)

if not dioceses_to_process:
    print("❌ No dioceses with parish directory URLs found")
    print("\n🔧 Make sure you've run 02_Find_Parish_Directories.ipynb first")
else:
    print(f"📋 Found {len(dioceses_to_process)} dioceses with directory URLs")
    
    # Show what we'll process
    print(f"\n📋 Dioceses to process:")
    for i, diocese in enumerate(dioceses_to_process, 1):
        print(f"  {i}. {diocese['name']}")
        print(f"     Directory: {diocese['parish_directory_url']}")
    
    # Setup WebDriver
    driver = setup_driver()
    
    if not driver:
        print("❌ Failed to setup WebDriver")
    else:
        results = []
        
        try:
            for i, diocese_info in enumerate(dioceses_to_process, 1):
                print(f"\n{'='*70}")
                print(f"Processing diocese {i}/{len(dioceses_to_process)}")
                
                result = extract_parishes_from_directory(diocese_info, driver)
                results.append(result)
                
                # Be respectful - pause between requests
                if i < len(dioceses_to_process):
                    print(f"  ⏱️ Waiting {config.request_delay} seconds...")
                    time.sleep(config.request_delay)
        
        finally:
            driver.quit()
            print("\n🧹 WebDriver closed")
        
        # Print comprehensive summary
        print(f"\n{'='*70}")
        print(f"📊 EXTRACTION SUMMARY")
        print(f"{'='*70}")
        
        total_parishes = sum(len(r.parishes) for r in results)
        successful_extractions = sum(1 for r in results if r.success)
        total_saved = sum(r.saved_count for r in results)
        
        print(f"Total dioceses processed: {len(results)}")
        print(f"Successful extractions: {successful_extractions}")
        print(f"Total parishes found: {total_parishes}")
        print(f"Total parishes saved: {total_saved}")
        
        if successful_extractions > 0:
            print(f"Average parishes per diocese: {total_parishes/successful_extractions:.1f}")
            print(f"Success rate: {successful_extractions/len(results)*100:.1f}%")
        
        # Show site types detected
        site_types = {}
        for result in results:
            if result.success:
                site_type = result.site_type.value
                site_types[site_type] = site_types.get(site_type, 0) + 1
        
        if site_types:
            print(f"\n🔍 Website Types Detected:")
            for site_type, count in site_types.items():
                print(f"  {site_type.replace('_', ' ').title()}: {count} dioceses")
        
        # Show detailed results
        print(f"\n📋 Detailed Results:")
        for result in results:
            status = "✅" if result.success else "❌"
            parishes_info = f"{len(result.parishes)} parishes" if result.success else "Failed"
            saved_info = f" ({result.saved_count} saved)" if result.saved_count > 0 else ""
            
            print(f"  {status} {result.diocese_name}: {parishes_info}{saved_info}")
            print(f"      Site Type: {result.site_type.value}")
            print(f"      Directory: {result.directory_url}")
            
            if result.errors:
                for error in result.errors:
                    print(f"      Error: {error}")
            
        # Show sample parishes
        all_parishes = []
        for result in results:
            all_parishes.extend(result.parishes)
        
        if all_parishes:
            print(f"\n🏛️ Sample Parishes Extracted:")
            for i, parish in enumerate(all_parishes[:10], 1):
                print(f"  {i}. {parish.name}")
                if parish.city:
                    print(f"     📍 {parish.city}")
                if parish.phone:
                    print(f"     📞 {parish.phone}")
                if parish.website:
                    print(f"     🌐 {parish.website}")
            
            if len(all_parishes) > 10:
                print(f"     ... and {len(all_parishes) - 10} more parishes")
        
        # Save detailed results to file
        try:
            timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
            filename = f'parish_extraction_results_{timestamp}.json'
            
            # Convert results to serializable format
            serializable_results = []
            for result in results:
                serializable_results.append({
                    'diocese_name': result.diocese_name,
                    'diocese_url': result.diocese_url,
                    'directory_url': result.directory_url,
                    'parish_count': result.parish_count,
                    'site_type': result.site_type.value,
                    'success': result.success,
                    'saved_count': result.saved_count,
                    'errors': result.errors,
                    'parishes': [
                        {
                            'name': p.name,
                            'city': p.city,
                            'address': p.address,
                            'phone': p.phone,
                            'website': p.website,
                            'confidence': p.confidence
                        }
                        for p in result.parishes
                    ]
                })
            
            with open(filename, 'w', encoding='utf-8') as f:
                json.dump(serializable_results, f, indent=2, ensure_ascii=False)
            
            print(f"💾 Detailed results saved to: {filename}")
            
            # Download file in Colab
            try:
                from google.colab import files
                files.download(filename)
                print(f"⬇️ Results file downloaded")
            except ImportError:
                print(f"📁 Results saved locally")
                
        except Exception as e:
            print(f"❌ Error saving results: {e}")

## 🎉 Parish Data Extraction Complete!

If you see successful extractions above, you now have detailed parish data in your database!

### ✅ What You've Accomplished:
- ✅ Detected website patterns and selected optimal extraction strategies
- ✅ Extracted comprehensive parish data including names, addresses, and contact info
- ✅ Handled multiple website platforms automatically
- ✅ Saved parish data to your Supabase database
- ✅ Generated detailed extraction statistics and quality metrics

### 📊 Your Data Now Includes:
- **Parish Names and Locations**: Complete identification information
- **Contact Information**: Phone numbers and websites where available
- **Geographic Data**: Addresses and coordinates for mapping
- **Quality Metrics**: Confidence scores and extraction methods

### 🚀 Next Steps:

1. **Scale Up Processing**
   - Increase `MAX_DIOCESES_TO_PROCESS` to extract from more dioceses
   - The system will automatically skip already-processed dioceses

2. **Analyze Your Data**
   - Query your Supabase `Parishes` table to explore the extracted data
   - Use the quality metrics to identify the most reliable extractions

3. **Advanced Features**
   - Run the other specialized notebooks for specific data extraction
   - Explore the detailed extraction results JSON file

### 🛠️ Troubleshooting:

**If no parishes were extracted:**
- Check that you've run `02_Find_Parish_Directories.ipynb` first
- Verify your Supabase database has data in the `DiocesesParishDirectory` table
- Some diocese websites may have changed or be temporarily unavailable

**If extraction failed for specific dioceses:**
- The system handles different website types, but some may require custom handling
- Check the error messages for specific issues
- The system will retry failed dioceses on subsequent runs

---

💡 **Tip**: The extraction system is designed to be re-runnable. You can safely run this notebook multiple times to process additional dioceses or retry failed extractions!