# Build Dioceses Database

This notebook scrapes the USCCB website to build the initial dioceses database.

**Prerequisites**: 
1. Run `00_Colab_Setup.ipynb` first
2. Ensure your API keys are configured

**What this does**:
- Scrapes diocese information from the USCCB website
- Extracts name, address, and website for each diocese
- Saves the data to your Supabase database
- Provides downloadable CSV backup

In [None]:
# Cell 1: Setup Environment and Imports
import os
import sys

# Ensure we're in the correct directory and set up Python path
repo_path = '/content/usccb-parish-extraction'

if not os.path.exists(repo_path):
    print("❌ Repository not found!")
    print("Please run 00_Colab_Setup.ipynb first to clone the repository.")
    raise FileNotFoundError("Repository not found")

# Change to repository directory and add to Python path
os.chdir(repo_path)
if repo_path not in sys.path:
    sys.path.append(repo_path)

print(f"📂 Working directory: {os.getcwd()}")
print("🐍 Python path configured")

# Now import the required modules
try:
    import requests
    from bs4 import BeautifulSoup
    import pandas as pd
    import time
    from datetime import datetime

    from config.settings import get_config
    from src.utils.webdriver import setup_driver, load_page, clean_text
    
    print("✅ All modules imported successfully")
    
except ImportError as e:
    print(f"❌ Import error: {e}")
    print("\n🔧 Troubleshooting:")
    print("1. Make sure you've run 00_Colab_Setup.ipynb completely")
    print("2. If you restarted the runtime, re-run the setup notebook")
    print("3. Check that all required packages are installed")
    raise

# Get configuration
try:
    config = get_config()
    print("✅ Configuration loaded successfully")
    print(f"📊 Database: {'Connected' if config.supabase else 'Not connected'}")
    print(f"🤖 AI: {'Enabled' if config.genai_enabled else 'Mock mode'}")
except RuntimeError as e:
    print(f"❌ Configuration error: {e}")
    print("\n🔧 Please run 00_Colab_Setup.ipynb first to configure your environment.")
    raise

In [None]:
# Cell 2: Scrape USCCB Dioceses Page
def scrape_dioceses_from_usccb():
    """Scrape dioceses information from USCCB website"""
    url = "https://www.usccb.org/about/bishops-and-dioceses/all-dioceses"
    print(f"🔍 Scraping dioceses from: {url}")
    
    driver = setup_driver()
    try:
        print("⏳ Loading page (this may take a moment)...")
        soup = load_page(driver, url)
        print("✅ Page loaded successfully")
        
        # Find diocese containers
        diocese_containers = soup.find_all('div', class_='views-row')
        print(f"📋 Found {len(diocese_containers)} potential diocese containers")
        
        dioceses = []
        
        for i, container in enumerate(diocese_containers):
            diocese_data = extract_diocese_info(container)
            if diocese_data:
                dioceses.append(diocese_data)
                if len(dioceses) % 10 == 0:
                    print(f"   📊 Processed {len(dioceses)} dioceses...")
        
        print(f"\n✅ Successfully extracted {len(dioceses)} dioceses")
        return dioceses
        
    except Exception as e:
        print(f"❌ Error during scraping: {e}")
        raise
    finally:
        driver.quit()
        print("🔧 Browser closed")

def extract_diocese_info(container):
    """Extract diocese information from a container element"""
    try:
        da_wrap = container.find('div', class_='da-wrap')
        if not da_wrap:
            return None
        
        # Extract name
        name_div = da_wrap.find('div', class_='da-title')
        if not name_div:
            return None
        name = clean_text(name_div.get_text())
        
        # Extract address
        address_div = da_wrap.find('div', class_='da-address')
        address_parts = []
        if address_div:
            for div in address_div.find_all('div', recursive=False):
                text = clean_text(div.get_text())
                if text and text.strip():
                    address_parts.append(text)
        
        address = ", ".join(address_parts) if address_parts else None
        
        # Extract website
        website_div = da_wrap.find('div', class_='site')
        website = None
        if website_div:
            link = website_div.find('a')
            if link and link.get('href'):
                website = link.get('href')
                # Clean up the URL
                if website and not website.startswith('http'):
                    website = f"https://{website}"
        
        # Only return if we have a valid name
        if name and len(name.strip()) > 2:
            return {
                'Name': name,
                'Address': address,
                'Website': website,
                'extracted_at': datetime.now().isoformat()
            }
    
    except Exception as e:
        print(f"⚠️ Error extracting diocese info: {e}")
    
    return None

# Run the scraping
print("🚀 Starting USCCB diocese extraction...\n")
dioceses_data = scrape_dioceses_from_usccb()
print(f"\n🎉 Extraction complete! Found {len(dioceses_data)} dioceses.")

In [None]:
# Cell 3: Analyze and Display Results
if dioceses_data:
    # Create DataFrame
    df = pd.DataFrame(dioceses_data)
    
    print(f"📊 DIOCESE EXTRACTION ANALYSIS")
    print(f"{'='*50}")
    print(f"Total dioceses extracted: {len(df)}")
    print(f"Columns: {list(df.columns)}")
    
    # Statistics
    missing_websites = df['Website'].isna().sum()
    missing_addresses = df['Address'].isna().sum()
    
    print(f"\n📈 Data Quality:")
    print(f"   ✅ Complete records: {len(df)}")
    print(f"   🌐 With websites: {len(df) - missing_websites} ({(len(df) - missing_websites)/len(df)*100:.1f}%)")
    print(f"   📍 With addresses: {len(df) - missing_addresses} ({(len(df) - missing_addresses)/len(df)*100:.1f}%)")
    print(f"   ❌ Missing websites: {missing_websites}")
    print(f"   ❌ Missing addresses: {missing_addresses}")
    
    # Show sample data
    print(f"\n📋 Sample Data (first 5 dioceses):")
    print("=" * 50)
    for i, row in df.head().iterrows():
        print(f"{i+1}. {row['Name']}")
        if row['Address']:
            print(f"   📍 {row['Address']}")
        if row['Website']:
            print(f"   🌐 {row['Website']}")
        print()
    
    if len(df) > 5:
        print(f"... and {len(df) - 5} more dioceses")
    
    # Check for duplicates
    duplicates = df.duplicated(subset=['Name']).sum()
    if duplicates > 0:
        print(f"\n⚠️ Found {duplicates} potential duplicate dioceses")
        print("   These will be handled during database insertion")
    else:
        print(f"\n✅ No duplicate dioceses found")

else:
    print("❌ No dioceses data was extracted")
    print("\n🔧 Troubleshooting:")
    print("   • Check your internet connection")
    print("   • The USCCB website might be temporarily unavailable")
    print("   • Try running the scraping cell again")
    df = pd.DataFrame()

In [None]:
# Cell 4: Save to Database
if not df.empty and config.supabase:
    print("💾 Saving dioceses to Supabase database...\n")
    
    # Convert DataFrame to list of dictionaries
    records = df.to_dict('records')
    
    try:
        # Insert data in batches to avoid timeouts
        batch_size = 20
        total_inserted = 0
        errors = 0
        
        for i in range(0, len(records), batch_size):
            batch = records[i:i + batch_size]
            batch_num = i//batch_size + 1
            
            print(f"📤 Inserting batch {batch_num}: {len(batch)} dioceses...")
            
            try:
                response = config.supabase.table('Dioceses').insert(batch).execute()
                
                if hasattr(response, 'error') and response.error:
                    print(f"   ❌ Database error: {response.error}")
                    errors += len(batch)
                else:
                    total_inserted += len(batch)
                    print(f"   ✅ Successfully inserted {len(batch)} dioceses")
            
            except Exception as e:
                error_msg = str(e).lower()
                if 'duplicate' in error_msg or 'unique' in error_msg:
                    print(f"   ⚠️ Some dioceses already exist (duplicates skipped)")
                    # Count as successful since data exists
                    total_inserted += len(batch)
                else:
                    print(f"   ❌ Error inserting batch: {e}")
                    errors += len(batch)
            
            # Small delay between batches
            if i + batch_size < len(records):
                time.sleep(0.5)
        
        # Final results
        print(f"\n{'='*50}")
        print(f"📊 DATABASE INSERTION RESULTS")
        print(f"{'='*50}")
        print(f"Total dioceses processed: {len(df)}")
        print(f"Successfully saved: {total_inserted}")
        print(f"Errors/Skipped: {errors}")
        print(f"Success rate: {total_inserted/len(df)*100:.1f}%")
        
        if total_inserted > 0:
            print(f"\n🎉 Dioceses database built successfully!")
            print(f"✅ You can now run parish extraction notebooks")
        else:
            print(f"\n❌ No dioceses were saved to the database")
            print(f"🔧 Check your database connection and try again")
        
    except Exception as e:
        print(f"❌ Database operation failed: {e}")
        print(f"\n🔧 Troubleshooting:")
        print(f"   • Check your Supabase connection")
        print(f"   • Verify the 'Dioceses' table exists")
        print(f"   • Check your API key permissions")

elif df.empty:
    print("❌ No data to save - extraction may have failed")
    
else:
    print("⚠️ Database not configured - data not saved to cloud")
    print("\n💡 But don't worry! Your data is still available in this session.")
    print("   You can export it to CSV in the next cell.")
    print("\n🔧 To enable database saving:")
    print("   • Add your Supabase credentials to Colab Secrets")
    print("   • Re-run the setup notebook")

In [None]:
# Cell 5: Export to CSV (Always useful as backup)
if not df.empty:
    timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
    filename = f'usccb_dioceses_extracted_{timestamp}.csv'
    
    try:
        # Save to CSV
        df.to_csv(filename, index=False)
        print(f"📁 Data exported to: {filename}")
        print(f"📊 Exported {len(df)} dioceses")
        
        # Show file size
        file_size = os.path.getsize(filename) / 1024  # KB
        print(f"📦 File size: {file_size:.1f} KB")
        
        # Download file in Colab
        try:
            from google.colab import files
            files.download(filename)
            print(f"⬇️ File downloaded to your computer")
            print(f"\n💡 Tip: Keep this CSV as a backup of your dioceses data")
        except ImportError:
            # Not in Colab environment
            print(f"📁 File saved locally: {filename}")
    
    except Exception as e:
        print(f"❌ Export failed: {e}")
        
else:
    print("❌ No data to export")
    print("\n🔧 The scraping may have failed. Try:")
    print("   • Re-running Cell 2 (the scraping cell)")
    print("   • Checking your internet connection")
    print("   • Waiting a moment and trying again")

## 🎉 Dioceses Database Build Complete!

If you see "Dioceses database built successfully" above, you're ready for the next step!

### ✅ What You've Accomplished:
- ✅ Scraped diocese data from the official USCCB website
- ✅ Extracted names, addresses, and websites for all dioceses
- ✅ Saved the data to your Supabase database
- ✅ Created a CSV backup of the data

### 🚀 Next Steps:

1. **🎯 Quick Parish Extraction Demo**
   - Open and run [`99_Simple_Demo.ipynb`](99_Simple_Demo.ipynb)
   - This will extract parishes from a few dioceses

2. **🔍 Find Parish Directories** 
   - Run [`02_Find_Parish_Directories.ipynb`](02_Find_Parish_Directories.ipynb)
   - This finds parish directory URLs for all dioceses

3. **📥 Extract All Parish Data**
   - Run [`03_Extract_Parish_Data.ipynb`](03_Extract_Parish_Data.ipynb)
   - This extracts detailed parish information

### 🛠️ Troubleshooting:

**If the scraping failed:**
- The USCCB website might be temporarily unavailable
- Try re-running Cell 2 after a few minutes
- Check your internet connection

**If database saving failed:**
- Check your Supabase credentials in the setup notebook
- Verify your Supabase project is active
- The CSV export still gives you the data

---

💡 **Remember**: The CSV file you downloaded is a complete backup of all dioceses data!