<a href="https://colab.research.google.com/github/tomknightatl/USCCB/blob/main/Find_Parish_Directory.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Notebook Configuration Parameters

This notebook is designed to find parish directory URLs on diocesan websites. The first code cell below, labeled "User-configurable parameters," contains all the necessary settings to control its behavior, including API key configurations, GitHub integration, and mocking controls.

Please review and configure these parameters before running the notebook.

## API Keys & Secrets

For features that require external services (like Generative AI or Google Search), you'll need to provide API keys. These should be stored securely using Colab Secrets.

### `GENAI_API_KEY`
- **Purpose**: This API key is for Google Generative AI (e.g., Gemini models). It's used to analyze web page content and search result snippets to identify potential parish directory links.
- **Configuration**:
    1. Obtain your GenAI API key from Google AI Studio.
    2. In Google Colab, go to "Secrets" (key icon in the left sidebar) and add a new secret named `GENAI_API_KEY_USCCB`. Paste your API key as the value.
    3. In the "User-configurable parameters" cell, uncomment the line `# GENAI_API_KEY = GENAI_API_KEY_FROM_USERDATA` to use the key from Colab Secrets. Alternatively, you can directly assign your key string to `GENAI_API_KEY` in the cell, but using secrets is recommended.
- **Default**: `None`. If no key is provided, GenAI-powered analysis will be disabled, and the system will rely on mock/basic analysis.
- **Dependencies**: Required if you want to use live GenAI analysis. You'll also need to set `use_mock_genai_direct_page = False` and/or `use_mock_genai_snippet = False`.

### `SEARCH_API_KEY` and `SEARCH_CX`
- **Purpose**: These are for the Google Custom Search API. This API is used as a fallback mechanism to find parish directory links if direct analysis of the diocesan website doesn't yield clear results. `SEARCH_API_KEY` is your API key, and `SEARCH_CX` is your Programmable Search Engine ID.
- **Configuration**:
    1. Create a Programmable Search Engine on the Google Control Panel, configured to search diocesan websites. Note the Search Engine ID (`SEARCH_CX`).
    2. Obtain your Google Cloud API Key enabled for the Custom Search API.
    3. In Colab Secrets, add `SEARCH_API_KEY_USCCB` (with your API key) and `SEARCH_CX_USCCB` (with your Search Engine ID).
    4. In the "User-configurable parameters" cell, uncomment the lines that assign these secrets to `SEARCH_API_KEY` and `SEARCH_CX`.
- **Default**: `None` for both. If not set, the Google Custom Search fallback will be disabled, and the system will rely on mock/basic search.
- **Dependencies**: Required for the search engine fallback feature. You'll also need to set `use_mock_search_engine = False`.

## GitHub Integration

These parameters are for integrating with GitHub, allowing the notebook to clone/pull a repository and potentially push changes (like updated data files).

### `GITHUB_USERNAME`
- **Purpose**: Your GitHub username. Used for authenticating when cloning or pushing to a private repository.
- **Configuration**:
    1. In Colab Secrets, add `GitHubUserforUSCCB` with your GitHub username.
    2. In the "User-configurable parameters" cell, uncomment the line `# GITHUB_USERNAME = GITHUB_USERNAME_FROM_USERDATA`.
- **Default**: `None`. If not set (along with `GITHUB_PAT`), cloning/pulling will attempt to use public access, which might fail for private repositories. Pushing changes will likely fail.
- **Dependencies**: Works in conjunction with `GITHUB_PAT`.

### `GITHUB_PAT`
- **Purpose**: Your GitHub Personal Access Token (PAT). Used for authenticating API requests, such as cloning private repositories or pushing changes.
- **Configuration**:
    1. Generate a PAT from your GitHub developer settings with appropriate permissions (e.g., `repo` scope for private repository access).
    2. In Colab Secrets, add `GitHubPATforUSCCB` with your PAT.
    3. In the "User-configurable parameters" cell, uncomment the line `# GITHUB_PAT = GITHUB_PAT_FROM_USERDATA`.
- **Default**: `None`. If not set (along with `GITHUB_USERNAME`), cloning/pulling will attempt to use public access. Pushing changes will likely fail.
- **Dependencies**: Works in conjunction with `GITHUB_USERNAME`.

### `GITHUB_REPO`
- **Purpose**: The name of the GitHub repository to clone and interact with (e.g., 'USCCB').
- **Configuration**: Set the string value directly in the "User-configurable parameters" cell.
- **Default**: `'USCCB'`.
- **Dependencies**: None.

## Mocking Controls

These boolean flags allow you to run the notebook with mocked (simulated) API responses, which is useful for testing or when API keys are unavailable. Set them to `False` to use live APIs (requires corresponding API keys to be configured).

### `use_mock_genai_direct_page`
- **Purpose**: Controls whether GenAI analysis of links found directly on a webpage uses live API calls or mocked responses.
- **Configuration**: Set to `True` for mock, `False` for live.
- **Default**: `True`. GenAI analysis for direct page links will be mocked.
- **Dependencies**: If set to `False`, a valid `GENAI_API_KEY` must be configured.

### `use_mock_genai_snippet`
- **Purpose**: Controls whether GenAI analysis of search result snippets (from Google Custom Search) uses live API calls or mocked responses.
- **Configuration**: Set to `True` for mock, `False` for live.
- **Default**: `True`. GenAI analysis for search snippets will be mocked.
- **Dependencies**: If set to `False`, a valid `GENAI_API_KEY` must be configured.

### `use_mock_search_engine`
- **Purpose**: Controls whether the Google Custom Search fallback uses live API calls or mocked search results.
- **Configuration**: Set to `True` for mock, `False` for live.
- **Default**: `True`. Google Custom Search calls will be mocked.
- **Dependencies**: If set to `False`, valid `SEARCH_API_KEY` and `SEARCH_CX` must be configured.

## Advanced Settings

### `chrome_options`
- **Purpose**: These are advanced settings for the Selenium WebDriver (Chrome). They control browser behavior like running headless (without a visible UI).
- **Configuration**: Modified directly in the "User-configurable parameters" cell.
- **Default**: Pre-configured for headless operation, no-sandbox, and other common settings for server environments. It's generally not necessary to change these unless you have specific WebDriver requirements.
- **Dependencies**: None.

In [61]:
# Cell 1
# Chrome Installation for Google Colab

import os
import subprocess

def ensure_chrome_installed():
    """Ensures Chrome is installed in the Colab environment."""
    try:
        # Check if Chrome is already available
        result = subprocess.run(['which', 'google-chrome'],
                              capture_output=True, text=True)
        if result.returncode == 0:
            print("✅ Chrome is already installed and available.")
            return True

        print("🔧 Chrome not found. Installing Chrome for Selenium...")

        # Install Chrome
        os.system('apt-get update > /dev/null 2>&1')
        os.system('wget -q -O - https://dl.google.com/linux/linux_signing_key.pub | apt-key add - > /dev/null 2>&1')
        os.system('echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" > /etc/apt/sources.list.d/google-chrome.list')
        os.system('apt-get update > /dev/null 2>&1')
        os.system('apt-get install -y google-chrome-stable > /dev/null 2>&1')

        # Verify installation
        result = subprocess.run(['google-chrome', '--version'],
                              capture_output=True, text=True)
        if result.returncode == 0:
            print(f"✅ Chrome installed successfully: {result.stdout.strip()}")
            return True
        else:
            print("❌ Chrome installation may have failed.")
            return False

    except Exception as e:
        print(f"❌ Error during Chrome installation: {e}")
        return False

# Run the installation check
chrome_ready = ensure_chrome_installed()
if chrome_ready:
    print("🚀 Ready to proceed with Selenium operations!")
else:
    print("⚠️  You may need to restart the runtime if Chrome installation failed.")

✅ Chrome is already installed and available.
🚀 Ready to proceed with Selenium operations!


In [62]:
# Cell 2
# Install necessary libraries & Setup API Keys

# This cell installs all required Python packages for the notebook.
!pip install selenium webdriver-manager google-generativeai google-api-python-client tenacity

# Standard library imports
import sqlite3
import re
import os
import time

# Third-party library imports
import requests # For simple HTTP requests (though less used now with Selenium)
from bs4 import BeautifulSoup # For parsing HTML
# from google.colab import userdata # Moved to User-configurable parameters cell

# Selenium imports for web automation and dynamic content loading
from selenium import webdriver
# from selenium.webdriver.chrome.options import Options # Moved to User-configurable parameters cell
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.common.exceptions import TimeoutException, WebDriverException

# Google GenAI imports (for Gemini model)
# import google.generativeai as genai # Moved to User-configurable parameters cell
from google.api_core.exceptions import DeadlineExceeded, ServiceUnavailable, ResourceExhausted, InternalServerError, GoogleAPIError

# Google API Client imports (for Custom Search API)
from googleapiclient.errors import HttpError
# To use the live Google Custom Search API, uncomment the following import in this cell
# AND in Cell 4.6 where `build` is called.
# from googleapiclient.discovery import build

# Tenacity library for robust retry mechanisms
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type, RetryError

print("--- API Key Status (defined in User-configurable parameters cell) ---")
# GENAI_API_KEY, SEARCH_API_KEY, SEARCH_CX are now defined in the first code cell (User-configurable parameters).
# The genai.configure() call is also in that cell.
# This section now just confirms their status based on the first cell's execution.
if 'GENAI_API_KEY' in globals() and GENAI_API_KEY:
    print("GenAI API Key is configured and available.")
else:
    print("GenAI API Key is NOT configured or available. Mocking will be used for GenAI features.")

if 'SEARCH_API_KEY' in globals() and SEARCH_API_KEY and 'SEARCH_CX' in globals() and SEARCH_CX:
    print("Search API Key and CX are configured and available.")
else:
    print("Search API Key and/or CX are NOT configured or available. Mocking will be used for search engine features.")
print("--- End API Key Status Check ---")

# --- Selenium WebDriver Setup ---
# chrome_options is now defined in the first code cell (User-configurable parameters).
# Ensure it's available in the global scope if defined in the first cell.
if 'chrome_options' not in globals():
    print("Error: chrome_options not found. It should be defined in the User-configurable parameters cell.")
    # Fallback to basic options if not found, though this indicates an issue with notebook structure
    from selenium.webdriver.chrome.options import Options
    chrome_options = Options()
    chrome_options.add_argument("--headless")

driver = None # Global WebDriver instance

def setup_driver():
    """Initializes and returns the Selenium WebDriver instance."""
    global driver
    if driver is None:
        try:
            print("Setting up Chrome WebDriver...")
            # ChromeDriver is automatically managed by webdriver_manager
            driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=chrome_options)
            print("WebDriver setup successfully.")
        except Exception as e:
            print(f"Error setting up WebDriver: {e}")
            print("Ensure Chrome is installed if not using a pre-built environment like Colab.")
            driver = None
    return driver

def close_driver():
    """Closes the Selenium WebDriver instance if it's active."""
    global driver
    if driver:
        print("Closing WebDriver...")
        driver.quit()
        driver = None
        print("WebDriver closed.")

--- API Key Status (defined in User-configurable parameters cell) ---
GenAI API Key is configured and available.
Search API Key and/or CX are NOT configured or available. Mocking will be used for search engine features.
--- End API Key Status Check ---


In [63]:
# Cell 3
# User-configurable-parameters
from google.colab import userdata
import google.generativeai as genai
from selenium.webdriver.chrome.options import Options
import os # For os.path.exists in GITHUB_REPO logic if adapted

print("--- User Configurable Parameters Cell Initializing ---")

# --- Processing Limit Configuration ---
# Set the maximum number of random dioceses to process (None = process all)
# This limits processing to help with testing or API quota management
MAX_DIOCESES_TO_PROCESS = 1  # Change this number or set to None to process all dioceses

if MAX_DIOCESES_TO_PROCESS:
    print(f"Processing will be limited to {MAX_DIOCESES_TO_PROCESS} randomly selected dioceses.")
else:
    print("Processing will include all dioceses that lack parish directory URLs.")

# --- GitHub Repository Configuration ---
# To clone/pull from a private repository, ensure GitHubUserforUSCCB and GitHubPATforUSCCB
# are stored in Colab Secrets and uncomment the lines that assign them.
GITHUB_USERNAME_FROM_USERDATA = userdata.get('GitHubUserforUSCCB')
GITHUB_PAT_FROM_USERDATA = userdata.get('GitHubPATforUSCCB')

GITHUB_USERNAME = None # Default: No username. Provide if using a private repo.
GITHUB_PAT = None      # Default: No PAT. Provide if using a private repo.
# if GITHUB_USERNAME_FROM_USERDATA:
#     GITHUB_USERNAME = GITHUB_USERNAME_FROM_USERDATA # Uncomment to use username from Colab Secrets
# if GITHUB_PAT_FROM_USERDATA:
#     GITHUB_PAT = GITHUB_PAT_FROM_USERDATA          # Uncomment to use PAT from Colab Secrets

GITHUB_REPO = 'USCCB' # Name of the repository

if GITHUB_USERNAME and GITHUB_PAT:
    print(f"GitHub credentials loaded for user {GITHUB_USERNAME} for repository {GITHUB_REPO}.")
else:
    print(f"GitHub credentials not fully set. Using public access for repository {GITHUB_REPO}.")

# --- GenAI API Key Setup ---
# To use live GenAI calls:
# 1. Ensure your GENAI_API_KEY_USCCB is stored in Colab Secrets.
# 2. EITHER: Uncomment the line below that assigns GENAI_API_KEY_FROM_USERDATA to GENAI_API_KEY
#    OR: Directly assign your key string to GENAI_API_KEY.
# 3. Set the use_mock_genai_direct_page and use_mock_genai_snippet flags (defined below) to False.
GENAI_API_KEY_FROM_USERDATA = userdata.get('GENAI_API_KEY_USCCB')
GENAI_API_KEY = None # Default: No API key, forces mock.

# CHANGE 1: Uncomment this line to use your API key from Colab Secrets
if GENAI_API_KEY_FROM_USERDATA and GENAI_API_KEY_FROM_USERDATA not in ["YOUR_API_KEY_PLACEHOLDER", "SET_YOUR_KEY_HERE"]:
    GENAI_API_KEY = GENAI_API_KEY_FROM_USERDATA # Now using key from Colab Secrets

if GENAI_API_KEY:
    try:
        genai.configure(api_key=GENAI_API_KEY)
        print("GenAI configured successfully for LIVE calls if relevant mock flags are False.")
    except Exception as e:
        print(f"Error configuring GenAI with key: {e}. GenAI features will be mocked.")
        GENAI_API_KEY = None # Ensure mock if configuration fails
else:
    print("GenAI API Key is not set. GenAI features will be mocked globally.")

# --- Search Engine API Key Setup ---
# To use live Google Custom Search API calls:
# 1. Ensure your SEARCH_API_KEY_USCCB and SEARCH_CX_USCCB are in Colab Secrets.
# 2. EITHER: Uncomment the lines below that assign _FROM_USERDATA to SEARCH_API_KEY and SEARCH_CX
#    OR: Directly assign your key strings.
# 3. Set the use_mock_search_engine flag (defined below) to False.
SEARCH_API_KEY_FROM_USERDATA = userdata.get('SEARCH_API_KEY_USCCB')
SEARCH_CX_FROM_USERDATA = userdata.get('SEARCH_CX_USCCB')

SEARCH_API_KEY = None # Default: No API key, forces mock.
SEARCH_CX = None      # Default: No CX, forces mock.
# CHANGE 2: Uncomment these lines to use your keys from Colab Secrets
if SEARCH_API_KEY_FROM_USERDATA and SEARCH_API_KEY_FROM_USERDATA not in ["YOUR_API_KEY_PLACEHOLDER", "SET_YOUR_KEY_HERE"]:
    SEARCH_API_KEY = SEARCH_API_KEY_FROM_USERDATA # Now using key from Colab Secrets
if SEARCH_CX_FROM_USERDATA and SEARCH_CX_FROM_USERDATA not in ["YOUR_CX_PLACEHOLDER", "SET_YOUR_CX_HERE"]:
    SEARCH_CX = SEARCH_CX_FROM_USERDATA            # Now using CX from Colab Secrets

if SEARCH_API_KEY and SEARCH_CX:
    print("Google Custom Search API Key and CX loaded. Ready for LIVE calls if use_mock_search_engine is False.")
else:
    print("Google Custom Search API Key and/or CX are NOT configured or available. Search engine calls will be mocked.")

# --- Mocking Controls ---
# These flags determine whether to use live API calls or mocked responses.
# CHANGE 3: Set these to False to attempt LIVE GenAI calls
global use_mock_genai_direct_page
use_mock_genai_direct_page = False  # Changed from True to False
# Set to False to attempt LIVE GenAI calls for search snippet analysis (requires valid GENAI_API_KEY)
global use_mock_genai_snippet
use_mock_genai_snippet = False  # Changed from True to False
# CHANGE 4: Set to False to attempt LIVE Google Custom Search calls
global use_mock_search_engine
use_mock_search_engine = False  # Changed from True to False

print(f"Mocking settings: Direct Page GenAI={use_mock_genai_direct_page}, Snippet GenAI={use_mock_genai_snippet}, Search Engine={use_mock_search_engine}")

# --- Selenium WebDriver Options ---
global chrome_options
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-dev-shm-usage")
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("window-size=1920,1080")
print("Chrome options configured.")

print("--- End User Configurable Parameters Cell ---")

--- User Configurable Parameters Cell Initializing ---
Processing will be limited to 1 randomly selected dioceses.
GitHub credentials not fully set. Using public access for repository USCCB.
GenAI configured successfully for LIVE calls if relevant mock flags are False.
Google Custom Search API Key and CX loaded. Ready for LIVE calls if use_mock_search_engine is False.
Mocking settings: Direct Page GenAI=False, Snippet GenAI=False, Search Engine=False
Chrome options configured.
--- End User Configurable Parameters Cell ---


In [64]:
# Cell 4
# Clone GitHub repository and configure Git

# This cell clones the GitHub repository if it doesn't exist,
# or pulls the latest changes if it does. It also configures Git user info.

# GITHUB_REPO, GITHUB_USERNAME, GITHUB_PAT are now defined in the first code cell (User-configurable parameters).
# import os # Already imported in the first cell if needed there, or here if only used here.
if 'os' not in globals(): import os # Ensure os is imported if not already by the first cell

if 'GITHUB_REPO' not in globals():
    print("Error: GITHUB_REPO not found. It should be defined in the User-configurable parameters cell.")
    GITHUB_REPO = 'USCCB_fallback' # Fallback repo name

# Construct the repository URL with credentials for private repositories (if applicable)
if GITHUB_USERNAME and GITHUB_PAT: # Check if credentials are set in the first cell
    REPO_URL = f"https://{GITHUB_USERNAME}:{GITHUB_PAT}@github.com/{GITHUB_USERNAME}/{GITHUB_REPO}.git"
    print(f"Using authenticated URL for {GITHUB_REPO}: {REPO_URL}")
else:
    # Fallback to a generic public URL structure if credentials are not provided.
    # Using a placeholder for the username part of the public URL if GITHUB_USERNAME is not set.
    # You might want to adjust 'default_public_username' or handle this case differently.
    default_public_username = 'tomknightatl' # As seen in original notebook, consider parameterizing this too
    public_username_to_use = GITHUB_USERNAME if GITHUB_USERNAME else default_public_username
    REPO_URL = f"https://github.com/{public_username_to_use}/{GITHUB_REPO}.git"
    print(f"Using public URL for {GITHUB_REPO}: {REPO_URL}")

if not os.path.exists(GITHUB_REPO):
    print(f"Cloning repository {GITHUB_REPO} from {REPO_URL}...")
    # Ensure GITHUB_REPO is part of the clone command if it's not the default name from URL
    !git clone {REPO_URL} {GITHUB_REPO}
    os.chdir(GITHUB_REPO) # Change current directory to the repository root
else:
    print(f"Repository {GITHUB_REPO} already exists. Updating...")
    os.chdir(GITHUB_REPO)
    # Ensure you are in the correct directory before pulling
    current_dir = os.getcwd()
    if os.path.basename(current_dir) == GITHUB_REPO:
        !git pull origin main # Pull the latest changes from the main branch
    else:
        print(f"Error: Not in the {GITHUB_REPO} directory. Current directory: {current_dir}")

# Configure Git local settings for this environment (optional, but good practice for commits)
!git config --global user.email "colab@example.com" # Replace with your email if desired
!git config --global user.name "Colab User"      # Replace with your name if desired

Using public URL for USCCB: https://github.com/tomknightatl/USCCB.git
Cloning repository USCCB from https://github.com/tomknightatl/USCCB.git...
Cloning into 'USCCB'...
remote: Enumerating objects: 268, done.[K
remote: Counting objects: 100% (132/132), done.[K
remote: Compressing objects: 100% (107/107), done.[K
remote: Total 268 (delta 83), reused 44 (delta 25), pack-reused 136 (from 1)[K
Receiving objects: 100% (268/268), 211.14 KiB | 6.81 MiB/s, done.
Resolving deltas: 100% (161/161), done.


In [65]:
# Cell 5
# Fetch Dioceses Info from SQLite database

# This cell connects to the SQLite database (data.db) and fetches a list of dioceses
# that do not yet have a parish directory URL recorded.

import sqlite3
import random

dioceses_to_scan = [] # Initialize an empty list to store diocese info
try:
    # Check if the database file exists before attempting to connect
    if not os.path.exists('data.db'):
        print("WARNING: data.db not found. No dioceses will be fetched for scanning.")
        # In a real scenario, data.db should be populated by other notebooks or processes.
    else:
        conn_db = sqlite3.connect('data.db')
        cursor_db = conn_db.cursor()

        # SQL query to select diocesan websites and names where a parish directory URL is missing.
        query = """
        SELECT d.Website, d.Name
        FROM Dioceses d
        LEFT JOIN DiocesesParishDirectory dpd ON d.Website = dpd.diocese_url
        WHERE dpd.parish_directory_url IS NULL OR dpd.parish_directory_url = ''
        """
        cursor_db.execute(query)
        # Store results as a list of dictionaries for easier access to URL and name
        all_dioceses = [{'url': row[0], 'name': row[1]} for row in cursor_db.fetchall()]

        # Apply limit if MAX_DIOCESES_TO_PROCESS is set
        if 'MAX_DIOCESES_TO_PROCESS' in globals() and MAX_DIOCESES_TO_PROCESS is not None:
            if len(all_dioceses) > MAX_DIOCESES_TO_PROCESS:
                dioceses_to_scan = random.sample(all_dioceses, MAX_DIOCESES_TO_PROCESS)
                print(f"Randomly selected {len(dioceses_to_scan)} dioceses out of {len(all_dioceses)} total dioceses lacking parish directory URLs.")
            else:
                dioceses_to_scan = all_dioceses
                print(f"All {len(dioceses_to_scan)} dioceses will be processed (fewer than the limit of {MAX_DIOCESES_TO_PROCESS}).")
        else:
            dioceses_to_scan = all_dioceses
            print(f"Fetched all {len(dioceses_to_scan)} dioceses from the database for scanning.")

except sqlite3.Error as e:
    print(f"Database error in Cell 3: {e}")
finally:
    if 'conn_db' in locals() and conn_db: # Ensure connection was opened before trying to close
        conn_db.close()

Randomly selected 1 dioceses out of 631 total dioceses lacking parish directory URLs.


In [66]:
# Cell 6
# Function to find candidate parish listing URLs from page content

from urllib.parse import urljoin, urlparse # For handling relative and absolute URLs
import re # For regular expression matching in URL paths

def normalize_url_join(base_url, relative_url):
    """Properly joins URLs while avoiding double slashes."""
    # Remove trailing slash from base_url if relative_url starts with slash
    if base_url.endswith('/') and relative_url.startswith('/'):
        base_url = base_url.rstrip('/')
    return urljoin(base_url, relative_url)

def get_surrounding_text(element, max_length=200):
    """Extracts text from the parent element of a given link, limited in length.
    This provides context for the link.
    """
    if element and element.parent:
        parent_text = element.parent.get_text(separator=' ', strip=True)
        # Truncate if too long to keep prompts for GenAI concise
        return parent_text[:max_length] + ('...' if len(parent_text) > max_length else '')
    return ''

def find_candidate_urls(soup, base_url):
    """Scans a BeautifulSoup soup object for potential parish directory links.
    It uses a combination of keyword matching in link text/surrounding text
    and regex patterns for URL paths.
    Returns a list of candidate link dictionaries.
    """
    candidate_links = []
    processed_hrefs = set() # To avoid adding duplicate URLs

    # Keywords likely to appear in link text or surrounding text for parish directories
    parish_link_keywords = [
        'Churches', 'Directory of Parishes', 'Parishes', 'parishfinder', 'Parish Finder',
        'Find a Parish', 'Locations', 'Our Parishes', 'Parish Listings', 'Find a Church',
        'Church Directory', 'Faith Communities', 'Find Mass Times', 'Our Churches',
        'Search Parishes', 'Parish Map', 'Mass Schedule', 'Sacraments', 'Worship'
    ]
    # Regex patterns for URL paths that often indicate a parish directory
    url_patterns = [
        r'parishes', r'directory', r'locations', r'churches',
        r'parish-finder', r'findachurch', r'parishsearch', r'parishdirectory',
        r'find-a-church', r'church-directory', r'parish-listings', r'parish-map',
        r'mass-times', r'sacraments', r'search', r'worship', r'finder'
    ]

    all_links_tags = soup.find_all('a', href=True) # Find all <a> tags with an href attribute

    for link_tag in all_links_tags:
        href = link_tag['href']
        # Skip empty, anchor, JavaScript, or mailto links
        if not href or href.startswith('#') or href.lower().startswith('javascript:') or href.lower().startswith('mailto:'):
            continue

        abs_href = normalize_url_join(base_url, href) # Resolve relative URLs to absolute with fixed joining
        if not abs_href.startswith('http'): # Ensure it's a web link
            continue
        if abs_href in processed_hrefs: # Avoid re-processing the same URL
            continue

        link_text = link_tag.get_text(strip=True)
        surrounding_text = get_surrounding_text(link_tag)
        parsed_href_path = urlparse(abs_href).path.lower() # Get the path component of the URL

        # Check for matches based on keywords in text or URL patterns
        text_match = any(keyword.lower() in link_text.lower() or keyword.lower() in surrounding_text.lower() for keyword in parish_link_keywords)
        pattern_match = any(re.search(pattern, parsed_href_path, re.IGNORECASE) for pattern in url_patterns)

        if text_match or pattern_match:
            candidate_links.append({
                'text': link_text,
                'href': abs_href,
                'surrounding_text': surrounding_text
            })
            processed_hrefs.add(abs_href)

    return candidate_links

In [67]:
# Cell 7
# GenAI Powered Link Analyzer (for direct page content)

# Define exceptions on which GenAI calls should be retried
RETRYABLE_GENAI_EXCEPTIONS = (
    DeadlineExceeded, ServiceUnavailable, ResourceExhausted,
    InternalServerError, GoogleAPIError
)

@retry(
    stop=stop_after_attempt(3), # Retry up to 3 times
    wait=wait_exponential(multiplier=1, min=2, max=10), # Exponential backoff: 2s, 4s, 8s...
    retry=retry_if_exception_type(RETRYABLE_GENAI_EXCEPTIONS),
    reraise=True # Reraise the last exception if all retries fail
)
def _invoke_genai_model_with_retry(prompt):
    """Internal helper to invoke the GenAI model with retry logic."""
    # print("    Attempting GenAI call...") # Uncomment for debugging retries
    # GENAI_API_KEY is configured in the first cell. If None, this will fail if not mocked.
    # Ensure genai is available if first cell wasn't run, or handle error
    if 'genai' not in globals():
        raise NameError("genai module not available. Ensure User-configurable parameters cell is run.")
    model = genai.GenerativeModel('gemini-1.5-flash') # Or your preferred model
    return model.generate_content(prompt)

def analyze_links_with_genai(candidate_links, diocese_name=None):
    """Analyzes candidate links using GenAI (or mock) to find the best parish directory URL."""
    best_link_found = None
    highest_score = -1

    # --- Mock vs. Live Control for GenAI (Direct Page Analysis) ---
    # Control for this is `use_mock_genai_direct_page` from the User-configurable parameters cell.
    # GENAI_API_KEY is also defined there.
    # Ensure mock if key is not configured, overriding user setting for safety.
    current_use_mock_direct = use_mock_genai_direct_page if ('GENAI_API_KEY' in globals() and GENAI_API_KEY) else True

    if not current_use_mock_direct:
        print(f"Attempting LIVE GenAI analysis for {len(candidate_links)} direct page links for {diocese_name or 'Unknown Diocese'}.")
    # else:
        # print(f"Using MOCKED GenAI analysis for {len(candidate_links)} direct page links for {diocese_name or 'Unknown Diocese'}.")
    # ---

    if current_use_mock_direct:
        mock_keywords = ['parish', 'church', 'directory', 'location', 'finder', 'search', 'map', 'listing', 'sacrament', 'mass', 'worship']
        for link_info in candidate_links:
            current_score = 0
            text_to_check = (link_info['text'] + ' ' + link_info['href'] + ' ' + link_info['surrounding_text']).lower()
            for kw in mock_keywords:
                if kw in text_to_check: current_score += 3
            if diocese_name and diocese_name.lower() in text_to_check: current_score +=1
            current_score = min(current_score, 10) # Cap score at 10
            if current_score >= 7 and current_score > highest_score: # Threshold of 7
                highest_score = current_score
                best_link_found = link_info['href']
        return best_link_found

    # --- Actual GenAI API Call Logic (executes if use_mock is False) ---
    for link_info in candidate_links:
        prompt = f"""Given the following information about a link from the {diocese_name or 'a diocesan'} website:
        Link Text: "{link_info['text']}"
        Link URL: "{link_info['href']}"
        Surrounding Text: "{link_info['surrounding_text']}"
        Does this link likely lead to a parish directory, a list of churches, or a way to find parishes?
        Respond with a confidence score from 0 (not likely) to 10 (very likely) and a brief justification.
        Format as: Score: [score], Justification: [text]"""
        try:
            response = _invoke_genai_model_with_retry(prompt)
            response_text = response.text
            # print(f"    GenAI Raw Response (Direct Link): {response_text}") # For debugging
            score_match = re.search(r"Score: (\d+)", response_text, re.IGNORECASE)
            if score_match:
                score = int(score_match.group(1))
                if score >= 7 and score > highest_score:
                    highest_score = score
                    best_link_found = link_info['href']
            # else: print(f"    Could not parse score from GenAI (Direct Link) for {link_info['href']}: {response_text}")
        except RetryError as e:
            print(f"    GenAI API call (Direct Link) failed after multiple retries for {link_info['href']}: {e}")
        except Exception as e:
            print(f"    Error calling GenAI (Direct Link) for {link_info['href']}: {e}. No score assigned.")
    return best_link_found

In [68]:
# Cell 8
# Search Engine Fallback Functions & GenAI Snippet Analysis

# Ensure 'build' is imported if using live search. It's commented in Cell 1 by default.
from googleapiclient.discovery import build

def is_retryable_http_error(exception):
    """Custom retry condition for HttpError: only retry on 5xx or 429 (rate limit)."""
    if isinstance(exception, HttpError):
        return exception.resp.status >= 500 or exception.resp.status == 429
    return False

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=2, max=10),
    retry=retry_if_exception_type(is_retryable_http_error), # Use custom condition for HttpError
    reraise=True
)
def _invoke_search_api_with_retry(service, query, cx_id):
    """Internal helper to invoke the Google Custom Search API with retry logic."""
    # print(f"    Attempting Search API call for query: {query}") # Uncomment for debugging retries
    return service.cse().list(q=query, cx=cx_id, num=3).execute() # Fetch top 3 results per query

def normalize_mock_url(base_url, path):
    """Properly constructs URLs for mock data, avoiding double slashes."""
    # Ensure base_url doesn't end with slash and path starts with slash
    base_clean = base_url.rstrip('/')
    path_clean = path if path.startswith('/') else '/' + path
    return base_clean + path_clean

def analyze_search_snippet_with_genai(search_results, diocese_name):
    """Analyzes search result snippets using GenAI (or mock) to find the best parish directory URL."""
    best_link_from_snippet = None
    highest_score = -1

    # --- Mock vs. Live Control for GenAI (Snippet Analysis) ---
    # Control for this is `use_mock_genai_snippet` from the User-configurable parameters cell.
    # GENAI_API_KEY is also defined there.
    # Ensure mock if key is not configured, overriding user setting for safety.
    current_use_mock_snippet = use_mock_genai_snippet if ('GENAI_API_KEY' in globals() and GENAI_API_KEY) else True

    if not current_use_mock_snippet:
        print(f"Attempting LIVE GenAI analysis for {len(search_results)} snippets for {diocese_name}.")
    # else:
        # print(f"Using MOCKED GenAI analysis for {len(search_results)} snippets for {diocese_name}.")
    # ---

    if current_use_mock_snippet:
        mock_keywords = ['parish', 'church', 'directory', 'location', 'finder', 'search', 'map', 'listing', 'mass times']
        for result in search_results:
            current_score = 0
            text_to_check = (result.get('title', '') + ' ' + result.get('snippet', '') + ' ' + result.get('link', '')).lower()
            for kw in mock_keywords:
                if kw in text_to_check: current_score += 3
            if diocese_name and diocese_name.lower() in text_to_check: current_score += 1
            current_score = min(current_score, 10)
            if current_score >= 7 and current_score > highest_score: # Threshold of 7
                highest_score = current_score
                best_link_from_snippet = result.get('link')
        return best_link_from_snippet

    # --- Actual GenAI API Call Logic for Snippets (executes if use_mock_genai_for_snippet is False) ---
    for result in search_results:
        title = result.get('title', '')
        snippet = result.get('snippet', '')
        link = result.get('link', '')
        prompt = f"""Given the following search result from {diocese_name}'s website:
        Title: "{title}"
        Snippet: "{snippet}"
        URL: "{link}"
        Does this link likely lead to a parish directory, church locator, or list of churches?
        Respond with a confidence score from 0 (not likely) to 10 (very likely) and a brief justification.
        Format as: Score: [score], Justification: [text]"""
        try:
            # Uses the same _invoke_genai_model_with_retry as direct page analysis
            response = _invoke_genai_model_with_retry(prompt)
            response_text = response.text
            # print(f"    GenAI Raw Response (Snippet): {response_text}") # For debugging
            score_match = re.search(r"Score: (\d+)", response_text, re.IGNORECASE)
            if score_match:
                score = int(score_match.group(1))
                if score >= 7 and score > highest_score:
                    highest_score = score
                    best_link_from_snippet = link
            # else: print(f"    Could not parse score from GenAI (Snippet) for {link}: {response_text}")
        except RetryError as e:
            print(f"    GenAI API call (Snippet) for {link} failed after multiple retries: {e}")
        except Exception as e:
            print(f"    Error calling GenAI for snippet analysis of {link}: {e}")
    return best_link_from_snippet

def search_for_directory_link(diocese_name, diocese_website_url):
    """Uses Google Custom Search (or mock) to find potential directory links, then analyzes snippets."""
    # print(f"Executing search engine fallback for {diocese_name} ({diocese_website_url})") # Verbose

    # --- Mock vs. Live Control for Search Engine ---
    # Control for this is `use_mock_search_engine` from the User-configurable parameters cell.
    # SEARCH_API_KEY and SEARCH_CX are also defined there.
    # Ensure mock if keys are not configured, overriding user setting for safety.
    current_use_mock_search = use_mock_search_engine if ('SEARCH_API_KEY' in globals() and SEARCH_API_KEY and 'SEARCH_CX' in globals() and SEARCH_CX) else True

    if not current_use_mock_search:
        print(f"Attempting LIVE Google Custom Search for {diocese_name}.")
    # else:
        # print(f"Using MOCKED Google Custom Search for {diocese_name}.")
    # ---

    if current_use_mock_search:
        mock_results = [
            {'link': normalize_mock_url(diocese_website_url, '/parishes'), 'title': f"Parishes - {diocese_name}", 'snippet': f"List of parishes in the Diocese of {diocese_name}. Find a parish near you."},
            {'link': normalize_mock_url(diocese_website_url, '/directory'), 'title': f"Directory - {diocese_name}", 'snippet': f"Official directory of churches and schools for {diocese_name}."},
            {'link': normalize_mock_url(diocese_website_url, '/find-a-church'), 'title': f"Find a Church - {diocese_name}", 'snippet': f"Search for a Catholic church in {diocese_name}. Mass times and locations."}
        ]
        # Simulate `site:` search by filtering mock results to the diocese's website
        filtered_mock_results = [res for res in mock_results if res['link'].startswith(diocese_website_url.rstrip('/'))]
        return analyze_search_snippet_with_genai(filtered_mock_results, diocese_name)

    # --- Actual Google Custom Search API Call Logic (executes if use_mock_search is False) ---
    try:
        # `build` is imported at the top of this cell for clarity when live calls are made.
        service = build("customsearch", "v1", developerKey=SEARCH_API_KEY)
        # Construct multiple queries to increase chances of finding the directory
        queries = [
            f"parish directory site:{diocese_website_url}",
            f"list of churches site:{diocese_website_url}",
            f"find a parish site:{diocese_website_url}",
            f"{diocese_name} parish directory" # Broader query without site restriction as a last resort
        ]
        search_results_items = []
        unique_links = set() # To avoid duplicate results from different queries

        for q in queries:
            if len(search_results_items) >= 5: break # Limit total API calls/results
            print(f"    Executing search query: {q}")
            # Use the retry-enabled helper for the API call
            res_items = _invoke_search_api_with_retry(service, q, SEARCH_CX).get('items', [])
            for item in res_items:
                link = item.get('link')
                if link and link not in unique_links:
                    search_results_items.append(item)
                    unique_links.add(link)
            time.sleep(0.2) # Brief pause between queries to be polite to the API

        if not search_results_items:
            print(f"    Search engine returned no results for {diocese_name}.")
            return None

        # Format results for the snippet analyzer
        formatted_results = [{'link': item.get('link'), 'title': item.get('title'), 'snippet': item.get('snippet')} for item in search_results_items]
        return analyze_search_snippet_with_genai(formatted_results, diocese_name)
    except RetryError as e:
        print(f"    Search API call failed after multiple retries for {diocese_name}: {e}")
        return None
    except Exception as e:
        print(f"    Error during search engine call for {diocese_name}: {e}")
        return None

In [69]:
# Cell 9
# Process URLs, Apply Analysis Stages, and Write Results to Database

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=2, max=10),
    retry=retry_if_exception_type((TimeoutException, WebDriverException)),
    reraise=True
)
def get_page_with_retry(driver_instance, url):
    """Wraps driver.get() with retry logic."""
    # print(f"    Attempting to load page: {url}") # Uncomment for debugging retries
    driver_instance.get(url)

if 'dioceses_to_scan' in locals() and dioceses_to_scan:
    conn_db = sqlite3.connect('data.db')
    cursor_db = conn_db.cursor()

    # Define table schema with PRIMARY KEY on diocese_url for INSERT OR REPLACE behavior. Method used: e.g., genai_direct, search_engine_genai
    cursor_db.execute('''CREATE TABLE IF NOT EXISTS DiocesesParishDirectory
                      (diocese_url TEXT PRIMARY KEY,
                       parish_directory_url TEXT,
                       found TEXT,  -- Status: Success, Not Found, Error details
                       found_method TEXT)''')
    conn_db.commit()

    # Check if the 'found_method' column exists and add it if not (for robustness against schema changes)
    try:
        cursor_db.execute("SELECT found_method FROM DiocesesParishDirectory LIMIT 1")
    except sqlite3.OperationalError:
        print("Adding 'found_method' column to DiocesesParishDirectory table.")
        cursor_db.execute("ALTER TABLE DiocesesParishDirectory ADD COLUMN found_method TEXT")
        conn_db.commit()


    driver_instance = setup_driver() # Initialize the WebDriver
    if driver_instance:
        print(f"Processing {len(dioceses_to_scan)} dioceses with Selenium...")
        for diocese_info in dioceses_to_scan:
            current_url = diocese_info['url']
            diocese_name = diocese_info['name']
            print(f"--- Processing: {current_url} ({diocese_name}) ---")

            parish_dir_url_found = None
            status_text = "Not Found" # Default status if no URL is found
            method = "not_found_all_stages" # Default method if all stages fail

            try:
                # Stage 1: Load page with Selenium (with retries)
                get_page_with_retry(driver_instance, current_url)
                time.sleep(0.5) # Brief pause for any JS rendering after page load
                page_source = driver_instance.page_source
                soup = BeautifulSoup(page_source, 'html.parser')

                # Stage 2: Find candidate links from direct page content
                candidate_links = find_candidate_urls(soup, current_url)

                if candidate_links:
                    # Stage 3: Analyze direct page candidates with GenAI (or mock)
                    print(f"    Found {len(candidate_links)} candidates from direct page. Analyzing...") # Verbose
                    parish_dir_url_found = analyze_links_with_genai(candidate_links, diocese_name)
                    if parish_dir_url_found:
                        method = "genai_direct_page_analysis"
                        status_text = "Success"
                    else: print(f"    GenAI (direct page) did not find a suitable URL for {current_url}.") # Verbose
                else: print(f"    No candidate links found by direct page scan for {current_url}.") # Verbose

                # Stage 4: If not found, try search engine fallback
                if not parish_dir_url_found:
                    print(f"    Direct page analysis failed for {current_url}. Trying search engine fallback...") # Verbose
                    parish_dir_url_found = search_for_directory_link(diocese_name, current_url)
                    if parish_dir_url_found:
                        method = "search_engine_snippet_genai"
                        status_text = "Success"
                    else: print(f"    Search engine fallback also failed for {current_url}.") # Verbose

                # Log final result for this diocese
                if parish_dir_url_found:
                     print(f"    Result: Parish Directory URL for {current_url}: {parish_dir_url_found} (Method: {method})")
                else:
                     # Method will be 'not_found_all_stages' if it reached here without finding a URL
                     print(f"    Result: No Parish Directory URL definitively found for {current_url} (Final method: {method})")

                cursor_db.execute("INSERT OR REPLACE INTO DiocesesParishDirectory VALUES (?, ?, ?, ?)",
                               (current_url, parish_dir_url_found, status_text, method))

            except RetryError as e: # Catch retry errors specifically for page load from get_page_with_retry
                error_message = str(e).replace('"', "''")
                print(f"    Result: Page load failed after multiple retries for {current_url}: {error_message[:100]}")
                status_text = f"Error: Page load failed - {error_message[:60]}" # Truncate for DB
                method = "error_page_load_failed"
                cursor_db.execute("INSERT OR REPLACE INTO DiocesesParishDirectory VALUES (?, ?, ?, ?)",
                               (current_url, None, status_text, method))
            except Exception as e: # Catch any other exceptions during processing of a diocese
                error_message = str(e).replace('"', "''")
                print(f"    Result: General error processing {current_url}: {error_message[:100]}")
                status_text = f"Error: {error_message[:100]}" # Truncate for DB
                method = "error_processing_general"
                cursor_db.execute("INSERT OR REPLACE INTO DiocesesParishDirectory VALUES (?, ?, ?, ?)",
                               (current_url, None, status_text, method))
            conn_db.commit() # Commit result for each diocese

        close_driver() # Close WebDriver after processing all dioceses
    else:
        print("Selenium WebDriver not available. Skipping URL processing.")

    if 'conn_db' in locals() and conn_db: # Ensure connection is closed
        conn_db.close()
        print("\nDatabase connection closed after processing.")
else:
    print("No dioceses to scan (dioceses_to_scan is empty or not defined). Ensure Cell 3 ran correctly and data.db is populated.")

Setting up Chrome WebDriver...
WebDriver setup successfully.
Processing 1 dioceses with Selenium...
--- Processing: http://www.catholicdos.org/ (Diocese of Superior) ---
    No candidate links found by direct page scan for http://www.catholicdos.org/.
    Direct page analysis failed for http://www.catholicdos.org/. Trying search engine fallback...
Attempting LIVE Google Custom Search for Diocese of Superior.
    Executing search query: parish directory site:http://www.catholicdos.org/




    Error during search engine call for Diocese of Superior: isinstance() arg 2 must be a type, a tuple of types, or a union
    Search engine fallback also failed for http://www.catholicdos.org/.
    Result: No Parish Directory URL definitively found for http://www.catholicdos.org/ (Final method: not_found_all_stages)
Closing WebDriver...
WebDriver closed.

Database connection closed after processing.


In [70]:
# Cell 10
# Verify the data in the SQLite database

print("--- Verification Cell Output ---")
try:
    conn = sqlite3.connect('data.db')
    cursor = conn.cursor()
    print("\nDisplaying first 5 rows from DiocesesParishDirectory (if any):")
    cursor.execute("SELECT * FROM DiocesesParishDirectory LIMIT 5")
    rows = cursor.fetchall()
    if rows:
        for row in rows:
            print(row)
    else:
        print("No data found in DiocesesParishDirectory table.")

    print("\nDisplaying counts by found_method:")
    cursor.execute("SELECT found_method, COUNT(*) FROM DiocesesParishDirectory GROUP BY found_method")
    rows_count = cursor.fetchall()
    if rows_count:
        for row_count in rows_count:
            print(row_count)
    else:
        print("No data to aggregate by found_method (DiocesesParishDirectory table might be empty).")
except sqlite3.Error as e:
    print(f"Database error during verification: {e}")
finally:
    if 'conn' in locals() and conn:
        conn.close()
    print("\nDatabase connection for verification closed")

--- Verification Cell Output ---

Displaying first 5 rows from DiocesesParishDirectory (if any):
('https://mobarch.org/', None, 'No parish directory URL found for https://mobarch.org/', None)
('http://www.bhmdiocese.org/', None, 'No parish directory URL found for http://www.bhmdiocese.org/', None)
('http://www.aoaj.org', None, 'No parish directory URL found for http://www.aoaj.org', None)
('http://www.cbna.info/', None, 'No parish directory URL found for http://www.cbna.info/', None)
('http://www.eparchyofphoenix.org/', 'http://www.eparchyofphoenix.org/directory-of-parishes', 'Success', None)

Displaying counts by found_method:
(None, 439)
('error_processing_general', 1)
('not_found_all_stages', 4)
('search_engine_snippet_genai', 6)

Database connection for verification closed


In [71]:
# Cell 11
# Commit changes and push to GitHub

# This cell commits the notebook and data.db (if changed) to the GitHub repository.
# It uses proper shell escaping and handles authentication gracefully.

import subprocess
import os

def safe_git_commit_and_push():
    """Safely commit and push changes with proper error handling."""
    try:
        # Get GitHub credentials from Colab secrets (same way as in Cell 2)
        from google.colab import userdata

        try:
            github_username = userdata.get('GitHubUserforUSCCB')
            github_pat = userdata.get('GitHubPATforUSCCB')
        except Exception as e:
            print(f"⚠️  Could not retrieve GitHub credentials from secrets: {e}")
            github_username = None
            github_pat = None

        # Check if we're in a git repository
        result = subprocess.run(['git', 'status'], capture_output=True, text=True)
        if result.returncode != 0:
            print("❌ Not in a git repository or git not available")
            return False

        # Add files
        print("📁 Adding files to git...")
        subprocess.run(['git', 'add', 'data.db', 'Find_Parish_Directory.ipynb'],
                      capture_output=True, text=True, check=True)

        # Create a proper commit message
        commit_msg = """Autorun from Cell 7"""

        # Commit changes
        print("💾 Committing changes...")
        result = subprocess.run(['git', 'commit', '-m', commit_msg],
                              capture_output=True, text=True)

        if result.returncode == 0:
            print("✅ Successfully committed changes")

            # Check if GitHub credentials are available from secrets
            if github_username and github_pat:
                print("🔐 GitHub credentials found in secrets. Attempting to push...")

                # Configure remote URL with credentials
                remote_url = f"https://{github_username}:{github_pat}@github.com/tomknightatl/USCCB.git"
                subprocess.run(['git', 'remote', 'set-url', 'origin', remote_url],
                             capture_output=True, text=True)

                # Push to remote
                push_result = subprocess.run(['git', 'push', 'origin', 'main'],
                                           capture_output=True, text=True)

                if push_result.returncode == 0:
                    print("✅ Successfully pushed to GitHub")
                    return True
                else:
                    print(f"❌ Failed to push to GitHub: {push_result.stderr}")
                    return False
            else:
                print("⚠️  No GitHub credentials found in Colab secrets. Changes committed locally only.")
                print("   To push to GitHub:")
                print("   1. Ensure 'GitHubUserforUSCCB' and 'GitHubPATforUSCCB' are set in Colab Secrets")
                print("   2. Re-run this cell")
                return True
        else:
            if "nothing to commit" in result.stdout:
                print("ℹ️  No changes to commit")
                return True
            else:
                print(f"❌ Failed to commit: {result.stderr}")
                return False

    except subprocess.CalledProcessError as e:
        print(f"❌ Git command failed: {e}")
        return False
    except Exception as e:
        print(f"❌ Unexpected error: {e}")
        return False

# Execute the safe commit and push
print("🚀 Starting git commit and push process...")
success = safe_git_commit_and_push()

if success:
    print("🎉 Git operations completed successfully!")
else:
    print("⚠️  Git operations completed with issues. Check messages above.")

# Verify final git status
try:
    result = subprocess.run(['git', 'status', '--porcelain'], capture_output=True, text=True)
    if result.stdout.strip():
        print("\n📋 Remaining untracked/modified files:")
        print(result.stdout)
    else:
        print("\n✨ Working directory is clean")
except:
    print("\n❓ Could not check git status")

🚀 Starting git commit and push process...
📁 Adding files to git...
💾 Committing changes...
✅ Successfully committed changes
🔐 GitHub credentials found in secrets. Attempting to push...
✅ Successfully pushed to GitHub
🎉 Git operations completed successfully!

✨ Working directory is clean
