# Bullhorn Automation Scraper
This notebook automates the extraction of Bullhorn (Herefish) automation details using Selenium.

#### **Features**
* Hibernation Check: Automatically scans your "Hibernated" table first to flag inactive automations in the final report.
* Progress Backups: Saves results to Excel every 50 records to prevent data loss.
* Fault Tolerance: * InvalidSessionId Management: Specifically catches session crashes to prevent the notebook from hanging.
    * Auto-Recovery: If a session expires or drops, the script automatically re-logs in and resumes exactly where it left off.
    * Anti-Blank Loading: If no content appears after 15 seconds, the script will automatically reload the page until the data is visible.

#### **Setup Requirements**
* Credentials: Enter your login details in the USERNAME and PASSWORD variables in Cell 2.
* Browser: Google Chrome must be installed on your computer.
* Speed: Extraction takes approximately ~25 minutes per 1,000 automations.

#### **Final Output**
* The scraper generates an Excel file with two sheets:
* Automations: Detailed data for the requested ID range, including a TRUE/FALSE column for hibernation status.
* Hibernated Automations: A full list of all currently hibernated automations.

In [6]:
import pandas as pd
import time
from datetime import datetime, timedelta
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, InvalidSessionIdException, NoSuchElementException

In [None]:
# Credentials and configuration
USERNAME = "YOUR_USERNAME_HERE"
PASSWORD = "YOUR_PASSWORD_HERE"

# XPATHs to find elements
## Automation Credentials XPATHs
EMAIL_XPATH = "/html/body/div/div/div[1]/div/form/div[2]/input"
PASS_XPATH = "/html/body/div/div/div[1]/div/form/div[3]/input"
BTN_XPATH = "/html/body/div/div/div[1]/div/form/div[5]/button"
## Hibernate scraper XPATHs
HIBERNATED_XPATH = "/html/body/div[2]/div[1]/div/div[3]/div/div[3]/div[1]/span[1]"
HIBERNATED_TABLE_XPATH = "/html/body/div[2]/div[1]/div/div[3]/div/div[3]/div[2]/div[1]/table"



In [8]:
# Cell 3: The Login Function
# This function allows us to trigger a login at the start or whenever the session drops.

def login_to_herefish(driver):
    print("Initiating login sequence...")
    driver.get("https://app.herefish.com/")
    time.sleep(5) 
    
    try:
        # Check if we are already logged in or need to enter credentials
        email_input = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, EMAIL_XPATH)))
        email_input.send_keys(USERNAME)
        
        password_input = driver.find_element(By.XPATH, PASS_XPATH)
        password_input.send_keys(PASSWORD)
        
        login_button = driver.find_element(By.XPATH, BTN_XPATH)
        login_button.click()
        
        time.sleep(5)  # Allow time for login to process

        # Wait until we are redirected away from the login page
        driver.get("https://app.herefish.com/Automations/Automations")
        WebDriverWait(driver, 15).until(EC.url_changes("https://app.herefish.com/"))
        print("Login successful.")
        return True
    except Exception as e:
        print(f"Login failed: {e}")
        return False

In [9]:
# Cell 4: New Session, Exact Login Function, Targeted Click, extract Hibernated Table

# 1. Initialize WebDriver
driver = webdriver.Chrome()
driver.maximize_window()

# 2. Use the existing login function to authenticate
login_to_herefish(driver)

# 3. After login, navigate to the Automations section
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, HIBERNATED_XPATH))).click()

# Locate the table using the XPATH and extract its HTML
time.sleep(2)                                                           # Allow time for table to fully render
table_element = driver.find_element(By.XPATH, HIBERNATED_TABLE_XPATH)

# 4. Use pandas to read the HTML table into a DataFrame
hibernated_df = pd.read_html(table_element.get_attribute('outerHTML'))[0]

driver.quit()

# 5. Create a set from the 'Automation Name' column for O(1) lookup speed. This will greatly speed up the "in" checks later.
hibernated_ids_set = set(hibernated_df['Automation Name'].astype(str).tolist())

print("Hibernated Automations extracted successfully.")

Initiating login sequence...
Login successful.


  hibernated_df = pd.read_html(table_element.get_attribute('outerHTML'))[0]


Hibernated Automations extracted successfully.


In [10]:
# Cell 5: The Main Scraper Loop with Automatic Retries
# This version will not skip an ID if it hits a timeout; it will retry until success.

# 1. Initialize WebDriver
driver = webdriver.Chrome()
driver.maximize_window()

# 2. Initial Login
login_to_herefish(driver)

# --- CONFIGURATION ---
POPUP_XPATH = "//div[contains(@class, 'modal-body')]//div[contains(text(), 'not found')]"       # Popup when there isn't any automation
CONTENT_XPATH = "/html/body/div[2]/div[1]/div/div[6]"                                           # If there is content, it appears here


wait = WebDriverWait(driver, 15)
automation_ids = list(range(7841, 7872)) 
scraped_data = []
today = datetime.now().strftime("%Y-%m-%d")

# Print how long will it take approximately
# 2. Calculate the estimated duration in minutes
num_ids = len(automation_ids)
est_total_minutes = (num_ids / 1000) * 25

# 3. Calculate expected end time
now = datetime.now()
expected_end_time = now + timedelta(minutes=est_total_minutes)

# 4. Format the duration (Hours, Minutes)
hours, minutes = divmod(int(est_total_minutes), 60)

# 5. Output the results
print(f"--- Scraper Estimation ---")
print(f"Total IDs: {num_ids}")

# Duration print logic
if hours > 0:
    print(f"Estimated duration: {hours} hours, {minutes} minutes")
else:
    seconds = int((est_total_minutes % 1) * 60)
    print(f"Estimated duration: {minutes} minutes, {seconds} seconds")

# Expected completion time and date
print(f"Expected to finish at: {expected_end_time.strftime('%I:%M %p on %B %d, %Y')}")
print(f"--------------------------")


# --- MAIN LOOP WITH RETRIES ---
index = 0
while index < len(automation_ids):
    automation_id = automation_ids[index]
    url = f"https://app.herefish.com/Automations/Automation/{automation_id}"
    
    # --- REDIRECT/SESSION CHECK ---
    if driver.current_url in ["https://app.herefish.com/", "https://app.herefish.com/Login", "https://app.herefish.com/Account/Login"]:
        print(f"Session lost at ID {automation_id}. Re-logging...")
        login_to_herefish(driver)
        # We don't increment index, so it retries this ID after login

    driver.get(url)
    
    time.sleep(1)  # Initial wait for page load

    automation_name = None
    status = None
    content_text = None
    error_note = None
    success = False # Track if we should move to the next ID

    try:
        # Check for redirect immediately after navigation
        if "Login" in driver.current_url or driver.current_url == "https://app.herefish.com/":
            print(f"ID {automation_id}: Redirected to login. Retrying...")
            continue 

        # Wait for the URL to reflect the correct ID
        wait.until(EC.url_contains(str(automation_id)))

        found_state = None
        # Poll for either the 'Not Found' popup or the actual content
        for _ in range(15): 
            popups = driver.find_elements(By.XPATH, POPUP_XPATH)
            if popups and popups[0].is_displayed():
                found_state = "popup"
                break
            
            content_elements = driver.find_elements(By.XPATH, CONTENT_XPATH)
            if content_elements and len(content_elements[0].text.strip()) > 0:
                found_state = "content"
                break
            
            time.sleep(1)       # Wait a bit before re-checking

        if found_state == "popup":
            print(f"ID {automation_id}: Skipped: Automation not found or deleted")
            # We found a definitive answer (it doesn't exist), so move to next ID
            success = True 

        elif found_state == "content":
            title_container = driver.find_element(By.CLASS_NAME, "title")
            automation_name = driver.execute_script(
                "return arguments[0].childNodes[0].textContent.trim();", 
                title_container
            )
            status_labels = title_container.find_elements(By.CSS_SELECTOR, "span.label:not(.ng-hide)")
            status = status_labels[0].text.strip() if status_labels else "N/A"
            content_text = driver.find_element(By.XPATH, CONTENT_XPATH).text
            
            print(f"ID {automation_id}: Scraped successfully.")
            success = True # Scrape finished, move to next ID

        else:
            # This is the Timeout / Blank page scenario
            print(f"ID {automation_id}: Timeout/Blank. Retrying same ID...")
            time.sleep(2) # Short breather before retry
            continue # This restarts the loop WITHOUT incrementing index

    except Exception as e:
        if isinstance(e, InvalidSessionIdException):
            print("Session crashed. Attempting to restart driver...")
            driver.quit()
            driver = webdriver.Chrome()
            login_to_herefish(driver)
            continue
            
        print(f"ID {automation_id}: Error {type(e).__name__}. Retrying...")
        continue

    # Only save data and move to the next ID if we actually got a result (Content or Popup)
    if success:
        if found_state == "content":
            # Check if the scraped name exists in our hibernated set
            is_hibernated = automation_name in hibernated_ids_set
            
            scraped_data.append({
                "Automation_ID": automation_id,
                "Name": automation_name,
                "Status": status,
                "Is_Hibernated": is_hibernated,  # New TRUE/FALSE column
                "Content": content_text,
                "Extraction_Notes": "Scraped"
            })

        # Periodic save every 2 IDs
        if len(scraped_data) > 0 and len(scraped_data) % 2 == 0:
            file_path = f"automation_results_{today}.xlsx"
            with pd.ExcelWriter(file_path) as writer:
                # Save current progress to 'Automations' sheet
                pd.DataFrame(scraped_data).to_excel(writer, sheet_name='Automations', index=False)
                
                # Save the hibernated list (from Cell 4) to its own sheet
                hibernated_df.to_excel(writer, sheet_name='Hibernated Automations', index=False)
            

        
        index += 1 # Move to the next ID in the list

# --- FINAL SAVE ---
df = pd.DataFrame(scraped_data)

# Define the filename once
file_path = f"automation_results_{today}.xlsx"

# Use the ExcelWriter to "hold open" the file while you write multiple sheets
with pd.ExcelWriter(file_path) as writer:
    # Write the main results
    df = pd.DataFrame(scraped_data)
    df.to_excel(writer, sheet_name='Automations', index=False)
    # Write the hibernated list
    hibernated_df.to_excel(writer, sheet_name='Hibernated Automations', index=False)


print(f"\n--- Scraping Complete. Kept {len(df)} records ---")
driver.quit()

Initiating login sequence...
Login successful.
--- Scraper Estimation ---
Total IDs: 31
Estimated duration: 0 minutes, 46 seconds
Expected to finish at: 11:15 AM on December 24, 2025
--------------------------
ID 7841: Skipped: Automation not found or deleted
ID 7842: Scraped successfully.
ID 7843: Skipped: Automation not found or deleted
ID 7844: Skipped: Automation not found or deleted
ID 7845: Scraped successfully.
ID 7846: Scraped successfully.
ID 7847: Scraped successfully.
ID 7848: Skipped: Automation not found or deleted
ID 7849: Skipped: Automation not found or deleted
ID 7850: Skipped: Automation not found or deleted
ID 7851: Skipped: Automation not found or deleted
ID 7852: Skipped: Automation not found or deleted
ID 7853: Skipped: Automation not found or deleted
ID 7854: Skipped: Automation not found or deleted
ID 7855: Skipped: Automation not found or deleted
ID 7856: Skipped: Automation not found or deleted
ID 7857: Skipped: Automation not found or deleted
ID 7858: Skipped