# **Scraping the Expo Hall Demos**
I found out there was [a page containing more information about games being demoed at the expo hall](https://east.paxsite.com/en-us/expo-hall/expo-hall-demos.html). Within this notebook, I'll scrape that too! 

# Setup
The cells below will help to set up the rest of the notebook. 

I'll start by configuring the kernel that's running this notebook:

In [1]:
# Change the cwd
%cd ..

# Enable the autoreload module
%load_ext autoreload
%autoreload 2

# Load the environment variables
from dotenv import load_dotenv
load_dotenv(override=True)

/Users/thubbard/Documents/personal/programming/pax-pal-2025/experiments


True

Next, I'm going to import the necessary modules:

In [7]:
# General imports
import re
import asyncio

# Third-party imports
import pandas as pd
from bs4 import BeautifulSoup
from playwright.async_api import async_playwright
from tqdm.notebook import tqdm

# Project-specific imports
from utils.miscellaneous import get_consistent_hash

# Grabbing Expo Hall Games

In [3]:
async def scrape_expo_hall_demos():
    """
    Scrapes the PAX East Expo Hall Demos page to get a list of all games being showcased.
    Returns a list of dictionaries containing game information.

    If the live page scrape fails, falls back to parsing the local HTML file at data/full-demo-list-html.txt.
    """
    games = []
    content = None
    used_fallback = False

    try:
        async with async_playwright() as p:
            # Launch browser
            browser = await p.chromium.launch(headless=True)
            page = await browser.new_page()

            # Navigate to the expo hall demos page
            print("Loading the Expo Hall Demos page...")
            await page.goto(
                "https://east.paxsite.com/en-us/expo-hall/expo-hall-demos.html",
                timeout=60000,
            )

            # Wait for the games to load
            print("Waiting for games to load...")
            await page.wait_for_selector(
                ".specials-list.gtartists-art-grid", timeout=60000
            )

            # Give it a little extra time to make sure all games are loaded
            await page.wait_for_timeout(2000)

            # Get the HTML content
            content = await page.content()

            # Close the browser
            await browser.close()
    except Exception as e:
        print(f"Failed to scrape live page: {e}")
        print("Falling back to local HTML file: data/full-demo-list-html.txt")
        with open("data/full-demo-list-html.txt", "r", encoding="utf-8") as f:
            content = f.read()
        used_fallback = True

    # Parse the HTML with BeautifulSoup
    soup = BeautifulSoup(content, "html.parser")

    # Find all game entries
    game_entries = soup.select(".mix.gt-entry")
    print(
        f"Found {len(game_entries)} games{' (from fallback file)' if used_fallback else ''}"
    )

    # Extract information for each game
    for entry in game_entries:
        link = entry.find("a", class_="artist-modal-link")
        if not link:
            continue

        game_id = entry.get("id")
        game_name = entry.get("data-name", "").replace("-", " ")

        # Get the actual title from the element
        title_elem = entry.select_one(".gtSpecial-title")
        title = title_elem.text if title_elem else game_name

        # Get company/exhibitor info
        company_elem = entry.select_one(".gtSpecial-company-booth")
        company = company_elem.text if company_elem else ""

        # Get image URL
        image_div = entry.select_one(".gtSpecial-image")
        image_url = ""
        if image_div:
            style = image_div.get("style", "")
            url_match = re.search(r"url\((.*?)\)", style)
            if url_match:
                image_url = url_match.group(1)

        # Get exhibitor ID
        exhibitor_id = link.get("data-exhibitor-id", "")

        games.append(
            {
                "id": game_id,
                "name": title,
                "company": company,
                "image_url": image_url,
                "exhibitor_id": exhibitor_id,
            }
        )

    return games


# Run the scraping function
games_data = await scrape_expo_hall_demos()

# Make the DataFrame
games_data_df = pd.DataFrame(games_data)

# Add a hash to the games
games_data_df["paxpal_id"] = games_data_df["name"].apply(
    lambda x: get_consistent_hash(x)
)

Loading the Expo Hall Demos page...
Waiting for games to load...
Found 151 games


I'll save it below:

In [9]:
games_data_df.to_json(
    "data/expo_hall_demos.json",
    orient="records",
    indent=4,
)

If I want to reload it, I can do it below:

In [10]:
# Reload the data
games_data_df = pd.read_json("data/expo_hall_demos.json")

### **OPTIONAL: Removing Pre-Scraped Games**

In [14]:
remove_prescraped_games = True
if remove_prescraped_games:
    prescraped_games_df = pd.read_json("data/final_enriched_games_data.json")
    games_data_df = games_data_df[
        ~games_data_df["paxpal_id"].isin(prescraped_games_df["id"].unique())
    ]

  prescraped_games_df = pd.read_json("data/final_enriched_games_data.json")
  prescraped_games_df = pd.read_json("data/final_enriched_games_data.json")
  prescraped_games_df = pd.read_json("data/final_enriched_games_data.json")


# Grabbing More Details
Next up: getting more details about each of the games. 

In [16]:
async def scrape_single_game_details(game):
    """
    Scrapes detailed information for a single game.

    Args:
        game: Series containing a single game's information

    Returns:
        Dictionary containing the game data with modal information
    """
    game_data = game.to_dict()
    game_id = game["id"]

    try:
        async with async_playwright() as p:
            # Launch browser
            browser = await p.chromium.launch(headless=True)
            page = await browser.new_page()

            # Navigate to the expo hall demos page
            await page.goto(
                "https://east.paxsite.com/en-us/expo-hall/expo-hall-demos.html"
            )

            # Wait for the games to load
            await page.wait_for_selector(
                ".specials-list.gtartists-art-grid", timeout=10_000
            )

            # Give it extra time to make sure all games are loaded
            await page.wait_for_timeout(10_000)

            # Use attribute selector instead of ID selector to handle numeric IDs
            selector = f"[id='{game_id}'] .artist-modal-link"

            # Wait for the element to be visible and click it
            await page.wait_for_selector(selector, state="visible", timeout=5_000)
            await page.click(selector)

            # Wait for the modal to appear and load content
            await page.wait_for_selector(
                ".modal__container", state="visible", timeout=5_000
            )
            await page.wait_for_timeout(5_000)  # Give modal time to load

            # Extract information from the modal
            modal_content = await page.evaluate(
                """() => {
                const modal = document.querySelector('.modal__container');
                if (!modal) return null;
                
                const title = modal.querySelector('.gtModal-title')?.textContent || '';
                const company = modal.querySelector('.gtModal-company')?.textContent || '';
                const description = modal.querySelector('.gtModal-desc')?.textContent || '';
                
                return {
                    modal_title: title,
                    modal_company: company,
                    modal_description: description
                };
            }"""
            )

            if modal_content:
                # Add the modal information to the game data
                game_data.update(modal_content)

            # Close the browser
            await browser.close()

    except Exception as e:
        print(f"Error processing game {game_id}: {str(e)[:100]}...")
        # Return the game data without modal information

    return game_data


async def scrape_game_details(games_df):
    """
    Scrapes detailed information from the modal for each game in the DataFrame.

    Args:
        games_df: DataFrame containing game information with exhibitor_id

    Returns:
        DataFrame with additional details from the modals
    """

    # Process each game individually and collect results
    detailed_games = []

    # Use tqdm for progress tracking
    for idx, game in tqdm(
        games_df.iterrows(), total=len(games_df), desc="Scraping game details"
    ):
        game_data = await scrape_single_game_details(game)
        detailed_games.append(game_data)

        # Wait between requests to avoid overloading the server
        await asyncio.sleep(5)

    return pd.DataFrame(detailed_games)

With these methods in hand, I'll run them below:

In [17]:
# Scrape detailed information for each game
detailed_games_df = await scrape_game_details(games_data_df)

Scraping game details:   0%|          | 0/16 [00:00<?, ?it/s]

If the user removed prescraped games, I'll add them back below:

In [24]:
if remove_prescraped_games:

    other_detailed_games_df = pd.read_json("data/expo_hall_demos_detailed.json")
    other_detailed_games_df["paxpal_id"] = other_detailed_games_df["name"].apply(
        lambda x: get_consistent_hash(x)
    )
    detailed_games_df = detailed_games_df[
        ~detailed_games_df["paxpal_id"].isin(other_detailed_games_df["paxpal_id"])
    ]

    detailed_games_df = pd.concat(
        [detailed_games_df, other_detailed_games_df], ignore_index=True
    )

Below, I'll save the data:

In [26]:
detailed_games_df.dropna(subset=["modal_description"]).to_json(
    "data/expo_hall_demos_detailed.json",
    orient="records",
    indent=4,
)