# **Scraping the Exhibitor List**
PAX just dropped [the 2025 exhibitor list](https://east.paxsite.com/en-us/expo-hall.html); within this notebook, I'll scrape it! 

# Setup
The cells below will help to set up the rest of the notebook. 

I'll start by configuring the kernel that's running this notebook:

In [1]:
# Change the cwd
%cd ..

# Enable the autoreload module
%load_ext autoreload
%autoreload 2

# Load the environment variables
from dotenv import load_dotenv
load_dotenv(override=True)

/Users/thubbard/Documents/personal/programming/pax-pal-2025/experiments


True

Next, I'm going to import the necessary modules:

In [2]:
# General imports
import random
import re
from time import sleep

# Third-party imports
from tqdm import tqdm
import pandas as pd
from bs4 import BeautifulSoup
from playwright.async_api import async_playwright

# Scraping the Exhibitor List
The Exhibitor List is actually loaded dynamically via JavaScript after the initial page load. Because of this, I've just used the Chrome inspector to copy the HTML data *after* the load; I can scrape from this list.

In [3]:
# Load the HTML from the saved file
with open("data/full-exhibitor-list-html.txt", "r", encoding="utf-8") as file:
    html_content = file.read()


# Define a method to scrape the exhibitors
def scrape_exhibitors(html_content):
    """
    Extract exhibitor information from the PAX East exhibitor list HTML.

    Args:
        html_content (str): The HTML content of the exhibitor list page

    Returns:
        exhibitor_list (List[dict]): A list of dictionaries containing exhibitor information
    """
    soup = BeautifulSoup(html_content, "html.parser")
    exhibitors = []

    # Find all exhibitor entries
    exhibitor_entries = soup.find_all("div", class_="exhibitor-entry")

    for entry in exhibitor_entries:
        exhibitor = {}

        # Get image URL
        img_tag = entry.find("img")
        exhibitor["image_url"] = img_tag["src"] if img_tag else None

        # Get name
        name_tag = entry.find("div", class_="exhibitor-name").find("a")
        exhibitor["name"] = name_tag.text.strip() if name_tag else None

        # Get description
        desc_tag = entry.find("div", class_="exhibitor-description")
        exhibitor["description_excerpt"] = desc_tag.text.strip() if desc_tag else None

        # Get booths
        location_tag = entry.find("div", class_="exhibitor-location")
        booth_text = location_tag.text if location_tag else ""
        # Extract booth numbers using regex
        booth_numbers = re.findall(r"\d+", booth_text)
        exhibitor["booths"] = booth_numbers

        # Get details page link
        link_tag = entry.find("a", class_="gtExhibitorLink")
        if link_tag:
            # Fix the URL by removing the duplicate domain
            href = link_tag["href"]
            if href.startswith("https://east.paxsite.com"):
                exhibitor["details_url"] = href
            else:
                exhibitor["details_url"] = f"https://east.paxsite.com{href}"
        else:
            exhibitor["details_url"] = None

        exhibitors.append(exhibitor)

    return exhibitors


# Call the function to scrape exhibitors
exhibitors = scrape_exhibitors(html_content)

# Convert the list of exhibitors to a DataFrame
exhibitor_df = pd.DataFrame(exhibitors)

# Add a column for the number of booths
exhibitor_df["n_booths"] = exhibitor_df["booths"].apply(
    lambda x: len(x) if isinstance(x, list) else 0
)

What does this scraped data look like?

In [4]:
exhibitor_df.head(5)

Unnamed: 0,image_url,name,description_excerpt,booths,details_url,n_booths
0,https://conv-prod-app.s3.amazonaws.com/media/s...,9th Level Games,9th Level Games is an indie tabletop roleplayi...,[10105],https://east.paxsite.com/en-us/expo-hall/showr...,1
1,https://conv-prod-app.s3.amazonaws.com/media/s...,Alienware,,[12019],https://east.paxsite.com/en-us/expo-hall/showr...,1
2,https://conv-prod-app.s3.amazonaws.com/media/s...,Bandai Namco,ELDEN RING NIGHTREIGN is coming to PAX EAST! J...,[16055],https://east.paxsite.com/en-us/expo-hall/showr...,1
3,https://conv-prod-app.s3.amazonaws.com/media/s...,Crimson Desert,Experience the intense and brutal combat of Cr...,[16019],https://east.paxsite.com/en-us/expo-hall/showr...,1
4,https://conv-prod-app.s3.amazonaws.com/media/s...,Devolver Digital,Purveyors of fine digital entertainment wares ...,[18084],https://east.paxsite.com/en-us/expo-hall/showr...,1


How many exhibitors are there?

In [5]:
len(exhibitor_df)

342

In [6]:
old_exhibitor_details_df = pd.read_json("data/exhibitor_details.json")

In [9]:
exhibitor_df = exhibitor_df[
    ~exhibitor_df["name"].isin(old_exhibitor_details_df["name"].unique())
]

# Scraping Exhibitor Details
Next up: for each of the exhibitors on the list, I'm going to scrape some details related to them.

The method below will use a headless Playwright client to scrape the details:

In [10]:
async def scrape_exhibitor_details(url):
    from tenacity import (
        retry,
        stop_after_attempt,
        wait_exponential,
        retry_if_exception_type,
    )

    @retry(
        stop=stop_after_attempt(3),
        wait=wait_exponential(multiplier=1, min=2, max=10),
        retry=retry_if_exception_type((TimeoutError, Exception)),
        reraise=True,
    )
    async def _scrape_with_retry(page, url):
        # Navigate to the URL with a longer timeout
        await page.goto(url, timeout=30000, wait_until="networkidle")

        # Wait for the main content to load with a longer timeout
        await page.wait_for_selector(".innovation__base", timeout=20000)

        # Add a small delay to ensure JavaScript has fully executed
        await page.wait_for_timeout(random.uniform(1000, 2000))

        # Create a result dictionary with default values
        result = {
            "details_url": url,
            "booth": None,
            "name": None,
            "description": None,
            "website_url": None,
            "shop_url": None,
            "playable_games": [],
        }

        # Extracting the booth number
        booth_element = await page.query_selector("h3.showroomLocation")
        if booth_element:
            booth_text = await booth_element.inner_text()
            # Extract just the number part after "Booth:"
            if ":" in booth_text:
                result["booth"] = booth_text.split(":", 1)[1].strip()

        # Extracting the name from the logo alt text
        logo_element = await page.query_selector("#showroomLogo img")
        if logo_element:
            result["name"] = await logo_element.get_attribute("alt")
        # Fallback to data attribute if available
        if not result["name"]:
            logo_link = await page.query_selector(".showroom-logo-modal")
            if logo_link:
                result["name"] = await logo_link.get_attribute("data-exhib-full-name")

        # Extracting the description
        desc_element = await page.query_selector("div#showroomDesc p")
        if desc_element:
            result["description"] = await desc_element.inner_text()

        # Extracting the website URL
        website_element = await page.query_selector("li.website a")
        if website_element:
            result["website_url"] = await website_element.get_attribute("href")

        # Extracting the shop URL
        shop_element = await page.query_selector("li.store a")
        if shop_element:
            result["shop_url"] = await shop_element.get_attribute("href")

        # Extracting playable games
        game_elements = await page.query_selector_all(
            "ul.singleSpecialList li.singleSpecial"
        )
        for game_element in game_elements:
            game_data = {}

            title_element = await game_element.query_selector(
                "div.gtSlideSpecial-title"
            )
            if title_element:
                game_data["name"] = await title_element.inner_text()
            else:
                game_data["name"] = None

            image_element = await game_element.query_selector(
                "div.gtSlideSpecial-image"
            )
            if image_element:
                image_style = await image_element.get_attribute("style")
                if image_style and "url(" in image_style:
                    url_part = image_style.split("url(")[1].split(")")[0]
                    game_data["image_url"] = url_part.strip("'\"")
                else:
                    game_data["image_url"] = None
            else:
                game_data["image_url"] = None

            link_element = await game_element.query_selector("a")
            if link_element:
                game_data["merch_link"] = await link_element.get_attribute(
                    "data-backupurl"
                )
            else:
                game_data["merch_link"] = None

            result["playable_games"].append(game_data)

        return result

    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()

        try:
            result = await _scrape_with_retry(page, url)
        except Exception:
            # If all retries fail, return empty result with URL
            result = {
                "details_url": url,
                "booth": None,
                "name": None,
                "description": None,
                "website_url": None,
                "shop_url": None,
                "playable_games": [],
            }

        await browser.close()
        return result

For each of the exhibitors I've found, I'll scrape their details:

In [11]:
exhibitor_details_list = []
for row in tqdm(
    iterable=list(exhibitor_df.itertuples()),
    desc="Scraping exhibitor details",
):
    # Skip if the URL is None
    if row.details_url is None:
        continue

    # Otherwise, scrape the details
    exhibitor_details_list.append(await scrape_exhibitor_details(row.details_url))

    # Sleep for a random amount of time
    sleep_time = random.uniform(3, 8)
    sleep(sleep_time)

# Make a DataFrame from the list of dictionaries
exhibitor_details_df = pd.DataFrame(exhibitor_details_list)

# Add a column for the number of playable games
exhibitor_details_df["n_playable_games"] = exhibitor_details_df["playable_games"].apply(
    lambda x: len(x) if isinstance(x, list) else 0
)

Scraping exhibitor details: 100%|██████████| 21/21 [11:21<00:00, 32.47s/it]


What does this look like? 

In [12]:
exhibitor_details_df.sample(5)

Unnamed: 0,details_url,booth,name,description,website_url,shop_url,playable_games,n_playable_games
6,https://east.paxsite.com/en-us/expo-hall/showr...,19087,Lisa Bach & Charisse Ann de Leon from SwissGames,Taking inspiration from the classic Tamagotchi...,https://obleak.games/,https://obleakgames.itch.io/in-full-bloom,"[{'name': 'In Full Bloom', 'image_url': 'https...",1
16,https://east.paxsite.com/en-us/expo-hall/showr...,"19087, 20086",SwissGames,SwissGames : exporting innovation and fun Worl...,https://ch.linkedin.com/showcase/pro-helvetia-...,,[],0
14,https://east.paxsite.com/en-us/expo-hall/showr...,16087,,,,,[],0
18,https://east.paxsite.com/en-us/expo-hall/showr...,22090,,The Shark Logic Foundation is a non-profit foc...,https://pencil-xr.com/,https://www.meta.com/en-gb/experiences/pencil-...,[],0
17,https://east.paxsite.com/en-us/expo-hall/showr...,22021,The Gaming Nexus Studios Ltd,"TGN Studios was founded in 2022, and since the...",http://www.summonersnexus.com,,[],0


# Saving Data
Finally, I'm going to save the data that I scraped. I'll save the data as JSON for now. 

In [23]:
import json

# Save the exhibitor list to a JSON file
with open("data/exhibitor_list.json", "w", encoding="utf-8") as f:
    json_str = json.dumps(
        exhibitor_df.to_dict(orient="records"), indent=2, ensure_ascii=False
    )
    # Replace escaped slashes with regular slashes
    json_str = json_str.replace("\\/", "/")
    f.write(json_str)

# Save the exhibitor details to a JSON file
with open("data/exhibitor_details.json", "w", encoding="utf-8") as f:
    json_str = json.dumps(
        exhibitor_details_df.to_dict(orient="records"), indent=2, ensure_ascii=False
    )
    # Replace escaped slashes with regular slashes
    json_str = json_str.replace("\\/", "/")
    f.write(json_str)