# **Enriching Data**

After the _very_ messy Notebook 4, I've got a list of different games. Within this notebook, I aim to use the new Web Search tool & the OpenAI Responses API to grab additional information about each game.


# Setup

The cells below will help to set up the rest of the notebook.

I'll start by configuring the kernel that's running this notebook:


In [1]:
# Change the cwd
%cd ..

# Enable the autoreload module
%load_ext autoreload
%autoreload 2

# Load the environment variables
from dotenv import load_dotenv
load_dotenv(override=True)

/Users/thubbard/Documents/personal/programming/pax-pal-2025


True

Next, I'm going to import the necessary modules:


In [50]:
# General imports
import time
from typing import List, Optional

# Third-party imports
import requests
import pandas as pd
from pydantic import BaseModel, Field
from openai import OpenAI
from tqdm import tqdm
from bs4 import BeautifulSoup
from Levenshtein import ratio as lev_ratio

# Set up the OpenAI client
openai_client = OpenAI()

# Loading Data

Below, I'm going to load in all of the data:


In [3]:
# Read in the data
playable_games_df = pd.read_json("data/unified_games_data.json")

# Drop all of the games that're missing a booth_number and/or exhibitor_name
playable_games_df = playable_games_df[
    playable_games_df["booth_number"].notna()
    & playable_games_df["exhibitor_name"].notna()
]

# Add an ID to each of the games by hashing the name to a UUID
playable_games_df["game_id"] = playable_games_df["title"].apply(
    lambda x: hash(x) % (10**8)
)

What does this look like?


In [4]:
playable_games_df.sample(3)

Unnamed: 0,title,booth_number,description_texts,genres,developer,header_image_url,exhibitor_name,exhibitor_description,game_id
101,Cornucopia,18096.0,"[{'source': 'pax_app', 'text': 'Cornucopia is ...","[RPG, Simulation, Adventure, Indie]",Subconscious Games,https://s3.amazonaws.com/app.growtix.com/media...,Subconscious Games,Cornucopia: A nostalgic 2.5D farm sim with 74 ...,95492714
38,As I Began to Dream,21097.0,"[{'source': 'pax_app', 'text': 'As I Began to ...","[Platform, Puzzle, Adventure, Indie]",Soft Source,https://s3.amazonaws.com/app.growtix.com/media...,Soft Source,Soft Source,22237346
74,The Transylvania Adventure of Simon Quest,15031.0,"[{'source': 'pax_app', 'text': 'The Transylvan...","[Platform, Adventure, Indie]",Retroware,https://s3.amazonaws.com/app.growtix.com/media...,Retroware,Retroware,59156606


How many are there?


In [5]:
len(playable_games_df)

163

# Searching the Web with OpenAI

Next up: I'm going to try and use OpenAI's Responses API to search the web for more information about each game.

I'll start by defining the developer prompt and output formats:


In [6]:
developer_prompt = """# Role
You're a digital assistant helping to identify information about games. 

# Task
Search the Internet for information about games provided by users. Synthesize the game's name, description, genres, release date, platforms, and Steam page link.

Return information about the search process, including visited URLs and a summary of your findings.

# Guidelines
- You MUST use the web search tool to find information about the game. 
- ONLY identify information about the specific game provided. 
- Use authoritative sources (e.g., Wikipedia, Steam, the game's official website) to gather information; include reviews / preview articles to understand more about the game.
- Return None for the `game_info` field if you can't identify the specific game (or if you're unsure about the identification).
- Some of these games may be in-development, so information could be limited; try your best!
"""


class GameInfo(BaseModel):
    game_name: str = Field(..., description="The name of the game.")
    released: bool = Field(
        ..., description="Whether the game has been released (True) or not (False)."
    )
    release_time: Optional[str] = Field(
        ...,
        description="A string describing the release date, with as much precision as is available.",
    )
    description: str = Field(
        ...,
        description="A paragraph-long description summarizing gameplay, story, aesthetics, and unique features, written in Wikipedia style.",
    )
    genres: Optional[List[str]] = Field(..., description="A list of genres")
    snappy_summary: Optional[str] = Field(
        None,
        description="A short, tagline-like summary (max 10 words) highlighting genre and unique appeal.",
    )
    platforms: Optional[List[str]] = Field(
        None,
        description="Platforms available: PlayStation, Xbox, Nintendo Switch, PC, Mobile, or Tabletop",
    )
    steam_link: Optional[str] = Field(
        None, description="Direct URL to the game's Steam page."
    )


class GameSearchResults(BaseModel):
    web_search_summary: str = Field(
        ..., description="Summary of whether platform and Steam info were found."
    )
    web_search_results: List[List[str]] = Field(
        ...,
        description="A list of all of the websites visited, where each tuple contains a webpage title and URL.",
    )
    correctly_identified_game: bool = Field(
        ..., description="Whether the game was correctly identified."
    )
    game_info: Optional[GameInfo] = Field(
        None,
        description="Found game info, if `correctly_identified_game` is True. Otherwise, None.",
    )

Next, we'll iterate through each of the rows and get some information about them:


In [7]:
game_search_results_list = []
for row in tqdm(
    iterable=list(playable_games_df.itertuples()),
    desc="Searching for game info",
):

    # Grab the proper row description
    game_descriptions = row.description_texts
    game_descriptions_dict = {
        desc.get("source"): desc.get("text") for desc in game_descriptions
    }
    game_description = (
        game_descriptions_dict.get("pax_website")
        if "pax_website" in game_descriptions_dict
        else (
            game_descriptions_dict.get("pax_app")
            if "pax_app" in game_descriptions_dict
            else "Game Description Not Found"
        )
    )

    user_prompt = f"""# **{row.title}**
*{game_description}*

**Developer:** {row.developer if pd.notna(row.developer) else "Unknown"}

**Exhibitor:** {row.exhibitor_name if pd.notna(row.exhibitor_name) else "Unknown"}

**Exhibitor Description:** {row.exhibitor_description if pd.notna(row.exhibitor_description) else "Unknown"}
"""

    try:
        response = openai_client.responses.parse(
            model="gpt-4.1",
            tools=[{"type": "web_search_preview", "search_context_size": "high"}],
            input=[
                {
                    "role": "developer",
                    "content": developer_prompt,
                },
                {"role": "user", "content": user_prompt},
            ],
            text_format=GameSearchResults,
            tool_choice={"type": "web_search_preview"},
        )

        # Add a response
        cur_row_dict = {
            "game_id": row.game_id,
            "web_search_summary": response.output_parsed.web_search_summary,
            "web_search_results": response.output_parsed.web_search_results,
            "correctly_identified_game": response.output_parsed.correctly_identified_game,
        } | (
            response.output_parsed.game_info.model_dump()
            if response.output_parsed.game_info
            else {}
        )

        game_search_results_list.append(cur_row_dict)

    except Exception as e:

        print(e)

        cur_row_dict = {
            "game_id": row.game_id,
            "web_search_summary": None,
            "web_search_results": None,
            "correctly_identified_game": False,
        }

# Make a DataFrame from the results
game_search_results_df = pd.DataFrame.from_records(game_search_results_list)

Searching for game info:   3%|▎         | 5/163 [00:43<25:33,  9.70s/it]

1 validation error for GameSearchResults
  Invalid JSON: invalid escape at line 1 column 3660 [type=json_invalid, input_value='{"web_search_summary":"I...1f\\u001f\\u001f\\u001 ', input_type=str]
    For further information visit https://errors.pydantic.dev/2.11/v/json_invalid


Searching for game info:  14%|█▍        | 23/163 [02:50<20:27,  8.77s/it]

1 validation error for GameSearchResults
  Invalid JSON: invalid escape at line 1 column 4741 [type=json_invalid, input_value='{"web_search_summary": "...u000123\\u000123\\u000 ', input_type=str]
    For further information visit https://errors.pydantic.dev/2.11/v/json_invalid


Searching for game info:  26%|██▌       | 42/163 [05:05<18:15,  9.05s/it]

1 validation error for GameSearchResults
  Invalid JSON: EOF while parsing a string at line 1 column 3597 [type=json_invalid, input_value='{"web_search_summary": "...u007f\\u007f\\u007f\\u ', input_type=str]
    For further information visit https://errors.pydantic.dev/2.11/v/json_invalid


Searching for game info:  74%|███████▍  | 121/163 [14:00<05:58,  8.53s/it]

1 validation error for GameSearchResults
  Invalid JSON: EOF while parsing a string at line 1 column 3840 [type=json_invalid, input_value='{"web_search_summary": "...f\\u007f\\u007f\\u007f ', input_type=str]
    For further information visit https://errors.pydantic.dev/2.11/v/json_invalid


Searching for game info: 100%|██████████| 163/163 [18:31<00:00,  6.82s/it]


Next, I'll save this data:


In [8]:
# Save the game_search_results_df to a JSON file
game_search_results_df.to_json(
    "data/game_search_results.json", orient="records", lines=True
)

# Verifying Steam Links

The OpenAI web search seems to be working well enough, but it's not perfect. Below, I'll verify each of the Steam links that were identified, and scrape more information from them.


In [13]:
def extract_steam_game_info(html_content: str) -> dict:
    """
    Extracts game information from Steam game page HTML.

    Args:
        html_content: The HTML content of the Steam game page as a string.

    Returns:
        A dictionary containing the extracted information:
        - name: Name of the game (str)
        - developer: Name of the game's developer (str)
        - image_address: URL of the header image (str)
        - description: Game description (str)
        - tags: List of popular user-defined tags (List[str])
        - video_link: URL of the first game trailer video (str)
                       or None if no video is found.
        - image_links: List of screenshot image URLs (List[str])
    """
    soup = BeautifulSoup(html_content, "html.parser")

    game_info = {
        "name": None,
        "developer": None,
        "image_address": None,
        "description": None,
        "tags": [],
        "video_link": None,
        "image_links": [],  # Added this line
    }

    # --- Extract Name ---
    name_tag = soup.find("div", class_="apphub_AppName")
    if name_tag:
        game_info["name"] = name_tag.get_text(strip=True)
    else:
        # Fallback for potential structure changes
        name_title_tag = soup.find("title")
        if name_title_tag:
            # Extract from title like "Super Cucumber on Steam"
            title_text = name_title_tag.get_text(strip=True)
            if " on Steam" in title_text:
                game_info["name"] = title_text.replace(" on Steam", "").strip()

    # --- Extract Developer ---
    developer_div = soup.find("div", id="developers_list")
    if developer_div:
        developer_link = developer_div.find("a")
        if developer_link:
            game_info["developer"] = developer_link.get_text(strip=True)
    else:
        # Fallback: Look for the grid structure
        dev_label = soup.find("div", class_="grid_label", string="Developer")
        if dev_label:
            dev_content = dev_label.find_next_sibling("div", class_="grid_content")
            if dev_content:
                dev_link = dev_content.find("a")
                if dev_link:
                    game_info["developer"] = dev_link.get_text(strip=True)

    # --- Extract Image Address ---
    # Look for the main header image first
    img_tag = soup.find("img", class_="game_header_image_full")
    if img_tag and img_tag.get("src"):
        game_info["image_address"] = img_tag["src"]
    else:
        # Fallback: Look for Open Graph image meta tag
        og_image_tag = soup.find("meta", property="og:image")
        if og_image_tag and og_image_tag.get("content"):
            game_info["image_address"] = og_image_tag["content"]

    # --- Extract Description ---
    # Look for the meta description tag
    desc_meta_tag = soup.find("meta", {"name": "Description"})
    if desc_meta_tag and desc_meta_tag.get("content"):
        game_info["description"] = desc_meta_tag["content"].strip()
    else:
        # Fallback: Look for Open Graph description meta tag
        og_desc_tag = soup.find("meta", property="og:description")
        if og_desc_tag and og_desc_tag.get("content"):
            game_info["description"] = og_desc_tag["content"].strip()
        else:
            # Fallback: Look for the short glance description snippet
            snippet_tag = soup.find("div", class_="game_description_snippet")
            if snippet_tag:
                game_info["description"] = snippet_tag.get_text(strip=True)

    # --- Extract Tags ---
    tags_container = soup.find("div", class_="glance_tags popular_tags")
    if tags_container:
        tag_elements = tags_container.find_all("a", class_="app_tag")
        game_info["tags"] = [
            tag.get_text(strip=True) for tag in tag_elements[:3]
        ]  # Get first 3 tags as requested

    # --- Extract Video Link ---
    # Find the first movie element
    video_element = soup.find("div", class_="highlight_movie")
    if video_element:
        # Prefer webm, then mp4. Prefer HD if available.
        # Check for webm HD first
        webm_hd_src = video_element.get("data-webm-hd-source")
        if webm_hd_src:
            game_info["video_link"] = webm_hd_src
        else:
            # Then check for mp4 HD
            mp4_hd_src = video_element.get("data-mp4-hd-source")
            if mp4_hd_src:
                game_info["video_link"] = mp4_hd_src
            else:
                # Then check for regular webm
                webm_src = video_element.get("data-webm-source")
                if webm_src:
                    game_info["video_link"] = webm_src
                else:
                    # Finally check for regular mp4
                    mp4_src = video_element.get("data-mp4-source")
                    if mp4_src:
                        game_info["video_link"] = mp4_src

    # --- Extract Image Links ---
    screenshot_elements = soup.find_all("a", class_="highlight_screenshot_link")
    for element in screenshot_elements:
        if element and element.get("href"):
            game_info["image_links"].append(element["href"])

    return game_info

With this method in hand, I can iterate through each game and create a link:


In [None]:
steam_details_df_records = []
for row in tqdm(
    iterable=list(game_search_results_df.itertuples()),
    desc="Searching for Steam game info",
):
    if pd.isna(row.steam_link):
        steam_details_df_records.append(
            {
                "game_id": row.game_id,
            }
        )
        continue

    try:
        # Make a request to the Steam page
        response = requests.get(row.steam_link, timeout=10)
        if response.status_code == 200:
            # Extract game information
            game_info = extract_steam_game_info(response.text)
            game_info["game_id"] = row.game_id
            steam_details_df_records.append(game_info)

            # Sleep for a bit
            time.sleep(3)
        else:
            print(
                f"Failed to fetch Steam page for game ID {row.game_id}: {response.status_code}"
            )
            steam_details_df_records.append(
                {
                    "game_id": row.game_id,
                }
            )
    except requests.exceptions.RequestException as e:
        print(f"Request failed for game ID {row.game_id}: {e}")
        steam_details_df_records.append(
            {
                "game_id": row.game_id,
            }
        )

# Make a DataFrame from the results
steam_details_df = pd.DataFrame.from_records(steam_details_df_records)

Searching for Steam game info: 100%|██████████| 159/159 [06:08<00:00,  2.32s/it]


In [54]:
# For each of the steam_details_df records, we'll add a steam_url column if the game name associated with the ID is within
# a certain Levenshtein distance of the name in the game_search_results_df
verified_steam_details_df_records = []
for row in steam_details_df.itertuples():

    # Make a dictionary of the steam details
    cur_row_dict = row._asdict()

    # Grab the game ID
    game_id = row.game_id

    # Determine the row from the game_search_results_df
    game_row = playable_games_df[playable_games_df["game_id"] == game_id].iloc[0]
    game_name = game_row.title
    search_row = game_search_results_df[
        game_search_results_df["game_id"] == game_id
    ].iloc[0]
    steam_link = search_row.steam_link

    # Check if the name is within a certain Levenshtein distance
    if lev_ratio(row.name, game_name) > 0.8:

        # Add the steam URL to the record
        cur_row_dict["steam_url"] = steam_link

        verified_steam_details_df_records.append(cur_row_dict)
    else:
        continue

# Make a DataFrame from the results
verified_steam_details_df = pd.DataFrame.from_records(verified_steam_details_df_records)

What does this data look like?


In [56]:
verified_steam_details_df.sample(3)

Unnamed: 0,Index,name,image_address,description,tags,video_link,image_links,game_id,steam_url
38,60,Whisper Mountain Outbreak,https://shared.cloudflare.steamstatic.com/stor...,Escape room meets co-op survival horror. The y...,"[Survival Horror, Online Co-Op, Isometric]",https://video.cloudflare.steamstatic.com/store...,[https://shared.cloudflare.steamstatic.com/sto...,46445845,https://store.steampowered.com/app/1953230/Whi...
15,27,Windswept,https://shared.cloudflare.steamstatic.com/stor...,Marbles 🦆 and Checkers 🐢 were swept away by a ...,"[Precision Platformer, 2D Platformer, Platformer]",https://video.cloudflare.steamstatic.com/store...,[https://shared.cloudflare.steamstatic.com/sto...,35417644,https://store.steampowered.com/app/1660960/Win...
66,97,Electro Bop Boxing League,https://shared.cloudflare.steamstatic.com/stor...,A new and unconventional mix of Auto-Battler +...,"[Auto Battler, Rhythm, Robots]",https://video.cloudflare.steamstatic.com/store...,[https://shared.cloudflare.steamstatic.com/sto...,3993242,https://store.steampowered.com/app/3211280/Ele...


I'll save this data below:


In [57]:
verified_steam_details_df.to_json(
    "data/steam_details.json", orient="records", lines=True
)

# Compiling Final Data

Now that I've created all of this information, I'm going to attempt to compile it all together.


In [101]:
# Create some dictionaries mapping the game IDs to data
game_search_results_dict = {
    row.game_id: row._asdict()
    for row in game_search_results_df.itertuples()
    if row.correctly_identified_game
}
steam_details_dict = {
    row.game_id: row._asdict() for row in verified_steam_details_df.itertuples()
}

# Create a final_enriched_games_df DataFrame
final_enriched_games_df_records = []
for row in playable_games_df.itertuples():

    # Convert the row into a dictionary
    cur_row_data = row._asdict()
    game_id = cur_row_data.get("game_id", None)

    # Grab the game's developer
    developer = cur_row_data.get("developer", None)
    if pd.isna(developer):
        developer = steam_details_dict.get(game_id, {}).get("developer", None)

    # Add any AI-generated text to the descriptions_text
    descriptions = cur_row_data.get("description_texts", [])
    ai_search_summary = game_search_results_dict.get(game_id, {}).get(
        "description", None
    )
    if pd.notna(ai_search_summary):
        descriptions.append(
            {
                "source": "ai_search_summary",
                "text": ai_search_summary,
            }
        )

    # Grab the proper header image
    header_image = cur_row_data.get("header_image", None)
    if pd.isna(header_image):
        header_image = steam_details_dict.get(game_id, {}).get("image_address", None)

    # Determine the list of genres and tags
    genres_and_tags = []
    if cur_row_data.get("genres", None) is not None:
        genres_and_tags.extend(cur_row_data.get("genres", []))
    if isinstance(steam_details_dict.get(game_id, {}).get("tags", None), list):
        genres_and_tags.extend(steam_details_dict.get(game_id, {}).get("tags", []))
    if isinstance(game_search_results_dict.get(game_id, {}).get("genres", None), list):
        genres_and_tags.extend(
            game_search_results_dict.get(game_id, {}).get("genres", [])
        )
    # Remove duplicates
    genres_and_tags = list(set(genres_and_tags))

    # Create the list of media
    media = []
    if isinstance(steam_details_dict.get(game_id, {}).get("image_links", None), list):
        for img_link in steam_details_dict.get(game_id, {}).get("image_links", []):
            media.append({"type": "image", "source": "steam", "url": img_link})
    if isinstance(steam_details_dict.get(game_id, {}).get("video_link", None), str):
        media.append(
            {
                "type": "video",
                "source": "steam",
                "url": steam_details_dict.get(game_id, {}).get("video_link", None),
            }
        )

    # Determine any links associated with the game
    links = [
        {
            "title": "Google Search for Game",
            "url": f"https://www.google.com/search?q={cur_row_data.get('title', '')} {developer or ''}",
        }
    ]
    if isinstance(
        game_search_results_dict.get(game_id, {}).get("web_search_results", None), list
    ):
        for search_result in game_search_results_dict.get(game_id, {}).get(
            "web_search_results", []
        ):
            links.append(
                {
                    "title": search_result[0],
                    "url": search_result[1],
                }
            )

    # Deduplicate the descriptions by source
    final_desc = []
    seen_sources = set()
    for desc in descriptions:
        if desc["source"] not in seen_sources:
            seen_sources.add(desc["source"])
            final_desc.append(desc)
    descriptions = final_desc

    # Start the data dictionary
    cur_row_data = {
        "id": cur_row_data.get("game_id", None),
        "name": cur_row_data.get("title", None),
        "snappy_summary": game_search_results_dict.get(game_id, {}).get(
            "snappy_summary", None
        ),
        "description_texts": descriptions,
        "platforms": game_search_results_dict.get(game_id, {}).get("platforms", None),
        "developer": developer,
        "exhibitor": cur_row_data.get("exhibitor_name", None),
        "booth_number": cur_row_data.get("booth_number", None),
        "header_image_url": header_image,
        "steam_link": steam_details_dict.get(game_id, {}).get("steam_url", None),
        "genres_and_tags": genres_and_tags,
        "media": media,
        "released": game_search_results_dict.get(game_id, {}).get("released", None),
        "release_time": game_search_results_dict.get(game_id, {}).get(
            "release_time", None
        ),
        "links": links,
    }

    # Append the data dictionary
    final_enriched_games_df_records.append(cur_row_data)

# Make the DataFrame
final_enriched_games_df = pd.DataFrame.from_records(final_enriched_games_df_records)

Finally, I'm going to save this:

In [102]:
final_enriched_games_df.to_json(
    "data/final_enriched_games_data.json", orient="records", indent=4
)

If I want to reload it below, I can:

In [103]:
# Reload the final_enriched_games_df
final_enriched_games_df = pd.read_json("data/final_enriched_games_data.json")

  final_enriched_games_df = pd.read_json("data/final_enriched_games_data.json")
  final_enriched_games_df = pd.read_json("data/final_enriched_games_data.json")
  final_enriched_games_df = pd.read_json("data/final_enriched_games_data.json")
