# Buenos Aires Cultural Events Dataset

### BUSINESS CHALLENGE:

Create a product that allows us to see a quick view of current cultural events in Buenos Aires classified by type of event and venue. It should provide information about all events and related links from different venues in Buenos Aires, so we can easily pick what we want to do this week. 

In [186]:
import os
print("cwd:", os.getcwd())
print("src exists:", os.path.exists(os.path.abspath("../src")))
print("scraper exists:", os.path.exists(os.path.abspath("../src/scraper.py")))

cwd: /Users/victoriayuzova/Data-Science-Projects/ba-events-recommender/notebooks
src exists: True
scraper exists: True


In [193]:
# imports
# If these fail, please check you're running from an 'activated' environment with (llms) in the command prompt
import os
import sys
import json
import importlib

from dotenv import load_dotenv
from IPython.display import Markdown, display, update_display
from openai import OpenAI

# Force the notebook to import the *local* ../src/scraper.py (not the pip package named `scraper`)
src_path = os.path.abspath("../src")
if src_path not in sys.path:
    sys.path.insert(0, src_path)
if "scraper" in sys.modules:
    del sys.modules["scraper"]

scraper = importlib.import_module("scraper")
fetch_website_links = scraper.fetch_website_links
fetch_website_contents = scraper.fetch_website_contents
extract_links_from_homepages = scraper.extract_links_from_homepages
build_classified_events_dataset = scraper.build_classified_events_dataset

In [188]:
import json
from pathlib import Path

p = Path.cwd().parent / "src" / "homepage_urls.json"  # because cwd is notebooks/
with p.open("r", encoding="utf-8") as f:
    homepage_urls = json.load(f)["homepage_urls"]

homepage_urls

['https://complejoteatral.gob.ar/',
 'https://malba.org.ar/',
 'https://www.teatrocervantes.gob.ar/',
 'https://turismo.buenosaires.gob.ar/es/article/que-hacer-esta-semana',
 'https://www.bellasartes.gob.ar/agenda/']

In [189]:
df_links = extract_links_from_homepages(
    homepage_urls=homepage_urls,
    fetch_website_links_fn=fetch_website_links,
    keep_same_domain=False,  # important for ticket domains
    out_dir="../data/raw",   # relative path is cleaner
    filename="01_all_links.csv"
)

df_links.head()

Saved 293 rows to: ../data/raw/2026-02-26/01_all_links.csv


Unnamed: 0,run_date,scraped_at,page_url,event_url_raw,event_url_abs,link_id
0,2026-02-26,2026-02-26T23:32:24,https://complejoteatral.gob.ar/,http://buenosaires.gob.ar/,http://buenosaires.gob.ar/,3cd0a6fbe45556781bd85be6f1ab1d74
1,2026-02-26,2026-02-26T23:32:24,https://complejoteatral.gob.ar/,https://complejoteatral.gob.ar/agenda?fecha=26-02-2026,https://complejoteatral.gob.ar/agenda?fecha=26-02-2026,f22e677f02d10cef5a9655c039ba6f74
2,2026-02-26,2026-02-26T23:32:24,https://complejoteatral.gob.ar/,https://complejoteatral.gob.ar/pdf/temporada2026.pdf,https://complejoteatral.gob.ar/pdf/temporada2026.pdf,bb6f5d43ebd6fb57389f315068d57fee
3,2026-02-26,2026-02-26T23:32:24,https://complejoteatral.gob.ar/,https://complejoteatral.gob.ar/ver/visitas_guiadas_al_teatro_san_martín,https://complejoteatral.gob.ar/ver/visitas_guiadas_al_teatro_san_martín,490e1ba143103982811544dfc0542a9b
4,2026-02-26,2026-02-26T23:32:24,https://complejoteatral.gob.ar/,https://complejoteatral.gob.ar/paginas/teatros-accesibles,https://complejoteatral.gob.ar/paginas/teatros-accesibles,8393f13dbdf20717b5095e801cf12ac3


## Step 1. Use LLM to pick relevant links

### We will call an LLM so it picks only relevant links with events

We will use "one shot prompting" in which we provide an example of how it should respond in the prompt.

This is an excellent use case for an LLM, because it requires nuanced understanding - hard coding each scenario would take us quite some time.

In [190]:
# Initialize and constants

load_dotenv(override=True)
api_key = os.getenv('OPENAI_API_KEY')

# Default model you can use elsewhere
MODEL = 'gpt-5-nano'

# Use a steadier model for link selection (avoid long hangs/timeouts)
LINK_MODEL = 'gpt-4.1-mini'

# Add a client-side timeout so a single slow request doesn't hang the notebook
openai = OpenAI(timeout=180)

In [191]:
# 1) Group candidates per homepage (IMPORTANT: use the filtered df2)
grouped = df_links.groupby("page_url")["event_url_raw"].apply(list).to_dict()

payloads = [
    {"homepage_url": page_url, "institution": None, "links": links}
    for page_url, links in grouped.items()
]

link_system_prompt = """
You are selecting event-related links for a Buenos Aires cultural events scraper.

Input JSON keys:
- homepage_url: string
- institution: string or null
- links: array of absolute URLs (strings)

Pick ONLY links useful to find current/upcoming cultural events:
- event listing pages (agenda / programacion / calendario)
- event detail pages
- ticket purchase pages
- (optional) program PDFs (ONLY if they are clearly about programming/season/events)

Exclude:
- terms/privacy, about, contact, donations, newsletters, login, accessibility pages, generic navigation
- social media

Rules:
- Do NOT invent URLs. Every URL must be exactly one of input.links
- Return at most 30 links
- Return ONLY valid JSON (no markdown/no prose) with schema:

{
  "homepage_url": "<string>",
  "institution": null,
  "links": [{"url": "<string>"}]
}
""".strip()

def select_relevant_links(payload: dict) -> dict:
    # Make sure content is string (fixes your earlier BadRequestError)
    payload_text = json.dumps(payload, ensure_ascii=False)

    print(f"Selecting relevant links for {payload['homepage_url']} by calling {LINK_MODEL}")

    response = openai.chat.completions.create(
        model=LINK_MODEL,  # use your steadier model here
        messages=[
            {"role": "system", "content": link_system_prompt},
            {"role": "user", "content": payload_text},
        ],
        response_format={"type": "json_object"},
        temperature=0,
    )

    result = json.loads(response.choices[0].message.content)

    # Safety: keep only URLs that were in the input (no hallucinations)
    input_set = set(payload["links"])
    cleaned = [x for x in result.get("links", []) if x.get("url") in input_set]

    result["homepage_url"] = payload["homepage_url"]
    result["institution"] = payload.get("institution")
    result["links"] = cleaned[:30]

    print(f"Found {len(result['links'])} relevant links")
    return result

outs = [select_relevant_links(p) for p in payloads]

df_selected = pd.concat(
    [
        pd.DataFrame(o["links"]).assign(
            homepage_url=o.get("homepage_url"),
            institution=o.get("institution")
        )
        for o in outs
    ],
    ignore_index=True
)

df_selected.head()

Selecting relevant links for https://complejoteatral.gob.ar/ by calling gpt-4.1-mini
Found 30 relevant links
Selecting relevant links for https://malba.org.ar/ by calling gpt-4.1-mini


KeyboardInterrupt: 

In [None]:
from datetime import datetime
import os

run_date = datetime.utcnow().strftime("%Y-%m-%d")

save_folder = f"../data/raw/{run_date}"
os.makedirs(save_folder, exist_ok=True)

df_selected.to_csv(f"{save_folder}/02_relevant_links.csv", index=False)

In [None]:
link_system_prompt = """
You are selecting event-related links for a Buenos Aires cultural events scraper.

You will receive a JSON object as input with keys:
- homepage_url: string
- institution: string or null
- links: array of absolute URLs (strings)

Your job:
- Pick ONLY links that are relevant for finding current/upcoming cultural events (agenda/listings, event detail pages, ticket purchase pages, program PDFs, calendars).
- Exclude terms/privacy, contact/about, donations/sponsors, newsletters, login, generic navigation, accessibility pages, and any social media.
- Do NOT invent new URLs. Every returned URL must come from input.links.
- Return at most 30 links.

Return ONLY valid JSON (no markdown, no prose) with this schema:

{
  "homepage_url": "<string>",
  "institution": "<string or null>",
  "links": [
    {
      "url": "<string>"
    }
  ]
}
"""

In [None]:
def get_links_user_prompt(payload_json_links):
    user_prompt = f"""
Here is the list of links on the website in json format: {payload_json_links} -
Please decide which of these are relevant web links for a brochure listing current cultural
evens in Buenos Aires.
Do not include Terms of Service, Privacy, email, social media links, or general descriptions of the theater that´s not related to any event.

Links (some might be relative links):

"""
    links = fetch_website_links(payload_json_links)
    user_prompt += "\n".join(links)
    return user_prompt

In [None]:
def select_relevant_links(payload_json_links):
    print(f"Selecting relevant links for {payload_json_links} by calling {MODEL}")
    response = openai.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "system", "content": link_system_prompt},
            {"role": "user", "content": payload_json_links}
        ],
        response_format={"type": "json_object"}
    )
    result = response.choices[0].message.content
    links = json.loads(result)
    print(f"Found {len(links['links'])} relevant links")
    return links
    

In [None]:
for i, p in enumerate(payloads):
    if not isinstance(p, str):
        print(i, type(p))
        break


0 <class 'dict'>


In [None]:
outs = [select_relevant_links(json.dumps(p, ensure_ascii=False)) for p in payloads]

df_selected = pd.concat(
    [pd.DataFrame(o["links"]).assign(homepage_url=o.get("homepage_url"), institution=o.get("institution"))
     for o in outs],
    ignore_index=True
)

df_selected

Selecting relevant links for {"homepage_url": "https://complejoteatral.gob.ar/", "institution": null, "links": ["https://complejoteatral.gob.ar/agenda?fecha=25-02-2026", "https://complejoteatral.gob.ar/pdf/temporada2026.pdf", "https://complejoteatral.gob.ar/ver/visitas_guiadas_al_teatro_san_martín", "https://complejoteatral.gob.ar/ver/la_gaviota", "https://entradasba.buenosaires.gob.ar/evento/d90f82ed-ec8f-46cf-b8b2-48e665a36fc3", "https://complejoteatral.gob.ar/ver/los-pilares-de-la-sociedad", "https://entradasba.buenosaires.gob.ar/evento/42523311-973e-4a52-b2ee-63a59c48a2b7", "https://complejoteatral.gob.ar/ver/baco-polaco", "https://entradasba.buenosaires.gob.ar/evento/f94c1c9a-151e-49a8-8ed8-c9aa9d0464c3", "https://complejoteatral.gob.ar/ver/invasiones-1", "https://entradasba.buenosaires.gob.ar/evento/e1281314-a634-47a0-a2fa-89f0c3c88b0b", "https://complejoteatral.gob.ar/ver/buenas-palabras", "https://entradasba.buenosaires.gob.ar/evento/f163b37c-bb5f-48b0-a250-5d4293030f3f", "http

Unnamed: 0,url,homepage_url,institution
0,https://complejoteatral.gob.ar/agenda?fecha=25-02-2026,https://complejoteatral.gob.ar/,
1,https://complejoteatral.gob.ar/pdf/temporada2026.pdf,https://complejoteatral.gob.ar/,
2,https://complejoteatral.gob.ar/ver/visitas_guiadas_al_teatro_san_martín,https://complejoteatral.gob.ar/,
3,https://complejoteatral.gob.ar/ver/la_gaviota,https://complejoteatral.gob.ar/,
4,https://entradasba.buenosaires.gob.ar/evento/d90f82ed-ec8f-46cf-b8b2-48e665a36fc3,https://complejoteatral.gob.ar/,
5,https://complejoteatral.gob.ar/ver/los-pilares-de-la-sociedad,https://complejoteatral.gob.ar/,
6,https://entradasba.buenosaires.gob.ar/evento/42523311-973e-4a52-b2ee-63a59c48a2b7,https://complejoteatral.gob.ar/,
7,https://complejoteatral.gob.ar/ver/baco-polaco,https://complejoteatral.gob.ar/,
8,https://entradasba.buenosaires.gob.ar/evento/f94c1c9a-151e-49a8-8ed8-c9aa9d0464c3,https://complejoteatral.gob.ar/,
9,https://complejoteatral.gob.ar/ver/invasiones-1,https://complejoteatral.gob.ar/,


In [None]:
df_selected.to_csv("relevant_links.csv", index=False)

## Second step: lets classify those links

Assemble all the details into another prompt to GPT-5-nano

In [195]:
df_events = build_classified_events_dataset(
    df_relevant=df_selected.rename(columns={"url": "url"}),  # df_selected already has url + homepage_url
    fetch_website_contents_fn=fetch_website_contents,
    openai_client=openai,
    model="gpt-4.1-mini",
    out_dir="../data/raw",
    filename="03_events.csv",
    limit=30,
)
df_events.head()

NameError: name 'timezone' is not defined

In [113]:
def fetch_page_and_all_relevant_links(payload_json_links):
    contents = fetch_website_contents(payload_json_links)
    relevant_links = select_relevant_links(payload_json_links)
    result = f"## Landing Page:\n\n{contents}\n## Relevant Links:\n"
    for link in relevant_links['links']:
        result += f"\n\n### Link: {link['type']}\n"
        result += fetch_website_contents(link["url"])
    return result

In [114]:
print(fetch_page_and_all_relevant_links("https://complejoteatral.gob.ar/"))

Selecting relevant links for https://complejoteatral.gob.ar/ by calling gpt-5-nano
Found 0 relevant links
## Landing Page:

Complejo Teatral de Buenos Aires

Programación
Agenda
Temporada 2026
Teatros
Teatro San Martín
Teatro Regio
Teatro de la Ribera
Teatro Sarmiento
Teatro Alvear
Cine Teatro El Plata
Géneros
teatro
para grandes y chicos
danza
música
Ciclo Hall Abierto
títeres
opera rock
fotogalería
cine
temporada internacional
Teatro musical
actividades especiales
danza / teatro
artes plásticas
actividades pedagógicas
Cíclos
MÚSICA EN EL HALL
BAILEMOS EN EL HALL
HADES EN DEMORA
CICLO PRODANZA EN EL HALL
PREMIO BANCO CIUDAD A LAS ARTES ESCÉNICAS 2020-2021
PREMIO CTBA A LA CREACIÓN Y PRODUCCIÓN DE ARTES ESCÉNICAS
2 de abril 40 años
CICLO HALL ABIERTO
Kantor
Vacaciones de Invierno en el Teatro
Próximos estrenos
Programación accesible
Visitas Guiadas
Buscar
Filtrar búsqueda por:
menú
Programación
Descargar Temporada 2026
Agenda > Hoy en el CTBA
Noticias
Teatros Accesibles
Cuerpos estable

In [115]:
brochure_system_prompt = """
You are an assistant that analyzes the contents of several relevant pages from a cultural institution website
and creates a short brochure about the events that are happening in Buenos Aires.
Respond in markdown without code blocks.
Include event name, type, short description, date and time of the event, link to the event if available.
"""

# Or uncomment the lines below for a more humorous brochure - this demonstrates how easy it is to incorporate 'tone':

# brochure_system_prompt = """
# You are an assistant that analyzes the contents of several relevant pages from a company website
# and creates a short, humorous, entertaining, witty brochure about the company for prospective customers, investors and recruits.
# Respond in markdown without code blocks.
# Include details of company culture, customers and careers/jobs if you have the information.
# """


In [116]:
def get_brochure_user_prompt(company_name, url):
    user_prompt = f"""
You are looking at a cultural institution called: {company_name}
Here are the contents of its pages that contain information about current cultural events in Buenos Aires;
use this information to build a short brochure about the events that are happening in Buenos Aires.\n\n
"""
    user_prompt += fetch_page_and_all_relevant_links(url)
    user_prompt = user_prompt[:5_000] # Truncate if more than 5,000 characters
    return user_prompt

In [117]:
get_brochure_user_prompt("Teatro San Martín", "https://complejoteatral.gob.ar/")

Selecting relevant links for https://complejoteatral.gob.ar/ by calling gpt-5-nano
Found 0 relevant links


'\nYou are looking at a cultural institution called: Teatro San Martín\nHere are the contents of its pages that contain information about current cultural events in Buenos Aires;\nuse this information to build a short brochure about the events that are happening in Buenos Aires.\n\n\n## Landing Page:\n\nComplejo Teatral de Buenos Aires\n\nProgramación\nAgenda\nTemporada 2026\nTeatros\nTeatro San Martín\nTeatro Regio\nTeatro de la Ribera\nTeatro Sarmiento\nTeatro Alvear\nCine Teatro El Plata\nGéneros\nteatro\npara grandes y chicos\ndanza\nmúsica\nCiclo Hall Abierto\ntíteres\nopera rock\nfotogalería\ncine\ntemporada internacional\nTeatro musical\nactividades especiales\ndanza / teatro\nartes plásticas\nactividades pedagógicas\nCíclos\nMÚSICA EN EL HALL\nBAILEMOS EN EL HALL\nHADES EN DEMORA\nCICLO PRODANZA EN EL HALL\nPREMIO BANCO CIUDAD A LAS ARTES ESCÉNICAS 2020-2021\nPREMIO CTBA A LA CREACIÓN Y PRODUCCIÓN DE ARTES ESCÉNICAS\n2 de abril 40 años\nCICLO HALL ABIERTO\nKantor\nVacaciones de

In [118]:
def create_brochure(company_name, url):
    response = openai.chat.completions.create(
        model="gpt-4.1-mini",
        messages=[
            {"role": "system", "content": brochure_system_prompt},
            {"role": "user", "content": get_brochure_user_prompt(company_name, url)}
        ],
    )
    result = response.choices[0].message.content
    display(Markdown(result))

In [119]:
create_brochure("Teatro San Martín", "https://complejoteatral.gob.ar/")

Selecting relevant links for https://complejoteatral.gob.ar/ by calling gpt-5-nano
Found 0 relevant links


# Buenos Aires Cultural Events at Teatro San Martín

Discover a vibrant selection of theatrical and cultural experiences happening now in Buenos Aires at the Teatro San Martín and other venues in the Complejo Teatral de Buenos Aires.

---

### LA GAVIOTA  
**Type:** Theater  
**Description:** A classic play by Anton Chekhov directed by Rubén Szuchmacher, "La Gaviota" (The Seagull) presents a profound exploration of art, love, and the tragic complexities of life.  
**Venue:** Teatro San Martín  
**Date & Time:** Check Teatro San Martín schedule for current showtimes  
[More Info and Tickets](#)

---

### LOS PILARES DE LA SOCIEDAD  
**Type:** Theater  
**Description:** A play by Henrik Ibsen directed by Jorge Suárez. This production delves into social issues and personal conflicts impacting society, offering critical reflection and emotional depth.  
**Venue:** Teatro Presidente Alvear  
**Date & Time:** Check Teatro Presidente Alvear schedule for dates and times  
[More Info and Tickets](#)

---

### BACO POLACO  
**Type:** Theater  
**Description:** Written by Mauricio Kartun, "Baco Polaco" is staged at Teatro Sarmiento. It is a contemporary production blending humor, drama, and social commentary.  
**Venue:** Teatro Sarmiento  
**Date & Time:** Refer to Teatro Sarmiento for current showtimes  
[More Info and Tickets](#)

---

### Additional Cultural Activities at Complejo Teatral de Buenos Aires:  
- **Dance and Ballet Contemporary Performances**  
- **Music in the Hall: Live musical sessions**  
- **Theatre for Children and Families**  
- **Workshops and Pedagogical Activities, including Dance and Puppetry Workshops**  
- **Accessible Programming and Guided Tours for Visitors**  

---

### Visit and Explore  
- Guided tours available for Teatro San Martín to experience the architectural and artistic heritage of this grand theater.  
- Explore a broad range of genres including puppet shows, opera rock, and international seasonal presentations.

---

For full program details and ticket purchases:  
[Complejo Teatral de Buenos Aires - Programación Temporada 2026](#)

---

Enjoy the artistic richness of Buenos Aires at the Teatro San Martín and its associated venues!  
Stay updated with schedules, special events, and new premieres to enrich your cultural experience.

## Finally - a minor improvement

With a small adjustment, we can change this so that the results stream back from OpenAI,
with the familiar typewriter animation

In [None]:
def stream_brochure(company_name, url):
    stream = openai.chat.completions.create(
        model="gpt-4.1-mini",
        messages=[
            {"role": "system", "content": brochure_system_prompt},
            {"role": "user", "content": get_brochure_user_prompt(company_name, url)}
          ],
        stream=True
    )    
    response = ""
    display_handle = display(Markdown(""), display_id=True)
    for chunk in stream:
        response += chunk.choices[0].delta.content or ''
        update_display(Markdown(response), display_id=display_handle.display_id)