# TA Session: Web Scraping for Difference-in-Differences Analysis
## Scraping Hotel Prices from Booking.com

**Course:** Introduction to Text Mining and Natural Language Processing  
**Session:** TA - Project Design and Getting the Text      
**Date:** January 2026

---

#### *Disclaimers*

- Easiest way for this to run properly: [install Astral's UV](https://docs.astral.sh/uv/getting-started/installation/).
- Then just run `uv sync` to install all dependecies.
- Make sure you have Google Chrome installed as well, using the default installation method.

<br>

---
---

## Section 1: Introduction





---
## üéØ Session Goal

By the end of this session, you will understand basic Selenium concepts and use them to scrape hotel data from Booking.com, analyzing how large events affect hotel prices. This dataset will be reused in later sessions for:
- Difference-in-Differences (DiD) econometric analysis
- Text mining to identify market segments
- Heterogeneous treatment effect analysis

**Your task:** Collect hotel prices for an event city and a control city, across event and non-event weeks.

---
## üìä Target Data Structure

You need to produce a DataFrame with **exactly these columns**:

| Column | Type | Description |
|--------|------|-------------|
| `city` | string | City name (e.g., "Barcelona", "Valencia") |
| `hotel` | string | Hotel name |
| `date` | datetime | Check-in date |
| `price` | float | Total price in EUR for 7 nights, 2 adults, 1 room |
| `treatCity` | bool/int | 1 if event city, 0 if control city |
| `treatPeriod` | bool/int | 1 if event week, 0 if control week |
| `text` | string | Hotel description text |

### Example of what your final data should look like:

```
| city      | hotel              | date       | price | treatCity | treatPeriod | text                              |
|-----------|--------------------| -----------|-------|-----------|-------------|-----------------------------------|
| Barcelona | Hotel Arts         | 2026-06-04 | 2450  | 1         | 1           | "Luxury beachfront hotel with..." |
| Barcelona | Hotel Arts         | 2026-06-11 | 1280  | 1         | 0           | "Luxury beachfront hotel with..." |
| Valencia  | Hotel Balneario    | 2026-06-04 | 580   | 0         | 1           | "Historic spa hotel located..."   |
| Valencia  | Hotel Balneario    | 2026-06-11 | 560   | 0         | 0           | "Historic spa hotel located..."   |
```

**Notice:** The same hotel appears multiple times (once per check-in date). The price changes, but the description stays the same.

---
## üìã Group Assignments

Each group will be assigned randomly:
- **One event** in Spain
- **One event city** (Treatment city, `TreatCity=1`)
- **One control city** (Comparison city, `TreatCity=0`)
- **Check-in dates** to scrape (one is the event week, others are control weeks)

### üî¥ IMPORTANT: Terminology
- **Treatment city** = city WHERE the event happens (gets "treated" by the event)
- **Control city** = <u>similar city</u> WITHOUT the event (for comparison)
- **Treatment period** = the week WHEN the event happens
- **Control period** = weeks without the event

### Group Assignment Table

| Group | Event | Event City | Control City | Event Dates | Treatment Check-in | Example Control Check-ins |
|-------|-------|------------|--------------|-------------|-------------------|-------------------|
| 1 | Las Fallas | Valencia | Alicante | Mar 15-19 | 2026-03-14 | 2026-02-28, 2026-03-07, 2026-03-21, 2026-03-28 |
| 2 | Feria de Abril | Seville | Granada | Apr 21-26 | 2026-04-20 | 2026-04-06, 2026-04-13, 2026-04-27, 2026-05-04 |
| 3 | Primavera Sound | Barcelona | Madrid | Jun 4-6 | 2026-06-02 | 2026-05-19, 2026-05-26, 2026-06-09, 2026-06-16 |
| 4 | San Ferm√≠n | Pamplona | Bilbao | Jul 6-14 | 2026-07-07 | 2026-06-23, 2026-06-30, 2026-07-14, 2026-07-21 |
| 5 | Feria de M√°laga | M√°laga | C√°diz | Aug 15-22 | 2026-08-15 | 2026-08-01, 2026-08-08, 2026-08-22, 2026-08-29 |
| 6 | Fiestas del Pilar | Zaragoza | Valladolid | Oct 10-18 | 2026-10-10 | 2026-09-26, 2026-10-03, 2026-10-17, 2026-10-24 |

**‚ö†Ô∏è Notes:**  
- Event dates are approximate. Verify your specific event dates before scraping! Some days mighr be more "important" than others.
- Exact dates for scraping and final rows extracted are up to the group, but must go at least 2 weeks before and 2 weeks after the event, as per the example above.     

### Fixed Search Parameters (ALL GROUPS):
- **Currency:** EUR
- **Adults:** 2
- **Rooms:** 1  
- **Nights:** 7 

---
## üõ†Ô∏è Setup and Imports

In [2]:
# Core imports
import pandas as pd
import numpy as np
import requests
import time
import random
import re
from datetime import datetime, timedelta

# Selenium imports
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import (
    TimeoutException, 
    NoSuchElementException,
    StaleElementReferenceException
)

# For automatic chromedriver management
from webdriver_manager.chrome import ChromeDriverManager

# For parsing HTML
from bs4 import BeautifulSoup

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

print("‚úÖ All imports successful!")

‚úÖ All imports successful!


---
## üî∞ Selenium Basics - A Gentle Introduction

Before we build the full scraper, let's learn the **fundamental Selenium commands**. This section is for you to experiment and understand how browser automation works.

### What is Selenium?
Selenium is a tool that lets Python **control a web browser**. It can:
- Open websites
- Click buttons
- Fill in forms
- Extract text from pages
- Take screenshots

### Why not just use `requests`?
Many modern websites (including Booking.com) load content **dynamically with JavaScript**. A simple HTTP request only gets the initial HTML - not the content that loads after. Selenium runs a real browser, so it sees everything a human would see.

In [3]:
# ============================================================
# SELENIUM BASICS - Let's start simple!
# ============================================================

# First, let's create a browser driver
# We'll use headless=False (default) so you can SEE what's happening

# Create a simple driver (WITH visible browser window)
options = Options()
# options.add_argument("--headless=new")  # Commented out so you see the browser!
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")
options.add_argument("--incognito")

service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service, options=options)

print("‚úÖ Browser opened! You should see a Chrome window.")

‚úÖ Browser opened! You should see a Chrome window.


In [4]:
# ============================================================
# BASIC COMMAND 1: Navigate to a webpage
# ============================================================

# Let's go to a simple website first (not Booking.com yet)
driver.get("https://example.com")

print("‚úÖ Navigated to example.com")
print(f"Current URL: {driver.current_url}")
print(f"Page title: {driver.title}")

‚úÖ Navigated to example.com
Current URL: https://example.com/
Page title: Example Domain


In [5]:
# ============================================================
# BASIC COMMAND 2: Find elements on the page
# ============================================================

# Selenium can find elements using different "locators":
# - By.ID            --> find by id attribute
# - By.CLASS_NAME    --> find by class attribute
# - By.TAG_NAME      --> find by HTML tag (h1, p, div, etc.)
# - By.CSS_SELECTOR  --> find using CSS selectors
# - By.XPATH         --> find using XPath expressions
# - By.NAME          --> find by name attribute (redundant since we can reconstruct with CSS)

# Let's find the <h1> heading on example.com
heading = driver.find_element(By.TAG_NAME, "h1")
print(f"Found heading: {heading.text}")

# Find all paragraphs
paragraphs = driver.find_elements(By.TAG_NAME, "p")  # Note: find_elementS (plural)
print(f"\nFound {len(paragraphs)} paragraph(s):")
for i, p in enumerate(paragraphs):
    print(f"  {i+1}. {p.text[:50]}..." if len(p.text) > 50 else f"  {i+1}. {p.text}")

Found heading: Example Domain

Found 2 paragraph(s):
  1. This domain is for use in documentation examples w...
  2. Learn more


In [6]:
# ============================================================
# BASIC COMMAND 3: Get the page source (HTML)
# ============================================================

# Sometimes you want the raw HTML to parse with BeautifulSoup
html = driver.page_source

print("First 500 characters of HTML:")
print(html[:500])

First 500 characters of HTML:
<html lang="en"><head><title>Example Domain</title><meta name="viewport" content="width=device-width, initial-scale=1"><style>body{background:#eee;width:60vw;margin:15vh auto;font-family:system-ui,sans-serif}h1{font-size:1.5em}div{opacity:0.8}a:link,a:visited{color:#348}</style></head><body><div><h1>Example Domain</h1><p>This domain is for use in documentation examples without needing permission. Avoid use in operations.</p><p><a href="https://iana.org/domains/example">Learn more</a></p></div>
<


In [7]:
# Now let's go to Booking.com and see a real search results page
booking_url = "https://www.booking.com"

driver.get(booking_url)
time.sleep(3)  # Wait for page to load

print("‚úÖ Navigated to Booking.com search results")
print(f"Page title: {driver.title}")

‚úÖ Navigated to Booking.com search results
Page title: Booking.com | Official site | The best hotels, flights, car rentals & accommodations


In [8]:
# Extra tip: Set window size for better visibility
driver.set_window_size(1400, 900)

In [9]:
# Let's compare the source code of the page with just a request.get() call
response = requests.get(booking_url)

print(f"\nLength of HTML via Selenium: {len(driver.page_source)} characters")
print(f"Length of HTML via requests: {len(response.text)} characters")


Length of HTML via Selenium: 853882 characters
Length of HTML via requests: 3962 characters


In [10]:
# ============================================================
# BASIC COMMAND 4: Click buttons/links, write on forms
# ============================================================

accept_button = driver.find_element(By.ID, "onetrust-accept-btn-handler")
accept_button.click()  # Example: Accept cookies button

In [11]:
close_signin = driver.find_element(By.CSS_SELECTOR, "button[aria-label='Dismiss sign in information.']")
close_signin.click()  # Example: Close sign-in popup
time.sleep(2)  # Wait a bit

In [12]:
city_input = driver.find_element(By.NAME, "ss")
city_input.send_keys("Barcelona")
time.sleep(2)  # Wait a bit

#### Extra tip: find the location of an element
Go to the Console in the Developer tool of Chrome and send these commands:

1. First
    ```js
    document.addEventListener("mousemove", e => {
      const box = document.getElementById("__coords__") || (() => {
        const d = document.createElement("div");
        d.id = "__coords__";
        d.style.cssText = `
          position:fixed;top:0;left:0;z-index:999999;
          background:black;color:lime;
          font:12px monospace;padding:4px;
          pointer-events:none;
        `;
        document.body.appendChild(d);
        return d;
      })();

      box.textContent = `x:${e.clientX} y:${e.clientY}`;
    });
    ``` 

2. Then type `allow pasting` 
3. Then try again with step 1.

In [13]:
# ============================================================
# BASIC COMMAND 5: Click with coordinates and JavaScript
# ============================================================

x = 600
y = 350
driver.execute_script(f"document.elementFromPoint({x}, {y}).click();")
time.sleep(2)  # Wait a bit to see the click effect


#### How do we get the correct CSS selectors for dates?
We can construct complex CSS selectors, not using just a CSS class, e.g.: `tagName[attribute='value']` where in HTML you'd see:

```HTML
<tag attribute="value">
```

For the dates, there's a `data-date` attribute on the `<span>` elements for each date.

In [14]:
# Select start and end dates

start_date = driver.find_element(By.CSS_SELECTOR, "span[data-date='2026-02-04']")
start_date.click()

time.sleep(2)  # Wait a bit to see the click effect

end_date = driver.find_element(By.CSS_SELECTOR, "span[data-date='2026-02-07']")
end_date.click()

In [15]:
# Find the "Search" button and click it
# In this case it's a "span" with text "Search"
# We can find it by looking for all spans and filtering by text

all_spans = driver.find_elements(By.TAG_NAME, "span")
for span in all_spans:
    if span.text.strip().lower() == "search":
        span.click()
        break

In [16]:
# ============================================================
# BASIC COMMAND 6: Scroll the page
# ============================================================

# Booking.com loads more hotels as you scroll (lazy loading)
# We can scroll using JavaScript

# Scroll down 1000 pixels
driver.execute_script("window.scrollBy(0, 1000);")
print("Scrolled down 1000px")
time.sleep(1)

Scrolled down 1000px


In [17]:
# Get current site height
site_height = driver.execute_script("return document.body.scrollHeight;")
print(f"Current site height: {site_height}px")

Current site height: 8985px


In [18]:
# Scroll to bottom of page
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
print("Scrolled to bottom")
time.sleep(1)

Scrolled to bottom


In [19]:
# Scroll back to top
driver.execute_script("window.scrollTo(0, 0);")
print("Scrolled back to top")

Scrolled back to top


In [20]:
# ============================================================
# BASIC COMMAND 7: Find hotel cards on Booking.com
# ============================================================

# Let's try to find hotel listings
# Booking.com uses data-testid attributes which are helpful

try:
    hotel_cards = driver.find_elements(By.CSS_SELECTOR, "div[data-testid='property-card']")
    print(f"Found {len(hotel_cards)} hotel cards using data-testid")
except:
    print("Could not find cards with data-testid")


Found 50 hotel cards using data-testid


In [21]:
# ============================================================
# BASIC COMMAND 8: Extract text from an element
# ============================================================

# Let's get the name of the first hotel
if len(hotel_cards) > 0:
    first_card = hotel_cards[0]
    
    # Try to find the title within this card
    try:
        title_elem = first_card.find_element(By.CSS_SELECTOR, "div[data-testid='title']")
        print(f"First hotel name: {title_elem.text}")
    except:
        # Fallback: get all text from the card
        print(f"Card text (first 200 chars): {first_card.text[:200]}...")
else:
    print("No hotel cards found to extract from")

First hotel name: Mayerling Schumann Barcelona


In [22]:
# ============================================================
# BASIC COMMAND 9: Navigate to a page
# ============================================================

# Let's navigate to a specific hotel's page
if len(hotel_cards) > 0:
    first_card = hotel_cards[0]
    link_elem = first_card.find_element(By.TAG_NAME, "a")
    hotel_url = link_elem.get_attribute("href")
    
    print(f"Navigating to hotel page: {hotel_url}")
    driver.get(hotel_url)
    time.sleep(3)  # Wait for page to load
    
    print(f"On hotel page: {driver.title}")

#### ‚ÄºÔ∏è Doing this is better than clicking because it avoids popups/overlays ‚ÄºÔ∏è ####

Navigating to hotel page: https://www.booking.com/hotel/es/chic-basic-bruc-apartamentos.en-gb.html?label=gen173nr-10CAEoggI46AdIM1gEaEaIAQGYATO4AQfIAQzYAQPoAQH4AQGIAgGoAgG4AsO3qcsGwAIB0gIkZTY2MWQ4ZGEtY2JjZi00YjBiLTg4MTctNjU1OWRjMmUwNWE12AIB4AIB&aid=304142&ucfs=1&arphpl=1&checkin=2026-02-04&checkout=2026-02-07&dest_id=-372490&dest_type=city&group_adults=2&req_adults=2&no_rooms=1&group_children=0&req_children=0&hpos=1&hapos=1&sr_order=popularity&srpvid=d2b96e2758910c44&srepoch=1768578004&all_sr_blocks=1875601_243442175_2_0_0_308580&highlighted_blocks=1875601_243442175_2_0_0_308580&matching_block_id=1875601_243442175_2_0_0_308580&sr_pri_blocks=1875601_243442175_2_0_0_308580_88500&from=searchresults
On hotel page: Mayerling Schumann Barcelona, Barcelona (updated prices 2026)


In [23]:
# ============================================================
# BASIC COMMAND 10: Take a screenshot (useful for debugging!)
# ============================================================

driver.save_screenshot("booking_screenshot.png")
print("‚úÖ Screenshot saved as 'booking_screenshot.png'")
print("   Check your working directory to see what the page looks like!")

‚úÖ Screenshot saved as 'booking_screenshot.png'
   Check your working directory to see what the page looks like!


In [24]:
# ============================================================
# BASIC COMMAND 11: Close the browser
# ============================================================

# IMPORTANT: Always close the driver when you're done!
# Otherwise you'll have zombie Chrome processes eating memory

driver.quit()
print("‚úÖ Browser closed!")

‚úÖ Browser closed!


### üß™ Try it yourself!

Before moving on, experiment:
1. Change the city in the URL from "Barcelona" to your assigned city
2. Try finding different elements on the page
3. Get inside a hotel site and extract the data that you need.
4. Play around and get confortable with Selenium. Explore other ways to interact with Booking...

---
## üîß Troubleshooting Guide

### Common Issues and Solutions:

**1. "No hotels extracted" / Empty results**
- Booking.com may have changed their HTML structure
- Run with `headless=False` to see what's happening
- Manually inspect the page and update selectors

**2. "Page load timeout"**
- Your internet connection may be slow
- Booking.com may be blocking you (wait 30 min and try again)
- Try increasing timeout: `driver.set_page_load_timeout(60)`

**3. CAPTCHA appearing**
- You've made too many requests, make sure your behavior is "human like".
    - *Scraping is not a race!*
- Wait 30-60 minutes before trying again
- Consider splitting work among group members and/or using mobile hotspot (different IPs)

**4. Prices showing as "None"**
- Some hotels don't show prices on search results
- The price selector may need updating
- These rows can be dropped: `df = df.dropna(subset=['price'])`

---
## üèÅ Summary: What You Need to Deliver

### 1. **A CSV file** named `groupX_EVENTNAME__booking_scraping_22DM014.csv` containing:
   - Hotel data for your event city and control city
   - Data for event week and control dates
   - All 7 required columns properly populated
   - Of course replacing X and EVENTNAME with yours

### 2. **Reproducible notebook** of the scraping process with the same filename but `.ipynb`
   - Using selenium
      - Or using another more advanced creative solution
   - Include thought process in markdown cells

<br>

> **Given the widespread use of LLMs, solutions that demonstrably relied on minimal AI assistance on thought process would carry greater value.**

<br>

---

**Good luck!**