# TA Session: Web Scraping for Difference-in-Differences Analysis
## Scraping Hotel Prices from Booking.com

**Course:** Introduction to Text Mining and Natural Language Processing  
**Session:** TA: Project Design and Getting the Text     
**Date:** January 2026

---

#### *Disclaimers*

- Easiest way for this to run properly: [install Astral's UV](https://docs.astral.sh/uv/getting-started/installation/).
- Then just run `uv sync` to install all dependecies.
- Make sure you have Google Chrome installed as well, using the default installation method.

<br>

---
---

## Section 2: Robust example

---
## üõ†Ô∏è Setup and Imports

In [None]:
# Core imports
import pandas as pd
import numpy as np
import time
import random
import re
from datetime import datetime, timedelta

# Selenium imports
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import (
    TimeoutException, 
    NoSuchElementException,
    StaleElementReferenceException
)

# For automatic chromedriver management
from webdriver_manager.chrome import ChromeDriverManager

# For parsing HTML
from bs4 import BeautifulSoup

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

print("‚úÖ All imports successful!")

---
## üåê Understanding Booking.com URL Structure

Before we scrape, let's understand how Booking.com URLs work.

A typical search URL looks like:
```
https://www.booking.com/searchresults.html?ss=Barcelona&checkin=2026-06-04&checkout=2026-06-11&group_adults=2&no_rooms=1&selected_currency=EUR
```

Key parameters:
- `ss` = Search string (city name)
- `checkin` = Check-in date (YYYY-MM-DD)
- `checkout` = Check-out date (YYYY-MM-DD)
- `group_adults` = Number of adults
- `no_rooms` = Number of rooms
- `selected_currency` = Currency code

In [None]:
def build_booking_url(city, checkin_date, nights=7, adults=2, rooms=1, currency="EUR"):
    """
    Build a Booking.com search URL.
    
    Parameters:
    -----------
    city : str
        City name to search
    checkin_date : str
        Check-in date in YYYY-MM-DD format
    nights : int
        Number of nights (default: 7)
    adults : int
        Number of adults (default: 2)
    rooms : int
        Number of rooms (default: 1)
    currency : str
        Currency code (default: EUR)
    
    Returns:
    --------
    str : Complete Booking.com search URL
    """
    # Calculate checkout date
    checkin = datetime.strptime(checkin_date, "%Y-%m-%d")
    checkout = checkin + timedelta(days=nights)
    checkout_date = checkout.strftime("%Y-%m-%d")
    
    # Build URL
    base_url = "https://www.booking.com/searchresults.html"
    params = {
        "ss": city,
        "checkin": checkin_date,
        "checkout": checkout_date,
        "group_adults": adults,
        "no_rooms": rooms,
        "selected_currency": currency
    }
    
    # Construct query string
    query_string = "&".join([f"{k}={v}" for k, v in params.items()])
    full_url = f"{base_url}?{query_string}"
    
    return full_url

# Test the function
test_url = build_booking_url("Barcelona", "2026-06-04")
print("Example URL:")
print(test_url)

---
## üöó Setting Up Selenium WebDriver

We use Selenium because Booking.com loads content dynamically with JavaScript. A simple HTTP request won't get us all the hotel listings.

### Important Configuration Notes:
- We run Chrome in **headless mode** (no visible browser window) for speed
- We add options to avoid detection as a bot
- We use random delays to be respectful to the server

In [None]:
def create_driver(headless=True):
    """
    Create and configure a Selenium Chrome WebDriver.
    
    Parameters:
    -----------
    headless : bool
        If True, run browser without GUI (faster, recommended for scraping)
        If False, show browser window (useful for debugging)
    
    Returns:
    --------
    webdriver.Chrome : Configured Chrome driver
    """
    chrome_options = Options()
    
    if headless:
        chrome_options.add_argument("--headless=new")  # New headless mode
    
    # Essential options to avoid detection and ensure stability
    chrome_options.add_argument("--no-sandbox")
    chrome_options.add_argument("--disable-dev-shm-usage")
    chrome_options.add_argument("--disable-gpu")
    chrome_options.add_argument("--window-size=1400,900")
    chrome_options.add_argument("--incognito")
    
    # Pretend to be a real browser
    chrome_options.add_argument("--disable-blink-features=AutomationControlled")
    chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
    chrome_options.add_experimental_option('useAutomationExtension', False)
    
    # Set a realistic user agent
    user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
    chrome_options.add_argument(f"user-agent={user_agent}")
    
    # Accept cookies by default (helps with some sites)
    chrome_options.add_argument("--lang=en-US")
    
    # Create driver with automatic chromedriver management
    service = Service(ChromeDriverManager().install())
    driver = webdriver.Chrome(service=service, options=chrome_options)
    
    # Set page load timeout
    driver.set_page_load_timeout(30)
    
    # Execute script to mask webdriver detection
    driver.execute_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")
    
    return driver

print("‚úÖ Driver creation function defined!")
print("‚ö†Ô∏è  We'll create the actual driver when we start scraping.")

---
## üîç Scraping Functions

Now we define the core scraping logic. This is the **most important part** of the notebook.

### Strategy:
1. Load the search results page
2. Handle the cookie consent popup (if it appears)
3. Scroll down to load all hotels (Booking.com uses infinite scroll)
4. Extract hotel information from each listing
5. Return structured data

In [None]:
def random_delay(min_sec=1, max_sec=3):
    """
    Wait for a random amount of time to avoid detection.
    Being a good citizen: don't hammer the server!
    """
    delay = random.uniform(min_sec, max_sec)
    time.sleep(delay)
    return delay


def handle_cookie_popup(driver):
    """
    Try to dismiss the cookie consent popup if it appears.
    Booking.com shows this on first visit.
    """
    try:
        # Wait a bit for popup to appear
        time.sleep(2)
        
        # Try different possible button selectors
        cookie_button_selectors = [
            "button#onetrust-accept-btn-handler",
            "button[id*='accept']",
            "button[class*='cookie-accept']",
            "//button[contains(text(), 'Accept')]",
            "//button[contains(text(), 'Aceptar')]"
        ]
        
        for selector in cookie_button_selectors:
            try:
                if selector.startswith("//"):
                    # XPath selector
                    button = driver.find_element(By.XPATH, selector)
                else:
                    # CSS selector
                    button = driver.find_element(By.CSS_SELECTOR, selector)
                
                button.click()
                print("    ‚úÖ Cookie popup dismissed")
                time.sleep(1)
                return True
            except NoSuchElementException:
                continue
        
        print("    ‚ÑπÔ∏è  No cookie popup found (or already dismissed)")
        return False
        
    except Exception as e:
        print(f"    ‚ö†Ô∏è  Cookie handling error (non-critical): {e}")
        return False


def scroll_to_load_hotels(driver, max_scrolls=10, scroll_pause=2):
    """
    Scroll down the page to load more hotels.
    Booking.com uses lazy loading / infinite scroll.
    
    Parameters:
    -----------
    driver : webdriver
        Selenium WebDriver instance
    max_scrolls : int
        Maximum number of scroll actions
    scroll_pause : float
        Seconds to wait after each scroll for content to load
    """
    print(f"    üìú Scrolling to load hotels (max {max_scrolls} scrolls)...")
    
    last_height = driver.execute_script("return document.body.scrollHeight")
    
    for i in range(max_scrolls):
        # Scroll down
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        
        # Wait for new content to load
        time.sleep(scroll_pause)
        
        # Check if we've reached the bottom
        new_height = driver.execute_script("return document.body.scrollHeight")
        if new_height == last_height:
            print(f"    ‚úÖ Reached bottom after {i+1} scrolls")
            break
        last_height = new_height
    
    # Scroll back to top
    driver.execute_script("window.scrollTo(0, 0);")
    time.sleep(1)


print("‚úÖ Helper functions defined!")

In [None]:
def extract_hotel_data(driver):
    """
    Extract hotel information from the current search results page.
    
    Returns:
    --------
    list of dict : Each dict contains hotel name, price, and description
    
    Note: Booking.com's HTML structure changes frequently!
    If this function stops working, you may need to inspect the page
    and update the selectors.
    """
    hotels = []
    
    # Get page source and parse with BeautifulSoup
    soup = BeautifulSoup(driver.page_source, 'lxml')
    
    # Find all hotel property cards
    # These selectors may need updating if Booking.com changes their HTML
    property_cards = soup.find_all('div', {'data-testid': 'property-card'})
    
    if not property_cards:
        # Try alternative selector
        property_cards = soup.find_all('div', {'class': re.compile(r'property-card')})
    
    if not property_cards:
        # Another fallback - look for hotel listing containers
        property_cards = soup.find_all('div', {'class': re.compile(r'sr_property_block')})
    
    print(f"    üìä Found {len(property_cards)} hotel cards")
    
    for card in property_cards:
        try:
            hotel_data = {
                'hotel': None,
                'price': None,
                'text': None
            }
            
            # ----- Extract Hotel Name -----
            # Try multiple selectors for hotel name
            name_element = card.find('div', {'data-testid': 'title'})
            if not name_element:
                name_element = card.find('h3')
            if not name_element:
                name_element = card.find('span', {'class': re.compile(r'hotel-name')})
            if not name_element:
                name_element = card.find('a', {'class': re.compile(r'hotel_name_link')})
            
            if name_element:
                hotel_data['hotel'] = name_element.get_text(strip=True)
            
            # ----- Extract Price -----
            # Price is usually in a span with specific data-testid or class
            price_element = card.find('span', {'data-testid': 'price-and-discounted-price'})
            if not price_element:
                price_element = card.find('span', {'class': re.compile(r'price')})
            if not price_element:
                # Sometimes price is in a div
                price_element = card.find('div', {'data-testid': re.compile(r'price')})
            
            if price_element:
                price_text = price_element.get_text(strip=True)
                # Extract numeric value from price string (e.g., "‚Ç¨ 1,234" -> 1234)
                price_numbers = re.findall(r'[\d,\.]+', price_text.replace(',', ''))
                if price_numbers:
                    # Take the first (or largest) number found
                    try:
                        hotel_data['price'] = float(price_numbers[0].replace('.', '').replace(',', '.'))
                    except ValueError:
                        hotel_data['price'] = None
            
            # ----- Extract Description/Text -----
            # This is typically a short preview or the hotel type
            # We collect multiple text elements and combine them
            text_parts = []
            
            # Hotel type (e.g., "Hotel", "Apartment", "Hostel")
            type_elem = card.find('span', {'data-testid': 'accommodation-type'})
            if type_elem:
                text_parts.append(type_elem.get_text(strip=True))
            
            # Location info
            location_elem = card.find('span', {'data-testid': 'address'})
            if location_elem:
                text_parts.append(location_elem.get_text(strip=True))
            
            # Distance from center
            distance_elem = card.find('span', {'data-testid': 'distance'})
            if distance_elem:
                text_parts.append(distance_elem.get_text(strip=True))
            
            # Review summary or highlights
            review_elem = card.find('div', {'class': re.compile(r'review')})
            if review_elem:
                text_parts.append(review_elem.get_text(strip=True))
            
            # Any other descriptive text in the card
            for p_tag in card.find_all('p'):
                p_text = p_tag.get_text(strip=True)
                if len(p_text) > 20:  # Only meaningful text
                    text_parts.append(p_text)
            
            # Combine all text parts
            hotel_data['text'] = ' | '.join(text_parts) if text_parts else None
            
            # ----- Validate and Add -----
            # Only add if we got at least a name
            if hotel_data['hotel']:
                hotels.append(hotel_data)
                
        except Exception as e:
            print(f"    ‚ö†Ô∏è  Error parsing one hotel card: {e}")
            continue
    
    return hotels


print("‚úÖ Hotel extraction function defined!")

In [None]:
def scrape_booking_search(driver, city, checkin_date, max_scrolls=5):
    """
    Main function to scrape hotel data for a single city and date.
    
    Parameters:
    -----------
    driver : webdriver
        Selenium WebDriver instance (already created)
    city : str
        City name to search
    checkin_date : str
        Check-in date in YYYY-MM-DD format
    max_scrolls : int
        Maximum scroll attempts to load more hotels
    
    Returns:
    --------
    list of dict : Hotel data with keys 'hotel', 'price', 'text'
    """
    print(f"\nüîç Scraping: {city} | Check-in: {checkin_date}")
    print("-" * 50)
    
    # Build URL
    url = build_booking_url(city, checkin_date)
    print(f"    üåê URL: {url[:80]}...")
    
    # Load page
    try:
        driver.get(url)
        print("    ‚úÖ Page loaded")
    except TimeoutException:
        print("    ‚ùå Page load timeout!")
        return []
    
    # Handle cookie popup
    handle_cookie_popup(driver)
    
    # Random delay to be polite
    random_delay(2, 4)
    
    # Scroll to load more hotels
    scroll_to_load_hotels(driver, max_scrolls=max_scrolls)
    
    # Extract hotel data
    hotels = extract_hotel_data(driver)
    
    print(f"    ‚úÖ Extracted {len(hotels)} hotels")
    
    # Add city and date to each record
    for hotel in hotels:
        hotel['city'] = city
        hotel['date'] = checkin_date
    
    return hotels


print("‚úÖ Main scraping function defined!")

---
## üöÄ DEMO - Scraping One Hotel Search

Let's run a live demonstration. We'll scrape hotels in **Barcelona** for **one check-in date**.

### ‚ö†Ô∏è Important Notes:
1. **Run this cell carefully** - it will open a browser (even in headless mode, it uses resources)
2. **Be patient** - scraping takes time (30-60 seconds per search)
3. **Don't run too many times** - Booking.com may block you if you make too many requests

In [None]:
# ===========================================
# DEMO: Scrape ONE search (Barcelona, one date)
# ===========================================

# Demo parameters
demo_city = "Barcelona"
demo_checkin = "2026-06-02"  # Primavera Sound week

print("=" * 60)
print("üé¨ LIVE DEMO: Scraping Booking.com")
print("=" * 60)
print(f"City: {demo_city}")
print(f"Check-in: {demo_checkin}")
print(f"Nights: 7")
print("=" * 60)

# Create driver
# Set headless=False if you want to SEE the browser (useful for debugging)
print("\nüöó Creating WebDriver...")
driver = create_driver(headless=True)  # Change to False to see the browser
print("‚úÖ Driver created!")

try:
    # Run the scraping
    demo_hotels = scrape_booking_search(driver, demo_city, demo_checkin, max_scrolls=3)
    
    # Show results
    print("\n" + "=" * 60)
    print("üìä DEMO RESULTS")
    print("=" * 60)
    
    if demo_hotels:
        # Convert to DataFrame for nice display
        demo_df = pd.DataFrame(demo_hotels)
        print(f"\nTotal hotels scraped: {len(demo_df)}")
        print("\nFirst 5 hotels:")
        print(demo_df[['hotel', 'price', 'city', 'date']].head())
        
        print("\nSample hotel with description:")
        print("-" * 40)
        sample = demo_hotels[0]
        print(f"Hotel: {sample['hotel']}")
        print(f"Price: ‚Ç¨{sample['price']}")
        print(f"Text: {sample['text'][:200] if sample['text'] else 'N/A'}...")
    else:
        print("‚ùå No hotels extracted. The HTML structure may have changed.")
        print("   You may need to inspect the page and update selectors.")

finally:
    # Always close the driver!
    driver.quit()
    print("\nüöó Driver closed.")

---
<br>

> #### Try to use other code that's at least slightly different to this for your final product ;)

---