Web Scraping Pipeline: Letterboxd Movie Data

This notebook automates crawling, parsing, and aggregating movie metadata from Letterboxd:
1. Setup & Imports  
    Standard libraries (`os`, `re`, `time`, `glob`)  
    Data handling with `pandas`  
    HTML parsing via `BeautifulSoup`  
    Browser automation with Selenium (`webdriver`, `Options`, `WebDriverWait`, etc.)

2. Helper Functions
    `create_output_dir(output_dir)`  
        Ensures the CSV output directory exists.  
    `setup_driver(chrome_driver_path, headless=True)`  
        Configures a headless Chrome WebDriver with a custom user-agent and timeout.  
    `get_movie_links(driver)`  
        Waits for film posters to load, scrapes each tile’s URL and title.

3. Detail Page Parser
    `parse_movie_detail(driver, movie_url)`  
        Opens each movie in a new tab, waits for page load, and extracts:  
            Rating, Description, Poster URL, Release Year  
            Director, Genres, Cast, Producers, Studios, Duration  
    Handles missing elements and errors with default `“N/A”` values.

4. CSV Saving
    `save_to_csv(data, output_dir, file_index)`  
        Saves a batch of scraped movie dicts to `movies_{file_index}.csv` and logs progress.

5. Main Scraping Loop
    `scrape_letterboxd_movies(driver, output_dir, max_movies=…)`  
        Iteratively visits the “Popular this week” paginated listing, collecting up to `max_movies`.  
        After every 100 entries (or at the end), calls `save_to_csv` to flush results.

6. Execution Entry Point
    `main()`  
        Defines `output_dir` & `chrome_driver_path`, initializes the driver, and invokes the scraper.  
        Ensures clean shutdown of the WebDriver.

7. Post-Processing
    Uses `glob` to load all `movies_*.csv` files from the dataset directory.  
    Concatenates them into a single DataFrame and writes the consolidated `movies.csv`. 


In [151]:
import os
import re
import time
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

In [152]:
def create_output_dir(output_dir):
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

In [153]:
def setup_driver(chrome_driver_path, headless=True):
    options = Options()
    if headless:
        options.add_argument("--headless")
    options.add_argument("user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                         "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36")
    service = Service(chrome_driver_path)
    driver = webdriver.Chrome(service=service, options=options)
    driver.set_page_load_timeout(20)
    return driver

In [154]:
def get_movie_links(driver):
    try:
        WebDriverWait(driver, 10).until(
            EC.presence_of_all_elements_located((By.CSS_SELECTOR, "div.film-poster"))
        )
    except Exception as e:
        print("Error waiting for movie elements:", e)
        return []
    
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    movie_elements = soup.find_all('div', class_='film-poster')
    links = []
    for elem in movie_elements:
        a_tag = elem.find('a')
        if a_tag and a_tag.has_attr('href'):
            movie_url = "https://letterboxd.com" + a_tag['href']
            
            title_tag = elem.find('img')
            title = title_tag['alt'].strip() if title_tag else "N/A"

            links.append((movie_url, title))
    return links

In [155]:
def parse_movie_detail(driver, movie_url):
    driver.execute_script("window.open('');")
    driver.switch_to.window(driver.window_handles[-1])
    
    try:
        driver.get(movie_url)
        WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, "img.image"))
        )
    except Exception as e:
        print(f"Error loading detail page {movie_url}: {e}")
        driver.close()
        driver.switch_to.window(driver.window_handles[0])
        return {
            "Studios": "N/A",
            "Year": "N/A",
            "Genre": "N/A",
            "Director": "N/A",
            "Producers": "N/A",
            "Cast": "N/A",
            "AvgRating": "N/A",
            "Duration": "N/A",
            "Description": "N/A",
            "Poster URL": "N/A",
            "Page URL": movie_url
        }
    
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    
    rating_tag = soup.find('span', class_='average-rating')
    rating = rating_tag.text.strip() if rating_tag else "N/A"
    
    description_tag = soup.find('div', class_='truncate')
    description = description_tag.text.strip() if description_tag else "N/A"
    
    poster_img = soup.find('img', class_='image')
    poster_url = poster_img['src'] if poster_img and poster_img.has_attr('src') else "N/A"
    
    parent = soup.find('div', class_='details')
    metablock = parent.find('div', class_='metablock') if parent else None
    year_div = metablock.find('div', class_='releaseyear') if metablock else None
    year_tag = year_div.find('a') if year_div else None
    year_from_detail = year_tag.text.strip() if year_tag else "N/A"
    
    director_spans = soup.select('.credits .directorlist .prettify')
    directors = [span.get_text(strip=True) for span in director_spans]
    director = ", ".join(directors) if directors else "N/A"

    genres_div = soup.find('div', id='tab-genres')
    if genres_div:
        sluglist = genres_div.find('div', class_='text-sluglist capitalize')
        if sluglist:
            genre_links = sluglist.find_all('a', class_='text-slug')
            genres = [link.get_text(strip=True) for link in genre_links]
            genres_str = ", ".join(genres) if genres else "N/A"
        else:
            genres_str = "N/A"
    else:
        genres_str = "N/A"
        
    cast_div = soup.find('div', id='tab-cast')
    if cast_div:
        cust_list = cast_div.find('div', class_='cast-list text-sluglist')
        if cust_list:
            cast_links = cust_list.find_all('a', class_='text-slug tooltip')
            cast_names = [link.get_text(strip=True) for link in cast_links]
            cast_str = ", ".join(cast_names) if cast_names else "N/A"
        else:
            cast_str = "N/A"
    else:
        cast_str = "N/A"

    crew_section = soup.find('div', id='tab-crew')
    producer_str = "N/A"

    if crew_section:
        sluglist_blocks = crew_section.find_all('div', class_='text-sluglist')
        producers = []
        for block in sluglist_blocks:
            links = block.find_all('a', class_='text-slug')
            for link in links:
                href = link.get('href', '')
                if '/producer/' in href.lower():
                    producers.append(link.get_text(strip=True))
        if producers:
            producer_str = ", ".join(producers)
            
    studio_str = "N/A"
    tab_details = soup.find('div', id='tab-details')
    if tab_details:
        sluglist_blocks = tab_details.find_all('div', class_='text-sluglist')
        studios = []
        for block in sluglist_blocks:
            links = block.find_all('a', class_='text-slug')
            for link in links:
                href = link.get('href', '').lower()
                if 'studio' in href:
                    studios.append(link.get_text(strip=True))
        if studios:
            studio_str = ", ".join(studios)
            
    duration_str = "N/A"
    footer_p = soup.find('p', class_='text-link text-footer')
    if footer_p:
        raw_text = footer_p.get_text(strip=True)
        match = re.search(r'(\d+)\s*mins', raw_text)
        if match:
            duration_str = match.group(1)
    
    driver.close()
    driver.switch_to.window(driver.window_handles[0])
    
    return {
        "Studios": studio_str,
        "Year": year_from_detail,
        "Genre": genres_str,
        "Director": director,
        "Producers": producer_str,
        "Cast": cast_str,
        "AvgRating": rating,
        "Duration": duration_str,
        "Description": description,
        "Poster URL": poster_url,
        "Page URL": movie_url
    }

In [156]:
def save_to_csv(data, output_dir, file_index):
    filename = os.path.join(output_dir, f"movies_{file_index}.csv")
    df = pd.DataFrame(data)
    df.to_csv(filename, index=False)
    for movie in data:
        print(f"Added movie '{movie.get('Title', 'Unknown')}' to CSV: {filename}")
    print(f"Saved {len(data)} movies to CSV: {filename}")

In [157]:
def scrape_letterboxd_movies(driver, output_dir, max_movies=20):
    all_movies = []
    csv_count = 1
    page = 1

    while len(all_movies) < max_movies:
        url = f"https://letterboxd.com/films/popular/this/week/page/{page}/"
        print(f"Scraping page: {url}")
        driver.get(url)
        
        time.sleep(3)
        
        movie_links = get_movie_links(driver)
        if not movie_links:
            print(f"No movies found on page {page}.")
            break
        
        for movie_url, title in movie_links:
            if len(all_movies) >= max_movies:
                break
            print(f"Processing movie: {title}")
            details = parse_movie_detail(driver, movie_url)
            movie_data = {
                "Title": title,
                "Studios": details["Studios"],
                "Year": details["Year"],
                "Genre": details["Genre"],
                "Director": details["Director"],
                "Producers": details["Producers"],
                "Cast": details["Cast"],
                "AvgRating": details["AvgRating"],
                "Duration": details["Duration"],
                "Description": details["Description"],
                "Poster URL": details["Poster URL"],
                "Page URL": details["Page URL"]
            }

            all_movies.append(movie_data)
            
            if len(all_movies) % 100 == 0:
                save_to_csv(all_movies[-100:], output_dir, csv_count)
                csv_count += 1
        
        page += 1
        
    remainder = len(all_movies) % 100
    if remainder:
        save_to_csv(all_movies[-remainder:], output_dir, csv_count)

    print(f"Total movies scraped: {len(all_movies)}")

In [None]:
def main():
    output_dir = "/Users/tolubai/Desktop/csci_final_project/datasets"
    chrome_driver_path = '/Users/tolubai/Desktop/csci_final_project/chromedriver'
    
    create_output_dir(output_dir)
    driver = setup_driver(chrome_driver_path, headless=True)
    
    try:
        scrape_letterboxd_movies(driver, output_dir, max_movies=10000)
    finally:
        driver.quit()

In [159]:
main()

Scraping page: https://letterboxd.com/films/popular/this/week/page/1/
Processing movie: Adolescence
Processing movie: Mickey 17
Error loading detail page https://letterboxd.com/film/mickey-17/: Message: timeout: Timed out receiving message from renderer: 19.806
  (Session info: chrome=134.0.6998.166)
Stacktrace:
0   chromedriver                        0x000000010313b6c8 cxxbridge1$str$ptr + 2791212
1   chromedriver                        0x0000000103133c9c cxxbridge1$str$ptr + 2759936
2   chromedriver                        0x0000000102c85e30 cxxbridge1$string$len + 92928
3   chromedriver                        0x0000000102c7035c cxxbridge1$string$len + 4140
4   chromedriver                        0x0000000102c700d4 cxxbridge1$string$len + 3492
5   chromedriver                        0x0000000102c6ddec chromedriver + 187884
6   chromedriver                        0x0000000102c6ea44 chromedriver + 191044
7   chromedriver                        0x0000000102c7c618 cxxbridge1$string$len + 

In [165]:
import glob
csv_files = glob.glob('/Users/tolubai/Desktop/csci_final_project/datasets/movies_*.csv')

In [166]:
dfs = [pd.read_csv(file) for file in csv_files]

In [167]:
dfs

[                                     Title  \
 0                         We Were Soldiers   
 1             20,000 Leagues Under the Sea   
 2   Wise Guy: David Chase and The Sopranos   
 3               Deuce Bigalow: Male Gigolo   
 4                                    BRATS   
 ..                                     ...   
 95                               Mona Lisa   
 96         Butcher, Baker, Nightmare Maker   
 97                      A Shop for Killers   
 98                                   Udaan   
 99                               The Abyss   
 
                                               Studios    Year  \
 0   Wheelhouse Entertainment, Icon Entertainment I...  2002.0   
 1                             Walt Disney Productions  1954.0   
 2           Jigsaw Productions, HBO Documentary Films  2024.0   
 3   Quinta Communications, Touchstone Pictures, Ha...  1999.0   
 4       Network Entertainment, ABC News Studios, NEON  2024.0   
 ..                                   

In [168]:
combined_df = pd.concat(dfs, ignore_index=True)
combined_df.to_csv('movies.csv', index=False)

In [171]:
combined_df.to_csv('/Users/tolubai/Desktop/csci_final_project/datasets/movies.csv', index=False)

In [170]:
df = pd.read_csv("movies.csv")
df

Unnamed: 0,Title,Studios,Year,Genre,Director,Producers,Cast,AvgRating,Duration,Description,Poster URL,Page URL
0,We Were Soldiers,"Wheelhouse Entertainment, Icon Entertainment I...",2002.0,"Action, History, War",Randall Wallace,"Randall Wallace, Bruce Davey, Stephen McEveety...","Mel Gibson, Madeleine Stowe, Greg Kinnear, Sam...",3.4,138.0,The story of the first major battle of the Ame...,https://a.ltrbxd.com/resized/film-poster/4/6/0...,https://letterboxd.com/film/we-were-soldiers/
1,"20,000 Leagues Under the Sea",Walt Disney Productions,1954.0,"Family, Science Fiction, Adventure",Richard Fleischer,Walt Disney,"Kirk Douglas, James Mason, Paul Lukas, Peter L...",3.5,127.0,A ship sent to investigate a wave of mysteriou...,https://a.ltrbxd.com/resized/film-poster/5/1/8...,https://letterboxd.com/film/20000-leagues-unde...
2,Wise Guy: David Chase and The Sopranos,"Jigsaw Productions, HBO Documentary Films",2024.0,Documentary,Alex Gibney,"Ophelia Harutyunyan, Alex Gibney, Bethany Dett...","David Chase, Alex Gibney, Lorraine Bracco, Dre...",3.9,157.0,A portrait of celebrated filmmaker David Chase...,https://a.ltrbxd.com/resized/film-poster/1/2/3...,https://letterboxd.com/film/wise-guy-david-cha...
3,Deuce Bigalow: Male Gigolo,"Quinta Communications, Touchstone Pictures, Ha...",1999.0,"Comedy, Romance",Mike Mitchell,"Sidney Ganis, Barry Bernardi, Alex Siskin, Har...","Rob Schneider, William Forsythe, Eddie Griffin...",2.4,88.0,"Deuce Bigalow is a less than attractive, down ...",https://a.ltrbxd.com/resized/film-poster/4/6/2...,https://letterboxd.com/film/deuce-bigalow-male...
4,BRATS,"Network Entertainment, ABC News Studios, NEON",2024.0,Documentary,Andrew McCarthy,"Adrian Buitenhuis, Derik Murray","Andrew McCarthy, Emilio Estevez, Ally Sheedy, ...",2.7,92.0,"In the 1980s, Andrew McCarthy was part of a yo...",https://a.ltrbxd.com/resized/film-poster/9/0/0...,https://letterboxd.com/film/brats-2024/
...,...,...,...,...,...,...,...,...,...,...,...,...
9995,Tangled,"Walt Disney Animation Studios, Walt Disney Pic...",2010.0,"Family, Adventure, Animation","Nathan Greno, Byron Howard","Roy Conli, Aimee Scribner","Mandy Moore, Zachary Levi, Donna Murphy, Ron P...",4.0,100.0,"Feisty teenager Rapunzel, who has long and mag...",https://a.ltrbxd.com/resized/sm/upload/0w/h1/l...,https://letterboxd.com/film/tangled-2010/
9996,I Saw the TV Glow,"A24, Fruit Tree, Smudge Films, Hypnic Jerk, Ac...",2024.0,"Drama, Horror",Jane Schoenbrun,"Luca Intili, Emma Stone, Ali Herting, Dave McC...","Justice Smith, Jack Haven, Ian Foreman, Helena...",3.5,100.0,Teenager Owen is just trying to make it throug...,https://a.ltrbxd.com/resized/film-poster/7/7/2...,https://letterboxd.com/film/i-saw-the-tv-glow/
9997,The Apprentice,"Gidden Media, Head Gear Films, Metrol Technolo...",2024.0,"History, Drama",Ali Abbasi,"Ali Abbasi, Daniel Bekerman, Kristina Börjeson...","Sebastian Stan, Jeremy Strong, Maria Bakalova,...",3.6,122.0,"A young Donald Trump, eager to make his name a...",https://a.ltrbxd.com/resized/sm/upload/yj/8m/d...,https://letterboxd.com/film/the-apprentice-2024/
9998,The Shining,"Warner Bros. Pictures, Peregrine, Hawk Films, ...",1980.0,"Horror, Thriller",Stanley Kubrick,"Stanley Kubrick, Robert Fryer, Mary Lea Johnso...","Jack Nicholson, Shelley Duvall, Danny Lloyd, S...",4.2,144.0,Jack Torrance accepts a caretaker job at the O...,https://a.ltrbxd.com/resized/sm/upload/7s/m2/b...,https://letterboxd.com/film/the-shining/
