# Oscars Best Picture Stats

This scraper retrieves five types of ratings (TMDB, Rotten Tomatoes, Letterboxd, IMDB, Metacritic) from four reputable sites for the 97 films awarded an Oscar for Best Picture. 

It uses nine functions to get URLs, get and return details in lists, and return lists as CSV files.

In [34]:
from bs4 import BeautifulSoup
import requests
import time
import csv

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver import ActionChains
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

hdr = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/133.0.0.0 Safari/537.36'}

## `get_tmdb_deets(main_url, list_num)`

This function grabs film details (title, release year, TMDB score) from one list on a TMDB page and returns them in a list. (No URL for each film was necessary.) A list on TMDB can display only 50 items at a time. By default, the page displays a descending list of films by release year. By altering the URL of the page, the list can also be viewed in ascending order. 

This function is to be used twice to grab 50 items descending and then 47 items ascending to get all 97 films' details.

In [61]:
def get_tmdb_deets(main_url, list_num):
    page = requests.get(main_url, headers=hdr)
    soup = BeautifulSoup(page.text, 'html.parser')
    time.sleep(2)

    list_area = soup.find(class_="list_items")
    headings = list_area.find_all(class_=['text-md', 'font-bold', 'text-black', 'cursor-pointer', 'hover:underline', 'underline-offset-2'])

    score_divs = soup.find_all(class_="user_score_chart")

    release_dates = soup.find_all(class_="bg-date-tile")

    titles = []
    years = []
    scores = []
    
    # Get titles from 'headings' (ignoring non-titles) and scores from 'score_divs'
    # Every other item in 'headings' is the actual title. All others are numbers or non-titles
    for i in range(len(headings)):
        if i % 2 != 0 and headings[i].text.strip() != "Load More" and len(headings[i].text) != 2:
            titles.append(headings[i].text)

    for release_date in release_dates:
        date = release_date.text.strip()
        comma = date.find(",")
        year = date[comma+2:]
        years.append(year)

    for score_div in score_divs:
        score = score_div.attrs['data-percent']
        scores.append(score)

    # Pair respective title and score with each other into one list
    # Append list to final list to be returned
    deet_list = []

    # Can only grab 50 items at a time, but function gives parameter for how many items to scrape if less than 50 must be scraped
    # Decrement so list_num is indexed for loop
    for i in range(list_num):
        trio = []
        trio.append(titles[i])
        trio.append(years[i])
        trio.append(scores[i])
        deet_list.append(trio)

    return deet_list

In [62]:
# Scrape first 50 items
url = 'https://www.themoviedb.org/list/8526090-oscars-best-pictures-2025?view=list&sort_by=primary_release_date.desc'
top_half_tmdb_list = get_tmdb_deets(url, 50)
# URL that reverse sorts TMDB list to show remaining 47 items first
url = 'https://www.themoviedb.org/list/8526090-oscars-best-pictures-2025?view=list&sort_by=primary_release_date.asc'
bottom_half_tmdb_list = get_tmdb_deets(url, 47)

# Glue top_half and bottom_half lists into one list 
tmdb_list = top_half_tmdb_list
for film in reversed(bottom_half_tmdb_list):
    tmdb_list.append(film)

[['Anora', '2024', '71'], ['Oppenheimer', '2023', '81'], ['Everything Everywhere All at Once', '2022', '78'], ['CODA', '2021', '79'], ['Nomadland', '2021', '72'], ['Parasite', '2019', '85'], ['Green Book', '2018', '82'], ['The Shape of Water', '2017', '72'], ['Moonlight', '2016', '74'], ['Spotlight', '2015', '78'], ['Birdman or (The Unexpected Virtue of Ignorance)', '2014', '75'], ['12 Years a Slave', '2013', '79'], ['Argo', '2012', '73'], ['The Artist', '2012', '75'], ["The King's Speech", '2010', '77'], ['Slumdog Millionaire', '2008', '77'], ['The Hurt Locker', '2009', '73'], ['No Country for Old Men', '2007', '79'], ['The Departed', '2006', '82'], ['Crash', '2005', '72'], ['Million Dollar Baby', '2004', '80'], ['The Lord of the Rings: The Return of the King', '2003', '85'], ['Chicago', '2002', '71'], ['A Beautiful Mind', '2001', '79'], ['Gladiator', '2000', '82'], ['American Beauty', '1999', '80'], ['Shakespeare in Love', '1998', '69'], ['Titanic', '1997', '79'], ['The English Patie

## `write_tmdb_csv(my_list)`

This function writes the list returned from `get_tmdb_deets()` into a CSV file. The CSV file is to be joined with three other CSV files that are returned in this scraper.

In [64]:
def write_tmdb_csv(my_list):
    csvfile = open('tmdb.csv', 'w', newline='', encoding='utf-8')
    c = csv.writer(csvfile)
    
    # Write column headings in a row
    c.writerow(['title', 'year', 'tmdb_rating'])

    
    for item in my_list:
        c.writerow(item)
        
    # Close file
    csvfile.close()
    
    # Return None to bypass potential error
    return None

# Run function to create CSV file
write_tmdb_csv(tmdb_list)

## List of Best Picture URLs on Rotten Tomatoes
Rotten Tomatoes has no main page to scrape URLs. URLs were manually put into a Python list instead.

In [74]:
rt_urls = ['https://www.rottentomatoes.com/m/anora', 'https://www.rottentomatoes.com/m/oppenheimer_2023', 
           'https://www.rottentomatoes.com/m/everything_everywhere_all_at_once', 'https://www.rottentomatoes.com/m/coda_2021',
           'https://www.rottentomatoes.com/m/nomadland', 'https://www.rottentomatoes.com/m/parasite_2019', 
           'https://www.rottentomatoes.com/m/green_book', 'https://www.rottentomatoes.com/m/the_shape_of_water_2017',
           'https://www.rottentomatoes.com/m/moonlight_2016', 'https://www.rottentomatoes.com/m/spotlight_2015',
           'https://www.rottentomatoes.com/m/birdman_2014', 'https://www.rottentomatoes.com/m/12_years_a_slave', 
           'https://www.rottentomatoes.com/m/argo_2012', 'https://www.rottentomatoes.com/m/the_artist', 'https://www.rottentomatoes.com/m/the_kings_speech',
           'https://www.rottentomatoes.com/m/the_hurt_locker', 'https://www.rottentomatoes.com/m/slumdog_millionaire',
           'https://www.rottentomatoes.com/m/no_country_for_old_men', 'https://www.rottentomatoes.com/m/departed',
           'https://www.rottentomatoes.com/m/million_dollar_baby', 'https://www.rottentomatoes.com/m/1144992-crash',
           'https://www.rottentomatoes.com/m/the_lord_of_the_rings_the_return_of_the_king', 'https://www.rottentomatoes.com/m/chicago',
           'https://www.rottentomatoes.com/m/beautiful_mind', 'https://www.rottentomatoes.com/m/gladiator',
           'https://www.rottentomatoes.com/m/american_beauty', 'https://www.rottentomatoes.com/m/shakespeare_in_love',
           'https://www.rottentomatoes.com/m/titanic', 'https://www.rottentomatoes.com/m/english_patient', 
           'https://www.rottentomatoes.com/m/1065684-braveheart', 'https://www.rottentomatoes.com/m/forrest_gump',
           'https://www.rottentomatoes.com/m/schindlers_list', 'https://www.rottentomatoes.com/m/1041911-unforgiven',
           'https://www.rottentomatoes.com/m/silence_of_the_lambs', 'https://www.rottentomatoes.com/m/dances_with_wolves',
           'https://www.rottentomatoes.com/m/driving_miss_daisy', 'https://www.rottentomatoes.com/m/rain_man',
           'https://www.rottentomatoes.com/m/last_emperor', 'https://www.rottentomatoes.com/m/platoon',
           'https://www.rottentomatoes.com/m/out_of_africa', 'https://www.rottentomatoes.com/m/amadeus',
           'https://www.rottentomatoes.com/m/terms_of_endearment', 'https://www.rottentomatoes.com/m/gandhi',
           'https://www.rottentomatoes.com/m/chariots_of_fire', 'https://www.rottentomatoes.com/m/ordinary_people',
           'https://www.rottentomatoes.com/m/kramer_vs_kramer', 'https://www.rottentomatoes.com/m/the_deer_hunter',
           'https://www.rottentomatoes.com/m/annie_hall', 'https://www.rottentomatoes.com/m/1017776-rocky',
           'https://www.rottentomatoes.com/m/one_flew_over_the_cuckoos_nest', 'https://www.rottentomatoes.com/m/godfather_part_ii',
           'https://www.rottentomatoes.com/m/1020130-sting', 'https://www.rottentomatoes.com/m/the_godfather',
           'https://www.rottentomatoes.com/m/french_connection', 'https://www.rottentomatoes.com/m/patton',
           'https://www.rottentomatoes.com/m/midnight_cowboy', 'https://www.rottentomatoes.com/m/oliver',
           'https://www.rottentomatoes.com/m/in_the_heat_of_the_night', 'https://www.rottentomatoes.com/m/1013162-man_for_all_seasons',
           'https://www.rottentomatoes.com/m/sound_of_music', 'https://www.rottentomatoes.com/m/my_fair_lady',
           'https://www.rottentomatoes.com/m/tom_jones', 'https://www.rottentomatoes.com/m/lawrence_of_arabia',
           'https://www.rottentomatoes.com/m/west_side_story', 'https://www.rottentomatoes.com/m/1001115-apartment',
           'https://www.rottentomatoes.com/m/benhur', 'https://www.rottentomatoes.com/m/gigi',
           'https://www.rottentomatoes.com/m/bridge_on_the_river_kwai', 'https://www.rottentomatoes.com/m/1001193-around_the_world_in_80_days',
           'https://www.rottentomatoes.com/m/1013427-marty', 'https://www.rottentomatoes.com/m/on_the_waterfront',
           'https://www.rottentomatoes.com/m/1007931-from_here_to_eternity', 'https://www.rottentomatoes.com/m/greatest_show_on_earth',
           'https://www.rottentomatoes.com/m/american_in_paris','https://www.rottentomatoes.com/m/1000626-all_about_eve',
           'https://www.rottentomatoes.com/m/1000654-all_the_kings_men', 'https://www.rottentomatoes.com/m/1009123-hamlet',
           'https://www.rottentomatoes.com/m/gentlemans_agreement', 'https://www.rottentomatoes.com/m/best_years_of_our_lives',
           'https://www.rottentomatoes.com/m/lost_weekend', 'https://www.rottentomatoes.com/m/going_my_way',
           'https://www.rottentomatoes.com/m/1003707-casablanca', 'https://www.rottentomatoes.com/m/mrs_miniver',
           'https://www.rottentomatoes.com/m/how_green_was_my_valley', 'https://www.rottentomatoes.com/m/1017293-rebecca',
           'https://www.rottentomatoes.com/m/gone_with_the_wind', 'https://www.rottentomatoes.com/m/you_cant_take_it_with_you_1938',
           'https://www.rottentomatoes.com/m/life_of_emile_zola', 'https://www.rottentomatoes.com/m/the_great_ziegfeld',
           'https://www.rottentomatoes.com/m/1014481-mutiny_on_the_bounty', 'https://www.rottentomatoes.com/m/it_happened_one_night',
           'https://www.rottentomatoes.com/m/cavalcade', 'https://www.rottentomatoes.com/m/grand_hotel', 
           'https://www.rottentomatoes.com/m/1004177-cimarron', 'https://www.rottentomatoes.com/m/all_quiet_on_the_western_front',
           'https://www.rottentomatoes.com/m/broadway_melody', 'https://www.rottentomatoes.com/m/wings']           

## `get_rt_deets(movie_url)`

This function gets the title and "Tomatometer" (Rotten Tomatoes rating) from a film page and returns them, along with the URL to the film page itself, in a list. The function is used in a loop through each of all 97 URLs (film pages), get the details in a list, and append them in a larger list that is to be written into a CSV file.

In [65]:
def get_rt_deets(movie_url):
    page = requests.get(movie_url, headers=hdr)
    time.sleep(2)
    soup = BeautifulSoup(page.text, 'html.parser')

    score_div = soup.find(class_="media-scorecard")
    tomatometer = score_div.find('rt-text').text

    # Get title for tracking purposes
    heading = soup.find(id="hero-wrap")
    title = heading.find('rt-button').attrs['data-title']

    deet_list = [title, tomatometer, movie_url]

    return deet_list

In [66]:
# Create list containing sub-lists for each film and its rating
rt_list = []
for url in rt_urls:
    deets = get_rt_deets(url)
    rt_list.append(deets)
    time.sleep(1)

## `write_rt_csv(my_list)`

This function writes the list that results from using `get_rt_deets()` to get film details for every Rotten Tomatoes film page. The CSV file is to be joined with three other CSV files that are returned in this scraper.

In [67]:
def write_rt_csv(my_list):
    csvfile = open('tomatometers.csv', 'w', newline='', encoding='utf-8')
    c = csv.writer(csvfile)
    
    # Write column headings in a row
    c.writerow(['title', 'tomatometer', 'rt_url'])

    
    for item in my_list:
        c.writerow(item)
        
    # Close file
    csvfile.close()
    
    # Return None to bypass potential error
    return None

# Run function to create CSV file
write_rt_csv(rt_list)

## `get_letterboxd_urls(main_url)`

This function grabs the URLs from a page displaying a list of all Oscar Best Pictures' Letterboxd pages and then returns them in a Python list.

In [68]:
def get_letterboxd_urls(main_url):
    page = requests.get(main_url, headers=hdr)
    soup = BeautifulSoup(page.text, 'html.parser')
    url_list = []

    # Get posters
    posters = soup.find_all(class_="poster-container")
    for poster in posters:
        time.sleep(1)
        poster_div = poster.find('div')
        partial_url = poster_div.attrs['data-target-link']
        if partial_url != "":
            url = "https://letterboxd.com" + partial_url
            url_list.append(url)

    return url_list
        
url = "https://letterboxd.com/oscars/list/oscar-winning-films-best-picture/"
letterboxd_urls = get_letterboxd_urls(url)

## `get_lb_rating(movie_url)`

This function gets the title and Letterboxd rating from a film page and returns them, along with the URL to the film page itself, in a list. The function is used in a loop through each of all 97 URLs (film pages), get the details in a list, and append them in a larger list that is to be written into a CSV file.

In [69]:
def get_lb_rating(movie_url):
    driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
    driver.get(movie_url)
    time.sleep(1)
    page = driver.page_source
    soup = BeautifulSoup(page, "html.parser")
    trio = []

    poster = soup.find(class_="film-poster")
    title = poster.attrs['data-film-name']
    trio.append(title)
    rating = soup.find("span", class_="average-rating").text.strip()
    trio.append(rating)
    trio.append(movie_url)

    # Quit the driver
    driver.quit()
    
    return trio

In [70]:
# Get title and rating for each of all 97 films
rating_list = []
for letterboxd_url in letterboxd_urls:
    rating = get_lb_rating(letterboxd_url)
    rating_list.append(rating)

## `write_lb_csv(my_list)`

This function writes the list that results from using `get_lb_rating()` to get film details for every Letterboxd film page. The CSV file is to be joined with three other CSV files that are returned in this scraper.

In [72]:
def write_lb_csv(my_list):
    csvfile = open('letterboxd.csv', 'w', newline='', encoding='utf-8')
    c = csv.writer(csvfile)
    
    # Write column headings in a row
    c.writerow(['title', 'letterboxd_rating', 'lb_url'])

    for item in my_list:
        c.writerow(item)
    # Loop through each list, append Rotten Tomatoes rating
    # for item in imdb_list:
        #for url in urls:
            #rating = get_tomatometer(url)
            #item.append(rating)
            #time.sleep(1)
        #c.writerow(item)
        
    # Close file
    csvfile.close()
    
    # Return None to bypass potential error
    return None

# Run function to create CSV file
write_lb_csv(rating_list)

## `get_imdb_deets(my_url)`

This function gets the title, directors, IMDB rating, and Metascore (Metacritic rating) for each film from one list on an IMDB page. Each film's details are put into a list that is then appended into one large list. This function returns the large list. 

In [10]:
def get_imdb_deets(my_url):
    driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
    driver.get(my_url)
    time.sleep(2)
    
    # IMDB loads only 25 items at a time. Must scroll to bottom to get all 97 items
    # Find footer (find_element(): https://selenium-python.readthedocs.io/locating-elements.html)
    footer = driver.find_element(By.CLASS_NAME, "imdb-footer")
    # Scroll to footer (tutorial: https://selenium-python.readthedocs.io/api.html#selenium.webdriver.common.action_chains.ActionChains)
    actions = ActionChains(driver)
    actions.scroll_to_element(footer)
    actions.perform()
    
    # Now get the page source
    page = driver.page_source
    soup = BeautifulSoup(page, 'html.parser')

    # Get movie section
    movie_section = soup.find(class_="ipc-metadata-list--base")

    # Get movie summaries
    summaries = movie_section.find_all(class_="ipc-metadata-list-summary-item")

    # Create a list that is to be returned where each item contains title, plot, director(s), rating, and Metascore from each movie summary
    detail_list = []
    mini_list = []
    for summary in summaries:
        time.sleep(2)
        # Get and clean title
        unclean_title = summary.find(class_="ipc-title__text")
        period = unclean_title.text.find(".")
        title = unclean_title.text[period+2:]
       
        # Get director(s)
        directors = summary.find_all(class_="sttd-director-item")
        dir_string = ""
        # If film has two directors...
        if len(directors) == 2:
            dir_string = directors[0].text + ", "
            dir_string = dir_string + directors[1].text
        else:
            dir_string = dir_string + directors[0].text
        
        # Get IMDB user rating
        rating = summary.find(class_="ipc-rating-star--rating").text
       
        # Get Metascore. Append "N/A" if no Metascore is available
        try: 
            metascore = summary.find(class_="metacritic-score-box").text
        except: 
            metascore = "N/A"

        mini_list = [title, dir_string, rating, metascore]
        detail_list.append(mini_list)

    # Quit the driver
    driver.quit()
    
    return detail_list
    

url = "https://www.imdb.com/list/ls596337274/?sort=release_date%2Cdesc"
imdb_list = get_imdb_deets(url)

In [11]:
print(imdb_list)

[['Anora', 'Sean Baker', '7.5', '91'], ['Oppenheimer', 'Christopher Nolan', '8.3', '90'], ['Everything Everywhere All at Once', 'Daniel Kwan, Daniel Scheinert', '7.8', '81'], ['CODA', 'Sian Heder', '8.0', '72'], ['Nomadland', 'Chloé Zhao', '7.3', '87'], ['Parasite', 'Bong Joon Ho', '8.5', '97'], ['Green Book', 'Peter Farrelly', '8.2', '69'], ['The Shape of Water', 'Guillermo del Toro', '7.3', '87'], ['Moonlight', 'Barry Jenkins', '7.4', '99'], ['Spotlight', 'Tom McCarthy', '8.1', '93'], ['Birdman or (The Unexpected Virtue of Ignorance)', 'Alejandro G. Iñárritu', '7.7', '87'], ['12 Years a Slave', 'Steve McQueen', '8.1', '96'], ['Argo', 'Ben Affleck', '7.7', '86'], ['The Artist', 'Michel Hazanavicius', '7.8', '89'], ["The King's Speech", 'Tom Hooper', '8.0', '88'], ['The Hurt Locker', 'Kathryn Bigelow', '7.5', '95'], ['Slumdog Millionaire', 'Danny Boyle, Loveleen Tandan', '8.0', '84'], ['No Country for Old Men', 'Ethan Coen, Joel Coen', '8.2', '92'], ['The Departed', 'Martin Scorsese', 

## `write_imdb_csv(my_list)`

This function writes the list returned from `get_imdb_deets()` into a CSV file. The CSV file is to be joined with three other CSV files that are returned in this scraper. 

In [14]:
def write_imdb_csv(my_list):
    csvfile = open('imdb_metacritic.csv', 'w', newline='', encoding='utf-8')
    c = csv.writer(csvfile)
    
    # Write column headings in a row
    c.writerow(['title', 'director', 'imdb_rating', 'metascore'])

    
    for item in my_list:
        c.writerow(item)
    # Loop through each list, append Rotten Tomatoes rating
    # for item in imdb_list:
        #for url in urls:
            #rating = get_tomatometer(url)
            #item.append(rating)
            #time.sleep(1)
        #c.writerow(item)
        
    # Close file
    csvfile.close()
    
    # Return None to bypass potential error
    return None

# Run function to create CSV file
write_imdb_csv(imdb_list)

# After Using the Scraper

The goal of this scraper is to get four CSV files, each containing the ratings from each film site. Titles vary slightly across the sites, so titles are standardized so they can be easily combined. A unique ID is also given to each film row in the joined CSV, which is then used for creating a Flask app. 