# Web Data Scraping

[Spring 2023 ITSS Mini-Course](https://www.colorado.edu/cartss/programs/interdisciplinary-training-social-sciences-itss/mini-course-web-data-scraping) — ARSC 5040  
[Brian C. Keegan, Ph.D.](http://brianckeegan.com/)  
[Assistant Professor, Department of Information Science](https://www.colorado.edu/cmci/people/information-science/brian-c-keegan)  
University of Colorado Boulder  

Copyright and distributed under an [MIT License](https://opensource.org/licenses/MIT)

These notebooks are adaptations from a 5 session mini course at the University of Colorado.They have been adapted for relevant content and integration with Docker so that we all have the same environment. Professor Keegan suggests using a most recent version of Python which we have through our docker container.

This notebook is adapted from excellent notebooks in Dr. [Cody Buntain](http://cody.bunta.in/)'s seminar on [Social Media and Crisis Informatics](http://cody.bunta.in/teaching/2018_winter_umd_inst728e/) as well as the [PRAW documentation](https://praw.readthedocs.io/en/latest/).


## Sharing accomplishments and challenges

* Using the inspect tool
* Counting numbers of members of U.S. House with XML
* Parsing information out from Twitter's JSON payload

## Parsing HTML data into tabular data

The overall goal we have as researchers in scraping data from the web is converting data from one structured format (HTML's tree-like structures) into another structured format (probably a tabular structure with rows and columns). 
This could involve simply reading tables out of a webpage all the way up to taking irregularly-structured HTML elements into a tabular format. 

We are going to make some use of the [`pandas`](https://pandas.pydata.org/) library ("**pan**el **da**ta", not the cute animal), which is Python's implementation of a data frame concept. This is a very powerful and complex library that I typically spend more than 12 hours of lecture teaching in intermediate programming classes. I hope to convey some important elements as we work through material, but it is far beyond the scope of this class to be able to cover all the fundamentals and syntax. 

Let's begin by importing the libraries we'll need in this notebook: requests, BeautifulSoup, and pandas

In [1]:
# Most straight-forward way to import a librayr in Python
import requests

# BeautifulSoup is a module inside the "bs4" library, we only import the BeautifulSoup module
from bs4 import BeautifulSoup

# We import pandas but give the library a shortcut alias "pd" since we will call its functions so much
import pandas as pd

### Reading an HTML table into Python

[The Numbers](http://www.the-numbers.com) is a popular source of data about movies' box office revenue numbers. Their daily domestic charts are HTML tables with the top-grossing movies for each day of the year, going back for several years. This [table](https://www.the-numbers.com/box-office-chart/daily/2024/12/25) for Christmas day in 2024 has coluns for the current week's ranking, previous week's ranking, name of movie, distributor, gross, change over the previous week, number of theaters, revenue per theater, total gross, and number of days since release. This looks like a fairly straightforward table that could be read directly into data frame-like structure.

Using the Inspect tool, we can see the table exists as a `<table border="0" ... align="CENTER">` element with child tags like `<tbody>` and `<tr>` (table row). Each `<tr>` has `<td>` which defines each of the cells and their content. For more on how HTML defines tables, check out [this tutoral](https://www.w3schools.com/html/html_tables.asp).

Using `requests` and `BeautifulSoup` we would get this webpage's HTML, turn it into soup, and then find the table (`<table>`) or the table rows (`<tr>`) and pull out their content.

In [2]:
# Make the request
xmas_bo_raw = requests.get('https://www.the-numbers.com/box-office-chart/daily/2024/12/25').text

# Turn into soup, specify the HTML parser
xmas_bo_soup = BeautifulSoup(xmas_bo_raw,'html.parser')

# Use .find_all to retrieve all the tables in the page
xmas_bo_tables = xmas_bo_soup.find_all('table')

It turns out there are two tables on the page, the first is a baby table consisting of the "Previous Chart", "Chart Index", and "Next Chart" at the top. We want the second table with all the data: `xmas_bo_tables[1]` returns the second chart (remember that Python is 0-indexed, so the first chart is at `xmas_bo_tables[0]`). With this table identified, we can do a second `find_all` to get the table rows inside it and we save it as `xmas_bo_trs`.

In [3]:
xmas_bo_trs = xmas_bo_tables[1].find_all('tr')

Let's inspect a few of these rows. The first row in our list of rows under `xmas_bo_trs` should be the header with the names of the columns.

In [4]:
xmas_bo_trs[0]

<tr><th> </th><th> </th><th>Movie Title</th><th>Distributor</th><th>Gross</th><th>%YD</th><th>%LW</th><th>Theaters</th><th>Per<br/>Theater</th><th>Total<br/>Gross</th><th>Days In<br/>Release</th></tr>

The next table row should be for The Lion King.

In [5]:
xmas_bo_trs[1]

<tr>
<td class="data chart_up" data-sort="1"><b>1</b></td>
<td class="data" data-sort="2">(2)</td>
<td><b><a href="/movie/Mufasa-The-Lion-King-(2024)#tab=box-office">Mufasa: The Lion King</a></b></td>
<td><a href="/market/distributor/Walt-Disney">Walt Disney</a></td>
<td class="data">$14,722,175</td>
<td class="data chart_up" data-sort="106">+106%</td>
<td class="data" data-sort="0"> </td>
<td class="data" data-sort="4100">4,100</td>
<td class="data chart_grey" data-sort="3591">$3,591</td>
<td class="data" data-sort="64376120">$64,376,120</td>
<td class="data">6</td>
</tr>

If we wanted to access the contents of this table row, we could use the `.contents` method to get a list of each of the `<td>` table cells, which (frustratingly) intersperses newline characters.

In [6]:
xmas_bo_trs[1].contents

['\n',
 <td class="data chart_up" data-sort="1"><b>1</b></td>,
 '\n',
 <td class="data" data-sort="2">(2)</td>,
 '\n',
 <td><b><a href="/movie/Mufasa-The-Lion-King-(2024)#tab=box-office">Mufasa: The Lion King</a></b></td>,
 '\n',
 <td><a href="/market/distributor/Walt-Disney">Walt Disney</a></td>,
 '\n',
 <td class="data">$14,722,175</td>,
 '\n',
 <td class="data chart_up" data-sort="106">+106%</td>,
 '\n',
 <td class="data" data-sort="0"> </td>,
 '\n',
 <td class="data" data-sort="4100">4,100</td>,
 '\n',
 <td class="data chart_grey" data-sort="3591">$3,591</td>,
 '\n',
 <td class="data" data-sort="64376120">$64,376,120</td>,
 '\n',
 <td class="data">6</td>,
 '\n']

Another alternative is to use the `.text` method to get the text content of all the cells in this row.

In [7]:
xmas_bo_trs[1].text

'\n1\n(2)\nMufasa: The Lion King\nWalt Disney\n$14,722,175\n+106%\n\xa0\n4,100\n$3,591\n$64,376,120\n6\n'

The `\n` characters re-appear here, but if we `print` out this statement, we see their newline functionality.

In [8]:
print(xmas_bo_trs[1].text)


1
(2)
Mufasa: The Lion King
Walt Disney
$14,722,175
+106%
 
4,100
$3,591
$64,376,120
6



We could use string processing to take this text string and convert it into a simple list of data. `.split('\n')` will split the string on the newline characters and return a list of what exists in between.

In [9]:
xmas_bo_trs[1].text.split('\n')

['',
 '1',
 '(2)',
 'Mufasa: The Lion King',
 'Walt Disney',
 '$14,722,175',
 '+106%',
 '\xa0',
 '4,100',
 '$3,591',
 '$64,376,120',
 '6',
 '']

We'll write a `for` loop to go through all the table rows in `xmas_bo_trs`, get the list of data from the row, and add it back to a list of all the rows.

In [10]:
cleaned_xmas_bo_rows = []

# Loop through all the non-header (first row) table rows
for row in xmas_bo_trs[1:]:
    
    # Get the text of the row and split on the newlines (like above)
    cleaned_row = row.text.split('\n')
    
    # Add this cleaned row back to the external list of row data
    cleaned_xmas_bo_rows.append(cleaned_row)
    
# Inspect the first few rows of data
cleaned_xmas_bo_rows[:2]

[['',
  '1',
  '(2)',
  'Mufasa: The Lion King',
  'Walt Disney',
  '$14,722,175',
  '+106%',
  '\xa0',
  '4,100',
  '$3,591',
  '$64,376,120',
  '6',
  ''],
 ['',
  '2',
  'N',
  'Nosferatu',
  'Focus Features',
  '$11,552,825',
  '\xa0',
  '\xa0',
  '2,911',
  '$3,969',
  '$11,552,825',
  '1',
  '']]

Now we can pass this list of lists in `cleaned_xmas_bo_rows` to pandas's `DataFrame` function and hopefully get a nice table out.

In [11]:
xmas_bo_df = pd.DataFrame(cleaned_xmas_bo_rows)

# Inspect
xmas_bo_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,,1,(2),Mufasa: The Lion King,Walt Disney,"$14,722,175",+106%,,4100,"$3,591","$64,376,120",6,
1,,2,N,Nosferatu,Focus Features,"$11,552,825",,,2911,"$3,969","$11,552,825",1,
2,,3,(1),Sonic the Hedgehog 3,Paramount Pi…,"$10,351,772",+38%,,3769,"$2,747","$88,003,028",6,
3,,4,N,A Complete Unknown,Searchlight …,"$7,201,242",,,2835,"$2,540","$7,201,242",1,
4,,5,(3),Wicked,Universal,"$5,399,625",+66%,+96%,3177,"$1,700","$397,890,620",34,


We need to do a bit of cleanup on this data:

* Columns 0 and 12 are all empty
* Add column names

In [12]:
# Drop columns 0 and 11 and overwrite the xmas_box_df variable
xmas_bo_df = xmas_bo_df.drop(columns=[0,12])

# Rename the columns
xmas_bo_df.columns = ['Rank','Last rank','Movie','Distributor','Gross',
                      'Change','Percent last week','Theaters','Per theater','Total gross',
                      'Days']

# Write to disk
# xmas_bo_df.to_csv('christmas_2025_box_office.csv',encoding='utf8')

# Inspect
xmas_bo_df.head()

Unnamed: 0,Rank,Last rank,Movie,Distributor,Gross,Change,Percent last week,Theaters,Per theater,Total gross,Days
0,1,(2),Mufasa: The Lion King,Walt Disney,"$14,722,175",+106%,,4100,"$3,591","$64,376,120",6
1,2,N,Nosferatu,Focus Features,"$11,552,825",,,2911,"$3,969","$11,552,825",1
2,3,(1),Sonic the Hedgehog 3,Paramount Pi…,"$10,351,772",+38%,,3769,"$2,747","$88,003,028",6
3,4,N,A Complete Unknown,Searchlight …,"$7,201,242",,,2835,"$2,540","$7,201,242",1
4,5,(3),Wicked,Universal,"$5,399,625",+66%,+96%,3177,"$1,700","$397,890,620",34


### `pandas`'s `read_html`
That was a good amount of work just to get this simple HTML table into Python. But it was important to cover how table elements moved from a string in `requests`, into a soup object from `BeautifulSoup`. into a list of data, and finally into `pandas`. 

`pandas` also has powerful functionality for reading tables directly from HTML. If we convert the soup of the first table (`xmas_bo_tables[1]`) back into a string, `pandas` can read it directly into a table. 

There are a few ideosyncracies here, the result is a list of dataframes—even if there's only a single table/dataframe—so we need to return the first (and only) element of this list. This is why there's a `[0]` at the end and the `.head()` is just to show the first five rows.

In [13]:
xmas_bo_table_as_string = str(xmas_bo_tables[1])

pd.read_html(xmas_bo_table_as_string)[0].head()

  pd.read_html(xmas_bo_table_as_string)[0].head()


Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Movie Title,Distributor,Gross,%YD,%LW,Theaters,Per Theater,Total Gross,Days In Release
0,1,(2),Mufasa: The Lion King,Walt Disney,"$14,722,175",+106%,,4100.0,"$3,591","$64,376,120",6.0
1,2,N,Nosferatu,Focus Features,"$11,552,825",,,2911.0,"$3,969","$11,552,825",1.0
2,3,(1),Sonic the Hedgehog 3,Paramount Pi…,"$10,351,772",+38%,,3769.0,"$2,747","$88,003,028",6.0
3,4,N,A Complete Unknown,Searchlight …,"$7,201,242",,,2835.0,"$2,540","$7,201,242",1.0
4,5,(3),Wicked,Universal,"$5,399,625",+66%,+96%,3177.0,"$1,700","$397,890,620",34.0


Finally, you can point `read_html` at a URL without any `requests` or `BeautifulSoup` and get all the tables on the page as a list of DataFrames. `pandas` is simply doing the `requests` and `BeautifulSoup` on the inside. Interestingly, I'm getting a [HTTP 403](https://en.wikipedia.org/wiki/HTTP_403) error indicating the server (The Numbers) is forbidding the client (us) from accessing their data using this strategy. We will discuss next week whether and how to handle situations where web servers refuse connections from non-human clients. In this case, you cannot use the off-the-shelf `read_html` approach and would need to revert to using the `requests`+`BeautifulSoup` approach above.

In [14]:
simple_tables = pd.read_html('https://www.the-numbers.com/box-office-chart/daily/2024/12/25')
simple_tables

[                  0            1             2
 0  ← Previous Chart  Chart Index  Next Chart →,
    Unnamed: 0 Unnamed: 1                 Movie Title     Distributor  \
 0           1        (2)       Mufasa: The Lion King     Walt Disney   
 1           2          N                   Nosferatu  Focus Features   
 2           3        (1)        Sonic the Hedgehog 3   Paramount Pi…   
 3           4          N          A Complete Unknown   Searchlight …   
 4           5        (3)                      Wicked       Universal   
 5           6        (4)                     Moana 2     Walt Disney   
 6           7          N             The Fire Inside   Amazon MGM S…   
 7           8          N                    Babygirl             A24   
 8           9        (5)                Gladiator II   Paramount Pi…   
 9          10        (6)                   Homestead   Angel Studios   
 10          -        (7)           Kraven the Hunter   Sony Pictures   
 11          -        (8)  

If we point it at Wikipedia's [2024 in film](https://en.wikipedia.org/wiki/2024_in_film), it will pull all of the tables present on the page.

In [15]:
simple_tables = pd.read_html('https://en.wikipedia.org/wiki/2024_in_film')

The first two correspond to the "Year in film" navigation box on the side and are poorly-formatted by default.

In [16]:
simple_tables[0]

Unnamed: 0,List of years in film,Unnamed: 1,Unnamed: 2
0,,List of years in film,
1,… 2014 2015 2016 2017 2018 2019 2020 2021 2022...,,
2,Art Archaeology Architecture Literature Music ...,,
3,,vte,


The second table in the `simple_tables` list we got from parsing the Wikipedia page with `read_html` is the table under the "Highest-grossing films" section.

In [17]:
simple_tables[2]

Unnamed: 0,Rank,Title,Distributor,Worldwide gross
0,1,Inside Out 2,Disney,"$1,698,863,816"
1,2,Deadpool & Wolverine,Disney,"$1,338,073,645"
2,3,Moana 2,Disney,"$1,059,269,477"
3,4,Despicable Me 4,Universal,"$971,105,208"
4,5,Wicked,Universal,"$753,700,072"
5,6,Mufasa: The Lion King,Disney,"$722,339,600"
6,7,Dune: Part Two,Warner Bros.,"$714,644,358"
7,8,Godzilla x Kong: The New Empire,Warner Bros.,"$571,850,016"
8,9,Kung Fu Panda 4,Universal,"$547,689,492"
9,10,Sonic the Hedgehog 3,Paramount,"$491,946,657"


Then all together.

In [18]:
wiki_top_grossing_t = pd.read_html('https://en.wikipedia.org/wiki/2024_in_film')[2]
wiki_top_grossing_t

Unnamed: 0,Rank,Title,Distributor,Worldwide gross
0,1,Inside Out 2,Disney,"$1,698,863,816"
1,2,Deadpool & Wolverine,Disney,"$1,338,073,645"
2,3,Moana 2,Disney,"$1,059,269,477"
3,4,Despicable Me 4,Universal,"$971,105,208"
4,5,Wicked,Universal,"$753,700,072"
5,6,Mufasa: The Lion King,Disney,"$722,339,600"
6,7,Dune: Part Two,Warner Bros.,"$714,644,358"
7,8,Godzilla x Kong: The New Empire,Warner Bros.,"$571,850,016"
8,9,Kung Fu Panda 4,Universal,"$547,689,492"
9,10,Sonic the Hedgehog 3,Paramount,"$491,946,657"


## Writing your own parser

We will return to the historical Oscars data. Even though data as prominent as this is likely to already exist in tabular format somewhere, we will maintain the illusion that we are the first to both scrape it and parse it into a tabular format. Our goal here is to write a parser that will (ideally) work across multiple pages; in this case, each of the award years.

One of the first things we should do before writing any code is come up with a model of what we want our data to look like at the end of this. This is an intuitive and "tidy" format, but you might come up with alternatives based on your analysis and modeling needs.

| *Year* | *Category* | *Nominee* | *Movie* | *Won* |
| --- | --- | --- | --- | --- |
| 2025 | Actor in a leading role | Adrian Brody | The Brutalist | Winner |
| 2025 | Actor in a leading role | Timothee Chalamet | A Complete Unknown | No |
| 2025 | Actor in a leading role | Colman Domingo | Sing Sing | No |
| 2025 | Actor in a leading role | Ralph Fieenes | Conclave | No |
| 2025 | Actor in a leading role | Viggo Mortensen | The Apprentice | No |

We will begin with writing a parser for a (hopefully!) representative year, then scrape the data for all the years, then apply the scraper to each of those years, and finally combine all the years' data together into a large data set. 

Let's begin with writing a parser for a (hopefully!) representative year of 2025.

Start off with using `requests` to get the data and then use `BeautifulSoup` to turn it into soup we can parse through.

In [19]:
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'}

# Pretend to be a web browser and make a get request of a webpage
oscars2025_raw = requests.get('https://www.oscars.org/oscars/ceremonies/2025',headers=headers).text


oscars2025_soup = BeautifulSoup(oscars2025_raw)

In [21]:
#oscars2025_soup

Using the Inspect tool, the `<div class="paragraph--type--award-category">` seems to be the most promising tag for us to extract. Use `.find_all('div',{'class':'paragraph--type--award-category'})` to (hopefully!) get all of these award groups. Inspect the first and last ones to make sure they looks coherent.

In [57]:

#oscars2025_groups = oscars2025_soup.find_all('div',{'class':"field field--name-field-award-honorees field--type-entity-reference-revisions field--label-hidden field__items"})

oscars2025_groups = oscars2025_soup.find_all('div',{'class':"paragraph--type--award-category"})

# Inspect the first one
oscars2025_groups[0]



<div class="paragraph paragraph--type--award-category paragraph--view-mode--default" data-term-id="1731">
<div class="field field--name-field-award-category-oscars field--type-entity-reference field--label-hidden field__item">Actor in a Leading Role</div>
<div class="field field--name-field-award-honorees field--type-entity-reference-revisions field--label-hidden field__items">
<div class="field__item">
<div class="paragraph paragraph--type--award-honoree paragraph--view-mode--oscars">
<div class="field field--name-field-honoree-type winner">Winner</div>
<div class="field">
<div class="field field--name-field-award-entities field--type-entity-reference field--label-hidden field__items">
<div class="field__item">Adrien Brody</div>
</div>
</div>
<div class="field">
<div class="field field--name-field-award-film field--type-entity-reference field--label-hidden field__item">The Brutalist</div>
</div>
</div>
</div>
<div class="field__item">
<div class="paragraph paragraph--type--award-honor

In [58]:
# Inspect the last one
oscars2025_groups[-1]

<div class="paragraph paragraph--type--award-category paragraph--view-mode--default" data-term-id="2296">
<div class="field field--name-field-award-category-oscars field--type-entity-reference field--label-hidden field__item">Writing (Original Screenplay)</div>
<div class="field field--name-field-award-honorees field--type-entity-reference-revisions field--label-hidden field__items">
<div class="field__item">
<div class="paragraph paragraph--type--award-honoree paragraph--view-mode--oscars">
<div class="field field--name-field-honoree-type winner">Winner</div>
<div class="field">
<div class="field field--name-field-award-film field--type-entity-reference field--label-hidden field__item">Anora</div>
</div>
<div class="field">
<div class="field field--name-field-award-entities field--type-entity-reference field--label-hidden field__items">
<div class="field__item">Written by Sean Baker</div>
</div>
</div>
</div>
</div>
<div class="field__item">
<div class="paragraph paragraph--type--awar

In [45]:
len(oscars2025_groups)

23

### Navigating the HTML tree to find more specific elements
We've parsed the html soup to get all the award categories. Now let's get more specific elements within each of the groups being honorees, film, and honoree type. 

In [60]:
category = oscars2025_groups[0].find('div', class_='field--name-field-award-category-oscars').get_text(strip=True)
category

'Actor in a Leading Role'

In [53]:
# Get all honorees (winner + nominees)
honorees = oscars2025_groups[0].find_all('div', class_='paragraph--type--award-honoree')
honorees

[<div class="paragraph paragraph--type--award-honoree paragraph--view-mode--oscars">
 <div class="field field--name-field-honoree-type winner">Winner</div>
 <div class="field">
 <div class="field field--name-field-award-entities field--type-entity-reference field--label-hidden field__items">
 <div class="field__item">Adrien Brody</div>
 </div>
 </div>
 <div class="field">
 <div class="field field--name-field-award-film field--type-entity-reference field--label-hidden field__item">The Brutalist</div>
 </div>
 </div>,
 <div class="paragraph paragraph--type--award-honoree paragraph--view-mode--oscars">
 <div class="field field--name-field-honoree-type nominee">Nominees</div>
 <div class="field">
 <div class="field field--name-field-award-entities field--type-entity-reference field--label-hidden field__items">
 <div class="field__item">Timothée Chalamet</div>
 </div>
 </div>
 <div class="field">
 <div class="field field--name-field-award-film field--type-entity-reference field--label-hidde

In [None]:
honorees[0].find('div', class_='field--name-field-honoree-type').get_text(strip=True)

In [None]:
honoree_type = honoree.find('div', class_='field--name-field-honoree-type').get_text(strip=True)

### Put it all together to create single year scraper


In [66]:
# Prepare a list to collect all award entries
data = []

# Find the section containing award categories
categories = oscars2025_soup.find_all('div', class_='paragraph--type--award-category')

for category in categories:
    # Get the award category name
    category_name = category.find('div', class_='field--name-field-award-category-oscars').get_text(strip=True)
    
    # Get all honorees (winner + nominees)
    honorees = category.find_all('div', class_='paragraph--type--award-honoree')
    
    for honoree in honorees:
        # Type: Winner or Nominee
        honoree_type = honoree.find('div', class_='field--name-field-honoree-type').get_text(strip=True)
        
        # Honoree name(s)
        honoree_entities = honoree.find('div', class_='field--name-field-award-entities')
        honoree_name = honoree_entities.get_text(strip=True) if honoree_entities else ""
        
        # Film
        film = honoree.find('div', class_='field--name-field-award-film')
        film_name = film.get_text(strip=True) if film else ""
        
        # Append to the list
        data.append({
            'Award Category': category_name,
            'Honoree': honoree_name,
            'Film': film_name,
            'Type': honoree_type,
            'Year': 2025 #lets add year as well
        })

# Turn into a DataFrame for a nice table
df = pd.DataFrame(data)

print(df)

                    Award Category  \
0          Actor in a Leading Role   
1          Actor in a Leading Role   
2          Actor in a Leading Role   
3          Actor in a Leading Role   
4          Actor in a Leading Role   
..                             ...   
115  Writing (Original Screenplay)   
116  Writing (Original Screenplay)   
117  Writing (Original Screenplay)   
118  Writing (Original Screenplay)   
119  Writing (Original Screenplay)   

                                               Honoree                Film  \
0                                         Adrien Brody       The Brutalist   
1                                    Timothée Chalamet  A Complete Unknown   
2                                       Colman Domingo           Sing Sing   
3                                        Ralph Fiennes            Conclave   
4                                       Sebastian Stan      The Apprentice   
..                                                 ...                 ... 

In [67]:
df

Unnamed: 0,Award Category,Honoree,Film,Type,Year
0,Actor in a Leading Role,Adrien Brody,The Brutalist,Winner,2025
1,Actor in a Leading Role,Timothée Chalamet,A Complete Unknown,Nominees,2025
2,Actor in a Leading Role,Colman Domingo,Sing Sing,Nominees,2025
3,Actor in a Leading Role,Ralph Fiennes,Conclave,Nominees,2025
4,Actor in a Leading Role,Sebastian Stan,The Apprentice,Nominees,2025
...,...,...,...,...,...
115,Writing (Original Screenplay),Written by Sean Baker,Anora,Winner,2025
116,Writing (Original Screenplay),"Written by Brady Corbet, Mona Fastvold",The Brutalist,Nominees,2025
117,Writing (Original Screenplay),Written by Jesse Eisenberg,A Real Pain,Nominees,2025
118,Writing (Original Screenplay),"Written by Moritz Binder, Tim Fehlbaum; Co-Wri...",September 5,Nominees,2025


There are 120 honorees (not necessarily all unique) with 4 columns and 23 Award Categories. 

In [70]:
df['Award Category'].value_counts()

Award Category
Best Picture                     10
Actor in a Leading Role           5
International Feature Film        5
Writing (Adapted Screenplay)      5
Visual Effects                    5
Sound                             5
Live Action Short Film            5
Production Design                 5
Music (Original Song)             5
Music (Original Score)            5
Makeup and Hairstyling            5
Film Editing                      5
Actor in a Supporting Role        5
Documentary Short Film            5
Documentary Feature Film          5
Directing                         5
Costume Design                    5
Cinematography                    5
Animated Short Film               5
Animated Feature Film             5
Actress in a Supporting Role      5
Actress in a Leading Role         5
Writing (Original Screenplay)     5
Name: count, dtype: int64

## Iterating vs. parsing to retrieve data

Often the data you are interested in is spread across multiple web pages. In an ideal world, the naming conventions would let you retrieve the data from these pages systematically. In the case of the Oscars, the URLs appear to be consistently formatted: `https://www.oscars.org/oscars/ceremonies/2025` suggests that we could change the 2025 to any other date going back to the start of the Oscars and get that year as well: `https://www.oscars.org/oscars/ceremonies/2015` should get us the page for 2015, and so on. Let's demonstrate each of these strategies with the Oscars data: iterating from 2025 back to 1929 in the URL versus parsing the list of links from the header.

### Iterating strategies for retrieving data

The fundamental assumption with this strategy is that the data are stored at URLs in a consistent way that we can access sequentially. In the case of the Oscars, we *should* be able to simply pass each year to the URL in requests. Here we want to practice responsible data scraping by including a sleep between each request so that we do not overwhelm the Oscars server with requests. We can use the `sleep` function within `time`.

In [63]:
from time import sleep

The `sleep(3)` below prevents any more code from progressing for 3 seconds.

In [64]:
print("The start of something.")
sleep(3)
print("The end of something.")

The start of something.
The end of something.


##### The core part of the iterating strategy is simply using Python's [`range`](https://docs.python.org/3.7/library/functions.html#func-range) function to generate a sequence of values. Here, we can use `range` to print out a sequence of URLs that should correspond to awards pages from 2015 through 2025. We can also incorporate the `sleep` functionality and wait a second between each `print` statement—it should now take 10 seconds for this code to finish printing. This simulates how we can use `sleep` to slow down and spread out requests so that we do not overwhelm the servers whose data we are trying to scrape.

In [65]:
for year in range(2015,2025):
    sleep(1)
    print('https://www.oscars.org/oscars/ceremonies/{0}'.format(year))

https://www.oscars.org/oscars/ceremonies/2015
https://www.oscars.org/oscars/ceremonies/2016
https://www.oscars.org/oscars/ceremonies/2017
https://www.oscars.org/oscars/ceremonies/2018
https://www.oscars.org/oscars/ceremonies/2019
https://www.oscars.org/oscars/ceremonies/2020
https://www.oscars.org/oscars/ceremonies/2021
https://www.oscars.org/oscars/ceremonies/2022
https://www.oscars.org/oscars/ceremonies/2023
https://www.oscars.org/oscars/ceremonies/2024


Let's definie a function to `get_nominees`  that gets the nominees by year. 

In [79]:
def get_nominees(year):
    # Prepare a list to collect all award entries
    data = []

    # Pause for a second between each request
    sleep(1)

    # Get the raw HTML
    headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'}
    year_raw_html = requests.get('https://www.oscars.org/oscars/ceremonies/{0}'.format(year), headers=headers).text

    # Soup-ify
    year_souped_html = BeautifulSoup(year_raw_html)

    # Find the section containing award categories
    categories = year_souped_html.find_all('div', class_='paragraph--type--award-category')

    for category in categories:
        # Get the award category name
        category_name = category.find('div', class_='field--name-field-award-category-oscars').get_text(strip=True)
    
        # Get all honorees (winner + nominees)
        honorees = category.find_all('div', class_='paragraph--type--award-honoree')
    
        for honoree in honorees:
            # Type: Winner or Nominee
            honoree_type = honoree.find('div', class_='field--name-field-honoree-type').get_text(strip=True)
        
            # Honoree name(s)
            honoree_entities = honoree.find('div', class_='field--name-field-award-entities')
            honoree_name = honoree_entities.get_text(strip=True) if honoree_entities else ""
        
            # Film
            film = honoree.find('div', class_='field--name-field-award-film')
            film_name = film.get_text(strip=True) if film else ""
        
            # Append to the list
            data.append({
                'Award Category': category_name,
                'Honoree': honoree_name,
                'Film': film_name,
                'Type': honoree_type,
                'Year': year #lets add year as well
            })

    # Turn into a DataFrame for a nice table
    df = pd.DataFrame(data)
    return(df)

Combine each of the DataFrames in `all_years_nominees` into a giant DataFrame of all the nominees from 2010-2019.

In [81]:
# Create an empty list to store the data we get
all_years_nominees = dict()

# For each year starting in 2015 until 2025
for year in range(2015,2025):
    
    # Convert the year_nominees to a DataFrame and add them to all_years_nominees
    all_years_nominees[year] = get_nominees(year)

all_years_nominees_df = pd.concat(all_years_nominees)
all_years_nominees_df.reset_index(drop=True).head(10)

Unnamed: 0,Award Category,Honoree,Film,Type,Year
0,Actor in a Leading Role,Eddie Redmayne,The Theory of Everything,Winner,2015
1,Actor in a Leading Role,Steve Carell,Foxcatcher,Nominees,2015
2,Actor in a Leading Role,Bradley Cooper,American Sniper,Nominees,2015
3,Actor in a Leading Role,Benedict Cumberbatch,The Imitation Game,Nominees,2015
4,Actor in a Leading Role,Michael Keaton,Birdman or (The Unexpected Virtue of Ignorance),Nominees,2015
5,Actor in a Supporting Role,J.K. Simmons,Whiplash,Winner,2015
6,Actor in a Supporting Role,Robert Duvall,The Judge,Nominees,2015
7,Actor in a Supporting Role,Ethan Hawke,Boyhood,Nominees,2015
8,Actor in a Supporting Role,Edward Norton,Birdman or (The Unexpected Virtue of Ignorance),Nominees,2015
9,Actor in a Supporting Role,Mark Ruffalo,Foxcatcher,Nominees,2015


We now have a general scraper to get all the nominees, films, categories and winners across all years from the oscars.  From 2015 to 2025, there were 1209 nominees with 236 winners. 

In [84]:
all_years_nominees_df.shape

(1209, 5)

In [85]:
all_years_nominees_df.Type.value_counts()

Type
Nominees    973
Winner      236
Name: count, dtype: int64