# Data Acquisition via Web Scraping (Rotten Tomatoes)

---

## Phase 1: Setup and Request Preparation

### Import Modules

In [1]:
# Import all required Python modules for scraping and data handling
import numpy as np
import pandas as pd
import requests             # For submitting HTTP requests
import json
from bs4 import BeautifulSoup # The primary tool for parsing HTML

print("Modules imported.")

Modules imported.


### Obtain a User-agent String
We get a User-agent string, which is necessary to identify our client. Scrapers should always send this header, as many sites block requests without it.

In [2]:
# Use httpbin.org to get a standard User-agent string
r = requests.get('https://httpbin.org/user-agent')
useragent = json.loads(r.text)['user-agent']

print(f"Obtained User-agent: {useragent}")

Obtained User-agent: python-requests/2.32.4


### Define Request Headers
We compile the User-agent and an identifying "From" email into the headers dictionary. It's polite scraping practice to include contact info.

In [5]:
# Define the headers dictionary
headers = {'User-agent': useragent,
          'From': 'hbv6pz@virginia.edu'} # Example 'atr8e@virginia.edu'

print("Headers dictionary defined.")

Headers dictionary defined.


### Define the URL and Submit the GET Request
This is the Extraction (E) phase of the web scrape. We submit the GET request to the target page.

In [6]:
# Define the target URL (Movies in Theaters page on Rotten Tomatoes)
url = 'https://www.rottentomatoes.com/browse/movies_in_theaters/sort:top_box_office?page=5'

# Submit the request with the custom headers
r = requests.get(url, headers=headers)

# A <Response [200]> means the request was successful
print(f"Request submitted. Status: {r}")

Request submitted. Status: <Response [200]>


In [7]:
# Uncomment and run r.text below this if you want to see the raw output!
# Recomment and rerun to get rid of the raw output, it's way too much!
#r.text

### Commentary on the Request
The response object 'r' now contains the entire HTML source code of the webpage.

The next step in the scraping process is to parse this raw text so we can easily search for the movie data.

---

## Phase 2: HTML Parsing and Isolation

### Parse the HTML with BeautifulSoup
This is the start of the Transformation (T) phase. BeautifulSoup turns the raw HTML text into a navigable Python object.

In [8]:
# Parse the raw HTML text (r.text) using the built-in 'html.parser'
mysoup = BeautifulSoup(r.text, 'html.parser')

print("HTML content successfully parsed into a BeautifulSoup object.")

HTML content successfully parsed into a BeautifulSoup object.


In [11]:
# Uncomment and run mysoup below this if you want to see the parsed output!
# Recomment and rerun to get rid of the parsed output, it's way too much!
mysoup

<!DOCTYPE html>

<html dir="ltr" lang="en" prefix="fb: http://www.facebook.com/2008/fbml og: http://opengraphprotocol.org/schema/" xmlns="http://www.w3.org/1999/xhtml">
<head prefix="og: http://ogp.me/ns# flixstertomatoes: http://ogp.me/ns/apps/flixstertomatoes#">
<script charset="UTF-8" crossorigin="anonymous" data-domain-script="7e979733-6841-4fce-9182-515fac69187f" integrity="sha384-TKdmlzVmoD70HzftTw4WtOzIBL5mNx8mXSRzEvwrWjpIJ7FZ/EuX758yMDWXtRUN" src="https://cdn.cookielaw.org/consent/7e979733-6841-4fce-9182-515fac69187f/otSDKStub.js" type="text/javascript">
</script>
<script type="text/javascript">
                    function OptanonWrapper() {
                        if (OnetrustActiveGroups.includes('7')) {
                            document.querySelector('search-results-nav-manager')?.setAlgoliaInsightUserToken?.();
                        }
                    }
                </script>
<script ccpa-opt-out-geo="US" ccpa-opt-out-ids="USP" ccpa-opt-out-lspa="false" charset=

### Isolate All Movie Tiles
We use BeautifulSoup's find_all() method to locate the custom HTML tag used by Rotten Tomatoes to wrap each movie listing.

In [10]:
# Find all elements corresponding to a single movie tile
# The tag 'tile-dynamic' appears to wrap all the necessary data for one movie.
mylist = mysoup.find_all('tile-dynamic')

print(f"Isolated {len(mylist)} movie tiles found on the page.")

Isolated 127 movie tiles found on the page.


In [12]:
# Uncomment and run mylist below this if you want to see the selected output!
# Recomment and rerun to get rid of the selected output, it's way too much!
mylist

[<tile-dynamic data-qa="tile">
 <rt-img alt="One Battle After Another poster image" loading="lazy" slot="image" src="https://resizing.flixster.com/J4-IRAM-JCRt1httKD5HlGu4kF8=/206x305/v2/https://resizing.flixster.com/RsxXU0Hem_FDFjs0xcKd_kOwEfg=/fit-in/180x240/v2/https://resizing.flixster.com/yepsxuaCu3f7igZnTo2huAOVtCU=/ems.cHJkLWVtcy1hc3NldHMvbW92aWVzLzNhNmQzNDA0LWFhYjQtNDA5Mi04OTMyLTA2Y2U2OWI5ZjVmYS5qcGc="></rt-img>
 <div data-track="scores" slot="caption">
 <div class="score-wrap">
 <score-icon-critics certified="" sentiment="positive" size="1"></score-icon-critics>
 <rt-text class="critics-score" context="label" size="1">96%</rt-text>
 </div>
 <span class="p--small">One Battle After Another</span>
 <span class="sr-only">Link to One Battle After Another</span>
 </div>
 </tile-dynamic>,
 <tile-dynamic data-qa="tile">
 <rt-img alt="The Smashing Machine poster image" loading="lazy" slot="image" src="https://resizing.flixster.com/x-QgCe91SXRqBPaEx9wyPdT9VZQ=/206x305/v2/https://resizing

### Isolate a Single Movie for Inspection
To determine how to extract the data, we must first inspect the structure of a single movie tile. We grab the first element ([0]) from our list.

In [13]:
# Assign the first movie tile element to the variable 'm' for inspection
m = mylist[0]

print("Isolated the first movie tile (m) for inspection.")
# Display the element 'm' (optional: uncomment 'm' to see the full code block)
m

Isolated the first movie tile (m) for inspection.


<tile-dynamic data-qa="tile">
<rt-img alt="One Battle After Another poster image" loading="lazy" slot="image" src="https://resizing.flixster.com/J4-IRAM-JCRt1httKD5HlGu4kF8=/206x305/v2/https://resizing.flixster.com/RsxXU0Hem_FDFjs0xcKd_kOwEfg=/fit-in/180x240/v2/https://resizing.flixster.com/yepsxuaCu3f7igZnTo2huAOVtCU=/ems.cHJkLWVtcy1hc3NldHMvbW92aWVzLzNhNmQzNDA0LWFhYjQtNDA5Mi04OTMyLTA2Y2U2OWI5ZjVmYS5qcGc="></rt-img>
<div data-track="scores" slot="caption">
<div class="score-wrap">
<score-icon-critics certified="" sentiment="positive" size="1"></score-icon-critics>
<rt-text class="critics-score" context="label" size="1">96%</rt-text>
</div>
<span class="p--small">One Battle After Another</span>
<span class="sr-only">Link to One Battle After Another</span>
</div>
</tile-dynamic>

### Extract the Movie Title
We search within the single tile (m) for the specific HTML element containing the title, using the tag (span) and its class name (p--small).

In [14]:
# Search within 'm' for the title: it's inside a <span> with the class 'p--small'
# We grab the first result [0] and its string content
title_example = m.find_all('span', 'p--small')[0].string

print(f"Example Title Extracted: {title_example}")

Example Title Extracted: One Battle After Another


### Cleaning the Data (Trimming the List)
Inspection of the page often reveals that the first few tiles are sometimes placeholders or ads. We trim the first 10 tiles to start with the main movie listings.

In [15]:
# Trim the first 10 elements which may be placeholder/ad content
mylist = mylist[10:]

print(f"Trimmed movie list to {len(mylist)} tiles.")

Trimmed movie list to 117 tiles.


In [16]:
# Uncomment and run mylist below this if you want to see the selected output!
# Recomment and rerun to get rid of the selected output, it's way too much!
mylist

[<tile-dynamic isvideo="true" skeleton="panel">
 <rt-img alt="HIM" class="posterImage" loading="lazy" slot="image" src="https://resizing.flixster.com/rsOzN-h_vtInSpz4fVPtEo5BBU0=/206x305/v2/https://resizing.flixster.com/p3353cv-SS1XX2iMINFj9ZMUtVM=/ems.cHJkLWVtcy1hc3NldHMvbW92aWVzLzA5Y2QxMGRmLWRmNjUtNGVlOC1iM2VjLTRjNGIzMWMzMGQ1NS5qcGc="></rt-img>
 <button class="transparent" data-ems-id="15787e2f-4718-483b-9e45-3596ad9dfd5c" data-mpx-id="2443233347990" data-public-id="ACOpLuZr5I5V" data-type="Movie" data-videoplayeroverlaymanager="btnVideo:click" slot="imageAction">
 <span class="sr-only">Watch the trailer for HIM</span>
 </button>
 <a data-qa="discovery-media-list-item-caption" data-track="scores" href="/m/him_2025_2" slot="caption">
 <score-pairs-deprecated>
 <score-icon-critics certified="false" sentiment="negative" size="1" slot="criticsScoreIcon"></score-icon-critics>
 <rt-text context="label" size="1" slot="criticsScore"> 31%</rt-text>
 <score-icon-audience certified="false" sent

---

## Phase 3: Data Extraction

### Extract All Movie Titles
Now that we know the path to the title, we use a list comprehension to efficiently extract the title from every element in our cleaned mylist.

In [17]:
# List comprehension to extract and clean the title from every movie tile
# .text is more robust than .string, and .strip() removes leading/trailing whitespace
titles = [m.find_all('span', 'p--small')[0].text.strip() for m in mylist]

print(f"Extracted {len(titles)} movie titles.")
print("Example Titles:")
print(titles[:5]) # Display the first 5 titles

Extracted 117 movie titles.
Example Titles:
['HIM', 'The Conjuring: Last Rites', 'Downton Abbey: The Grand Finale', 'The Long Walk', 'A Big Bold Beautiful Journey']


### Extract Score Attributes (Audience Example)
Scores often require checking specific HTML attributes. We examine one tile to find the HTML attribute (sentiment) used to store the score category.

In [18]:
# Isolate a specific tile for attribute checking
example_tile = mylist[0]

# The 'score-icon-critics' tag holds the sentiment in an attribute named 'sentiment'
critic_sentiment_example = example_tile.find_all('score-icon-critics')[0]['sentiment']

print(f"Example Critic Sentiment (Attribute): {critic_sentiment_example}")

Example Critic Sentiment (Attribute): negative


### Extract All Score Components
We repeat the list comprehension process for all required score components (Certified status, Score value, and Sentiment) for both Critics and Audience.

In [19]:
# Audience Data Extraction
audiencecertified = [m.find_all('score-icon-audience')[0]['certified'] for m in mylist]
audiencescore = [m.find_all('rt-text')[1].text.strip() for m in mylist]
audiencesentiment = [m.find_all('score-icon-audience')[0]['sentiment'] for m in mylist]

# Critics Data Extraction
criticscertified = [m.find_all('score-icon-critics')[0]['certified'] for m in mylist]
criticsscore = [m.find_all('rt-text')[0].text.strip() for m in mylist]
criticssentiment = [m.find_all('score-icon-critics')[0]['sentiment'] for m in mylist]

print("All six data lists (Scores, Sentiment, Certified status) successfully extracted.")

All six data lists (Scores, Sentiment, Certified status) successfully extracted.


---

## Phase 4: Data Consolidation and Export

### Create the Final DataFrame
We combine all the parallel lists into a single Pandas DataFrame.

In [20]:
rt_data = pd.DataFrame({
    'title': titles,
    'audience_certified': audiencecertified,
    'audience_score': audiencescore,
    'audience_sentiment': audiencesentiment,
    'critics_certified': criticscertified,
    'critics_score': criticsscore,
    'critics_sentiment': criticssentiment
})

print(f"DataFrame created with {len(rt_data)} rows.")

DataFrame created with 117 rows.


### View the Final DataFrame

In [21]:
# Display the resulting DataFrame
display(rt_data)

Unnamed: 0,title,audience_certified,audience_score,audience_sentiment,critics_certified,critics_score,critics_sentiment
0,HIM,false,58%,negative,false,31%,negative
1,The Conjuring: Last Rites,false,78%,positive,false,59%,negative
2,Downton Abbey: The Grand Finale,true,96%,positive,true,92%,positive
3,The Long Walk,false,85%,positive,true,88%,positive
4,A Big Bold Beautiful Journey,false,59%,negative,false,37%,negative
...,...,...,...,...,...,...,...
112,Trade Secret,,,,false,,
113,Viva Verdi!,,,,false,,
114,Norita,,,,false,,
115,Peas and Carrots,,,,false,,


### Extract Movie `$\text{URL}$`s
For fun, we can extract the relative URL for each movie, which is nested under a different tag. This shows how quickly you could start to think about how you would develop a web crawler that could grab URLs and traverse a whole site!

In [22]:
# Find the <a> tag which contains the link, identified by its 'data-qa' attribute
mylist_links = mysoup.find_all('a', attrs = {'data-qa':"discovery-media-list-item-caption"})

In [23]:
# Extract the 'href' attribute and prepend the base URL
urls = ['https://www.rottentomatoes.com' + m['href'] for m in mylist_links]
urls

['https://www.rottentomatoes.com/m/him_2025_2',
 'https://www.rottentomatoes.com/m/the_conjuring_last_rites',
 'https://www.rottentomatoes.com/m/downton_abbey_the_grand_finale',
 'https://www.rottentomatoes.com/m/the_long_walk_2025',
 'https://www.rottentomatoes.com/m/a_big_bold_beautiful_journey',
 'https://www.rottentomatoes.com/m/toy_story',
 'https://www.rottentomatoes.com/m/weapons',
 'https://www.rottentomatoes.com/m/freakier_friday',
 'https://www.rottentomatoes.com/m/the_bad_guys_2',
 'https://www.rottentomatoes.com/m/hamilton_2020',
 'https://www.rottentomatoes.com/m/apollo_13',
 'https://www.rottentomatoes.com/m/the_fantastic_four_first_steps',
 'https://www.rottentomatoes.com/m/caught_stealing',
 'https://www.rottentomatoes.com/m/the_history_of_sound',
 'https://www.rottentomatoes.com/m/the_roses',
 'https://www.rottentomatoes.com/m/jurassic_world_rebirth',
 'https://www.rottentomatoes.com/m/superman_2025',
 'https://www.rottentomatoes.com/m/the_baltimorons',
 'https://www.r

### Export to CSV
This is the final Load (L) phase.

In [24]:
# Define the output filename
output_filename = 'rotten_tomatoes_movies_in_theaters.csv'

# Export the DataFrame to a CSV file without the pandas index
rt_data.to_csv(output_filename, index=False)

print(f"--- Load Phase Complete ---")
print(f"Clean, scraped movie data successfully exported to: {output_filename}")

--- Load Phase Complete ---
Clean, scraped movie data successfully exported to: rotten_tomatoes_movies_in_theaters.csv
