# Web Scraping Popular Movies using BeautifulSoup
 
![](https://imgur.com/JAYEfY3.png)

The **Project Idea** is to curate a list of popular movies that we can watch using Web Scraping. Check out the TMdb website here: https://www.themoviedb.org/movie

**Web Scraping** is the process of gathering useful information from the web and making meaningful insights from it. In a way, web scarping is automating the process of data collection. 

**Note: Web Scraping code depends on the structure of the web page. So, if the structure changes then your code needs update too!**


**Python** offers a variety of libraries to scrape the web. If you are starting with web scraping, then Beautiful Soup will be the easy option.

We’ll be using the packages:
* **Requests** — for downloading the HTML code from the TMdb URL
* **BeautifulSoup4** — for extracting data from the HTML string
* **Pandas** — to gather data into a dataframe for further processing



Let's see an outline of the steps we'll follow:
1. Load the TMdb movie web page https://www.themoviedb.org/movie using `Requests`.
2. Parse the HTML web page using BeautifulSoup. 
3. Extract the list of movies from the landing page. For each page, we'll get the movie name, user rating and the movie page URL.
3. Again for each movie, we'll grab the release dates, genres, duration and directors. 
4. Compile extracted movie details into Python Lists and Dictionaries.
4. We'll extend the above logic to scrape multiple pages. 
5. Finally, we'll save all the movie informations into a csv file. 

```
The csv file will be of the following format. 
Name,rating,genre,release_date,runtime	director,url
Mortal Kombat,80,"Fantasy,Action, Adventure, Science Fiction, Thriller",04/23/2021,1h 50m,Lewis Tan,	https://www.themoviedb.org/movie/460465
Godzilla vs. Kong,82.0,"Science Fiction, Action",	03/31/2021,1h 53m,Alexander Skarsgård,	https://www.themoviedb.org/movie/399566
Nobody,85.0,"Action, Thriller, Crime",03/26/2021,1h 32m,Bob Odenkirk,https://www.themoviedb.org/movie/615457
Zack Snyder's Justice League,85.0,"Action, Adventure, Fantasy, Science Fiction",03/18/2021,4h 2m,Ben Affleck,https://www.themoviedb.org/movie/791373
```

### Installing the Libraries
Let’s start by installing the required packages.

In [1]:
# # Install the bs4 module from BeautifulSoup 
!pip install beautifulsoup4==4.9.3 --upgrade --quiet

Let's import the necessary packages

In [2]:
# Let's import necessary packages 
import requests
import pandas as pd
from bs4 import BeautifulSoup

### Load the Webpage using Requests

The landing page of TMdb movies page consists of a list of popular movies. We can click on each of the movie items and navigate to the individual movie page to get more details on each movie.

Each page contains 20 movies. From the landing page, we will parse the list of movies, user ratings, and movie URLs. Then, we can navigate to the next pages using the ‘Load More’ button click.

In [27]:
# TMdb movie URL 
tmdb_movies_url = 'https://www.themoviedb.org/movie'

headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36',
        'referer':'https://www.themoviedb.org/movie'}

In [4]:
# The movie page is downloaded using 'requests`
response = requests.get(tmdb_movies_url, headers=header)

In [5]:
# Check if the request was successful 
response.status_code

200

### The above code validates if the requests was successful using the `.status_code = 200`.

In [7]:
page_contents = response.text
page_contents[:200]

'<!DOCTYPE html>\n<html lang="en" class="no-js">\n  <head>\n    <title>Popular Movies &#8212; The Movie Database (TMDB)</title>\n    <meta http-equiv="X-UA-Compatible" content="IE=edge" />\n    <meta http-e'

### Above shows first few snippet of the HTML code of the TMdb web page. Let's now write the `page-contents` into a file.

In [8]:
with open ('tmdb_movie.html', 'w') as f:
    f.write(page_contents)

In [9]:
doc = BeautifulSoup(page_contents, 'html.parser')

The HTML page content is extracted using BeautifulSoup into `doc`.

Let us create a function to perform the above.

In [10]:
"""Function to download a web page using `requests` and check the status code to validate if the call was successful. """
def get_movies_page(movies_url):
    # Access the webpage using `requests`
    response=requests.get(movies_url, headers=headers)
    
    # Check if the request was successful
    if response.status_code != 200:
        raise Exception(f'Failed to load page {movies_url}')
    
    # Parse the `response' text using BeautifulSoup
    movies_doc=BeautifulSoup(response.text, 'html.parser')
    
    return movies_doc

### Inspect the Web page

Chrome users can use the “Inspect” option by right-clicking on the page to examine the HTML code behind the page. A menu will appear, either on the bottom or right side of the page (based on the settings), with a long list of nested HTML tags. To find the correct tag associated with the information needed, select the details (ex. movie name) and click “Inspect” again and that will highlight a blue box. Now, you can click on the HTML tags and get the correct tag associated with the item of interest, here, movie name.

As we see in the image below, the movie names are embedded in the `h2` tags.

![](https://imgur.com/XzQ6OYC.png)

We can use the `h2.a.text.strip()` to retrieve the name of the movie. Note, that we need to exclude the first four lines, as those do not contain the movie names.

In [12]:
print(doc.find_all('h2')[4:])

[<h2><a href="/movie/848278" title="Jurassic Hunt">Jurassic Hunt</a></h2>, <h2><a href="/movie/619778" title="Malignant">Malignant</a></h2>, <h2><a href="/movie/566525" title="Shang-Chi and the Legend of the Ten Rings">Shang-Chi and the Legend of the Ten Rings</a></h2>, <h2><a href="/movie/436969" title="The Suicide Squad">The Suicide Squad</a></h2>, <h2><a href="/movie/588228" title="The Tomorrow War">The Tomorrow War</a></h2>, <h2><a href="/movie/585216" title="Escape Room: Tournament of Champions">Escape Room: Tournament of Champions</a></h2>, <h2><a href="/movie/595743" title="SAS: Red Notice">SAS: Red Notice</a></h2>, <h2><a href="/movie/482373" title="Don't Breathe 2">Don't Breathe 2</a></h2>, <h2><a href="/movie/451048" title="Jungle Cruise">Jungle Cruise</a></h2>, <h2><a href="/movie/497698" title="Black Widow">Black Widow</a></h2>, <h2><a href="/movie/675445" title="PAW Patrol: The Movie">PAW Patrol: The Movie</a></h2>, <h2><a href="/movie/619297" title="Sweet Girl">Sweet Girl

In [13]:
movies_names_tags=doc.find_all('h2')[4:]
names=[]
for h2 in movies_names_tags:
    names.append(h2.a['title'])
print(names)

['Jurassic Hunt', 'Malignant', 'Shang-Chi and the Legend of the Ten Rings', 'The Suicide Squad', 'The Tomorrow War', 'Escape Room: Tournament of Champions', 'SAS: Red Notice', "Don't Breathe 2", 'Jungle Cruise', 'Black Widow', 'PAW Patrol: The Movie', 'Sweet Girl', 'Infinite', 'Space Jam: A New Legacy', 'Cinderella', 'The Boss Baby: Family Business', 'Luca', 'Breathless', 'F9', 'After We Fell']


Similarly, we can extarct the movie links. 

In [14]:
urls=[]
for h2 in movies_names_tags:
    urls.append(h2.a['href'])
print(urls)

['/movie/848278', '/movie/619778', '/movie/566525', '/movie/436969', '/movie/588228', '/movie/585216', '/movie/595743', '/movie/482373', '/movie/451048', '/movie/497698', '/movie/675445', '/movie/619297', '/movie/581726', '/movie/379686', '/movie/593910', '/movie/459151', '/movie/508943', '/movie/860425', '/movie/385128', '/movie/744275']


Let's create functions to extract the movies names and movie URLs.

In [16]:
"""Function to extract the movie names from HTML source code using BeautifulSoup."""
def get_movies_names(doc):
    
    movies_names = []
    # Loop through the page get all the movie names from the page
    for h2 in movies_names_tags:
        movies_names.append(h2.a['title'])
    return movies_names

`get_movies_names` can be used to get the list of popular movie names.

In [17]:
# Get the popular movie list from the webpage using the BeautifulSoup object `doc`. 
get_movies_names(doc)

['Jurassic Hunt',
 'Malignant',
 'Shang-Chi and the Legend of the Ten Rings',
 'The Suicide Squad',
 'The Tomorrow War',
 'Escape Room: Tournament of Champions',
 'SAS: Red Notice',
 "Don't Breathe 2",
 'Jungle Cruise',
 'Black Widow',
 'PAW Patrol: The Movie',
 'Sweet Girl',
 'Infinite',
 'Space Jam: A New Legacy',
 'Cinderella',
 'The Boss Baby: Family Business',
 'Luca',
 'Breathless',
 'F9',
 'After We Fell']

The above shows the list of movies in the landing page of the TMdb movie web page. 

Similarly,  let's define functions for movie user ratings and URLs.

The user ratings are embedded as part of the `div` tag under the `user_score_chart` class in the webpage as below.

![](https://imgur.com/WqCIgES.png)

In [19]:
print(doc.find_all('div', {'class': 'user_score_chart'}))

[<div class="user_score_chart 60e49c9c6bdec300460a32fe" data-bar-color="#d2d531" data-percent="48.0" data-track-color="#423d0f">
<div class="percent">
<span class="icon icon-r48"></span>
</div>
</div>, <div class="user_score_chart 5d424c8395c0af0014d8113a" data-bar-color="#21d07a" data-percent="73.0" data-track-color="#204529">
<div class="percent">
<span class="icon icon-r73"></span>
</div>
</div>, <div class="user_score_chart 5c05f27cc3a3685c370d0750" data-bar-color="#21d07a" data-percent="79.0" data-track-color="#204529">
<div class="percent">
<span class="icon icon-r79"></span>
</div>
</div>, <div class="user_score_chart 5886bfcd925141197d001333" data-bar-color="#21d07a" data-percent="79.0" data-track-color="#204529">
<div class="percent">
<span class="icon icon-r79"></span>
</div>
</div>, <div class="user_score_chart 5c8abc659251415249c0791b" data-bar-color="#21d07a" data-percent="81.0" data-track-color="#204529">
<div class="percent">
<span class="icon icon-r81"></span>
</div>
</

In [20]:
"""Function to extract the movie user rating from HTML source code using the BeautifulSoup."""
def get_movies_rating(doc):
    
    movies_rating = []
    # Loop through the webpage to get the ratings of all the movies in the page
    movies_rating_tags=doc.find_all('div', {'class': 'user_score_chart'})
    for tags in movies_rating_tags:
        movies_rating.append(tags['data-percent'])
    return movies_rating

In [21]:
# Get the ratings of each movies in the webpage using the BeautifulSoup object `doc`. 
get_movies_rating(doc)

['48.0',
 '73.0',
 '79.0',
 '79.0',
 '81.0',
 '71.0',
 '59.0',
 '77.0',
 '79.0',
 '78.0',
 '80',
 '69.0',
 '75.0',
 '74.0',
 '68.0',
 '78.0',
 '81.0',
 '59.0',
 '75.0',
 '84.0']

The above shows the user ratings for movies in the landing page of the TMdb movie web page. 

Each movie URL can be retrieved by appending the base URL of https://www.themoviedb.org to .a['href'].

![](https://imgur.com/8D8DYAq.png)

In [75]:
 """Function to extract the movie links from HTML source code using BeautifulSoup. """
def get_movies_urls(doc):
   
    movies_urls = []
    base_url='https://www.themoviedb.org'
    # Loop through the webpage to get the URL of each movie
    for h2 in movies_names_tags:
        movies_urls.append(base_url+h2.a['href'])
    return movies_urls

In [76]:
# Get the URLS of each movies in the webpage using the BeautifulSoup object `doc`. 
get_movies_urls(doc)

['https://www.themoviedb.org/movie/848278',
 'https://www.themoviedb.org/movie/619778',
 'https://www.themoviedb.org/movie/566525',
 'https://www.themoviedb.org/movie/436969',
 'https://www.themoviedb.org/movie/588228',
 'https://www.themoviedb.org/movie/585216',
 'https://www.themoviedb.org/movie/595743',
 'https://www.themoviedb.org/movie/482373',
 'https://www.themoviedb.org/movie/451048',
 'https://www.themoviedb.org/movie/497698',
 'https://www.themoviedb.org/movie/675445',
 'https://www.themoviedb.org/movie/619297',
 'https://www.themoviedb.org/movie/581726',
 'https://www.themoviedb.org/movie/379686',
 'https://www.themoviedb.org/movie/593910',
 'https://www.themoviedb.org/movie/459151',
 'https://www.themoviedb.org/movie/508943',
 'https://www.themoviedb.org/movie/860425',
 'https://www.themoviedb.org/movie/385128',
 'https://www.themoviedb.org/movie/744275']

By now we have movie names, user rating, and the movie URLs for the first page.

Let’s first consider a sample movie web page: Godzilla vs. Kong and see how we parse HTML tags to get additional information like release date, genre, runtime, and director of each of the movies.

![](https://imgur.com/N1wxPw8.png)

To read additional movie information, let's create a function that can accept a movie url. 

In [77]:
# Let's read a movie page
def get_detailed_movie_page(movies_url):

# Access the webpage using `requests`
    response=requests.get(movies_url, headers=headers)
    
    # Check if the request was successful
    if response.status_code != 200:
        raise Exception(f'Failed to load page {movies_url}')
    
    # Parse the `response' text using BeautifulSoup
    movies_doc=BeautifulSoup(response.text, 'html.parser')
    
    return movies_doc

In [78]:
doc1 = get_detailed_movie_page('https://www.themoviedb.org/movie/399566')

We have the HTML source code in the 
BeautifulSoup object `doc1`.

In [40]:
print(doc1.find_all('div', {'class':'facts'}))

[<div class="facts">
<span class="certification">
        PG-13
      </span>
<span class="release">
        03/31/2021 (US)
      </span>
<span class="genres">
<a href="/genre/28-action/movie">Action</a>, <a href="/genre/12-adventure/movie">Adventure</a>, <a href="/genre/14-fantasy/movie">Fantasy</a>
</span>
<span class="runtime">
      
        1h 53m
      
    </span>
</div>]


In [46]:
add_info=doc1.find('div', {'class':'facts'})
print(add_info.text.split()[:])

['PG-13', '03/31/2021', '(US)', 'Action,', 'Adventure,', 'Fantasy', '1h', '53m']


In [47]:
# Find the `div` tag under `facts` class to get the release date, genre and runtime 

release_date=[]
genre=[]
runtime=[]
add_info=doc1.find('div', {'class':'facts'})
release_date=add_info.text.split()[1]
genre=add_info.text.split()[3:-2]
runtime=add_info.text.split()[-2:]
# Print and validate the result is correct
print(release_date, genre, runtime)

03/31/2021 ['Action,', 'Adventure,', 'Fantasy'] ['1h', '53m']


In [62]:
print (d_tags[0].text.strip().partition('\n')[0])

Adam Wingard


In [50]:
# Find the `div` tag under `scroller_wrap should_fade is_fading` class to get the director
d_tags = doc1.find_all('li',{'class':'profile'})

# Print and validate the result
print (d_tags[0].text.strip().partition("\n")[0])

Adam Wingard


The `div` tag under class `facts` contains the release date, genre and runtime details.

In [79]:
def get_movies_info(doc):
    """
    Function to get the movie informations - 
    release date, genre, runtime and director.
    """
    add_info=doc.find('div', {'class':'facts'})
    release_date=add_info.text.split()[1]
    genre=add_info.text.split()[3:-2]
    runtime=add_info.text.split()[-2:]
    d_tags = doc.find_all('li',{'class':'profile'})

    # Print and validate the result
    director=d_tags[0].text.strip().partition("\n")[0]
    return release_date, genre, runtime, director

In [64]:
# Call the `get_movies_info` for movie `Godzilla vs. Kong`.
get_movies_info(doc1) 

('03/31/2021',
 ['Action,', 'Adventure,', 'Fantasy'],
 ['1h', '53m'],
 'Adam Wingard')

The above logic can be extended to get the release dates, genres, runtimes, and directors for all the URLs we have from the landing page.


In [80]:
"""Function to get lists of movie information as lists from all the pages. """
def get_all_movies_details(urls):
    
    genres = []
    release_dates = []
    runtimes = []
    directors = []
    
    # Loop through all the urls of the the movies 
    for url in urls:
        movie_doc=get_detailed_movie_page(url)
    
        # get_movies_info returns release_date, genre, runtime, director.
        release_date, genre, runtime, director=get_movies_info(movie_doc)
        # Convert the genre list to string on 
        genres.append(" ".join(genre))
        release_dates.append(release_date)
        runtimes.append(" ".join(runtime))
        directors.append(director)
        
        
    return genres, release_dates, runtimes, directors

We have all the details that we are looking to retrieve from the TMdb web page `name`, `ratings`, `genres`, `release_dates`, `runtimes`, `directors` and `urls`.

### Putting all the Pieces Together

We’ve got all the information in different pieces of our BeautifulSoup scraper. We need to assemble them into a single function and make it as reusable as possible.

I’ve used Python `Dictionary` to store the key-value pairs of the movie information. Later, I've copied the dictionary to `pandas DataFrame` to store the tabular movie information into rows and columns.

In [68]:
a=[2]
a+=[3]
a

[2, 3]

In [81]:
def scrape_movies():
    """
    Function to download web page using `requests` and
    to extract the HTML source code using BeautifulSoup.
    """
    # Let's get the popular movies listing from the TMdb website
    page_count = 1 # Initializing the movie page count to 1
    # Define lists for all the movie attributes
    all_names = []
    all_ratings = []
    all_genres = []
    all_release_dates = []
    all_runtimes = []
    all_directors = []
    all_urls = []
    
    while page_count < 2: # Looping for 8 pages of the TMdb web page
        movies_url = f"https://www.themoviedb.org/movie?page={page_count}"
        # Access the webpage using `requests`
        response = requests.get(movies_url, headers=header)
        # Check if the request was successful
        if response.status_code != 200:
            raise Exception('Failed to load page {}'.format(movies_url))
        # Parse the `response' text using BeautifulSoup
        doc = BeautifulSoup(response.text, 'html.parser')
        
        urls = get_movies_urls(doc)
        print(urls)
        genres, release_dates, runtimes, directors = get_all_movies_details(urls)
        
        # Append each movie attribute to respective lists
        all_names += get_movies_names(doc)
        all_ratings += get_movies_rating(doc)
        all_genres += genres
        all_release_dates += release_dates
        all_runtimes += runtimes
        all_directors += directors
        all_urls += urls 
        page_count += 1

        # Defining a dictionary to store the movie informations
    movies_dict = {
        'name': all_names,
        'rating': all_ratings,
        'genre': all_genres,
        'release_date': all_release_dates,
        'runtime': all_runtimes,
        'director': all_directors,
        'url': all_urls
    }
    return pd.DataFrame(movies_dict)

### In this project, we are scraping seven pages and since each page has 20 movies listed, output dataset has 140 rows. It goes without saying that the more movie listing you want, the more web pages you should scrape.

#### Let's save the movies dataframe to a `.csv` file. 

In [82]:
# Invoke the scrape_movies functionality 
movies_df = pd.Dataframe(scrape_movies())
movies_df.head() # View the first few rows of the output

['https://www.themoviedb.org/movie/848278', 'https://www.themoviedb.org/movie/619778', 'https://www.themoviedb.org/movie/566525', 'https://www.themoviedb.org/movie/436969', 'https://www.themoviedb.org/movie/588228', 'https://www.themoviedb.org/movie/585216', 'https://www.themoviedb.org/movie/595743', 'https://www.themoviedb.org/movie/482373', 'https://www.themoviedb.org/movie/451048', 'https://www.themoviedb.org/movie/497698', 'https://www.themoviedb.org/movie/675445', 'https://www.themoviedb.org/movie/619297', 'https://www.themoviedb.org/movie/581726', 'https://www.themoviedb.org/movie/379686', 'https://www.themoviedb.org/movie/593910', 'https://www.themoviedb.org/movie/459151', 'https://www.themoviedb.org/movie/508943', 'https://www.themoviedb.org/movie/860425', 'https://www.themoviedb.org/movie/385128', 'https://www.themoviedb.org/movie/744275']


Unnamed: 0,name,rating,genre,release_date,runtime,director,url
0,Jurassic Hunt,48.0,"Action, Science Fiction, Thriller",09/01/2021,1h 23m,Hank Braxtan,https://www.themoviedb.org/movie/848278
1,Malignant,73.0,"Horror, Thriller, Mystery, Crime",09/10/2021,1h 51m,James Wan,https://www.themoviedb.org/movie/619778
2,Shang-Chi and the Legend of the Ten Rings,79.0,"Action, Adventure, Fantasy",09/03/2021,2h 12m,Destin Daniel Cretton,https://www.themoviedb.org/movie/566525
3,The Suicide Squad,79.0,"Action, Adventure, Fantasy, Comedy",08/06/2021,2h 12m,James Gunn,https://www.themoviedb.org/movie/436969
4,The Tomorrow War,81.0,"Action, Science Fiction, Adventure",07/02/2021,2h 18m,Chris McKay,https://www.themoviedb.org/movie/588228


In [83]:
# Save the dataset to `.csv` format
movies_df.to_csv('movies.csv', index=None)

We can check that the CSV was created properly by reading the csv file using `pandas`.

In [84]:
df = pd.read_csv('movies.csv')
df.head()

Unnamed: 0,name,rating,genre,release_date,runtime,director,url
0,Jurassic Hunt,48.0,"Action, Science Fiction, Thriller",09/01/2021,1h 23m,Hank Braxtan,https://www.themoviedb.org/movie/848278
1,Malignant,73.0,"Horror, Thriller, Mystery, Crime",09/10/2021,1h 51m,James Wan,https://www.themoviedb.org/movie/619778
2,Shang-Chi and the Legend of the Ten Rings,79.0,"Action, Adventure, Fantasy",09/03/2021,2h 12m,Destin Daniel Cretton,https://www.themoviedb.org/movie/566525
3,The Suicide Squad,79.0,"Action, Adventure, Fantasy, Comedy",08/06/2021,2h 12m,James Gunn,https://www.themoviedb.org/movie/436969
4,The Tomorrow War,81.0,"Action, Science Fiction, Adventure",07/02/2021,2h 18m,Chris McKay,https://www.themoviedb.org/movie/588228


In [None]:
df.shape

### Summary

1. Downloaded the TMdb movie web page using `Requests`
2. Extracted the movie details using BeautifulSoup (bs4).
3. Extracted all the movie informations - movie name, user rating, release date, genre, duration, directors and urls. 
4. Complied the movie informations into Pandas lists and Dataframes. 
5. Extracted the movie informations for multiple pages.
6. Saved the dataset into .`csv` format