In [2]:
# check python version
import sys

print(sys.version)

3.9.13 (main, Aug 25 2022, 23:51:50) [MSC v.1916 64 bit (AMD64)]


In [3]:
# check your library list
!pip list

Package                       Version
----------------------------- --------------------
alabaster                     0.7.12
anaconda-client               1.11.0
anaconda-navigator            2.4.0
anaconda-project              0.11.1
anyio                         3.5.0
appdirs                       1.4.4
argon2-cffi                   21.3.0
argon2-cffi-bindings          21.2.0
arrow                         1.2.2
astroid                       2.11.7
astropy                       5.1
async-generator               1.10
atomicwrites                  1.4.0
attrs                         21.4.0
Automat                       20.2.0
autopep8                      1.6.0
Babel                         2.9.1
backcall                      0.2.0
backports.functools-lru-cache 1.6.4
backports.tempfile            1.0
backports.weakref             1.0.post1
bcrypt                        3.2.0
beautifulsoup4                4.11.1
binaryornot                   0.4.4
bitarray                      2.5.1
bkc

In [3]:
# install selenium 
  # pip install selenium # uncomment and run for install

In [4]:
# install beautifulsoup4
# pip install beautifulsoup4 # uncomment and run for install

In [4]:
'''
open https://www.imdb.com/chart/top/ , right click on first title, click inspect or inspect element,
By pressing CTRL+F and searching in the HTML code structure, you will see that there is only one <table> 
tag on the page. This is useful as it gives us information about how we can access the data.

An HTML selector that will give us all of the titles from the page is table tbody tr td.titleColumn a. 
That’s because all titles are in an anchor inside a table cell with the class “titleColumn”.

Using this CSS selector and getting the innerText of each anchor will give us the titles that we need. 
You can simulate that in the browser console from the new window you just opened and by using the 
JavaScript line: document.querySelectorAll("table tbody tr td.titleColumn a")[0].innerText

click on console tab and paste the JS line

Now that we have this selector, we can start writing our Python code and extracting the information we need.

How to Use BeautifulSoup to Extract Statically Loaded Content
The movie titles from our list are static content. That’s because if you look into the 
page source (CTRL+U on the page or right-click and then choose View Page Source), you will see that 
the titles are already there.

Static content is usually easier to scrape as it doesn’t require JavaScript rendering. 
To extract the first ten titles on the list, we will use BeautifulSoup to get the content 
and then print it in the output of our scraper.

The code below uses the selector we saw in the first step to extract the movie title anchors from the page.
It then loops through the first ten and displays the innerText of each.
'''

import requests # import the library we need
from bs4 import BeautifulSoup
 
page = requests.get('https://www.imdb.com/chart/top/') # Getting page HTML through request
soup = BeautifulSoup(page.content, 'html.parser') # Parsing content using beautifulsoup
 
links = soup.select("table tbody tr td.titleColumn a") # Selecting all of the anchors with titles
first10 = links[:10] # Keep only the first 10 anchors
for anchor in first10:
    print(anchor.text) # Display the innerText of each anchor

The Shawshank Redemption
The Godfather
The Dark Knight
The Godfather Part II
12 Angry Men
Schindler's List
The Lord of the Rings: The Return of the King
Pulp Fiction
The Lord of the Rings: The Fellowship of the Ring
Il buono, il brutto, il cattivo


In [5]:
'''
How to Extract Dynamically Loaded Content
As technology advanced, websites started to load their content dynamically. This improves the page’s 
performance, the user's experience, and even removes an extra barrier for scrapers.

This complicates things, though, as the HTML retrieved from a simple request will not contain the dynamic 
content. Fortunately, with Selenium, we can simulate a request in the browser and wait for the dynamic 
content to be displayed.

How to Use Selenium for Requests
You will need to know the location of your chromedriver. The following code is identical to the one 
presented in the second step, but this time we are using Selenium to make the request. We will still 
parse the page’s content using BeautifulSoup, as we did before.

Don’t forget to replace “YOUR-PATH-TO-CHROMEDRIVER” with the location where you extracted the chromedriver.
Also, you should notice that instead of page.content, when we are creating the BeautifulSoup object, we are
now using driver.page_source, which provides the HTML content of the page.
'''
# import library
from bs4 import BeautifulSoup
from selenium import webdriver
 
 # define option
option = webdriver.ChromeOptions()

# I recommend to use the headless option at least, out of the 3
option.add_argument('--headless')
option.add_argument('--no-sandbox')
option.add_argument('--disable-dev-sh-usage')

# Replace YOUR-PATH-TO-CHROMEDRIVER with your chromedriver location, locate the chromedriver.exe, right click, choose copy as path
# check your chrome version at chrome browser setting, about chrome, then download chromedriver

driver = webdriver.Chrome(r'C:\Users\wibow\OneDrive\Desktop\Web Scraping\chromedriver_win32\chromedriver.exe', options=option)

driver.get('https://www.imdb.com/chart/top/') # Getting page HTML through request
soup = BeautifulSoup(driver.page_source, 'html.parser') # Parsing content using beautifulsoup. Notice driver.page_source instead of page.content
 
links = soup.select("table tbody tr td.titleColumn a") # Selecting all of the anchors with titles
first10 = links[:10] # Keep only the first 10 anchors
for anchor in first10:
    print(anchor.text) # Display the innerText of each anchor

  driver = webdriver.Chrome(r'C:\Users\wibow\OneDrive\Desktop\Web Scraping\chromedriver_win32\chromedriver.exe', options=option)


The Shawshank Redemption
The Godfather
The Dark Knight
The Godfather Part II
12 Angry Men
Schindler's List
The Lord of the Rings: The Return of the King
Pulp Fiction
The Lord of the Rings: The Fellowship of the Ring
Il buono, il brutto, il cattivo


In [6]:
'''
How to Extract Statically Loaded Content Using Selenium
Using the code from above, we can now access each movie page by calling the click method on each of the 
anchors.
'''

from selenium.webdriver.common.by import By

first_link = driver.find_elements(By.CSS_SELECTOR,'table tbody tr td.titleColumn a')[0]
first_link.click()

In [7]:
'''
This will simulate a click on the first movie’s link. However, in this case, I recommend that you continue 
using driver.get instead. This is because you will no longer be able to use the click() method after you go 
on a different page since the new page doesn't have links to the other nine movies.

As a result, after clicking on the first title from the list, you’d need to go back to the first page, then 
click on the second, and so on. This is a waste of performance and time. Instead, we will just use the 
extracted links and access them one by one.

For “The Shawshank Redemption”, the movie page will be https://www.imdb.com/title/tt0111161/. We will extract 
the movie’s year and duration from the page, but this time we will use Selenium’s functions instead of 
BeautifulSoup as an example. In practice, you can use either one, so pick your favorite.

To retrieve the movie’s year and duration, you should repeat the first step we went through here on the 
movie’s page.

You will notice that you can find all of the information in the first element with the class ipc-inline-list 
(".ipc-inline-list" selector) and that all of the elements of the list contain the attribute role with the 
value presentation (the [role=’presentation’] selector).
'''

# import library
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service

# define option
option = webdriver.ChromeOptions()

# put option below, for this time no need for the option

# Replace YOUR-PATH-TO-CHROMEDRIVER with your chromedriver location
service = Service(r'C:\Users\wibow\OneDrive\Desktop\Web Scraping\chromedriver_win32\chromedriver.exe')

# pass service object
driver = webdriver.Chrome(service=service, options=option)
 
page = driver.get('https://www.imdb.com/chart/top/') # Getting page HTML through request
soup = BeautifulSoup(driver.page_source, 'html.parser') # Parsing content using beautifulsoup
 
totalScrapedInfo = [] # In this list we will save all the information we scrape
links = soup.select("table tbody tr td.titleColumn a") # Selecting all of the anchors with titles
first10 = links[:10] # Keep only the first 10 anchors
for anchor in first10:
    driver.get('https://www.imdb.com/' + anchor['href']) # Access the movie’s page
    infolist = driver.find_elements(By.CSS_SELECTOR, '.ipc-inline-list')[1] # Find the element with class ‘ipc-inline-list’ consisting year, rating, duration, in this page its the second element block, check by inspect one of the title page
    informations = infolist.find_elements(By.CSS_SELECTOR, "[role='presentation']") # Find all elements with role=’presentation’ from the second element with class ‘ipc-inline-list’
    scrapedInfo = {
        "title": anchor.text, # fill the title with anchor
        "year": informations[0].text, # year is the first element with presentation role 
        "duration": informations[2].text, # duration is the second
    } # Save all the scraped information in a dictionary
    totalScrapedInfo.append(scrapedInfo) # Append the dictionary to the totalScrapedInformation list

print(totalScrapedInfo) # Display the list with all the information we scraped

driver.quit() # Close the WebDriver


[{'title': 'The Shawshank Redemption', 'year': '1994', 'duration': '2h 22m'}, {'title': 'The Godfather', 'year': '1972', 'duration': '2h 55m'}, {'title': 'The Dark Knight', 'year': '2008', 'duration': '2h 32m'}, {'title': 'The Godfather Part II', 'year': '1974', 'duration': '3h 22m'}, {'title': '12 Angry Men', 'year': '1957', 'duration': '1h 36m'}, {'title': "Schindler's List", 'year': '1993', 'duration': '3h 15m'}, {'title': 'The Lord of the Rings: The Return of the King', 'year': '2003', 'duration': '3h 21m'}, {'title': 'Pulp Fiction', 'year': '1994', 'duration': '2h 34m'}, {'title': 'The Lord of the Rings: The Fellowship of the Ring', 'year': '2001', 'duration': '2h 58m'}, {'title': 'The Good, the Bad and the Ugly', 'year': '1966', 'duration': '2h 58m'}]


In [8]:
'''
How to Extract Dynamically Loaded Content Using Selenium
The next big step in web scraping is extracting content that is loaded dynamically. You can find such content on 
each of the movie’s pages (such as https://www.imdb.com/title/tt0111161/) in the Editorial Lists section.

If you look using inspect on the page, you'll see that you can find the section as an element with the attribute 
data-testid set as firstListCardGroup-editorial. But if you look in the page source, you will not find this 
attribute value anywhere. That’s because the Editorial Lists section is loaded by IMDB dynamically.

In the following example, we will scrape the editorial list of each movie and add it to our current results of 
the total scraped information.

To do that, we will import a few more packages that make it possible to wait for our dynamic content to load.
'''

# import library
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
 
 # define option
option = webdriver.ChromeOptions()

# put option below, for this time no need for the option

# Replace YOUR-PATH-TO-CHROMEDRIVER with your chromedriver location
service = Service(r'C:\Users\wibow\OneDrive\Desktop\Web Scraping\chromedriver_win32\chromedriver.exe')

# pass service object
driver = webdriver.Chrome(service=service, options=option)
 
page = driver.get('https://www.imdb.com/chart/top/') # Getting page HTML through request
soup = BeautifulSoup(driver.page_source, 'html.parser') # Parsing content using beautifulsoup
 
totalScrapedInfo = [] # In this list we will save all the information we scrape
links = soup.select("table tbody tr td.titleColumn a") # Selecting all of the anchors with titles
first10 = links[:10] # Keep only the first 10 anchors
for anchor in first10:
    driver.get('https://www.imdb.com/' + anchor['href']) # Access the movie’s page
    infolist = driver.find_elements(By.CSS_SELECTOR, '.ipc-inline-list')[1] # Find the element with class ‘ipc-inline-list’ consisting year, rating, duration, in this page its the second element block, check by inspect one of the title page
    informations = infolist.find_elements(By.CSS_SELECTOR, "[role='presentation']") # Find all elements with role=’presentation’ from the second element with class ‘ipc-inline-list’
    scrapedInfo = {
        "title": anchor.text, # fill the title with anchor
        "year": informations[0].text, # year is the first element with presentation role 
        "duration": informations[2].text, # duration is the second
    } # Save all the scraped information in a dictionary
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") # Scroll automatically to the bottom of the page to load all the page content otherwise : error timeout because element not appear
    WebDriverWait(driver, 60).until(EC.presence_of_element_located((By.CSS_SELECTOR, "[data-testid='firstListCardGroup-editorial']"))) # We are waiting for 60 seconds for our element with the attribute data-testid set as `firstListCardGroup-editorial`
    listElements = driver.find_elements(By.CSS_SELECTOR,"[data-testid='firstListCardGroup-editorial'] .sc-77f04bb0-1.hEkxEA.listName") # Extracting the editorial lists elements
    listNames = [] # Creating an empty list and then appending only the elements texts
    for el in listElements:
        listNames.append(el.text)
    scrapedInfo['editorial-list'] = listNames # Adding the editorial list names to our scrapedInfo dictionary
    totalScrapedInfo.append(scrapedInfo) # Append the dictionary to the totalScrapedInformation list
    
print(totalScrapedInfo) # Display the list with all the information we scraped

driver.quit() # Close the WebDriver

[{'title': 'The Shawshank Redemption', 'year': '1994', 'duration': '2h 22m', 'editorial-list': ["What's New on HBO and HBO Max in January 2022", "What's New on Netflix in March 2022", 'Everything Coming to HBO and HBO Max in August 2021']}, {'title': 'The Godfather', 'year': '1972', 'duration': '2h 55m', 'editorial-list': ['10 films that inspired director George Tillman Jr.', 'All Oscar Best Picture winners, ranked by IMDb rating', 'New on Netflix India This Aug 2020']}, {'title': 'The Dark Knight', 'year': '2008', 'duration': '2h 32m', 'editorial-list': ['The Billion-Dollar Film Club: 50+ Movies to Reach $1 Billion Worldwide', 'The billion-dollar superhero club: All the superhero movies to reach $1 billion worldwide', "What's new on Hulu in December 2022"]}, {'title': 'The Godfather Part II', 'year': '1974', 'duration': '3h 22m', 'editorial-list': ['All Oscar Best Picture winners, ranked by IMDb rating', 'Top 100 Movies Bucket List', 'Top 100 Movies as Rated by Women on IMDb in 2016']

In [10]:
'''
How to Save the Scraped Content
Now that we have all the data we want, we can save it as a .json or a .csv file for easier readability.

To do that, we will just use the JSON and CVS packages from Python and write our content to new files:
'''

'''
# uncomment code below to run
import csv
import json
        
# Save scraped data to JSON file
with open('scraped_data.json', 'w') as json_file: # the json file will be named scraped_data
    json.dump(totalScrapedInfo, json_file)

# Save scraped data to CSV file
csv_file_path = 'scraped_data.csv'
fieldnames = totalScrapedInfo[0].keys()

with open(csv_file_path, 'w', newline='') as csv_file:
    writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
    writer.writeheader()
    writer.writerows(totalScrapedInfo)
'''