## Advanced scraping for detailed movies data

This notebook contains the code to scrape detailed movies information from Rotten Tomatoes. Instead of only gathering information from the main page it collects all the links and then loops through them to gather information. This way we are also able to access movie genres, scores and complete release dates. The data gathered through this script will later be used in the data cleaning tutorial.

**Step 1: import**

In [81]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager

# This chunk of imports is useful to catch exceptions and clock our script
from selenium.common.exceptions import NoSuchElementException, TimeoutException
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import time


import urllib.request
import requests

import pandas as pd

**Step 2: start Chrome webdriver session**

In [82]:
# Start Chrome
options = Options()
# options.headless = True
options.add_argument("--window-size=1920,1200")

capa = DesiredCapabilities.CHROME
capa["pageLoadStrategy"] = "none"

driver = webdriver.Chrome(executable_path=ChromeDriverManager().install(), options=options, desired_capabilities=capa)


driver.get("https://www.rottentomatoes.com/browse/dvd-streaming-all")
# print(driver.page_source)
print('done')



Current google-chrome version is 96.0.4664
Get LATEST chromedriver version for 96.0.4664 google-chrome
Driver [/Users/arran/.wdm/drivers/chromedriver/mac64/96.0.4664.45/chromedriver] found in cache
  if __name__ == '__main__':


done


**Step 3: Get movies**

In [83]:
#Get page with movies
driver.get("https://www.rottentomatoes.com/browse/dvd-streaming-all")
pageExpandLimit = 0
links = []

#Start while loop to fake click interaction and collect urls.
while True:
    print('clicked on "show more" button {0} times.'.format(pageExpandLimit))
    pageExpandLimit = pageExpandLimit + 1 
    # This can be remove if it ends up taking too much space.
    driver.save_screenshot("screenshots/screenshot{0}.png".format(pageExpandLimit))
    
    if pageExpandLimit < 10:
        
        time.sleep(5)
        driver.find_element(By.CLASS_NAME, 'mb-load-btn').click()  
        moreMovies = driver.find_elements(By.CLASS_NAME, 'mb-movie')
        
    if pageExpandLimit >= 10:
        # Appending only links to an empty list
        for movie in moreMovies:
            link = movie.find_element_by_tag_name("a")
            href = link.get_attribute("href")
            links.append(href)
            
        break

clicked on "show more" button 0 times.
clicked on "show more" button 1 times.
clicked on "show more" button 2 times.
clicked on "show more" button 3 times.
clicked on "show more" button 4 times.
clicked on "show more" button 5 times.
clicked on "show more" button 6 times.
clicked on "show more" button 7 times.
clicked on "show more" button 8 times.
clicked on "show more" button 9 times.


In [85]:
#Check if links is full
len(links)

311

In [86]:
links

['https://www.rottentomatoes.com/m/rumble_2021',
 'https://www.rottentomatoes.com/m/hurt_2021',
 'https://www.rottentomatoes.com/m/mosley',
 'https://www.rottentomatoes.com/m/benedetta',
 'https://www.rottentomatoes.com/m/last_words',
 'https://www.rottentomatoes.com/m/saint_narcisse',
 'https://www.rottentomatoes.com/m/try_harder',
 'https://www.rottentomatoes.com/m/back_to_the_outback',
 'https://www.rottentomatoes.com/m/the_scary_of_sixty_first',
 'https://www.rottentomatoes.com/m/sensation_2021',
 'https://www.rottentomatoes.com/m/red_snow_2021',
 'https://www.rottentomatoes.com/m/the_novice_2021',
 'https://www.rottentomatoes.com/m/even_mice_belong_in_heaven',
 'https://www.rottentomatoes.com/m/the_lost_daughter',
 'https://www.rottentomatoes.com/m/the_hand_of_god',
 'https://www.rottentomatoes.com/m/encounter_2021',
 'https://www.rottentomatoes.com/m/null',
 'https://www.rottentomatoes.com/m/the_united_states_of_insanity',
 'https://www.rottentomatoes.com/m/the_unforgivable',
 'h

**Step 4: loop through movies and build a list of lists containing our data**

Scraping can be really painful. Depending on how the website is structured you will have to face a multitude of different problems and hiccups. 404 pages, lazy loading images and throttling are only some of the problems. It will be up to you to refine the script in a way that catches and solves all the problems that could threaten your scraping. Unfortunately there is not default recipe and you will have to change scripts and strategy often (also on the same website). Also: scrapers quickly become obsolete, if you want to scrape on regular time intervals for a long period you will always have to check the website source to be sure no major changes happened (this could cause your script to break).

![information we are interested in](scraping_elements.png)

In this particular case we are interested in title (1), scores (2,3), release date (4) and cover image(5)

In [87]:
# Helper function to catch 404 and 501
def catchSourceErrors(currentElement):
    if currentElement == "404 - Not Found":
        return True
    if currentElement == "Internal Server Error":
        return True
    else:
        return False

In [88]:
moviesData = []
count = 0
# Passing to our driver a waiting time in seconds
wait = WebDriverWait(driver, 10)


for link in links:
    
# -- Initially load page -------------------------

    driver.get(link)
    print('currently on', link)
    count = count + 1
    # Try to scrape only if the main container is visible    
    try:    
        wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, '#main_container')))
        if catchSourceErrors(driver.find_element_by_tag_name("h1").text) is False:
            try:
                
# ------------------ Waiting and retrieving data -------------------------

                print('start scraping number', count)
                # Wait for title, then get title and scores
                wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, '.scoreboard__title')))
                title = driver.find_element(By.CLASS_NAME, "scoreboard__title").text
                scores = driver.find_elements_by_tag_name("score-board")

                
                criticNumber = scores[0].get_attribute('tomatometerscore')
                audienceNumber = scores[0].get_attribute('audiencescore')

                # Pause in case images are not properly loaded
                wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, '.posterImage')))

                # Download images
                image = driver.find_element(By.CLASS_NAME, 'posterImage')
                source = image.get_attribute("src")
                imgdestination = "data/img/poster-{0}.jpg".format(count)
                urllib.request.urlretrieve(source, imgdestination)

                # Get release date and genre from movie_info box
                releaseDate = driver.find_element_by_tag_name('time').text
                movieGenre = driver.find_element(By.CLASS_NAME, 'meta-value genre').text

                # Append to list
                print('data', title, criticNumber, audienceNumber, releaseDate, movieGenre)
                moviesData.append([
                    title, 
                    criticNumber, 
                    audienceNumber, 
                    movieGenre, 
                    imgdestination, 
                    releaseDate
                ])
                driver.send_keys(Keys.CONTROL +'Escape')
                print(title, 'appended to list!')
                print('----------')
                
# -------------- Exceptions and errors handling! -------------------------

            #Catch error in case the movie page is not properly loaded
            except (NoSuchElementException, TimeoutException):
                print('no element')
                continue

        # If the page doesn't exists it will skip and continue        
        else:
            continue
            
    # If waiting time gets annoying it will force driver to get the link again.      
    except:
        driver.get(link)
        continue

# Quit from driver and display data        
driver.quit()
print(moviesData)

currently on https://www.rottentomatoes.com/m/rumble_2021




start scraping number 1




currently on https://www.rottentomatoes.com/m/hurt_2021
start scraping number 2
currently on https://www.rottentomatoes.com/m/mosley
start scraping number 3
currently on https://www.rottentomatoes.com/m/benedetta


ProtocolError: ('Connection aborted.', ConnectionResetError(54, 'Connection reset by peer'))

In [76]:
AllStreamingTitles = pd.DataFrame(moviesData, columns=["title", "critic rating", "audience rating", "genres", "img", "available"])

**Step 5: Move list of lists content to a pandas dataframe**

In [50]:
#Previewing 5 random rows from my dataframe
AllStreamingTitles.head(50)

Unnamed: 0,title,critic rating,audience rating,genres,img,available


**Step 6: Export dataframe to csv**

In [23]:
# Check the 'data' subfolder after executing this cell to see the file.
AllStreamingTitles.to_csv('data/movieDetailsDataRaw.csv', sep=',')