## Advanced scraping for detailed movies data

This notebook contains the code to scrape detailed movies information from Rotten Tomatoes. Instead of only gathering information from the main page it collects all the links and then loops through them to gather information. This way we are also able to access movie genres, scores and complete release dates. The data gathered through this script will later be used in the data cleaning tutorial.

**Step 1: import**

In [3]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
from selenium.common.exceptions import NoSuchElementException
import time

import urllib.request
import requests

import pandas as pd

**Step 2: start Chrome webdriver session**

In [37]:
# Start Chrome
options = Options()
options.headless = True
options.add_argument("--window-size=1920,1200")

driver = webdriver.Chrome(executable_path=ChromeDriverManager().install(), options=options)
driver.get("https://www.rottentomatoes.com/browse/dvd-streaming-all")

# print(driver.page_source)
print('done')

[WDM] - Current google-chrome version is 87.0.4280
[WDM] - Get LATEST driver version for 87.0.4280


 


[WDM] - Driver [/Users/francescamorini/.wdm/drivers/chromedriver/mac64/87.0.4280.88/chromedriver] found in cache


done


**Step 3: Get movies**

In [38]:
#Get page with movies
driver.get("https://www.rottentomatoes.com/browse/dvd-streaming-all")
pageExpandLimit = 0
links = []

#Start while loop to fake click interaction and collect urls.
while True:
    print(pageExpandLimit)
    pageExpandLimit = pageExpandLimit + 1 
    # This can be remove if it ends up taking too much space.
    driver.save_screenshot("screenshots/screenshot{0}.png".format(pageExpandLimit))
    
    if pageExpandLimit < 20:
        
        time.sleep(5)
        driver.find_element_by_class_name('mb-load-btn').click()  
        moreMovies = driver.find_elements_by_class_name('mb-movie')
        
    if pageExpandLimit >= 20:
        # Appending only links to an empty list
        for movie in moreMovies:
            link = movie.find_element_by_tag_name("a")
            href = link.get_attribute("href")
            links.append(href)
            
        break

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19


In [40]:
#Check if links is full
len(links)

591

**Step 4: loop through movies and build a list of lists containing our data**

In [42]:
moviesData = []
count = 0

for link in links:
    driver.get(link)
    count = count + 1
    try:
        # Get title and scores
        title = driver.find_element_by_class_name("mop-ratings-wrap__title").text
        scores = driver.find_elements_by_class_name("mop-ratings-wrap__percentage")
        
        #A couple of conditions to avoid errors
        # If both scores not available only consider criticScore
        if len(scores):
            criticScore = scores[0].text
            criticNumber = criticScore.replace('%', '')
            if len(scores) > 1:
                audienceScore = scores[1].text
                audienceNumber = audienceScore.replace('%', '')
            else:
                audienceNumber = 0
        else: 
            criticNumber = 0
            audienceNumber = 0
            
        # Pause in case images are not properly loaded
        time.sleep(2)
        # Download images
        image = driver.find_element_by_class_name('posterImage')
        source = image.get_attribute("src")
        imgdestination = "data/img/poster-{0}.jpg".format(count)
        urllib.request.urlretrieve(source, imgdestination)
        
        # Get release date and genre from movie_info box
        releaseDate = driver.find_element_by_tag_name('time').text
        movieGenre = driver.find_element_by_class_name('genre').text
        print(movieGenre)
        
        # Append to list
        moviesData.append([title, criticNumber, audienceNumber, movieGenre, imgdestination, releaseDate])
    
    #Catch error in case the movie page is not properly loaded (deals with 500 and 404)
    except NoSuchElementException:
        print('no element')

#Quit from driver and display data        
driver.quit()
print(moviesData)

Documentary
Comedy, Mystery And Thriller
Drama, Music
Drama
Comedy, Animation, Adventure, Kids And Family
Documentary
Documentary
Drama, Horror
Drama
Documentary
Romance, Drama, Music
Romance
Drama
no element
War, Action, Horror
Mystery And Thriller, Horror
no element
Horror, Mystery And Thriller
Sci Fi, Drama
Horror, Comedy
Comedy
Adventure, Sci Fi, Fantasy, Action
Mystery And Thriller
Mystery And Thriller, Action
Comedy, Stand Up
Mystery And Thriller, Drama
Documentary
Romance
Drama
Drama
Documentary, Music
Comedy, Romance
Comedy, Horror
Drama
Music, Documentary
Comedy
no element
Romance, Drama
Drama
Comedy, Horror
Drama
Horror, Action, Mystery And Thriller
Documentary
Biography, Drama, Music
Comedy
Documentary
Romance, Drama
Mystery And Thriller
Documentary
Mystery And Thriller, Sci Fi
Drama
Music, Documentary
Other, Fantasy, Mystery And Thriller, Comedy
Mystery And Thriller, Action
Romance
Mystery And Thriller
Action, Mystery And Thriller
no element
Other, Documentary
Other, Romanc

Comedy, Horror
Drama
Documentary
Comedy
Drama
History, Drama, War
Mystery And Thriller, Horror
Mystery And Thriller, Horror
Comedy, Drama
Drama
no element
Horror
Drama, Crime, Biography
Mystery And Thriller, Horror
Documentary
Drama, Romance
Romance
Drama
War, History, Drama
Mystery And Thriller, Action
Drama
Romance
no element
no element
Drama
Documentary
Drama, Documentary
Mystery And Thriller, Crime, Drama
Mystery And Thriller
Fantasy, Adventure, Kids And Family, Comedy, Musical, Animation
Mystery And Thriller, Action
Mystery And Thriller, Sci Fi, Horror
Documentary
Crime, Mystery And Thriller, Drama
Comedy, Drama
no element
Mystery And Thriller, Horror
no element
Comedy, Drama
Music, Documentary
Biography, Documentary
Mystery And Thriller, Crime, Drama
Biography, Documentary
Mystery And Thriller, Sci Fi, Action, Horror
Drama
Mystery And Thriller, Horror
Documentary
Mystery And Thriller, Horror
Comedy, Mystery And Thriller
Biography, Documentary
Drama
Comedy
Biography, Documentary
A

In [43]:
AllStreamingTitles = pd.DataFrame(moviesData, columns=["title", "critic rating", "audience rating", "genres", "img", "available"])

In [39]:
moviesData

[['THE EMOJI STORY',
  'NaN',
  'NaN',
  '91%',
  'data/img/poster-1.jpg',
  'Dec 22, 2020'],
 ['PROMISING YOUNG WOMAN',
  '91%',
  '87%',
  134.5,
  'data/img/poster-2.jpg',
  'Dec 25, 2020'],
 ['YELLOW ROSE', '86%', '82%', 127.0, 'data/img/poster-3.jpg', 'Oct 9, 2020'],
 ['AMERICAN SKIN',
  '31%',
  '97%',
  79.5,
  'data/img/poster-4.jpg',
  'Jan 15, 2021'],
 ['SOUL', '96%', '88%', 140.0, 'data/img/poster-5.jpg', 'Dec 25, 2020'],
 ['SING ME A SONG',
  'NaN',
  'NaN',
  '88%',
  'data/img/poster-6.jpg',
  'Dec 4, 2020'],
 ['BORN TO BE', 'NaN', 'NaN', '100%', 'data/img/poster-7.jpg', 'Nov 18, 2020'],
 ['THE DELIVERED',
  'NaN',
  'NaN',
  '93%',
  'data/img/poster-8.jpg',
  'Jan 15, 2021'],
 ['PIECES OF A WOMAN',
  '76%',
  '89%',
  120.5,
  'data/img/poster-9.jpg',
  'Dec 30, 2020'],
 ['ASSASSINS',
  '100%',
  '100%',
  150.0,
  'data/img/poster-10.jpg',
  'Dec 11, 2020'],
 ["SYLVIE'S LOVE",
  '92%',
  '79%',
  131.5,
  'data/img/poster-11.jpg',
  'Dec 23, 2020'],
 ['LOVE SARAH', '55

**Step 5: Move list of lists content to a pandas dataframe**

In [115]:
AllStreamingTitles = pd.DataFrame(moviesData, columns=["title", "rating", "available", "img"])

In [44]:
#Previewing 5 random rows from my dataframe
AllStreamingTitles.sample(5)

Unnamed: 0,title,critic rating,audience rating,genres,img,available
285,GAME OF DEATH,57,24,"Mystery And Thriller, Horror",data/img/poster-311.jpg,"Jul 3, 2020"
370,YOU SHOULD HAVE LEFT,40,23,"Mystery And Thriller, Horror",data/img/poster-401.jpg,"Jun 19, 2020"
37,BLACK BEAR,87,60,Drama,data/img/poster-41.jpg,"Dec 4, 2020"
401,URSULA VON RYDINGSVARD: INTO HER OWN,100,0,Documentary,data/img/poster-435.jpg,"May 29, 2020"
208,SOMETIMES ALWAYS NEVER,82,68,"Comedy, Drama",data/img/poster-227.jpg,"Mar 6, 2020"


**Step 6: Export dataframe to csv**

In [45]:
# Check the 'data' subfolder after executing this cell to see the file.
AllStreamingTitles.to_csv('data/movieDetailsDataRaw.csv', sep=',')