## Advanced scraping for detailed movies data

This notebook contains the code to scrape detailed movies information from Rotten Tomatoes. Instead of only gathering information from the main page it collects all the links and then loops through them to gather information. This way we are also able to access movie genres, scores and complete release dates. The data gathered through this script will later be used in the data cleaning tutorial.

**Step 1: import**

In [4]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager

# This chunk of imports is useful to catch exceptions and clock our script
from selenium.common.exceptions import NoSuchElementException, TimeoutException
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import time


import urllib.request
import requests

import pandas as pd

**Step 2: start Chrome webdriver session**

In [5]:
# Start Chrome
options = Options()
# options.headless = True
options.add_argument("--window-size=1920,1200")

capa = DesiredCapabilities.CHROME
capa["pageLoadStrategy"] = "none"

driver = webdriver.Chrome(executable_path=ChromeDriverManager().install(), options=options, desired_capabilities=capa)


driver.get("https://www.rottentomatoes.com/browse/dvd-streaming-all")
# print(driver.page_source)
print('done')



Current google-chrome version is 96.0.4664
Get LATEST chromedriver version for 96.0.4664 google-chrome
Driver [/Users/arran/.wdm/drivers/chromedriver/mac64/96.0.4664.45/chromedriver] found in cache
  if __name__ == '__main__':


done


**Step 3: Get movies**

In [70]:
#Get page with movies
driver.get("https://www.rottentomatoes.com/browse/dvd-streaming-all")
pageExpandLimit = 0
links = []

#Start while loop to fake click interaction and collect urls.
while True:
    print('clicked on "show more" button {0} times.'.format(pageExpandLimit))
    pageExpandLimit = pageExpandLimit + 1 
    # This can be remove if it ends up taking too much space.
    driver.save_screenshot("screenshots/screenshot{0}.png".format(pageExpandLimit))
    
    if pageExpandLimit < 10:
        
        time.sleep(5)
        driver.find_element_(By.CLASS_NAME, 'mb-load-btn').click()  
        moreMovies = driver.find_elements(By.CLASS_NAME, 'mb-movie')
        
    if pageExpandLimit >= 10:
        # Appending only links to an empty list
        for movie in moreMovies:
            link = movie.find_element_by_tag_name("a")
            href = link.get_attribute("href")
            links.append(href)
            
        break

clicked on "show more" button 0 times.
clicked on "show more" button 1 times.
clicked on "show more" button 2 times.
clicked on "show more" button 3 times.
clicked on "show more" button 4 times.
clicked on "show more" button 5 times.
clicked on "show more" button 6 times.
clicked on "show more" button 7 times.
clicked on "show more" button 8 times.
clicked on "show more" button 9 times.


In [71]:
#Check if links is full
len(links)

314

In [72]:
links

['https://www.rottentomatoes.com/m/promising_young_woman',
 'https://www.rottentomatoes.com/m/yellow_rose',
 'https://www.rottentomatoes.com/m/american_skin',
 'https://www.rottentomatoes.com/m/sing_me_a_song',
 'https://www.rottentomatoes.com/m/the_changin_times_of_ike_white',
 'https://www.rottentomatoes.com/m/born_to_be',
 'https://www.rottentomatoes.com/m/savage_2019_2',
 'https://www.rottentomatoes.com/m/the_delivered',
 'https://www.rottentomatoes.com/m/pieces_of_a_woman',
 'https://www.rottentomatoes.com/m/assassins_2020',
 'https://www.rottentomatoes.com/m/love_sarah',
 'https://www.rottentomatoes.com/m/one_night_in_miami',
 'https://www.rottentomatoes.com/m/notturno_2021',
 'https://www.rottentomatoes.com/m/mlk_fbi',
 'https://www.rottentomatoes.com/m/shadow_in_the_cloud',
 'https://www.rottentomatoes.com/m/ten_minutes_to_midnight',
 'https://www.rottentomatoes.com/m/i_am_lisa',
 'https://www.rottentomatoes.com/m/pg_psycho_goreman',
 'https://www.rottentomatoes.com/m/baby_done

**Step 4: loop through movies and build a list of lists containing our data**

Scraping can be really painful. Depending on how the website is structured you will have to face a multitude of different problems and hiccups. 404 pages, lazy loading images and throttling are only some of the problems. It will be up to you to refine the script in a way that catches and solves all the problems that could threaten your scraping. Unfortunately there is not default recipe and you will have to change scripts and strategy often (also on the same website). Also: scrapers quickly become obsolete, if you want to scrape on regular time intervals for a long period you will always have to check the website source to be sure no major changes happened (this could cause your script to break).

![information we are interested in](scraping_elements.png)

In this particular case we are interested in title (1), scores (2,3), release date (4) and cover image(5)

In [73]:
# Helper function to catch 404 and 501
def catchSourceErrors(currentElement):
    if currentElement == "404 - Not Found":
        return True
    if currentElement == "Internal Server Error":
        return True
    else:
        return False

In [92]:
moviesData = []
count = 0
# Passing to our driver a waiting time in seconds
wait = WebDriverWait(driver, 10)


for link in links:
    
# -- Initially load page -------------------------

    driver.get(link)
    print('currently on', link)
    count = count + 1
    # Try to scrape only if the main container is visible    
    try:    
        wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, '#main_container')))
        if catchSourceErrors(driver.find_element_by_tag_name("h1").text) is False:
            try:
                
# ------------------ Waiting and retrieving data -------------------------

                print('start scraping number', count)
                # Wait for title, then get title and scores
                wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, '.scoreboard__title')))
                title = driver.find_element(By.CLASS_NAME, "scoreboard__title").text
                scores = driver.find_elements_by_tag_name("score-board")

                criticNumber = scores[0].get_attribute('tomatometerscore')
                audienceNumber = scores[0].get_attribute('audiencescore')

                # Pause in case images are not properly loaded
                wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, '.posterImage')))

                # Download images
                image = driver.find_element(By.CLASS_NAME, 'posterImage')
                source = image.get_attribute("src")
                imgdestination = "data/img/poster-{0}.jpg".format(count)
                urllib.request.urlretrieve(source, imgdestination)

                # Get release date and genre from movie_info box
                releaseDate = driver.find_element_by_tag_name('time').text
                movieGenre = driver.find_element(By.CLASS_NAME, 'genre').text

                # Append to list
                print('data', title, criticNumber, audienceNumber, releaseDate, movieGenre)
                moviesData.append([
                    title, 
                    criticNumber, 
                    audienceNumber, 
                    movieGenre, 
                    imgdestination, 
                    releaseDate
                ])
                driver.send_keys(Keys.CONTROL +'Escape')
                print(title, 'appended to list!')
                print('----------')
                
# -------------- Exceptions and errors handling! -------------------------

            #Catch error in case the movie page is not properly loaded
            except (NoSuchElementException, TimeoutException):
                print('no element')
                continue

        # If the page doesn't exists it will skip and continue        
        else:
            continue
            
    # If waiting time gets annoying it will force driver to get the link again.      
    except:
        driver.get(link)
        continue

# Quit from driver and display data        
driver.quit()
print(moviesData)

currently on https://www.rottentomatoes.com/m/promising_young_woman
start scraping number 1
data PROMISING YOUNG WOMAN 91 87 Dec 25, 2020 Mystery And Thriller, Comedy
PROMISING YOUNG WOMAN appended to list!
----------
currently on https://www.rottentomatoes.com/m/yellow_rose
start scraping number 2
data YELLOW ROSE 87 82 Oct 9, 2020 Music, Drama
YELLOW ROSE appended to list!
----------
currently on https://www.rottentomatoes.com/m/american_skin
start scraping number 3
data AMERICAN SKIN 29 96 Jan 15, 2021 Drama
AMERICAN SKIN appended to list!
----------
currently on https://www.rottentomatoes.com/m/sing_me_a_song
start scraping number 4
data SING ME A SONG 89  Dec 4, 2020 Documentary
SING ME A SONG appended to list!
----------
currently on https://www.rottentomatoes.com/m/the_changin_times_of_ike_white
start scraping number 5
data THE CHANGIN' TIMES OF IKE WHITE 100  Dec 4, 2020 Documentary, Music
THE CHANGIN' TIMES OF IKE WHITE appended to list!
----------
currently on https://www.rot

start scraping number 46
data SURVIVAL SKILLS 86 62 Nov 13, 2020 Comedy
SURVIVAL SKILLS appended to list!
----------
currently on https://www.rottentomatoes.com/m/the_walrus_and_the_whistleblower
start scraping number 47
data THE WALRUS AND THE WHISTLEBLOWER 40  Oct 9, 2020 Documentary
THE WALRUS AND THE WHISTLEBLOWER appended to list!
----------
currently on https://www.rottentomatoes.com/m/ammonite
start scraping number 48
data AMMONITE 67 85 Nov 13, 2020 Drama, Romance, Gay And Lesbian
AMMONITE appended to list!
----------
currently on https://www.rottentomatoes.com/m/a_chefs_voyage
start scraping number 49
data A CHEF'S VOYAGE 67  Sep 18, 2020 Documentary
A CHEF'S VOYAGE appended to list!
----------
currently on https://www.rottentomatoes.com/m/minor_premise
start scraping number 50
data MINOR PREMISE 93 84 Dec 4, 2020 Sci Fi, Mystery And Thriller
MINOR PREMISE appended to list!
----------
currently on https://www.rottentomatoes.com/m/crock_of_gold_a_few_rounds_with_shane_macgowan


start scraping number 90
data WHERE SHE LIES 100  Nov 10, 2020 Documentary
WHERE SHE LIES appended to list!
----------
currently on https://www.rottentomatoes.com/m/the_orange_years_the_nickelodeon_story
start scraping number 91
data THE ORANGE YEARS: THE NICKELODEON STORY 88 92 Nov 17, 2020 Documentary
THE ORANGE YEARS: THE NICKELODEON STORY appended to list!
----------
currently on https://www.rottentomatoes.com/m/my_summer_as_a_goth
start scraping number 92
data MY SUMMER AS A GOTH 20  Nov 11, 2020 Drama
MY SUMMER AS A GOTH appended to list!
----------
currently on https://www.rottentomatoes.com/m/sleepless_beauty
start scraping number 93
data SLEEPLESS BEAUTY 40  Nov 10, 2020 Horror
SLEEPLESS BEAUTY appended to list!
----------
currently on https://www.rottentomatoes.com/m/truth_is_the_only_client
start scraping number 94
data TRUTH IS THE ONLY CLIENT: THE OFFICIAL INVESTIGATION OF THE MURDER OF JOHN F. KENNEDY 100  Nov 17, 2020 Documentary
TRUTH IS THE ONLY CLIENT: THE OFFICIAL IN

start scraping number 133
data AGGIE 75  Oct 7, 2020 Documentary
AGGIE appended to list!
----------
currently on https://www.rottentomatoes.com/m/save_yourselves
start scraping number 134
data SAVE YOURSELVES! 88 60 Oct 2, 2020 Comedy, Sci Fi
SAVE YOURSELVES! appended to list!
----------
currently on https://www.rottentomatoes.com/m/a_call_to_spy
start scraping number 135
data A CALL TO SPY 72 84 Oct 2, 2020 Drama, History
A CALL TO SPY appended to list!
----------
currently on https://www.rottentomatoes.com/m/tapeworm
start scraping number 136
data TAPEWORM 75  Oct 13, 2020 Drama, Comedy
TAPEWORM appended to list!
----------
currently on https://www.rottentomatoes.com/m/12_hour_shift
start scraping number 137
data 12 HOUR SHIFT 77 42 Oct 2, 2020 Horror, Mystery And Thriller, Comedy
12 HOUR SHIFT appended to list!
----------
currently on https://www.rottentomatoes.com/m/broil
start scraping number 138
data BROIL 40  Oct 13, 2020 Horror, Mystery And Thriller
BROIL appended to list!
----

start scraping number 178
data OTTOLENGHI AND THE CAKES OF VERSAILLES 71  Sep 25, 2020 Documentary
OTTOLENGHI AND THE CAKES OF VERSAILLES appended to list!
----------
currently on https://www.rottentomatoes.com/m/no_escape_2020
start scraping number 179
data NO ESCAPE 27 49 Sep 18, 2020 Horror, Mystery And Thriller
NO ESCAPE appended to list!
----------
currently on https://www.rottentomatoes.com/m/all_in_the_fight_for_democracy
start scraping number 180
data ALL IN: THE FIGHT FOR DEMOCRACY 100 72 Sep 9, 2020 Documentary
ALL IN: THE FIGHT FOR DEMOCRACY appended to list!
----------
currently on https://www.rottentomatoes.com/m/sno_babies
start scraping number 181
data SNO BABIES 25 99 Sep 29, 2020 Drama
SNO BABIES appended to list!
----------
currently on https://www.rottentomatoes.com/m/murder_in_the_woods
start scraping number 182
data MURDER IN THE WOODS 82 65 Aug 14, 2020 Horror, Mystery And Thriller
MURDER IN THE WOODS appended to list!
----------
currently on https://www.rottentom

start scraping number 224
no element
currently on https://www.rottentomatoes.com/m/the_prey_2020
start scraping number 225
data THE PREY 73  Aug 21, 2020 Action, Adventure
THE PREY appended to list!
----------
currently on https://www.rottentomatoes.com/m/sometimes_always_never
start scraping number 226
data SOMETIMES ALWAYS NEVER 82 69 Mar 6, 2020 Drama, Comedy
SOMETIMES ALWAYS NEVER appended to list!
----------
currently on https://www.rottentomatoes.com/m/benjamin
start scraping number 227
data BENJAMIN 89 73 Aug 25, 2020 Romance, Gay And Lesbian
BENJAMIN appended to list!
----------
currently on https://www.rottentomatoes.com/m/cane_river
start scraping number 228
data CANE RIVER 100 38 Jul 6, 2020 Romance
CANE RIVER appended to list!
----------
currently on https://www.rottentomatoes.com/m/pariah_dog
start scraping number 229
data PARIAH DOG 100 100 Aug 18, 2020 Documentary
PARIAH DOG appended to list!
----------
currently on https://www.rottentomatoes.com/m/desert_one
start scrap

start scraping number 270
data PSYCHOMAGIC, A HEALING ART 77  Aug 7, 2020 Documentary
PSYCHOMAGIC, A HEALING ART appended to list!
----------
currently on https://www.rottentomatoes.com/m/max_reload_and_the_nether_blasters
start scraping number 271
data MAX RELOAD AND THE NETHER BLASTERS 82 61 Aug 7, 2020 Adventure, Fantasy
MAX RELOAD AND THE NETHER BLASTERS appended to list!
----------
currently on https://www.rottentomatoes.com/m/endless
start scraping number 272
data ENDLESS 19  Aug 14, 2020 Drama, Romance
ENDLESS appended to list!
----------
currently on https://www.rottentomatoes.com/m/paydirt
start scraping number 273
data PAYDIRT 22 22 Aug 7, 2020 Mystery And Thriller, Action, Drama, Crime
PAYDIRT appended to list!
----------
currently on https://www.rottentomatoes.com/m/irl_2019
start scraping number 274
data IRL 100  Aug 11, 2020 Drama
IRL appended to list!
----------
currently on https://www.rottentomatoes.com/m/uncle_peckerhead
start scraping number 275
data UNCLE PECKERHEAD

start scraping number 313
data STARS AND STRIFE 90  Aug 4, 2020 Documentary
STARS AND STRIFE appended to list!
----------
currently on https://www.rottentomatoes.com/m/busmans_holiday_2020
start scraping number 314
data BUSMAN'S HOLIDAY 86  Jul 25, 2020 Drama
BUSMAN'S HOLIDAY appended to list!
----------
[['PROMISING YOUNG WOMAN', '91', '87', 'Mystery And Thriller, Comedy', 'data/img/poster-1.jpg', 'Dec 25, 2020'], ['YELLOW ROSE', '87', '82', 'Music, Drama', 'data/img/poster-2.jpg', 'Oct 9, 2020'], ['AMERICAN SKIN', '29', '96', 'Drama', 'data/img/poster-3.jpg', 'Jan 15, 2021'], ['SING ME A SONG', '89', '', 'Documentary', 'data/img/poster-4.jpg', 'Dec 4, 2020'], ["THE CHANGIN' TIMES OF IKE WHITE", '100', '', 'Documentary, Music', 'data/img/poster-5.jpg', 'Dec 4, 2020'], ['BORN TO BE', '100', '', 'Documentary', 'data/img/poster-6.jpg', 'Nov 18, 2020'], ['SAVAGE', '75', '68', 'Crime, Drama', 'data/img/poster-7.jpg', 'Jan 29, 2020'], ['THE DELIVERED', '93', '', 'Horror, Drama', 'data/img/p

In [93]:
AllStreamingTitles = pd.DataFrame(moviesData, columns=["title", "critic rating", "audience rating", "genres", "img", "available"])

**Step 5: Move list of lists content to a pandas dataframe**

In [94]:
#Previewing 5 random rows from my dataframe
AllStreamingTitles.head(50)

Unnamed: 0,title,critic rating,audience rating,genres,img,available
0,PROMISING YOUNG WOMAN,91,87.0,"Mystery And Thriller, Comedy",data/img/poster-1.jpg,"Dec 25, 2020"
1,YELLOW ROSE,87,82.0,"Music, Drama",data/img/poster-2.jpg,"Oct 9, 2020"
2,AMERICAN SKIN,29,96.0,Drama,data/img/poster-3.jpg,"Jan 15, 2021"
3,SING ME A SONG,89,,Documentary,data/img/poster-4.jpg,"Dec 4, 2020"
4,THE CHANGIN' TIMES OF IKE WHITE,100,,"Documentary, Music",data/img/poster-5.jpg,"Dec 4, 2020"
5,BORN TO BE,100,,Documentary,data/img/poster-6.jpg,"Nov 18, 2020"
6,SAVAGE,75,68.0,"Crime, Drama",data/img/poster-7.jpg,"Jan 29, 2020"
7,THE DELIVERED,93,,"Horror, Drama",data/img/poster-8.jpg,"Jan 15, 2021"
8,PIECES OF A WOMAN,76,88.0,Drama,data/img/poster-9.jpg,"Dec 30, 2020"
9,ASSASSINS,97,100.0,Documentary,data/img/poster-10.jpg,"Dec 11, 2020"


**Step 6: Export dataframe to csv**

In [95]:
# Check the 'data' subfolder after executing this cell to see the file.
AllStreamingTitles.to_csv('data/movieDetailsDataRaw.csv', sep=',')