## Dynamic Scraping with Selenium and Chrome webdriver

#### Approx execution time: 1h 30min

This is a simple notebook to work with Selenium and Chrome Webdriver in order to scrape a dynamic website. While for static websites we can use [Beautiful Soup](https://pypi.org/project/beautifulsoup4/), when we are working with dynamic ones we need Selenium. The key aspect of [Selenium](https://pypi.org/project/selenium/) is that it executes javascript when loading a page request, this allows for dynamic websites to be scraped easily. For the purpose of this exercise we will use [Chrome WebDriver](https://chromedriver.chromium.org/), a tool for automated testing of web apps. To complete our exercise we will also use [pandas](https://pypi.org/project/pandas/), the data science and data analysis library, in order to create a dataframe and export our data to our local machine.

**Tutorial sections:**
- **Start a Headless session**: Start a WebDriver session with Selenium. It will be headless since we won't open a visual browser window and our script will "drive" it across the pages we need.
- **One page simple scraping**: retrieve all the content from one page and create a dataframe
- **Scraping across pages**: Continously load new content, then create a dataframe

**Step 1: import**

In [2]:
# Will make scraping possible by importing our webdriver and relevant options. This import section can be easily
#costumized based on your needs (for example to add particular options to your browser or to catch common errors)
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by import By
import time

# Will allow to request, access and write urls of images and other links
import urllib.request
import requests

# Will allow us to move the data we scrape within tables and then export them to csv
import pandas as pd

Initialize Selenium and driver instance using Chrome WebDriver. To properly install ChromeDriver you can manually set the path following [this guide](https://www.kenst.com/2015/03/including-the-chromedriver-location-in-macos-system-path/) or use `pip install webdriver-manager` to manage it.

**Step 2: start Chrome webdriver session**

For the purpose of this exercise we will get some movies from [Rotten Tomatoes](https://www.rottentomatoes.com/browse/dvd-streaming-all). We are interested in creating a dataset of titles with their ranking, date of release and cover.

In [3]:
# We start an headless Chrome session. (Headless means that we don't open a visual browser window)
options = Options()
#If you comment out this option below you will be able to see Chrome opening and performing operations!
options.headless = True
options.add_argument("--window-size=1920,1200")
options.add_argument("--disable-gpu");

# instead of setting path manually (still possible but a bit annoying) we use DriverManager
driver = webdriver.Chrome(executable_path=ChromeDriverManager().install(), options=options)



Current google-chrome version is 96.0.4664
Get LATEST chromedriver version for 96.0.4664 google-chrome
There is no [win32] chromedriver for browser  in cache
Trying to download new driver from https://chromedriver.storage.googleapis.com/96.0.4664.45/chromedriver_win32.zip
Driver has been saved in cache [C:\Users\Seb\.wdm\drivers\chromedriver\win32\96.0.4664.45]
  driver = webdriver.Chrome(executable_path=ChromeDriverManager().install(), options=options)


In [4]:
driver.get("https://www.imdb.com/search/title/?title_type=feature&genres=drama&sort=num_votes,desc&count=250")

# print(driver.page_source)
# remove the comment above to print the result of our driver.get()
print('done')

done


### One time simple scraping

The concept behind scraping is quite easy to grasp. We are creating a script that behaves like a human. By using the correct methods we can fake human interactions on webpage. While usually scripts "see" different things than us, Selenium, Beautiful Soup and WebDrivers are built to see pages as we would do. Scripts are obviously faster, they don't get to go to the toilet and they do everything very enthusiastically.

**Step 3: Get movies**

Now we will select specific elements we are interested in. To do so we can instruct our script to look for them on the page. Selenium provides a series of methods to do so: we can select elements according to their class name, html tag, css selector, attribute and so on. Please go to https://www.rottentomatoes.com/browse/dvd-streaming-all, right-click on any part of the screen and then select "Inspect Element" to activate the source inspector. Here you can check the page structure and select the elements you want to obtain. In our case each movie preview is wrapper in a container with class `mb-movie`. To get all the movies we will tell Selenium to collect only elements with this specific class name.

In [13]:
# Only getting first page results
movies = driver.find_elements(By.CLASS_NAME, 'lister-item-content')
# Check how many elements we obtained. It should be 32, the number of results for the first page.
print(len(movies))

250


**Step 4: loop through movies and build a list of lists containing our data**

Now that we have a list of elements we can loop over it and replicate our pattern in a more structured way. We will ask Selenium to look for text and images of movies. Again we have a wide range of opportunities. We can use the `.text` method to obtain the content of an HTML tag or we can get attributes from links, it really depends on the website structure and the information we are interested in.

In [16]:
# This empty list will later contain my precious data
moviesData = []
# I am creating a counter, this will print the parser status and generate filenames for images
count = 0
for movie in movies:
    
    count = count + 1
    
    # we can get precise elements by using css selectors, classes, tag names and xpaths
    title = movie.find_element(By.CLASS_NAME, 'lister-item-header').find_element(By.TAG_NAME, "a").text
    length = movie.find_element(By.CLASS_NAME, 'runtime').text
    rating = movie.find_element(By.CLASS_NAME, 'ratings-imdb-rating').find_element(By.TAG_NAME, "strong").text
    metascore = movie.find_element(By.CLASS_NAME, 'metascore').text
    available = movie.find_element(By.CLASS_NAME, 'lister-item-year').text
    print(count, '/', title, length, available)
    
    # it is also possible to get accessory information from collected elements, 
    # in this case we want to get the url of the poster images, we can retrieve it from the src attribute
    
    # We use urllib to read the poster images and save them locally in a special folder. In order to avoid
    # overwriting we use the {0}.format('dynamic variable') to save all of them
    
    # We append a shorter list to the initial list
    moviesData.append([title, length, available, rating, metascore])

# Safety print to check everything is fine!
print(moviesData)

1 / The Shawshank Redemption 142 min (1994)
2 / The Dark Knight 152 min (2008)
3 / Fight Club 139 min (1999)
4 / Forrest Gump 142 min (1994)
5 / Pulp Fiction 154 min (1994)
6 / The Lord of the Rings: The Fellowship of the Ring 178 min (2001)
7 / The Lord of the Rings: The Return of the King 201 min (2003)
8 / The Godfather 175 min (1972)
9 / Interstellar 169 min (2014)
10 / The Dark Knight Rises 164 min (2012)
11 / The Lord of the Rings: The Two Towers 179 min (2002)
12 / Seven 127 min (1995)
13 / Django Unchained 165 min (2012)
14 / Gladiator 155 min (2000)
15 / Batman Begins 140 min (2005)
16 / Inglourious Basterds 153 min (2009)
17 / The Silence of the Lambs 118 min (1991)
18 / Saving Private Ryan 169 min (1998)
19 / The Wolf of Wall Street 180 min (2013)
20 / Schindler's List 195 min (1993)
21 / The Prestige 130 min (2006)
22 / The Departed 151 min (2006)
23 / The Green Mile 189 min (1999)
24 / The Godfather: Part II 202 min (1974)
25 / Joker 122 min (2019)
26 / American Beauty 122

231 / 10 Things I Hate About You 97 min (1999)
232 / Click 107 min (2006)
233 / Hugo 126 min (2011)
234 / Top Gun 110 min (1986)
235 / Real Steel 127 min (2011)
236 / The Blind Side 129 min (2009)
237 / Chinatown 130 min (1974)
238 / The Proposal 108 min (I) (2009)
239 / The Hunt 115 min (2012)
240 / 10 Cloverfield Lane 103 min (2016)
241 / Fifty Shades of Grey 125 min (2015)
242 / Interview with the Vampire: The Vampire Chronicles 123 min (1994)
243 / Lucky Number Slevin 110 min (2006)
244 / The Book of Eli 118 min (2010)
245 / To Kill a Mockingbird 129 min (1962)
246 / The Secret Life of Walter Mitty 114 min (2013)
247 / Lord of War 122 min (2005)
248 / Gone with the Wind 238 min (1939)
249 / Annihilation 115 min (I) (2018)
250 / Magnolia 188 min (1999)
[['The Shawshank Redemption', '142 min', '(1994)', '9.3', '80'], ['The Dark Knight', '152 min', '(2008)', '9.0', '84'], ['Fight Club', '139 min', '(1999)', '8.8', '66'], ['Forrest Gump', '142 min', '(1994)', '8.8', '82'], ['Pulp Ficti

**Step 5: Move list of lists content to a pandas dataframe**

We use pandas to create a dataframe holding the information we gathered from Rotten Tomatoes. This is the gateway to data analysis and data cleaning. From pandas we can export a `.csv` file that can later be cleaned, loaded in various applications and finally used to code our prototype.

In [17]:
AllStreamingTitles = pd.DataFrame(moviesData, columns=["Title", "Runtime", "Release Date", "IMDB Rating", "Critics Score"])

In [22]:
#Previewing 5 random rows from my dataframe
AllStreamingTitles.sample(5)

Unnamed: 0,Title,Runtime,Release Date,IMDB Rating,Critics Score
238,The Hunt,115 min,(2012),8.3,77
82,The Hunger Games: Catching Fire,146 min,(2013),7.5,76
201,Man on Fire,146 min,(2004),7.7,47
198,Jojo Rabbit,108 min,(2019),7.9,58
243,The Book of Eli,118 min,(2010),6.9,53


**Step 6: Export dataframe to csv**

In [23]:
# Check the 'data' subfolder after executing this cell to see the file.
AllStreamingTitles.to_csv('data/movie_length_to_rating.csv', sep=',')

### Iterating on page's elements

What if we need to scrape more than what is visible on the first page? Currently we are getting titles only from the initial screen. 
Instead we want to keep going: our script should be able to proceed down the page and do something everytime new titles need to be loaded. To approach this problem we need to go back to the source website. At the bottom of the page it's possible to spot a yellow "Show More" button, if we click on it new movies are loaded. 

**Positive aspects**
- We have a distinct element, the yellow button with fairly easy interaction
- We have a stable pattern and everything is happening on one single page, we don't need to touch the page URL
- We can adopt a similar procedure to the previous one.

**Negative aspects**
- The page indefinitely loads new content, the button does not change and/or disappear at some point

This calls for a fairly simple strategy. We will modify only one step of our procedure: step 3. Instead of creating a variable with the first page results, we will create a while loop that keeps loading content up to a certain point, while getting all the movies. This particular case does not require that we access individual URLs, for an example of cross-page scraping see the Advanced Dynamic Scraper notebook.

**Step 1 and 2 stay the same!**

**Step 3: automate clicking of the "Show More" button**

Websites generally don't like to be scraped (on some websites it's explicitly forbidden, so be careful). It can happen that if we are interacting with the page too quickly we might get blocked or throttled (slowed down). We will fake interactions with the "Show More" button every 5 seconds to avoid throttle.

In [48]:
# I am creating a variable holding the number of times I want to click on the 'Show More' button. 
# Alternatively I could code my script to stop when a certain point in data is reached, 
#for example a particular title
pageExpandLimit = 0
# I am also creating a variable to store movie pictures correctly
count = 0
# An empty list to store my webelements
moreMovies = []
# And an empty list that will be filled with my data.
dataList = []

while True:
    # Add 1 to counter so we can artificially stop the loop
    print('I clicked "show more" {0} times'.format(pageExpandLimit))
    pageExpandLimit = pageExpandLimit + 1 
    
    # Taking a screenshot everytime we click on the button (to check where my browser's at)
    # This can be remove if it ends up taking too much space.
    driver.save_screenshot("screenshots/screenshot{0}.png".format(pageExpandLimit))
    
    if pageExpandLimit < 10:
        
        # Generic asleep time to avoid throttle
        time.sleep(5)
        # Find the button element and click on it to keep loading movies
        driver.find_element(By.CLASS_NAME, 'mb-load-btn').click()  
        # Dump all the elements in one list
        moreMovies = driver.find_elements(By.CLASS_NAME, 'mb-movie')
        
    # It will set the condition to False if counter is greater than 10, this will break the while loop
    if pageExpandLimit > 10:
        
        # Before breaking the loop I will go through the data as I did in the example above
        print('started movies collection')
        for movie in moreMovies:
            count = count + 1
            try:
                title = movie.find_element(By.CLASS_NAME, 'movieTitle').text
                rating = movie.find_element(By.CLASS_NAME, 'tMeterIcon').text
                available = movie.find_element(By.CLASS_NAME, 'release-date').text
                image = movie.find_element_by_tag_name('img')
            
                print(count, '/', title)

                source = image.get_attribute("src")

                imgdestination = "data/img_basic/poster-{0}.jpg".format(count)
                urllib.request.urlretrieve(source, imgdestination)

                # Appending data
                dataList.append([title, rating, available, imgdestination])
            except:
                print('an exception occurred for', count)
            
        # Breaking and exiting the loop
        break

I clicked "show more" 0 times
I clicked "show more" 1 times
I clicked "show more" 2 times
I clicked "show more" 3 times
I clicked "show more" 4 times
I clicked "show more" 5 times
I clicked "show more" 6 times
I clicked "show more" 7 times
I clicked "show more" 8 times
I clicked "show more" 9 times
I clicked "show more" 10 times
started movies collection
1 / Rumble




2 / Hurt
3 / Mosley
4 / Benedetta
5 / Last Words
6 / Saint-Narcisse
7 / Try Harder!
8 / Back To The Outback
9 / The Scary Of Sixty-First
10 / Sensation
11 / Red Snow
12 / The Novice
13 / Even Mice Belong In Heaven
14 / The Lost Daughter
15 / The Hand Of God (È Stata La Mano Di Dio)
16 / Encounter
17 / Don'T Look Up
18 / The United States Of Insanity
19 / The Unforgivable
20 / Being The Ricardos
21 / The Hating Game
22 / Portal Runner
23 / Swan Song
24 / Mother/Android
25 / Project Space 13
26 / Joy Womack: The White Swan
27 / The Last Son
28 / Boxing Day
29 / Margrete - Queen Of The North
30 / Minnal Murali
31 / Harry Potter 20th Anniversary: Return To Hogwarts
32 / Beanie Mania
33 / Clifford The Big Red Dog
34 / Bugs
35 / They Say Nothing Stays The Same (Aru Sendo No Hanashi)
36 / Broken Law
37 / Violet
38 / Freeland
39 / Gustav Stickley: American Craftsman
40 / She Paradise
41 / A Gift From Bob
42 / Karen Dalton: In My Own Time
43 / Night Raiders
44 / Captains Of Za'Atari
45 / Ma Bel

In [53]:
print(dataList)



In [54]:
MoreStreamingTitles = pd.DataFrame(dataList, columns=["title", "rating", "available", "img"])

In [55]:
MoreStreamingTitles.sample(5)

Unnamed: 0,title,rating,available,img
83,Mothers Of The Revolution,100%,Available Oct 19,data/img_basic/poster-84.jpg
17,The United States Of Insanity,100%,Available Dec 10,data/img_basic/poster-18.jpg
151,Yakuza Princess,35%,Available Sep 3,data/img_basic/poster-152.jpg
185,Karen,17%,Available Sep 3,data/img_basic/poster-186.jpg
53,Boiling Point,98%,Available Nov 23,data/img_basic/poster-54.jpg


In [56]:
MoreStreamingTitles.to_csv('data/titlesDataRaw_extended.csv', sep=',')

**Steps 4, 5 and 6 are identical to the "one page simple scraping" example**

In [38]:
#Ultimately, once we got everything we were interested in, we need to quit our WebDriver to avoid sessions piling up
driver.quit()

### Conclusion

Each website is different, therefore scraping is often a tailored activity that requires some coding and strategizing effort. This tutorial presented some simple features and a couple of useful patterns if you need to scrape a relatively contained amount of data. There are more complex and articulated tasks that require more savviness, time and computing effort. One example are social media. For special or recurring cases there's a useful collection of tools that can be found [here](https://wiki.digitalmethods.net/Dmi/ToolDatabase).

### References

1. Guide to Selenium basic functions for scraping: https://www.scrapingbee.com/blog/selenium-python/
2. Including WebDriver location in MacOS System Path: https://www.kenst.com/2015/03/including-the-chromedriver-location-in-macos-system-path/
3. Data Science Skills Set: https://francesco-ai.medium.com/data-science-skills-list-9f38863adab5
4. DMI Tools Database: https://wiki.digitalmethods.net/Dmi/ToolDatabase