## Dynamic Scraping with Selenium and Chrome webdriver

This is a simple notebook to work with Selenium and Chrome Webdriver in order to scrape a dynamic website. While for static websites we can use [Beautiful Soup](https://pypi.org/project/beautifulsoup4/), when we are working with dynamic ones we need Selenium. The key aspect of [Selenium](https://pypi.org/project/selenium/) is that it executes javascript when loading a page request, this allows for dynamic websites to be scraped easily. To complete our exercise we will also use [pandas](https://pypi.org/project/pandas/), the data science and data analysis library, in order to create a dataframe and export our data to our local machine.

Tutorial sections:
- **Headless browsers**: Start a WebDriver session with Selenium
- **One page simple scraping**: retrieve all the content from one page and create a dataframe
- **Scraping across pages**: Continously load new content, then create a dataframe

**Step 1: import**

In [11]:
# Will make scraping possible
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
from selenium.common.exceptions import NoSuchElementException
import time

# Will allow to request, access and write urls of images and other links
import urllib.request
import requests

# Will provide structure to data
import pandas as pd

Initialize Selenium and driver instance using Chrome WebDriver. To properly install ChromeDriver you can manually set the path following [this guide](https://www.kenst.com/2015/03/including-the-chromedriver-location-in-macos-system-path/) or use `pip install webdriver-manager` to manage it.

**Step 2: start Chrome webdriver session**

For the purpose of this exercise we will get some movies from [Rotten Tomatoes](https://www.rottentomatoes.com/browse/dvd-streaming-all). We are interested in creating a dataset of titles with their ranking, date of release and cover.

In [47]:
# We start an headless Chrome session. (Headless means that we don't open a visual browser window)
options = Options()
options.headless = True
options.add_argument("--window-size=1920,1200")

# instead of setting path manually (still possible but a bit annoying) we use DriverManager
driver = webdriver.Chrome(executable_path=ChromeDriverManager().install(), options=options)
driver.get("https://www.rottentomatoes.com/browse/dvd-streaming-all")

# print(driver.page_source)
print('done')

[WDM] - Current google-chrome version is 87.0.4280
[WDM] - Get LATEST driver version for 87.0.4280


 


[WDM] - Driver [/Users/francescamorini/.wdm/drivers/chromedriver/mac64/87.0.4280.88/chromedriver] found in cache


done


### One time simple scraping

The concept behind scraping is quite easy to grasp. We are creating a script that behaves like a human. By using the correct methods we can fake human interactions on webpage. While usually scripts "see" different things than us, Selenium, Beautiful Soup and WebDrivers are built to see pages as we would do. Scripts are obviously faster, they don't get to go to the toilet and they do everything very enthusiastically.

**Step 3: Get movies**

In [49]:
# Only getting first page results
movies = driver.find_elements_by_class_name('mb-movie')

**Step 4: loop through movies and build a list of lists containing our data**

In [71]:
# This empty list will later contain my precious data
moviesData = []
# I am creating a counter, this will print the parser status and to generate filenames for images
count = 0
for movie in movies:
    
    count = count + 1
    print(count)
    # we can get precise elements by using css selectors, classes, tag names and xpaths
    title = movie.find_element_by_class_name('movieTitle').text
    rating = movie.find_element_by_class_name('tMeterIcon').text
    available = movie.find_element_by_class_name('release-date').text
    image = movie.find_element_by_tag_name('img')
    
    # it is also possible to get accessory information from collected elements, 
    # in this case we are interested in the src attribute
    source = image.get_attribute("src")
    
    # We use urllib to read the poster images and save them locally in a special folder. In order to avoid
    # overwriting we use the {0}.format('dynamic variable') to save all of them
    imgdestination = "data/img/poster-{0}.jpg".format(count)
    urllib.request.urlretrieve(source, imgdestination)
    
    # We append a shorter list to the initial list
    moviesData.append([title, rating, available, imgdestination])

# Safety print to check everything is fine!
print(moviesData)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
[['Parallel', ' 80%', 'Available Dec 11', 'data/img/poster-1.jpg'], ['The Emoji Story (Picture Character)', ' 91%', 'Available Dec 22', 'data/img/poster-2.jpg'], ['Yellow Rose', ' 86%', 'Available Jan 5', 'data/img/poster-3.jpg'], ['Soul', ' 96%', 'Available Dec 25', 'data/img/poster-4.jpg'], ['Sing Me A Song', ' 88%', 'Available Jan 1', 'data/img/poster-5.jpg'], ['Pieces Of A Woman', ' 77%', 'Available Jan 7', 'data/img/poster-6.jpg'], ['Lupin III: The First', ' 94%', 'Available Dec 15', 'data/img/poster-7.jpg'], ["Sylvie'S Love", ' 92%', 'Available Dec 23', 'data/img/poster-8.jpg'], ['Beasts Clawing At Straws', ' 96%', 'Available Dec 15', 'data/img/poster-9.jpg'], ['Shadow In The Cloud', ' 76%', 'Available Jan 1', 'data/img/poster-10.jpg'], ['I Am Lisa', ' 91%', 'Available Jan 5', 'data/img/poster-11.jpg

**Step 5: Move list of lists content to a pandas dataframe**

We use pandas to create a dataframe holding the information we gathered from Rotten Tomatoes. This is the gateway to data analysis and data cleaning. From pandas we can export a `.csv` file that can later be cleaned, loaded in various applications and finally used to code our prototype.

In [5]:
AllStreamingTitles = pd.DataFrame(moviesData, columns=["title", "rating", "available", "img"])

In [68]:
#Previewing 5 random rows from my dataframe
AllStreamingTitles.sample(5)

Unnamed: 0,title,rating,available,img
9,Shadow In The Cloud,76%,Available Jan 1,data/img/poster-10.jpg
20,Sister Of The Groom,47%,Available Dec 18,data/img/poster-21.jpg
10,I Am Lisa,91%,Available Jan 5,data/img/poster-11.jpg
16,Hunter Hunter,93%,Available Dec 18,data/img/poster-17.jpg
0,Parallel,80%,Available Dec 11,data/img/poster-1.jpg


**Step 6: Export dataframe to csv**

In [48]:
# Check the 'data' subfolder after executing this cell to see the file.
df.to_csv('data/reviewsDataRaw.csv', sep=',')

### Scraping across multiple pages 

What if we need to scrape more than the first page? Currently we are getting titles only from the initial screen. 
Instead we want to keep going: our script should be able to proceed down the page and do something everytime new titles need to be loaded. To approach this problem we need to go back to the source website. At the bottom of the page it's possible to spot a yellow "Show More" button, if we click on it new movies are loaded. 

**Positive aspects**
- We have a distinct element, the yellow button with fairly easy interaction
- We have a stable pattern and everything is happening on one single page, we don't need to touch the page URL
- We can adopt a similar procedure to the previous one.

**Negative aspects**
- The page indefinitely loads new content, the button does not change and/or disappear at some point

This calls for a fairly simple strategy. We will to modify only one step of our procedure: step 3. Instead of creating a variable with the first page results, we will create a while loop that keeps loading content up to a certain point, while getting all the movies. 

**Step 1 and 2 stay the same!**

**Step 3: automate clicking of the "Show More" button**

Websites generally don't like to be scraped (on some websites it's explicitly forbidden, so be careful). It can happen that if we are requesting too many pages or if we are interacting with the page too quickly we might get blocked or throttled (slowed). We will fake interactions with the "Show More" button every 10 seconds to avoid throttle.

In [67]:
counter = 0
while True:
    # Generic asleep time to avoid throttle
    time.sleep(10)
    # Add 1 to counter so we can artificially stop the loop
    counter = counter + 1 
    if counter < 5:
        # Find the button element and click on it to keep loading movies
        driver.find_element_by_class_name('mb-load-btn').click()
        # Dump all the elements in one list
        moreMovies = driver.find_elements_by_class_name('mb-movie')
    # It will set the condition to False if counter is greater than 5, this will break the while loop
    if counter > 5:
        break
    else:
        break

**Steps 4, 5 are identical to the "one page simple scraping" example**

In [72]:
#Ultimately, once we got everything we were interested in, we need to quit our WebDriver to avoid sessions piling up
driver.quit()

### Conclusion

Each website is different, therefore scraping is often a tailored activity that requires some coding and strategizing effort. This tutorial presented some simple features and a couple of useful patterns if you need to scrape a relatively contained amount of data. There are more complex and articulated tasks that require more savviness, time and computing effort. One example are social media. For special or recurring cases there's a useful collection of tools that can be found [here](https://wiki.digitalmethods.net/Dmi/ToolDatabase).

### References

1. Guide to Selenium basic functions for scraping: https://www.scrapingbee.com/blog/selenium-python/
2. Including WebDriver location in MacOS System Path: https://www.kenst.com/2015/03/including-the-chromedriver-location-in-macos-system-path/
3. Data Science Skills Set: https://francesco-ai.medium.com/data-science-skills-list-9f38863adab5
4. DMI Tools Database: https://wiki.digitalmethods.net/Dmi/ToolDatabase