## Dynamic Scraping with Selenium and Chrome webdriver

This is a work in progress notebook. The idea is to create a basic tutorial to work with Selenium and Chrome Webdriver in order to scrape a dynamic website. I am also planning for a tutorial based on BeautifulSoup for static websites, the useful thing about Selenium is that it executes javascript when loading a request, this allows for dynamic websites to be scraped easily.

Tutorial steps:
- Importing Selenium and basic rules of webdrivers
- First scraping: get all elements on one page
- Create a temporized click function to keep scrolling down and repeat scraping

In [64]:
# Will make scraping possible
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
import time

# Will allow to request, access and write urls of images and other links
import urllib.request
import requests

# Will provide structure to data
import pandas as pd

Initialize Selenium and driver instance using Chrome WebDriver. To properly install ChromeDriver you can manually set the path following [this guide](https://www.kenst.com/2015/03/including-the-chromedriver-location-in-macos-system-path/) or use `pip install webdriver-manager` to manage it.

In [21]:
# We start an headless Chrome session. (Headless means that we don't open a visual browser window)
options = Options()
options.headless = True
options.add_argument("--window-size=1920,1200")

# instead of setting path manually (still possible but a bit annoying) we use DriverManager
driver = webdriver.Chrome(executable_path=ChromeDriverManager().install(), options=options)
driver.get("https://www.rottentomatoes.com/browse/dvd-streaming-all")

# print(driver.page_source)
print('done')

[WDM] - Current google-chrome version is 87.0.4280
[WDM] - Get LATEST driver version for 87.0.4280
[WDM] - Driver [/Users/francescamorini/.wdm/drivers/chromedriver/mac64/87.0.4280.88/chromedriver] found in cache


 


### One time simple scraping

The concept behind scraping is quite easy to grasp. We are creating a script that behaves like a human. By using the correct methods we can fake human interactions on webpage, while usually scripts "see" different things than us, Selenium, Beautiful Soup and WebDrivers are built to see pages as we would do with our eyes. Scripts are obviously faster, they don't get to go to the toilet and they do everything very enthusiastically.

In [14]:
# Only getting first page results
movies = driver.find_elements_by_class_name('mb-movie')

In [47]:
# This empty list will later contain my precious data
moviesData = []
# I am creating a counter, this will print the parser status and to generate filenames for images
count = 0
for movie in movies:
    
    count = count + 1
    print(count)
    # we can get precise elements by using css selectors, classes, tag names and xpaths
    title = movie.find_element_by_class_name('movieTitle').text
    rating = movie.find_element_by_class_name('tMeterIcon').text
    available = movie.find_element_by_class_name('release-date').text
    image = movie.find_element_by_tag_name('img')
    
    # it is also possible to get accessory information from collected elements, 
    # in this case we are interested in the src attribute
    source = image.get_attribute("src")
    
    # We use urllib to read the poster images and save them locally in a special folder. In order to avoid
    # overwriting we use the {0}.format('dynamic variable') to save all of them
    imgdestination = "data/img/poster-{0}.jpg".format(count)
    urllib.request.urlretrieve(source, imgdestination)
    
    # We append a shorter list to the initial list
    moviesData.append([title, rating, available, imgdestination])

# Safety print to check everything is fine!
print(moviesData)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
[['Parallel', ' 80%', 'Available Dec 11', 'data/img/poster-1.jpg'], ['The Emoji Story (Picture Character)', ' 91%', 'Available Dec 22', 'data/img/poster-2.jpg'], ['Soul', ' 96%', 'Available Dec 25', 'data/img/poster-3.jpg'], ['Sing Me A Song', ' 88%', 'Available Jan 1', 'data/img/poster-4.jpg'], ['Pieces Of A Woman', ' 76%', 'Available Jan 7', 'data/img/poster-5.jpg'], ["I'M Your Woman", ' 81%', 'Available Dec 11', 'data/img/poster-6.jpg'], ['Lupin III: The First', ' 93%', 'Available Dec 15', 'data/img/poster-7.jpg'], ['Wander Darkly', ' 75%', 'Available Dec 11', 'data/img/poster-8.jpg'], ["Sylvie'S Love", ' 92%', 'Available Dec 23', 'data/img/poster-9.jpg'], ['Safety', ' 82%', 'Available Dec 11', 'data/img/poster-10.jpg'], ['Beasts Clawing At Straws', ' 96%', 'Available Dec 15', 'data/img/poster-11.jpg'], ['Wolfwalkers', ' 99%', 'Available Dec 11', 'data/img/poster-12.jpg'], ['Shadow In The Cloud', 

We use pandas to create a dataframe holding the information we gathered from Rotten Tomatoes. This is the gateway to data analysis and data cleaning. From pandas we can export a `.csv` file that can later be cleaned, loaded in various applications and finally used to code our prototype.

In [45]:
AllStreamingTitles = pd.DataFrame(moviesData, columns=["title", "rating", "available", "img"])

In [46]:
AllStreamingTitles.sample(5)

Unnamed: 0,title,rating,available,img
29,The One You Feed,40%,Available Dec 29,data/img/poster-30.jpg
30,"If Not Now, When?",44%,Available Jan 8,data/img/poster-31.jpg
26,We Can Be Heroes,69%,Available Jan 1,data/img/poster-27.jpg
13,I Am Lisa,91%,Available Jan 5,data/img/poster-14.jpg
4,Pieces Of A Woman,76%,Available Jan 7,data/img/poster-5.jpg


In [48]:
# Check the 'data' subfolder after executing this cell to see the file.
df.to_csv('data/reviewsDataRaw.csv', sep=',')

### Continuous scraping across pages 

What if we need to scrape more than the first page? Currently we are getting titles only from this initial screen. ![initial screen of rotten tomatoes with movies preview images, we see a limited amount of titles](screenshot.png)

Instead we want to keep going: our script should be able to proceed down the page and do something everytime titles are not loading to load new information.

In [62]:
#While loop with some kind of timer to temporize click and expand results?


In [65]:

moreMovies = showMore()

HTTPConnectionPool(host='127.0.0.1', port=56558): Max retries exceeded with url: /session/89d7045f827516940d122578e543205b/element (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fd570ad31f0>: Failed to establish a new connection: [Errno 61] Connection refused'))


In [18]:
# Getting all elements again
movies = driver.find_elements_by_class_name('mb-movie')

for movie in movies:
    print(movie.text)