# Selenium Webscraping Tutorial

#### Table of Contents

1. <a href='##Introduction'>The Basics of Webscraping With Selenium</a> <br>
> 1.a <a href='##package'>The Selenium package</a> <br>
2. <a href='##example'>Webscraping Example</a> <br>
> 2.a <a href='##browser'>Selenium headless browser</a> <br>
> 2.b <a href='##find_elements'>Find elements using the Selenium Library</a> <br>
> 2.c <a href='##click_scroll'>Clicking, scrolling, going back</a> <br>
3. <a href='##scrolling_example'>Selenium scrolling example</a> <br>
4. <a href='##links'>Iterating through all movie links</a> <br>

<a id='#Introduction'></a>

### The Basics of Webscraping with Selenium

Selenium automates web browsering. Routine point-and-click tasks on websites can (and should) be handled programmatically, and Selenium provides us this capability. There are many use cases for Selenium that BeautifulSoup cannot handle. Programs that require scrolling to the bottom of a webpage, pointing and clicking on a button, or passing information into a search box are all perfect use cases for Selenium. Selenium has it's own HTML parsing functionality (which we'll examine), but it can also be used **in conjunction** with BeautifulSoup to load webpages, where BeautifulSoup can be called to parse the page. We will examine all of these use cases and more in the following demo.

We'll be working in Python 3 for this tutorial, although Beautiful Soup runs for Python 2 with some simple adjustments to your code. Let's take a look at an example and see if we can get you up and running with parsing your own web pages!

(1) https://www.seleniumhq.org/

<a id='#package'></a>

### The Selenium package

The Selenium Webdriver allows us to fetch webpages which we can control through remote operations (via the Python code which we'll write). The first steps in running Selenium are to

* **Install Selenium (in your envirnonment)**
* **Install Selenium Web Driver**

(http://selenium-python.readthedocs.io/installation.html)

Try following the directions listed in the link above and make sure that you're webrowser corresponds to the supported version listed in the webdriver download links. The latest version should be clearly denoted:

http://selenium-python.readthedocs.io/installation.html

<a id='#example'></a>
### Let's Start a Webscraping Example

The first thing we need to do is import the relevant packages for our project. These packages are the following: 

* **re** - the regular expressions package for evaluating strings 
* **pandas** - this is a fundamental package for formatting and evaluating data 
* **webdriver** - this is the headless browser which will receive our commands and parse webpages

In [1]:
import re
import time
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

<a id='#browser'></a>
First let's call the **webdriver** (Firefox in my case) and instantiate the web browser

In [10]:
website = 'https://www.rottentomatoes.com/'
driver = webdriver.Firefox()
driver.get(website)

We can now tab to the browser we created and see that we are on the Rotten Tomatoe's website! The goal of this project is going to be to finish something we were unable to do with Beautiful Soup:

1. Follow links for each sub-section and click on the **"view all"** button to load the remainder of the movies
2. Scroll down the page to to allow the HTML to load so that we can grab all of the movies on the page

<a id='#find_elements'></a>

So to start, let's grab each of the **"view all"** buttons on the webpage. Selenium has a number of options for us to locate elements. Just a few of these are:

For grabbing individual items on a page (the first item in the HTML tree):

* *find_element_by_id*
* *find_element_by_name*
* *find_element_by_xpath*
* *find_element_by_link_text*
* *find_element_by_partial_link_text*
* *find_element_by_tag_name*
* *find_element_by_class_name*
* *find_element_by_css_selector*

For grabbing multiple items on a page (returns a list of the items selected):

* *find_elements_by_id*
* *find_elements_by_name*
* *find_elements_by_xpath*
* *find_elements_by_link_text*
* *find_elements_by_partial_link_text*
* *find_elements_by_tag_name*
* *find_elements_by_class_name*
* *find_elements_by_css_selector*

Ref: https://selenium-python.readthedocs.io/locating-elements.html 

In [11]:
#We can find the first "view all" button by searching for the driver id
print(driver.find_element_by_id('Top-Box-Office-view-all'))
#We can also find it by searching x-path
print(driver.find_element_by_xpath("//a[@id='Top-Box-Office-view-all']"))
button = driver.find_element_by_id('Top-Box-Office-view-all')

<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="c3924412-7c76-8247-beff-cbc79755f475", element="f7b8c2dc-3030-3a44-b982-b764bd2192ff")>
<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="c3924412-7c76-8247-beff-cbc79755f475", element="f7b8c2dc-3030-3a44-b982-b764bd2192ff")>


<a id='#click_scroll'></a>
Now we can **click** our button to follow the link to the next page and examine all the data:

In [12]:
button.click()

Now if we look at the webpage, we see that not all the movies are loaded yet. We need to scroll down for all the movies to load. Fortunately for us, Selenium has built-in functionality to handle this situation:

Ref: https://stackoverflow.com/questions/42982950/how-to-scroll-down-the-page-till-bottomend-page-in-the-selenium-webdriver/42983332 <br>
Ref: https://stackoverflow.com/questions/22702277/crawl-site-that-has-infinite-scrolling-using-python

In [13]:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

Now that we are on the webpage and have scrolled to the bottom of the page (to make sure all our data is loaded), we will grab the data for all the movies on the page and store it in a **CSV** file

In [14]:
movies = driver.find_elements_by_class_name('mb-movie')

Now let's save our movie data to a CSV file so we can review it

In [15]:
titles = []
dates = []
scores= []
for movie in movies:
    print(movie.find_element_by_class_name('movieTitle').text)
    print(movie.find_element_by_class_name('release-date').text)
    titles.append(movie.find_element_by_class_name('movieTitle').text)
    dates.append(movie.find_element_by_class_name('release-date').text)
    try:
        print(movie.find_element_by_class_name('tMeterScore').text)
        scores.append(movie.find_element_by_class_name('tMeterScore').text)
    except:
        scores.append(None)
pd.DataFrame({'Title':titles, 'Dates': dates, 'Score': scores}).to_csv(driver.find_element_by_class_name('main-column-item').text + '.csv')

The House With A Clock In Its Walls
In Theaters Sep 21
67%
A Simple Favor
In Theaters Sep 14
84%
The Nun
In Theaters Sep 7
27%
The Predator
In Theaters Sep 14
34%
Crazy Rich Asians
In Theaters Aug 15
93%
White Boy Rick
In Theaters Sep 14
59%
Peppermint
In Theaters Sep 7
11%
Fahrenheit 11/9
In Theaters Sep 21
77%
The Meg
In Theaters Aug 10
46%
Searching
In Theaters Aug 31
93%
Life Itself
In Theaters Sep 21
11%
Unbroken: Path To Redemption
In Theaters Sep 14
26%
Mission: Impossible - Fallout
In Theaters Jul 27
97%
Christopher Robin
In Theaters Aug 3
71%
Assassination Nation
In Theaters Sep 21
67%
The Wife
In Theaters Aug 17
85%
BlacKkKlansman
In Theaters Aug 10
95%
Hotel Transylvania 3: Summer Vacation
In Theaters Jul 13
59%
Incredibles 2
In Theaters Jun 15
94%
Alpha
In Theaters Aug 17
81%
Operation Finale
In Theaters Aug 29
59%
Mamma Mia! Here We Go Again
In Theaters Jul 20
80%
Ant-Man And The Wasp
In Theaters Jul 6
88%
Lizzie
In Theaters Sep 14
67%
Jurassic World: Fallen Kingdom
In The

Finally, let's go back to the previous webpage so that we can scrape the rest of the data from the Rotten Tomatoes website

In [17]:
driver.back()

<a id='#scrolling_example'></a>

### Selenium Scrolling

Most webpages that we need to gather data from are not created to assist us in our web crawling tasks. As such, properly scrolling a website is frequently used functionality Here we have **Example** code that demonstrates how one might scroll to the bottom of a webpage that loads dynamically.

We'll demonstrate on a supermarket website with digital coupons. Take a look at the website in your browswe and inspect the elements. There are several aspects of the website that need to be accounted for:

1. The "Load More" button
2. Scrolling to the bottom of the page

We'll handle each of these...

In [21]:
website_scroll = 'https://www.publix.com/savings/coupons/digital-coupons'
driver_scroll = webdriver.Firefox()
driver_scroll.get(website_scroll)

Did the entire page load? If not, how many objects loaded?

As we see here by looking at the length of all the objects collected, only a subset of all the coupons are currently loaded. In order to get all the coupon info, we'll need to scroll down and click the **"Load More"** button, and then continue to scroll until the entire page is loaded.

In [22]:
len(driver_scroll.find_elements_by_xpath("//div[@class='dc-card']"))

30

First let's define the **"Load More"** object. We declare it using x-path, and then we'll execute a Java command to scroll down the page until we find that element

In [23]:
element = driver_scroll.find_element_by_xpath("//button[@class='btn btn-large js-btnLoadCoupons']")
driver_scroll.execute_script("arguments[0].scrollIntoView();", element)

Next click the **element** to load the rest of the objects on the page

In [24]:
element.click()

Now that we've clicked **Load More**, let's see how many objects there are. Lot's more than 30! Now we'll scroll to the bottom of the page and collect the data for all of these coupons

In [25]:
len(driver_scroll.find_elements_by_xpath("//div[@class='dc-card']"))

153

The **Keys** class in Selenium provide keys in the keyboard like RETURN, F1, ALT etc.

With this code, I am setting the new object to the old object (coupon on the page). While there exists a new object to be set, I will leverage the **"send_keys"** function in the **Keys** class and pass an argument to **PAGE_DOWN**. 

The result will be that while I step through each coupon object on the page, I only scroll until there are no more objects, at which point I stop. 

Ref: https://selenium-python.readthedocs.io/getting-started.html 

In [26]:
dc_card = driver_scroll.find_element_by_xpath("//div[@class='dc-card']")

while True:
    try:
        print(dc_card)
        #Here I am setting the new dc_card to the sibling of the previous dc-card
        #I have to specify exactly which tag I want to use next as the sibling
        dc_card = dc_card.find_element_by_xpath(".//following-sibling::div[@class='dc-card']")
        dc_card.send_keys(Keys.PAGE_DOWN)
        time.sleep(3)
    except:
        break

<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="e61241d9-86d3-2a43-8cae-1b207f1339a7", element="385e4b5c-c598-514f-87f5-5b60e657ca7e")>
<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="e61241d9-86d3-2a43-8cae-1b207f1339a7", element="c30b70a1-2629-b448-bbd5-105b1da68381")>
<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="e61241d9-86d3-2a43-8cae-1b207f1339a7", element="65c9931e-78e7-f543-83ba-ead4e97f286d")>
<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="e61241d9-86d3-2a43-8cae-1b207f1339a7", element="1aad7dba-9cf0-0a4e-94ed-6f7e99ad8ef4")>
<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="e61241d9-86d3-2a43-8cae-1b207f1339a7", element="95749297-3d7b-5444-9e28-71b27421ed31")>
<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="e61241d9-86d3-2a43-8cae-1b207f1339a7", element="56e4fc60-73f8-354e-9324-52bd129acfa2")>
<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="e61

<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="e61241d9-86d3-2a43-8cae-1b207f1339a7", element="0307b124-5f21-ab4f-b3ff-d2cb228eaf18")>
<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="e61241d9-86d3-2a43-8cae-1b207f1339a7", element="d8cfeb26-648d-9347-b31c-7a067dec9fad")>
<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="e61241d9-86d3-2a43-8cae-1b207f1339a7", element="2aac2f27-a6f9-f94f-9c09-5364b33af1a9")>
<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="e61241d9-86d3-2a43-8cae-1b207f1339a7", element="3e35bfaa-a5e7-704d-a24f-265b06d2c7ca")>
<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="e61241d9-86d3-2a43-8cae-1b207f1339a7", element="03d35c9c-b2eb-2242-bdf8-bd5cda0806c6")>
<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="e61241d9-86d3-2a43-8cae-1b207f1339a7", element="71d62e0c-6f6f-0841-85e8-488ce3174817")>
<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="e61

<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="e61241d9-86d3-2a43-8cae-1b207f1339a7", element="887f0737-1637-7f4c-82a3-f1800616bebf")>
<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="e61241d9-86d3-2a43-8cae-1b207f1339a7", element="393fa428-317a-6049-bdee-ee68af0cbb44")>
<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="e61241d9-86d3-2a43-8cae-1b207f1339a7", element="b4265ba9-638b-024b-94af-4ad748a323f8")>
<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="e61241d9-86d3-2a43-8cae-1b207f1339a7", element="9ea427e9-6a76-4140-8bf9-0c4d98db0089")>
<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="e61241d9-86d3-2a43-8cae-1b207f1339a7", element="1098615d-d5ac-9141-8dcd-3f4f6d018e46")>
<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="e61241d9-86d3-2a43-8cae-1b207f1339a7", element="5e1e8bd8-b38c-c444-902d-e25ec4c8b020")>
<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="e61

<a id='#links'></a>

### Iterate Through All Links

Now that we've demonstrated how to **point**, **click** and **scroll** down a webpage, we will now put all these pieces together to iterate through each subsection of the Rotten Tomatoes website, point, click, scroll and collect the data on each of these pages.

Let's take it from the top!

#### Call the website and create a driver instance like we did previously...

In [2]:
website = 'https://www.rottentomatoes.com/'
driver = webdriver.Firefox()
driver.get(website)

#### Now let's scroll down the page to make sure all our links are loaded

In [3]:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

#### Grab all the "View All" objects

In [6]:
buttons = driver.find_elements_by_class_name('clickForMore')

Each of the **"View All"** objects have different **id** tag names (1). However if we look at the **div class** parent node (2), we see that all the **class names are "clickMore"**. So what we'll do is create a list of all the **"clickMore"** objects, and then select the first child node.

![title](ViewAll.png)

* Once we've selected the correct child nodes, we'll extract the **href** tag, which we can follow to parse the relevant links for our data analysis.

* Using a **try-except** statement, we'll search for the the **"clickMore"** objects which our scraper can parse, and then export that data to CSV files with the page name

In [9]:
import os
os.getcwd()

'/Users/zacharyescalante/Desktop/Github/Zach-Escalante-Code/WebScrapingTutorial'

In [10]:
#![title](ViewAll.png)
for button in buttons:
    #Extract the "href" element
    website = button.find_element_by_xpath(".//*").get_attribute("href")
    print(website)
    driver = webdriver.Firefox()
    driver.get(website)
    time.sleep(5)
    #If the "clickMore" object has a 'main-column-item', parse the page
    try:
        print(driver.find_element_by_class_name('main-column-item').text)
        time.sleep(5)
        titles = []
        dates = []
        scores= []
        movies = driver.find_elements_by_class_name('mb-movie')
        for movie in movies:
            titles.append(movie.find_element_by_class_name('movieTitle').text)
            dates.append(movie.find_element_by_class_name('release-date').text)
            try:
                scores.append(movie.find_element_by_class_name('tMeterScore').text)
            except:
                scores.append(None)
        #The 'main-column-item' element is saved as the CSV file name so that we know what data we're looking at
        pd.DataFrame({'Title':titles, 'Dates': dates, 'Score': scores}).to_csv(driver.find_element_by_class_name('main-column-item').text + '.csv')
    #If the "clickMore" object does not have a 'main-column-item', pass
    except:
        print("pass")
    driver.quit()
    time.sleep(5)

https://editorial.rottentomatoes.com/total-recall/
pass
https://editorial.rottentomatoes.com/rt-hubs/
pass
https://editorial.rottentomatoes.com/news/
pass
https://www.rottentomatoes.com/browse/opening/
OPENING THIS WEEK
https://www.rottentomatoes.com/browse/in-theaters/
TOP BOX OFFICE
WEEKEND BOX OFFICE EARNINGS
https://www.rottentomatoes.com/browse/upcoming/
COMING SOON
https://www.rottentomatoes.com/browse/tv-list-1
NEW TV TONIGHT
pass
https://www.rottentomatoes.com/browse/tv-list-2
MOST POPULAR TV ON RT
pass
https://www.rottentomatoes.com/browse/top-dvd-streaming/
TOP DVD & STREAMING
https://www.rottentomatoes.com/browse/cf-in-theaters/
CERTIFIED FRESH MOVIES
WEEKEND BOX OFFICE EARNINGS
https://editorial.rottentomatoes.com/video-interviews/
pass
https://editorial.rottentomatoes.com/news/
pass
https://editorial.rottentomatoes.com/publications/
pass


#### Congratulations! 

You've covered some important concepts for webscraping with **Selenium**, including:

* How to install/run headless browsers to automate point-click tasks
* Extracting web elements using Selenium
* Scrolling and clicking through a webpage
* Handling non-trivial websites and scrolling

All of these concepts will be very useful. Our next lecture will tie these web-scraping building blocks together with a **faster, cleaner framework (Scrapy)** to run these processes in **parallel**, and provide us the autonomy to run scrapers using all of these tools (plus many more) at **break-neck speed**!