# Selenium Webscraping Tutorial

#### Table of Contents

1. <a href='##Introduction'>The Basics of Webscraping With Selenium</a> <br>
> 1.a <a href='##package'>The Selenium package</a> <br>
2. <a href='##example'>Webscraping Example</a> <br>
> 2.a <a href='##browser'>Selenium headless browser</a> <br>
> 2.b <a href='##find_elements'>Find elements using the Selenium Library</a> <br>
> 2.c <a href='##click_scroll'>Clicking, scrolling, going back</a> <br>
3. <a href='##scrolling_example'>Selenium scrolling example</a> <br>
4. <a href='##links'>Iterating through all movie links</a> <br>
> 3.a <a href='##bs4_html'>Creating the right HTML objects to parse</a> <br>
> 3.b <a href='##bs4_find_html'>Creating nested HTML objects using **find( ) and **find_all( ) methodology</a> <br>
> 3.c <a href='##pulling_tags'>Parsing the individual movies and pulling the data</a> <br>
> 3.d <a href='##storing_data'>Storing the data and exporting to CSV</a> <br>

<a id='#Introduction'></a>

### The Basics of Webscraping with Selenium

Selenium automates web browsering. Routine point-and-click tasks on websites can (and should) be handled programmatically, and Selenium provides us this capability. There are many use cases for Selenium that BeautifulSoup cannot handle. Programs that require scrolling to the bottom of a webpage, pointing and clicking on a button, or passing information into a search box are all perfect use cases for Selenium. Selenium has it's own HTML parsing functionality (which we'll examine), but it can also be used **in conjunction** with BeautifulSoup to load webpages, where BeautifulSoup can be called to parse the page. We will examine all of these use cases and more in the following demo.

We'll be working in Python 3 for this tutorial, although Beautiful Soup runs for Python 2 with some simple adjustments to your code. Let's take a look at an example and see if we can get you up and running with parsing your own web pages!

(1) https://www.seleniumhq.org/

<a id='#package'></a>

### The Selenium package

The Selenium Webdriver allows us to fetch webpages which we can control through remote operations (via the Python code which we'll write). The first steps in running Selenium are to

* **Install Selenium (in your envirnonment)**
* **Install Selenium Web Driver**

(http://selenium-python.readthedocs.io/installation.html)

Try following the directions listed in the link above and make sure that you're webrowser corresponds to the supported version listed in the webdriver download links. The latest version should be clearly denoted:

http://selenium-python.readthedocs.io/installation.html

<a id='#example'></a>
### Let's Start a Webscraping Example

The first thing we need to do is import the relevant packages for our project. These packages are the following: 

* **re** - the regular expressions package for evaluating strings 
* **pandas** - this is a fundamental package for formatting and evaluating data 
* **webdriver** - this is the headless browser which will receive our commands and parse webpages

In [257]:
import re
import pandas as pd
from selenium import webdriver

<a id='#browser'></a>
First let's call the **webdriver** (Firefox in my case) and instantiate the web browser

In [264]:
website = 'https://www.rottentomatoes.com/'
driver = webdriver.Firefox()
driver.get(website)

We can now tab to the browser we created and see that we are on the Rotten Tomatoe's website! The goal of this project is going to be to finish something we were unable to do with Beautiful Soup:

1. Follow links for each sub-section and click on the **"view all"** button to load the remainder of the movies
2. Scroll down the page to to allow the HTML to load so that we can grab all of the movies on the page

<a id='#find_elements'></a>

So to start, let's grab each of the **"view all"** buttons on the webpage. Selenium has a number of options for us to locate elements. Just a few of these are:

For grabbing individual items on a page (the first item in the HTML tree):

* *find_element_by_id*
* *find_element_by_name*
* *find_element_by_xpath*
* *find_element_by_link_text*
* *find_element_by_partial_link_text*
* *find_element_by_tag_name*
* *find_element_by_class_name*
* *find_element_by_css_selector*

For grabbing multiple items on a page (returns a list of the items selected):

* *find_elements_by_id*
* *find_elements_by_name*
* *find_elements_by_xpath*
* *find_elements_by_link_text*
* *find_elements_by_partial_link_text*
* *find_elements_by_tag_name*
* *find_elements_by_class_name*
* *find_elements_by_css_selector*

Ref: https://selenium-python.readthedocs.io/locating-elements.html 

In [265]:
#We can find the first "view all" button by searching for the driver id
print(driver.find_element_by_id('Top-Box-Office-view-all'))
#We can also find it by searching x-path
print(driver.find_element_by_xpath("//a[@id='Top-Box-Office-view-all']"))
button = driver.find_element_by_id('Top-Box-Office-view-all')

<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="07c16727-9eba-dc47-8357-dbf79053282c", element="01e53c78-0453-084b-808a-b601f41a9d16")>
<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="07c16727-9eba-dc47-8357-dbf79053282c", element="09d12382-a635-1741-99c4-6247a665156e")>


<a id='#click_scroll'></a>
Now we can **click** our button to follow the link to the next page and examine all the data:

In [266]:
button.click()

Now if we look at the webpage, we see that not all the movies are loaded yet. We need to scroll down for all the movies to load. Fortunately for us, Selenium has built-in functionality to handle this situation:

Ref: https://stackoverflow.com/questions/42982950/how-to-scroll-down-the-page-till-bottomend-page-in-the-selenium-webdriver/42983332 <br>
Ref: https://stackoverflow.com/questions/22702277/crawl-site-that-has-infinite-scrolling-using-python

In [267]:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

Now that we are on the webpage and have scrolled to the bottom of the page (to make sure all our data is loaded), we will grab the data for all the movies on the page and store it in a **CSV** file

In [268]:
movies = driver.find_elements_by_class_name('mb-movie')

Now let's save our movie data to a CSV file so we can review it

In [269]:
titles = []
dates = []
scores= []
for movie in movies:
    print(movie.find_element_by_class_name('movieTitle').text)
    print(movie.find_element_by_class_name('release-date').text)
    titles.append(movie.find_element_by_class_name('movieTitle').text)
    dates.append(movie.find_element_by_class_name('release-date').text)
    try:
        print(movie.find_element_by_class_name('tMeterScore').text)
        scores.append(movie.find_element_by_class_name('tMeterScore').text)
    except:
        scores.append(None)
pd.DataFrame({'Title':titles, 'Dates': dates, 'Score': scores}).to_csv(driver.find_element_by_class_name('main-column-item').text + '.csv')

Crazy Rich Asians
In Theaters Aug 15
93%
The Meg
In Theaters Aug 10
47%
Mile 22
In Theaters Aug 17
21%
Mission: Impossible - Fallout
In Theaters Jul 27
97%
Alpha
In Theaters Aug 17
82%
Christopher Robin
In Theaters Aug 3
70%
BlacKkKlansman
In Theaters Aug 10
96%
Slender Man
In Theaters Aug 10
8%
Hotel Transylvania 3: Summer Vacation
In Theaters Jul 13
60%
Mamma Mia! Here We Go Again
In Theaters Jul 20
80%
Equalizer 2
In Theaters Jul 20
50%
Ant-Man And The Wasp
In Theaters Jul 6
87%
The Spy Who Dumped Me
In Theaters Aug 3
50%
Incredibles 2
In Theaters Jun 15
93%
Jurassic World: Fallen Kingdom
In Theaters Jun 22
50%
Dog Days
In Theaters Aug 8
60%
Teen Titans Go! To The Movies
In Theaters Jul 27
91%
Eighth Grade
In Theaters Aug 3
98%
Three Identical Strangers
In Theaters Jun 29
96%
Skyscraper
In Theaters Jul 13
47%
Death Of A Nation
In Theaters Aug 3
0%
The Darkest Minds
In Theaters Aug 3
18%
Sorry To Bother You
In Theaters Jul 13
94%
Won't You Be My Neighbor?
In Theaters Jun 8
99%
Puzzle

Finally, let's go back to the previous webpage so that we can scrape the rest of the data from the Rotten Tomatoes website

In [None]:
driver.back()

<a id='#scrolling_example'></a>

### Selenium Scrolling

Most webpages that we need to gather data from are not created to assist us in our web crawling tasks. As such, properly scrolling a website is frequently used functionality Here we have **Example** code that demonstrates how one might scroll to the bottom of a webpage that loads dynamically.

We'll demonstrate on a supermarket website with digital coupons. Take a look at the website in your browswe and inspect the elements. There are several aspects of the website that need to be accounted for:

1. The "Load More" button
2. Scrolling to the bottom of the page

We'll handle each of these...

In [91]:
website_scroll = 'https://www.publix.com/savings/coupons/digital-coupons'
driver_scroll = webdriver.Firefox()
driver_scroll.get(website_scroll)

Did the entire page load? If not, how many objects loaded?

As we see here by looking at the length of all the objects collected, only a subset of all the coupons are currently loaded. In order to get all the coupon info, we'll need to scroll down and click the **"Load More"** button, and then continue to scroll until the entire page is loaded.

In [93]:
len(driver_scroll.find_elements_by_xpath("//div[@class='dc-card']"))

30

First let's define the **"Load More"** object. We declare it using x-path, and then we'll execute a Java command to scroll down the page until we find that element

In [94]:
element = driver_scroll.find_element_by_xpath("//button[@class='btn btn-large js-btnLoadCoupons']")
driver_scroll.execute_script("arguments[0].scrollIntoView();", element)

Next click the **element** to load the rest of the objects on the page

In [95]:
element.click()

Now that we've clicked **Load More**, let's see how many objects there are. Lot's more than 30! Now we'll scroll to the bottom of the page and collect the data for all of these coupons

In [None]:
len(driver_scroll.find_elements_by_xpath("//div[@class='dc-card']"))

The **Keys** class in Selenium provide keys in the keyboard like RETURN, F1, ALT etc.

With this code, I am setting the new object to the old object (coupon on the page). While there exists a new object to be set, I will leverage the **"send_keys"** function in the **Keys** class and pass an argument to **PAGE_DOWN**. 

The result will be that while I step through each coupon object on the page, I only scroll until there are no more objects, at which point I stop. 

Ref: https://selenium-python.readthedocs.io/getting-started.html 

In [83]:
dc_card = driver_scroll.find_element_by_xpath("//div[@class='dc-card']")

while True:
    try:
        print(dc_card)
        #Here I am setting the new dc_card to the sibling of the previous dc-card
        #I have to specify exactly which tag I want to use next as the sibling
        dc_card = dc_card = dc_card.find_element_by_xpath(".//following-sibling::div[@class='dc-card']")
        dc_card.send_keys(Keys.PAGE_DOWN)
        time.sleep(3)
    except:
        break

<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="a0e2aa76-f4a3-0b4c-9edc-40fa62d3b55c", element="21350090-0d02-464c-a7be-e327517f3961")>
<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="a0e2aa76-f4a3-0b4c-9edc-40fa62d3b55c", element="ddc50f7e-5158-3a43-bd38-83c4c42acb85")>
<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="a0e2aa76-f4a3-0b4c-9edc-40fa62d3b55c", element="5ff6c546-78cc-a74a-9bd8-5e3c5eb8c7cc")>
<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="a0e2aa76-f4a3-0b4c-9edc-40fa62d3b55c", element="b392a82f-e2a1-9f42-95ff-6f4bd1d4b8dd")>
<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="a0e2aa76-f4a3-0b4c-9edc-40fa62d3b55c", element="d8a39bce-1ed2-bd41-b756-e9b8c70df001")>
<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="a0e2aa76-f4a3-0b4c-9edc-40fa62d3b55c", element="73394f28-b6c1-094b-a1f4-f2715c86a6cb")>
<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="a0e

<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="a0e2aa76-f4a3-0b4c-9edc-40fa62d3b55c", element="788a9fc2-d135-854e-8731-436df18e9e58")>
<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="a0e2aa76-f4a3-0b4c-9edc-40fa62d3b55c", element="eee8ca92-7f52-4645-b029-d17cc56d9d86")>
<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="a0e2aa76-f4a3-0b4c-9edc-40fa62d3b55c", element="d0506278-5ec2-e14d-bd12-3d68ae1e359e")>
<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="a0e2aa76-f4a3-0b4c-9edc-40fa62d3b55c", element="421a9edb-3fba-f449-a813-585c7d3a0b70")>
<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="a0e2aa76-f4a3-0b4c-9edc-40fa62d3b55c", element="adf25d94-7731-ea46-b794-2c615ef4dca0")>
<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="a0e2aa76-f4a3-0b4c-9edc-40fa62d3b55c", element="dcbc1744-1cd8-f340-abe3-3a33a51569f1")>
<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="a0e

<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="a0e2aa76-f4a3-0b4c-9edc-40fa62d3b55c", element="17cfb784-0564-ce45-bcd8-418feee2d195")>
<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="a0e2aa76-f4a3-0b4c-9edc-40fa62d3b55c", element="a8aba360-6588-7044-89e5-d915613f5de9")>
<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="a0e2aa76-f4a3-0b4c-9edc-40fa62d3b55c", element="83253214-3701-3041-bc5d-72d002d2adac")>
<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="a0e2aa76-f4a3-0b4c-9edc-40fa62d3b55c", element="22c02551-892c-394a-b0ef-f199dc92a38a")>
<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="a0e2aa76-f4a3-0b4c-9edc-40fa62d3b55c", element="47b98cf0-bdec-ef46-a033-afe27a7e3e2b")>
<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="a0e2aa76-f4a3-0b4c-9edc-40fa62d3b55c", element="26f400ac-1172-aa4d-83af-525b59f191f5")>
<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="a0e

<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="a0e2aa76-f4a3-0b4c-9edc-40fa62d3b55c", element="490fac95-68e1-7c47-ab36-5f36f3277900")>
<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="a0e2aa76-f4a3-0b4c-9edc-40fa62d3b55c", element="f80488a7-9ade-d945-acc3-95a95ba25e3e")>
<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="a0e2aa76-f4a3-0b4c-9edc-40fa62d3b55c", element="dd7cb23d-6137-244c-89a3-0ec8fd05ea10")>
<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="a0e2aa76-f4a3-0b4c-9edc-40fa62d3b55c", element="89e95813-5682-2f40-b238-34d802c49c6b")>
<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="a0e2aa76-f4a3-0b4c-9edc-40fa62d3b55c", element="093f9cff-a9e6-cf48-9cd1-b9ef5b80f80e")>
<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="a0e2aa76-f4a3-0b4c-9edc-40fa62d3b55c", element="e0dc748d-02bc-a548-a69e-06d15afe637a")>
<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="a0e

<a id='#links'></a>

### Iterate Through All Links

Now that we've demonstrated how to **point**, **click** and **scroll** down a webpage, we will now put all these pieces together to iterate through each subsection of the Rotten Tomatoes website, point, click, scroll and collect the data on each of these pages.

Let's take it from the top!

#### Call the website and create a driver instance like we did previously...

In [297]:
website = 'https://www.rottentomatoes.com/'
driver = webdriver.Firefox()
driver.get(website)

#### Now let's scroll down the page to make sure all our links are loaded

In [298]:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

#### Grab all the "View All" objects

In [299]:
buttons = driver.find_elements_by_class_name('clickForMore')

Each of the "View All" objects have different **id** tag names. However if we look at the **div class** parent node, we see that all the **class names are "clickMore"**. So what we'll do is create a list of all the **"clickMore"** objects, and then select the first child node. 

In [301]:
for button in buttons:
    website = button.find_element_by_xpath(".//*").get_attribute("href")
    print(website)
    driver = webdriver.Firefox()
    driver.get(website)
    time.sleep(5)
    try:
        print(driver.find_element_by_class_name('main-column-item').text)
        time.sleep(5)
        titles = []
        dates = []
        scores= []
        movies = driver.find_elements_by_class_name('mb-movie')
        for movie in movies:
            titles.append(movie.find_element_by_class_name('movieTitle').text)
            dates.append(movie.find_element_by_class_name('release-date').text)
            try:
                scores.append(movie.find_element_by_class_name('tMeterScore').text)
            except:
                scores.append(None)
        pd.DataFrame({'Title':titles, 'Dates': dates, 'Score': scores}).to_csv(driver.find_element_by_class_name('main-column-item').text + '.csv')
    except:
        print("pass")
    driver.quit()
    time.sleep(5)

https://www.rottentomatoes.com/about
pass
https://editorial.rottentomatoes.com/total-recall/
pass
https://editorial.rottentomatoes.com/rt-hubs/
pass
https://editorial.rottentomatoes.com/news/
pass
https://www.rottentomatoes.com/browse/opening/
OPENING THIS WEEK
https://www.rottentomatoes.com/browse/in-theaters/
TOP BOX OFFICE
WEEKEND BOX OFFICE EARNINGS
https://www.rottentomatoes.com/browse/upcoming/
COMING SOON
https://www.rottentomatoes.com/browse/tv-list-1
CERTIFIED FRESH TV
pass
https://www.rottentomatoes.com/browse/tv-list-2
MOST POPULAR TV ON RT
pass
https://www.rottentomatoes.com/browse/top-dvd-streaming/
TOP DVD & STREAMING
https://www.rottentomatoes.com/browse/cf-in-theaters/
CERTIFIED FRESH MOVIES
WEEKEND BOX OFFICE EARNINGS
https://editorial.rottentomatoes.com/video-interviews/
pass
https://editorial.rottentomatoes.com/news/
pass
https://editorial.rottentomatoes.com/publications/
pass
