<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Web Scraping OpenTable With Selenium: Guided Lab

_Authors: Joseph Nelson (DC)_

---

> *Note: This is intended to be an instructor-guided lab.*


In today's code-along lab, we'll build a scraper using urllib and Beautiful Soup. We'll also remedy some of the pitfalls of automated scraping by using a "headless" browser called Selenium.

You'll be scraping OpenTable's Washington, D.C. listings. We're interested in knowing the restaurants' **name, location, price, and how many people booked it that day.**

OpenTable provides all of this information on this page: http://www.opentable.com/washington-dc-restaurant-listings.

### 1) Inspect the elements of this page to confirm we can find all of the information we're interested in.

### 2) Use `urllib` and `BeautifulSoup` to read the contents of the HTML.

In [1]:
from bs4 import BeautifulSoup
import urllib

In [2]:
# Set the URL we want to visit.
url = "http://www.opentable.com/washington-dc-restaurant-listings"

# Visit the URL and grab the HTML of the page.
html = urllib.urlopen(url).read()

### 3) Print out a fraction of the HTML. What's in it?

In [3]:
len(html)

675305

In [4]:
html[0:1000]

'           <!DOCTYPE html><html lang="en"><head><meta charset="utf-8"/><meta http-equiv="X-UA-Compatible" content="IE=9; IE=8; IE=7; IE=EDGE"/> <title>Washington, D.C. Area Restaurants List | OpenTable</title>  <meta  name="description" content="Find Washington, D.C. Area restaurants. Search by location, cuisine, or price to refine restaurant results in the Washington, D.C. Area area." > </meta>  <meta  name="robots" content="noindex" > </meta><link  rel="canonical" href="https://www.opentable.com/washington-dc-restaurant-listings" > </link>      <link rel="shortcut icon" href="//components.otstatic.com/components/favicon/1.0.5/favicon/favicon.ico" type="image/x-icon"/><link rel="icon" href="//components.otstatic.com/components/favicon/1.0.5/favicon/favicon-16.png" sizes="16x16"/><link rel="icon" href="//components.otstatic.com/components/favicon/1.0.5/favicon/favicon-32.png" sizes="32x32"/><link rel="icon" href="//components.otstatic.com/components/favicon/1.0.5/favicon/favicon-48.pn

In [5]:
# This is the raw HTML from the page.

### 4) Use Beautiful Soup to convert the raw HTML into a soup object.

In [6]:
# We need to convert this into a soup object.
soup = BeautifulSoup(html, 'html.parser', from_encoding="utf-8")

### 5) Extract the name of each restaurant.

First, let's find each restaurant name listed on the page we've loaded. How do we find each restaurant's location on the page? 

> *Hint: We need to know where the restaurant element is housed in the **HTML**.*

**5.A) See if you can find the restaurant name. Keep in mind that there are many restaurants loaded on the page.**

In [7]:
# Print the restaurant names.
for n in soup.find_all('span', {'class': 'rest-row-name-text'})[0:20]:
    print(n)

<span class="rest-row-name-text">Ruffino's - Arlington</span>
<span class="rest-row-name-text">Joe's Place Pizza and Pasta</span>
<span class="rest-row-name-text">Farmers Fishers Bakers</span>
<span class="rest-row-name-text">Filomena Ristorante</span>
<span class="rest-row-name-text">Ambar - Arlington</span>
<span class="rest-row-name-text">Rasika West End</span>
<span class="rest-row-name-text">Blue Duck Tavern</span>
<span class="rest-row-name-text">Tupelo Honey - Arlington</span>
<span class="rest-row-name-text">BlackSalt</span>
<span class="rest-row-name-text">Il Canale</span>
<span class="rest-row-name-text">Kapnos Taverna Arlington</span>
<span class="rest-row-name-text">Sequoia</span>
<span class="rest-row-name-text">Green Pig Bistro</span>
<span class="rest-row-name-text">Bistro Aracosia</span>
<span class="rest-row-name-text">Et Voila</span>
<span class="rest-row-name-text">Lyon Hall</span>
<span class="rest-row-name-text">The Liberty Tavern</span>
<span class="rest-row-name-

**5.B) Create a list of _only_ the restaurant names (no tags).**


In [8]:
r_names = []
# For each element you find, print out the restaurant name.
for entry in soup.find_all('span', {'class': 'rest-row-name-text'}):
    r_names.append(entry.renderContents())

In [9]:
r_names[0:20]

["Ruffino's - Arlington",
 "Joe's Place Pizza and Pasta",
 'Farmers Fishers Bakers',
 'Filomena Ristorante',
 'Ambar - Arlington',
 'Rasika West End',
 'Blue Duck Tavern',
 'Tupelo Honey - Arlington',
 'BlackSalt',
 'Il Canale',
 'Kapnos Taverna Arlington',
 'Sequoia',
 'Green Pig Bistro',
 'Bistro Aracosia',
 'Et Voila',
 'Lyon Hall',
 'The Liberty Tavern',
 'Chez Billy Sud',
 'Nobu DC',
 'Caf\xc3\xa9 Milano']

### 6) Repeat this process for location.

For example, barmini by Jose Andres is located in "Penn Quarter," as listed in our search results.

In [10]:
# First, see if you can identify the location for all elements — print it out.
print soup.find_all('span', {'class': 'rest-row-meta--location rest-row-meta-text'})[0:5]

[<span class="rest-row-meta--location rest-row-meta-text">Arlington</span>, <span class="rest-row-meta--location rest-row-meta-text">Arlington</span>, <span class="rest-row-meta--location rest-row-meta-text">Georgetown</span>, <span class="rest-row-meta--location rest-row-meta-text">Georgetown</span>, <span class="rest-row-meta--location rest-row-meta-text">Arlington</span>]


In [11]:
r_loc = []
for entry in soup.find_all('span', {'class': 'rest-row-meta--location rest-row-meta-text'}):
    r_loc.append(entry.renderContents())
    
r_loc[0:10]

['Arlington',
 'Arlington',
 'Georgetown',
 'Georgetown',
 'Arlington',
 'West End',
 'West End',
 'Arlington',
 'Palisades Northwest',
 'Georgetown']

### 7) Get the price for each restaurant.

The price is the number of dollar signs on a scale of one to four for each restaurant. We'll follow the same process we used for restaurant name and location.

In [12]:
# Print out all of the prices.
print soup.find_all('div', {'class': 'rest-row-pricing'})[0:5]

[<div class="rest-row-pricing"> <i>  $    $      </i>   $    $      </div>, <div class="rest-row-pricing"> <i>  $    $      </i>   $    $      </div>, <div class="rest-row-pricing"> <i>  $    $      </i>   $    $      </div>, <div class="rest-row-pricing"> <i>  $    $    $    </i>   $        </div>, <div class="rest-row-pricing"> <i>  $    $      </i>   $    $      </div>]


In [13]:
r_dollars = []
# Get the number of dollar signs for each restaurant.
# It's trickier to eliminate the HTML in this one. Hint: Try a nested find.
for entry in soup.find_all('div', {'class': 'rest-row-pricing'}):
    r_dollars.append(entry.find('i').renderContents())
    
r_dollars[0:10]

['  $    $      ',
 '  $    $      ',
 '  $    $      ',
 '  $    $    $    ',
 '  $    $      ',
 '  $    $    $    ',
 '  $    $    $    ',
 '  $    $      ',
 '  $    $    $    ',
 '  $    $      ']

**7.B) Convert the dollar sign strings to a count of the number of dollar signs.**

Can you figure out a way to print out the number of dollar signs per restaurant listed?

In [14]:
r_dollar_count = []

for entry in soup.find_all('div', {'class': 'rest-row-pricing'}):
    price = entry.find('i').renderContents()
    r_dollar_count.append(price.count('$'))
    
r_dollar_count[0:10]

[2, 2, 2, 3, 2, 3, 3, 2, 3, 2]

### 8) Can you find the number of times a restaurant was booked?

In the next cell, print out a sample of objects that contain the number of times a restaurant was booked.

> *Note: If you can't, why do you think this happens?*

In [15]:
# Print out all of the objects that contain the number of times the restaurant was booked.
print soup.find_all('div', {'class': 'booking'})[0:20]

[]


That's weird — an empty set. Did we find the wrong element? What's going on here? Let's discuss.

How can we debug this? Any ideas?

In [16]:
# Let's first try printing out all of the "span" class objects.
for entry in soup.find_all('div'):
    if 'ooked' in entry:
        print(entry)

In [17]:
# We still can't find the booking count in the object. This requires JavaScript.

## Enter: Selenium

---

Selenium is a headless browser. That means it enables us to mock human-browsing behavior — it even waits for JavaScript elements to load.

If you don't already have Selenium installed, you can do so via pip. Simply run `pip install selenium`.

In [18]:
# Import:
from selenium import webdriver

To run, Selenium requires us to determine a default browser. We're going to opt for Firefox, but Chromium is also a very common choice.

http://selenium-python.readthedocs.io/faq.html

### 9) What's going to happen when we run the next cell?

The ChromeDriver has been provided in the "chromedriver" folder, so don't worry about downloading another one.

In [55]:
import os
from selenium import webdriver

chromedriver = "/Users/edoardo/github_dsi4/classes/week-06/labs/python-webscraping_opentable-lab-master/chromedriver/chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver
# driver = webdriver.Chrome(chromedriver)

In [64]:
# Create a driver called "driver."
driver = webdriver.Chrome(executable_path="../chromedriver/chromedriver")

Pretty crazy, right? Now let's close that driver. 

This should have opened up a new browser window. If you didn't see it pop up automatically, check all of your desktop displays. 

In [65]:
# Close it.
driver.close()

### 10) Use the driver to visit `www.python.org`.

In [27]:
# Let's boot it up and visit a URL of our choice.
driver = webdriver.Chrome(executable_path="../chromedriver/chromedriver")
driver.get("http://www.python.org")

In [30]:
driver.close()

Awesome! Now we're getting somewhere — we're programmatically controlling our browser like a human would.

### 11) Visit the OpenTable page using the driver.

Let's return to the problem at hand. We need to visit the OpenTable listings for DC. Once there, we need the HTML to load. 

In the next cell, prove you can programmatically visit the page.

In [31]:
# Visit our OpenTable page.
driver = webdriver.Chrome(executable_path="../chromedriver/chromedriver")
driver.get("ç")
# It's always good to check that we have the page we think we do.
assert "OpenTable" in driver.title

In [32]:
driver.close()

### 12) Resolve the JavaScript issue using the driver and find the bookings.

What we can do in this case is:

1) Request that the page load.
2) Wait one second.
3) Grab the source HTML from the page.

The page should believe we're visiting from a live connection on a browser client, so the JavaScript should render to be part of the page source. We can then grab the page source.

**Once you have the HTML with the JavaScript rendered, repeat the processes above to find the bookings.**

In [33]:
# Import sleep:
from time import sleep

In [34]:
# Visit our relevant page.
driver = webdriver.Chrome(executable_path="../chromedriver/chromedriver")
driver.get("http://www.opentable.com/washington-dc-restaurant-listings")
# Wait one second.
sleep(1)
# Grab the page source.
html = driver.page_source

In [35]:
# Beautiful Soup it!
html = BeautifulSoup(html, 'lxml')

In [36]:
# Now, let's return to our earlier problem: How do we locate bookings on the page?

In [37]:
# Print out the number of bookings for all of the restaurants.
print html.find_all('div', {'class':'booking'})[0:10]

[<div class="booking"><span class="tadpole"></span>Booked 451 times today</div>, <div class="booking"><span class="tadpole"></span>Booked 249 times today</div>, <div class="booking"><span class="tadpole"></span>Booked 98 times today</div>, <div class="booking"><span class="tadpole"></span>Booked 154 times today</div>, <div class="booking"><span class="tadpole"></span>Booked 125 times today</div>, <div class="booking"><span class="tadpole"></span>Booked 82 times today</div>, <div class="booking"><span class="tadpole"></span>Booked 59 times today</div>, <div class="booking"><span class="tadpole"></span>Booked 90 times today</div>, <div class="booking"><span class="tadpole"></span>Booked 36 times today</div>, <div class="booking"><span class="tadpole"></span>Booked 95 times today</div>]


In [38]:
r_bookings = []
for booking in html.find_all('div', {'class':'booking'}):
    r_bookings.append(booking.text)
    
r_bookings[0:15]

[u'Booked 451 times today',
 u'Booked 249 times today',
 u'Booked 98 times today',
 u'Booked 154 times today',
 u'Booked 125 times today',
 u'Booked 82 times today',
 u'Booked 59 times today',
 u'Booked 90 times today',
 u'Booked 36 times today',
 u'Booked 95 times today',
 u'Booked 44 times today',
 u'Booked 39 times today',
 u'Booked 36 times today',
 u'Booked 41 times today',
 u'Booked 36 times today']

In [31]:
# We've succeeded!

# But we can still clean this up a bit. 
# We're going to use regular expressions (regex) to grab only the digits that are available in the text.

# The best way to get good at using regex is to keep trying and testing: http://pythex.org/.

In [39]:
# Import regex.
import re

In [40]:
# Because we haven't covered regex, here's a guide for how to use the search function to match any given digit.

In [41]:
r_bookings_num = []
# For each entry, grab the text.
for booking in html.find_all('div', {'class':'booking'}):
    # Match all digits.
    match = re.search(r'\d+', booking.text)
    # Append if found.
    if match:
        r_bookings_num.append(int(match.group()))
    # Otherwise, it's 0.
    else:
        r_bookings_num.append(0)
        
r_bookings_num[0:15]

[451, 249, 98, 154, 125, 82, 59, 90, 36, 95, 44, 39, 36, 41, 36]

### 13) Can we get all of the items we want from the page in a single `find_all`?

To be as efficient as possible, we only want to do a single loop for each entry on the page. That means we want to find the element all of our other elements (name, location, price, and bookings) are housed within. Where is each entry located on the page?

In [42]:
# Print out all entries.
entries = html.find_all('div', {'class':'result content-section-list-row cf with-times'})

### 14) Does every entry have all of the elements we want?

In [43]:
# Not every element has a number of recent bookings. 
# That's probably exactly why OpenTable houses this in JavaScript: It wants to continuously update the number of bookings 
# with the most relevant number of values.

In [44]:
# What happens when a booking isn't available?
# Print out some booking entries using the identification code we wrote above.
for entry in html.find_all('div', {'class':'result content-section-list-row cf with-times'})[0:50]:
    print entry.find('div', {'class':'booking'})

None
None
<div class="booking"><span class="tadpole"></span>Booked 451 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 249 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 98 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 154 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 125 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 82 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 59 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 90 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 36 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 95 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 44 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 39 times today</div>
<div class="booking"><span class="tadpole"></span>

### 15) Use Python exceptions to handle cases when bookings aren't found.

When a booking isn't found, store `'ZERO'`.

In [45]:
# If we find the element we want, we print it. Otherwise, we print "ZERO."
entries = []
for entry in html.find_all('div', {'class':'result content-section-list-row cf with-times'}):
    try:
        entries.append(entry.find('div', {'class':'booking'}).text)
    except:
        entries.append('ZERO')
        
print entries.count('ZERO')

3


### 16) Putting it all together in a DataFrame.

**Loop through the entries. For each:**

1) Grab the relevant information we want (name, location, price, and bookings). 
2) Produce a DataFrame with the columns "name," "location," "price," and "bookings" that contains the 100 entries we'd like.

In [46]:
# First, create an empty DataFrame.
import pandas as pd
dc_eats = pd.DataFrame(columns=["name","location","price","bookings"])

In [47]:
# Loop through each entry.
for entry in html.find_all('div', {'class':'result content-section-list-row cf with-times'}):
    # Grab the name.
    name = entry.find('span', {'class': 'rest-row-name-text'}).text
    # Grab the location.
    location = entry.find('span', {'class': 'rest-row-meta--location rest-row-meta-text'}).renderContents()
    # Grab the price.
    price = entry.find('div', {'class': 'rest-row-pricing'}).find('i').renderContents().count('$')
    # Try to find the number of bookings.
    try:
        temp = entry.find('div', {'class':'booking'}).text
        match = re.search(r'\d+', temp)
        if match:
            bookings = match.group()
    except:
        bookings = 'NA'
    # Add to the DataFrame.
    dc_eats.loc[len(dc_eats)]=[name, location, price, bookings]

In [48]:
# Check out our work.
dc_eats.head()

Unnamed: 0,name,location,price,bookings
0,Ruffino's - Arlington,Arlington,2,
1,Joe's Place Pizza and Pasta,Arlington,2,
2,Farmers Fishers Bakers,Georgetown,2,451.0
3,Filomena Ristorante,Georgetown,3,249.0
4,Ambar - Arlington,Arlington,2,98.0


### 17) [Bonus] Sending keys over the driver.

We can send keys to the page using the driver. Below is a demonstration of how to search the page using the Selenium driver.

In [49]:
# We can send keys as well. Import:
from selenium.webdriver.common.keys import Keys

In [50]:
# Open the driver.
driver = webdriver.Chrome(executable_path="../chromedriver/chromedriver")
# Visit Python.
driver.get("http://www.python.org")
# Verify we're in the right place.
assert "Python" in driver.title

In [51]:
# Find the search position.
elem = driver.find_element_by_name("q")
# Clear it.
elem.clear()
# Type in "pycon."
elem.send_keys("pycon")

In [52]:
# Send the keys.
elem.send_keys(Keys.RETURN)
# This yields no results.
assert "No results found." not in driver.page_source

In [53]:
# Close the driver.
driver.close()

In [54]:
# All at once:
driver = webdriver.Chrome(executable_path="../chromedriver/chromedriver")
driver.get("http://www.python.org")
assert "Python" in driver.title
elem = driver.find_element_by_name("q")
elem.clear()
elem.send_keys("pycon")
elem.send_keys(Keys.RETURN)
assert "No results found." not in driver.page_source
driver.close()

## Additional Resources

---

The example above (and many others) are available in the [Selenium docs](http://selenium-python.readthedocs.io/getting-started.html).

It's especially important to explore functionality, such as [locating elements](http://selenium-python.readthedocs.io/locating-elements.html#locating-elements).

Review Selenium's [FAQs](http://selenium-python.readthedocs.io/faq.html).