<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Web Scraping

_Author: Dave Yerrington (SF)_

---

## Learning Objectives
- Revisit how to locate elements on a webpage
- Aquire unstructure data from the internet using Beautiful soup.
- Discuss limitations associated with simple requests and urllib libraries
- Introduce Selenium as a solution, and implement a scraper using selenium

## Lesson Guide

- [Introduction](#intro)
- [Building a web scraper](#building-scraper)
- [Retrieving data from the HTML page](#retrieving-data)
    - [Retrieving the restaurant names](#retrieving-names)
    - [Challenge: Retrieving the restaurant locations](#retrieving-locations)
    - [Retrieving the restaurant prices](#retrieving-prices)


- [Summary](#summary)

<a id="intro"></a>
## Introduction

In this codealong lesson, we'll build a web scraper using requests and BeautifulSoup. We will also explore how to use a headless browser called Selenium.

We'll begin by scraping OpenTable's DC listings. We're interested in knowing the restaurant's **name, location, price, and how many people booked it today.**

OpenTable provides all of this information on this given page: http://www.opentable.com/washington-dc-restaurant-listings

Let's inspect the elements of this page to assure we can find each of the bits of information in which we're interested.

---

<a id="building-scraper"></a>
## Building a web scraper

Now, let's build a web scraper for OpenTable using urllib and Beautiful Soup:

In [1]:
# import our necessary first packages
from bs4 import BeautifulSoup
import requests

In [2]:
# set the url we want to visit
url = "http://www.opentable.com/washington-dc-restaurant-listings"

# visit that url, and grab the html of said page
html = requests.get(url)

At this point, what is in html?

In [3]:
# .text returns the request content in Unicode
html.text[:500]

'           <!DOCTYPE html><html lang="en"><head><meta charset="utf-8"/><meta http-equiv="X-UA-Compatible" content="IE=9; IE=8; IE=7; IE=EDGE"/> <title>Restaurant Reservation Availability</title>    <meta  name="robots" content="noindex" > </meta><link  rel="canonical" href="https://www.opentable.com/washington-dc-restaurant-listings" > </link>      <link rel="shortcut icon" href="//components.otstatic.com/components/favicon/1.0.5/favicon/favicon.ico" type="image/x-icon"/><link rel="icon" href="/'

We will need to convert this html objct into a soup object so we can parse it using python and BS4

In [4]:
# convert this into a soup object
soup = BeautifulSoup(html.text, 'html.parser')

<a id="retrieving-data"></a>
### Retrieving data from the HTML page

Let's first find each restaurant name listed on the page we've loaded. How do we find the page location of the restaurant? (Hint: We need to know where in the **HTML** the restaurant element is housed.) In order to find the HTML that renders the restaurant location, we can use Google Chrome's Inspect tool:

> http://www.opentable.com/washington-dc-restaurant-listings

> 1. Visit the URL above. 

> 2. Right-click on an element you are interested in, then choose Inspect (in Chrome). 

> 3. This will open the Developer Tools and show the HTML used to render the selected page element. 

> Throughout this lesson, we will use this method to find tags associated with elements of the page we want to scrape.

See if you can find the restaurant name on the page. Keep in mind there are many restaurants loaded on the page.

In [5]:
# print the restaurant names
soup.find_all(name='span', attrs={'class':'rest-row-name-text'})

[<span class="rest-row-name-text">486 Von</span>,
 <span class="rest-row-name-text">Dolorem Jakubowski</span>,
 <span class="rest-row-name-text">O'Hara</span>,
 <span class="rest-row-name-text">Alize Ports</span>,
 <span class="rest-row-name-text">Isabella Howell</span>,
 <span class="rest-row-name-text">Rerum</span>,
 <span class="rest-row-name-text">Mills</span>,
 <span class="rest-row-name-text">Roob</span>,
 <span class="rest-row-name-text">Valleys</span>,
 <span class="rest-row-name-text">Hellens</span>,
 <span class="rest-row-name-text">Tremblay</span>,
 <span class="rest-row-name-text">Myrl Drives</span>,
 <span class="rest-row-name-text">Rafaela Sauer</span>,
 <span class="rest-row-name-text">Numquam Anderson</span>,
 <span class="rest-row-name-text">Shields</span>,
 <span class="rest-row-name-text">Est</span>,
 <span class="rest-row-name-text">Rollins</span>,
 <span class="rest-row-name-text">Aniyah Zemlak</span>,
 <span class="rest-row-name-text">Levi Crossroad</span>,
 <span

It is important to always keep in mind the data types that were returned. Note this is a `list`, and we know that immediately by observing the outer square brackets and commas separating each tag.

Next, note the elements of the list are `Tag` objects, not strings. (If they were strings, they would be surrounded by quotes.) The Beautiful Soup authors chose to display a `Tag` object visually as a text representation of the tag and its contents. However, being an object, it has many methods that we can call on it. For example, next we will use the `encode_contents()` method to return the tag's contents encoded as a Python string.

<a id="retrieving-names"></a>
#### Retrieving the restaurant names

Now that we found a list of tags containing the restaurant names, let's think how we can loop through them all one-by-one. In the following cell, we'll print out the name (and **only** the clean name, not the rest of the html) of each restaurant.

In [6]:
# for each element you find, print out the restaurant name
for entry in soup.find_all(name='span', attrs={'class':'rest-row-name-text'}):
    print(entry.text)

486 Von
Dolorem Jakubowski
O'Hara
Alize Ports
Isabella Howell
Rerum
Mills
Roob
Valleys
Hellens
Tremblay
Myrl Drives
Rafaela Sauer
Numquam Anderson
Shields
Est
Rollins
Aniyah Zemlak
Levi Crossroad
Et
192 Jacobson
Non
Wilfrid Lueilwitz
Quam Lane
Corbins
Turner
Mina Miller
Est Well
Explicabo Erdman
Fahey
1088 Ortiz
Ut
Ortiz Square
Eius
Necessitatibus Wintheiser
Officia
Alek Hoeger
Sed
Mandy Roob
Provident
Domenick Zboncak
Kattie Circle
Course
Sequi
Jerod Burgs
Barrys
Hansen Lock
Distinctio
Voluptas
Vel Gottlieb
Izabellas
Nicholauss
Soluta Weber
104 Mann
Ari Terry
Minima Schinner
Aut
Omnis
Iure Squares
Kraig Gottlieb
Bruen
Verlie McKenzie
Ratione Bypass
Agloe Bar & Grill
O'Kon
Autem
Maudies
Eius
Gladyce Brook
Stephen Grimes
Expedita Mission
Aniyah Alley
Kris
Kims
Farrell
Grant
Dolor
Shanon Osinski
Jazmyn Brook
Dannie Mohr
Aliquid
West
In Divide
Numquam Dale
Thiel
Hegmann Keys
Sit
Ludwig Stoltenberg
Incidunt Valley
Schultz
Nihil Bergnaum
Ad
Yasmins
Helen Prohaska
Adams
Sunt Tunnel
Ipsum
Han

Great!

<a id="retrieving-locations"></a>
#### Challenge: Retrieving the restaurant locations

Can you repeat that process for finding the location? For example, barmini by Jose Andres is in the location listed as "Penn Quarter" in our search results.

In [7]:
# first, see if you can identify the location for all elements -- print it out

In [8]:
# now print out EACH location for the restaurants

<a id="retrieving-prices"></a>
#### Retrieving the restaurant prices

Ok, we've figured out the restaurant name and location. Now we need to grab the price (number of dollar signs on a scale of one to four) for each restaurant. We'll follow the same process.

In [16]:
# print out all prices
soup.find_all('div', {'class':'rest-row-pricing'})

[<div class="rest-row-pricing"> <i class="pricing--the-price">  $    $      </i> <span class="pricing--not-the-price">  $    $      </span></div>,
 <div class="rest-row-pricing"> <i class="pricing--the-price">  $    $    $    $  </i> <span class="pricing--not-the-price"> </span></div>,
 <div class="rest-row-pricing"> <i class="pricing--the-price">  $    $      </i> <span class="pricing--not-the-price">  $    $      </span></div>,
 <div class="rest-row-pricing"> <i class="pricing--the-price">  $    $      </i> <span class="pricing--not-the-price">  $    $      </span></div>,
 <div class="rest-row-pricing"> <i class="pricing--the-price">  $    $    $    $  </i> <span class="pricing--not-the-price"> </span></div>,
 <div class="rest-row-pricing"> <i class="pricing--the-price">  $    $    $    </i> <span class="pricing--not-the-price">  $        </span></div>,
 <div class="rest-row-pricing"> <i class="pricing--the-price">  $    $      </i> <span class="pricing--not-the-price">  $    $      

In [17]:
# print out EACH number of dollar signs per restaurant
# this one is trickier to eliminate the html. Hint: try a nested find
for entry in soup.find_all('div', {'class':'rest-row-pricing'}):
    print(entry.find('i').text)

  $    $      
  $    $    $    $  
  $    $      
  $    $      
  $    $    $    $  
  $    $    $    
  $    $      
  $    $    $    $  
  $    $    $    
  $    $    $    
  $    $    $    $  
  $    $    $    $  
  $    $      
  $    $    $    $  
  $    $      
  $    $    $    $  
  $    $    $    $  
  $    $    $    $  
  $    $    $    $  
  $    $    $    
  $    $      
  $    $      
  $    $      
  $    $      
  $    $    $    $  
  $    $    $    
  $    $      
  $    $      
  $    $    $    
  $    $      
  $    $      
  $    $    $    $  
  $    $    $    $  
  $    $      
  $    $    $    
  $    $    $    $  
  $    $    $    $  
  $    $    $    
  $    $    $    
  $    $    $    
  $    $    $    
  $    $    $    $  
  $    $    $    $  
  $    $    $    $  
  $    $    $    
  $    $    $    
  $    $      
  $    $    $    $  
  $    $    $    
  $    $    $    $  
  $    $    $    $  
  $    $      
  $    $    $    
  $    $    $    $  
  $    $     

That looks great, but what if I wanted just the number of dollar signs per restaurant? Can you figure out a way to simply print out the number of dollar signs per restaurant listed?

In [18]:
# print the number of dollars signs per restaurant

That's weird -- an empty set. Did we find the wrong element? What's going on here? Discuss.

How can we debug this? Any ideas?

## Class Exercise
- For all CL housing listings on the front page, scrape the price and title and any other attributes into a dataframe

### Summary

In this lesson, we used the Beautiful Soup library to locate elements on a website then scrape their text. We also used the Selenium headless browser to run JavaScript first before retrieving the page contents.