# Webscraping 2.0

Joseph Nelson, DC

In today's codealong, I'll walkthrough how to build a scraper using urllib and BeautifulSoup. We'll discover the problems we discussed in the lesson readme associated with doing so, and we'll remedy this problem using a headless browser called Selenium.

For starter's we're going to be scraping OpenTable's DC listings. We're interested in knowing the restaurant's **name, location, price, and how many people booked it today.**

OpenTable provides all of this information on this given page: http://www.opentable.com/washington-dc-restaurant-listings

Let's inspect the elements of this page to assure we can find each of the bits of information in which we're interested.
>Class proceeds to do exactly that ^

We'll then build our scraper:

In [1]:
# import our necessary first packages
from bs4 import BeautifulSoup
import urllib

In [2]:
# set the url we want to visit
url = "http://www.opentable.com/washington-dc-restaurant-listings"

# visit that url, and grab the html of said page
html = urllib.urlopen(url).read()

At this point, what is in html?

In [3]:
html

'          <!DOCTYPE html><html lang="en"><head><meta charset="utf-8"/><meta http-equiv="X-UA-Compatible" content="IE=9; IE=8; IE=7; IE=EDGE"/> <title>Washington, D.C. Area Restaurants List | OpenTable</title>  <meta  name="description" content="Find Washington, D.C. Area restaurants. Search by location, cuisine, or price to refine restaurant results in the Washington, D.C. Area area." > </meta>  <meta  name="robots" content="noindex" > </meta>    <link rel="shortcut icon" href="//components.otstatic.com/components/favicon/1.0.4/favicon/favicon.ico" type="image/x-icon"/><link rel="icon" href="//components.otstatic.com/components/favicon/1.0.4/favicon/favicon-16.png" sizes="16x16"/><link rel="icon" href="//components.otstatic.com/components/favicon/1.0.4/favicon/favicon-32.png" sizes="32x32"/><link rel="icon" href="//components.otstatic.com/components/favicon/1.0.4/favicon/favicon-48.png" sizes="48x48"/><link rel="icon" href="//components.otstatic.com/components/favicon/1.0.4/favicon/fa

In [4]:
# we need to convert this into a soup object
soup = BeautifulSoup(html, 'html.parser', from_encoding="utf-8")

Let's first find each restaurant name listed on the page we've loaded. How do we find the page location of the restaurant? (Psst: we need to know where in the **html** the restaurant element is housed)

See if you can find the restaurant name on the page. Keep in mind there are many restaurants loaded on the page.

In [5]:
# print the restaurant names
soup.find_all('span', attrs={'class':'rest-row-name-text'})

[<span class="rest-row-name-text">Rasika</span>,
 <span class="rest-row-name-text">Harold Black</span>,
 <span class="rest-row-name-text">Ambar</span>,
 <span class="rest-row-name-text">Chez Billy Sud</span>,
 <span class="rest-row-name-text">Centrolina</span>,
 <span class="rest-row-name-text">Blue Duck Tavern</span>,
 <span class="rest-row-name-text">Old Ebbitt Grill</span>,
 <span class="rest-row-name-text">Rasika West End</span>,
 <span class="rest-row-name-text">Farmers Fishers Bakers</span>,
 <span class="rest-row-name-text">Captain Gregory's</span>,
 <span class="rest-row-name-text">Le Diplomate</span>,
 <span class="rest-row-name-text">Mokomandy</span>,
 <span class="rest-row-name-text">Ambar - Arlington</span>,
 <span class="rest-row-name-text">Lupo Verde</span>,
 <span class="rest-row-name-text">CIRCA at Foggy Bottom</span>,
 <span class="rest-row-name-text">Ghibellina</span>,
 <span class="rest-row-name-text">Ted's Bulletin - 14th Street</span>,
 <span class="rest-row-name-t

In [6]:
# print the restaurant names
var = soup.find_all('span', class_='rest-row-name-text')

[x.text for x in var]

[u'Rasika',
 u'Harold Black',
 u'Ambar',
 u'Chez Billy Sud',
 u'Centrolina',
 u'Blue Duck Tavern',
 u'Old Ebbitt Grill',
 u'Rasika West End',
 u'Farmers Fishers Bakers',
 u"Captain Gregory's",
 u'Le Diplomate',
 u'Mokomandy',
 u'Ambar - Arlington',
 u'Lupo Verde',
 u'CIRCA at Foggy Bottom',
 u'Ghibellina',
 u"Ted's Bulletin - 14th Street",
 u'Medium Rare - Cleveland Park',
 u'Founding Farmers - DC',
 u'Floriana',
 u'Tortino',
 u'Hazel',
 u'Oyamel',
 u"Clyde's of Georgetown",
 u'The Hamilton',
 u'Chaplin',
 u'CIRCA at Clarendon',
 u'Peacock Caf\xe9',
 u'Daikaya Izakaya (2F)',
 u"Hank's Oyster Bar - Dupont",
 u'Commissary DC',
 u'RPM Italian - DC',
 u'Momofuku CCDC',
 u'The Bird',
 u'Joselito Casa de Comidas',
 u'The Wine Kitchen on the Creek',
 u'Succotash Restaurant',
 u'Osteria Morini DC',
 u'DBGB DC',
 u'Mirabelle',
 u'Mission',
 u'SEI restaurant & lounge',
 u'Virtue Feed & Grain',
 u'Filomena Ristorante',
 u'Acqua Al 2',
 u'Farmers & Distillers',
 u'Sakerum',
 u'The Majestic',
 u'Ca

In [7]:
for i in soup.find_all('span', attrs={'class':'rest-row-name-text'}):
    print i.renderContents()

Rasika
Harold Black
Ambar
Chez Billy Sud
Centrolina
Blue Duck Tavern
Old Ebbitt Grill
Rasika West End
Farmers Fishers Bakers
Captain Gregory's
Le Diplomate
Mokomandy
Ambar - Arlington
Lupo Verde
CIRCA at Foggy Bottom
Ghibellina
Ted's Bulletin - 14th Street
Medium Rare - Cleveland Park
Founding Farmers - DC
Floriana
Tortino
Hazel
Oyamel
Clyde's of Georgetown
The Hamilton
Chaplin
CIRCA at Clarendon
Peacock Café
Daikaya Izakaya (2F)
Hank's Oyster Bar - Dupont
Commissary DC
RPM Italian - DC
Momofuku CCDC
The Bird
Joselito Casa de Comidas
The Wine Kitchen on the Creek
Succotash Restaurant
Osteria Morini DC
DBGB DC
Mirabelle
Mission
SEI restaurant &amp; lounge
Virtue Feed &amp; Grain
Filomena Ristorante
Acqua Al 2
Farmers &amp; Distillers
Sakerum
The Majestic
Cava Mezze - DC
Kinship
The Arsenal at Bluejacket
Proof Restaurant
CIRCA at Dupont
Founding Farmers - Tysons
The Liberty Tavern
Logan Tavern
Tico
Zaytinya
The Pig
Jacques' Brasserie at L'Auberge Chez Francois
Denson Liquor Bar
District 

Now that we can find each element, let's think how we can loop through them all one-by-one. In the following cell, print out the name (and **only** the clean name, not the rest of the html) of each restaurant.

In [8]:
# for each element you find, print out the restaurant name


Great!

Can you repeat that process for finding the location? 

In [9]:
# first, see if you can identify the location for all elements -- print it out


In [10]:
# now print out EACH location for the restaurants

[x.renderContents() for x in soup(class_='rest-row-meta--location rest-row-meta-text')]


['Penn Quarter',
 'Capitol Hill',
 'Capitol Hill',
 'Georgetown',
 'Downtown',
 'West End',
 'Downtown',
 'West End',
 'Georgetown',
 'Alexandria',
 'Logan Circle',
 'Sterling',
 'Arlington',
 'Logan Circle',
 'Foggy Bottom',
 'Logan Circle',
 'Logan Circle',
 'Cleveland Park',
 'Foggy Bottom',
 'Dupont Circle',
 'Downtown',
 'Shaw',
 'Penn Quarter',
 'Georgetown',
 'Downtown',
 'Shaw',
 'Arlington',
 'Georgetown',
 'Penn Quarter',
 'Dupont Circle',
 'Logan Circle',
 'Mt. Vernon Square',
 'Mount Vernon',
 'Logan Circle',
 'Capitol Hill',
 'Frederick',
 'National Harbor',
 'Navy Yard',
 'Penn Quarter',
 'Downtown',
 'Dupont Circle',
 'Penn Quarter',
 'Alexandria',
 'Georgetown',
 'Capitol Hill',
 'Mt. Vernon Square',
 'U Street Corridor',
 'Alexandria',
 'Capitol Hill',
 'Mt. Vernon Square',
 'Navy Yard',
 'Penn Quarter',
 'Dupont Circle',
 'Tysons Corner / McLean',
 'Arlington',
 'Logan Circle',
 'U Street Corridor',
 'Penn Quarter',
 'Logan Circle',
 'Great Falls',
 'Penn Quarter',
 '

Ok, we've figured out the restaurant name and location. Now we need to grab the price (number of dollar signs on a scale of one to four) for each restaurant. We'll follow the same process.

In [11]:
# print out all prices
for i in soup.find_all('div', class_='rest-row-pricing'):
    print i.renderContents()

 <i>  $    $    $    </i>   $        
 <i>  $    $      </i>   $    $      
 <i>  $    $      </i>   $    $      
 <i>  $    $      </i>   $    $      
 <i>  $    $      </i>   $    $      
 <i>  $    $    $    </i>   $        
 <i>  $    $      </i>   $    $      
 <i>  $    $    $    </i>   $        
 <i>  $    $      </i>   $    $      
 <i>  $    $      </i>   $    $      
 <i>  $    $    $    </i>   $        
 <i>  $    $      </i>   $    $      
 <i>  $    $      </i>   $    $      
 <i>  $    $      </i>   $    $      
 <i>  $    $      </i>   $    $      
 <i>  $    $      </i>   $    $      
 <i>  $    $      </i>   $    $      
 <i>  $    $      </i>   $    $      
 <i>  $    $      </i>   $    $      
 <i>  $    $      </i>   $    $      
 <i>  $    $      </i>   $    $      
 <i>  $    $    $    </i>   $        
 <i>  $    $      </i>   $    $      
 <i>  $    $      </i>   $    $      
 <i>  $    $      </i>   $    $      
 <i>  $    $      </i>   $    $      
 <i>  $    $

In [12]:
# print out EACH number of dollar signs per restaurant
# this one is trickier to eliminate the html. Hint: try a nested find
for entry in soup.find_all('div', class_='rest-row-pricing'):
    print entry.renderContents()

 <i>  $    $    $    </i>   $        
 <i>  $    $      </i>   $    $      
 <i>  $    $      </i>   $    $      
 <i>  $    $      </i>   $    $      
 <i>  $    $      </i>   $    $      
 <i>  $    $    $    </i>   $        
 <i>  $    $      </i>   $    $      
 <i>  $    $    $    </i>   $        
 <i>  $    $      </i>   $    $      
 <i>  $    $      </i>   $    $      
 <i>  $    $    $    </i>   $        
 <i>  $    $      </i>   $    $      
 <i>  $    $      </i>   $    $      
 <i>  $    $      </i>   $    $      
 <i>  $    $      </i>   $    $      
 <i>  $    $      </i>   $    $      
 <i>  $    $      </i>   $    $      
 <i>  $    $      </i>   $    $      
 <i>  $    $      </i>   $    $      
 <i>  $    $      </i>   $    $      
 <i>  $    $      </i>   $    $      
 <i>  $    $    $    </i>   $        
 <i>  $    $      </i>   $    $      
 <i>  $    $      </i>   $    $      
 <i>  $    $      </i>   $    $      
 <i>  $    $      </i>   $    $      
 <i>  $    $

In [30]:
# Steph's answer:

pricing = [x.find('i').text.encode() for x in soup.find_all('div', attrs={'class':'rest-row-pricing'})]
pricing

['  $    $    $    ',
 '  $    $      ',
 '  $    $      ',
 '  $    $      ',
 '  $    $      ',
 '  $    $    $    ',
 '  $    $      ',
 '  $    $    $    ',
 '  $    $      ',
 '  $    $      ',
 '  $    $    $    ',
 '  $    $      ',
 '  $    $      ',
 '  $    $      ',
 '  $    $      ',
 '  $    $      ',
 '  $    $      ',
 '  $    $      ',
 '  $    $      ',
 '  $    $      ',
 '  $    $    $    ',
 '  $    $      ',
 '  $    $      ',
 '  $    $      ',
 '  $    $      ',
 '  $    $      ',
 '  $    $      ',
 '  $    $      ',
 '  $    $      ',
 '  $    $      ',
 '  $    $      ',
 '  $    $    $    ',
 '  $    $    $    ',
 '  $    $      ',
 '  $    $      ',
 '  $    $      ',
 '  $    $      ',
 '  $    $      ',
 '  $    $      ',
 '  $    $      ',
 '  $    $      ',
 '  $    $    $    ',
 '  $    $      ',
 '  $    $      ',
 '  $    $    $    ',
 '  $    $      ',
 '  $    $    $    ',
 '  $    $      ',
 '  $    $      ',
 '  $    $    $    $  ',
 '  $    $    

In [34]:
for entry in soup.find_all('div', class_='rest-row-pricing'):
    print entry.find('i').renderContents()

  $    $    $    
  $    $      
  $    $      
  $    $      
  $    $      
  $    $    $    
  $    $      
  $    $    $    
  $    $      
  $    $      
  $    $    $    
  $    $      
  $    $      
  $    $      
  $    $      
  $    $      
  $    $      
  $    $      
  $    $      
  $    $      
  $    $    $    
  $    $      
  $    $      
  $    $      
  $    $      
  $    $      
  $    $      
  $    $      
  $    $      
  $    $      
  $    $      
  $    $    $    
  $    $    $    
  $    $      
  $    $      
  $    $      
  $    $      
  $    $      
  $    $      
  $    $      
  $    $      
  $    $    $    
  $    $      
  $    $      
  $    $    $    
  $    $      
  $    $    $    
  $    $      
  $    $      
  $    $    $    $  
  $    $      
  $    $      
  $    $      
  $    $      
  $    $      
  $    $      
  $    $    $    
  $    $      
  $    $      
  $    $      
  $    $      
  $    $      
  $    $      
  $    $      
 

That looks great, but what if I wanted just the number of dollar signs per restaurant? Can you figure out a way to simply print out the number of dollar signs per restaurant listed?

In [37]:
# print the number of dollars signs per restaurant
for entry in soup.find_all('div', class_='rest-row-pricing'):
    price = entry.find('i').renderContents()
    print price.count('$')

3
2
2
2
2
3
2
3
2
2
3
2
2
2
2
2
2
2
2
2
3
2
2
2
2
2
2
2
2
2
2
3
3
2
2
2
2
2
2
2
2
3
2
2
3
2
3
2
2
4
2
2
2
2
2
2
3
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
3
2
3
2
2
2
2
2
2
2
2
3
2
2
2
3
2
2
2
2
2
2
3
3


Phew, nice work. 

One more, right? We only need to find the number times a restaurant was booked. In the next cell, print out all objects that contain the number of times the restaurant was booked.

In [36]:
# print out all objects that contain the number of times the restaurant was booked
for entry in soup.find_all('span', class_='tadpole'):
    print entry

That's weird -- an empty set. Did we find the wrong element? What's going on here? Discuss.

How can we debug this? Any ideas?

In [None]:
# let's first try printing out all 'span' class objects
for entry in soup.find_all('span'):
    print entry

I still don't see it. Let's search our entire soup object:

In [None]:
# print out soup, do command+f for "booked "
soup

What do you notice? Why is this happening?

## Enter Selenium

Selenium is a headless browser. That means it enables us to mock human browsing behavior -- even waiting for JavaScript elements to load.

If you do not already have Selenium installed, you can do so via pip. Simply: `pip install selenium`

In [38]:
# import
from selenium import webdriver

Selenium requires us to determine a default browser to run. I'm going to opt for Firefox, but Chromium is also a very common choice. http://selenium-python.readthedocs.io/faq.html

In [None]:
# STOP
# what is going to happen when I run the next cell?

In [41]:
# create a driver
driver = webdriver.Chrome(executable_path="/Users/mjspeck/Downloads/chromedriver")

Pretty crazy, right? Let's close that driver.

In [42]:
# close it
driver.close()

In [43]:
# let's boot it up, and visit a URL of our choice
driver = webdriver.Chrome(executable_path="/Users/mjspeck/Downloads/chromedriver")
driver.get("http://www.python.org")


Awesome. Now we're getting somewhere: programmatically controlling our browser like a human.

In [45]:
#driver = webdriver.Chrome(executable_path="/Users/mjspeck/Downloads/chromedriver")
driver.get("http://www.python.org")

Let's return to our problem at hand. We need to visit the OpenTable listing for DC. Once there, we need to get the html to load. In the next cell, prove you can programmatically visit the page.

In [46]:
# visit our OpenTable page
#driver = webdriver.Chrome(executable_path="/Users/mjspeck/Downloads/chromedriver")
driver.get("http://www.opentable.com/washington-dc-restaurant-listings")
# always good to check we've got the page we think we do
assert "OpenTable" in driver.title

Now, to resolve our JavaScript problem, there's a few things we can do. What I'll do in this case is request that the page load, wait one second, and then I'm going to grab the source html from the page. Because the page should believe I'm visiting from a live connection on a browser client, the JavaScript should render to be a part of the page source. I can then grab the page source.

In [47]:
# import sleep
from time import sleep

In [48]:
# visit our relevant page
# driver = webdriver.Chrome(executable_path="/Users/mjspeck/Downloads/chromedriver")
driver.get("http://www.opentable.com/washington-dc-restaurant-listings")
# wait one second
sleep(1)
#grab the page source
html = driver.page_source

Pop quiz: what do we need to do with this html?

In [49]:
# BeautifulSoup it!
html = BeautifulSoup(html)



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


Now, let's return to our earlier problem: how do we locate bookings on the page?

In [51]:
# print out the number bookings for all restaurants
for element in html.find_all('div', {'class':'booking'}):
    print element

<div class="booking"><span class="tadpole"></span>Booked 101 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 24 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 67 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 69 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 41 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 115 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 313 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 119 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 290 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 16 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 243 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 16 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 5

In [None]:
# now print out each booking for the listings using a loop
for booking in html.find_all('div', {'class':'booking'}):
    print booking

Let's grab just the text of each of these entries.

In [54]:
# do the same as above, but grabbing only the text content
for booking in html.find_all('div', {'class':'booking'}):
    print booking.text

Booked 101 times today
Booked 24 times today
Booked 67 times today
Booked 69 times today
Booked 41 times today
Booked 115 times today
Booked 313 times today
Booked 119 times today
Booked 290 times today
Booked 16 times today
Booked 243 times today
Booked 16 times today
Booked 54 times today
Booked 48 times today
Booked 63 times today
Booked 57 times today
Booked 23 times today
Booked 42 times today
Booked 496 times today
Booked 53 times today
Booked 134 times today
Booked 48 times today
Booked 12 times today
Booked 23 times today
Booked 44 times today
Booked 26 times today
Booked 48 times today
Booked 174 times today
Booked 44 times today
Booked 36 times today
Booked 147 times today
Booked 82 times today
Booked 41 times today
Booked 15 times today
Booked 21 times today
Booked 13 times today
Booked 60 times today
Booked 27 times today
Booked 61 times today
Booked 20 times today
Booked 49 times today
Booked 20 times today
Booked 184 times today
Booked 65 times today
Booked 53 times today

We've succeeded!

But we can clean this up a little bit. We're going to use regular expressions (regex) to grab only the digits that are available in each of the text.

The best way to get good at regex is to, well, just keep trying and testing: http://pythex.org/

In [55]:
# import regex
import re

Given we haven't covered regex, I'll show you how to use the search function to match any given digit.

In [67]:
# for each entry, grab the text
for booking in html.find_all('div', {'class':'booking'}):
    # match all digits
    match = re.search(r'\d+', booking.text)
    # print if found
    if match:
        print match.group()
    # otherwise pass
    else:
        pass

101
24
67
69
41
115
313
119
290
16
243
16
54
48
63
57
23
42
496
53
134
48
12
23
44
26
48
174
44
36
147
82
41
15
21
13
60
27
61
20
49
20
184
65
53
180
24
18
49
21
29
21
23
256
14
274
30
14
18
63
28
35
45
27
53
7
42
18
8
19
8
25
12
17
21
59
8
49
10
20
15
20
16
36
7
28
51
65
23
9
43
31
48
21
33
31
11
114
32


Before we demonstrate all the other amazing things about headless browsers, let's finish up collecting the data we want from this current example. Do you suppose the html parsing we wrote above will still work on the page source we've grabbed from our headless browser?

To be most efficient, we want to only do a single loop for each entry on the page. That means we want to find what element all of other other elements (name, location, price, bookings) is housed within. Where on the page is each entry located?

In [68]:
# print out all entries
print html.find_all('div', {'class':'result content-section-list-row cf with-times'})

[<div class="result content-section-list-row cf with-times" data-id="0" data-index="1" data-lat="38.8950000" data-lon="-77.0200000" data-offers="" data-rid="5674"><div class="rest-row with-image"> <div class="rest-row-image"> <a href="/rasika" target="_blank"><img alt="photo of rasika restaurant" class="lazy rest-image loaded" data-src="//resizer.otstatic.com/v2/profiles/legacy/5674.jpg" src="//resizer.otstatic.com/v2/profiles/legacy/5674.jpg"/></a></div> <div class="rest-row-info"><div class="rest-row-header"> <a class="rest-row-name rest-name " href="/rasika" target="_blank"> <span class="rest-row-name-text">Rasika</span> </a> </div> <div class="flex-row-justify"> <div class="rest-row-review"> <div class="star-rating review-container"><div class="star-wrapper small"><div class="all-stars"></div><div class="all-stars filled" style="width: 96%;"></div></div> <a class="review-link" href="/rasika#reviews" target="_blank"><span class="star-rating-text">(7254)</span><span class="star-ratin

Look over the page. Does every single entry have each element we're seeking?
> I did this previously. I know for a fact that not every element has a number of recent bookings. That's probably exactly why OpenTable houses this in JavaScript: they want to continously update the number of bookings with the most relevant number of values.

In [69]:
# what happens when a booking is not available?
# print out each booking entry, using the identification code we wrote above
for entry in html.find_all('div', {'class':'result content-section-list-row cf with-times'}):
    print entry.find('div', {'class':'booking'})

<div class="booking"><span class="tadpole"></span>Booked 101 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 24 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 67 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 69 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 41 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 115 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 313 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 119 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 290 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 16 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 243 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 16 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 5

In [70]:
for entry in html.find_all('div', {'class':'result content-section-list-row cf with-times'}):
    print entry.find('div', {'class':'booking'}).text

Booked 101 times today
Booked 24 times today
Booked 67 times today
Booked 69 times today
Booked 41 times today
Booked 115 times today
Booked 313 times today
Booked 119 times today
Booked 290 times today
Booked 16 times today
Booked 243 times today
Booked 16 times today
Booked 54 times today
Booked 48 times today
Booked 63 times today
Booked 57 times today
Booked 23 times today


AttributeError: 'NoneType' object has no attribute 'text'

What do you notice takes the place when booking is not found?

Thus, we will use exceptions. Here's a demo:

In [71]:
# if we find the element we want, we print it. Otherwise, we print 'ZERO'
values = []
for entry in html.find_all('div', {'class':'result content-section-list-row cf with-times'}):
    try:
        values.append(entry.find('div', {'class':'booking'}).text)
    except:
        values.append('ZERO')

In [73]:
for i in values:
    print i

Booked 101 times today
Booked 24 times today
Booked 67 times today
Booked 69 times today
Booked 41 times today
Booked 115 times today
Booked 313 times today
Booked 119 times today
Booked 290 times today
Booked 16 times today
Booked 243 times today
Booked 16 times today
Booked 54 times today
Booked 48 times today
Booked 63 times today
Booked 57 times today
Booked 23 times today
ZERO
Booked 42 times today
Booked 496 times today
Booked 53 times today
Booked 134 times today
Booked 48 times today
Booked 12 times today
Booked 23 times today
Booked 44 times today
Booked 26 times today
Booked 48 times today
Booked 174 times today
Booked 44 times today
Booked 36 times today
Booked 147 times today
Booked 82 times today
Booked 41 times today
Booked 15 times today
Booked 21 times today
Booked 13 times today
Booked 60 times today
Booked 27 times today
Booked 61 times today
Booked 20 times today
Booked 49 times today
Booked 20 times today
Booked 184 times today
Booked 65 times today
Booked 53 times 

From previously completing this, I know all other elements WILL be returned. That means we do not have to create exceptions for them.

However, the onus is on you to now put all the pieces together.

Loop through each entry. For each entry, grab the relevant information we want (name, location, price, bookings). Produce a dataframe with the columns "name","location","price","bookings" that contains the 100 entries we would like.

In [74]:
# I'm going to create my empty df first
import pandas as pd
dc_eats = pd.DataFrame(columns=["name","location","price","bookings"])

**Check:** What is my for-loop doing?

In [75]:
# loop through each entry
for entry in html.find_all('div', {'class':'result content-section-list-row cf with-times'}):
    # grab the name
    name = entry.find('span', {'class': 'rest-row-name-text'}).text
    # grab the location
    location = entry.find('span', {'class': 'rest-row-meta--location rest-row-meta-text'}).renderContents()
    # grab the price
    price = entry.find('div', {'class': 'rest-row-pricing'}).find('i').renderContents().count('$')
    # try to find the number of bookings
    try:
        temp = entry.find('div', {'class':'booking'}).text
        match = re.search(r'\d+', temp)
        if match:
            bookings = match.group()
    except:
        bookings = 'NA'
    # add to df
    dc_eats.loc[len(dc_eats)]=[name, location, price, bookings]

In [76]:
# check out our work
dc_eats.head()

Unnamed: 0,name,location,price,bookings
0,Rasika,Penn Quarter,3,101
1,Harold Black,Capitol Hill,2,24
2,Chez Billy Sud,Georgetown,2,67
3,Ambar,Capitol Hill,2,69
4,Centrolina,Downtown,2,41


Awesome! We succeeded.

Now, let's explore some of the other functionality of a webdriver. We've barely scratched the surface.

In [77]:
# we can send keys as well
# import
from selenium.webdriver.common.keys import Keys

In [None]:
# open Chrome
driver = webdriver.Chrome(executable_path="/Users/mjspeck/Downloads/chromedriver")

In [78]:
# visit Python
driver.get("http://www.python.org")
# verify we're in the right place
assert "Python" in driver.title

In [79]:
# find the search position
elem = driver.find_element_by_name("q")
# clear it
elem.clear()
# type in pycon
elem.send_keys("pycon")


In [80]:
# send those keys
elem.send_keys(Keys.RETURN)
# no results
assert "No results found." not in driver.page_source

In [81]:
# close
driver.close()

In [None]:
# all at once:
driver = webdriver.Chrome(executable_path="/Users/mjspeck/Downloads/chromedriver")
driver.get("http://www.python.org")
assert "Python" in driver.title
elem = driver.find_element_by_name("q")
elem.clear()
elem.send_keys("pycon")
elem.send_keys(Keys.RETURN)
assert "No results found." not in driver.page_source
driver.close()

The above example (and many others) are available in the Selenium docs: http://selenium-python.readthedocs.io/getting-started.html

What is especially important is exploring functionality like locating elements: http://selenium-python.readthedocs.io/locating-elements.html#locating-elements

FAQ:
http://selenium-python.readthedocs.io/faq.html