<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Web Scraping

_Author: Dave Yerrington (SF)_

---

## Learning Objectives
- Revisit how to locate elements on a webpage
- Aquire unstructure data from the internet using Beautiful soup.
- Discuss limitations associated with simple requests and urllib libraries
- Introduce Selenium as a solution, and implement a scraper using selenium

## Lesson Guide

- [Introduction](#intro)
- [Building a web scraper](#building-scraper)
- [Retrieving data from the HTML page](#retrieving-data)
    - [Retrieving the restaurant names](#retrieving-names)
    - [Challenge: Retrieving the restaurant locations](#retrieving-locations)
    - [Retrieving the restaurant prices](#retrieving-prices)


- [Summary](#summary)

<a id="intro"></a>
## Introduction

In this codealong lesson, we'll build a web scraper using requests and BeautifulSoup. We will also explore how to use a headless browser called Selenium.

We'll begin by scraping OpenTable's DC listings. We're interested in knowing the restaurant's **name, location, price, and how many people booked it today.**

OpenTable provides all of this information on this given page: http://www.opentable.com/washington-dc-restaurant-listings

Let's inspect the elements of this page to assure we can find each of the bits of information in which we're interested.

---

<a id="building-scraper"></a>
## Building a web scraper

Now, let's build a web scraper for OpenTable using urllib and Beautiful Soup:

In [1]:
# import our necessary first packages
from bs4 import BeautifulSoup
import requests

In [2]:
# set the url we want to visit
url = "http://www.opentable.com/washington-dc-restaurant-listings"

# visit that url, and grab the html of said page
html = requests.get(url)

At this point, what is in html?

In [3]:
# .text returns the request content in Unicode
html.text[:500]

'           <!DOCTYPE html><html lang="en"><head><meta charset="utf-8"/><meta http-equiv="X-UA-Compatible" content="IE=9; IE=8; IE=7; IE=EDGE"/> <title>Restaurant Reservation Availability</title>    <meta  name="robots" content="noindex,nofollow" > </meta>     <link rel="shortcut icon" href="//components.otstatic.com/components/favicon/1.0.6/favicon/favicon.ico" type="image/x-icon"/><link rel="icon" href="//components.otstatic.com/components/favicon/1.0.6/favicon/favicon-16.png" sizes="16x16"/><l'

We will need to convert this html objct into a soup object so we can parse it using python and BS4

In [4]:
# convert this into a soup object
soup = BeautifulSoup(html.text, 'html.parser')

In [5]:
soup.prettify()



<a id="retrieving-data"></a>
### Retrieving data from the HTML page

Let's first find each restaurant name listed on the page we've loaded. How do we find the page location of the restaurant? (Hint: We need to know where in the **HTML** the restaurant element is housed.) In order to find the HTML that renders the restaurant location, we can use Google Chrome's Inspect tool:

> http://www.opentable.com/washington-dc-restaurant-listings

> 1. Visit the URL above. 

> 2. Right-click on an element you are interested in, then choose Inspect (in Chrome). 

> 3. This will open the Developer Tools and show the HTML used to render the selected page element. 

> Throughout this lesson, we will use this method to find tags associated with elements of the page we want to scrape.

See if you can find the restaurant name on the page. Keep in mind there are many restaurants loaded on the page.

In [6]:
# print the restaurant names
soup.find_all(name='span', attrs={'class':'rest-row-name-text'})

[<span class="rest-row-name-text">Mozell Hermiston</span>,
 <span class="rest-row-name-text">1474 Feest</span>,
 <span class="rest-row-name-text">Summit</span>,
 <span class="rest-row-name-text">Georgianna Bosco</span>,
 <span class="rest-row-name-text">71 McGlynn</span>,
 <span class="rest-row-name-text">Buddys</span>,
 <span class="rest-row-name-text">Autem Forges</span>,
 <span class="rest-row-name-text">Sint Underpass</span>,
 <span class="rest-row-name-text">Kiara Grant</span>,
 <span class="rest-row-name-text">Delectus Vista</span>,
 <span class="rest-row-name-text">Angelinas</span>,
 <span class="rest-row-name-text">772 Mertz</span>,
 <span class="rest-row-name-text">Drives</span>,
 <span class="rest-row-name-text">Andreane Estates</span>,
 <span class="rest-row-name-text">Giles Cliff</span>,
 <span class="rest-row-name-text">653 Kiehn</span>,
 <span class="rest-row-name-text">Forrest Ford</span>,
 <span class="rest-row-name-text">Senger</span>,
 <span class="rest-row-name-text"

It is important to always keep in mind the data types that were returned. Note this is a `list`, and we know that immediately by observing the outer square brackets and commas separating each tag.

Next, note the elements of the list are `Tag` objects, not strings. (If they were strings, they would be surrounded by quotes.) The Beautiful Soup authors chose to display a `Tag` object visually as a text representation of the tag and its contents. However, being an object, it has many methods that we can call on it. For example, next we will use the `encode_contents()` method to return the tag's contents encoded as a Python string.

<a id="retrieving-names"></a>
#### Retrieving the restaurant names

Now that we found a list of tags containing the restaurant names, let's think how we can loop through them all one-by-one. In the following cell, we'll print out the name (and **only** the clean name, not the rest of the html) of each restaurant.

In [7]:
# for each element you find, print out the restaurant name
for entry in soup.find_all(name='span', attrs={'class':'rest-row-name-text'}):
    print(entry.text)

Mozell Hermiston
1474 Feest
Summit
Georgianna Bosco
71 McGlynn
Buddys
Autem Forges
Sint Underpass
Kiara Grant
Delectus Vista
Angelinas
772 Mertz
Drives
Andreane Estates
Giles Cliff
653 Kiehn
Forrest Ford
Senger
Agloe Bar & Grill
Occaecati River
Konopelski
Loyals
Billys
Possimus Kuphal
Eveniet Roob
McDermott Station
Sammys
Fletas
Hilpert
Voluptas Carroll
Viviennes
Beatae Dietrich
744 Beatty
Batz
Et Court
288 Langworth
Quia
Vivienne Weissnat
Iure Schumm
858 Welch
Spring
Heathcote
Voluptas Terrace
Jacobi Freeway
Nikko Predovic
Island
590 Zieme
Vel Russel
1405 Jaskolski
Arlo Morar
Mollitia Pfannerstill
Islands
Lillie Villages
Sawayn
Bins Junction
Unde Shields
1500 Rosenbaum
Ex
Sint Extensions
Luettgen
Skylar Treutel
Hic O'Connell
Glen Lakes
Kaceys
Vel Springs
553 Legros
In
Cupiditate
Pouros
Quis Parks
Toy Plain
Voluptatem Circles
Gerhold
Common
Lolita Harvey
Schumm Neck
Est Pollich
Totam Feest
Voluptatum
Et
Sunt Gateway
Nicholas Hilll
Willms
Pariatur Unions
Carols
Quia Rue
Maxime
Delectus 

Great!

<a id="retrieving-locations"></a>
#### Challenge: Retrieving the restaurant locations

Can you repeat that process for finding the location? For example, barmini by Jose Andres is in the location listed as "Penn Quarter" in our search results.

In [7]:
# first, see if you can identify the location for all elements -- print it out


In [8]:
# now print out EACH location for the restaurants


<a id="retrieving-prices"></a>
#### Retrieving the restaurant prices

Ok, we've figured out the restaurant name and location. Now we need to grab the price (number of dollar signs on a scale of one to four) for each restaurant. We'll follow the same process.

In [8]:
# print out all prices
soup.find_all('div', {'class':'rest-row-pricing'})

[<div class="rest-row-pricing"> <i class="pricing--the-price">  $    $    $    $  </i> <span class="pricing--not-the-price"> </span></div>,
 <div class="rest-row-pricing"> <i class="pricing--the-price">  $    $    $    $  </i> <span class="pricing--not-the-price"> </span></div>,
 <div class="rest-row-pricing"> <i class="pricing--the-price">  $    $    $    $  </i> <span class="pricing--not-the-price"> </span></div>,
 <div class="rest-row-pricing"> <i class="pricing--the-price">  $    $    $    </i> <span class="pricing--not-the-price">  $        </span></div>,
 <div class="rest-row-pricing"> <i class="pricing--the-price">  $    $    $    </i> <span class="pricing--not-the-price">  $        </span></div>,
 <div class="rest-row-pricing"> <i class="pricing--the-price">  $    $    $    </i> <span class="pricing--not-the-price">  $        </span></div>,
 <div class="rest-row-pricing"> <i class="pricing--the-price">  $    $    $    </i> <span class="pricing--not-the-price">  $        </span>

In [9]:
# print out EACH number of dollar signs per restaurant
# this one is trickier to eliminate the html. Hint: try a nested find
for entry in soup.find_all('div', {'class':'rest-row-pricing'}):
    print(entry.find('i').text)

  $    $    $    $  
  $    $    $    $  
  $    $    $    $  
  $    $    $    
  $    $    $    
  $    $    $    
  $    $    $    
  $    $      
  $    $    $    $  
  $    $      
  $    $    $    
  $    $      
  $    $    $    
  $    $    $    
  $    $    $    
  $    $      
  $    $    $    
  $    $    $    $  
  $    $      
  $    $    $    
  $    $    $    $  
  $    $    $    
  $    $    $    $  
  $    $      
  $    $    $    
  $    $    $    
  $    $    $    $  
  $    $    $    
  $    $    $    $  
  $    $    $    $  
  $    $      
  $    $    $    
  $    $      
  $    $      
  $    $    $    $  
  $    $      
  $    $    $    
  $    $      
  $    $    $    $  
  $    $      
  $    $    $    $  
  $    $    $    $  
  $    $      
  $    $    $    $  
  $    $      
  $    $    $    $  
  $    $    $    $  
  $    $    $    $  
  $    $    $    
  $    $    $    
  $    $    $    
  $    $      
  $    $      
  $    $    $    
  $    $      
  $    

That looks great, but what if I wanted just the number of dollar signs per restaurant? Can you figure out a way to simply print out the number of dollar signs per restaurant listed?

In [18]:
# print the number of dollars signs per restaurant


That's weird -- an empty set. Did we find the wrong element? What's going on here? Discuss.

How can we debug this? Any ideas?

### My Web Scraper

In [None]:
import requests
import warnings
from bs4 import BeautifulSoup
from operator import itemgetter

In [None]:
### GRAB RELEVANT URLS

# Mask warnings
warnings.filterwarnings("ignore", category = UserWarning, module = 'bs4')

# Initialize the page number and set the search query
page = 0
query = '[your query]'
query = query.replace(' ', '+')

# Set desktop header for Google search
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:65.0) Gecko/20100101 Firefox/65.0'
USER_AGENT_URL = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)'

# Make web requests with a header
headers = {'user-agent' : USER_AGENT}
headers_url = {'user-agent' : USER_AGENT_URL}

# Parse desktop anchor links, article titles and save to results
url_results = []
### REMEMBER TO CHANGE RANGE ACCORDING TO RETURNED GOOGLE PAGES
for p in range(1,26):
    page += p
    URL = 'https://www.google.com/search?q={}&sxsrf=ACYBGNTx2Ew_5d5HsCvjwDoo5SC4U6JBVg:1574261023484&ei=H1HVXf-fHfiU1fAP65K6uAU&start={}&sa=N&ved=0ahUKEwi_q9qog_nlAhV4ShUIHWuJDlcQ8tMDCF8&biw=1280&bih=561&dpr=1.5'.format(query,page)
    resp = requests.get(URL, headers = headers)
    soup = BeautifulSoup(resp.content, 'html.parser')
    for g in soup.find_all('div', class_='r'):
        anchors = g.find_all('a')
        if anchors:
            link = anchors[0]['href']
            title = g.find('h3').text
            item = {"title": title, "link": link}
            url_results.append(item)

# Get urlss from link, article titles list of dictionaries
urls = list(map(itemgetter('link'), url_results))
print(urls) 

In [None]:
### PARSE CORPORA SUBHEADINGS AND TEXT FROM URLS
content_results = []

# Loop through urls to grab all title, paragraph content
for url in urls:
    try:
        content = requests.get(url, headers = headers_url)
        soup_maker = BeautifulSoup(content.text, 'html.parser')
        article_sub = soup_maker.find('div').find_all('h2')
        article = soup_maker.find('div').find_all('p')
        for subheading in article_sub:
            content_results.append('\n' + ''.join(subheading.findAll(text = True)))
        for content in article:
            content_results.append('\n' + ''.join(content.findAll(text = True)))
    except:
        print('There was an error trying to scrape this link: ' + url)
    else:
        continue
         
# Open, write, and close file
myfile = open('[file name].txt', 'w+')
myfile.write(str(content_results))
myfile.close()

#### Challenge: Retrieving content from all urls of a searched query

Can you repeat the process for a query you want to search?


In [None]:
# Grab the relevant URLs on each page of Google


In [None]:
# Parse the URLs for text content


### Summary

In this lesson, we used the Beautiful Soup library to locate elements on a website then scrape their text. We also used the Selenium headless browser to run JavaScript first before retrieving the page contents.