# Web Scraping with Beautiful Soup

We've covered APIs, but what happens when the information we want to get from the internet isn't provided for us already? We can use webscraping to collect it directly from the HTML code on each webpage.


## What is web scraping?
- Extracting information from websites (simulates a human copying and pasting)
- Based on finding patterns in website code (usually HTML)


## What are best practices for web scraping?
- Scraping too many pages too fast can get your IP address blocked
- Pay attention to the robots exclusion standard (robots.txt)
- Let's look at http://www.facebook.com/robots.txt
- Any other sites we should check out?



## What is HTML?
- Code interpreted by a web browser to produce ("render") a web page
- Let's look at some example HTML - twitter
- Introduce google's inspect tool
- What can you tell me about the language?


### HTML code can get complicated:

In [None]:
from IPython.display import Image
Image('http://www.openbookproject.net/tutorials/getdown/css/images/lesson4/HTMLDOMTree.png')


### Fortunately, Chrome has some great developer tools that let us look at the structure of pages without needing to read or understand HTML

## How to view HTML code:
- To view the entire page: "View Source" or "View Page Source" or "Show Page Source"
- To view a specific part: "Inspect Element"
- Safari users: Safari menu, Preferences, Advanced, Show Develop menu in menu bar
- Let's try it out on twitter.com

## How do I webscrape?

We will be using two new libraries for our webscraping:
- **requests** - lets us acquire the HTML code in python, like a web browser would
- **BeautifulSoup** - allows us to interact with the HTML efficiently and easily

Webscraping with Python can be broken down into a few simple steps. First we need is to access and then 'download' the page that we want to scrape.

We want to scrape [this example site](http://econpy.pythonanywhere.com/ex/001.html).


In [1]:
import requests 

In [2]:
html = requests.get('http://econpy.pythonanywhere.com/ex/001.html')
#Can someone explain what this is doing? What does the html variable now contain?

In [None]:
# convert HTML into a structured Soup object
from bs4 import BeautifulSoup
b = BeautifulSoup(html.text)

In [None]:
#what does b look like?
print b

In [None]:
# we can make it prettier:

print b.prettify()

# This looks slightly better, but it's still pretty interpretable. How could I find the item I'm looking for?

## Finding items in the site

Now that we have the website text saved as a beautiful soup object, we can use bs4 functions to find things on the page for us

In [None]:
# 'find' method returns the first matching Tag (and everything inside of it)
print b.find(name='body')

In [None]:
# .text will return the text without the extra tags
print b.find(name='body').text

findall will return all matching tags

In [None]:
print b.find_all('div')

In [None]:
b.find_all('div', title='buyer-name')
# beautiful soup will let us choose specific elements within div tags.

In [3]:
#We can use for loops to select just the text
for i in b.find_all('div', title='buyer-name'):
    print i.text

NameError: name 'b' is not defined

In [None]:
# Now I have a list of all names on the page
# What if I wanted a list of the prices?
for i in b.find_all('span'):
    print i.text

In [None]:
# Pair practice:
# Group work - 
# build a function that uses the statements above creates a dictionary that pairs the names with prices:



