# Webscraping with Python

## Introduction
This session will introduce web scraping with Python. Along the way we will discuss how web pages are structured and how information is passed from a web browser to a web site. 

## Part I
We'll begin by loading some essential web scraping modules into our Python environment:
- [Requests](http://docs.python-requests.org/en/master/)
- [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Re](https://docs.python.org/3/library/re.html)
- [Time](https://docs.python.org/3.6/library/time.html#module-time)



In [2]:
# import our modules - if you're using Anaconda, these should all be bundled
import requests
from bs4 import BeautifulSoup
import re
import time


### 1.1 Basics
This first example uses [AFL-CIO Legislative Alerts](http://www.aflcio.org/Legislation-and-Politics/Legislative-Alerts) site which contains a chronological inventory of official communications sent by the organiztion to legislators.

In [4]:
# make our first GET request
req = requests.get("http://www.aflcio.org/Legislation-and-Politics/Legislative-Alerts")


In [5]:
type(req)
src = req.text
src

'<!DOCTYPE html>\n<html xmlns:fb="http://ogp.me/ns/fb#" xmlns="http://www.w3.org/1999/xhtml" xml:lang="en-US" lang="en-US" class="no-js"> \n<head>    \n  \n\n\t \n\t                             \n    <title>\tLegislative Alerts\t</title>\n\n    \n    \n                <meta name="Content-Type" content="text/html; charset=utf-8" />\n\n            <meta name="Content-language" content="en-US" />\n\n                                                      \n                <meta name="author" content="" />\n    \n                <meta name="copyright" content="" />\n    \n                <meta name="description" content="" />\n    \n                <meta name="keywords" content="" />\n    \n        <meta name="MSSmartTagsPreventParsing" content="TRUE" />\n    <meta name="generator" content="eZ Publish" />\n\t\n\t\n\t\t\t<meta property="og:title" content="Legislative Alerts" />\n<meta property="og:url" content="http://www.aflcio.org/Legislation-and-Politics/Legislative-Alerts" />\n<meta prope

In [6]:
type(src)

str

In order to make sense of this data we'll need to learn more about how web pages are structured. 

### 1.2 HTML structure


![HTML <>](https://upload.wikimedia.org/wikipedia/commons/thumb/8/81/HTML5_de_Erick_Dimas.jpg/800px-HTML5_de_Erick_Dimas.jpg)

It's impossible to scrape web sites without knowing something about HTML markup. Because we have limited time, this tutorial will focus on only a small subset of the language. Here's some basic terminology I'll use repeatedly.
- __Elements__
- __Tags__
- __Attributes__
- __Content__

__*Elements*__ are defined by a starting *tag* and the text within the element is the __*content*__.

Examples:
```html
<p> This is a paragraph </p>
<h1> This is a heading </h1>
<div> This is a division </div>
```
__*Elements*__ can also be nested
```html
<p> This is a paragraph with <b>bold</b> text</p>
```

__*Attributes*__ provide additional information about an element.

```html
<div class="intro">This is division contains the introduction</div>
```

- *div* is the tag name.
- *class* is the attribute name.
- *intro* is the attribute value.

We'll dig a little more deeply into HTML after looking again at the Project Gutenberg web site.











### 1.3 Exploring page structure with Chrome
Let's look again at the [Legislative Alerts](http://www.aflcio.org/Legislation-and-Politics/Legislative-Alerts) page and explore it's HTML elements.
- Right-click (or Ctrl-click on a Mac) on the text "Election 2016" near the top left of the page. 
- Look at the panel that opens on the right side (or possibly bottom) of your screen and answer these questions.
1. What tag is the text contained in?
2. In what tag is that one embedded?
3. What other tags are embedded with it?

A powerful feature of Chrome inspect is the ability to identify *sets* of elements. Shortly, we'll learn how important this is to effective scraping. One way to do this to hover your mouse over individual items in the panel and see what gets higlighted on the page. Another, more powerful is to assign new attributes to a tag. Let's try that with the *li* tag and add a background color.
- Inspect any legislative alert title in the main body of the page.
- Select one of the sets of ```<div>``` elements in the main inspect panel.
- Click the '+' sign in the lower left panel
- A new ```<div>``` item will appear in the lower left panel.
- Click within the curly braces then enter "background-color" or select it from the list that appears and hit enter.
- Choose a color from the list and hit enter again.

Now let's use this information to scrape some content from the page.


## Part 2
### 2.1 Making soup
Beautiful Soup is a popular Python module for navigating and searching through scraped web content. Here's a simple example.

In [9]:
# Getting started with Beautiful Soup
soup = BeautifulSoup(src, 'lxml')
print(soup.prettify())

<!DOCTYPE html>
<html class="no-js" lang="en-US" xml:lang="en-US" xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://ogp.me/ns/fb#">
 <head>
  <title>
   Legislative Alerts
  </title>
  <meta content="text/html; charset=utf-8" name="Content-Type"/>
  <meta content="en-US" name="Content-language"/>
  <meta content="" name="author"/>
  <meta content="" name="copyright"/>
  <meta content="" name="description"/>
  <meta content="" name="keywords"/>
  <meta content="TRUE" name="MSSmartTagsPreventParsing"/>
  <meta content="eZ Publish" name="generator"/>
  <meta content="Legislative Alerts" property="og:title"/>
  <meta content="http://www.aflcio.org/Legislation-and-Politics/Legislative-Alerts" property="og:url"/>
  <meta content="AFL-CIO" property="og:site_name"/>
  <meta content="http://www.aflcio.org/extension/aflcio/design/aflcio_user/images/facebook_aflcio_200x200.jpg" property="og:image"/>
  <meta content="non_profit" property="og:type"/>
  <meta content="288636237825618" property="

### 2.2 Find page elements
Beautiful Soup provides functions that allow searching a page by tags and attributes. Let's look again at the elements in the Top 100 page and determine how to identify the list of titles. We can then use the find_all function to retrieve them.

In [29]:
# Find elements
alerts = soup("div", class_="ec_statements")
print (type(alerts[0]))
first_alert = alerts[0]
print(first_alert)

<class 'bs4.element.Tag'>
<div class="ec_statements">
<div id="legalert_title"><a href="/Legislation-and-Politics/Legislative-Alerts/Letter-to-Senators-in-Support-of-the-Water-Resources-Development-Act">Letter to Senators in Support of the Water Resources Development Act</a></div>
<div id="legalert_date">September 14, 2016</div>
</div>


In [30]:
print(first_alert.text)


Letter to Senators in Support of the Water Resources Development Act
September 14, 2016



Because first_title is a Tag element, it has a text property.

In [37]:
# Print some content
first_alert.select('div#legalert_title ')

[<div id="legalert_title"><a href="/Legislation-and-Politics/Legislative-Alerts/Letter-to-Senators-in-Support-of-the-Water-Resources-Development-Act">Letter to Senators in Support of the Water Resources Development Act</a></div>]

This text is obviously much cleaner than what we started with, but we still need to do some more work to parse it out and make it more actionable.

### 2.3 Using selectors
Beautiful Soup structures page content in a tree structure and provides functions that allow searching based on relationships between tags. 

In [None]:
# Review

In [45]:
first_title = first_alert.select('div#legalert_title a')
type(first_title)
len(first_title)
type(first_title[0])
first_title[0].text

'Letter to Senators in Support of the Water Resources Development Act'

These results are closer to what we need, but still need more cleanup. The solution lies in using more specific selectors.

In [57]:
# Selector examples
alert = alerts[0]
title = alert.select('div#legalert_title a')
print(type(title))
print(title[0]['href'])
for i in range(len(title)):
    print(title[i]['href'])

<class 'list'>
/Legislation-and-Politics/Legislative-Alerts/Letter-to-Senators-in-Support-of-the-Water-Resources-Development-Act
/Legislation-and-Politics/Legislative-Alerts/Letter-to-Senators-in-Support-of-the-Water-Resources-Development-Act


In [53]:
date = alert.select('div#legalert_date')
print(date[0].text)

September 14, 2016


Now that we understand the structure, we can start thinking about how to access the data in our code. It's also critical to remember the following:
- *Tag* objects have a text property. This is how to retrieve *clean* content.
- When the tag has an attribute, the attribute value can be retrieved using the bracket notation.

In [None]:
# Figure out what objects are returned and how to use them

Note that the link is only a portion of the url. In order to use this, we'll need to append it to a base url.

In [58]:
# Fix the urls
print("http://www.aflcio.org/"+title[0]['href'])

http://www.aflcio.org//Legislation-and-Politics/Legislative-Alerts/Letter-to-Senators-in-Support-of-the-Water-Resources-Development-Act


### 2.4 Iteration
Typically, scraping involves pulling multiple pieces of content from a page. Because BeautifulSoup creates iterable structures out of page contents, it's easy to use standard Python list constructs to extract data.

In [59]:
# Extract all the titles and links and dates from all the alerts (start by copying and pasting the code above)
# Use a dict structure to capture the different elements
base_url = "http://www.aflcio.org/"
alerts = soup.find_all('div', class_ = "ec_statements")
results = []
for alert in alerts:
    item = {}
    title = alert.select('div#legalert_title a')
    date = alert.select('div#legalert_date')
    item['title'] = title[0].text
    item['url'] = base_url + title[0]['href']
    item['date'] = date[0].text
    results.append(item)

In [62]:
results

[{'date': 'September 14, 2016',
  'title': 'Letter to Senators in Support of the Water Resources Development Act',
  'url': 'http://www.aflcio.org//Legislation-and-Politics/Legislative-Alerts/Letter-to-Senators-in-Support-of-the-Water-Resources-Development-Act'},
 {'date': 'July 15, 2016',
  'title': 'Letter to Representatives in opposition to legislation that would rob millions of workers of overtime pay protection.',
  'url': 'http://www.aflcio.org//Legislation-and-Politics/Legislative-Alerts/Letter-to-Representatives-in-opposition-to-legislation-that-would-rob-millions-of-workers-of-overtime-pay-protection'},
 {'date': 'June 21, 2016',
  'title': "Letter in Support of Barack Obama's Veto of DOL Retirement Rule Override",
  'url': 'http://www.aflcio.org//Legislation-and-Politics/Legislative-Alerts/Letter-in-Support-of-Barack-Obama-s-Veto-of-DOL-Retirement-Rule-Override'},
 {'date': 'June 13, 2016',
  'title': 'Myths v Facts Response - ILRWG H-2B Approps FY 2017',
  'url': 'http://www

## Part 3
### 3.1 Multi-step scraping
What if we want to download the actual text of, say, the top 10, titles on the list. We'll need to do the following.
1. Find the particular list we want to scrape
2. Gather the links to the download page
3. Use those links to fetch the contents of the download page.
4. Choose the appropriate link for retrieving the text we need.
5. Save the text to a file.

### 3.2 Using functions
Functions are a useful approach to organizing the different parsing stages of a mult-step scrape. We'll write two functions for this example. One will retrieve a list of links for a given list *year*. The other will retrieve the text of the letter.

In [70]:
def get_alerts(year):
    ## you'll need to look at the how the site url changes to figure out how to handle the year parameter
    
    alerts_url = "http://www.aflcio.org/Legislation-and-Politics/Legislative-Alerts/(y)/" + str(year)
    req = requests.get(alerts_url)
    src = req.text
    soup = BeautifulSoup(src)
    
    base_url = "http://www.aflcio.org/"
    alerts = soup.find_all('div', class_ = "ec_statements")
    results = []
    for alert in alerts:
        item = {}
        title = alert.select('div#legalert_title a')
        date = alert.select('div#legalert_date')
        item['title'] = title[0].text
        item['url'] = base_url + title[0]['href']
        item['date'] = date[0].text
        results.append(item)
    return(results)

        


In [73]:
get_alerts(2015)



 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "lxml")

  markup_type=markup_type))


[{'date': 'December 17, 2015',
  'title': 'Letter to Representatives expressing views on the Consolidated Appropriations Act of 2016',
  'url': 'http://www.aflcio.org//Legislation-and-Politics/Legislative-Alerts/Letter-to-Representatives-expressing-views-on-the-Consolidated-Appropriations-Act-of-2016'},
 {'date': 'December 03, 2015',
  'title': "Letter to Representatives urging them to support the Fixing America's Surface Transportation (FAST) Act (H.R. 22)",
  'url': 'http://www.aflcio.org//Legislation-and-Politics/Legislative-Alerts/Letter-to-Representatives-urging-them-to-support-the-Fixing-America-s-Surface-Transportation-FAST-Act-H.R.-22'},
 {'date': 'November 30, 2015',
  'title': 'Letter to Sens. Grassley and Durbin in support of the H-1B and L-1 Visa Reform Act of 2015 (S. 2266)',
  'url': 'http://www.aflcio.org//Legislation-and-Politics/Legislative-Alerts/Letter-to-Sens.-Grassley-and-Durbin-in-support-of-the-H-1B-and-L-1-Visa-Reform-Act-of-2015-S.-2266'},
 {'date': 'November 1

In [74]:
links = get_alerts(2015)



 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "lxml")

  markup_type=markup_type))


In [78]:
for i in range(len(links)):
    print (links[i]['title'])

Letter to Representatives expressing views on the Consolidated Appropriations Act of 2016
Letter to Representatives urging them to support the Fixing America's Surface Transportation (FAST) Act (H.R. 22)
Letter to Sens. Grassley and Durbin in support of the H-1B and L-1 Visa Reform Act of 2015 (S. 2266)
Letter to Representatives urging them to vote against the American SAFE Act of 2015 (H.R. 4038)
Letter to Representatives supporting the Surface Transportation Reauthorization and Reform Act
Letter to Representatives urging them to oppose the "Protecting Local Business Opportunity Act"
Letter to Representatives urging them to reauthorize the Export-Import Bank
Letter to Representatives urging opposition to the Restoring Americans' Healthcare Freedom Reconciliation Act
Letter to Representatives urging opposition to the misnamed Retail Investor Protection Act
Letter to House Transportation Committee commending its efforts in crafting the Surface Transportation Reauthorization and Reform A

In [94]:
# move this code to the get_letter function

def get_letter(url):
    ## insert code
    req = requests.get(url)
    src = req.text
    soup = BeautifulSoup(src, 'lxml')
    body = soup('div', class_ =  "attribute-body")
    text = body[0].text

    return(text)


In [95]:
# Try this
links = get_alerts("2015")
print(get_letter(links[5]['url']))


Dear Representative:On behalf of the AFL-CIO, a federation of 56 national and international unions representing more than 12 million working men and women across the United States, I am writing to urge you to oppose H.R. 3459, the “Protecting Local Business Opportunity Act.”H.R. 3459 overturns the National Labor Relations Board’s (NLRB) recent decision in Browning-Ferris Industries (August 27, 2015) and substitutes a confusing and restrictive test for finding two employers to be “joint employers” under the National Labor Relations Act. The legislation is misguided and will undermine the ability of workers to speak up together for higher wages and better working conditions.In Browning-Ferris¸ the NLRB decided that its previous decisions had taken an overly narrow view of the joint employer issue, resulting in situations where workers could not join together for improved employment conditions. Browning-Ferris contracted with a staffing agency—Leadpoint Business Solutions—to supply worke



 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "lxml")

  markup_type=markup_type))


### 3.3 Tying it all together

## Part 4
We've examined relatively straightforward scraping examples. Here are a few pointers on some more advanced scraping scenarios you are likely to eventually encounter.
### 4.1 Advance text cleaning
Regular expressions often play an important role in webscraping. We don't have enough time to fully cover the topic so let's look at one practical application using the site we've been scraping. Imagine we need to capture the numeric data in the lists, i.e. the actual number of downloads. We can use a regular expression to parse out the clean title and the count. Here's an example from another website - https://www.gutenberg.org/browse/scores/top

In [102]:
###
full_title = "Pride and Prejudice by Jane Austen (1340)"
m = re.search(r'(.+)\s\((\d+)\)', full_title)
details = {}
if m:
    details['all'] = m.group(0)
    details['title'] = m.group(1)
    details['count'] = m.group(2)
print(details)
print(details['all'])
print(details['title'])
print(details['count'])

{'title': 'Pride and Prejudice by Jane Austen', 'all': 'Pride and Prejudice by Jane Austen (1340)', 'count': '1340'}
Pride and Prejudice by Jane Austen (1340)
Pride and Prejudice by Jane Austen
1340


### 4.2 Authentication

The requests module can handle some forms of authentication, i.e. logins, but it requires more coding than we can get into in this workshop. I refer you to the [documentation](http://docs.python-requests.org/en/master/user/authentication/) for guidance on this issure.

### 4.3 Javascript-heavy sites

Some sites generate content based on user actions such as menu selections or other types of "clicks". The Python [selenium](http://selenium-python.readthedocs.io/) module can help with these types of scraping applications.