# Web Scraping Demo

# Overview

This notebook will demonstrate some basic use cases for Python web scraping. Although these examples are fairly simple and straightforward, you can think of ways to extend this to more hairy web scraping scenarios!

The packages we will be using are:

* **`urllib2`** - a built-in Python package to make HTTP requests and pull requested web data into Python.
* **`bs4` (Beautiful Soup 4)** - a very powerful Python package for processing text downloaded from the web.
* **`pandas`** - a package for formatting and analyzing data in Python using "data frames".
* **`tweepy`** - an API that directly plugs into Twitter.

We will be demonstrating 3 examples:

1. **Scraping online voter records.**
2. **Scraping Wikipedia.**
3. **Scraping Twitter.**

Let's get coding!

## 1. Scraping online voter records

The state of **Oregon** has an amazing open data portal:

    https://data.oregon.gov
    
Amongst other things, we can find the records of registered voter counts by county and party... sounds like we could use that for something. Let's get scraping!

**Import relevant libraries** 

In [None]:
import urllib2   # HTTP requests
import bs4       # Parsing our web data
import pandas    # Data formatting and analysis

**Reading in data**

We first want to download our web page data, which we do using the method **`urlopen()`** from the **`urllib2`** package.

We can then feed in this raw data and create our key object - the `soup`, which allows us to intelligently search our huge XML text blob.

In [None]:
url = "https://data.oregon.gov/views/c5a8-vfhd/rows.xml"

In [None]:
response = urllib2.urlopen(url)

In [None]:
xml_text = response.read()
soup = bs4.BeautifulSoup(xml_text, "lxml")
print soup.prettify()[:200]
print "..."

**Parsing our web data**

The key methods to use for any object that has parsed HTML/XML tags:
* **`myTagsObj.findAll(<id>)`** - returns a list of all XML tags contained in **`myTagsObj`** that match that `<id>`.
* **`myTagsObj.findChildren()`** - returns all XML tags contained in **`myTagsObj`** (either just one level down or all levels down).

Let's use these methods to extract the title (`.name`) and value (`.text`) of each entry for a given row in this XML file:

In [None]:
rows = soup.findAll("row", recursive=True)
print "Number of rows:" len(rows) - 1

In [None]:
example_row = rows[1]
voter_entries = example_row.findChildren()
for entry in voter_entries:
    print entry.name + ": " + entry.text

In [None]:
entry_names = []
for entry in voter_entries:
    entry_names.append(entry.name)
print entry_names

**Saving and analyzing**

Ok great! Now that we have our data, we want to store it in a nice format either for immediate analysis or to save as a csv file for later work. 

Turns out that the **`pandas`** library is perfect for this, letting us nicely save our data into a data frame:

In [None]:
import pandas as pd


# Our data frame (df)
df = pd.DataFrame([], columns=entry_names)

voters = rows[1:]

for voter in voters:
    voter_entries = voter.findChildren()
    entry_values = []
    for entry in voter_entries:
        entry_values.append(entry.text)
    # Create a new row
    df_row = pd.DataFrame([entry_values], columns=entry_names)
    # Add row to data frame
    df = df.append(df_row)

In [None]:
from IPython.display import display
print display(df)

... And there we have it! We can now save this to a csv file for later analysis.

In [None]:
df.to_csv("myOregonVoters.csv")

**Advanced Exercise:** In 2016, which party in Oregon had the most registered voters? Use pandas to calculate this!

See docs: http://pandas.pydata.org/

## 2. Scraping Wikipedia

Alright, now let's say there's a Wikipedia page that contains some juicy data that you want to parse. Perhaps it's the page of your home state with a nice statistics table. Or perhaps the page of a U.S. Senator you'd like to add to your personal Congress database.

Let's take the page of our favorite Senator -- we want to keep an eye on her/him, so we're going to scrape some basic background information from her/his Wikipedia page summary:

In [None]:
import bs4
import urllib2

In [None]:
d = urllib2.urlopen("https://en.wikipedia.org/wiki/Chris_Coons")

In [None]:
soup = bs4.BeautifulSoup(d.read(), "html.parser")
info_table = soup.find_all(class_="infobox vcard")

In [None]:
tbl_rows = info_table[0].find_all("tr")
for row in tbl_rows:
    cells =  row.find_all("td") # table cells
    header = row.find("th")
    if header != None:
        print "\nTitle: " + header.text
    if len(cells) != 0:
        for cell in cells:
            cell_text = cell.text.strip()
            if len(cell_text) != 0:
                print "Desc: " + cell_text

Woohoo! With some careful navigation of the HTML structure, we can extract the text fields that we want.

**Advanced Exercise:** Try putting this data into a Pandas data frame and saving it!

##  3. Scraping Twitter

So far we've done a lot of heavy lifting ourselves - we've had to find our web page of interest, pin down where in the source code our data of interest is, and then write some complex `BeautifulSoup` code to extract the data.

Are there any data sources that provide their *own* Python API so that we don't have to do as much work!

Yes! Check out Twitter's **`tweepy`** package for Python.

http://www.tweepy.org/

**Access**

While `tweepy` provides a very nice API for interfacing with Twitter, in order to actually access it, you will need to create an developer account from here first:
https://dev.twitter.com/

We can then fill in our credentials that the Tweepy authorization object asks of us:

In [1]:
import tweepy

ckey    = "jGJsmFgEi6wdPeFo8hvpryQrQ"                          # consumer key
csecret = "Ff5X4QeVYHGGikeBN03l9ooHCb0x3tOPhmEpWGi8kTnLFloFLC" # consumer scret
atoken  = "731185387-O34WQW8MmjWtvaW72tuhz4hfAkZZ3p7RYqOP3nel" # access token
asecret = "NwoGqLCwwmFd7hjejry9hP9MbZUYaWfQrS4egWZi3Jp7A"      # access secret
    

... and access the actual API:

In [2]:
# Instantiate OAuthHandler and initialize it with your credentials
# (i.e. secure "handshake" with the API)
auth = tweepy.OAuthHandler(ckey, csecret)
auth.set_access_token(atoken, asecret)
 
# Instantiate the tweepy API object.
api = tweepy.API(auth)

We're ready to grab some tweets now! Let's pick someone of interest ...

In [3]:

example_user = "realDonaldTrump"
tweeter = api.get_user(example_user)

print "======================================="
print "@" + example_user
print "\nTweeter follower count: " + str(tweeter.followers_count)
print "Tweeter description: " + tweeter.description
print "======================================="

@realDonaldTrump

Tweeter follower count: 28195415
Tweeter description: 45th President of the United States of America


Now let's grab their last 100 tweets:

In [4]:
tweets = api.user_timeline(id=tweeter.id, count=100)

for tweet in tweets:
    print "Tweet: " + tweet.text
    print "\n"

Tweet: 'Presidential Executive Order on Identifying and Reducing Tax Regulatory Burdens' 
Executive Order:… https://t.co/dpE6hDzlAt


Tweet: RT @Scavino45: .@POTUS @realDonaldTrump, @IvankaTrump, Jared Kushner, &amp; Dina Powell in the Oval Office today w/ Aya &amp; her brother Basel.
#W…


Tweet: WELCOME HOME, AYA!
#GodBlessTheUSA🇺🇸 https://t.co/CR4I8dvunc


Tweet: China is very much the economic lifeline to North Korea so, while nothing is easy, if they want to solve the North Korean problem, they will


Tweet: No matter how much I accomplish during the ridiculous standard of the first 100 days, &amp; it has been a lot (including S.C.), media will kill!


Tweet: Another terrorist attack in Paris. The people of France will not take much more of this. Will have a big effect on presidential election!


Tweet: RT @foxandfriends: NYT editor apologizes for misleading tweet about New England Patriots' visit to the White House (via @FoxFriendsFirst) h…


Tweet: A great honor to host PM Paolo

Again, let's save this to a dataframe and see what we have:

In [None]:
df = pd.DataFrame([], columns=["user", "tweet", "date"])

In [None]:
for tweet in tweets:
    row_data = [example_user, tweet.text, tweet.created_at]
    df_row = pd.DataFrame([row_data], columns=["user", "tweet", "date"])
    df = df.append(df_row)

In [None]:
from IPython.display import display
print display(df)

**Advanced Exercise:** Find a user that has *geocoded* tweets and create a data frame of Tweets with their tweeting locations (use the Tweepy docs to help you out).

# Conclusion

Web scraping is tough. Some of it is tedious manual inspection; a lot of it is referring to documentation for highly specific functions.

However, no need to fear! You have just successfully scraped three different data sources and worked with some complex Python packages in the process. With your knowledge and this notebook as a reference, you now have a basic toolkit to web scrape any and all corners of the world wide web!