Web Scraping
======
Adapted from a page created by [John Beieler](https://gist.github.com/johnb30/4743272)

> Even with the best of websites, I don’t think I’ve ever encountered a scraping job that couldn’t be described as *“A small and simple general model with heaps upon  piles of annoying little exceptions”* 

>> \- Swizec Teller [http://swizec.com/blog/scraping-with-mechanize-and-beautifulsoup/swizec/5039](http://swizec.com/blog/scraping-with-mechanize-and-beautifulsoup/swizec/5039)

## What is it?

A large portion of the data that we as social scientists are interested in resides on the web in manner. Web scraping is a method for pulling data from the structured (or not so structured!) HTML that makes up a web page. Python has numerous libraries for approaching this type of problem, many of which are incredibly powerful. If there is something you want to do, there's usually a way to accomplish it. Perhaps not easily, but it can be done. 


## How is it accomplished?

In general, there are three problems that you might face when undertaking a scraping task:

1. You have a single page, or a set of pages, that you know of and you want to scrape.
2. You have a source that generates links, e.g., [RSS feeds](http://rss.nytimes.com/services/xml/rss/nyt/World.xml), to various pages with the same structure.
3. You have a page that contains many pages of interest that are scattered across the file system and you only have general rules for reaching these pages. 

The key is that you must identify which type of problem you have. After this, you must look at the HTML structure of a webpage and construct a script that will select the parts of the page that are of interest to you.

##  There's a library for that! 

As mentioned previously, Python has various libraries for scraping tasks. The ones I have found the most useful are:

- [pattern](http://www.clips.ua.ac.be/pages/pattern)
- [lxml](http://lxml.de/)
- [requests](https://3.python-requests.org/)
- [Scrapy](http://doc.scrapy.org/en/0.16/)
- [Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/)


In [9]:
#directions for installing these packages and for printing out their versions

# !pip install pattern
import pattern3
print("package versions")
print("--------------------------")
print(f"pattern        : {pattern3.__version__}")

# you might face a error due to lack of mysql configs
# for mac :
#!brew install mysql
#!export PATH=$PATH:/usr/local/mysql/bin
#!pip install pattern

# !pip install lxml
import lxml
print(f"lxml           : {lxml.etree.__version__}")

#!pip install requests
import requests
print(f"requests       : {requests.__version__}")

#!pip install scrapy
import scrapy
print(f"scrapy         : {scrapy.__version__}")

#!pip install beautifulsoup4
import bs4
print(f"beautifulsoup4 : {bs4.__version__}")

package versions
--------------------------
pattern        : 2.6


AttributeError: module 'lxml' has no attribute 'etree'

### Inspecting a web page
let's inspect a [mongabay web page](https://names.mongabay.com/). Open the page in a separate tab, it should look something like the image below. Mongapay is a site that has a nice collection of statistics about first and last names.

<img src="images/page.png" alt="web page" height="420" width="600">

The source code language for web pages is HTML (Hyper-Text Markup-Language]. HTML is a hierarchical description of the visual content of a page. You can view the source, in Chrome, by choosing `More Tools/Developer Tools`. However, commercial web pages are very complex and inspecting them requires a more powerful method.

[mongabay web page](http://names.mongabay.com/data/1000.html)

The Chrome browsr comes with such a tool built in. You can open this tool by choosing `More Tools/Developer Tools`

If you are using Firefox you can install the [Firebug](https://getfirebug.com/) plugin which gives similar functionality.

The bottom half of the browser page will now have a sophisticated development environment for all things web (HTML, CSS, Javascript).
In particular it allows you to click on a visual element in the page and find out where it is in the source and what tags it is associated with

**note that, image below might be different in newer versions of chrome browser**
<img src="images/InspectingHTML.png" alt="web page" height="420" width="600">

Note that the element SMITH is surrounded by `<td>` and `</td>` and that this element and all of the elements in that row :`SMITH  1  2376206 ...` are surrounded by `<tr> ..... </tr>`. In HTML-speak we say that each element in the table is of type 'td' and that the whole row is an element if type 'tr' (stands for "table row").

We will now see how to inspect these elements from the command line using `scrapy shell`

### Scrapy

* [Official Scrapy Tutorial](http://doc.scrapy.org/en/0.24/intro/tutorial.html)
* Other important sections in the Scrapy documentation are **selector** and **Scrapy shell**


### Using scrapy shell
*Everything here is done from the terminal window, not inside a notebook*

Install scrapy. Use one of those or something similar:
> `pip install scrapy`

> `sudo pip install scrapy`

> `conda install scrapy`

Start Scrapy Shell from a terminal window:
> `scrapy shell`

Fetch a page: 
> `fetch('http://names.mongabay.com/data/1000.html')`

View the page in a Browser:
> `view(response)`

Get the HTML text of the response:
> `response.body`

Get just the header
> `response.headers`

Get all href links to other web pages `<a href="http://web.site/file"> link text </a>`:
> `response.xpath('//a')`

Filter out of the href links the ones that match a particular regular expression:
> `response.xpath('//a').re('data/.+\.html?')`

scrapy contains much more than the shell. You can use scrapy as a library to perform web operations. You can also write a **Spider** or **Crawler** i.e. a program that will follow links to extract and process all of the pages of a particular type from a web site. Later in this notebook there is an example of using Scrapy for crawling. 

For now, we switch from scrapy to the libraries `requests` and `lxml.html` which are somewhat easier to use for web page parsing.

### Scraping a page that you know

The easiest approach to webscraping is getting the content from a page that you know in advance. I'll go ahead and keep using the page we looked at earlier. There are three basic steps to scraping a single page:

1. Get (request) the page
2. Parse the page content
3. Select the content of interest using an XPath selector

The following code executes these three steps and prints the result. 

In [None]:
import requests
import lxml.html as lh

url = 'http://names.mongabay.com/data/1000.html'
page = requests.get(url)
doc = lh.fromstring(page.content)
page.content[:500]

In [None]:
# the tag tr (table row) is used in many places, 
# among them the table of interest to us.
# we can identify those rows by the fact that 
# the table contains 11 columns.
tr_elements = doc.xpath('//tr')
for T in tr_elements[:20]:
    for e in T.iterchildren():
        print(e.text_content())

In [None]:
type(tr_elements[0]), len(tr_elements)

In [None]:
element=tr_elements[7]
#element.  # uncomment this line and hit <tab> after the dot to see the methods and properties of an HTML element

In [None]:
col=[]  # collect column names into col
T=tr_elements[0]
print(type(T))
i=0
print(len(T))
for t in T.iterchildren():
    i+=1
    name=t.text_content()
    print('%d:"%s"'%(i,name))
    col.append((name,[]))

print('the columns are:',col)

In [None]:
len(tr_elements)

In [None]:
for j in range(1,len(tr_elements)):
    #print(j)
    T=tr_elements[j]
    #print(len(T))
    if len(T)!=15:
        break
    i=0
    for t in T.iterchildren():
        data=t.text_content()
        if i>0:
            try:
                data=float(data if '%' not in data else data.split('%')[0])
            except:
                print(data,'cannot be converted to float, row,col=',j,i)
                data=None
        col[i][1].append(data)
        i+=1


In [None]:
[len(C) for (title,C) in col]

In [None]:
min_len=min([len(C) for (title,C) in col])
min_len

In [None]:
Dict={title:column for (title,column) in col}

In [None]:
Dict

In [None]:
import pandas as pd
df=pd.DataFrame(Dict)
df.head()

So we now have our lovely output. This output can be manipulated in various ways, or written to an output file.

### Scraping generated links

Let's say you want to get a stream of news stories in an easy manner. You could visit the homepage of the NYT and work from there, or you can use an [RSS feed](http://rss.nytimes.com/services/xml/rss/nyt/World.xml). RSS stands for Real Simple Syndication and is, at its heart, an XML document. This allows it to be easily parsed. The fantastic library `pattern` allows for easy parsing of RSS feeds. Using `pattern`'s `Newsfeed()` method, it is possible to parse a feed and obtain attributes of the XML document. The `search()` method returns an iterable composed of the individual stories. Each result has a variety of attributes such as `.url`, `.title`, `.description`, and more. The following code demonstrates these methods.

In [None]:
import pattern.web
num=5;
url = 'http://rss.nytimes.com/services/xml/rss/nyt/World.xml'
results = pattern.web.Newsfeed().search(url, count=num)

print('The current top headers from the NY times are:')
for i in range(num):
    print ("%d\t%s"%(i,results[i].title))

print('\n\nURL: %s \n\n Header\n%s \n\nFull Article\n %s \n' % (results[0].url, results[0].title,
                                                                  results[0].description))

In [None]:
That looks pretty good, but the description looks nastier than we would generally prefer. Luckily, `pattern` provides functions to get rid of the HTML in a string.  

In [None]:
print('%s \n\n %s \n\n %s \n\n' % (results[0].url, results[0].title, pattern.web.plaintext(results[0].description)))

While it's all well and good to have the title and description of a story this is often insufficient (some descriptions are just the title, which isn't particularly helpful). To get further information on the story, it is possible to combine the single-page scraping discussed previously and the results from the RSS scrape. The following code implements a function to scrape the NYT article pages, which can be done easily since the NYT is wonderfully consistent in their HTML, and then iterates over the results applying the `scrape` function to each result.

In [None]:
import codecs

outputFile = codecs.open('tutorialOutput.txt', encoding='utf-8', mode='w')

def scrape(url):
    page = requests.get(url)
    #print(page.content)
    doc = lh.fromstring(page.content)
    #print(doc)
    text = doc.xpath('//p[@id="article-summary"]') #('//p[@itemprop="<find required id here, article body doesn't seem to be present>"]') 
    finalText = str()
    for par in text:
        finalText += par.text_content()
    return finalText+'\n\n\n'

for result in results:
    #print(result.url)
    outputText = scrape(result.url)
    #print(outputText)
    outputFile.write(outputText)

outputFile.close()

In [None]:
!head -4 tutorialOutput.txt

### Scraping arbitrary websites

The final approach is for a webpage that contains information you want and the pages are spread around in a fairly consistent manner, but there is no simple, straightfoward manner in which the pages are named.

I'll offer a brief aside here to mention that it is often possible to make slight modifications to the URL of a website and obtain many different pages. For example, a website that contains Indian parliament speeches has the URL `http://164.100.47.132/LssNew/psearch/Result13.aspx?dbsl=` with differing values appended after the `=`. Thus, using a `for-loop` allows for the programatic creation of different URLs. Some sample code is below.

In [None]:
url = 'http://164.100.47.132/LssNew/psearch/Result13.aspx?dbsl='

for i in range(5175,5973):
    newUrl = url + str(i)
    print('Scraping: %s' % newUrl)

Getting back on topic, it is often more difficult than the above to iterate over numerous webpages within a site. This is where the `Scrapy` library comes in. `Scrapy` allows for the creation of web spiders that crawl over a webpage, following any links that it finds. This is often far more difficult to implement than a simple scraper since it requires the identification of rules for link following. The [State Department](http://www.state.gov/r/pa/prs/dpb/2012/index.htm) offers a good example. I don't really have time to go into the depths of writing a `Scrapy` spider, but I thought I would put up some code to illustrate what it looks like.

In [None]:
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from bs4 import BeautifulSoup
import re
import codecs

class MySpider(CrawlSpider):
    name = 'statespider' #name is a name
    start_urls = ['http://www.state.gov/r/pa/prs/dpb/2010/index.htm',] #defines the URL that the spider should start on. adjust the year.

    #defines the rules for the spider
    rules = (Rule(LinkExtractor(allow=('/2010/'), restrict_xpaths=('//*[@id="local-nav"]'),)),
             #allows only links within the navigation panel that have /year/ in them.
             Rule(LinkExtractor(restrict_xpaths=('//*[@id="dpb-calendar"]',), deny=('/video/')),
                  callback='parse_item'),
             #follows links within the caldendar on the index page for the individuals years, while denying any links with /video/ in them
    )

    def parse_item(self, response):
        self.log('Hi, this is an item page! %s' % response.url) #prints the response.url out in the terminal to help with debugging
        
        #Insert code to scrape page content

        #opens the file defined above and writes 'texts' using utf-8
        with codecs.open(filename, 'w', encoding='utf-8') as output:
            output.write(texts)


### Excercise
Google news provides RSS feeds that can be filtered, at the source, according to the country, the language and a query term. See [this description](http://thinktostart.com/creating-custom-rss-feeds-with-google/).

Create a feed for all spanish news about san diego that prints the latest 5 headers and descriptions.


##The Pitfalls of Webscraping

Web scraping is much, much, *much*, more of an art than a science. It is often non-trivial to identify the XPath selector that will get you what you want. Also, some web programmers can't seem to decide how they want to structure the pages they write, so they just change the HTML every few pages. Notice that for the NYT example if `articleBody` gets changed to `articleBody1`, everything breaks. There are ways around this that are often convoluted, messy, and hackish. Usually, however, where there is a will there is a way.

...brief pause to demonstrate the lengths this can go to.

##PITF Human Atrocities

As a wrap up, I show the workflow I have been using to perform real-time scraping from various news sites of stories pertaining to human atrocities. This is illustrative both of web scraping and of the issues that can accompany programming. 

The general flow of the scraper is:

RSS feed -> identify relevant stories -> scrape story -> place results in mongoDB -> repeat every hour
