# Webscraping for Data Collection

I ~love~ using webscraping to grab data from content on websites. In my day job this is something I end up using a lot, getting data from different websites, putting it back together, and it's also something I get a lot of questions about. Webscraping is particularily useful when 1) the service is not intended to provide data or 2) there is no goo public api. This simple tutorial will build into my "Women's Representation in History" as I show how to webscrape data from webpages at different levels of diffculty. 

I'm not an expert in this! There is always a better way to solve a problem, but here are some methods that are pretty quick that should get you going, and also get you on the path to scraping your own data from webpages.

I plan to do at least 3 tutorials with different levels of capabilities. 

1. Very simple (this one!) basic request and parsing of html
    1. How to follow a link, and how to recursively scrape
3. Creating a spider, doing crawls, sending scraped data back to a data base
4. Handling javascript rendered pages (and looking for hidden apis)

### Basic page request and html parsing
Most of the time I use `lxml` and `requests` for webscraping. There are other libraries, but I find that being conversant in these two are usually enough to solve more that 90% of my webscraping needs (you can see how some of my data collection from pt1 of my Women's Representation in History was done using this library [here](https://github.com/sjfwagner/Blog-code-snippets-and-notebooks/blob/master/SYMIH_Scrapes.ipynb)). Preferences may vary! [BeautifulSoup4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is very popular, but I prefer to do all of my data extraction through xpath (more on that later). There *are* other HTTP request libraries, like urllib2 etc, but I have never seen a HTTP requests library recommended over requests. 

In [2]:
#the two imports that we'll need
import requests
from lxml import html

For this simple tutorial we'll be scraping a page from wikipedia, on historians in history. 
You can see the page here, and I also recommend that you view the source, as that will give you a good idea of the structure of the page we'll be scraping. 

We'll start [here](https://en.wikipedia.org/wiki/List_of_historians), with a list of historians identified by wikipedia and then use the same techniques to move on to [here](https://en.wikipedia.org/wiki/List_of_historians_by_area_of_study), a list of historians identified by wikipedia for each area of study. There will be overlap, but I believe each list may have some different names. 

In [34]:
url = 'https://en.wikipedia.org/wiki/List_of_historians'
response = requests.get(url)


Generically html is just a structured document, whose tags allow the browser to figure out how to display information. CSS and Javascript modify this, but most basic webpages are html. 

We can use the structure of the document, the tags, the pull features out of this. xpath (great [tutorial](http://www.w3schools.com/xsl/xpath_intro.asp) at w3 schools) allows us to use structured language to query the document for the information that we want. 

It's super useful at this point to load the source (right-click view source) to  

In [35]:
wiki_page = html.fromstring(response.content)

In [38]:
historians = {}
for i in wiki_page.xpath('//li/a[@title]'): #this xpath grabs all the list items, with a title
    print i.text #print the historian's name
    print i.xpath('@href') #print the link
    historians[i.text] = "https://en.wikipedia.org"+i.xpath('@href')[0] #store the full link
    

Herodotus
['/wiki/Herodotus']
historiography
['/wiki/Historiography']
Thucydides
['/wiki/Thucydides']
Xenophon
['/wiki/Xenophon']
Ctesias
['/wiki/Ctesias']
Theopompus
['/wiki/Theopompus']
Eudemus of Rhodes
['/wiki/Eudemus_of_Rhodes']
Berossus
['/wiki/Berossus']
Ptolemy I Soter
['/wiki/Ptolemy_I_Soter']
Duris of Samos
['/wiki/Duris_of_Samos']
Manetho
['/wiki/Manetho']
Timaeus of Tauromenium
['/wiki/Timaeus_(historian)']
Quintus Fabius Pictor
['/wiki/Quintus_Fabius_Pictor']
Artapanus of Alexandria
['/wiki/Artapanus_of_Alexandria']
Ptolemaic Egypt
['/wiki/Ptolemaic_Egypt']
Cato the Elder
['/wiki/Cato_the_Elder']
Gaius Acilius
['/wiki/Gaius_Acilius']
fl.
['/wiki/Floruit']
Polybius
['/wiki/Polybius']
Sempronius Asellio
['/wiki/Sempronius_Asellio']
Sima Tan
['/wiki/Sima_Tan']
Sima Qian
['/wiki/Sima_Qian']
Chinese historiography
['/wiki/Chinese_historiography']
Agatharchides
['/wiki/Agatharchides']
Posidonius
['/wiki/Posidonius']
Julius Caesar
['/wiki/Julius_Caesar']
Diodorus of Sicily
['/wik

In [39]:
historians

{'Frank Barlow': 'https://en.wikipedia.org/wiki/Frank_Barlow_(historian)',
 'Einhard': 'https://en.wikipedia.org/wiki/Einhard',
 'T. C. Smout': 'https://en.wikipedia.org/wiki/Christopher_Smout',
 'Isaiah Berlin': 'https://en.wikipedia.org/wiki/Isaiah_Berlin',
 'D. W. Meinig': 'https://en.wikipedia.org/wiki/D._W._Meinig',
 'Paul Cartledge': 'https://en.wikipedia.org/wiki/Paul_Cartledge',
 'John Lukacs': 'https://en.wikipedia.org/wiki/John_Lukacs',
 'Asser': 'https://en.wikipedia.org/wiki/Asser',
 'Mary Bonaventure Browne': 'https://en.wikipedia.org/wiki/Mary_Bonaventure_Browne',
 'Georges Lefebvre': 'https://en.wikipedia.org/wiki/Georges_Lefebvre',
 'Quentin Skinner': 'https://en.wikipedia.org/wiki/Quentin_Skinner',
 'Robert Darnton': 'https://en.wikipedia.org/wiki/Robert_Darnton',
 'Simon Janashia': 'https://en.wikipedia.org/wiki/Simon_Janashia',
 'Italian Wars': 'https://en.wikipedia.org/wiki/Italian_Wars',
 'David Hackett Fischer': 'https://en.wikipedia.org/wiki/David_Hackett_Fischer