# Web Scraping in Python: A Crash Course

Scott Chamberlain has put together [a version of this workflow in R](https://github.com/sckott/talks/blob/gh-pages/oddpdx/code/demo1.Rmd). The basic p

## Our Goal: Get the complete list of species in Hoyt Arboretum

The Hoyt Arboretum maintains a database of the various species of tree, along with their locations in the garden, provenance information, etc. Unfortunately, this database is designed for use by humans- not computers. The list of species is broken out by letter of the alphabet...

![Hoyt Index Page](hoyt_index.png)


... And each letter has its own index:

![Hoyt Index Page](hoyt_a.png)


We could manually visit each letter's index page, copy-and-paste the list into a text editor, and go from there... but that would get very old very fast! Let's write a script to automte the process.

## Scraping: The general process

We will actually do this in three steps. First, we will get a list of all of the letter-specific index pages. Then, we will write a function to visit a letter index page, extract all of the species names, and return them as a Python list. Finally, we will put it all together in a script and write the output as a CSV file.

## Libraries we will use:

Web scraping in Python is made easier by a couple of libraries:

`requests`: To download the contents of a web page

`lxml`: To allow us to process the contents of a webpage

An optional library that you will also want to have installed (but do not need to explicitly import) is `cssselect`.

In [1]:
from lxml import html
import requests

## Part 1: Getting the list of letter-index page URLs

### First: Download the main index page and turn it into something useful

In [2]:
main_index_page = requests.get("https://hoytarboretum.gardenexplorer.org/taxalist.aspx")

In [3]:
main_index_page

<Response [200]>

In [4]:
main_index_tree = html.fromstring(main_index_page.content)

### Second, figure out which HTML elements we care about:

Firefox, Chrome, and Safari all have a special mode for viewing the source code to a web page that you are looking at, and this mode has useful tools for analyzing that source code. One particularly helpful tool lets you click on an element in the page and see exactly what code produced it. This is very helpful for reverse-engineering a web page for scraping. On Safari, you bring this mode up by pressing Command-Option-I.

![Hoyt Index Page web inspector](hoyt_index_inspector.png)

Looking at this, we can see that the letter buttons are all children of a div with the "content" class. 

We can write a "CSS Selector" to grab those tags for further processing:

In [7]:
letter_index_links = main_index_tree.cssselect("div.content a") # "a" elements that are the children of a div of class "content" 

Let's see if that got what we wanted:

In [9]:
len(letter_index_links)

26

Looks good! This was a nice and simple one; often, writing the right CSS selector is a bit trickier. Some styles of writing HTML result in easier-to-scrape pages than others. Trial-and-error is usually a part of the process!

What have we actually gotten, though? We have a list of `Element` objects, each of which represents one of the `a` tags in our list.

In [49]:
letter_index_links[0]

<Element a at 0x1127a2ea8>

In [55]:
letter_index_links[0].attrib

{'id': 'ctl00_ContentPlaceHolder1_Repeater1_ctl00_HyperLink1', 'title': ' A ', 'href': 'taxalist-A.aspx'}


Once we've got the right selector set up, and have the elements we care about, it's time to pull whatever information out of them that we are after.

### Third, actually extract information

So, what have we captured? We have captured a list of objects representing HTML elements. In our case, they are all links ("anchor" tags), and we want to grab their destinations- represented by their `href` attributes.

In [14]:
letter_index_urls = [a.get("href") for a in letter_index_links]

In [15]:
letter_index_urls

['taxalist-A.aspx',
 'taxalist-B.aspx',
 'taxalist-C.aspx',
 'taxalist-D.aspx',
 'taxalist-E.aspx',
 'taxalist-F.aspx',
 'taxalist-G.aspx',
 'taxalist-H.aspx',
 'taxalist-I.aspx',
 'taxalist-J.aspx',
 'taxalist-K.aspx',
 'taxalist-L.aspx',
 'taxalist-M.aspx',
 'taxalist-N.aspx',
 'taxalist-O.aspx',
 'taxalist-P.aspx',
 'taxalist-Q.aspx',
 'taxalist-R.aspx',
 'taxalist-S.aspx',
 'taxalist-T.aspx',
 'taxalist-U.aspx',
 'taxalist-V.aspx',
 'taxalist-W.aspx',
 'taxalist-X.aspx',
 'taxalist-Y.aspx',
 'taxalist-Z.aspx']

## Part 2: Processing a letter-index page

Now we're ready to look at the letter-index pages. We will follow the same procedure for these as we did for the main index: look at the HTML, figure out a selector that works, and go from there.

![Hoyt Index Letter Page web inspector](hoyt_a_inspector.png)

We can see that the information we want is in an unordered list (`ul`) with the class of "taxalist", that each species has its own list item (`li`), and that each entry in the list contains a link with both the species name as well as some other information- the common name, an icon showing whether a photo is available, etc. The species name is in boldface (`b`). So, our selector should be something like:

    ul.taxalist li span a b
   
Note that this is not the most parsimonious selector that would work. There are often multiple valid ways to design a selector...

Also, note that it is not _always_ true that the latin name is in boldface- there are several plants on the "A" page that, for whatever reason, are italicized instead. So we will have to be a bit more creative... 

In [119]:
def process_letter_index(url):
    page = requests.get("https://hoytarboretum.gardenexplorer.org/{}".format(url))
    # nb: we should be doing some error checking here, to make sure that the request was successful, etc.
    tree = html.fromstring(page.content) 
    for a in tree.cssselect("ul.taxalist li span a"):
        # whatever the child of this a tag is (whether b or i), let's get it and return its text:
        yield a.getchildren()[0].text.strip()



In [158]:
list(process_letter_index("taxalist-A.aspx"))[:15]

['Abies alba',
 "Abies alba 'Argau'",
 "Abies alba 'Badenweiler'",
 "Abies alba 'Green Spiral'",
 "Abies alba 'Pendula'",
 'Abies amabilis',
 'Abies balsamea',
 "Abies balsamea 'Nana'",
 'Abies balsamea var. phanerolepis',
 'Abies bracteata',
 'Abies cephalonica',
 "Abies cephalonica 'Meyer's Dwarf'",
 'Abies concolor',
 "Abies concolor 'Conica'",
 "Abies concolor 'Winter Gold'"]

This is looking good! But what if we wanted not just the names of the species, but also the common names and the links to each one's page? That would be a little bit more complicated, because of how the HTML of the page is set up. Each link looks like this:

    <li class="taxalist" id="Taxon-312">
        <span dir="ltr">
            <a id="ctl00_ContentPlaceHolder1_Repeater2_ctl01_HyperLink2" href="taxon-312.aspx">
                <b>Abies alba</b>
            </a>
        </span>	
        <span id="ctl00_ContentPlaceHolder1_Repeater2_ctl01_TaxonDetails" class="textmedium colorlabel">
            Pinaceae • European Silver Fir
        </span>
        <img id="ctl00_ContentPlaceHolder1_Repeater2_ctl01_hasImage" title="Has image" src="Images/photo.png" alt="Has image" style="height:10px;width:12px;border-width:0px;">
    </li>
    
What we would have to do would be to do additional processing to each `<li>` element: first grabbing the link, extracting its `href`, and then grabbing its child `b` node. Then, we'd need to do a second step of processing of the span containing the common name, then parse out _its_ contents. It can get a little tedious...

In [123]:
def process_letter_index_detail(url):
    page = requests.get("https://hoytarboretum.gardenexplorer.org/{}".format(url))
    # nb: we should be doing some error checking here, to make sure that the request was successful, etc.
    tree = html.fromstring(page.content) 
    species = tree.cssselect("ul.taxalist li")
    for s in species:
        detail_link = s.cssselect("a")[0] # there's always only one
        link_dest = detail_link.get("href")
        
        latin_name_a = s.cssselect("span a")[0]
        latin_name = latin_name_a.getchildren()[0].text.strip() # ditto
            
        # exercise for the reader: extend this script to also grab family and common names
        
        yield {'latin_name': latin_name, 'url': link_dest}


In [124]:
list(process_letter_index_detail("taxalist-Z.aspx"))

[{'latin_name': 'Zanthoxylum aff. diacanthoides', 'url': 'taxon-1345.aspx'},
 {'latin_name': 'Zanthoxylum americanum', 'url': 'taxon-977.aspx'},
 {'latin_name': 'Zanthoxylum nepalense', 'url': 'taxon-1742.aspx'},
 {'latin_name': 'Zanthoxylum piperitum', 'url': 'taxon-1263.aspx'},
 {'latin_name': 'Zanthoxylum planispinum', 'url': 'taxon-978.aspx'},
 {'latin_name': 'Zanthoxylum schinifolium', 'url': 'taxon-1743.aspx'},
 {'latin_name': 'Zanthoxylum simulans', 'url': 'taxon-1149.aspx'},
 {'latin_name': "Zauschneria californica 'Calistoga'",
  'url': 'taxon-1921.aspx'},
 {'latin_name': "Zauschneria californica 'Carmen's Gray'",
  'url': 'taxon-1922.aspx'},
 {'latin_name': "Zauschneria californica 'Dublin'", 'url': 'taxon-1923.aspx'},
 {'latin_name': "Zauschneria californica 'Silver Select'",
  'url': 'taxon-1925.aspx'},
 {'latin_name': "Zauschneria californica 'Solidarity Pink'",
  'url': 'taxon-1444.aspx'},
 {'latin_name': "Zauschneria californica 'Wayne's Silver'",
  'url': 'taxon-1434.as

You may notice that this code is quite brittle, and makes many assumptions about the format of the HTML that it is processing. If the authors of the Hoyt Arboretum site change anything, it will almost certainly break our script. This is (sadly) a very common state of affairs. In fact, it is extremely rare for a scraping script to be reusable across multiple sites. 

## Part 3: Putting it all together...

In [125]:
import csv

with open('species_names.csv', 'w') as outfile:
    fieldnames = ['latin_name', 'url']
    writer = csv.DictWriter(outfile, fieldnames=fieldnames)
    writer.writeheader()
    
    for url in letter_index_urls:
        for plant in process_letter_index_detail(url):
            writer.writerow(plant)

In [126]:
! head species_names.csv

latin_name,url
Abies alba,taxon-312.aspx
Abies alba 'Argau',taxon-3693.aspx
Abies alba 'Badenweiler',taxon-3699.aspx
Abies alba 'Green Spiral',taxon-3698.aspx
Abies alba 'Pendula',taxon-945.aspx
Abies amabilis,taxon-316.aspx
Abies balsamea,taxon-726.aspx
Abies balsamea 'Nana',taxon-810.aspx
Abies balsamea var. phanerolepis,taxon-1738.aspx


In [127]:
! tail species_names.csv

Zauschneria californica 'Carmen's Gray',taxon-1922.aspx
Zauschneria californica 'Dublin',taxon-1923.aspx
Zauschneria californica 'Silver Select',taxon-1925.aspx
Zauschneria californica 'Solidarity Pink',taxon-1444.aspx
Zauschneria californica 'Wayne's Silver',taxon-1434.aspx
Zelkova serrata,taxon-229.aspx
Zelkova serrata 'Goshiri',taxon-1966.aspx
Zelkova serrata 'Musashino',taxon-2302.aspx
Zelkova serrata 'Variegata',taxon-83.aspx
Zelkova sinica,taxon-82.aspx


Et voilà! You can easily imagine how you might write a function to scrape data from the taxon subpages, and so forth. 

## In closing...

While the specific CSS selectors that you construct will vary widely from page to page, the basic steps we've outlined here are remarkably consistent across scraping projects. Most programming languages and libraries have similar functionality to the ones found in Python's `lxml` and `requests` libraries, and CSS selectors are the same across languages. There are other ways to process HTML beyond CSS selectors- you may encounter scraping/parsing libraries that prefer XPath to CSS, for example. But again, the basic steps will be very similar to what we've outlined here.

One thing to keep in mind: when scraping a large site (i.e., making hundreds-thousands of requests, etc.), it is considered polite to build some delay into your script, so as to avoid overloading the website. So, for example, in our loop above over each sub-page, we could have put a small random delay into each iteration, just to give the Hoyt Arboretum's server a break. And, of course, if a site's terms of use forbid automated scraping (as is the case with many scientific publishers' websites, for example), you should probably honor their requests. The case of Aaron Swartz and JSTOR is illustrative of how seriously some publishers take such matters.

With that said, web scraping is a useful part of your programming and "data science" toolbox. Happy scraping!

## Special Bonus Content: Saving your data as JSON

Above, we demonstrated how to save our scraped data as a CSV file. While CSV is a very important and useful format, depending on what your data look like and what you plan on doing with them downstream, you may wish to use something different. For example, if your data is anything other than simple tabular data, storing your data as serialized JSON objects may well be a better choice. Similarly, if all you're going to be doing is reading your data back in to Python (or some other similar language) for further analysis, using JSON over CSV _may_ save you some coding, depending on what your data look like.

What is JSON? JSON stands for "JavaScript Object Notation", and it is a simple format for representing structured data (a list of strings, a dictionary of key-value pairs, etc.) as text. There are many ways to this, but JSON is particularly popular and useful for several reasons:

1. It is human-readable, meaning that it is possible to examine JSON-encoded data with a text editor
2. Its syntax is very simple, so writing a program to read or write JSON-encoded data is (relatively) straightforward
3. Its syntax looks a lot like JavaScript, Python, Ruby, etc., meaning that it is very easy for programmers who are used to those languages to read (in fact, its syntax is almost-but-not-exactly identical to JavaScript- nearly all valid JSON objects are also valid JavaScript objects).
4. Using JSON can _often_ be more reliable than CSV when working with non-Latin characters in your data (e.g. ∆, π, and 😁). 

Most languages have built-in functionality for working with JSON- both reading and writing. We can save our scraped Arboretum data in a couple of ways, but the simplest way is to just write it out to a file, one line per object. This style/method is sometimes called ["JSON Lines"](http://jsonlines.org), and is very convenient to work with.

It is worth noting that, in the present example (i.e., Hoyt Arboretum's list of species), it is probably overkill to be using JSON. The main reason for doing so would be if we had some piece of software that we were planning on using downstream that specifically required/preferred JSON. Of course, if our data were more complex (for example, if we were to extend our scraper to visit each species' sub-page and pull out its geolocation, for exmple), JSON would be a very natural way to serialize our results.

In [145]:
import json

with open("species_list.json","w") as outfile:
    for url in letter_index_urls:
        for plant in process_letter_index_detail(url):
            json.dump(plant, outfile, ensure_ascii=False) # in case of Unicode characters
            outfile.write("\n")


Reading in JSON-encoded data is straightforward:

In [150]:
names = [json.loads(l) for l in open("species_list.json")]
names[0]

{'latin_name': 'Abies alba', 'url': 'taxon-312.aspx'}

In [151]:
species_detail_urls = [s['url'] for s in names]
species_detail_urls[0]

'taxon-312.aspx'

Another way some people prefer to work with JSON-serialized data is to write it out as one giant object:

In [152]:
import json

plants = {'species': [] }

for url in letter_index_urls:
    for this_species in process_letter_index_detail(url):
        plants['species'].append(this_species)

with open("species_list2.json","w") as outfile:
    json.dump(plants, outfile)


In [154]:
plants_from_json = json.load(open("species_list2.json"))
len(plants_from_json['species'])

2015

And there you are! We have successfully round-tripped data from Python, out to JSON, and then back into Python. Most programming languages have similar JSON-handling functionality- though the exact functions you use might be different, the overall pattern will be similar. 