# Lab 13: Scraping and Parsing Data 

In this lab, you will get exposure to some basic tools for manipulating hierarchical data structures pulled from the web.

In [4]:
# Run this cell to set up your notebook
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

from IPython.display import display, Latex, Markdown, HTML, Javascript
from client.api.notebook import Notebook
ok = Notebook('lab13.ok')

Assignment: Lab 13
OK, version v1.13.9



In [6]:
# Log into OkPy.
# You might need to change this to ok.auth(force=True) if you get an error
ok.auth()

Successfully logged in as yining.jiang@berkeley.edu


## Question 1

The standard-library module `ElementTree XML` supports some of the key capabilities of an XPath selector. See the [documentation](https://docs.python.org/2/library/xml.etree.elementtree.html#supported-xpath-syntax) for a refresher on the syntax. In this question we work with a string containing mock XML data to get practice with writing XPath expressions.

In [7]:
import xml.etree.ElementTree as ET

# string containing XML data
plantData = '''
<CATALOG YEAR="2017">
    <PLANT>
        <COMMON>Bloodroot</COMMON>
        <BOTANICAL>Sanguinaria canadensis</BOTANICAL>
        <ZONE>4</ZONE>
        <LIGHT>Mostly Shady</LIGHT>
        <PRICE CURRENCY="dollar">2.44</PRICE>
        <AVAILABILITY>031599</AVAILABILITY>
    </PLANT>
    <PLANT>
        <COMMON>Columbine</COMMON>
        <BOTANICAL>Aquilegia canadensis</BOTANICAL>
        <ZONE>3</ZONE>
        <LIGHT>Mostly Shady</LIGHT>
        <PRICE CURRENCY="dollar">9.37</PRICE>
        <AVAILABILITY>030699</AVAILABILITY>
    </PLANT>
    <PLANT>
        <COMMON>Goatsbeard</COMMON>
        <BOTANICAL>Tragopogon porrifolius</BOTANICAL>
        <ZONE>4</ZONE>
        <LIGHT>Full Shade</LIGHT>
        <PRICE CURRENCY="euro">6.31</PRICE>
        <AVAILABILITY>080399</AVAILABILITY>
    </PLANT>
</CATALOG>'''

# parse the data into an ElementTree
tree1 = ET.fromstring(plantData)

# find the common name of the first plant 
# in the catalog needing mostly shade
commonName = tree1.findall("./PLANT[LIGHT='Mostly Shady']/COMMON")[0]
commonName.text

'Bloodroot'

### Question 1a

Find all the botanical names of every plant in the catalog.

In [23]:
botanicalNames = [tree1.findall("./PLANT/BOTANICAL")[0].text,tree1.findall("./PLANT/BOTANICAL")[1].text,tree1.findall("./PLANT/BOTANICAL")[2].text]
botanicalNames

['Sanguinaria canadensis', 'Aquilegia canadensis', 'Tragopogon porrifolius']

In [24]:
_ = ok.grade('q01a')
_ = ok.backup()

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed



<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Saving notebook... Saved 'lab13.ipynb'.
Backup... 100% complete
Backup successful for user: yining.jiang@berkeley.edu
URL: https://okpy.org/cal/ds100/fa17/lab13/backups/qj2WJG
NOTE: this is only a backup. To submit your assignment, use:
	python3 ok --submit



### Question 1b

Find all the common names of every plant in zone 4.

In [25]:
commonNamesZone4 = [tree1.findall("./PLANT[ZONE='4']/COMMON")[0].text,tree1.findall("./PLANT[ZONE='4']/COMMON")[1].text]

In [26]:
tree1.findall("./PLANT[ZONE='4']/COMMON")[0].text

'Bloodroot'

In [27]:
_ = ok.grade('q01b')
_ = ok.backup()

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed



<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Saving notebook... Saved 'lab13.ipynb'.
Backup... 100% complete
Backup successful for user: yining.jiang@berkeley.edu
URL: https://okpy.org/cal/ds100/fa17/lab13/backups/wjk6PM
NOTE: this is only a backup. To submit your assignment, use:
	python3 ok --submit



### Question 1c

Find the price of every plant that is listed in dollars (your answer should be a list of *floats*).

In [None]:
priceInDollars = ...

In [None]:
_ = ok.grade('q01c')
_ = ok.backup()

## Question 2

We can use the `lxml` module for richer and more efficient processing of XML. The syntax for XPath queries is comparable to `ElementTree XML`, except we use `tree.xpath` instead of `tree.findall`. For example, here is how we recover the `commonName` from Question 1:

In [28]:
from lxml import etree

# parse the data string
tree_example = etree.fromstring(plantData)

# find the common name of the first plant 
# in the catalog needing mostly shade
commonName = tree_example.xpath("./PLANT[LIGHT='Mostly Shady']/COMMON/text()")[0]
commonName

'Bloodroot'

### Question 2a

XPath expressions for parsing html become complicated quickly. Below we collect the html for the wikipedia page on [Lists of Lists of Lists](https://en.wikipedia.org/wiki/List_of_lists_of_lists). The [Chrome web browser](https://www.google.com/chrome/browser/desktop/index.html) allows you to inspect any html element on the page (*N.B. other browsers usually allow you to inspect element as well, but you may have to change your preferences or look up how to do it*). A sidebar pops up on the right side of the browser, and you can right click the element of interest to copy a path. See the image below to walk through the two steps:



<img src="copypathfull.png" alt="getting XPath expression from browser"> 



Write or get an XPath expression `expr` that will return the *Lists of unsolved problems* element. Note that we can then get the corresponding link with `tree.xpath(expr)[0].get("href")`.

In [37]:
import requests
from lxml import html

page = requests.get('https://en.wikipedia.org/wiki/List_of_lists_of_lists')
tree2 = html.fromstring(page.content)
#''
expr ='//*[@id="mw-content-text"]/div/ul[1]/li[3]/a'

In [38]:
_ = ok.grade('q02a')
_ = ok.backup()

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed



<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Saving notebook... Saved 'lab13.ipynb'.
Backup... 100% complete
Backup successful for user: yining.jiang@berkeley.edu
URL: https://okpy.org/cal/ds100/fa17/lab13/backups/X6MBmA
NOTE: this is only a backup. To submit your assignment, use:
	python3 ok --submit



### Question 2b

Below we collect the html for the wikipedia page on [List of female Olympic gymnastics medalists](https://en.wikipedia.org/wiki/List_of_Olympic_medalists_in_gymnastics_%28women%29). Write an XPath expression that `expr` such that `tree.xpath(expr)` returns a list of all of the female Olympic medalists in gymnastics. **Hint:** You can get an expression that is very close by using "inspect element" as in question 2a.

In [None]:
page = requests.get('https://en.wikipedia.org/wiki/List_of_Olympic_medalists_in_gymnastics_(women)')
tree2 = html.fromstring(page.content)

# write XPath expression here
expr = ''
medalists = tree2.xpath(expr)

### Question 2c

Create a dictionary or series called `medalCounts` counting the number of Olympic medals for each gymnast. How many Olympic medals does Simone Biles have? Does your finding agree with her [wikipedia page](https://en.wikipedia.org/wiki/Simone_Biles)?

In [None]:
medalCounts = ...

In [None]:
_ = ok.grade('q02c')
_ = ok.backup()

## Question 3

We will use the `lxml` module to read exchange rates (against the euro) from the European Central Bank and create a time series plot showing how the rates for four different currencies–the British pound (GBP), the US dollar (USD), the Canadian dollar (CAD), and the Japanese yen (JPY)–have changed over time.

Before jumping to the code portion, visit this URL: [European Central Bank XML Format](http://www.ecb.europa.eu/stats/eurofxref/eurofxref-daily.xml?167971e0f5d2192a5dc29404b0261986).

This URL provides an example of the structure of the XML document with the exchange rates. Where do you see the currency and rate? What about the time? How deep is the tree?


### Question 3a

The above link provides the daily conversion rates. Below we provide the `url` of the recorded history, over 4,800 days. Read in the XML file using `etree`.

In [None]:
# backup link: http://www.stat.berkeley.edu/~nolan/data/ECB2016.xml
url  = 'http://www.ecb.europa.eu/stats/eurofxref/eurofxref-hist.xml'
tree3 = ...

print(tree3.tag)

In [None]:
_ = ok.grade('q03a')
_ = ok.backup()

### Question 3b

Notice that the `Envelope` tag comes with a link in brackets {...}. If you look back at the URL at the start of this question, the `Envelope` tag has an attribute called `xmlns`. This refers to the *namespace*, an unique identifier for the element. The namespace in this case is a reference to `gesmes`, an international standard for the exchange of time series information. Here is how we specify the namespace of an element in XPath:

In [None]:
rates = tree3.xpath('.//x:Cube[@currency = "GBP"]/@rate', 
                   namespaces = {'x':'http://www.ecb.int/vocabulary/2002-08-01/eurofxref'})

The result of `rates` contains a list the daily exchange rates for British Pounds to Euros. Write an XPath query to get the list of dates. It should be return a list of the same length as `rates`.

In [None]:
dates = ...

In [None]:
_ = ok.grade('q03b')
_ = ok.backup()

### Question 3c

Plot the exchange rate for three currencies---GBP, USD, and CAD---over time.

In [None]:
# make plot here!

## Submission

Run the cell below to run all the OkPy tests at once:

In [None]:
import os
print("Running all tests...")
_ = ok.grade_all()

Now, run the cell below to submit your assignment to OkPy. The autograder should email you shortly with your autograded score. The autograder will only run once every 30 minutes.

**If you're failing tests on the autograder but pass them locally**, you should simulate the autograder by doing the following:

1. In the top menu, click Kernel -> Restart and Run all.
2. Run the cell above to run each OkPy test.

**You must make sure that you pass all the tests when running steps 1 and 2 in order.** If you are still failing autograder tests, you should double check your results.

In [None]:
_ = ok.submit()

Now, run this cell to create a PDF to upload to Gradescope.

In [None]:
!pip install -U gs100
from gs100 import convert
# If your output font size is small, increase the zoom argument. Setting zoom=2
# makes everything twice as big.
convert('lab13.ipynb', zoom=1)

Make sure to upload your PDF now. Otherwise, your written questions won't be graded.