# Parsing structured data

XML and JSON are formats you'll use often when dealing with textual data. Many texts are encoded in XML or one of its specific flavors, such as TEI. XML is much the older of the two, and by far the more powerful. It has a long history in print and electronic publishing, and is still commonly used in the production of edited texts, complex websites, and certain computational projects. It's a descendant of SGML and a proper superset of HTML.

JSON, on the other hand, is simpler, lighter weight, newer, and generally easier to compute with. It's the de facto format for data exchange on the web.

This exercise asks you to ingest, parse, and work with structured data in both XML and JSON formats.

### XML

We'll begin with XML, the more difficult of the two. XML looks a lot like HTML, if you've seen that. The main difference is that, where HTML consists of a fixed set of allowable tags, XML tags can be defined arbitrarily according to a spec of the user's choosing. In practice, we won't do much with this, but you'll often find yourself dealing with other projects' arbitrary XML use. For our purposes, all that matters is that you can figure out -- mostly by human examination -- which tags are used to encode what information.

Here's an example of some XML:

```
<?xml version="1.0" encoding="utf-8"?>
<TEI xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
     xmlns="http://www.tei-c.org/ns/1.0"
     xsi:schemaLocation="http://www.tei-c.org/ns/1.0 http://www.dlib.indiana.edu/lib/xml/tei/p5/general.xsd"
     xml:id="VAC5615">
   <teiHeader>
      <fileDesc>
         <titleStmt>
            <title>Growler's Income Tax </title>
            <author>Arthur, T. S. (1809-1885)</author>
         </titleStmt>
         ...
      </fileDesc>
   </teiHeader>
</TEI>
```

Exciting, no? This is the beginning of the file you'll use in this exercise. You can see that it begins with information about the XML version and schema used (it's TEI, just so you know). Then there's a header, which contains a title statement, which contains title and author info, etc. This is how XML, HTML, and other structured markup languages work: each element contains one or more others. These are often referred to as parent and child elements. Every child has exactly one parent, but parents many have multiple children and may be children themselves. In the above example,  `<title>` is the child of `<titleStmt>`, which is in turn the child of `<fileDesc>`. `<title>` and `<author>` are siblings.

#### Beautiful Soup

Parsing XML in the general case is a pain. You absolutely *do not* want to build a general-purpose XML parser. But you don't need to; other people have done it. One of the most widely used XML parsers for Python is [Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/). (OK, technically Beautiful Soup is a wrapper for other parsers, but whatever.)

Review two sources of information about BeatutifulSoup: The [documentation](http://www.crummy.com/software/BeautifulSoup/bs4/doc/) and a [tutorial](http://programminghistorian.org/lessons/intro-to-beautiful-soup) from the Programming Historian. Note that both of these sources cover installation, which is unnecessary if you're using Anaconda (because BeautifulSoup is included therein). Everything you need to know to complete this portion of the exercise is contained in the "Quick Start" section of the BS docs.

Grab a copy of *Growler's Income Tax* by T.S. Arthur in TEI-XML format from the same GitHub directory as this exercise. Save this file in the same directory in which your iPython notebook is saved.

There are a lot of things you can do with BeautifulSoup, all of which depend on parsing an XML or HTML document and then searching or modifying the parsed results. In the present case, you'll need to extract four pieces of information from the XML file in question:

1. The full, library-style author record, in the form "LastName, FirstName (born-died)".
2. The number of paragraphs in the body of the text.
3. The number of words in the body of the text.
4. The full, plain-text (no XML tags) content of the embedded epistle.

To do this, you'll need to know how the XML file is structured. Spend some time looking at it, whether in the GitHub display or in a text editor (but probably not Word, which tends to mess with markup). Each piece of information above can be extracted with one line of code, once you've imported and parsed the XML, though you may certainly use more than one line for each if you prefer. You'll use the `find`, `find_all`, and `get_text` methods of BeautifulSoup, plus other things you've learned as necessary. Potential gotcha: BeautifulSoup converts all tag names to lowercase.

Enter your code in the cell or cells below. I've given you the code to import BeuatifulSoup and load the XML file, though you may need to change the file name and location to match whatever you've used. Your code should print the four pieces of information listed above.

In [15]:
from bs4 import BeautifulSoup
with open('growler.xml', 'r') as f:
    soup = BeautifulSoup(f, "xml")

Uncomment the lines below to print XML and plain-text versions of the text. No need to do this, but it can be useful to verify that the import worked correctly, as well as to see how you go about extracting information from the parsed data.

In [16]:
## Pretty-print the XML, to see that import worked
#print(soup.prettify())

## Print the full plain-text version
print(soup.body.get_text())



Growler's income tax

GROWLER'S INCOME TAX.
BY T.S. ARTHUR.
My neighbor Growler, an excitable man by the way, is particularly excited over his Income Tax, or, as he called it, his "War Tax." He had never liked the war—thought it unnecessary and wicked; the work of politicians. The fighting of brother against brother was a terrible thing in his eyes. If you asked him who begun the war?—who struck at the nation's life?—if self defence were not a duty?—he would reply with vague generalities, made up of partisan tricky sentences, which he had learned without comprehending their just significance.
Growler came in upon me the other day flourishing a square piece of blue writing paper, quite moved from his equanimity.
"There it is! Just so much robbery! Stand and deliver, is the word. Pistols and bayonets! Your money or your life!"
I took the piece of paper from his hand and read:





"Philadelphia, Sept., 1863.

"RICHARD GROWLER, ESQ.,
"To JOHN M. RILEY. Dr.
"Collector Internal Revenue fo

Now extract the information required in the problem spec ...

In [17]:
# Extract the required information
author = soup.author.get_text()
wc = len(soup.body.get_text().split())
pars = len(soup.body.find_all('p'))
letter = soup.find(type="letter").get_text()
print("Author:", author)
print("Wordcount:", wc)
print("Body paragraphs:", pars)
print("Letter content:", letter)

Author: Arthur, T. S. (1809-1885)
Wordcount: 1670
Body paragraphs: 40
Letter content: 

"Philadelphia, Sept., 1863.

"RICHARD GROWLER, ESQ.,
"To JOHN M. RILEY. Dr.
"Collector Internal Revenue for the 4th District of 
Pennsylvania. Office 427 Chestnut St.


"For Tax on Income, for the year 1862 as per return made to the Assessor of the District, $43,21.

"Rec'd payment,
"JOHN M. RILEY, Collector."




### JSON

Here we'll use the HathiTrust Research Center's [extracted features dataset](https://sharc.hathitrust.org/features), which contains page-level, part-of-speech-tagged word counts from 4.8 million public domain volumes held by the HatiTrust digital library. We'll work with a single book, which happens to be volume 4 of Bret Harte's *Collected Works*. You can see (and search) the [full-text scanned copy](http://babel.hathitrust.org/cgi/pt?id=mdp.39076000600655) via HT's reading interface.

Your task is to determine the most frequently occurring noun (singular or plural, common or proper) on the 116th page of this text. For reference, you may want to consult [the relevant page image](http://babel.hathitrust.org/cgi/pt?id=mdp.39076000600655;view=1up;seq=116) in the HT reader. Note that we're interested in the page with sequence number 116, which isn't the same thing as the one with printed page number 116.

This is tricky. Or, more accurately, it's conceptually tedious. The JSON loader reads in the JSON data as a multiply nested dictionary of dictionaries, plus a list of dictionaries corresponding to each page in the volume. What you need to do is walk that list of page-level dictionaries, looking for page 116, then iterating over the tokens on that page, selecting the nouns, and keeping track of which one occurs most often.

For reference, the structure of the data is as follows:

    features
        pages
            header
            footer
            [some info about the page, including the 'seq' key for page sequence number]
            body
                tokenPosCount
                    [actual token, i.e., an individual word form]
                        [PoS tag, e.g., 'NNP' for proper noun]
                            [count, e.g., 3]
                            
Each one of those levels is the key for a dictionary, which dictionary contains the keys for the level below it, etc. So, when you read in the JSON data, you can address it the way you would any other dictionary, using the relevant keys one after another,  like so:
    
    data['features']['pages']
    
... which will yield a list of dictionaries, each containing the feature counts for one page of the volume.

Here's the algorithm, in English: Use a `for` loop to iterate over the page-level entries, looking for one with key `'seq'` and value `'00000116'` (note that the value is a string, not an integer). If you've found page 116, iterate over the `'tokenPosCount'` dictionary within the `'body'` dictionary for that page. For each token in that dictionary, examine the associated part of speech tag. If it's a noun (that is, the PoS tag is one of `NN`, `NNS`, `NNP`, or `NNPS`), then record the value associated with the PoS key, which is the count of occurrences of that word with that PoS tag on that page. If it's the largest count yet seen, record both the word and the count. When you've finished iterating over the whole volume, print the word and the count.

You can download the [HTRC JSON data](https://raw.githubusercontent.com/wilkens/course-exercises-f15/master/harte.json) from GitHub. It's too long for GitHub to display prettily, so that link takes you directly to the raw JSON. Save the file to your iPython notebook working directory as above.

I've given you some code to get started. FYI, my answer requires about 10 additional lines of code.

In [18]:
import json

with open('harte.json', 'r') as g:
    data = json.load(g)  # Parse the JSON input
    max_count = 0        # Keep track of largest seen noun count
    answer = ''          # Keep track of most frequently occurring noun
    for i in data['features']['pages']:  # i is a dictionary of page-level data
        if i['seq'] == '00000116':
            for j in i['body']['tokenPosCount']:  # j is a dictionary of words on a page
                for key in i['body']['tokenPosCount'][j].keys():  # key is a PoS tag
                    if key in ('NN', 'NNS', 'NNP', 'NNPS'):
                        count = int(i['body']['tokenPosCount'][j][key])
                        if count > max_count:
                            max_count = count
                            answer = j
                            print('#', j, key, i['body']['tokenPosCount'][j][key])  # Print each successive maximum, FWIW
            break # No need to keep iterating over pages past 116
print(f"Most common noun on page 116: '{answer}' occurs {max_count} times")  # Print the answer

# Brace NNP 1
# discovery NN 2
# paper NN 4
Most common noun on page 116: 'paper' occurs 4 times
