# Natural Language Parsing

# Stanford Parser Setup

In this lab, we'll explore acquisition of semi-structured data from a web service, browsing the data and then parsing free-text content. 

The data come from wikipedia which has a very simple REST API. While the focus is on natural language parsing, we'll also explore JSON and Wikipedia's own markup format. 

We'll be using network access for this assignment, so MAKE SURE YOUR LAPTOP NETWORK IS UP before starting the VM. 

Once your VM is up and running, you can download this notebook file by clicking on icon at the top right of this page. Create a directory ~/labs/lab4 to hold it. 

First we need to install some parsing tools. Download the Stanford parser from <a href="https://ufl.instructure.com/files/25797988/download?download_frd=1">here</a>. If your network is not working, download from a browser on your host machine and then use drag-and-drop.

Either way, you can put the parser in your "Downloads" directory. Unpack it with
<pre>
tar xvzf stanfordparser.tar.gz
</pre>

and then move it to the /opt directory with 
<pre>
sudo mv StanfordParser /opt
</pre>

It will be helpful to have links to the parser scripts from your bin directory. If you havent already, create a directory ~/bin. Then 
<pre>
cd ~/bin
ln -s /opt/StanfordParser/lexparser.sh lexparser.sh
ln -s /opt/StanfordParser/lexparser-gui.sh lexparser-gui.sh
ln -s /opt/StanfordParser/dependencyviewer/dependencyviewer.sh dependencyviewer.sh
</pre>

These files will be in your path the next time you login. You can logout from the start button at the top right of the VM window. Then log back in again. 

### Mediawiki Parser Setup

The mediawiki parser is memorably named "mwparserfromhell". To install it with a working network, all you need to do is
<pre>
sudo pip install mwparserfromhell
</pre>

if your network is not working, copy the package source from <a href="https://bcourses.berkeley.edu/courses/1267848/files/51008623/download?wrap=1">here</a>. Untar it, which gives a directory tree starting at "usr". Traverse the directories until you find the "mwparserfromhell" direcory. Copy that directory to your python modules directory with:
<pre>
sudo cp -r mwparserfromhell /usr/local/lib/python2.7/dist-packages
</pre>

### Accessing the Wikipedia Web API

Start by looking over the <a href="http://www.mediawiki.org/wiki/API:Main_page">Mediawiki API documentation</a> which describes Wikipedia's RESTful API. 

The code below implements an API call with options:
* format=json to receive JSON data
* action=query to query Wikipedia content
* titles=string to specify a list of page titles to search for
* prop=revision to return the revisions of the page
* rvprop=content to return the full page content

We'll start with the title search string 'parsing' to retrieve the page about parsing.

In [None]:
import requests
title='parsing'
response = requests.get("http://en.wikipedia.org/w/api.php?format=json&action=query&titles="+str(title)+"&prop=revisions&rvprop=content")
response

The response object is an HTTP GET response. It turns out the requests package contains a json interpreter, which we can invoke as:

In [None]:
jsondata = response.json()

If you dont have a working network, copy the file <a href="https://bcourses.berkeley.edu/courses/1267848/files/51028371/download?wrap=1">parsing.json</a> into your ~/labs/lab4 directory. Then you can load it with:

In [None]:
import json
# fp=open('/home/datascience/labs/lab4/parsing.json','r')
# jsondata=json.load(fp);
# fp.close()

The JSON object is hierarchically structured. To view it, its helpful to define a couple of helper routines:

In [None]:
import json
def pretty(jdata):
    str = json.dumps(jdata, sort_keys=True, indent=4).decode('string_escape');
    return str

def saveas(sdata, fname):
    f = open(fname,'w');
    f.write(sdata);
    f.close();

The first routine converts the JSON to a carefully-formatted string. The second writes a string to a file. We can use them together to save the JSON data to a better format for viewing.

In [None]:
saveas(pretty(jsondata), '/home/datascience/labs/lab4/'+title+'.json')

Now open the file '/home/datascience/labs/lab4/parsing.json' by right-clicking on it and using "open-with" with emacs or gvim. Note the structure.

The JSON parser converts JSON data nodes and lists of nodes. The nodes are represented as Python "Dict" objects, and the lists are Python lists. Each Dict maps the names of the nodes children to their values. We can query the type of each node using the "type" function. For each Dict, we can enumerate the keys using the keys() method. In this way we can explore the JSON tree (although its much quicker to eyeball it from the JSON file we just saved). But anyway we can browse with:

In [None]:
type(jsondata)

In [None]:
jsondata.keys()

which is a list of just one string (a unicode string, hence the "u" prefix). We can then extract that node with

In [None]:
jsondata['query']

and continue exploring:

In [None]:
type(jsondata['query'])

In [None]:
jsondata['query'].keys()

From the pretty-printed file, we know we are looking for the 'pages' child, which has a page id number. We dont know what this number is, so we cant use it as a key. But instead we can use the 'values()' method on the dictionary to get a list of all the nodes below it. We only need one page, so we take the first of those.

In [None]:
jsondata['query']['pages'].values()[0]

Continue down the tree, next to the "revisions" node. This time, take the *last* revision in the list.

In [None]:
# content = 

In [None]:
content

The content is now a text string in Mediawiki's own format. To make sense of it we can use the mwparserfromhell (MWPH for short).

In [None]:
import mwparserfromhell as mwph
wikicode = mwph.parse(content)

MWPH supports a variety of methods to explore mediawiki content. The main class is the Wikicode class, which is the type returned by mwph.parse(). e.g. try

In [None]:
wikicode.filter_comments()

In [None]:
wikicode.filter_headings()

In [None]:
wikicode.filter_wikilinks()

But since we want to parse the english text from the article, we want to ignore all these metadata. MWPH has a method to do this:

In [None]:
text = wikicode.strip_code()

In [None]:
text

This data is clean enough now that we can save it for parsing:

In [None]:
 saveas(pretty(text), '/home/datascience/labs/lab4/'+title+'.txt')

### Running the Stanford Parser

From a terminal window, type
<pre>
lexparser-gui.sh
</pre>

This brings up a GUI interface to the Stanford parser. To use it, click on "Load Parser" which brings up a file selection dialog. Navigate to

<pre>
/opt/StanfordParser/stanford-parser-3.4.1-models.jar
</pre>

and open it.

Then you will see a list of parsers to use. Select

<pre>
englishPCFG.ser.gz
</pre>

You're now ready to parse some text!

Click on the "Load File" button, and browse to the lab4 directory and load the parsing.txt file. Click on "Parse" to parse the current sentence (highlighted in yellow). 

### Content Analysis

Now lets try to analyze some content from Wikipedia. To make our lives simpler, we'll use a simplified english version of wikipedia. Change the URL in the first code box in this file to:
<pre>
simple.wikipedia.org
</pre>
and change the query title to 'cat'. Rerun all the cells above. This should produce a file "cat.txt" in the lab4 directory. Load that file into the parser, and parse some of the sentences. 

If you cant access the network, download the file <a href="https://ufl.instructure.com/files/25797985/download?download_frd=1">cat.json</a> into your lab4 directory, and repeat the commands used earlier to load the parsing.json file. 

We'll now convert the parser output to XML, so we can process it further. Find the script
<pre> 
/opt/StanfordParser/lexparser.sh
</pre>
and edit it so that its outputFormat is:
<pre>
-outputFormat "xmlTree"
</pre>
and add a new option:
<pre>
-outputFormatOptions "xml"
</pre>
save the new script as 
<pre>
parsetoxml.sh
</pre>
and create an alias to it in your ~/bin directory. Now run from your lab4 directory
<pre>
parsetoxml.sh cat.txt > cat.xml
</pre>
you're ready now to analyze the cat data. We'll use Python's builtin ElementTree parser. 

In [None]:
from lxml import etree
parser = etree.XMLParser(recover=True)
tree = etree.parse('/home/datascience/labs/lab4/cat.xml',parser)

We can examine the root of this tree:

In [None]:
root=tree.getroot()
root.tag

In [None]:
len(root)

In [None]:
root[0].tag

i.e. we have found the first sentence. The xmlTree representation is a little tricky however, as POS tags are stored as attributes of nodes rather than node tags. To get to the actual root node, we need to dig a little deeper (and we'll use the second sentence which is a bit more conventional):


In [None]:
root[1][0][0].attrib['value']



going down one level gets us to the actual sentence node:

In [None]:
s=root[6][0][0][0]
s.attrib['value']

and to get its children we can do:

In [None]:
s[:]

This is not too helpful, because the node types are hidden in the value attribs of these nodes. To see them, we can use a python anonymous function and map it over the list. 

In [None]:
map(lambda (x): x.attrib['value'], s[:])

Now lets see if we can find sentences starting with noun phrases containing a given noun. The final function supports a flexible syntax (similar to xpath) for locating elements of given type or attributes. A slash "/" is like a directory specifier, and defines a child node. A double slash "//" specifies *any* descendent, child, grandchild, great-grandchild etc. The "node[@value='NP']" specifies a node with the given attribute value.

In [None]:
agent = s.findall("./node[@value='NP']//node[@value='NN']//leaf[@value='cat']")
agent

finds all the nodes starting with an 'NP' child of s, and having a 'NN' node above a leaf with 'cat' value. 

We can similarly look for a verb in a verb phrase under the root node:

In [None]:
verb = s.findall("./node[@value='VP']//node[@value='VBZ']//leaf[@value='is']")
verb

Putting these together, we can discover sentences containing a given pair of (agent,action) pairs:

In [None]:

def printnode(node):
    for i in node.findall(".//leaf"):
        print(" " + i.attrib['value']),
    print('')

def testnode(node, agent, action):
    aa = node.findall("./node[@value='NP']//node[@value='NN']//leaf[@value='"+agent+"']")
    bb = node.findall("./node[@value='VP']//leaf[@value='"+action+"']")
    if (len(aa) > 0 and len(bb) > 0):
        printnode(node)    

def agentact(node, agent, action):
    testnode(node, agent, action)
    snodes = node.findall(".//node[@value='S']")
    for snode in snodes:
        testnode(snode, agent, action)


In [None]:
agentact(s, title, 'is')

Next we can map the agentact function across all the sentences in the Wikipedia entry:

In [None]:
map(lambda (nn): agentact(nn[0][0][0], title, 'is'), root)
[]

## Your Turn!

1. Write code to extract the actual content of the current version of a Wikipedia page.
2. Load the first sentence of the “Parse” wikipedia article using the stanford parser GUI. Did it parse correctly? Explain.
3. Modify the given testnode function such that other facts about cats can be extracted, use the aganetact2 function below to test it.
4. Extract facts about this people’s wikipedia pages
       -Jim Parsons
       -Barack Obama
       
Challange Question
5. Can you write code to automatically extract the following type facts about a given person’s wikipedia page?
    -Place of birth
    -Spouse 
    -Schools attended
Test your code using Barack Obama’s wikipedia page

Hint: you can write different fuctions for each relation. 


In [None]:
title = 'cats'
def agentact2(node, agent, action):
    testnode2(node, agent, action)
    snodes = node.findall(".//node[@value='S']")
    for snode in snodes:
        testnode(snode, agent, action)
        
map(lambda (nn): agentact2(nn[0][0][0], title, 'are'), root)
[]

## Lab Responses

Remember to fill out the lab responses <a href="https://ufl.instructure.com/courses/320501/quizzes/464185">here</a>.