CSCI 4580-5580: Data Science
Lab 4: NLP Tools

NOTE click near here to select this cell, esc-Enter will get you into cell edit mode, shift-Enter gets you back


Name: Sharvita Paithankar

Student ID: 108172438

In this lab we'll explore NLP with the Stanford Parsing suite.

## NOTE:
The Stanford Parser requires the VM you set up in Lab 1. Please revisit that lab (specifically the prelab document) if you run into any issues regarding VM setup. Also this lab will be much easier if you have a shared folder setup for your VM. This was an optional step in the Lab 1 Prelab but it might be worth taking a couple minutes to revisit the document and setup a shared folder before starting this lab.

# Natural Language Analysis of Content

Here we're going to use a parser to extract some "facts" from natural language. The text is from the simplified wikipedia site: http://simple.wikipedia.org. It has been filtered to find sentences about cats. Download the <b>cat.txt</b> file from Canvas into your lab4 directory. 

## Stanford Parser Setup

Download the Stanford parser from Canvas. If you have already downloaded it, you could use the shared folder to transfer it to Ubuntu.

Unpack it with

<pre>tar xvzf stanfordparser.tar.gz</pre>

and then move it to the /opt directory with

<pre>sudo mv StanfordParser /opt</pre>

It will be helpful to have links to the parser scripts from your bin directory. **If you havent already, create a directory ~/bin and add it to your path with ```echo "export PATH=~/bin:$PATH" >> ~/.bashrc``` **
Then
<pre>
cd ~/bin
ln -s /opt/StanfordParser/lexparser.sh lexparser.sh
ln -s /opt/StanfordParser/lexparser-gui.sh lexparser-gui.sh
ln -s /opt/StanfordParser/dependencyviewer/dependencyviewer.sh dependencyviewer.sh
</pre>

These files will be in your path the next time you login. You can logout from the start button at the top right of the VM window. Then log back in again.    

## Running the Parser

From a terminal window, type

<pre>lexparser-gui.sh</pre> 
or alternatively 
<pre>~/bin/lexparser-gui.sh</pre>
 **NOTE: if java is not already installed, you can install it with:**
  <pre>sudo apt install default-jre</pre>
 

This brings up a GUI interface to the Stanford parser. To use it, click on "Load Parser" which brings up a file selection dialog. Navigate to

<pre>/opt/StanfordParser/stanford-parser-3.4.1-models.jar</pre>

and open it.

Then you will see a list of parsers to use. Select

<pre>englishPCFG.ser.gz</pre>

You're now ready to parse some text!

Click on the "Load File" button, and browse to the lab4 directory and load the cat.txt file. Click on "Parse" to parse the current sentence (highlighted in yellow).

### NOTE:
The tags used by the parser are explained in more detail [here](https://gist.github.com/nlothian/9240750). The important parts of speech will be noun, verb, and subject. 


> Q1) Generate two parse tree visualizations for any pair of sentences from cat.txt. The tree should show up in the bottom panel of the Stanford Parser when you click Parse. Screenshot the trees and insert the images below ([see Stack Overflow post on adding image to Jupyter notebook](https://stackoverflow.com/questions/32370281/how-to-embed-image-or-picture-in-jupyter-notebook-either-from-a-local-machine-o)). Breifly reflect on the similarity/difference in structure between the two parse trees (for example: how are the parts of speech ordered, is one tree deeper/wider than the other, do the sentences seem like they should have similar/different trees but dont and why?) Make sure to submit the image files along with you notebook when you turn it in!



![title](one.png)


![title](two.png)

#### Add Q1 answer here

how are the parts of speech ordered, is one tree deeper/wider than the other, do the sentences seem like they should have similar/different trees but dont and why?

The first sentence is a very simple sentence hence the tree is smaller in height and width compared the the second tree, where the sentence is more complicated. The tress are perfectly ordered according to their sentences(meaning I can read the sentence when I go from right to left. The tress are a little different beacuse the first tree has less words and not "LRB" where as the second tree does. 



## Parsing to XML

We'll parse the cat sentence file to XML. To do this, we'll make a customized version of the parser script. Copy the file:

<pre>/opt/StanfordParser/lexparser.sh</pre>

and save it as:

<pre>/opt/StanfordParser/parsetoxml.sh</pre>

Edit it so that its outputFormat is:

<pre>-outputFormat "xmlTree"</pre>

and add a new option:

<pre>-outputFormatOptions "xml"</pre>

and create an alias to parsetoxml.sh it in your ~/bin directory.
<pre>cd ~/bin</pre>
<pre>ln -s /opt/StanfordParser/parsetoxml.sh parsetoxml.sh</pre>

Now run from your lab4 directory

<pre>parsetoxml.sh cat.txt > cat.xml</pre>

you're ready now to analyze the cat data. We'll use Python's built-in ElementTree parser.

## Working with the XML

You can now copy the cat.xml file out of the VM and into the same directory as this notebook (or a different directory if you prefer, **just be sure to change the path to the xml file below!**)

In [14]:
from lxml import etree
parser = etree.XMLParser(recover=True)
tree = etree.parse('/Users/sharvitapaithankar/Desktop/Senior Year/Data Science/Lab 4/cat.xml',parser) # fix this path if you put the file somewhere else

We can examine the root of this tree:

In [15]:
root=tree.getroot()
root.tag

'corpus'

In [16]:
len(root)

213

In [17]:
root[0].tag

's'

i.e. we have found the first sentence. The xmlTree representation is a little tricky however, as POS tags are stored as attributes of nodes rather than node tags. To get to the actual root node, we need to dig a little deeper (and we'll use the second sentence which is a bit more conventional):

In [18]:
root[1][0][0].attrib['value']

'ROOT'

going down one level gets us to the actual sentence node:

In [19]:
s=root[6][0][0][0]
s.attrib['value']

'S'

and to get its children we can do:

In [20]:
s[:]

[<Element node at 0x109d02e40>,
 <Element node at 0x109d02600>,
 <Element node at 0x109d02880>]

This is not too helpful, because the node types are hidden in the value attributes of these nodes. To see them, we can use a python anonymous function and map it over the list.

In [21]:
list(map(lambda x: x.attrib['value'], s[:]))
# of if you prefer list comprehensions: nodes = [x.attrib['value'] for x in s[:]]

['NP', 'VP', '.']

Now let's see if we can find sentences starting with noun phrases containing a given noun. The final function supports a flexible syntax (similar to xpath) for locating elements of given type or attributes. A slash "/" is like a directory specifier, and defines a child node. A double slash "//" specifies any descendent, child, grandchild, great-grandchild, etc. The "node[@value='NP']" specifies a node with the given attribute value.

In [22]:
agent = s.findall("./node[@value='NP']//node[@value='NN']//leaf[@value='cat']")
agent[0].attrib

{'value': 'cat'}

finds all the nodes starting with an 'NP' child of s, and having a 'NN' node above a leaf with 'cat' value.

We can similarly look for a verb in a verb phrase under the root node:


In [23]:
verb = s.findall("./node[@value='VP']//node[@value='VBZ']//leaf[@value='is']")
verb[0].attrib

{'value': 'is'}

Putting these together, we can discover sentences containing a given pair of (agent,action) pairs:

In [24]:
def printnode(node):
    for i in node.findall(".//leaf"):
        print(" " + i.attrib['value']),
    print('')

def testnode(node, agent, action):
    aa = node.findall("./node[@value='NP']//node[@value='NN']//leaf[@value='"+agent+"']")
    bb = node.findall("./node[@value='VP']//leaf[@value='"+action+"']")
    if (len(aa) > 0 and len(bb) > 0):
        printnode(node)    

def agentact(node, agent, action):
    testnode(node, agent, action)
    snodes = node.findall(".//node[@value='S']")
    for snode in snodes:
        testnode(snode, agent, action)

In [25]:
title = 'cat'
agentact(s, title, 'is')

 A
 young
 cat
 is
 called
 a
 kitten
 .



Next we can apply the agentact function to all the sentences in the Wikipedia entry

In [None]:
[agentact(nn[0][0][0], title, 'is') for nn in root]
[] # hide the return bvalue

> Q2) Copy the code from the previous cell to the next cell and change the verb to something other than "is" that returns a few sentences. Can you find any sentences that share similar meaning based on their verb alone? Or completely different meaning? Write a breif sentence in a comment about what this could mean for an NLP model and the importance of having enough data.

Answer :  The sentences share a similar meaning because the sentences describe what a cat can do. Since the sentences are all simplistic, the NLP model can pick up the sentences perfectly. When the sentences are a little bit complicated, you can see some of them do not make a lot of sense. 

> Q3) Finish the testnode2 function that returns sentences in which the given adjective (JJ) appears in the cell below, you will need to check for plural nouns (NNS) in addition to singular nouns, which requires a new search with a leaf node of "cats"instead of "cat". Try a few different adjectives (ex: wild, domestic, brown, etc.). Not all adjectives will return results, and you can always check the parse tree in the Stanford parser to check for available adjective-noun pairs. Do the sentences you see make sense? Now try the adjective "dry". Is cat/cats still the subject of the sentences you see returned, if not what is the subject of the sentence? Does this suggest anything to you about how the nuances of languages and how they should be modeled? Write 2-3 sentences in a comment about your observations. 

Answer : The sentences do make sense when the word "cats" is used. When the word "brown" is used, the sentences do not make sense. After using "dry", the sentences do not make sense but they are still on the topic of cats. Since the english language can have complicated sentences, the model does not work very well and hence does not serve its purpose. If used for simple sentences, this model can work great. 


In [37]:
# Q2 code here
[agentact(nn[0][0][0], title, 'can') for nn in root]
[] # hide the return bvalue

 The
 cat
 creeps
 towards
 a
 chosen
 victim
 ,
 keeping
 its
 body
 flat
 and
 near
 to
 the
 ground
 so
 that
 it
 can
 not
 be
 seen
 easily
 ,
 until
 it
 is
 close
 enough
 for
 a
 rapid
 dash
 or
 pounce
 .

 The
 cat
 's
 tongue
 can
 act
 as
 a
 hairbrush
 and
 can
 clean
 and
 untangle
 a
 cat
 's
 fur
 .

 a
 cat
 gets
 fleas
 because
 fleas
 can
 make
 cats
 uncomfortable



[]

In [36]:
def testnode2(node, agent, modifier):
    # Q3 code here:

    cc = node.findall("./node[@value='VP']//node[@value ='NP']//node[@value ='NP']//node[@value ='NN']//leaf[@value='"+title+"']")
    dd = node.findall("./node[@value='VP']//node[@value ='NP']//node[@value ='NP']//node[@value ='NN']//leaf[@value='"+modifier+"']")
    ee = node.findall("./node[@value='VP']//node[@value ='NP']//node[@value ='NP']//node[@value ='NN']//leaf[@value='"+agent+"']")
    if(len(cc) > 0 or len(dd) > 0) and len(ee) > 0:
        printnode(node)
    
def agentact2(node, agent, modifier):
    testnode2(node, agent, modifier)
    snodes = node.findall(".//node[@value='S']")
    for snode in snodes:
        testnode2(snode, agent, modifier)
        
list(map(lambda nn: agentact2(nn[0][0][0], title, 'dry'), root))
[]

 an
 entire
 male
 cat
 is
 a
 tom.Entire
 means
 a
 female
 cat
 that
 is
 not
 spayed
 ,
 and
 a
 male
 cat
 that
 is
 not
 neutered
 ,
 leaving
 either
 able
 to
 reproduce

 means
 a
 female
 cat
 that
 is
 not
 spayed
 ,
 and
 a
 male
 cat
 that
 is
 not
 neutered
 ,
 leaving
 either
 able
 to
 reproduce

 This
 helps
 to
 explain
 the
 cat
 's
 spinal
 mobility
 and
 flexibility
 .

 to
 explain
 the
 cat
 's
 spinal
 mobility
 and
 flexibility

 Behaviour
 thumb
 |
 right
 |
 200px
 |
 The
 cat
 on
 the
 right
 is
 fed
 up
 with
 the
 cat
 on
 the
 left

 They
 are
 used
 between
 a
 mother
 cat
 and
 her
 kittens
 .

 This
 is
 because
 a
 male
 cat
 's
 penis
 has
 a
 band
 of
 about
 120-150
 backwards-pointing
 spines
 ,
 which
 are
 about
 one
 millimeter
 long
 ;
 upon
 withdrawal
 of
 the
 penis
 ,
 the
 spines
 rake
 the
 walls
 of
 the
 female
 's
 vagina
 ,
 which
 is
 a
 triggerTrigger
 :
 in
 the
 sense
 of
 an
 event
 which
 starts
 other
 events
 .

 Your
 next
 jo

[]

# Lab Responses

Upload your ipython notebook on Canvas under Lab4 on Thursday, 10/1/2020 by 11:59pm.