# spaCy for Syntactic Dependency Parsing

Class,

We'll now surf the surface of **syntactic dependency parsing** from the Computational Linguistics (CL) area. 

While NLP focuses on the tokens/tags as predictors in machine learning models, CL digs into the **relationships** and links among parts of speech.

Hence, CL looks into token organization and inter-related contexts within sentences using word-to-word grammar relationships which are also known as **dependencies**. 

Dependency is the notion that syntactic units (words) are connected to each other by **directed links** which describe the relationship possessed by the connected words (see table below some top interconnection types).

<img src="https://i.ibb.co/wCX1g0Z/dep-Parsing.png" title="Dependency Parsing Table" />

Enough theory, let's get concrete with an example. Recall the depCy parse tree we made for our Dn Trump wala sentence? Let's re-start where we left off, from there. Behold.

In [1]:
# setup chunk
import spacy
from spacy import displacy
import pandas as pd
import en_core_web_sm
nlp = en_core_web_sm.load()

In [2]:
# Repeating the tree parsing routine
from nltk import Tree
def to_nltk_tree(node):
	if node.n_lefts + node.n_rights > 0:
		return Tree(node.orth_, [to_nltk_tree(child) for child in node.children])
	else:
		return node.orth_

# print tree for seq of sents if needed
sent0 = "Donald Trump is a controversial American President." 
sent = nlp(sent0)
[to_nltk_tree(sent.root).pretty_print() for sent in sent.sents]

            is                       
  __________|________                 
 |  Trump        President           
 |    |      ________|__________      
 .  Donald  a  controversial American



[None]

Note again the classic tree structure above with the root node being the ROOT verb.

### Interpreting the Parse Tree

The ROOT node is the sentence's lynchpin. Without it, no grammatically correct sentence can form.

After the top (ROOT) node, the next level of nodes are the next most important in sentence formation. Then the next level and so on.

A token node above another in the parse tree is the *ancestor* node to the latter's *descendent* node. 

So in the above sentence, going up the tree from the leaf nodes, 'Trump' and 'is' are ancestor nodes to 'Donald' whereas 'controversial' is a child node to 'president'.

Below I define a function to parse the dependency tree and yield ancestor or *head nodes* and descendent or *child nodes*. Behold.   

In [3]:
## define func for depcy parsing display as DF. Input is a sentence
def depcy_attrib(sent0):

	doc = nlp(sent0)

    # define empty lists to populate
	text=[]; pos=[]; dep=[]; headText=[]
	headPos=[]; childTokens=[]

    # loop over each token & populate attribs into lists
	for token in doc:
		text.append(token.text)
		pos.append(token.pos_)
		dep.append(token.dep_)
		headText.append(token.head.text)
		headPos.append(token.head.pos_)
		childTokens.append([child for child in token.children])

    # store output as panda DF 
	test_df = pd.DataFrame({'text':text, 'pos':pos, 'dep':dep,
                         'headText':headText, 'headPos':headPos, 'childTokens':childTokens})
	return(test_df)

# test-drive above func on a sample sentence
sent0 = "Donald Trump is a controversial American President."    
depcy_df = depcy_attrib(sent0)
depcy_df

Unnamed: 0,text,pos,dep,headText,headPos,childTokens
0,Donald,PROPN,compound,Trump,PROPN,[]
1,Trump,PROPN,nsubj,is,AUX,[Donald]
2,is,AUX,ROOT,is,AUX,"[Trump, President, .]"
3,a,DET,det,President,PROPN,[]
4,controversial,ADJ,amod,President,PROPN,[]
5,American,ADJ,amod,President,PROPN,[]
6,President,PROPN,attr,is,AUX,"[a, controversial, American]"
7,.,PUNCT,punct,is,AUX,[]


Can you reverse-map the DF table above to the tree? In terms of parent-child nodes? 

Below, another quick example, for variety sake.

In [4]:
# another example, from the spaCy vignette
doc0 = "Credit and mortgage account holders must submit their requests"
doc = nlp(doc0)  # annotate sentence
[to_nltk_tree(sent.root).pretty_print() for sent in doc.sents]  # draw parse tree

     submit                          
  _____|________________________      
 |          holders             |    
 |             |                |     
 |           Credit             |    
 |      _______|_______         |     
 |     |            account  requests
 |     |               |        |     
must  and           mortgage  their  



[None]

If we can *traverse* the tree left0right and up-down at will, then much can be done such as:<p>

- detecting phrases (NP and VP)<p>
- detecting sentence SUBJECT <p>
- detecting sentence's OBJECTs<p>
- etc.
    
Let me demo a small example. But first, the depcy_df below.

In [5]:
depcy_df = depcy_attrib(doc0)
depcy_df

Unnamed: 0,text,pos,dep,headText,headPos,childTokens
0,Credit,NOUN,nmod,holders,NOUN,"[and, account]"
1,and,CCONJ,cc,Credit,NOUN,[]
2,mortgage,NOUN,compound,account,NOUN,[]
3,account,NOUN,conj,Credit,NOUN,[mortgage]
4,holders,NOUN,nsubj,submit,VERB,[Credit]
5,must,AUX,aux,submit,VERB,[]
6,submit,VERB,ROOT,submit,VERB,"[holders, must, requests]"
7,their,PRON,poss,requests,NOUN,[]
8,requests,NOUN,dobj,submit,VERB,[their]


Here's the plan. 

First we ID the SUBJECT in the sentence above. 

Think of the classic SVO (Subject-Verb-Object) framework that works fairly well in Eng. The SUBJECT comes typically to the *left* of the Verb. P.S. Head to the slides a tad quickly to see this in action.

### Using SVO in parse trees

We ID the  by looking at the ROOT's descendents to the left of the root verb in th eparse tree which are also NOUNs (since SUBJECT typically tends to be a noun).

Then, once we have the subject pigeonholed, we will try to extract the SVO subset of the parse tree...

Behold.

In [6]:
# first ID the ROOT verb
root = [token for token in doc if token.head == token][0]
print("root is: ", root)  # 'submit'

# now find list of SUBJECTs among ROOT's descendents
subject = list(root.lefts); subject
print("subjects are: ", subject)  # [holders, must]

# find that subj which has pos==noun
for token in subject:
    if (token.pos_ == 'NOUN'):
        a0 = [x for x in range(depcy_df.shape[0]) if depcy_df['text'].iloc[x] == str(token)]
    else:
         pass

subject1 = doc[a0[0]]
print("focal subj is: ", subject1)  # 'holders'

root is:  submit
subjects are:  [holders, must]
focal subj is:  holders


In [7]:
# now isolate and extract subtree for that focal term only
outp_list = [(descendant.text, descendant.dep_) for descendant in subject1.subtree] 
outp_df = pd.DataFrame(outp_list, columns = ['desc_text', 'desc_depcy'])
outp_df

Unnamed: 0,desc_text,desc_depcy
0,Credit,nmod
1,and,cc
2,mortgage,compound
3,account,conj
4,holders,nsubj


spaCy thus provides numerous tools to parse syntactic dependency trees, iterate around them and home in on particular features of interest at a sentence level.

P.S. For Reference only: Going through the doucmentation for more features and options is a good idea. https://spacy.io/usage/linguistic-features#tokenization

## Visualizing Syntactic Dependencies

Let me arrive atthe last few examples from this section next.

Consider the following sentence and its syntactic dependencies.

In [8]:
doc = nlp('I shot an elephant in my pajamas'); doc

for token in doc:
    print(str(token.text),  str(token.lemma_),  str(token.pos_),  str(token.dep_))

from spacy import displacy
# type localhost:5000 (or whatever portnumber you get) in browser & refresh
displacy.render(doc, style = 'dep', jupyter = True)  

I -PRON- PRON nsubj
shot shoot VERB ROOT
an an DET det
elephant elephant NOUN dobj
in in ADP prep
my -PRON- PRON poss
pajamas pajama NOUN pobj


<img src="https://i.ibb.co/89PdLyJ/spacy-dep-tree.png" title="depTree" />

Neat, eh? Note the directed links (arrows) spanning out from the ROOT (verb) to different syntactic units such as NMOD etc.

Can we customize the plot above and prettify it if we wanted to?

Yes, we can. See below.

P.S. https://spacy.io/usage/visualizers

In [9]:
# customizing displacy output
options = {"compact": True, "bg": "#09a3d5",
           "color": "white", "font": "Source Sans Pro"}

displacy.render(doc, style="dep", jupyter=True, options=options)

In [10]:
# Now trying the 'correct' version of the same sentence below:
doc1 = nlp('I in my pajamas, shot an elephant'); doc1
displacy.render(doc1, style='dep')

Here's the new dependency parse tree. Notice the sorta subtle but powerful difference?

<img src="https://i.ibb.co/9TdQGyy/spacy-dep-tree1.png" title="Title text" />

Well, dassit from me for now. Back to the slides.

Voleti