<a href="https://colab.research.google.com/github/scskalicky/LING-226-vuw/blob/main/32_more_spaCy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Dependency Grammar in spaCy

In this notebook, we'll take a closer look at the way that spaCy represents the syntax of a text. The specific method spaCy uses is a [Dependency grammar](https://en.wikipedia.org/wiki/Dependency_grammar), which is a bit different from other ways of representing syntax that you might find in a linguistics course. Instead of using phrase structure rules, all of the words in a sentence are linked in a hierarchical manner, leading back to a head or ROOT verb of each sentence.

Let's explore this a bit by importing spaCy and creating a document.

In [None]:
# import spacy and save the parser to a variable
import spacy
nlp = spacy.load('en_core_web_sm')

In [None]:
# parse a text to create a spaCy doc
parsed_text = nlp('The sea was angry that day my friends. Like an old man sending back soup in a deli.')

## The `.dep_` attribute

Just like each token in a spaCy doc has attributes for pos tags and other information, we can access the dependency relationships of tokens using the `.dep_` attribute.

Consider the output below looking at the dependency of each word in our parsed text:

In [None]:
# print the dependency labels for each token.
for token in parsed_text:
  print(token, '|', token.dep_)

Examine the dependency tags — they might look similar or at least understandable because they have some overlap with part of speech tags (to some degree). For example, `det` is being used for determiners, `prep` is being used for prepositions, and `punct` is being used for punctuation. But we have a lot of other tags that aren't very clear.

Unlike part of speech tags, the dependency tags indicate how each word in the sentence is *related* to another word. The tags thus represent a variety of syntactic features, such as `subject` and `object`, but will also indicate relative clauses, negation, agents, passivization, and a whole lot more. It is beyond the scope of this course / notebook to explain the syntax in great detail, but we can work through some of these relationships below.

spaCy also has a built-in dependency visualizer to look at how these relationships work.

Let's use a smaller example to work with this for now.

In [None]:
# create a smaller example
smaller_example = nlp("Soda found a rabbit.")

Inspect the output below. On the bottom we have each word and its basic part of speech tag. We also have a series of arrows connecting the words to one another. These arrows reflect the syntactic dependencies of the words in the sentence. From this perspective, words "depend" on other words in order to determine both their syntactic *and* their semantic meaning.

In the example below, we can see that the verb `found` is the root source for all of the arrows. One arrow goes left, leading to `Soda`, a noun. The label for the arrow is `nsubj`, which means the noun subject of the verb. One arrows goes right, and points to `rabbit` with the label `dobj` - this means `rabbit` is the direct object of the verb. Then, an arrow goes from `rabbit` to `a`, demonstrating that `a` "depends" on `rabbit`.

In [None]:
# import displacy, the spacy visualizer
from spacy import displacy

# render a parsed Doc
# the 'distance' argument controls the size of the plot
displacy.render(smaller_example, style= 'dep', jupyter = True, options={'distance': 100})

As was lamented in the [top answer to this stackoverflow post](https://stackoverflow.com/questions/40288323/what-do-spacys-part-of-speech-and-dependency-tags-mean), spaCy previously made it very difficult to understand what the different pos and dep tags actaully mean. There is now a nice way to look up their meaning using `spacy.explain()`


does not really do a good job explaining what all of these different dependency tags mean.



In [None]:
spacy.explain('PROPN')

In [None]:
spacy.explain('AUX')

You can find all the information you need about [each dependency tag in this paper](https://downloads.cs.stanford.edu/nlp/software/dependencies_manual.pdf), or even better [use this website.](https://universaldependencies.org/u/dep/index.html) So it's up to us to look up the tags that don't make sense and maybe requires a bit of remembering what you might think of as "grammar". Sentence subjects, objects, coordinating conjunctions, relative clauses, adverbial clauses, and plenty of other types of syntactic dependencies will spring up with these tags.

We will stick with simple examples. I want to show you how the parse changes with different word information, just to further belabor the point that word meaning and use depends on other words. Look at the tag and dependencies for `comb` as I go through these three phrases:

1. the comb
2. comb my hair
3. to comb my hair

In [None]:
# THE belongs to COMB
displacy.render(nlp('the comb'), style= 'dep', jupyter = True, options={'distance': 100})

In [None]:
# MY belongs to HAIR, HAIR belongs to COMB
displacy.render(nlp('comb my hair'), style= 'dep', jupyter = True, options={'distance': 100})

In [None]:
# TO belongs to COMB
# MY belongs to HAIR, HAIR belongs to COMB
displacy.render(nlp('to comb my hair'), style= 'dep', jupyter = True, options={'distance': 100})

## Your Turn

Parse some sentences / texts using the dependency display function.

## Finding the `.head` of a word

Note how the arrows link words to other words as we add them. These arrows represent additional information aside from the dependency tag. The tokens also contain attributes which reflect the direction of these arrows, which are described as a token's `head` and its `children`. This means that for any target word, we can find the word the target word *depends* on, as well as any words that depend on the target word.

We can access these attributes using the `.head` and `.children` attributes of a token.

For example, in the output below, we can see that `found` is the head word for all of the other words in the sentence except for `a`, which has `rabbit` as its headword. This matches the arrows that we saw in the displacy figure, which I will repeat here again for your convenience.

Also please take note of the `ROOT` tag. This represents the epicenter of the sentence, suggesting that all words ultimately depend upon this main verb. This is one way of thinking about word relations, but not the only way. However, knowing that all roads lead to a root verb can be useful for processing sentences later on. In the displacy graphics, we don't see ROOT but must infer it from the single word that all arrows lead to.

In [None]:
# print each token, its dependency tag, and its head word
for token in nlp('Soda found a rabbit.'):
  print(token, token.dep_, token.head)

In [None]:
# compare to a visualization of the same thing
displacy.render(nlp('Soda found a rabbit'), style= 'dep', jupyter = True, options = {'distance': 100})

## Finding the `.children` of a word

The `.children` attribute will return a generator object including all of the words which rely on any one token ( i.e., its children). Notice how some words have no children - this makes sense because these are the words that do not have any arrows originating from them in the displacy figure above. This means that we know both the root of the sentence (the verb) but also the roots of all the subphrases - such as `rabbit` being the head of the noun phrase `a rabbit`

In [None]:
# print the token and then all the children of the token
for token in nlp('Soda found a rabbit.'):
  # need to use a list comprehension to unpack the generator
  print(token, [child for child in token.children])

## Using `.lefts` and `.rights` to traverse children

We can also specify the direction of the children we want to investigate. By this, I mean whether we want children which occur to the left or the right of the target word. This information is helpful when combined with knowledge of typical English word order (subject, verb, object), in addition to the knowledge that English is written from right-to-left. In other words, we can assume that in a canonical active English sentence, the subject of a verb will be to its left, and objects to the right.

We can find these words using the `.lefts` and `.rights` attributes. These attributes are also generator objects and thus also need to be looped or wrapped in `list()` to examine.

The main difference that we will see below is that we get `Soda` for the verb's `lefts` and `rabbit` for the verb's `rights`.

We also get `a` for `rabbit`'s lefts, because `a` occurs on the left side of `rabbit`, but there are no children to the right of rabbit (because `rabbit` occurs at the end of the sentence!). This makes further sense because in English, determiners and adjectives will occur to the *left* of a noun. All this seems perhaps relatively straightforward, but can be useful for example if you want to check negation, count words, modals, and so on.

In [None]:
# for each word, first print its children to the left, then to the right
for token in nlp('Soda found a rabbit.'):
  print(token, 'lefts:', list(token.lefts))
  print(token, 'rights:', list(token.rights))

## The `.subtree` attribute

Finally, there is the `.subtree` attribute, which will include all of a token's "descendents", which means it will include not just the immediate children of the token, but also its grandchildren, great grandchildren, etc.

When we used `.children` above, we returned `Soda`, `rabbit`, and the full stop all as children of `found`.

When we use `.subtree` below, we still get `Soda` and `rabbit`, but we also get the child of rabbit (`a`) and (for some reason), the target token itself (`found`).

In [None]:
# subtree is more comprehensive than children
# note that becaus FOUND is the root, it shows all the words
for token in nlp('Soda found a rabbit.'):
  print(token, list(token.subtree))

## **Using syntactic dependants as a measure of syntactic complexity**

There are a number of ways you could use this information to process and anlayse text. Let's consider how we could obtain a measure of syntactic complexity. Syntactic complexity could be one way to analyse the overall formality and/or difficulty of a text.

For example, compare the two sentences:

`Soda found a rabbit`

`Soda, who is a labradoodle, found a rabbit`

Not only is the second sentence longer, it also has more complex syntax because there is a relative clause (`who is a labradoodle`) which must be correctly parsed and incorporated into the sentence.

We see the complexity with visualization as well.




In [None]:
displacy.render(nlp('Soda found a rabbit'), style= 'dep', jupyter = True, options={'distance': 100})

In [None]:
displacy.render(nlp('Soda, who is a labradoodle, found a rabbit'), style= 'dep', jupyter = True, options={'distance': 100})

So maybe we could define syntactic complexity as the total number of syntactic dependents in any one sentence? One question would be whether or not this is just some form of additional measure of text length. Let's investigate.


In [None]:
# create two sentences with the same length but different syntactic relations
s1 = 'This sentence has eleven words and thus is eleven words long' # a sentence without a relative clause
s2 = 'This sentence, which is eleven words long, is thus more difficult' # a sentence with a relative clause
print(len(s1.split()), len(s2.split()))

In [None]:
# visualize s1
displacy.render(nlp(s1), style= 'dep', jupyter = True, options={'distance': 100})

In [None]:
# visualize s2
displacy.render(nlp(s2), style= 'dep', jupyter = True, options={'distance': 100})

How can we count the number of dependents? One way could be to count how many words have other words relying on them, and how many words in total rely on them (i.e., their total number of dependents). We could then get a measure of average number of dependents, with the argument that more dependents = more complexity.

We can examine these relationships using the `subtree`, `ancestors`, and `children` attributes for tokens. It might be unclear how these differ from one another, so let's create a function to examine them all, side-by-side.





In [None]:
# create a convenience function to explore the dependency relationships in a sentence

def examine_dependencies(text_input):
  """return different syntactic dependents for a text input"""
  # create spacy doc
  t = nlp(text_input)

  for token in t:
    # create a list of each type of dependency relations
    ancestors = [a for a in token.ancestors]
    children = [c for c in token.children]
    subtree = [s for s in token.subtree]
    # unpack the list in a print statement
    print(f'Token: {token, token.dep_} \n anc: {ancestors} \n child: {children} \n subtree: {subtree}')

In [None]:
# repeated again for convenience
displacy.render(nlp('Soda found a rabbit'), style= 'dep', jupyter = True, options={'distance': 100})

In [None]:
examine_dependencies('Soda found a rabbit')

Going by the output, we can obtain a better understanding of these different relationships by comparing them to the arrows in the displacy figure. Words that *rely* on other words need those other words to make meaning / syntactic phrases. Each phrase and sentence will have a *head* or a *root*. Let's unpack this in light of our sentence output.

**ancestors** provides the words the current token depends upon.
- `Soda` relies upon `found`, because `found` is the root verb.
- `found` doesn't rely on anything, because it is the root verb.
- `a` relies upon `rabbit` because `rabbit` is the head of the NP `a rabbit`,  and `a` also relies upon `found` because `found` is the root verb.
- `rabbit` relies upon `found` because `found` is the root verb.

**children** provides the *immediate* syntactic dependents of a token.
- `Soda` does not have any children because no words immediately rely upon it - it acts as a single word noun phrase.
- `found` has two children: `Soda` and `rabbit` because they are both the *immediate* children of the root verb (they have arrows directly connected to `found`). We see that `a` does *not* show up here because `a` is not an immediate child of found (although `a` does ultimately rely upon `found`).
- `a` does not have any children because no other words rely on `a`
- `rabbit` has one child: `a`, because `rabbit` is the head noun of the noun phrase `a rabbit`

**subtree** returns *all* of the syntactic dependents, including the token itself
- `Soda` only returns itself
- `found` returns the entire sentence because `found` is the root verb, and all arrows can eventually be traced back to `found`
- `a` only returns itself
- `rabbit` returns itself and `a`, because `a` is its syntactic child

Phew, that was a lot, but I wanted to make sure we can wrap our heads around that output. You may have noticed that it's really just different ways of looking at the same thing. Let's try out our function on a few more examples, see if you can understand the output in the same way I've described the sentence above.

In [None]:
examine_dependencies('These pretzels are making me thirsty!')

In [None]:
# check out what happens when we unpack a compound noun
examine_dependencies('Victoria University of Wellington')

All right cool. Now we can continue thinking about what this information might tell us for measuring complexity. If you recall, one of the things I measured in the *Please, please* paper was standard deviation of *noun dependents*. That means I took a measure of variation in how large/small the average number of noun dependants was per text. How could we approximate that measure here?

1. First, we want to find all the nouns which are the head of a phrase
2. Then we'd measure how many dependents they have, using `.children` (because `.children` returns all of the *immeduate* dependants. Compare the NP "the big brown dog" and "the rabbit" below – the children of `dog` are "the big brown", and the children of "rabbit" are "the". So the NP dog has more syntactic dependents. That's basically what I measured in the *Please, please* paper, easy!

In [None]:
examine_dependencies('The big brown dog ate the rabbit')

Okay, let's actually write the function to measure noun dependents :) Because we are only concernd with noun phrases, we will use the generic `.pos_` attribute for our tokens to check whether `token.pos_ == 'NOUN'`. This isn't fully accurate because we also want to check if the NOUN is a head, but let's leave that for now and just use this as an example.

If the token is a noun, we will append the length of their children to a list (which will be the number of children for that noun). At the end of the loop, we will return the average of the list by summing the individual lengths and dividing by the total number of all nouns. If a noun has no children, it will append a 0 to the list.

In [None]:
# create function to calculate average noun dependents

def avg_noun_deps(text_input):
  """return average number of dependents per noun in a text"""
  # create spacy doc
  tokens = nlp(text_input)

  # list to store number of children per noun
  n_deps = []

  for token in tokens:
    # use simple pos tag to find the nouns
    if token.pos_ == "NOUN":
      n_childs = [c for c in token.children]
      n_deps.append(len(n_childs)) # add the total number of dependents to the list

  # safety first (don't proceed if the list is empty)
  if n_deps:
   # print(n_deps) # in case you want to check what's happening with the numbers
    return sum(n_deps) / len(n_deps) # return the average
  else:
    print('Sorry, no nouns found')


Test our function out on some Seinfeld quotes.

In [None]:
avg_noun_deps('The sea was angry that day my friends. Like an old man sending back soup in a deli.')

In [None]:
avg_noun_deps("It's not a lie if you believe it")

Let's further test out some good old `state_union` texts :)

In [None]:
# read in state union texts
import nltk
nltk.download('state_union')
from nltk.corpus import state_union

In [None]:
# compare Clinton's last state union with GWB's 2002
clinton = state_union.raw('2000-Clinton.txt')
bush = state_union.raw('2002-GWBush.txt')

In [None]:
avg_noun_deps(clinton)

In [None]:
avg_noun_deps(bush)

As you can see the averages aren't very different - why do you think that might be? We could then start examining larger sets of text to unpack whether this is a speaker difference or something related to the genre of state of the union speeches. We would also want to run some inferential statistics to see if this seemingly small differences is nonetheless statistically meaningful, but we won't do that here.

We could update our function to return the standard deviation as well as the average. The standard deviation is what I used in the *Please, please* paper.

A higher standard deviation means higher variance in individual values. To calculate the SD, I import the `statistics` module and use the built-in `stdev()` function. I'll go ahead and use the `mean()` function from the package as well (mean is the same as average).


In [None]:
# an updated version of our function
import statistics

def avg_sd_noun_deps(text_input):
  """return average and sd of of dependents per head noun in a text"""
  # create spacy doc
  tokens = nlp(text_input)

  # list to store noun children
  n_deps = []

  for token in tokens:
    # use simple pos tag to find the nouns
    if token.pos_ == "NOUN":
      n_childs = [c for c in token.children]
      n_deps.append(len(n_childs))

  # safety first
  if n_deps:
    # in case you want to check what's happening with the numbers
    # print(n_deps)
    avg = statistics.mean(n_deps)
    sd = statistics.stdev(n_deps)
    return avg, sd # first number is the average, second is the standard deviation
  else:
    print('Sorry, no nouns found')


In [None]:
avg_sd_noun_deps(clinton)

In [None]:
avg_sd_noun_deps(bush)

There you have it, Clinton has a higher mean number of syntactic dependents per noun phrase, but also more variation. Bush, on the other hand, has a lower number of deps per noun phrase but is also more consistent. Whether these differences are meaningful, interesting, or worthwhile to explore is a different question!

Now, if we *really* wanted to be precise, we could start looking at the dependency tags themselves and counting specific tags and clauses. For example, look at the category [nominal dependents](https://universaldependencies.org/u/dep/index.html) on the list of universal dependency tags. As you can see, there are several different types of noun modifiers, and we might want to score them based on their complexity. For example, [determiners](https://universaldependencies.org/u/dep/det.html) might be low on the complexity scale, whereas `acl` [might be higher on the scale](https://universaldependencies.org/u/dep/acl.html) because it is, well, more complex!

Well, I think I'll leave it here. As I've said there are a lot of other things you can do in spaCy, such as named entity recognition, measures of text similarity, measures of sentiment, etc. It's a great library to learn for many NLP tasks. Since you've learned about NLTK, you likely have a better appreciation for spaCy.

## Your Turn

1. Can you apply this approach to calculate the complexity of other texts?
2. What modifications might you want to make to calculate more fine-grained measures of syntactic complexity?
3. Is there anything else you want to know about the dependency grammar? Any ideas for using this for your final projects?

# **Negation**

The ability to find negated words is pretty useful, especially if we recall the way negation can influence sentiment and other lexicon-based approaches. Consider the output below - could this be used in someway to improve on sentiment analysis of adjectives? How?

In [None]:
for token in nlp('I am not happy right now.'):
  print(token,'\t',  token.dep_, '\t', token.pos_)

In [None]:
for token in nlp('I am happy right now.'):
  print(token,'\t',  token.dep_, '\t', token.pos_)

In [None]:
spacy.explain('acomp')

# **Noun chunks**

You can also ask spaCy for all the noun chunks in a text. This is similar to the syntactic dependencies, and returns a list of all noun and pronouns chunks. The chunk means it returns the entire phrase, such as `the work` rather than just the noun `work`. Could this output be used as a way to improve upon the approach above which looked for the number of noun dependencies?


In [None]:
!wget 'https://raw.githubusercontent.com/scskalicky/LING-226-vuw/main/sample-texts/grasshopper.txt'
grasshopper = open('grasshopper.txt').read()

In [None]:
[chunk for chunk in nlp(grasshopper).noun_chunks]

## **Your Turn**

* Can you figure out a way to use spaCy's attributes to tweak the noun chunks output to only provide you with chunks that are not pronouns?
* Do you think this procedure could be combined with other analyses, such as lexical resources and/or sentiment analysis? What would be the potential benefit of doing so?

# **Named Entities**

Enough with the syntax already - let's look at something else. One NLP task which is always in demand is the ability to recognise `named entities` in texts. These are different people and organisations. Discovering the entities can be useful to know whether a text is relevant to a topic, as part of information extraction, or doing some sort of coreference resolution, where pronouns such as `she`, `they`, and `he` are replaced with their actual noun referents.

Extracting the Entities from a spaCy doc is dead easy - you just need to call the `.ents` attribute, and are rewarded with a set of the entities in a text.

In [None]:
nlp('Soda is a dog').ents

In [None]:
nlp("Bill Gates and Steve Jobs are both people that did stuff.").ents

How well does it perform on a longer text?


In [None]:
!wget 'https://raw.githubusercontent.com/scskalicky/LING-226-vuw/main/sample-texts/conversation_satire.txt'
co = open('conversation_satire.txt').read()

In [None]:
nlp(co).ents

## **Your Turn**

- Conceptualise a way that we could use NER to replace pronouns in a text with their referents. What would the program need to do? You can try to attempt making a program on a small text, which would convert:

> 'Dr. Smith went to France on Friday, then she flew home.'

to

> 'Dr. Smith went to France on Friday, then Dr. Smith flew home.'