Text Annotation with spaCy
======================

This chapter introduces workshop participants to the general field of natural language processing, or NLP. While NLP is often used interchangeably with text mining/analytics in introductory settings, the former differs in important ways from many of the core methods in the latter. We will highlight a few such differences over the course of this session, and then more generally throughout the workshop series as a whole.

```{admonition} Learning objectives
By the end of this chapter, you will be able to:
+ Explain how document annotation differs from other representations of text data
+ Have a general sense of how `spaCy` models and their respective pipelines work
+ Extract linguistic information about text using `spaCy`
+ Describe key terms in NLP, like part-of-speech tagging, dependency parsing, etc.
+ Know how/where to look for more information about the linguistic data `spaCy` makes available
```

NLP vs. Text Mining: In Brief
---------------------------------

The short space of this reader, as well as that of our series, necessarily limits a conversation about everything that characterizes NLP from text mining/analytics. That there are differences at all is in itself worth noting and merits further exploration. But for the purposes of this series's focus, there are two such instances of these differences that are especially worth calling out.

### Data structures

At the outset, one way of distinguishing NLP from text mining has to do with NLP's underlying **data structure**. Generally speaking, NLP methods are **maximally preservative** when it comes to representing textual information in a way computers can read. Unlike text mining's atomizing focus on **bags of words**, in NLP we often use literal transcriptions of the input text and run our analyses directly on that. This is because much of the information NLP methods provide is context-sensitive: we need to know, for example, the subject of a sentence in order to do dependency parsing; part-of-speech taggers are most effective when they have surrounding tokens to consider. Accordingly, our workflow needs to retain as much information about our documents as possible, for as long as possible. In fact, many NLP methods _build on each other_, so data about our documents will grow over the course of processing them (rather than getting pared down, as with text mining). The dominant paradigm, then, for thinking about how text data is represented in NLP is **annotation**: NLP tends to add, associate, or tag documents with extra information.

### Model-driven methods

The other key difference between text mining and NLP – which goes hand-in-hand with the idea of annotation – lies in the fact that the latter tends to be more **model-driven**. NLP methods often rely on statistical models to create the above information, and ultimately these models have a lot of assumptions baked into them. Such assumptions range from philosophy of language (how do we know we're analyzing meaning?) to the kind of training data on which they're trained (what does the model represent, and what biases might thereby be involved?). Of course, it's possible to build your own models, and indeed a later chapter will show you how to do so, but you'll often find yourself using other researchers' models when doing NLP work. It's thus very important to know how researchers have built their models so you can do your own work responsibly.

```{admonition} Keep in mind
Throughout this series, we will be using NLP methods in the context of text-based data, but NLP applies more widely to speech data as well.
```

spaCy Language Models
-----------------------------

```{margin} Want to know more?
Explosion, the company behind `spaCy`, has a series of useful videos introducing the framework. [This one] is a good place to start.

[This one]: https://www.youtube.com/watch?v=9k_EfV7Cns0&t=1365s
```

Much of this workshop series will use language models from `spaCy`, one of the most popular NLP libraries in Python. `spaCy` is both a framework and a model resource. It offers access to models through a unique set of coding workflows, which we'll discuss below (you can also train your own models with the library). Learning about these workflows will help us add annotate documents with extra information that will, in turn, enable us to perform a number of different NLP tasks.

### `spaCy` pipelines

In essence, a `spaCy` model is a collection of sub-models arranged into a **pipeline**. The idea here is that you send a document through this pipeline, and the model does the work of annotating your document. Once it has finished, you can access these annotations to perform whatever analysis you'd like to do.

![](img/spacy_pipeline.png)

Every component, or **pipe**, in a `spaCy` pipeline performs a different task, from tokenization to part-of-speeching tagging and named-entity recognition. Each model comes with a specific ordering of these tasks, but you can mix and match them after the fact, adding or removing pipes as you see fit. The result is a wide set of options; the present workshop series only samples a few core aspects of the library's overall capabilities.

### Downloading a model

The specific model we'll be using is `spaCy`'s medium-sized English model: [en_core_web_md]. It's been trained on the [OntoNotes] corpus and it features several useful pipes, which we'll discuss below.

If you haven't used `spaCy` before, you'll need to download this model. You can do so by running the following in a command line interface:

```sh
python -m spacy download en_core_web_md
```

Just be sure you run this while working in the Python environment you'd like to use!

Once this downloads, you can load the model with the code below. Note that it's conventional to assign the model to a variable called `nlp`.

[en_core_web_md]: https://github.com/explosion/spacy-models/releases/tag/en_core_web_sm-3.3.0
[OntoNotes]: https://catalog.ldc.upenn.edu/LDC2013T19

In [1]:
import spacy

nlp = spacy.load('en_core_web_md')

Annotations
--------------

With the model loaded, we can send a document through the pipeline, which will in turn produce our text annotations. To annotate a document with the `spaCy` model, simply run it through the core function, `nlp()`. We'll do so with a short poem by Gertrude Stein.

In [2]:
with open('data/session_one/stein_carafe.txt', 'r') as f:
    stein_poem = f.read()
    
carafe = nlp(stein_poem)

With this done, we can inspect the result...

In [3]:
carafe

A kind in glass and a cousin, a spectacle and nothing strange a single hurt color and an arrangement in a system to pointing. All this and not ordinary, not unordered in not resembling. The difference is spreading.

...which seems to be no different from a string representation! This output is a bit misleading, however. Our `carafe` object actually has a ton of extra information associated with it, even though, on the surface, it appears to be a plain old string.

If you'd like, you can inspect all these attributes and methods with:

In [4]:
attributes = [i for i in dir(carafe) if i.startswith("_") is False]

We won't show them all here, but suffice it to say, there are a lot!

In [5]:
print("Number of attributes in a SpaCy doc:", len(attributes))

Number of attributes in a SpaCy doc: 51


This high number of attributes indicates an important point to keep in mind when working with `spaCy` and NLP generally: as we mentioned before, the primary data model for NLP aims to **maximally preserve information** about your document. It keeps documents intact and in fact adds much more information about them than Python's base string methods have. In this sense, we might say that `spaCy` is additive in nature, whereas text mining methods are subtractive, or reductive.

### Document Annotations

So, while the base representation of `carafe` looks like a string, under the surface there are all sorts of annotations about it. To access them, we use the attributes counted above. For example, `spaCy` adds extra segmentation information about a document, like which parts of it belong to different sentences. We can check to see whether this information has been attached to our text with the `.has_annotation()` method.

In [6]:
carafe.has_annotation('SENT_START')

True

We can use the same method to check for a few other annotations:

In [7]:
annotation_types = {'Dependencies': 'DEP', 'Entities': 'ENT_IOB', 'Tags': 'TAG'}
for a, t in annotation_types.items():
    print(
        f"{a:>12}: {carafe.has_annotation(t)}"
    )

Dependencies: True
    Entities: True
        Tags: True


Let's look at sentences. We can access them with `.sents`.

In [8]:
carafe.sents

<generator at 0x1255c5728>

...but you can see that there's a small complication here: `.sents` returns a generator, not a list. The reason has to do with memory efficiency. Because `spaCy` adds so much extra information about your document, this information could slow down your code or overwhelm your computer if the library didn't store it in an efficient manner. Of course this isn't a problem with our small poem, but you can imagine how it could become one with a big corpus.

To access the actual sentences in `carafe`, we'll need to convert the generator to a list.

```{margin} Want to learn more about generators?
The DataLab has a workshop about them. See [this link].

[this link]: https://datalab.ucdavis.edu/eventscalendar/intermediate-python-iterator-generator-crash-course/
```

In [9]:
import textwrap

sentences = list(carafe.sents)
for s in sentences:
    s = textwrap.shorten(s.text, width=100)
    print(s)

A kind in glass and a cousin, a spectacle and nothing strange a single hurt color and an [...]
All this and not ordinary, not unordered in not resembling.
The difference is spreading.


One very useful attribute is `.noun_chunks`. It returns nouns and compound nouns in a document.

In [10]:
noun_chunks = list(carafe.noun_chunks)

for noun in noun_chunks:
    print(noun)

A kind
glass
a cousin
a spectacle
nothing
a single hurt color
an arrangement
a system
The difference


See how this picks up not only nouns, but articles and compound information? Articles could be helpful if you wanted to track singular/plural relationships, while compound nouns might tell you something about the way a document refers to the entities therein. The latter could have repeating patterns, and you might imagine how you could use noun chunks to create and count n-gram tokens and feed that into a classifier.

Consider this example from _The Odyssey_. Homer used many epithets and repeating phrases throughout his epic. According to some theories, these act as mnemonic devices, helping a performer keep everything in their head during an oral performance (the poem wasn't written down in Homer's day). Using `.noun_chunks` in conjunction with a Python `Counter`, we may be able to identify these in Homer's text. Below, we'll do so with _The Odyssey_ Book XI.

First, let's load and model the text.

In [11]:
with open('data/session_one/odyssey_book_11.txt', 'r') as f:
    book_eleven = f.read()
    
odyssey = nlp(book_eleven)

Now we'll import a `Counter` and initialize it. Then we'll get the noun chunks from the document and populate them in the count dictionary with a list comprehension line. Be sure to only grab the text from each token. We'll explain why in a little while.

In [12]:
from collections import Counter

noun_counts = Counter([chunk.text for chunk in odyssey.noun_chunks])

With that done, let's look for repeating noun chunks with three or more words.

```{margin} What we're doing here...
For every noun chunk in the counter:

1. Split the chunk
2. Check if the length of the chunk is more than two and the count is more than one
3. If so, join the chunk back together and print it along with the chunk
```

In [13]:
import pandas as pd

chunks = []
for chunk, count in noun_counts.items():
    chunk = chunk.split()
    if (len(chunk) > 2) and (count > 1):
        joined = ' '.join(chunk)
        chunks.append({
            'PHRASE': joined,
            'COUNT': count
        })
        
chunks = pd.DataFrame(chunks).set_index('PHRASE')
chunks

Unnamed: 0_level_0,COUNT
PHRASE,Unnamed: 1_level_1
the sea shore,2
a fair wind,2
the poor feckless ghosts,2
the same time,2
the other side,2
his golden sceptre,2
your own house,2
her own son,2
the Achaean land,2
her own husband,2


Excellent! Looks like we turned up a few: "the poor feckless ghosts," "my wicked wife," and "all the Danaans" are likely the kind of repeating phrases scholars think of in Homer's text.

Another way to look at entities in a text is with `.ents`. `spaCy` uses **named-entity recognition** to extract significant objects, or entities, in a document. In general, anything that has a proper name associated with it is considered an entity, but things like expressions of time and geographic location are also often tagged. Here are the first five from Book XI above.

In [14]:
entities = list(odyssey.ents)

count = 0
while count < 5:
    print(entities[count])
    count += 1

Circe
Oceanus
Cimmerians
Circe
Perimedes


You can select particular entities using the `.label_` attribute. Here are all the temporal entities in Book XI.

In [15]:
[e.text for e in odyssey.ents if e.label_ == 'TIME']

['all night', 'to-morrow morning', 'the light of day']

And here is a unique listing of all the people.

```{margin} How many labels are there?
This will depend on the model. Here's the [label scheme] for the one we're using.

[label scheme]: https://spacy.io/models/en#en_core_web_md-labels
```

In [16]:
set(e.text for e in odyssey.ents if e.label_ == 'PERSON')

{'Achilles',
 'Aeson',
 'Alcinous',
 'Arete',
 'Ariadne',
 'Cassandra',
 'Chloris',
 'Circe',
 'Clytemnestra',
 'Diana',
 'Echeneus',
 'Epicaste',
 'Eriphyle',
 'Eurylochus',
 'Helen',
 'Iphicles',
 'Iphimedeia',
 'Jove',
 'Leda',
 'Leto',
 'Maera',
 'Megara',
 'Memnon',
 'Minerva',
 'Neleus',
 'Neoptolemus',
 'Nestor',
 'OEdipodes',
 'Orestes',
 'Ossa',
 'Periclymenus',
 'Perimedes',
 'Pero',
 'Pollux',
 'Priam',
 'Proserpine',
 'Pylos',
 'Pytho',
 'Queen',
 'Scyros',
 'Sisyphus',
 'Teiresias',
 'Telemachus',
 'Theban Teiresias',
 'Ulysses'}

Don't see an entity that you know to be in your document? You can add more to the `spaCy` model. Doing so is beyond the scope of our workshop session, but the library's `EntityRuler()` [documentation] will show you how.

[documentation]: https://spacy.io/api/entityruler

### Token Annotations

In addition to storing all of this information about texts, `spaCy` creates a substantial amount of annotations for each of the tokens in that document. The same logic as above applies to accessing this information.

Let's return to the Stein poem. Indexing `carafe` will return individual tokens:

In [17]:
carafe[3]

glass

Like `carafe`, each one has several attributes:

In [18]:
token_attributes = [i for i in dir(carafe[3]) if i.startswith("_") is False]

print("Number of token attributes:", len(token_attributes))

Number of token attributes: 94


That's a lot!

These attributes range from simple booleans, like whether a token is an alphabetic character:

In [19]:
carafe[3].is_alpha

True

...or whether it is a stop word:

In [20]:
carafe[3].is_stop

False

...to more complex pieces of information, like tracking back to the sentence this token is part of:

In [21]:
carafe[3].sent

A kind in glass and a cousin, a spectacle and nothing strange a single hurt color and an arrangement in a system to pointing.

...sentiment scores:

In [22]:
carafe[3].sentiment

0.0

...and even vector space representations (more about these on day three!):

In [23]:
carafe[3].vector

array([-1.4859e-01, -1.7940e-01,  4.3666e-02,  1.5748e-01,  1.3568e-01,
       -9.3666e-01, -6.8430e-01,  4.7692e-01, -4.1391e-01,  9.3575e-01,
       -1.6360e-01,  6.7553e-02, -2.7843e-01, -5.6125e-01,  1.3088e-01,
       -1.0006e-01,  7.0374e-03,  2.6217e+00,  5.4600e-02, -5.8931e-01,
        2.5739e-04, -2.6791e-01,  4.6093e-01, -5.9145e-02, -1.0330e-01,
       -3.7589e-01, -2.5343e-01,  1.4790e-02, -4.8031e-01, -4.4314e-01,
        2.4685e-01, -8.6519e-04, -1.2361e-01,  9.1683e-02, -1.5880e-01,
       -4.5974e-01,  3.3017e-01, -4.4124e-01,  3.3604e-01, -3.0438e-01,
        4.4664e-01,  2.2697e-01,  2.9327e-02, -2.7025e-01,  3.1813e-01,
       -1.5890e-01, -4.1371e-01, -9.0721e-01, -2.0866e-01,  3.6400e-01,
        5.6862e-02, -2.6824e-01, -2.9722e-01,  6.2107e-02, -4.7908e-01,
       -5.8164e-01, -1.4302e-01,  7.0109e-02, -1.2735e-01,  3.6194e-02,
       -1.6634e-01, -2.2135e-01, -5.0446e-02,  4.3839e-01, -5.5363e-01,
       -4.4219e-01, -1.3657e-01, -2.8472e-01, -5.0637e-01,  7.99

Here's a listing of some attributes you might want to know about when text mining.

In [24]:
sample_attributes = []
for token in carafe:
    sample_attributes.append({
        'INDEX': token.i,
        'TEXT': token.text,
        'LOWERCASE': token.lower_,
        'ALPHABETIC': token.is_alpha,
        'DIGIT': token.is_digit,
        'PUNCTUATION': token.is_punct,
        'STARTS SENTENCE': token.is_sent_start,
        'LIKE URL': token.like_url
    })

sample_attributes = pd.DataFrame(sample_attributes).set_index('INDEX')
sample_attributes.head(10)

Unnamed: 0_level_0,TEXT,LOWERCASE,ALPHABETIC,DIGIT,PUNCTUATION,STARTS SENTENCE,LIKE URL
INDEX,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,A,a,True,False,False,True,False
1,kind,kind,True,False,False,False,False
2,in,in,True,False,False,False,False
3,glass,glass,True,False,False,False,False
4,and,and,True,False,False,False,False
5,a,a,True,False,False,False,False
6,cousin,cousin,True,False,False,False,False
7,",",",",False,False,True,False,False
8,a,a,True,False,False,False,False
9,spectacle,spectacle,True,False,False,False,False


We'll discuss some of the more complex annotations later on, both in this session and others. For now, let's collect some simple information about each of the tokens in our document. We'll use list comprehension to do so. We'll also use the `.text` attribute for each token, since we only want the text representation. Otherwise, we'd be creating a list of generators, where each generator has all those attribute for every token! (This is why we made sure to only use `.text` in our work with _The Odyssey_ above.)

In [25]:
words = ' '.join([token.text for token in carafe if token.is_alpha])
punctuation = ' '.join([token.text for token in carafe if token.is_punct])

print(
    f"Words\n-----\n{textwrap.shorten(words, width=100)}",
    f"\n\nPunctuation\n-----------\n{punctuation}"
)

Words
-----
A kind in glass and a cousin a spectacle and nothing strange a single hurt color and an [...] 

Punctuation
-----------
, . , . .


Want some linguistic information? We can get that too. For example, here are prefixes and suffixes:

```{margin} You might be wondering about those underscores...

The syntax conventions of `spaCy` use a trailing underscore to access the actual attribute information for a token. Using an attribute without the underscore will return an id, which the library uses internally to piece together output.
```

In [26]:
prefix_suffix = []
for token in carafe:
    if token.is_alpha:
        prefix_suffix.append({
            'TOKEN': token.text,
            'PREFIX': token.prefix_,
            'SUFFIX': token.suffix_
        })

prefix_suffix = pd.DataFrame(prefix_suffix).set_index('TOKEN')
prefix_suffix.head(10)

Unnamed: 0_level_0,PREFIX,SUFFIX
TOKEN,Unnamed: 1_level_1,Unnamed: 2_level_1
A,A,A
kind,k,ind
in,i,in
glass,g,ass
and,a,and
a,a,a
cousin,c,sin
a,a,a
spectacle,s,cle
and,a,and


And here are lemmas:

In [27]:
lemmas = []
for token in carafe:
    if token.is_alpha:
        lemmas.append({
            'TOKEN': token.text,
            'LEMMA': token.lemma_
        })

lemmas = pd.DataFrame(lemmas).set_index('TOKEN')
lemmas[24:]

Unnamed: 0_level_0,LEMMA
TOKEN,Unnamed: 1_level_1
All,all
this,this
and,and
not,not
ordinary,ordinary
not,not
unordered,unordere
in,in
not,not
resembling,resemble


With such attributes at your disposal, you might imagine how you could work `spaCy` into a text mining pipeline. Instead of using separate functions to clean your corpus, those steps could all be accomplished by accessing attributes.

Before you do this, however, you should consider two things: 1) whether the increased computational/memory overhead is worthwhile for your project; and 2) whether `spaCy`'s base models will work for the kind of text you're using. This second point is especially important. While `spaCy`'s base models are incredibly powerful, they are built for general purpose applications and may struggle with domain-specific language. Medical text and early modern print are two such examples of where the base models interpret your documents in unexpected ways, thereby complicating, maybe even ruining, parts of a text mining pipeline that relies on them. Sometimes, in other words, it's just best to stick with a text mining pipeline that you know to be effective.

That all said, there are ways to train your own `spaCy` model on a specific domain. This can be an extensive process, one which exceeds the limits of our short workshop, but if you want to learn more about doing so, you can visit [this page]. There are also [third party models] available, which you might find useful, though your milage may vary.

[this page]: https://spacy.io/usage/training
[third party models]: https://spacy.io/universe/category/models

Part-of-Speech Tagging
----------------------------

One of the most common tasks in NLP involves assigning **part-of-speech, or POS, tags** to each token in a document. As we saw in the text mining series, these tags are a necessary step for certain text cleaning process, like lemmatization; you might also use them to identify subsets of your data, which you could separate out and model. Beyond text cleaning, POS tags can be useful for tasks like **word sense disambiguation**, where you try to determine which particular facet of meaning a given token represents.

Regardless of the task, the process of getting POS tags from `spaCy` will be the same. Each token in a document has an associated tag, which is accessible as an attribute.

In [28]:
pos = []
for token in carafe:
    pos.append({
        'TOKEN': token.text,
        'POS_TAG': token.pos_
    })

pos = pd.DataFrame(pos).set_index('TOKEN')
pos

Unnamed: 0_level_0,POS_TAG
TOKEN,Unnamed: 1_level_1
A,DET
kind,NOUN
in,ADP
glass,NOUN
and,CCONJ
a,DET
cousin,NOUN
",",PUNCT
a,DET
spectacle,NOUN


If you don't know what a tag means, you can use `spacy.explain()`.

In [29]:
spacy.explain('CCONJ')

'coordinating conjunction'

`spaCy` actually has two types of POS tags. The ones accessible with the `.pos_` attribute are the basic tags, whereas those under `.tag_` are more detailed (these come from the [Penn Treebank project]). We'll print them out below, along with information about what they mean.

[Penn Treebank project]: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

In [30]:
detailed_tags = []
for token in carafe:
    detailed_tags.append({
        'TOKEN': token.text,
        'POS_TAG': token.tag_,
        'EXPLANATION': spacy.explain(token.tag_)
    })

detailed_tags = pd.DataFrame(detailed_tags).set_index('TOKEN')
detailed_tags

Unnamed: 0_level_0,POS_TAG,EXPLANATION
TOKEN,Unnamed: 1_level_1,Unnamed: 2_level_1
A,DT,determiner
kind,NN,"noun, singular or mass"
in,IN,"conjunction, subordinating or preposition"
glass,NN,"noun, singular or mass"
and,CC,"conjunction, coordinating"
a,DT,determiner
cousin,NN,"noun, singular or mass"
",",",","punctuation mark, comma"
a,DT,determiner
spectacle,NN,"noun, singular or mass"


### Use case: word sense disambiguation

This is all well and good in the abstract, but the power of POS tags lies in how they support other kinds of analysis. We'll do a quick word sense disambiguation task here but will return to do something more complex in a little while.

Between the two strings:

1. "I am not going to bank on that happening."
2. "I went down to the river bank."

How can we tell which sense of the word "bank" is being used? Well, we can model each with `spaCy` and see whether the POS tags for these two tokens match. If they don't match, this will indicate that the tokens represent two different senses of the word "bank."

All this can be accomplished with a `for` loop and `nlp.pipe()`. The latter function enables you to process different documents with the `spaCy` model all at once. This can be great for working with a large corpus, though note that, because `.pipe()` is meant to work on text at scale, it will return a generator.

In [31]:
banks = ["I am not going to bank on that happening.", "I went down to the river bank."]

nlp.pipe(banks)

<generator object Language.pipe at 0x12da33318>

In [32]:
for doc in nlp.pipe(banks):
    for token in doc:
        if token.text == 'bank':
            print(
                f"{doc.text}\n+ {token.text}: "
                f"{token.tag_} ({spacy.explain(token.tag_)})\n"
            )

I am not going to bank on that happening.
+ bank: VB (verb, base form)

I went down to the river bank.
+ bank: NN (noun, singular or mass)



See how the tags differ between the two instances of "bank"? This indicates a difference in usage and, by proxy, a difference in meaning.

Dependency Parsing
------------------------

Another tool that can help with tasks like disambiguating word sense is dependency parsing. We've actually used it already: it allowed us to extract those noun chunks above. Dependency parsing involves analyzing the grammatical structure of text (usually sentences) to identify relationships between the words therein. The basic idea is that every word in a linguistic unit (eg. a sentence) is linked to at least one other word via a tree structure, and these linkages are hierarchical in nature, with various modifications occuring across the levels of sentences, clauses, phrases, and even compound nouns. Dependency parsing can tell you information about:

1. The primary **subject** of a linguistic unit (and whether it is an **active** or **passive** subject)
2. Various **heads**, which determine the syntatic categories of a phrase; these are often nouns and verbs, and you can think of them as the local subjects of subunits
3. Various **dependents**, which modify, either directly or indirectly, their heads (think adjectives, adverbs, etc.)
4. The **root** of the unit, which is often ([but not always!]) the primary verb

Linguists have developed a number of different methods to parse dependencies, which we won't discuss here. Take note though that most popular one in NLP is the [Universal Dependencies] framework; `spaCy`, like most NLP models, uses this. The library also has some functionality for visualizing dependencies, which will help clarify what it is they are in the first place. Below, we visualize a sentence from the Stein poem.

[but not always!]: https://universaldependencies.org/u/dep/root.html
[Universal Dependencies]: https://universaldependencies.org

In [33]:
from spacy import displacy

to_render = list(carafe.sents)[2]
displacy.render(to_render, style='dep')

See how the arcs have arrows? Arrows point to the dependents within a linguistic unit, that is, they point to modifying relationships between words. Arrows arc out from a segment's head, and the relationships they indicate are all specified with labels. As with the POS tags, you can use `spacy.explain()` on the dependency labels, which we'll do below. The whole list of them is also available in this [table of typologies]. Finally, somewhere in the tree you'll find a word with no arrows pointing to it (here, "spreading"). This is the root. One of its dependents is the subject of the sentence (here, "difference").

[table of typologies]: https://universaldependencies.org/u/dep/all.html

Seeing these relationships are quite useful in and of themselves, but the real power of dependency parsing comes in all the extra data it can provide about a token. Using this technique, you can link tokens back to their heads, or find local groupings of tokens that all refer to the same head.

Here's how you could formalize that with a dataframe. Given this sentence:

In [34]:
sentence = odyssey[2246:2260]
sentence.text

"Then I tried to find some way of embracing my mother's ghost."

We can construct a `for` loop, which rolls through each token and retrieves its dependency info.

In [35]:
dependencies = []
for token in sentence:
    dependencies.append({
        'INDEX': token.i,
        'TOKEN': token.text,
        'DEPENDENCY_SHORTCODE': token.dep_,
        'DEPENDENCY': spacy.explain(token.dep_),
        'HEAD_INDEX': token.head.i,
        'HEAD': token.head
    })
    
dependencies = pd.DataFrame(dependencies).set_index('INDEX')
dependencies

Unnamed: 0_level_0,TOKEN,DEPENDENCY_SHORTCODE,DEPENDENCY,HEAD_INDEX,HEAD
INDEX,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2246,Then,advmod,adverbial modifier,2248,tried
2247,I,nsubj,nominal subject,2248,tried
2248,tried,ROOT,,2248,tried
2249,to,aux,auxiliary,2250,find
2250,find,xcomp,open clausal complement,2248,tried
2251,some,det,determiner,2252,way
2252,way,dobj,direct object,2250,find
2253,of,prep,prepositional modifier,2252,way
2254,embracing,pcomp,complement of preposition,2253,of
2255,my,poss,possession modifier,2256,mother


How many tokens are associated with each head?

In [36]:
dependencies.groupby('HEAD').size()

HEAD
tried        5
find         2
way          2
of           1
embracing    1
mother       2
ghost        1
dtype: int64

Which tokens are in each of these groups?

In [37]:
groups = []
for group in dependencies.groupby('HEAD'):
    head, tokens = group[0].text, group[1]['TOKEN'].tolist()
    groups.append({
        'HEAD': head,
        'GROUP': tokens
    })
    
groups = pd.DataFrame(groups).set_index('HEAD')
groups

Unnamed: 0_level_0,GROUP
HEAD,Unnamed: 1_level_1
tried,"[Then, I, tried, find, .]"
find,"[to, way]"
way,"[some, of]"
of,[embracing]
embracing,[ghost]
mother,"[my, 's]"
ghost,[mother]


`spaCy` also has a special `.subtree` attribute for each token, which will also produce a similar set of local groupings. Note however that `.subtree` captures all tokens that hold a dependent relationship with the one in question, meaning that when you find the subtree of the root, you're going to print out the entire sentence.

As you might expect by now, `.subtree` returns a generator, so convert it to a list or use list comprehension to extract the tokens. We'll do this in a separate function. Within this function, we're going to use the `.text_with_ws` attribute of each token in the subtree to return an exact, string-like representation of the tree (this will include any whitespace characters that are attached to a token).

In [38]:
def subtree_to_text(subtree):
    subtree = ''.join([token.text_with_ws for token in token.subtree])
    subtree = subtree.strip()
    return subtree

sentence_trees = []
for token in sentence:
    subtree = subtree_to_text(token.subtree)
    sentence_trees.append({
        'TOKEN': token.text,
        'DEPENDENCY': token.dep_,
        'SUBTREE': subtree
    })

sentence_trees = pd.DataFrame(sentence_trees).set_index('TOKEN')
sentence_trees

Unnamed: 0_level_0,DEPENDENCY,SUBTREE
TOKEN,Unnamed: 1_level_1,Unnamed: 2_level_1
Then,advmod,Then
I,nsubj,I
tried,ROOT,"""Then I tried to find some way of embracing my..."
to,aux,to
find,xcomp,to find some way of embracing my mother's ghost
some,det,some
way,dobj,some way of embracing my mother's ghost
of,prep,of embracing my mother's ghost
embracing,pcomp,embracing my mother's ghost
my,poss,my


Putting Everything Together
---------------------------------

Now that we've walked through all these options (which are really only a small sliver of what you can do with `spaCy`!), let's put them into action. Below, we'll construct two short examples of how you might combine different aspects of token attributes to analyze a text. Both of them are essentially **information retrieval** tasks, and you might imagine doing something similar to extract and analyze particular words in your corpus, or to find different grammatical patterns that could be of significance (as we'll discuss in the next session).

### Finding lemmas

In the first, we'll use the `.lemma_` attribute to search through Book XI of _The Odyssey_ and match its tokens to a few key words. If you've read _The Odyssey_, you'll know that Book XI is where Odysseus and his fellow sailors have to travel down to the underworld Hades, where they speak with the dead. We already saw one example of this: Odysseus attempts to embrace his dead mother after communing with her. The whole trip to Hades is an emotionally tumultuous experience for the travelers, and peppered throughout Book XI are expressions of grief.

With `.lemma_`, we can search for these expressions. We'll roll through the text and determine whether a token lemma matches one of a selected set. When we find a match, we'll get the subtree of this token's _head_. That is, we'll find the head upon which this token depends, and then we'll use that to reconstruct the local context for the token.

In [39]:
sorrowful_lemmas = []
for token in odyssey:
    if token.lemma_ in ('cry', 'grief', 'grieve', 'sad', 'sorrow', 'tear', 'weep'):
        subtree = subtree_to_text(token.head.subtree)
        sorrowful_lemmas.append({
            'TOKEN': token.text,
            'SUBTREE': subtree
        })

sorrowful_lemmas = pd.DataFrame(sorrowful_lemmas).set_index('TOKEN')
sorrowful_lemmas

Unnamed: 0_level_0,SUBTREE
TOKEN,Unnamed: 1_level_1
weeping,weeping and in great distress of mind
cried,cried when I saw him: 'Elpenor
sad,sad
tears,tears
sorrow,all my sorrow
sad,sad
tears,tears
grieves,He grieves continually about your never having...
sad,sad
sorrows,our sorrows


### Verb-subject relations

For this next example, we'll use dependency tags to find the subject sentences in Book XI. As before, we'll go through each token in the document, this time checking to see whether it has the `nsubj` or `nsubjpass` tag for its `.dep_` attribute, which denote the subjects of the sentence's root. We'll also check to see whether a token is a noun (otherwise we'd get a lot of articles like "who," "them," etc.). If a token matches these two conditions, we'll find its head verb as well as the token's subtree. Note that this time, the subtree will refer directly to the token in question, not to the head. This will let us capture some descriptive information about each sentence subject.

In [40]:
nsubj = []
for token in odyssey:
    if token.dep_ in ('nsubj', 'nsubjpass') and token.pos_ in ('NOUN', 'PROPN'):
        nsubj.append({
            'SUBJECT': token.text,
            'HEAD': token.head.text,
            'HEAD_LEMMA': token.head.lemma_,
            'SUBTREE': subtree_to_text(token.subtree)
        })

nsubj_df = pd.DataFrame(nsubj).set_index('SUBJECT')
nsubj_df

Unnamed: 0_level_0,HEAD,HEAD_LEMMA,SUBTREE
SUBJECT,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Circe,sent,send,"Circe, that great and cunning goddess"
sails,were,be,her sails
sun,went,go,the sun
darkness,was,be,darkness
rays,pierce,pierce,the rays of the sun
...,...,...,...
Theseus,came,come,Theseus and Pirithous glorious children of the...
thousands,came,come,so many thousands of ghosts
Proserpine,send,send,Proserpine
ship,went,go,the ship


Let's look at a few subtrees. Note how sometimes they are simple noun chunks, while in other cases they expand to whole phrases.

In [41]:
for chunk in nsubj_df['SUBTREE'].sample(10):
    print(f"+ {chunk}")

+ the gods
+ the ghosts
+ your lands
+ Theseus
+ The rest of her children
+ "'Ulysses
+ much hardship
+ the wind
+ The ghosts of other dead men
+ My mother


Time to zoom out. How many time do each of our selected subjects appear?

In [42]:
nsubj_df.groupby('SUBJECT').size().sort_values(ascending=False).head(25)

SUBJECT
ghost           8
Ulysses         5
ghosts          4
heaven          4
wife            4
Proserpine      4
man             4
one             4
Alcinous        3
people          3
Neleus          2
ship            2
judgement       2
Theseus         2
Teiresias       2
gods            2
mother          2
life            2
wind            2
Circe           2
Clytemnestra    2
Hercules        2
Jove            2
creature        2
prisoners       1
dtype: int64

What heads are associated with each subject? (Note that we're using the lemmatized form of the verbs.)

In [43]:
nsubj_df.groupby(['SUBJECT', 'HEAD_LEMMA']).size().sort_values(ascending=False).head(25)

SUBJECT     HEAD_LEMMA
ghost       come          3
Ulysses     answer        2
Proserpine  send          2
one         invite        1
man         do            1
            kill          1
mother      answer        1
            come          1
one         be            1
            get           1
Aegisthus   be            1
man         be            1
one         tell          1
others      fall          1
people      be            1
            bless         1
            hear          1
man         cross         1
limbs       fail          1
property    be            1
heaven      take          1
ground      reek          1
guest       be            1
guests      sit           1
hardship    reach         1
dtype: int64

Such information provides another way of looking at something like topicality. Rather than using, say, a bag of words approach to build a topic model, you could instead segment your text into chunks like the above and start tallying up token distributions. Such distributions might help you identify the primary subject in a passage of text, whether that be a character or something like a concept. Or, you could leverage them to investigate how different subjects are talked about, say by throwing POS tags into the mix to further nuance relationships across entities.

Our next session will demonstrate what such investigations look like in action. For now however, the main takeway is that the above annotation structures provide you with a host of different ways to segment and facet your text data. You are by no means limited to single token counts when working computationally analyzing text. Indeed, sometimes the most compelling ways to expore a corpus lie in the broader, fuzzier relationships that NLP annotations help us identify.