Text Annotation
===============

spaCy Language Models
-----------------------------

In [1]:
import spacy

nlp = spacy.load('en_core_web_md')

Text Annotations
--------------------

To annotate text with the `spaCy` model, simply run it through the core function, `nlp()`. We'll do so with a short poem by Gertrude Stein

In [2]:
with open('data/session_one/stein_carafe.txt', 'r') as f:
    carafe = f.read()
    
doc = nlp(carafe)

With this done, we can inspect the result...

In [3]:
doc

A kind in glass and a cousin, a spectacle and nothing strange a single hurt color and an arrangement in a system to pointing. All this and not ordinary, not unordered in not resembling. The difference is spreading.

...which seems to be no different from a string representation! This output is a bit misleading, however. Our `doc` object actually has a ton of extra information associated with it, even though, on the surface, it appears to be a plain old string.

If you'd like, you can inspect all these attributes and methods with:

In [4]:
attributes = [i for i in dir(doc) if i.startswith("_") is False]

We won't show them all here, but suffice it to say, there are a lot!

In [5]:
print("Number of attributes in a SpaCy doc:", len(attributes))

Number of attributes in a SpaCy doc: 51


This high number of attributes indicates an important point to keep in mind when working with `spaCy`: as opposed to other forms of text preprocessing, like changing the case of tokens, removing punctuation, or stemming a corpus, `spaCy` aims to **maximally preserve information** about your text. It keeps texts intact and in fact adds much more information about them than Python's base string methods have. In this sense, we might say that `spaCy` is additive in nature, whereas text mining methods are subtractive, or reductive.

### Document

So, while the base representation of `doc` looks like a string, under the surface there are all sorts of annotations about it. To access them, we use the attributes counted above. For example, `spaCy` adds extra segmentation information about a text, like which parts of it belong to different sentences. We can check to see whether this information has been attached to our text with the `has_annotation()` method.

In [6]:
doc.has_annotation('SENT_START')

True

We can use the same method to check for a few other annotations:

In [7]:
annotation_types = {'Dependencies': 'DEP', 'Entities': 'ENT_IOB', 'Tags': 'TAG'}
for a, t in annotation_types.items():
    print(
        f"{a:>12}: {doc.has_annotation(t)}"
    )

Dependencies: True
    Entities: True
        Tags: True


Let's look at sentences. We can access them with `sents`.

In [8]:
doc.sents

<generator at 0x119bedf48>

...but you can see that there's a small complication here: `sents` returns a generator, not a list. The reason has to do with memory efficiency. Because `SpaCy` adds so much extra information about your text, this information could slow down your code or overwhelm your computer if the library didn't store it in an efficient manner. Of course this isn't a problem with our small poem, but you can imagine how it could become one with a big corpus.

To access the actual sentences in `doc`, we'll need to convert the generator to a list.

```{margin} Want to learn more about generators?
The DataLab has a workshop on them.
```

In [9]:
sentences = list(doc.sents)
for s in sentences:
    print(s)

A kind in glass and a cousin, a spectacle and nothing strange a single hurt color and an arrangement in a system to pointing.
All this and not ordinary, not unordered in not resembling.
The difference is spreading.


One very useful attribute is `noun_chunks`. It returns nouns and compound nouns in a text.

In [10]:
noun_chunks = list(doc.noun_chunks)

for noun in noun_chunks:
    print(noun)

A kind
glass
a cousin
a spectacle
nothing
a single hurt color
an arrangement
a system
The difference


See how this picks up not only nouns, but articles and compound information? Articles could be helpful if you wanted to track singular/plural relationships, or whereas compound nouns might tell you something about the way a text refers to the entities therein. The latter might have repeating patterns, for example, and you might imagine how you could use noun chunks to create and count n-gram tokens and feed that into a classifier.

Consider this example from _The Odyssey_. Homer used many epithets and repeating phrases throughout his epic. According to some theories, these act as mnemonic devices, helping a performer keep everything in their head during an oral performance. Using `noun_chunks` in conjunction with a Python `Counter`, we might be able to identify these in Homer's text. Below, we'll do so with Book XI.

First, let's load and model the text.

In [11]:
with open('data/session_one/odyssey_book_11.txt', 'r') as f:
    book_eleven = f.read()
    
odyssey = nlp(book_eleven)

Now we'll import a `Counter` and initialize it. Then we'll get the noun chunks from the text and add them to the counter dictionary. Be sure to only grab the text! We'll explain why in a little while.

In [12]:
from collections import Counter

c = Counter()
odyssey_noun_chunks = list(odyssey.noun_chunks)

for chunk in odyssey_noun_chunks:
    c[chunk.text] += 1

With that done, let's look for repeating noun chunks with three or more words.

```{margin} What we're doing here...
For every noun chunk in the counter:

1. Split the chunk
2. Check if the length of the chunk is more than two and the count is more than one
3. If so, join the chunk back together and print it along with the chunk
```

In [13]:
print("Phrase".ljust(23), "| Count\n-------------------------------")
for chunk, count in c.items():
    chunk = chunk.split()
    if (len(chunk) > 2) and (count > 1):
        joined = ' '.join(chunk)
        print(f"{joined:<24} {count:>4}")

Phrase                  | Count
-------------------------------
the sea shore               2
a fair wind                 2
the poor feckless ghosts    2
the same time               2
the other side              2
his golden sceptre          2
your own house              2
her own son                 2
the Achaean land            2
her own husband             2
my wicked wife              2
all the Danaans             2
the poor creature           2


Excellent! Looks like we turned up a few: "the poor feckless ghosts," "my wicked wife," and "all the Danaans" are likely the kind of repeating phrases scholars think of in Homer's text.

Another way to look at entities in a text is with `ents`. `spaCy` uses **named-entity recognition** to extract significant objects, or entities, in a text. In general, anything that has a proper name associated with it is considered an entity. Here are the first five from Book XI above.

In [14]:
entities = list(odyssey.ents)

count = 0
while count < 5:
    print(entities[count])
    count += 1

Circe
Oceanus
Cimmerians
Circe
Perimedes


### Tokens

In addition to storing all of this information about texts, `spaCy` creates a substantial amount of annotations for each of the tokens in that text. The same logic as above applies to accessing this information.

Let's return to the Stein poem. Indexing `doc` will return individual tokens:

In [15]:
doc[3]

glass

Like `doc`, each one has several attributes:

In [16]:
token_attributes = [i for i in dir(doc[3]) if i.startswith("_") is False]

print("Number of token attributes:", len(token_attributes))

Number of token attributes: 94


Wow! That's a lot.

These attributes range from simple booleans, like whether a token is an alphabetic character:

In [17]:
doc[3].is_alpha

True

...or whether it is a stop word:

In [18]:
doc[3].is_stop

False

...to more complex pieces of information, like part of speech tags:

In [19]:
doc[3].pos_

'NOUN'

...sentiment scores:

In [20]:
doc[3].sentiment

0.0

...and even vector space representations:

In [21]:
doc[3].vector

array([-1.4859e-01, -1.7940e-01,  4.3666e-02,  1.5748e-01,  1.3568e-01,
       -9.3666e-01, -6.8430e-01,  4.7692e-01, -4.1391e-01,  9.3575e-01,
       -1.6360e-01,  6.7553e-02, -2.7843e-01, -5.6125e-01,  1.3088e-01,
       -1.0006e-01,  7.0374e-03,  2.6217e+00,  5.4600e-02, -5.8931e-01,
        2.5739e-04, -2.6791e-01,  4.6093e-01, -5.9145e-02, -1.0330e-01,
       -3.7589e-01, -2.5343e-01,  1.4790e-02, -4.8031e-01, -4.4314e-01,
        2.4685e-01, -8.6519e-04, -1.2361e-01,  9.1683e-02, -1.5880e-01,
       -4.5974e-01,  3.3017e-01, -4.4124e-01,  3.3604e-01, -3.0438e-01,
        4.4664e-01,  2.2697e-01,  2.9327e-02, -2.7025e-01,  3.1813e-01,
       -1.5890e-01, -4.1371e-01, -9.0721e-01, -2.0866e-01,  3.6400e-01,
        5.6862e-02, -2.6824e-01, -2.9722e-01,  6.2107e-02, -4.7908e-01,
       -5.8164e-01, -1.4302e-01,  7.0109e-02, -1.2735e-01,  3.6194e-02,
       -1.6634e-01, -2.2135e-01, -5.0446e-02,  4.3839e-01, -5.5363e-01,
       -4.4219e-01, -1.3657e-01, -2.8472e-01, -5.0637e-01,  7.99

We'll discuss some of these more complex annotations later on, both in this session and others. For now, let's collect some simple information about each of the tokens in our text. We'll use list comprehension to do so. We'll also use the `text` attribute for each token, since we only want the text representation (otherwise, we'd be creating a list of generators, where each generator has all those attribute for every token!).

In [22]:
words = [t.text for t in doc if t.is_alpha]
punctuation = [t.text for t in doc if t.is_punct]

print(
    f"Words\n-----\n{' '.join(words)}",
    f"\n\nPuncuation\n----------\n{' '.join(punctuation)}"
)

Words
-----
A kind in glass and a cousin a spectacle and nothing strange a single hurt color and an arrangement in a system to pointing All this and not ordinary not unordered in not resembling The difference is spreading 

Puncuation
----------
, . , . .


Want some linguistic information? We can get that too. For example, here are prefixes and suffixes:

```{margin} You might be wondering about those underscores...

The syntax conventions of `spaCy` use an underscore to access the actual attribute information for a token. Using an attribute without the underscore will return an id, which the library uses internally to piece together output.
```

In [23]:
prefix_suffix = [(t.text, t.prefix_, t.suffix_) for t in doc if t.is_alpha]

for i in prefix_suffix:
    text, prefix, suffix = i[0], i[1], i[2]
    print(
        f"Text: {text}\nPrefix: {prefix}\nSuffix: {suffix}\n"
    )

Text: A
Prefix: A
Suffix: A

Text: kind
Prefix: k
Suffix: ind

Text: in
Prefix: i
Suffix: in

Text: glass
Prefix: g
Suffix: ass

Text: and
Prefix: a
Suffix: and

Text: a
Prefix: a
Suffix: a

Text: cousin
Prefix: c
Suffix: sin

Text: a
Prefix: a
Suffix: a

Text: spectacle
Prefix: s
Suffix: cle

Text: and
Prefix: a
Suffix: and

Text: nothing
Prefix: n
Suffix: ing

Text: strange
Prefix: s
Suffix: nge

Text: a
Prefix: a
Suffix: a

Text: single
Prefix: s
Suffix: gle

Text: hurt
Prefix: h
Suffix: urt

Text: color
Prefix: c
Suffix: lor

Text: and
Prefix: a
Suffix: and

Text: an
Prefix: a
Suffix: an

Text: arrangement
Prefix: a
Suffix: ent

Text: in
Prefix: i
Suffix: in

Text: a
Prefix: a
Suffix: a

Text: system
Prefix: s
Suffix: tem

Text: to
Prefix: t
Suffix: to

Text: pointing
Prefix: p
Suffix: ing

Text: All
Prefix: A
Suffix: All

Text: this
Prefix: t
Suffix: his

Text: and
Prefix: a
Suffix: and

Text: not
Prefix: n
Suffix: not

Text: ordinary
Prefix: o
Suffix: ary

Text: not
Prefix: n
Suf

And here are lemmas:

In [24]:
lemmas = [(t.text, t.lemma_) for t in doc if t.is_alpha]

for i in lemmas:
    text, lemma = i[0], i[1]
    print(
        f"Text: {text}\nLemma: {lemma}\n"
    )

Text: A
Lemma: a

Text: kind
Lemma: kind

Text: in
Lemma: in

Text: glass
Lemma: glass

Text: and
Lemma: and

Text: a
Lemma: a

Text: cousin
Lemma: cousin

Text: a
Lemma: a

Text: spectacle
Lemma: spectacle

Text: and
Lemma: and

Text: nothing
Lemma: nothing

Text: strange
Lemma: strange

Text: a
Lemma: a

Text: single
Lemma: single

Text: hurt
Lemma: hurt

Text: color
Lemma: color

Text: and
Lemma: and

Text: an
Lemma: an

Text: arrangement
Lemma: arrangement

Text: in
Lemma: in

Text: a
Lemma: a

Text: system
Lemma: system

Text: to
Lemma: to

Text: pointing
Lemma: point

Text: All
Lemma: all

Text: this
Lemma: this

Text: and
Lemma: and

Text: not
Lemma: not

Text: ordinary
Lemma: ordinary

Text: not
Lemma: not

Text: unordered
Lemma: unordere

Text: in
Lemma: in

Text: not
Lemma: not

Text: resembling
Lemma: resemble

Text: The
Lemma: the

Text: difference
Lemma: difference

Text: is
Lemma: be

Text: spreading
Lemma: spread



With such attributes at your disposal, you might imagine how you could work `spaCy` into a text mining pipeline. Instead of using separate functions to clean your text, those stepps could all be accomplished by accessing attributes.

Before you do this, however, you should consider two things: 1) whether the increased computational/memory overhead is worthwhile for your project; and 2) whether `spaCy`'s base models will work for the kind of text you're using. This second point is especially important. While `spaCy`'s base models are incredibly powerful, they are built for general purpose applications and may struggle with domain-specific language. Medical texts or early modern print are two such examples of where the base models interpret your text in unexpected ways, thereby complicating, if not ruining parts of a text mining pipeline that relies on them. Sometimes, in other words, it's just best to stick with a text mining pipeline that you know is effective.

That all said, there are ways to train your own `spaCy` model on a specific domain. This can be an extensive process, one which exceeds the limits of our short workshop, but if you want to learn more about doing so, you can visit [this page]. There are also [third party models] available, which you might find useful, though your milage may vary.

[this page]: https://spacy.io/usage/training
[third party models]: https://spacy.io/universe/category/models

Dependency Parsing
------------------------