<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/tapi-logo-small.png" />

This notebook free for educational reuse under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/).

Created by [Firstname Lastname](https://) for the 2022 Text Analysis Pedagogy Institute, with support from the [National Endowment for the Humanities](https://neh.gov), [JSTOR Labs](https://labs.jstor.org/), and [University of Arizona Libraries](https://new.library.arizona.edu/).

For questions/comments/improvements, email author@email.address.<br />
____

# `Introduction to spaCy` `1`

with MacCall edits

This lesson is `1` of 3 in the educational series on `Natural Language Processing (NLP) with spaCy`. This notebook is intended `to teach the basics of NLP and the spaCy library.`.

**Audience:** `Teachers` / `Learners` / `Researchers`

**Use case:** `Tutorial`

'This tutorial is a constructed example that takes the user by the hand through a series of steps to learn how a process works. Tutorials often use "toy" (or at least carefully constrained) examples that give reliable, accurate, and repeatable results every time.`

**Difficulty:** `Beginner`

`Beginner assumes users are relatively new to Python and Jupyter Notebooks. The user is helped step-by-step with lots of explanatory text.`

`Intermediate assumes users are familiar with Python and have been programming for 6+ months. Code makes up a larger part of the notebook and basic concepts related to Python are not explained.`

`Advanced assumes users are very familiar with Python and have been programming for years, but they may not be familiar with the process being explained.`

**Completion time:** `90 minutes`

**Knowledge Required:**
```
* Python basics (variables, flow control, functions, lists, dictionaries)
```

**Knowledge Recommended:**
```
* Basic file operations (open, close, read, write)
```

**Learning Objectives:**
After this lesson, learners will be able to:
```
1. Understand what NLP is generally
2. Understand the basics of spaCy
3. Understand Containers
4. Understand the Doc object
5. Understand the Token and its Attributes
```
**Research Pipeline:**
```
N/A
```
___

# Required Python Libraries
`List out any libraries used and what they are used for`
* [spaCy](https://spacy.io/) for performing [Natural Language Processing (NLP)](https://docs.constellate.org/key-terms/#nlp).

## Install Required Libraries

In [None]:
### Install Libraries ###

# Using !pip installs
!pip install spacy
!python -m spacy download en_core_web_sm
# Using %%bash magic with apt-get and yes prompt





Collecting en-core-web-sm==3.3.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.3.0/en_core_web_sm-3.3.0-py3-none-any.whl (12.8 MB)
     --------------------------------------- 12.8/12.8 MB 11.7 MB/s eta 0:00:00
[+] Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')




In [None]:
### Import Libraries ###
import spacy

# Introduction to the Notebooks

In these notebooks, students will receive an introduction to natural language processing, or NLP, and the Python library spaCy. SpaCy allows users to perform NLP tasks via Python. In academia, the Natural Language Toolkit, or NLTK, has remained quite a popular Python library; however, SpaCy is an industry alternative to NLTK.

I enjoy using spaCy in all my NLP workflows for a few reasons. First, spaCy's syntax, or the way in which you write code to do things with the spaCy library, is fairly straightforward. Second, it allows you to construct robust pipelines for processing texts and extracting relevant information from them. Third, spaCy has powerful built-in components that allow you to use both heuristics, or rules, and machine learning. Fourth, out-of-the-box, spaCy offers pipelines and machine learning components for many modern languages. Each month their team and the community add more support for existing languages and support new languages. Fifth, training new machine learning models via spaCy is relatively simple and easy to replicate. Sixth, spaCy has a strong community and forum; if you have a question, someone is usually their to help. Seventh, spaCy scales well, meaning you can process many texts efficiently. Eighth, it is comprehensive enough to add new languages into its framework.

In these three notebooks, students will learn the basics of spaCy and how to use it to solve a real-world problem in a digital humanities setting, namely information extraction via rules-based named entity recognition. Because these notebooks are designed as a primer for students new to NLP and spaCy, there are many subjects left out of these notebooks, mainly machine learning and how to train custom machine learning spaCy models.


# Part One: The Basics of NLP and spaCy

In this notebook, we will not be working with spaCy in code, rather in concept. This entire JupyterBook is designed around approaching spaCy top-down. By this I mean approaching the things that spaCy does and can do and then exploring how to implement that in code. I think this is necessary so that as you explore the smaller components of spaCy, such as the Lemmatizer, you will understand how it fits into the larger architecture of the spaCy framework.

<center><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/8/88/SpaCy_logo.svg/1200px-SpaCy_logo.svg.png" width=500></center>

## What is spaCy?

A good way to begin is by exploring the question, "What is spaCy?" spaCy (yes, spelled with a lowercase "s" and uppercase "C" is a natural language processing framework. **Natural language processing**, or NLP, is a branch of linguistics that seeks to parse human language in a computer system. This field is generally referred to as computational linguistics, though it has far reaching applications beyond academic linguistic research.

NLP is used in every sector of industry, from academics who leverage it to aid in research to financial analysts who try and predict the stock market. Lawyers use NLP to help analyze thousands of legal documents in seconds to target their research and medical doctors use it to parse patient charts. NLP has been around for decades, but with the increased promise of deep learning, a subfield of machine learning, that NLP rapidly expanded. This is because, as we shall learn all too well throughout this book, language is inherently ambiguous. By this, I mean that language does not always make perfect sense. In some cases, it is entirely illogical. The double-negative in English is a good example of this. In some contexts, it can be an emphatic positive, as in, "I cannot stress this enough, I do not like pasta." This is, of course a lie. I love pasta, but you get my point. In other cases, the double negative can be an emphatic negative, as in, "I ain't not doing that!"

As humans, especially native speakers of a language, we can parse these complex illogical statements with ease, especially with enough context. For computers, this is not always easy.

Because NLP is such a complex problem for computers, it requires a complex solution. The answer has been found in artificial neural networks, or ANNs or neural nets for short. These are the primary areas of research for deep learning practitioners. As the field of deep learning (and machine learning in general) expand and advance, so too does NLP. New methods for training, such as transformer models, push the field further.

## How to Install spaCy

In order to install spaCy, I recommend visiting their website, here: https://spacy.io/usage. They have a nice user-friendly interface. Input your device settings, e.g. Mac or Windows or Linux, and your language, e.g. English, French, or German. The web-app will automatically populate the commands that you need to execute to get started. Since this is a JupyterBook, we can install these with a "!" before in a cell to indicate that we want to run a terminal command. I will be installing spaCy and thee small English model, en_core_web_sm.

In [None]:
# !pip install spacy
# !python -m spacy download en_core_web_sm

As with all Python libraries, the first thing we need to do is import spaCy. In the last notebook, I walked you through how to install it and download the small English model. If you have followed those steps, you should be able to import it like so:

## spaCy's Quickstart Page

On spaCy's page, you can find a helpful [quickstart guide](https://spacy.io/models#quickstart). This is the best reference available with up-to-date information for how to get started with each language officially supported by spaCy.

![spaCy's Quickstart page](https://github.com/stevenmaccall/tap-2023-spacy-01/blob/main/images/spacy_quickstart.JPG?raw=1)

On this page, you will notice that you have a few choices to select. The dropdown menu lets you select language. The current supported languages are:

    - Catalan
    - Chinese
    - Croatian
    - Danish
    - Dutch
    - English
    - Finnish
    - French
    - German
    - Greek
    - Italian
    - Japanese
    - Korean
    - Lithuanian
    - Macedonian
    - Multi-language
    - Norwegian Bokmål
    - Polish
    - Portuguese
    - Romanian
    - Russian
    - Slovenian
    - Spanish
    - Swedish
    - Ukrainian


Next, you can select "Loading style", this is how you intend to load a model in a Python script. A packaged spaCy pipeline functions like a Python library, or module. This means you can load it like any other Python package. While useful for some applications, most documentation you will see will use `spaCy.load()`, where we import spaCy into our script and load a pipeline with `load()`.

This page then asks you to select for either efficiency or accuracy. Efficiency refers to computational efficiency, meaning a smaller pipeline that can run faster. Accuracy refers to a model that is substantially larger, slower, and far more accurate. By default, the efficiency pipeline is the small pipeline and the accuracy pipeline is the transformer pipeline (discussed below).

Finally, you have a checkbox for `show text example`. This will provide you with a text example. On some browsers, this checkbox does not do anything and it is selected by default.


## Naming Conventions for spaCy Pipelines

Official spaCy pipelines have standard naming conventions. They look like this: `en_core_web_sm` or `en_core_web_trf`. Let's break down what this means.

- `en`: Language. This indicates the language of the model. These are the standard 2-character representation for languages.
- `core`: Type. this indicates the type of model or its capabilities. Core indicates that this is a general pipeline that can recognize all required spaCy components, such as tagger, parser, lemmatizer, and named entity recognition.
- `web`: Genre. This indicates the type of data that wasa used for training. You will typically see `web` and `news` for most languages.
- `sm`: Size. There are four sizes for spaCy pipelines: small (sm), medium (md), large (lg), and transformer (trf).

Each model from spaCy is also versioned, meaning it is meant to work alongside a specific version of spaCy. As spaCy updates its core library, it also retrains its models. You will receive error messages if you use older versions of the models indicating that they not be fully supported. This does not mean that you have to update every time spaCy puts out a new version (about once every other month). Most updates are minor and do not effect the core languages significantly. However, you should know about this as some updates do create changes that can break workflows.

A key version issue to be aware of is the change from spaCy 2x (version 2) to 3x (version 3). This was a substantial overhaul, especially in how models were trained. We will learn about this in Week 3 of this course. The key thing to be aware of now is that the documentation that you are using is for spaCy 3x and the model you are using is for 3x. If you ever want to make sure of a model's current version, you can use pip.

In [None]:
!pip show en_core_web_sm

Name: en-core-web-sm
Version: 3.4.1
Summary: English pipeline optimized for CPU. Components: tok2vec, tagger, parser, senter, ner, attribute_ruler, lemmatizer.
Home-page: https://explosion.ai
Author: Explosion
Author-email: contact@explosion.ai
License: MIT
Location: c:\users\wma22\anaconda3\lib\site-packages
Requires: spacy
Required-by: 




## What's a Pipeline?

You may hear the words spaCy model or spaCy pipeline used quite often. Let's jump into what these words mean. A model refers to a machine learning model, or a statistical model that is trained to do a specific task. A pipeline is a sequence of models, rules, or listeners that are leverage things done to date by prior components in a pipeline.

On spaCy's main page, they use the following image to represent a standard non-transformer pipeline (small, medium, and large pipelines).

![spaCy pipeline](https://spacy.io/images/pipeline-design.svg)

Imagine a string (a text) going through this pipeline beginning at the left. As it moves down stream through each component to the right, the text is mutated. Each component in this image has a specific role and changes the text in different ways. Throughout the next few we eks, we will learn a lot more about these components and how to build some of our own, but for now, let's just stick with the basics.

When the text goes into the model, it is first converted into a vector representation of a document. We will learn a lot more about word vectors and document vectors in week 3 when we dive into machine learning. For now, think of a vector representation of a text as something that allows for the spaCy pipeline to numerically understand the meaning of a text. These are complex multi-dimensional numbers.

Later components are able to use these vectors, meaning they can be trained to perform individual tasks. These are the listener components that are trained to recognize things like parts-of-speech and lemmas of words so that when a text that was not used in the training data is given to the spaCy pipeline it can make accurate (hopefully!) predictions.

It is important to keep this image in your mind as we work through spaCy because it is important to understand that sequence is absolutely essential. If you are designing a custom spaCy component that needs to use the lemmas of a word to do a specific task, then you need to make sure that component sits after the lemmatizer.


In [None]:
import spacy

c:\Users\wma22\anaconda3\lib\site-packages\numpy\.libs\libopenblas.FB5AE2TYXYH2IJRDKGDGQ3XBKLKTF43H.gfortran-win_amd64.dll
c:\Users\wma22\anaconda3\lib\site-packages\numpy\.libs\libopenblas.XWYDX2IKJW2NMTWSFYNGFUWKQU3LYTCZ.gfortran-win_amd64.dll
c:\Users\wma22\anaconda3\lib\site-packages\numpy\.libs\libopenblas64__v0.3.21-gcc_10_3_0.dll


With spaCy imported, we can now create our nlp object. This is the standard Pythonic way to create your model in a Python script. Unless you are working with multiple models in a script, try to always name your model, nlp. It will make your script much easier to read. To do this, we will use spacy.load(). This command tells spaCy to load up a model. In order to know which model to load, it needs a string argument that corresponds to the model name. Since we will be working with the small English model, we will use "en_core_web_sm". This function can take keyword arguments to identify which parts of the model you want to load, but we will get to that later. For now, we want to import the whole thing.

In [None]:
nlp = spacy.load("en_core_web_sm")

## Why use spaCy? A Quick Example (IN-CLASS LESSON)

In [None]:
from spacy import displacy

In [None]:
with open ("../data/fo_jefferson.txt", "r", encoding="utf-8") as f:
    data = f.read()
print(data)

“To Thomas Jefferson from George Wythe, 9 March 1770,” Founders Online, National Archives, https://founders.archives.gov/documents/Jefferson/01-01-02-0027. [Original source: The Papers of Thomas Jefferson, vol. 1, 1760–1776, ed. Julian P. Boyd. Princeton: Princeton University Press, 1950, p. 38.]
I send you some nectarine and apricot graffs and grapevines, the best I had; and have directed your messenger to call upon Major Taliaferro for some of his. You will also receive two of Foulis’s catalogues. Mrs. Wythe will send you some garden peas.
You bear your misfortune so becomingly, that, as I am convinced you will surmount the difficulties it has plunged you into, so I foresee you will hereafter reap advantages from it several ways. Durate, et vosmet rebus servate secundis.


In [None]:
text = data.splitlines()[1:]
text = "\n".join(text)
print(text)

I send you some nectarine and apricot graffs and grapevines, the best I had; and have directed your messenger to call upon Major Taliaferro for some of his. You will also receive two of Foulis’s catalogues. Mrs. Wythe will send you some garden peas.
You bear your misfortune so becomingly, that, as I am convinced you will surmount the difficulties it has plunged you into, so I foresee you will hereafter reap advantages from it several ways. Durate, et vosmet rebus servate secundis.


In [None]:
doc = nlp(text)
displacy.render(doc, style="ent")

## Why Master spaCy? (IN-CLASS LESSON)

In [None]:
nlp = spacy.load("en_core_web_sm")
from spacy.util import filter_spans
from spacy.tokens import Span
from spacy.language import Language
import re
salutation_names_pattern = r"(Mrs|Major)(\.)* [A-Z]\w+( [A-Z]\w+)*"
@Language.component("salutation_person")
def salutation_person(doc):
    text = doc.text
    person_ents = []
    original_ents = list(doc.ents)
    for match in re.finditer(salutation_names_pattern, doc.text):
        start, end = match.span()
        span = doc.char_span(start, end)
        person_ents.append((span.start, span.end, span.text))
    for start, end, name in person_ents:
        per_ent = Span(doc, start, end, label="PERSON")
        original_ents.append(per_ent)
    filtered = filter_spans(original_ents)
    doc.ents = filtered
    return (doc)
nlp.add_pipe("salutation_person", before="ner")
doc = nlp(text)
displacy.render(doc, style="ent")

# Part Two: Getting Started with spaCy and its Linguistic Annotations

In this part of the notebook, we will start working with spaCy directly. The goals of this chapter are twofold. First, it is my hope that you understand the basic spaCy syntax for creating a Doc container and how to call specific attributes of that container. Second, it is my hope that you leave this chapter with a basic understanding of the vast linguistic annotations available in spaCy. While we will not explore all attributes, we will deal with many of the most important ones, such as lemmas, parts-of-speech, and named entities. By the time you are finished with this chapter, you should have enough of a basic understanding of spaCy to begin applying it to your own texts.

## Containers

The first thing new spaCy students need to understand is the hierarchy of spaCy data objects. In spaCy, this means beginning to interact with and understand containers. **`Containers are spaCy objects that contain a large quantity of data about a text.`** When we analyze texts with the spaCy framework, we create different container objects to do that. Here is a full list of all spaCy containers. We will be focusing on three (emboldened): Doc, Span, and Token.

* <b>Doc</b>
* DocBin
* Example
* Language
* Lexeme
* <b>Span</b>
* SpanGroup
* <b>Token</b>

I created the image below to show how I visualize spaCy containers in my mind. At the top, we have a Doc container. This is the basis for all spaCy. It is the main object that we create. Within the Doc container are many different attributes and subcontainers. One attribute is the Doc.sents, which contains all the sentences in the Doc container. The doc container (and each sentence generator) is made up of a set of token containers. These are things like words, punctuation, etc.

Span containers are kind of like token, in that they are a piece of a Doc container. Spans have one thing that makes them unique. They can cross multiple tokens.

We can give spans a bit more specificity by classifying them into different groups. These are known as SpanGroup containers.


</center><img src="http://spacy.pythonhumanities.com/_images/spacy_containers.png" width=500></center>

In [None]:
with open ("../data/wiki_us.txt", "r") as f:
    us_text = f.read()

Now, let's see what this text looks like. It can be a bit difficult to read in a JupyterBook, but notice the horizontal slider below. You don't neeed to read this in its entirety.

In [None]:
print(us_text)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America. It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j] At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d] The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22] With a population of more than 331 million people, it is the third most populous country in the world. The national capital is Washington, D.C., and the most populous city is New York.

Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century. The United States emerged from the thirteen British colonies est

## Creating a Doc Container

With the data loaded in, it's time to make our first Doc container. Unless you are working with multiple Doc containers, it is best practice to always call this object "doc", all lowercase. To create a doc container, we will usually just call our nlp object and pass our text to it as a single argument.

In [None]:
doc = nlp(us_text)

Great! Let's see what this looks like.

In [None]:
print(doc)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America. It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j] At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d] The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22] With a population of more than 331 million people, it is the third most populous country in the world. The national capital is Washington, D.C., and the most populous city is New York.

Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century. The United States emerged from the thirteen British colonies est

In [None]:
print(len(doc))
print(len(us_text))

654
3521


Hmm... What's going on here? Same text, but different length. Why does this occur? To answer that, let's explore it more deeply and try and print off each item in each object.

In [None]:
for token in us_text[:10]:
    print (token)

T
h
e
 
U
n
i
t
e
d


As we would expect. We have printed off each character, including white spaces. Let's try and do the same with the Doc container.

In [None]:
for token in doc[:10]:
    print (token)

The
United
States
of
America
(
U.S.A.
or
USA
)


And now we see the magical difference. While on the surface it may seem that the Doc container's length is dependent on the quantity of words, look more closely. You should notice that the open and close parentheses are also considered an item in the container. These are all known as tokens. **Tokens** are a fundamental building block of spaCy or any NLP framework. They can be words or punctuation marks. Tokens are something that has syntactic purpose in a sentence and is self-contained. A good example of this is the contraction "don't" in English. When tokenized, or the process of converting the text into tokens, we will have two tokens. "do" and "n't" because the contraction represents two words, "do" and "not".

On the surface, this may not seem exceptional. But it is. You may be thinking to yourself that you could easily use the split method in Python to split by whitespace and have the same result. But you'd be wrong. Let's see why.

In [None]:
for token in us_text.split()[:10]:
    print (token)

The
United
States
of
America
(U.S.A.
or
USA),
commonly
known


Notice that the parentheses are not removed or handled individually. To see this more clearly, let's print off all tokens from index 5 to 8 in both the text and doc objects.

In [None]:
words = us_text.split()[:10]

In [None]:
i=5
for token in doc[i:8]:
    print (f"SpaCy Token {i}:\n{token}\nWord Split {i}:\n{words[i]}\n\n")
    i=i+1

SpaCy Token 5:
(
Word Split 5:
(U.S.A.


SpaCy Token 6:
U.S.A.
Word Split 6:
or


SpaCy Token 7:
or
Word Split 7:
USA),




We can see clearly now how the spaCy Doc container does much more with its tokenization than a simple split method. We could, surely, write complex rules for a language to achieve the same results, but why bother? SpaCy does it exceptionally well for all languages. In my entire time using spaCy, I have never seen the tokenizer make a mistake. I am sure that mistakes may occur, but these are probably rare exceptions.

Let's see what else this Doc Container holds.

## Sentence Boundary Detection (SBD)

In NLP, sentence boundary detection, or SBD, is the identification of sentences in a text. Again, this may seem fairly easy to do with rules. One could use split("."), but in English we use the period to also denote abbreviation. You could, again, write rules to look for periods not proceeded by a lowercase word, but again, I ask the question, "why bother?". We can use spaCy and in seconds have all sentences fully separated through SBD.

To access the sentences in the Doc container, we can use the attribute sents, like so:

In [None]:
for sent in doc.sents:
    print (sent)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.
It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j]
At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d]
The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22] With a population of more than 331 million people, it is the third most populous country in the world.
The national capital is Washington, D.C., and the most populous city is New York.


Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century.
The United States emerged from the thirteen British colonies es

Let's move forward with just one of these sentences. Let's try and grab index 0 in this attribute.

In [None]:
sentence1 = doc.sents[0]
print (sentence1)

TypeError: 'generator' object is not subscriptable

Uh oh! We got an error. That is because the sents attribute is a generator. It is beyond the scope of this notebook to explain what generators are or how they work. Instead, let's convert our genreator into a list so that we can work with it by each index.

In [None]:
sentence1 = list(doc.sents)[0]
print(sentence1)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.


Now we have the first sentence. Now that we have a smaller text, let's explore spaCy's other building block, the token.

## Token Attributes

The token object contains a lot of different attributes that are VITAL do performing NLP in spaCy. We will be working with a few of them, such as:

* .text
* .head
* .left_edge
* .right_edge
* .ent_type_
* .iob_
* .lemma_
* .morph
* .pos_
* .dep_
* .lang_

I will briefly describe these here and show you how to grab each one and what they look like. We will be exploring each of these attributes more deeply in this chapter and future chapters. To demonstrate each of these attributes, we will use one token, "States" which is part of a sequence of tokens that make up "The United States of America"

In [None]:
token2 = sentence1[2]
print(token2)

States


### Text

```Verbatim text content.``` -spaCy docs

In [None]:
token2.text

'States'

### Head

```The syntactic parent, or “governor”, of this token.``` -spaCy docs

In [None]:
token2.head

is

This tells to which word it is governed by, in this case, the primary verb, "is", as it is part of the noun subject.

### Left Edge

``` The leftmost token of this token’s syntactic descendants.``` -spaCy docs

In [None]:
token2.left_edge

The

If part of a sequence of tokens that are collectively meaningful, known as **multi-word tokens**, this will tell us where the multi-word token begins.

### Right Edge

``` The rightmost token of this token’s syntactic descendants.``` -spaCy docs

In [None]:
token2.right_edge

,

This will tell us where the multi-word token ends.

### Entity Type

``` Named entity type.``` -spaCy docs

In [None]:
token2.ent_type

384

Note the absence of the _ at the end of the attribute. This will return an integer that corresponds to an entity type, where as _ will give you the string equivalent., as in below.

In [None]:
token2.ent_type_

'GPE'

We will learn all about types of entities in our chapter on named entity recognition, or NER. For now, simply understand that GPE is geopolitical entity and is correct.

### Ent IOB

```IOB code of named entity tag. “B” means the token begins an entity, “I” means it is inside an entity, “O” means it is outside an entity, and "" means no entity tag is set.```

In [None]:
token2.ent_iob_

'I'

IOB is a method of annotating a text. In this case, we see "I" because states is inside an entity, that is to say that it is part of the United States of America.

### Lemma

```Base form of the token, with no inflectional suffixes.``` -spaCy docs

In [None]:
token2.lemma_

'States'

In [None]:
sentence1[12].lemma_

'know'

### Morph

```Morphological analysis``` -spaCy docs

In [None]:
sentence1[12].morph

Aspect=Perf|Tense=Past|VerbForm=Part

### Part of Speech

```Coarse-grained part-of-speech from the Universal POS tag set.``` -spaCy docs

In [None]:
token2.pos_

'PROPN'

### Syntactic Dependency

```Syntactic dependency relation.``` -spaCy docs

In [None]:
token2.dep_

'nsubj'

### Language

```Language of the parent document’s vocabulary.``` -spaCy docs

In [None]:
token2.lang_

'en'

## Part of Speech Tagging (POS)

In the field of computational linguistics, understanding parts-of-speech is essential. SpaCy offers an easy way to parse a text and identify its parts of speech. Below, we will iterate across each token (word or punctuation) in the text and identify its part of speech.

In [None]:
for token in sentence1:
    print(token.text, token.pos_, token.dep_)

The DET det
United PROPN compound
States PROPN nsubj
of ADP prep
America PROPN pobj
( PUNCT punct
U.S.A. PROPN appos
or CCONJ cc
USA PROPN conj
) PUNCT punct
, PUNCT punct
commonly ADV advmod
known VERB acl
as ADP prep
the DET det
United PROPN compound
States PROPN pobj
( PUNCT punct
U.S. PROPN appos
or CCONJ cc
US PROPN conj
) PUNCT punct
or CCONJ cc
America PROPN conj
, PUNCT punct
is AUX ROOT
a DET det
country NOUN attr
primarily ADV advmod
located VERB acl
in ADP prep
North PROPN compound
America PROPN pobj
. PUNCT punct


Here, we can see two vital pieces of information: the string and the corresponding part-of-speech (pos). For a complete list of the pos labels, see the spaCy documentation (https://spacy.io/api/annotation#pos-tagging). Most of these, however, should be apparent, i.e. PROPN is proper noun, AUX is an auxiliary verb, ADJ, is adjective, etc. We can visualize this sentence with a diagram through spaCy's displaCy Notebook feature.

In [None]:
from spacy import displacy

In [None]:
displacy.render(sentence1, style="dep")

## Named Entity Recognition

Another essential task of NLP, is named entity recognition, or NER. I spoke about NER in the last notebook. Here, I’d like to demonstrate how to perform basic NER via spaCy. Again, we will iterate over the doc object as we did above, but instead of iterating over doc.sents, we will iterate over doc.ents. For our purposes right now, I simply want to print off each entity’s text (the string itself) and its corresponding label (note the _ after label). I will be explaining this process in much greater detail in the next two notebooks.

In [None]:
for ent in doc.ents:
    print (ent.text, ent.label_)

The United States of America GPE
U.S.A. GPE
USA GPE
the United States GPE
U.S. GPE
US GPE
America GPE
North America LOC
50 CARDINAL
five CARDINAL
326 CARDINAL
Indian NORP
3.8 million square miles QUANTITY
9.8 million square kilometers QUANTITY
third- or fourth DATE
The United States GPE
Canada GPE
Mexico GPE
Bahamas GPE
Cuba GPE
more than 331 million CARDINAL
third ORDINAL
Washington GPE
D.C. GPE
New York GPE
Paleo-Indians NORP
Siberia LOC
North American NORP
at least 12,000 years ago DATE
European NORP
the 16th century DATE
The United States GPE
thirteen CARDINAL
British NORP
the East Coast LOC
Great Britain GPE
the American Revolutionary War ORG
1775–1783 CARDINAL
the late 18th century DATE
U.S. GPE
North America LOC
Native Americans NORP
1848 DATE
the United States GPE
United States GPE
the second half of the 19th century DATE
the American Civil War ORG
Spanish NORP
World War EVENT
U.S. GPE
World War II EVENT
the Cold War EVENT
the United States GPE
the Korean War EVENT
the Vietnam 

Sometimes it can be difficult to read this output as raw data. In this case, we can again leverage spaCy's displaCy feature. Notice that this time we are altering the keyword argument, style, with the string "ent". This tells displaCy to display the text as NER annotations

In [None]:
displacy.render(doc, style="ent")

# Exercises

`I know we covered a lot in this notebook and the best way to understand its contents in depth is to apply it to your own domain, or area of expertise. I encourage you to select a text (or texts) that you use in your own research and try to apply the methods covered in this notebook to those particular texts. I would highly encourage you to do this before moving on to the next notebook.`