# spaCy Linguistic Annotations

Based on **Dr. William Mattingly** video: https://www.youtube.com/watch?v=dIUTsFT2MeQ&t

and his Jupyter Book: http://spacy.pythonhumanities.com/01_02_linguistic_annotations.html

## 1. Doc Container

### 1.1 Concept
In the spaCy library, the **Doc** container represents a processed document in natural language processing tasks. It is a fundamental data structure provided by spaCy to hold the linguistic annotations and properties of a text.

A **Doc** object in spaCy consists of a sequence of **Token** objects, where each token represents a word or a unit of text. The **Doc** container stores various linguistic annotations for each token, including POS tags, NER, syntactic dependencies, lemma, and more. These annotations are computed during the pipeline processing in spaCY.

The **Doc** container provides efficient access to the linguistic information of the text and supports various operations and methods for working with the tokens and their annotations. You can access individual tokens, iterate over them, retrieve their properties, and perform operations such as merging, splitting, or extracting sub-spans of the document.

The **Doc** container is an essential component in spaCy's processing pipeline, allowing us to perform advanced NLP tasks, such as entity recognition, dependency parsing, or text classification, by leveraging the annotated information stored within the **Doc** object.

### 1.2 Code

In [6]:
import spacy

In [7]:
# loading model
nlp = spacy.load("en_core_web_sm")

In [8]:
with open ("data/wiki_us.txt", "r") as f:
    text = f.read()

In [9]:
print(text)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America. It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j] At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d] The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22] With a population of more than 331 million people, it is the third most populous country in the world. The national capital is Washington, D.C., and the most populous city is New York.

Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century. The United States emerged from the thirteen British colonies est

In [10]:
doc = nlp(text)

In [11]:
print(doc)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America. It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j] At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d] The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22] With a population of more than 331 million people, it is the third most populous country in the world. The national capital is Washington, D.C., and the most populous city is New York.

Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century. The United States emerged from the thirteen British colonies est

In [12]:
print(len(text))
print(len(doc))

3521
654


In [13]:
# itterate over text with simple for loop
for token in text[:10]:
    print(token)

T
h
e
 
U
n
i
t
e
d


In [14]:
# itterate over doc with simple for loop
for token in doc[:10]:
    print(token)

The
United
States
of
America
(
U.S.A.
or
USA
)


In [15]:
# Showing why we should consider using spaCy
for token in text.split()[:10]:
    print(token)

The
United
States
of
America
(U.S.A.
or
USA),
commonly
known


In [16]:
# closer example
words = text.split()[:10]
i = 5
for token in doc[i:8]:
    print(f"SpaCy Token {i}:\n{token}\nWord Split {i}:\n{words[i]}\n\n")
    i += 1

SpaCy Token 5:
(
Word Split 5:
(U.S.A.


SpaCy Token 6:
U.S.A.
Word Split 6:
or


SpaCy Token 7:
or
Word Split 7:
USA),




## 2. Sentence Boundary Detection (SBD)

### 2.1 Concept

**SBD** also known as sentence segmentation or sentence tokenization, is the task of identifying the boundaries between sentences in a given text. it is a fundamental step in NLP and text analysis.

The goal of **SBD** is to correctly identify the end of on sentence and the beginning of the next sentence, even in the absence of explict punctation marks such as periods. This task can be challenging because sentence boundaries can be ambiguous, especially in texts with complex sentence structures, abbrevations, or other linguisitc variations.

Various techniques and algorithms are used for **SBD**, ranging from simple rule-based approaches to more sophisticated machine learning methods. Rule-based approaches rely on patterns and heuristics to determine sentence boundaries based on punctation marks. capitalization, and other language-specific cues. Machine learning approaches involve training models or annotated data to predict sentence boundaries based on contextual and structural features of the text.

Accurate sentence boundary detection is crucial for many NLP tasks, such as machine translation, information extraction, text summarization, and sentiment analysis. It helps to properly segment the text into meaningful units, enabling downstream analysis and processing at the sentence level.

### 2.2 Code

In [17]:
for sent in doc.sents:
    print(sent)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.
It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j]
At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d]
The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22]
With a population of more than 331 million people, it is the third most populous country in the world.
The national capital is Washington, D.C., and the most populous city is New York.


Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century.
The United States emerged from the thirteen British colonies es

In [18]:
# doc is a generator so we can't directly itterate throught it
sentence1 = doc.sents[0]
print(sentence1)

TypeError: 'generator' object is not subscriptable

In [19]:
sentence1 = list(doc.sents)[0]
print(sentence1)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.


## 3. Token Attributes

More about attributes: [spaCy token attributes](https://spacy.io/api/token#attributes)

### 3.1 Concept

Token attributes refer to the specific characteristics or properties associated with individual tokens in NLP tasks. These attributes provide valuable information about each token, enabling various analyses and operations on text data.

In spaCy, token attributes are the properties associated with individual tokens in a text. SpaCy provides a variety of token attributes that can be accessed to extract information about each token. Examples of some commonly used token attributes in spaCy:
1. **text**: The actual text of the token.
2. **head**: The syntacting head of the token. It refers to the token that governs the current token in the dependency parse tree.
3. **left_edge**: The leftmost token of a span or subtree that the token is a part of.
4. **right_edge**: The rightmost token of a span or subtree that the current token is a part of.
5. **lemma_**: The base or canonical form of the token.
6. **morph**: The morphological analysis of the token, providing information about its inflectional featurs such as tense, number, gender, etc.
7. **pos_**: The part-of-speech tag of the token.
8. **tag_**: The detailed part-of-speech tag of the token.
9. **dep_**: The syntactic dependency relation of the token.
10. **lang_**: The language of the token as identified by spaCy's language detection model.
11. **_iob_**: A binary value indicating whether the token is the beginning (B), inside (I), or outside (O) of an entity in an IOB (Inside-Outside-Beginning) tagging scheme.
12. **is_stop**: A boolean value indicating whether the token is a stop word.
13. **is_alpha**: A boolean value indicating whether the token consists of alphabetic characters.
14. **is_digit**: A boolean value indicating whether the token ia a digit.
15. **is_punct**: A boolean value indicating whether the token is a punctuation symbol.
16. **is_space**: A boolean value indicating whether the token is a whitespace.
17. **is_upper**: A boolean value indicating whether the token is in uppercase.
18. **is_lower**: A boolean value inddicating whether the token is in lowercase.
19. **is_title**: A boolean value indicating whether the token is in titlecase.
20. **ent_type_**: The named entity type of the token, if it is part of a named entity.
21. **is_sent_start**: A boolean value indicating whether the token is the start of a sentence.

In [21]:
for token in doc[:10]:
    print(token)

The
United
States
of
America
(
U.S.A.
or
USA
)


In [22]:
token2 = sentence1[2]
print(token2)

States


### 3.2 Text

Verbatim text content. Type: str

In [24]:
# actual text of the token
token2.text

'States'

### 3.3 Head

The syntactic parent, or "governor", of this token. Type: Token

In [50]:
token2.head

is

### 3.4 Left_edge

The leftmost token of this token's syntactic descendants. Type: Token

In [25]:
token2.left_edge

The

### 3.5 right_edge

The rightmost token of this token's syntactic descendants. Type: Token

In [51]:
token2.right_edge

America

### 3.6 lemma_

Base form of the token, with no infectional suffixes. Type: str

In [52]:
token2.lemma_

'States'

In [68]:
print(f"Word: {sentence1[12]}, lemma: {sentence1[12].lemma_}")

Word: known, lemma: know


### 3.7 morph

Morphological analysis. Type: MorphAnalysis

In [53]:
token2.morph

Number=Sing

In [69]:
sentence1[12].morph

Aspect=Perf|Tense=Past|VerbForm=Part

### 3.8 pos_

Coarse-grained part-of-speech from [the Universal POS tag set](https://universaldependencies.org/u/pos/). Type: str

In [32]:
token2.pos_

'PROPN'

### 3.9 tag_

Fine-grained part-of-speech. Type: str

In [54]:
token2.tag_

'NNP'

### 3.10 dep_

Syntactic dependency relation. Type: str

In [55]:
token2.dep_

'nsubj'

### 3.11 lang_

Language of the parent document's vocabulary. Type: str

In [56]:
token2.lang_

'en'

### 3.12 ent_iob_

IOB code of named entity tag. "B" means the token begins an entity, "I" means it is inside an entity, "O" means it is outside and entity, and "" means no entity tag is set. Type: str

In [57]:
token2.ent_iob_

'I'

### 3.13 is_stop

Is the token part of a "stop list"? Type: bool

In [58]:
token2.is_stop

False

### 3.14 is_alpha

Does the token consist of alphabetic characters? Equivalent to token.text.isalpha(). Type: bool

In [59]:
token2.is_alpha

True

### 3.15 is_digit

Does the token consist of ASCII characters? Equivalent to all(ord(c) < 128 for c in token.text). Type: bool

In [60]:
token2.is_digit

False

### 3.16 is_punct

Is the token punctuation? Type: bool

In [61]:
token2.is_punct

False

### 3.17 is_space

Does the token consist of whitespace characters? Equivalent to token.text.isspace(). Type: bool

In [62]:
token2.is_space

False

### 3.18 is_upper

Is the token in uppercase? Equivalent to token.text.isupper(). Type: bool

In [63]:
token2.is_upper

False

### 3.19 is_lower

Isthe token in lowercase? Equivalent to token.text.islower(). Type: bool

In [64]:
token2.is_lower

False

### 3.20 is_title

Is the token in titlecase? Equivalent to the token.text.istitle(). Type: bool

In [65]:
token2.is_title

True

### 3.21 ent_type_

Named entity type. Type: str

In [66]:
token2.ent_type_

'GPE'

### 3.22 is_sent_start

Does the token start a sentence? bool or None if unknown. Default to True for the first token in the Doc.

In [67]:
token2.is_sent_start

False

### 4. Part of Speech Tagging (POS)

POS is a process in NLP that assigns a grammatical category or part-of-speach tag to each word in a given sentence or text. The part-of-speach tags represent the syntatic role and grammatical category words in a sentence, such as noun, verb, adjectice, adverb, pronun, preposition, conjuction, and more.

POS tagging is essential for many NLP tasks and applications because it provides insights into the structure and meaning of sentences, allowing for more accurate analysis and understanding of text. It helps in disamblguating word meanings, resolving grammatical ambiguities, and facilitating further linguistic analysis.

The process of POS tagging involves training machine learning models on annotated corpora, where human linguists or annotators manually assign appropriate part-of-speech tags to each word. These annotated datasets serve as training data for supervised learning algorithms to learn patterns and statistical associations between words and their corresponding part-of-speech tags.

There are different approaches to POS tagging, icluding:
1. **Rule-Based Tagging**: In this approach, a set of predefined rules and patterns are created to assign part-of-speech tags based on word morphology, context, and syntactic rules. For example, if a word ends with "-ing", it is likely a verb.
2. **Probabilistic Tagging**: This approach involves using statistical models, such as Hidden Markov Models (HMMs) or Maximum Entropy Markov Models (MEMMs), to assign part-of-speech tags based on the probability of a word belonging to a particular category given its context and surrounding words.
3. **Neural Network Tagging**: With the advancements in deep learning, neural network-based approaches, such as Recurrent Neural Networks (RNNs) or Transformer models, have been successfully applied to POS tagging. These models learn the sequential dependencies and contextual information in a sentence to predict the part-of-speech tags.

POS tagging is a fundamental step in many NLP applications, including:
+ **Syntax Parsing**: POS tags provide information about the grammatical structure of a sentence, which is crucial for syntactic parsing and analyzing the relationships between words.
+ **Named Entity Recognition**: POS tags can assist in identifying names entities by providing contextual cues. For example, proper nouns are often tagged as nouns.
+ **Sentiment Analysis**: POS tags can be used as features for sentiment analysis tasks as different parts of speech may convey different sentiments.
+ **Machine Translation**: POS tags help in disambiguating words during the translation process by providing information about the word's role in the sentence.

POS tagging is a foundational task in NLP that plays a vital role in understanding the syntactic structure of text and facilitating various higher-level language processing tasks.