# spaCy Linguistic Annotations

Based on **Dr. William Mattingly** video: https://www.youtube.com/watch?v=dIUTsFT2MeQ&t

and his Jupyter Book: http://spacy.pythonhumanities.com/01_02_linguistic_annotations.html

## Doc Container

### Concept
In the spaCy library, the **Doc** container represents a processed document in natural language processing tasks. It is a fundamental data structure provided by spaCy to hold the linguistic annotations and properties of a text.

A **Doc** object in spaCy consists of a sequence of **Token** objects, where each token represents a word or a unit of text. The **Doc** container stores various linguistic annotations for each token, including POS tags, NER, syntactic dependencies, lemma, and more. These annotations are computed during the pipeline processing in spaCY.

The **Doc** container provides efficient access to the linguistic information of the text and supports various operations and methods for working with the tokens and their annotations. You can access individual tokens, iterate over them, retrieve their properties, and perform operations such as merging, splitting, or extracting sub-spans of the document.

The **Doc** container is an essential component in spaCy's processing pipeline, allowing us to perform advanced NLP tasks, such as entity recognition, dependency parsing, or text classification, by leveraging the annotated information stored within the **Doc** object.

In [6]:
import spacy

In [7]:
# loading model
nlp = spacy.load("en_core_web_sm")

In [8]:
with open ("data/wiki_us.txt", "r") as f:
    text = f.read()

In [9]:
print(text)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America. It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j] At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d] The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22] With a population of more than 331 million people, it is the third most populous country in the world. The national capital is Washington, D.C., and the most populous city is New York.

Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century. The United States emerged from the thirteen British colonies est

In [10]:
doc = nlp(text)

In [11]:
print(doc)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America. It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j] At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d] The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22] With a population of more than 331 million people, it is the third most populous country in the world. The national capital is Washington, D.C., and the most populous city is New York.

Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century. The United States emerged from the thirteen British colonies est

In [12]:
print(len(text))
print(len(doc))

3521
654


In [13]:
# itterate over text with simple for loop
for token in text[:10]:
    print(token)

T
h
e
 
U
n
i
t
e
d


In [14]:
# itterate over doc with simple for loop
for token in doc[:10]:
    print(token)

The
United
States
of
America
(
U.S.A.
or
USA
)


In [15]:
# Showing why we should consider using spaCy
for token in text.split()[:10]:
    print(token)

The
United
States
of
America
(U.S.A.
or
USA),
commonly
known


In [16]:
# closer example
words = text.split()[:10]
i = 5
for token in doc[i:8]:
    print(f"SpaCy Token {i}:\n{token}\nWord Split {i}:\n{words[i]}\n\n")
    i += 1

SpaCy Token 5:
(
Word Split 5:
(U.S.A.


SpaCy Token 6:
U.S.A.
Word Split 6:
or


SpaCy Token 7:
or
Word Split 7:
USA),




## Sentence Boundary Detection (SBD)

### Concept

**SBD** also known as sentence segmentation or sentence tokenization, is the task of identifying the boundaries between sentences in a given text. it is a fundamental step in NLP and text analysis.

The goal of **SBD** is to correctly identify the end of on sentence and the beginning of the next sentence, even in the absence of explict punctation marks such as periods. This task can be challenging because sentence boundaries can be ambiguous, especially in texts with complex sentence structures, abbrevations, or other linguisitc variations.

Various techniques and algorithms are used for **SBD**, ranging from simple rule-based approaches to more sophisticated machine learning methods. Rule-based approaches rely on patterns and heuristics to determine sentence boundaries based on punctation marks. capitalization, and other language-specific cues. Machine learning approaches involve training models or annotated data to predict sentence boundaries based on contextual and structural features of the text.

Accurate sentence boundary detection is crucial for many NLP tasks, such as machine translation, information extraction, text summarization, and sentiment analysis. It helps to properly segment the text into meaningful units, enabling downstream analysis and processing at the sentence level.

In [17]:
for sent in doc.sents:
    print(sent)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.
It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j]
At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d]
The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22]
With a population of more than 331 million people, it is the third most populous country in the world.
The national capital is Washington, D.C., and the most populous city is New York.


Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century.
The United States emerged from the thirteen British colonies es

In [18]:
# doc is a generator so we can't directly itterate throught it
sentence1 = doc.sents[0]
print(sentence1)

TypeError: 'generator' object is not subscriptable

In [19]:
sentence1 = list(doc.sents)[0]
print(sentence1)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.


## Token Attributes

### Concept

Token attributes refer to the specific characteristics or properties associated with individual tokens in NLP tasks. These attributes provide valuable information about each token, enabling various analyses and operations on text data.

In spaCy, token attributes are the properties associated with individual tokens in a text. SpaCy provides a variety of token attributes that can be accessed to extract information about each token. Examples of some commonly used token attributes in spaCy:
1. **text**: The actual text of the token.
2. **lemma_**: The base or canonical form of the token.
3. **pos_**: The part-of-speech tag of the token.
4. **tag_**: The detailed part-of-speech tag of the token.
5. **dep_**: The syntactic dependency relation of the token.
6. **is_stop**: A boolean value indicating whether the token is a stop word.
7. **is_alpha**: A boolean value indicating whether the token consists of alphabetic characters.
8. **is_digit**: A boolean value indicating whether the token ia a digit.
9. **is_punct**: A boolean value indicating whether the token is a punctuation symbol.
10. **is_space**: A boolean value indicating whether the token is a whitespace.
11. **is_upper**: A boolean value indicating whether the token is in uppercase.
12. **is_lower**: A boolean value inddicating whether the token is in lowercase.
13. **is_title**: A boolean value indicating whether the token is in titlecase.
14. **ent_type**: The named entity type of the token, if it is part of a named entity.
15. **is_sent_start**: A boolean value indicating whether the token is the start of a sentence.