<a href="https://colab.research.google.com/github/soujanya-vattikolla/NLP-with-spaCy/blob/main/BasicsofSpacyanditsLinguistic_Annotations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# install spacy
! pip install spacy

In [None]:
!python -m spacy download en_core_web_sm

In [None]:
import spacy

In [None]:
# Make sure that downloaded the model successfully with the command below.

nlp = spacy.load("en_core_web_sm")

In [None]:
# import the text
with open ("wiki.txt", "r") as f:
    text = f.read()

In [None]:
print(text)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America. It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j] At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d] The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22] With a population of more than 331 million people, it is the third most populous country in the world. The national capital is Washington, D.C., and the most populous city is New York.

Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century. The United States emerged from the thirteen British colonies est

### Creating a Doc Container

In [None]:
doc = nlp(text)

In [None]:
print(doc)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America. It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j] At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d] The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22] With a population of more than 331 million people, it is the third most populous country in the world. The national capital is Washington, D.C., and the most populous city is New York.

Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century. The United States emerged from the thirteen British colonies est

### The Doc container, unlike the text object, contains a lot of valuable metadata, or attributes, hidden behind it. 

In [None]:
print(len(text))
print(len(doc))

3525
652


### Let’s explore it and try and print off each item in each object.

In [None]:
for tokentext in text[0:10]:
    print(tokentext)

T
h
e
 
U
n
i
t
e
d


### It has printed each character, including white spaces. Let’s try and do the same with the Doc container.

In [None]:
for tokendoc in doc[:10]:
    print(tokendoc)

The
United
States
of
America
(
U.S.A.
or
USA
)


### The open and close parentheses are also considered an item in the container. These are all known as tokens. Tokens are a fundamental building block of spaCy or any NLP framework.

In [None]:
for tokentext in text.split()[:10]:
    print(tokentext)

The
United
States
of
America
(U.S.A.
or
USA),
commonly
known


### The parentheses are not removed or handled individually.

## Sentence Boundary Detection (SBD)

### To access the sentences in the Doc container, we can use the attribute sents, like 

In [None]:
for sentdoc in doc.sents:
    print(sentdoc)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.
It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j]
At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d] The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22] With a population of more than 331 million people, it is the third most populous country in the world.
The national capital is Washington, D.C., and the most populous city is New York.


Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century.
The United States emerged from the thirteen British colonies es

In [None]:
sentencedoc = doc.sents[0]
print(sentencedoc)

TypeError: 'generator' object is not subscriptable

#### The sents attribute is a generator. In python, we can usually iterate over generators by converting them into a list. So, let’s do that.

In [None]:
sentencedoc = list(doc.sents)[0]
print (sentencedoc)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.


### Token Attributes

In [None]:
tokensent = sentencedoc[2]
print(tokensent)

States


### Text

In [None]:
tokensent.text

'States'

### Head

In [None]:
tokensent.head

is

This tells to which word it is governed by, in this case, the primary verb, “is”, as it is part of the noun subject.

### Left Edge

In [None]:
tokensent.left_edge

The

If part of a sequence of tokens that are collectively meaningful, known as multi-word tokens, this will tell us where the multi-word token begins.

### Right Edge

In [None]:
tokensent.right_edge

America

This will tell us where the multi-word token ends.

### Entity Type

In [None]:
tokensent.ent_type

384

This will return an integer that corresponds to an entity type

In [None]:
tokensent.ent_type_

'GPE'

This will give you the string equivalent. GPE is geopolitical entity and is correct.

### Ent IOB

In [None]:
tokensent.ent_iob_

'I'

IOB code of named entity tag. “B” means the token begins an entity, “I” means it is inside an entity, “O” means it is outside an entity, and "" means no entity tag is set.
IOB is a method of annotating a text. In this case, we see “I” because states is inside an entity, that is to say that it is part of the United States of America.

### Lemma

In [None]:
tokensent.lemma_

'States'

Base form of the token

### Morph

In [None]:
sentencedoc[2].morph

NounType=Prop|Number=Sing

Morphological analysis

### Part of Speech

In [None]:
tokensent.pos_

'PROPN'

 part-of-speech from the Universal POS tag set

### Syntactic Dependency

In [None]:
tokensent.dep_

'nsubj'

 Syntactic dependency relation

### Language

In [None]:
tokensent.lang_

'en'

Language of the parent document’s vocabulary.

### Part of Speech Tagging (POS)

In [None]:
for postoken in sentencedoc:
    print(postoken.text, postoken.pos_, postoken.dep_)

The DET det
United PROPN compound
States PROPN nsubj
of ADP prep
America PROPN pobj
( PUNCT punct
U.S.A. PROPN appos
or CCONJ cc
USA PROPN conj
) PUNCT punct
, PUNCT punct
commonly ADV advmod
known VERB acl
as ADP prep
the DET det
United PROPN compound
States PROPN pobj
( PUNCT punct
U.S. PROPN appos
or CCONJ cc
US PROPN conj
) PUNCT punct
or CCONJ cc
America PROPN conj
, PUNCT punct
is AUX ROOT
a DET det
country NOUN attr
primarily ADV advmod
located VERB acl
in ADP prep
North PROPN compound
America PROPN pobj
. PUNCT punct


In [None]:
from spacy import displacy

In [None]:
displacy.render(sentencedoc, style="dep")

In [None]:
text = "Mike enjoys playing football"
documnt = nlp(text)
print(documnt)

Mike enjoys playing football


In [None]:
for testtoken in documnt:
    print(testtoken.text, testtoken.pos_, testtoken.dep_)

Mike PROPN nsubj
enjoys VERB ROOT
playing VERB xcomp
football NOUN dobj


In [None]:
displacy.render(documnt, style="dep")

### Named Entity Recognition

In [None]:
for entity in sentencedoc.ents:
    print(entity.text, entity.label_)

The United States of America GPE
U.S.A. GPE
USA GPE
the United States GPE
U.S. GPE
US GPE
America GPE
North America LOC


It provides us the entity of each string in the sentence.

In [None]:
for entity in documnt.ents:
    print(entity.text, entity.label_)

Mike PERSON


In [None]:
displacy.render(sentencedoc, style='ent')

In [None]:
displacy.render(documnt,style='ent')

This tells displaCy to display the text as NER annotations.