# Part of Speech tagging

Import SpaCy and load its NLP models for the English language (the load function returns a Python module).

Note: You may need to first install SpaCy and the English model to be able to process English text. To do that, please run the following commands on the Anaconda prompt:

> conda install spacy

> python -m spacy download en_core_web_sm

In [1]:
import spacy
nlp = spacy.load('en_core_web_sm')

To process a document, instantiate a `Doc` class by passing the string to the constructor class.

In [2]:
sen = nlp(u"A black cat sat on a mat and ate a fat rat.")

We can obtain the original text of the document simply by using `sen` in the string context (e.g., printing it):

In [3]:
sen

A black cat sat on a mat and ate a fat rat.

In [7]:
type(word)

spacy.tokens.token.Token

We can access the output of all the NLP steps applied to this document. Each token of the document is stored inside the `Doc` object, different linguistic information on each token is available as attributes on the token: `text` (the original word form of the token), `lemma_`, `pos_` (a simplified part-of-speech tag), `tag_` (a tag from an extended set of part-of-speech tags, which includes, e.g., separate tags for past forms of a verb, singular and plural nouns, etc). SpaCy also provides a way to print the explanation of what each tags means (the last column).

In [4]:
for word in sen:
    print(f'{word.text:{10}} {word.lemma_:{10}} {word.pos_:{8}} {word.tag_:{8}} {spacy.explain(word.tag_)}')

A          a          DET      DT       determiner
black      black      ADJ      JJ       adjective
cat        cat        NOUN     NN       noun, singular or mass
sat        sit        VERB     VBD      verb, past tense
on         on         ADP      IN       conjunction, subordinating or preposition
a          a          DET      DT       determiner
mat        mat        NOUN     NN       noun, singular or mass
and        and        CCONJ    CC       conjunction, coordinating
ate        eat        VERB     VBD      verb, past tense
a          a          DET      DT       determiner
fat        fat        ADJ      JJ       adjective
rat        rat        NOUN     NN       noun, singular or mass
.          .          PUNCT    .        punctuation mark, sentence closer


A complete list of PoS tags used by SpaCy is [here](https://spacy.io/api/annotation#pos-tagging).

Note that SpaCy can guess the PoS of previously unseen words, e.g., names of people.

# Dependency parsing

The parse of a sentence includes syntactic roles and dependency links between words. Syntactic roles can be subject, (direct) object, prepositional object, adjective modifier, etc. Dependency links are grammatical relations between words, e.g. subject-verb ("the cat sat"), verb-direct object ("ate a rat"), modifier-noun ("black cat").

The syntactic roles are available in SpaCy in the `dep_` attribute of the token; the dependency links are in the `head` attribute.

Below is an example of a parsed sentence.

In [19]:
for word in sen:
    print(f'{word.i} {word.text:{12}} {word.lemma_:{12}} {word.pos_:{10}} {word.dep_:{8}} {word.head.i:{8}}')

0 A            a            DET        det             2
1 black        black        ADJ        amod            2
2 cat          cat          NOUN       nsubj           3
3 sat          sit          VERB       ROOT            3
4 on           on           ADP        prep            3
5 a            a            DET        det             6
6 mat          mat          NOUN       pobj            4
7 and          and          CCONJ      cc              3
8 ate          eat          VERB       conj            3
9 a            a            DET        det            11
10 fat          fat          ADJ        amod           11
11 rat          rat          NOUN       dobj            8
12 .            .            PUNCT      punct           3


SpaCy also has a tool to create a visualization of the parse of the sentence:

In [11]:
from spacy import displacy

displacy.render(sen, style='dep', jupyter=True, options={'distance': 85})

# Multiword phrases

Another useful output SpaCy produces are noun chunks: these are phrases where the head word is a noun and other words are syntactically dependent on it. For example:

In [37]:
for chunk in sen.noun_chunks:
    print(chunk.text)

A black cat
a mat
a fat rat


Noun chunks that are frequent in a document are likely to refer to important topical concepts discussed in the document.

# Named Entities

Names Entities are proper nouns, each provided with a label describing the semantic class of the noun, such as Person, Organization, Location. A specific business application may use its own set of named entity labels, e.g., in a legal document, named entities may be Seller, Customer, etc.

Below is an example of generic named entities found by SpaCy in a Wikipedia article about Aston University:

In [22]:
sen2 = nlp(u"""Aston University is a public research university situated in the city centre of Birmingham, England. 
Aston began as the Birmingham Municipal Technical School in 1895, evolving into the UK's first College of Advanced 
Technology in 1956. Aston University received its royal charter from Queen Elizabeth II on 22 April 1966.""")

print(sen2.ents)

(Aston University, Birmingham, England, Aston, the Birmingham Municipal Technical School, 1895, UK, first College of Advanced Technology, 1956, Aston University, Queen Elizabeth II, 22 April 1966)


In [24]:
for entity in sen2.ents:
    print(entity.text + ' - ' + entity.label_ + ' - ' + str(spacy.explain(entity.label_)))

Aston University - ORG - Companies, agencies, institutions, etc.
Birmingham - GPE - Countries, cities, states
England - GPE - Countries, cities, states
Aston - ORG - Companies, agencies, institutions, etc.
the Birmingham Municipal Technical School - ORG - Companies, agencies, institutions, etc.
1895 - DATE - Absolute or relative dates or periods
UK - GPE - Countries, cities, states
first College of Advanced Technology - ORG - Companies, agencies, institutions, etc.
1956 - DATE - Absolute or relative dates or periods
Aston University - ORG - Companies, agencies, institutions, etc.
Queen Elizabeth II - PERSON - People, including fictional
22 April 1966 - DATE - Absolute or relative dates or periods


There is also a tool to visualize entities and their labels in the input text:

In [25]:
displacy.render(sen2, style='ent', jupyter=True)

# Citing this notebook

If you use this notebook in your work, please cite it as follows:
    
Pekar, V. (2022). Big Data for Decision Making. Lecture examples and exercises. (Version 1.0.0). URL: https://github.com/vpekar/bd4dm
