# Part-of-Speech Tagging

**Part of speech** (POS, syntactic category, word class): a category of words that have similar grammatical properties and display similar syntactic behavior (they play similar roles within the grammatical structure of sentences) and sometimes similar morphology.

**Less official desciption**

 Words that somehow ‘behave’ alike:
 
- Appear in similar contexts 
- Perform similar functions in sentences 
- Undergo similar transformations 

Wikipedia:
- **Noun** (names): a word or lexical item denoting any abstract (abstract noun: e.g. home) or concrete entity (concrete noun: e.g. house); a person (police officer, Michael), place (coastline, London), thing (necktie, television), idea (happiness), or quality (bravery). Nouns can also be classified as count nouns or non-count nouns; some can belong to either category. The most common part of speech; they are called naming words.

Important: **proper nouns** (names of specific perons or entities. in English usually not preceded by articles - e.g. Anne, IBM, London) VS common nouns 
Pronoun (replaces or places again): a substitute for a noun or noun phrase (them, he). Pronouns make sentences shorter and clearer since they replace nouns.
- **Adjective** (describes, limits): a modifier of a noun or pronoun (big, brave). Adjectives make the meaning of another word (noun) more precise.
- **Verb** (states action or being): a word denoting an action (walk), occurrence (happen), or state of being (be). Without a verb a group of words cannot be a clause or sentence.
- **Adverb** (describes, limits): a modifier of an adjective, verb, or another adverb (very, quite). Adverbs make language more precise.
- **Preposition** (relates): a word that relates words to each other in a phrase or sentence and aids in syntactic context (in, of). Prepositions show the relationship between a noun or a pronoun with another word in the sentence.
- **Conjunction** (connects): a syntactic connector; links words, phrases, or clauses (and, but). Conjunctions connect words or group of words
- **Interjection** (expresses feelings and emotions): an emotional greeting or exclamation (Huzzah, Alas). Interjections express strong feelings and emotions.
- **Article** (describes, limits): a grammatical marker of definiteness (the) or indefiniteness (a, an). The article is not always listed among the parts of speech. It is considered by some grammarians to be a type of adjective or sometimes the term 'determiner' (a broader class) is used.

Why is it **useful (in NLP)** to know what POS a word belongs to? 

- **parsing** (POS defines the neighbouring words of the keyword: e.g. in English nouns are preceded by determiners and articles)

- **name entity recognition** in information extraction (e.g. find people or organisation names)
- **coreference resolution**
- **speech recognition/synthesis** (e.g. CONtent or conTENT - pronounciation depend on the part of speech!)

### Two main classes [3]

- **Closed class**

    - Prepositions: of, in, by, ...
    - Auxiliaries: may, can, will had, been, ... 
    - Pronouns: I, you, she, mine, his, them, ... 
    - Usually function words (short common words which play a role in grammar)

- **Open class (English has 4 of them)**
    - Nouns, 
    - Verbs
    - Adjectives
    - Adverbs



In NLP, parts of speech are usually defined not by the semantics of the class  (e.g. nouns denote people and things and verbs denote actions - normally!) - but raher by their morphological and syntactic properties. 

- **Morphology**: the study of words, how they are formed, and their relationship to other words in the same language. It analyzes the structure of words and parts of words, such as stems, root words, prefixes, and suffixes.
- **Syntax**: the set of rules, principles, and processes that govern the structure of sentences (sentence structure) in a given language

# POS tagger

A part-of-speech tagger, or POS-tagger, processes a sequence of words and attaches a part of speech tag to each word.


## NLTK POS tagger

In [None]:
!pip install nltk
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('tagsets')

In [None]:
text = word_tokenize("Göran Hansson, secretary general of the Royal Swedish Academy of Sciences, tells Science that some of his colleagues have been hounded for expressing doubt about the country’s lax pandemic policies")
nltk.pos_tag(text)

How to find out what the tags mean? 
There is Penn Treebank Tagset https://www.eecis.udel.edu/~vijay/cis889/ie/pos-set.pdf 
which can be quiried like this:

In [None]:
nltk.help.upenn_tagset('NNPS')

**OBSERVE AND REFLECT** Find out what some of the other words' tags mean by replacing the tag in the previous code line. Have you come across any questionable cases? Why, in your opinion, they have got these particular tags?

**SAMPLE ANSWER**: 

"General" is tagged as NN (in "secretary general") - but general may be a noun or adjective. Here is it rather an adjective. An example with 100% noun would be "general of the army".


"Pandemic" is tagged here as an adjective. Dictionary defines this word as a) adjective (meaning "(of a disease) prevalent over a whole country or the world" and b) noun (meaning "an outbreak of a pandemic disease.
"the results may have been skewed by an influenza pandemic"). In what meaning is this word used in the sentence? It is not an adjective because it described not a disease but policies, therefore "pandemic policies" is a noun phrase consisting of two nouns, where one noun modifies the other.

Note how "pandemic" is tagged 'JJ' (adjective):  is it really an adjective here? In this particular word combination (lax pandemic policies) pandemic is a noun used in the role of an adjective (to modify another noun - policies). If you look up the word "pandemic" in a dictionary you will see that it is only used as an adjective when we talk of a disease ("a pandemic disease"). However, the tagger is based on certain rules and thinks that "pandemic" is an adjective. 

________

Tagging is a **disambiguation** task (aiming at removing lexical ambiguity). 


**Homographs** (words that have the same spelling but more than one meaningy) present a particular challenge. 

In [None]:
homograph_1 = word_tokenize("When shot at, the dove dove into the bushes.")
nltk.pos_tag(homograph_1)

What part of speech is "dove" in first case and what is it in the second one? Both of them are tagged "NN". Problem!

Another issue - important for speech recognition and generation - is when word meaning defines the **stressed syllable**. See below:

In [None]:
homonym = word_tokenize("They refuse to permit us to obtain the refuse permit")
nltk.pos_tag(homonym)

"Refuse" and "permit" are tagged as a present tense verb (VBP) and a noun (NN). In the two cases these words have different stress positions. We need to know what part of speech the word is to pronounce it correctly. 
Similar cases: https://www.english-at-home.com/pronunciation/noun-and-verb-syllable-stress




---


** CODE IT ** Using the nltk librabry tag the part of speeches of the followng phrases and Identify how the word **back** has been tagged in each sentence:


pharse_1 = The **back** door (Modifier/ADJ/JJ)

phrase_2 = On my **back** (NN)

phrase_3 = Win the voters **back**  (RB)
 
phrase_4 = Promise to **back** the bill (V)

In [1]:
#insert your code here

## Conventions

Convention: in NLTK **a tagged token is represented by a tuple (token, tag)**:

Useful function str2tuple(): 


In [None]:
tagged_token = nltk.tag.str2tuple('Göran/NNP')
print (tagged_token)
print (tagged_token[0])
print (tagged_token[1])

Some text corpora are pos-tagged, and we may need to extract a list of tagged tokens. 
1) tokenize the string to access the individual word/tag strings
2) convert each of these into a tuple with str2tuple()

In [None]:
sent = '''
... The/AT grand/JJ jury/NN commented/VBD on/IN a/AT number/NN of/IN
... other/AP topics/NNS ,/, AMONG/IN them/PPO the/AT Atlanta/NP and/CC
... Fulton/NP-tl County/NN-tl purchasing/VBG departments/NNS which/WDT it/PPS
... said/VBD ``/`` ARE/BER well/QL operated/VBN and/CC follow/VB generally/RB
... accepted/VBN practices/NNS which/WDT inure/VB to/IN the/AT best/JJT
... interest/NN of/IN both/ABX governments/NNS ''/'' ./.
... '''
print ([nltk.tag.str2tuple(t) for t in sent.split()])

Brown corpus (a million words of samples from 500 written texts from different genres published in the
WSJ United States in 1961) comes with POS tags. 

In [None]:
nltk.download('brown') #download Brown corpus from NLTK
nltk.corpus.brown.tagged_words() [:40] 

## Practical applications of POS tagging

As POS tagging is a disambiguation task it is widely used in **machine translation**. 

- ENG: "I fish a fish"
- FR (correct): "Je pêche un poisson". 

To make the two words distinct from each other from the point of view of computer, we need to POS-tag them - only then the translation will be correct. 



** Google Translator API for python: ** [``googletrans``](https://pypi.org/project/googletrans/) library is an api for using google translator in python see an example in the following cells.

In [None]:
!pip install googletrans

In [None]:
import nltk
from googletrans import Translator
translator = Translator()

sentence ="I fish a fish"

translated_sentence  = translator.translate('I fish a fish', src='en',dest='fr').text
print(translated_sentence)

## A couple of words on how POS taggers are created [1]


- **Lexical Based Methods** — Assigns the POS tag the most frequently occurring with a word in the training corpus.
- **Rule-Based Methods**  — Assigns POS tags based on rules. For example, we can have a rule that says, words ending with “ed” or “ing” must be assigned to a verb. Rule-Based Techniques can be used along with Lexical Based approaches to allow POS Tagging of words that are not present in the training corpus but are there in the testing data.
- **Probabilistic Methods** — This method assigns the POS tags based on the probability of a particular tag sequence occurring. Conditional Random Fields (CRFs) and Hidden Markov Models (HMMs) are probabilistic approaches to assign a POS Tag.
- **Deep Learning Methods** — Recurrent Neural Networks can also be used for POS tagging.

---

## SpaCy : a python NLP library   [2]


spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python.

Using Spacy:


1. Install spacy using: !pip install spacy
2. Choose a model from https://spacy.io/models
3. download the model using: !python -m spacy download {model_name} 
4. Run the spacy pipeline with your model and get the tags you need.


### 1. Install spaCy

In [None]:
!pip install spacy

### 2. Choose and download the spaCy model

In [None]:
!python -m spacy download en_core_web_sm #downloading the English model

!python -m spacy download fr_core_news_sm #downloading the French model


In [None]:
import en_core_web_sm
nlp = en_core_web_sm.load()
doc = nlp("Colorless green ideas sleep furiously.")



for token in doc:
    print(token.text, token.pos_)

** Code it ** Think of a sentence in a language you speak in which one word may be tagged with two (or more) different POS tags (e.g. fish (verb) - fish (noun))  
and translate it to another language using the Google Translator API.

Then use Spacy pipeline to tag both the translation and the original sentence with POS tags. (like the: "I fish a fish example")

** Note:  ** Try to choose a language that both Spacy and Google Translator support
.

In [None]:

from googletrans import Translator
import en_core_web_sm
import fr_core_news_sm


sentence = "I went back in trough the back door."
nlp = en_core_web_sm.load()
doc = nlp(sentence)

for token in doc:
    print(token.text, token.pos_)

translator = Translator()

translated_sentence  = translator.translate(sentence, src='en',dest='fr').text
nlp_fr = fr_core_news_sm.load()
fr_doc = nlp_fr(translated_sentence)

print(translated_sentence)
for token in fr_doc:
    print(token.text, token.pos_)

**If you fancy **  The sentence "Colorless green ideas sleep furiously" was introduced by Noam Chomsky as a sentence which is syntactically correct but it semantically doesn't make sense.

Noam Chomsky is a highly respected contemporary linguist who is known for the controversial theory of [universal grammar](https://www.youtube.com/watch?v=517XJ3eOIzg), [generative grammar](https://www.youtube.com/watch?v=jc2bL1z9Wh4) and the minimalist program.


Watch [this Interview in 1989](https://www.youtube.com/watch?v=hdUbIlwHRkY&t=534s) of Chomsky talking about language change.


## Stanza : Yet another NLP Python Package  [4]

Stanza is a recently released python library by Stanford university for Natural Language Processing.

It covers more than 60 human languages.

You can run a demo of Stanza [here](http://stanza.run/)

In order to use Stanza for POS tagging you need to follow the same steps as we did in using ** SpaCy **

1. Download and install library using pip

2. Choose and download model using stanza.download('lang') (replace lang with the abbreviation for your chosen language)

3. Run Pipeline


**NOTE:** Stanza is a huge language model the english model itself is 428MB.

 If you want to use stanza for another language change the language in `download` and `Pipeline` commands. You can find a list of languages supported by Stanza [here](https://stanfordnlp.github.io/stanza/available_models.html)

In [None]:
!pip install stanza

import stanza

Using Stanza for pos-tagging a sentence in english

In [None]:

stanza.download('en') # download English model

nlp = stanza.Pipeline('en') # initialize English neural pipeline

doc = nlp("Barack Obama was born in Hawaii.") # run annotation over a sentence

print(*[f'word: {word.text}\tupos: {word.upos}' for sent in doc.sentences for word in sent.words], sep='\n')

Using Stanza for pos-tagging a sentence in Hebrew

In [None]:
stanza.download('he') # download Hebrew model

nlp = stanza.Pipeline('he') # initialize Hebrew neural pipeline

doc = nlp("אני בקורס המעשי כרגע מנסה את החבילה החדשה הזו.") # run annotation over a sentence

print(*[f'word: {word.text}\tupos: {word.upos}' for sent in doc.sentences for word in sent.words], sep='\n')

---

** Excercise ** Create a word cloud of Proper Nouns from a text of your own choosing which is longer than 10,000 words.


In [None]:
!pip install urllib
import urllib
url = "https://www.gutenberg.org/files/98/98-0.txt"
file = urllib.request.urlopen(url)
nlp = en_core_web_sm.load()
proper_nouns=[]
for line in file:
    decoded_line = line. decode("utf-8")
    doc = nlp(decoded_line)
    for token in doc:
        if(token.pos_=="PROPN"):
            proper_nouns.append(token.text)

!pip install wordcloud==1.8.0

from wordcloud import WordCloud
import matplotlib.pyplot as plt
 
proper_nouns_word_cloud = WordCloud().generate(" ".join(proper_nouns))
 
# Display the generated image:
fig = plt.figure()


plt.imshow(proper_nouns_word_cloud, interpolation='bilinear')
plt.axis("off")
plt.margins(x=5, y=5)
plt.show()

## Refereneces

**\[1\]** https://medium.com/analytics-vidhya/pos-tagging-using-conditional-random-fields-92077e5eaa31

**\[2\]** https://spacy.io/

**\[3\]** http://www.cs.columbia.edu/~kathy/NLP/2017/ClassSlides/Class5-POS/pos_F17.pdf

**\[4\]** https://stanfordnlp.github.io/stanza/