<h1><center>Natural Language Processing</center></h1>

### Parts-of-Speech Tagging

#### Task-1:
Consider the following "spoof newspaper headlines", such as:  

British Left Waffles on Falkland Islands, and Juvenile Court to Try Shooting Defendant.  

Manually tag these headlines to see if knowledge of the [part-of-speech tags](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html) removes the ambiguity.

#### Task-2:

Tokenize and tag the below given sentence by using ([using NLTK POS Tagger](https://www.nltk.org/api/nltk.tag.pos_tag.html)) and [spaCy](https://spacy.io/usage/linguistic-features)  

"They wind back the clock, while we chase after the wind".  

What different pronunciations and parts of speech are involved?

In [2]:
import nltk
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
# nltk.download('averaged_perceptron_tagger')
# nltk.download('universal_tagset')

sent = "They wind back the clock, while we chase after the wind."
pos_tag(word_tokenize(sent), tagset='universal')

[('They', 'PRON'),
 ('wind', 'VERB'),
 ('back', 'ADV'),
 ('the', 'DET'),
 ('clock', 'NOUN'),
 (',', '.'),
 ('while', 'ADP'),
 ('we', 'PRON'),
 ('chase', 'VERB'),
 ('after', 'ADP'),
 ('the', 'DET'),
 ('wind', 'NOUN'),
 ('.', '.')]

In [3]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp(sent)

for token in doc:
    print(token,token.pos_)
#     print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
#             token.shape_, token.is_alpha, token.is_stop)


They PRON
wind VERB
back ADP
the DET
clock NOUN
, PUNCT
while SCONJ
we PRON
chase VERB
after ADP
the DET
wind NOUN
. PUNCT


#### Task-3:

Write programs to read and process the [NLTK Brown Corpus](https://www.nltk.org/howto/corpus.html) and perform the followig tasks:

- Get and display a list of all categories of the brown corpus.
- Get and display the tokenized and tagged version of the "news" category.
- Get and display the sentence segmented, tokenized, and tagged version of the "news" category.
- Print the first 5 words, and the first 2 sentences.
- Get and display the set of all the tags in the brown corpus.
- Get and display the set of all the tags in the universal tagset.

In [4]:
from nltk.corpus import brown
# nltk.download('brown')
print(brown.words())
print(brown.tagged_words(categories='news')[:20])
print(brown.tagged_sents(categories='news', tagset='universal')[:2])
print(dir(brown))
tagset = set()
for x in brown.tagged_words(tagset='universal'):
      tagset.add(x[-1])
print(tagset)

['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]
[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN'), ("Atlanta's", 'NP$'), ('recent', 'JJ'), ('primary', 'NN'), ('election', 'NN'), ('produced', 'VBD'), ('``', '``'), ('no', 'AT'), ('evidence', 'NN'), ("''", "''"), ('that', 'CS')]
[[('The', 'DET'), ('Fulton', 'NOUN'), ('County', 'NOUN'), ('Grand', 'ADJ'), ('Jury', 'NOUN'), ('said', 'VERB'), ('Friday', 'NOUN'), ('an', 'DET'), ('investigation', 'NOUN'), ('of', 'ADP'), ("Atlanta's", 'NOUN'), ('recent', 'ADJ'), ('primary', 'NOUN'), ('election', 'NOUN'), ('produced', 'VERB'), ('``', '.'), ('no', 'DET'), ('evidence', 'NOUN'), ("''", '.'), ('that', 'ADP'), ('any', 'DET'), ('irregularities', 'NOUN'), ('took', 'VERB'), ('place', 'NOUN'), ('.', '.')], [('The', 'DET'), ('jury', 'NOUN'), ('further', 'ADV'), ('said', 'VERB'), ('in', 'ADP'), ('term-end', 'NOUN'),

#### Task-4:

Write programs to process the [Brown Corpus](https://www.nltk.org/howto/corpus.html) and find answers to the following questions:
- Which nouns are more common in their plural form, rather than their singular form? (Only consider regular plurals, formed with the -s suffix.)
- Which word has the greatest number of distinct tags. What are they, and what do they represent?
- List tags in order of decreasing frequency. What do the 20 most frequent tags represent?
- Which tags are nouns most commonly found after? What do these tags represent?

In [5]:
nouns = [word[0].lower() for word in brown.tagged_words(tagset='universal') if word[1] == "NOUN"]
def countNouns(nouns):
    nounsDict = {}
    for noun in nouns:
        nounsDict[noun] = nounsDict.get(noun, 0) + 1
    return nounsDict

sinNouns = [noun for noun in nouns if not noun.endswith('s')]
pluNouns = [noun for noun in nouns if noun.endswith('s')]
sinNounsDict = countNouns(sinNouns)
pluNounsDict = countNouns(pluNouns)

pluCommonList = []
for key,value in sinNounsDict.items():
    if value < pluNounsDict.get(key + 's', 0):
        pluCommonList.append(key)
print(pluCommonList)

['term', 'year', 'candidate', 'official', 'bond', 'item', 'fund', 'legislator', 'event', 'keynote', 'revision', 'requirement', 'fee', 'appointee', 'poll', 'texan', 'disclosure', 'minute', 'worker', 'employer', 'student', 'dollar', 'appropriation', 'leader', 'thing', 'rule', 'hour', 'element', 'step', 'diplomat', 'affair', 'lao', 'observer', 'complication', 'factor', 'nation', 'vehicle', 'eye', 'expert', 'voter', 'member', 'product', 'remark', 'passenger', 'lip', 'freeholder', 'representative', 'wave', 'visitor', 'scientist', 'object', 'locale', 'norm', 'democrat', 'month', 'owner', 'american', 'banker', 'panel', 'commercial', 'native', 'headache', 'towel', 'headline', 'acre', 'planner', 'bomb', 'critic', 'greeting', 'emerald', 'denomination', 'immigrant', 'defendant', 'oriole', 'plate', 'error', 'hit', 'spectator', 'ditmar', 'yankee', 'bird', 'wing', 'condition', 'win', 'allowance', 'knee', 'swelling', 'longhorn', 'leg', 'yard', 'star', 'tactic', 'characteristic', 'arm', 'organ', 'pop'

In [23]:
brown_news_tagged = brown.tagged_words(categories='news',tagset='universal')
cfd1 = nltk.ConditionalFreqDist(brown_news_tagged)
cfd = {key: len(cfd1[key]) for key in cfd1.conditions()}
max_value = max(cfd.values())
k = [key for key, value in cfd.items() if value == max_value]
for key in k:
    print(key, cfd1[key].most_common())

that [('ADP', 545), ('PRON', 128), ('DET', 124), ('ADV', 5)]
present [('ADJ', 21), ('ADV', 7), ('NOUN', 1), ('VERB', 1)]
close [('ADV', 6), ('ADJ', 3), ('VERB', 2), ('NOUN', 1)]
open [('ADJ', 13), ('VERB', 11), ('NOUN', 2), ('ADV', 1)]


In [8]:
tag_fd = nltk.FreqDist(tag for (word, tag) in brown_news_tagged)
tag_fd.most_common(20)

[('NOUN', 30654),
 ('VERB', 14399),
 ('ADP', 12355),
 ('.', 11928),
 ('DET', 11389),
 ('ADJ', 6706),
 ('ADV', 3349),
 ('CONJ', 2717),
 ('PRON', 2535),
 ('PRT', 2264),
 ('NUM', 2166),
 ('X', 92)]

In [9]:
word_tag_pairs = nltk.bigrams(brown_news_tagged)
noun_preceders = [a[1] for (a, b) in word_tag_pairs if b[1] == 'NOUN']
fdist = nltk.FreqDist(noun_preceders)
[tag for (tag, _) in fdist.most_common()]

['NOUN',
 'DET',
 'ADJ',
 'ADP',
 '.',
 'VERB',
 'CONJ',
 'NUM',
 'ADV',
 'PRT',
 'PRON',
 'X']

#### Task-5:

Train a unigram tagger ([on Brown Corpus news categories](https://www.nltk.org/howto/corpus.html)) and run it on the below given test text:  

test_text = ['hello', 'world', 'natural', 'language', 'processing']  

Observe that some words are not assigned a tag. Why not?

In [33]:
from nltk.tag import UnigramTagger
test_text = ['hello', 'world', 'natural', 'language', 'processing']
unigram_tagger = UnigramTagger(brown.tagged_sents(categories='news'))
print(unigram_tagger.tag(test_text))

[('hello', None), ('world', 'NN'), ('natural', 'JJ'), ('language', 'NN'), ('processing', 'NN')]


In [38]:
for word in brown.tagged_words(categories='news', tagset='universal'):
    if word[0].lower() == "hello":
        print(1)

#### Task-6:

By using the [NLTK default tagger](https://www.nltk.org/book/ch05.html#:~:text=4.1-,The%20Default%20Tagger,-The%20simplest%20possible), tag the following sentence with most frequent tag in the corpus:  

"the quick brown fox jumped over the lazy dog"

In [40]:
import nltk
tags = [tag for (word, tag) in brown.tagged_words()]
raw = "the quick brown fox jumped over the lazy dog"
tokens = nltk.word_tokenize(raw)
default_tagger = nltk.DefaultTagger(nltk.FreqDist(tags).max())
print(default_tagger.tag(tokens))

[('the', 'NN'), ('quick', 'NN'), ('brown', 'NN'), ('fox', 'NN'), ('jumped', 'NN'), ('over', 'NN'), ('the', 'NN'), ('lazy', 'NN'), ('dog', 'NN')]


#### Task-7:

By using the([Brown Corpus news categories](https://www.nltk.org/howto/corpus.html)), train and evaluate the following built-in taggers:
- [Unigram Tagger](https://www.nltk.org/book/ch05.html)
- [Bigram Tagger](https://www.nltk.org/book/ch05.html)
- [Trigram Tagger](https://www.nltk.org/book/ch05.html)

In [41]:
from nltk.corpus import brown
brown_tagged_sents = brown.tagged_sents(categories='news')
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
unigram_tagger.evaluate(brown_tagged_sents)

0.9349006503968017

In [43]:
bigram_tagger = nltk.BigramTagger(brown_tagged_sents)
bigram_tagger.evaluate(brown_tagged_sents)

0.7892972929967977

In [44]:
trigram_tagger = nltk.TrigramTagger(brown_tagged_sents)
trigram_tagger.evaluate(brown_tagged_sents)

0.8198679316586113