<h1>Tokenization - POS Tagging</h1>

- Giving each word in a text, a grammatical category such as nouns, pronouns, verbs, adjectives and adverbs.

- Rule based POS
  - Rule Based Tagging: <pre>PTB</pre>
  - Transformation based tagging
  - Statistical POS Tagging:
  <pre>
  Conditional Random Fields(CRF)
  Hidden Markup Model(HMM)</pre>

- <b>Penn Treebank</b>(PTB) - The English PTB corpus, and in particular the section of the corpus corresponding to the articles of Wall Street Journal (WSJ), is one of the most known and used corpus for the evaluation of models for sequence labelling. The task consists of annotating each word with its Part-of-Speech tag. In the most common split of this corpus, sections from 0 to 18 are used for training (38 219 sentences, 912 344 tokens), sections from 19 to 21 are used for validation (5 527 sentences, 131 768 tokens), and sections from 22 to 24 are used for testing (5 462 sentences, 129 654 tokens). The corpus is also commonly used for character-level and word-level Language Modelling.
<br>

<h2> Tagger</h2>

- Default tagger(Based on PTB)
- Unigram tagger(like machine learning, based on training dataset)
- Bigram tagger(Related to semantics(meaning))

In [None]:
import nltk
nltk.download("punkt")
nltk.download("stopwords")
nltk.download('averaged_perceptron_tagger')
nltk.download("treebank")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data]   Unzipping corpora/treebank.zip.


True

In [None]:
# SAMPLE POS TAGGING
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from nltk.corpus import stopwords

def simple_pos_tagger(text):
  tokens = word_tokenize(text)
  stop_words = set(stopwords.words("english"))
  filtered = [token for token in tokens if token.lower() not in stop_words]

  pos_tags = pos_tag(filtered)
  return(pos_tags)

txt = "Obama does Natural Language Processing is fascinating and not so challenging."
tags = simple_pos_tagger(txt)
print(tags)

[('Obama', 'NNP'), ('Natural', 'NNP'), ('Language', 'NNP'), ('Processing', 'NNP'), ('fascinating', 'VBG'), ('challenging', 'NN'), ('.', '.')]


In [None]:
#COMPARISON OF TAGGERS
from nltk.tag import DefaultTagger, UnigramTagger, BigramTagger

def compare_taggers(text):
  tokens = word_tokenize(text)
  stop_words = set(stopwords.words("english"))
  filtered = [token for token in tokens if token.lower() not in stop_words]

  train_size = int(0.8*len(filtered))
  train, test = filtered[:train_size], filtered[:train_size]
  tagged_sent = [(token, 'NN') for token in train]
  print("Tagged:", tagged_sent)

  def_tag = DefaultTagger("NN")
  uni_tag = UnigramTagger(model = {word: tag for word,tag in tagged_sent})
  bi_tag = BigramTagger(train=[tagged_sent], backoff=uni_tag)

  gold_standard = [(token, 'NN') for token in test]

  test_sents = [[(token, "NN") for token in test]]

  tag_sent_def = def_tag.tag_sents(test_sents)
  tag_sent_uni = uni_tag.tag_sents(test_sents)
  tag_sent_bi = bi_tag.tag_sents(test_sents)

  tag_def = [tag for (word, tag) in tag_sent_def[0]]
  tag_uni = [tag for (word, tag) in tag_sent_uni[0]]
  tag_bi = [tag for (word, tag) in tag_sent_bi[0]]

  accuracy_def = sum(1 for gold, pred in zip(gold_standard, tag_def) if gold[1] == pred)/len(gold_standard)
  accuracy_uni = sum(1 for gold, pred in zip(gold_standard, tag_uni) if gold[1] == pred)/len(gold_standard)
  accuracy_bi = sum(1 for gold, pred in zip(gold_standard, tag_bi) if gold[1] == pred)/len(gold_standard)


  return accuracy_def, accuracy_uni, accuracy_bi

acc = compare_taggers(txt)

print(f"Default: {acc[0]}\nUnigram: {acc[1]}\nBigram: {acc[2]}")




Tagged: [('Obama', 'NN'), ('Natural', 'NN'), ('Language', 'NN'), ('Processing', 'NN'), ('fascinating', 'NN')]
Default: 1.0
Unigram: 0.0
Bigram: 0.0


In [None]:
import nltk
from nltk.corpus import treebank
from nltk.tag import UnigramTagger, BigramTagger

train = treebank.tagged_sents()[:2000]
test = treebank.tagged_sents()[2000:]

uni_tag = UnigramTagger([tags])
bi_tag = BigramTagger(train, backoff=uni_tag)

uni_tag.tag(tags)
print(uni_tag.evaluate([tags]))
print(bi_tag.evaluate([tags]))

1.0
1.0


  Function evaluate() has been deprecated.  Use accuracy(gold)
  instead.
  print(uni_tag.evaluate([tags]))
  Function evaluate() has been deprecated.  Use accuracy(gold)
  instead.
  print(bi_tag.evaluate([tags]))
