# I. Rule-based and Statistical Approaches for Part-of-Speech Tagging

Part-of-Speech tagging, also known as POS tagging, is the process of assigning grammatical tags or labels to words in a sentence. The tags represent the syntactic category or part of speech of each word, such as noun, verb, adjective, adverb, etc. POS tagging is an essential step in many Natural Language Processing (NLP) tasks, including parsing, machine translation, and information retrieval.

POS tagging can be approached using different techniques, including rule-based approaches, statistical approaches, and hybrid approaches that combine both. In statistical approaches, Hidden Markov Models (HMMs) and Maximum Entropy Markov Models (MEMMs) are commonly used.

Implement a rule-based part-of-speech (POS) tagger:
* a. Write a set of rules to assign POS tags to words based on their context
* b. Apply the rules to a sample text and evaluate the accuracy of the tagger.



Implement a statistical POS tagger using a pre-trained model:


* a. Train a statistical POS tagger on a labeled corpus using a machine learning algorithm such as Naive Bayes or Maximum Entropy.
* b. Apply the trained model to tag a sample text and evaluate its accuracy.





In [7]:
import nltk
import re
nltk.download('punkt')
nltk.download('treebank')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker') #The maxent_ne_chunker contains two pre-trained English named entity chunkers trained on an ACE corpus (perhaps ACE ACE 2004 Multilingual Training Corpus?)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\zsj13\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package treebank to
[nltk_data]     C:\Users\zsj13\AppData\Roaming\nltk_data...
[nltk_data]   Package treebank is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\zsj13\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\zsj13\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!


True

In [8]:
from nltk import pos_tag, ne_chunk
from nltk.tokenize import word_tokenize
from nltk.corpus import treebank
from nltk.tag import DefaultTagger, UnigramTagger, BigramTagger, TrigramTagger #important for POS tagging
from sklearn.model_selection import train_test_split


# Part 1: Rule-based and Statistical Approaches for Part-of-Speech Tagging

# Rule-based POS Tagger
def rule_based_pos_tagger(sentence):
    # Define your rules here
    rules = [
          (re.compile(r'\bThe\b'), 'DT'),
          (re.compile(r'\bcat\b'), 'NN'),
          (re.compile(r'\bis\b'), 'VB'),
          (re.compile(r'\bsitting\b'), 'VB'),
          (re.compile(r'\bon\b'), 'IN'),
          (re.compile(r'\bthe\b'), 'DT'),
          (re.compile(r'\bmat\b'), 'NN'),
      ]
    tagged_sentence = []
    words = word_tokenize(sentence)
    for word in words:
        for pattern, tag in rules:
            if pattern.match(word):
                tagged_sentence.append((word, tag))
                break
        else:
            tagged_sentence.append((word, 'UNKNOWN'))
    return tagged_sentence

# Statistical POS Tagger
def statistical_pos_tagger(sentence):
    # Train your model on a labeled corpus (e.g., treebank)
    train_data = treebank.tagged_sents()[:3000]
    # Train your statistical model here

    # Split data into training and testing sets
    train_size = int(len(train_data) * 0.8)
    train_set = train_data[:train_size]
    test_set = train_data[train_size:]

    # Create taggers
    default_tagger = DefaultTagger('NN')  # Default tagger assigns 'NN' to all words
    unigram_tagger = UnigramTagger(train_set, backoff=default_tagger)  # Unigram tagger using training set
    bigram_tagger = BigramTagger(train_set, backoff=unigram_tagger)  # Bigram tagger using training set and fallback to unigram tagger

    # Evaluate on test set
    accuracy = bigram_tagger.accuracy(test_set)
    print("Accuracy:", accuracy)


    # Apply the trained model to tag the sentence
    tagged_sentence  = bigram_tagger.tag(word_tokenize(sentence))
    #tagged_sentence = nltk.pos_tag(words)
    #tagged_sentence.append(tagged_sentence)
    return tagged_sentence


In [9]:
# Part 1: Rule-based and Statistical Approaches for Part-of-Speech Tagging
sample_sentence = "The cat is sitting on the mat."

# Rule-based POS Tagging
rule_based_tags = rule_based_pos_tagger(sample_sentence)
print("Rule-based POS Tags:")
print(rule_based_tags)

# Statistical POS Tagging
statistical_tags = statistical_pos_tagger(sample_sentence)
print("Statistical POS Tags:")
print(statistical_tags)


Rule-based POS Tags:
[('The', 'DT'), ('cat', 'NN'), ('is', 'VB'), ('sitting', 'VB'), ('on', 'IN'), ('the', 'DT'), ('mat', 'NN'), ('.', 'UNKNOWN')]
Accuracy: 0.8748033560566335
Statistical POS Tags:
[('The', 'DT'), ('cat', 'NN'), ('is', 'VBZ'), ('sitting', 'VBG'), ('on', 'IN'), ('the', 'DT'), ('mat', 'NN'), ('.', '.')]


Additionally, NLTK has a built in function call ```pos_tags``
See example below

In [10]:
sample_sentence = "The cat is sitting on the mat."

tagged_sentence = nltk.pos_tag(word_tokenize(sample_sentence))
print(tagged_sentence)

[('The', 'DT'), ('cat', 'NN'), ('is', 'VBZ'), ('sitting', 'VBG'), ('on', 'IN'), ('the', 'DT'), ('mat', 'NN'), ('.', '.')]


 ### Exercise

Update the Rule-based tagger with patterns using regex. An example could be:



      ```  (r'\b\w+s\b|\b\w+es\b', 'NN'),     # Nouns ending ```

  From here proivde an updated rule-based tagger and statistical based tagger that can apply a part of speech tag for the following complex sentence:

  ```
  sentence = "The quick brown fox jumps over the lazy dog while it's raining heavily."

  ```

In [11]:
##### START CODE ####

import nltk
from nltk.corpus import treebank
from nltk.tag import DefaultTagger, UnigramTagger, BigramTagger, TrigramTagger
from nltk.tokenize import word_tokenize
from sklearn.model_selection import train_test_split
# Part 1: Rule-based and Statistical Approaches for Part-of-Speech Tagging

# Rule-based POS Tagger
def Exercise_rule(sentence):
    # Define my rules
    rules = [
          (re.compile(r'\bThe\b'), 'DT'),
          (re.compile(r'\bquick\b'), 'JJ'),
          (re.compile(r'\bbrown\b'), 'JJ'),
          (re.compile(r'\bfox\b'), 'NN'),
          (re.compile(r'\bjumps\b'), 'VBZ'),
          (re.compile(r'\bover\b'), 'IN'),
          (re.compile(r'\bthe\b'), 'DT'),
          (re.compile(r'\blazy\b'), 'JJ'),
          (re.compile(r'\bdog\b'), 'NN'),
          (re.compile(r'\bwhile\b'), 'IN'),
          (re.compile(r"\bit's\b"), 'VBZ'),
          (re.compile(r'\braining\b'), 'VBG'),
          (re.compile(r'\bheavily\b'), 'RB')
      ]
    tagged_sentence_exercise = []
    words_exercise = word_tokenize(sentence)
    for word_exercise in words_exercise:
        for pattern, tag in rules:
            if pattern.match(word_exercise):
                tagged_sentence_exercise.append((word_exercise, tag))
                break
        else:
            tagged_sentence_exercise.append((word_exercise, 'UNKNOWN'))
    return tagged_sentence_exercise


# Statistical POS Tagger
def Exercise_statistical(sentence):
    # Get the dataset
    dataset = treebank.tagged_sents()

    # Split dataset
    training_data, testing_data = train_test_split(dataset, test_size=0.2, random_state=42)

    # Construct chain
    tagger_chain = Tagger_chain(training_data)

    # Test
    print(f'Accuracy: {test(tagger_chain, testing_data)}')

    # Tokenize
    tokens = word_tokenize(sentence)

    # Tag input sentence
    return tagger_chain.tag(tokens)

def Tagger_chain(training_data):
    tagger1 = DefaultTagger('NN')
    tagger2 = UnigramTagger(training_data, backoff=tagger1)
    tagger3 = BigramTagger(training_data, backoff=tagger2)
    final_tagger = TrigramTagger(training_data, backoff=tagger3)
    return final_tagger

def test(tagger, testing_data):
    return tagger.evaluate(testing_data)

#### END CODE ####


In [12]:
#show printed output below
# Part 1: Rule-based and Statistical Approaches for Part-of-Speech Tagging
sentence = "The quick brown fox jumps over the lazy dog while it's raining heavily."

# Rule-based POS Tagging
Exercise_rule_tags = Exercise_rule(sentence)
print("Rule-based POS Tags:")
print(Exercise_rule_tags)

# Statistical POS Tagging
statistical_tags = Exercise_statistical(sentence)
print("Statistical POS Tags:")
print(statistical_tags)

Rule-based POS Tags:
[('The', 'DT'), ('quick', 'JJ'), ('brown', 'JJ'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN'), ('while', 'IN'), ('it', 'UNKNOWN'), ("'s", 'UNKNOWN'), ('raining', 'VBG'), ('heavily', 'RB'), ('.', 'UNKNOWN')]
Accuracy: 0.9037909387318118
Statistical POS Tags:
[('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'NNS'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'NN'), ('dog', 'NN'), ('while', 'IN'), ('it', 'PRP'), ("'s", 'VBZ'), ('raining', 'NN'), ('heavily', 'RB'), ('.', '.')]


  Function evaluate() has been deprecated.  Use accuracy(gold)
  instead.
  return tagger.evaluate(testing_data)
