# I. Rule-based and Statistical Approaches for Part-of-Speech Tagging

Part-of-Speech tagging, also known as POS tagging, is the process of assigning grammatical tags or labels to words in a sentence. The tags represent the syntactic category or part of speech of each word, such as noun, verb, adjective, adverb, etc. POS tagging is an essential step in many Natural Language Processing (NLP) tasks, including parsing, machine translation, and information retrieval.

POS tagging can be approached using different techniques, including rule-based approaches, statistical approaches, and hybrid approaches that combine both. In statistical approaches, Hidden Markov Models (HMMs) and Maximum Entropy Markov Models (MEMMs) are commonly used.

Implement a rule-based part-of-speech (POS) tagger:
* a. Write a set of rules to assign POS tags to words based on their context
* b. Apply the rules to a sample text and evaluate the accuracy of the tagger.



Implement a statistical POS tagger using a pre-trained model:


* a. Train a statistical POS tagger on a labeled corpus using a machine learning algorithm such as Naive Bayes or Maximum Entropy.
* b. Apply the trained model to tag a sample text and evaluate its accuracy.





In [102]:
import nltk
import re
nltk.download('punkt')
nltk.download('treebank')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker') #The maxent_ne_chunker contains two pre-trained English named entity chunkers trained on an ACE corpus (perhaps ACE ACE 2004 Multilingual Training Corpus?)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data]   Package treebank is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!


True

In [103]:
from nltk import pos_tag, ne_chunk
from nltk.tokenize import word_tokenize
from nltk.corpus import treebank
from nltk.tag import DefaultTagger, UnigramTagger, BigramTagger #important for POS tagging

# Part 1: Rule-based and Statistical Approaches for Part-of-Speech Tagging

# Rule-based POS Tagger
def rule_based_pos_tagger(sentence):
    # Define your rules here
    rules = [
          (re.compile(r'\bThe\b'), 'DT'),
          (re.compile(r'\bcat\b'), 'NN'),
          (re.compile(r'\bis\b'), 'VB'),
          (re.compile(r'\bsitting\b'), 'VB'),
          (re.compile(r'\bon\b'), 'IN'),
          (re.compile(r'\bthe\b'), 'DT'),
          (re.compile(r'\bmat\b'), 'NN'),
      ]
    tagged_sentence = []
    words = word_tokenize(sentence)
    for word in words:
        for pattern, tag in rules:
            if pattern.match(word):
                tagged_sentence.append((word, tag))
                break
        else:
            tagged_sentence.append((word, 'UNKNOWN'))
    return tagged_sentence

# Statistical POS Tagger
def statistical_pos_tagger(sentence):
    # Train your model on a labeled corpus (e.g., treebank)
    train_data = treebank.tagged_sents()[:3000]
    # Train your statistical model here

    # Split data into training and testing sets
    train_size = int(len(train_data) * 0.8)
    train_set = train_data[:train_size]
    test_set = train_data[train_size:]

    # Create taggers
    default_tagger = DefaultTagger('NN')  # Default tagger assigns 'NN' to all words
    unigram_tagger = UnigramTagger(train_set, backoff=default_tagger)  # Unigram tagger using training set
    bigram_tagger = BigramTagger(train_set, backoff=unigram_tagger)  # Bigram tagger using training set and fallback to unigram tagger

    # Evaluate on test set
    accuracy = bigram_tagger.accuracy(test_set)
    print("Accuracy:", accuracy)


    # Apply the trained model to tag the sentence
    tagged_sentence  = bigram_tagger.tag(word_tokenize(sentence))
    #tagged_sentence = nltk.pos_tag(words)
    #tagged_sentence.append(tagged_sentence)
    return tagged_sentence

In [104]:
# Part 1: Rule-based and Statistical Approaches for Part-of-Speech Tagging
sample_sentence = "The cat is sitting on the mat."

# Rule-based POS Tagging
rule_based_tags = rule_based_pos_tagger(sample_sentence)
print("Rule-based POS Tags:")
print(rule_based_tags)

# Statistical POS Tagging
statistical_tags = statistical_pos_tagger(sample_sentence)
print("Statistical POS Tags:")
print(statistical_tags)


Rule-based POS Tags:
[('The', 'DT'), ('cat', 'NN'), ('is', 'VB'), ('sitting', 'VB'), ('on', 'IN'), ('the', 'DT'), ('mat', 'NN'), ('.', 'UNKNOWN')]
Accuracy: 0.8748033560566335
Statistical POS Tags:
[('The', 'DT'), ('cat', 'NN'), ('is', 'VBZ'), ('sitting', 'VBG'), ('on', 'IN'), ('the', 'DT'), ('mat', 'NN'), ('.', '.')]


Additionally, NLTK has a built in function call ```pos_tags``
See example below

In [105]:
sample_sentence = "The cat is sitting on the mat."

tagged_sentence = nltk.pos_tag(word_tokenize(sample_sentence))
print(tagged_sentence)

[('The', 'DT'), ('cat', 'NN'), ('is', 'VBZ'), ('sitting', 'VBG'), ('on', 'IN'), ('the', 'DT'), ('mat', 'NN'), ('.', '.')]


 ### Exercise

Update the Rule-based tagger with patterns using regex. An example could be:



      ```  (r'\b\w+s\b|\b\w+es\b', 'NN'),     # Nouns ending ```

  From here proivde an updated rule-based tagger and statistical based tagger that can apply a part of speech tag for the following complex sentence:

  ```
  sentence = "The quick brown fox jumps over the lazy dog while it's raining heavily."

  ```

In [106]:
##### START CODE ####
from sklearn.model_selection import train_test_split
from nltk.classify import NaiveBayesClassifier
from nltk.classify.util import accuracy as nltk_accuracy
from nltk.tag import ClassifierBasedPOSTagger

# Updated Rule-based POS Tagger
def updated_rule_based_pos_tagger(sentence):

    rules = [
        (re.compile(r'\bthe\b', re.IGNORECASE), 'DT'),
        (re.compile(r'\bquick\b'), 'JJ'),
        (re.compile(r'\bbrown\b'), 'JJ'),
        (re.compile(r'\bfox\b'), 'NN'),
        (re.compile(r'\bjumps\b'), 'VBZ'),
        (re.compile(r'\bover\b'), 'IN'),
        (re.compile(r'\blazy\b'), 'JJ'),
        (re.compile(r'\bdog\b'), 'NN'),
        (re.compile(r'\bwhile\b'), 'IN'),
        (re.compile(r'\bit\b'), 'PRP'),
        (re.compile(r'\bis\b'), 'VBZ'),
        (re.compile(r"'s"), 'VBZ'),
        (re.compile(r'\braining\b'), 'VBG'),
        (re.compile(r'\bheavily\b'), 'RB'),
        (re.compile(r'\b\w+s\b|\b\w+es\b', ), 'NN'), # Nouns ending
        (re.compile(r'[.!?]'), '.'),                 # Punctuation
    ]
    tagged_sentence = []
    words = word_tokenize(sentence)
    for word in words:
        for pattern, tag in rules:
            if pattern.match(word):
                tagged_sentence.append((word, tag))
                break
        else:
            tagged_sentence.append((word, 'UNKNOWN'))
    return tagged_sentence

# Statistical POS Tagger

# Function to define features for the tagger
def word_features(words, index):
    return {
        'word': words[index],
        'is_first': index == 0,
        'is_last': index == len(words) - 1,
        'prefix-1': words[index][0],
        'prefix-2': words[index][:2],
        'prefix-3': words[index][:3],
        'suffix-1': words[index][-1],
        'suffix-2': words[index][-2:],
        'suffix-3': words[index][-3:],
        'prev_word': '' if index == 0 else words[index - 1],
        'next_word': '' if index == len(words) - 1 else words[index + 1],
        'has_hyphen': '-' in words[index],
        'is_numeric': words[index].isdigit(),
        'capitals_inside': words[index][1:].lower() != words[index][1:]
    }

# Function to train a Naive Bayes POS tagger
def train_naive_bayes_pos_tagger():
    # Loading the Treebank corpus
    tagged_sentences = treebank.tagged_sents()

    # Splitting data into training and testing sets
    train_size = int(len(tagged_sentences) * 0.8)
    train_data = tagged_sentences[:train_size]
    test_data = tagged_sentences[train_size:]

    # Extracting features from the training data
    train_features = []
    for sentence in train_data:
        words, tags = zip(*sentence)
        for i in range(len(words)):
            train_features.append((word_features(words, i), tags[i]))

    # Training the Naive Bayes classifier based POS tagger
    classifier = nltk.NaiveBayesClassifier.train(train_features)

    # Creating the ClassifierBasedPOSTagger
    tagger = ClassifierBasedPOSTagger(classifier=classifier)

    #Evaluating accuracy on test data
    test_features = []
    for sentence in test_data:
        words, tags = zip(*sentence)
        for i in range(len(words)):
            test_features.append((word_features(words, i), tags[i]))

    accuracy = nltk.classify.accuracy(classifier, test_features)
    print("Accuracy of Naive Bayes classifier based POS tagger :", accuracy)

    return tagger
#### END CODE ####


In [107]:
# Function to evaluate rule-based POS tagger
def evaluate_rule_based_pos_tagger(test_sents):
    total = 0
    correct = 0
    for sent in test_sents:
        words = [word for word, tag in sent]
        true_tags = [tag for word, tag in sent]
        predicted_tags = [tag for word, tag in updated_rule_based_pos_tagger(' '.join(words))]
        for true_tag, predicted_tag in zip(true_tags, predicted_tags):
            total += 1
            if true_tag == predicted_tag:
                correct += 1
    accuracy = correct / total
    return accuracy

# Load the test data
test_sents = treebank.tagged_sents()[3000:3200]


In [108]:
#show printed output below
sentence = "The quick brown fox jumps over the lazy dog while it's raining heavily."
"""your POS tagging function"""

# Apply rule-based POS tagger
rule_based_tags = updated_rule_based_pos_tagger(sentence)
print("Rule-based POS Tags:")
print(rule_based_tags)

# Evaluate the rule-based POS tagger
accuracy = evaluate_rule_based_pos_tagger(test_sents)
print("Rule-based POS Tagger Accuracy:", accuracy)

# Training the Naive Bayes POS tagger
tagger = train_naive_bayes_pos_tagger()
tagged_text = tagger.tag(nltk.word_tokenize(sentence))
print("Naive Bayes POS Tags:")
print(tagged_text)

# Apply NLTK's built-in POS tagger
nltk_tags = nltk.pos_tag(word_tokenize(sentence))
print("NLTK Built-in POS Tags:")
print(nltk_tags)

Rule-based POS Tags:
[('The', 'DT'), ('quick', 'JJ'), ('brown', 'JJ'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN'), ('while', 'IN'), ('it', 'PRP'), ("'s", 'VBZ'), ('raining', 'VBG'), ('heavily', 'RB'), ('.', '.')]
Rule-based POS Tagger Accuracy: 0.06599298824499897
Accuracy of Naive Bayes classifier based POS tagger : 0.9299865262737661
Naive Bayes POS Tags:
[('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'NN'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'NN'), ('dog', 'NN'), ('while', 'IN'), ('it', 'PRP'), ("'s", 'POS'), ('raining', 'NN'), ('heavily', 'RB'), ('.', '.')]
NLTK Built-in POS Tags:
[('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN'), ('while', 'IN'), ('it', 'PRP'), ("'s", 'VBZ'), ('raining', 'VBG'), ('heavily', 'RB'), ('.', '.')]
