<a href="https://colab.research.google.com/github/yashaswip/Assignment-1/blob/main/M3_Part_II_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# I. Rule-based and Statistical Approaches for Part-of-Speech Tagging

Part-of-Speech tagging, also known as POS tagging, is the process of assigning grammatical tags or labels to words in a sentence. The tags represent the syntactic category or part of speech of each word, such as noun, verb, adjective, adverb, etc. POS tagging is an essential step in many Natural Language Processing (NLP) tasks, including parsing, machine translation, and information retrieval.

POS tagging can be approached using different techniques, including rule-based approaches, statistical approaches, and hybrid approaches that combine both. In statistical approaches, Hidden Markov Models (HMMs) and Maximum Entropy Markov Models (MEMMs) are commonly used.

Implement a rule-based part-of-speech (POS) tagger:
* a. Write a set of rules to assign POS tags to words based on their context
* b. Apply the rules to a sample text and evaluate the accuracy of the tagger.



Implement a statistical POS tagger using a pre-trained model:


* a. Train a statistical POS tagger on a labeled corpus using a machine learning algorithm such as Naive Bayes or Maximum Entropy.
* b. Apply the trained model to tag a sample text and evaluate its accuracy.





In [None]:
import nltk
import re
nltk.download('punkt')
nltk.download('treebank')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker') #The maxent_ne_chunker contains two pre-trained English named entity chunkers trained on an ACE corpus (perhaps ACE ACE 2004 Multilingual Training Corpus?)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data]   Unzipping corpora/treebank.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.


True

In [None]:
from nltk import pos_tag, ne_chunk
from nltk.tokenize import word_tokenize
from nltk.corpus import treebank
from nltk.tag import DefaultTagger, UnigramTagger, BigramTagger #important for POS tagging


# Part 1: Rule-based and Statistical Approaches for Part-of-Speech Tagging

# Rule-based POS Tagger
def rule_based_pos_tagger(sentence):
    # Define your rules here
    rules = [
          (re.compile(r'\bThe\b'), 'DT'),
          (re.compile(r'\bcat\b'), 'NN'),
          (re.compile(r'\bis\b'), 'VB'),
          (re.compile(r'\bsitting\b'), 'VB'),
          (re.compile(r'\bon\b'), 'IN'),
          (re.compile(r'\bthe\b'), 'DT'),
          (re.compile(r'\bmat\b'), 'NN'),
      ]
    tagged_sentence = []
    words = word_tokenize(sentence)
    for word in words:
        for pattern, tag in rules:
            if pattern.match(word):
                tagged_sentence.append((word, tag))
                break
        else:
            tagged_sentence.append((word, 'UNKNOWN'))
    return tagged_sentence

# Statistical POS Tagger
def statistical_pos_tagger(sentence):
    # Train your model on a labeled corpus (e.g., treebank)
    train_data = treebank.tagged_sents()[:3000]
    # Train your statistical model here

    # Split data into training and testing sets
    train_size = int(len(train_data) * 0.8)
    train_set = train_data[:train_size]
    test_set = train_data[train_size:]

    # Create taggers
    default_tagger = DefaultTagger('NN')  # Default tagger assigns 'NN' to all words
    unigram_tagger = UnigramTagger(train_set, backoff=default_tagger)  # Unigram tagger using training set
    bigram_tagger = BigramTagger(train_set, backoff=unigram_tagger)  # Bigram tagger using training set and fallback to unigram tagger

    # Evaluate on test set
    accuracy = bigram_tagger.accuracy(test_set)
    print("Accuracy:", accuracy)


    # Apply the trained model to tag the sentence
    tagged_sentence  = bigram_tagger.tag(word_tokenize(sentence))
    #tagged_sentence = nltk.pos_tag(words)
    #tagged_sentence.append(tagged_sentence)
    return tagged_sentence

In [None]:
# Part 1: Rule-based and Statistical Approaches for Part-of-Speech Tagging
sample_sentence = "The cat is sitting on the mat."

# Rule-based POS Tagging
rule_based_tags = rule_based_pos_tagger(sample_sentence)
print("Rule-based POS Tags:")
print(rule_based_tags)

# Statistical POS Tagging
statistical_tags = statistical_pos_tagger(sample_sentence)
print("Statistical POS Tags:")
print(statistical_tags)


Rule-based POS Tags:
[('The', 'DT'), ('cat', 'NN'), ('is', 'VB'), ('sitting', 'VB'), ('on', 'IN'), ('the', 'DT'), ('mat', 'NN'), ('.', 'UNKNOWN')]
Accuracy: 0.8748033560566335
Statistical POS Tags:
[('The', 'DT'), ('cat', 'NN'), ('is', 'VBZ'), ('sitting', 'VBG'), ('on', 'IN'), ('the', 'DT'), ('mat', 'NN'), ('.', '.')]


Additionally, NLTK has a built in function call ```pos_tags``
See example below

In [None]:
sample_sentence = "The cat is sitting on the mat."

tagged_sentence = nltk.pos_tag(word_tokenize(sample_sentence))
print(tagged_sentence)

[('The', 'DT'), ('cat', 'NN'), ('is', 'VBZ'), ('sitting', 'VBG'), ('on', 'IN'), ('the', 'DT'), ('mat', 'NN'), ('.', '.')]


 ### Exercise

Update the Rule-based tagger with patterns using regex. An example could be:



      ```  (r'\b\w+s\b|\b\w+es\b', 'NN'),     # Nouns ending ```

  From here proivde an updated rule-based tagger and statistical based tagger that can apply a part of speech tag for the following complex sentence:

  ```
  sentence = "The quick brown fox jumps over the lazy dog while it's raining heavily."

  ```

In [None]:
##### START CODE ####

""" Enter code here"""

#### END CODE ####


' Enter code here'

In [None]:
#show printed output below
sentence = "The quick brown fox jumps over the lazy dog while it's raining heavily."
"""your POS tagging function"""

'your POS tagging function'