<a href="https://colab.research.google.com/github/Natural-Language-Processing-YU/Module-3-Assignment/blob/main/M3_Part_II_Assignment_COPY.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# I. Rule-based and Statistical Approaches for Part-of-Speech Tagging

Part-of-Speech tagging, also known as POS tagging, is the process of assigning grammatical tags or labels to words in a sentence. The tags represent the syntactic category or part of speech of each word, such as noun, verb, adjective, adverb, etc. POS tagging is an essential step in many Natural Language Processing (NLP) tasks, including parsing, machine translation, and information retrieval.

POS tagging can be approached using different techniques, including rule-based approaches, statistical approaches, and hybrid approaches that combine both. In statistical approaches, Hidden Markov Models (HMMs) and Maximum Entropy Markov Models (MEMMs) are commonly used.

Implement a rule-based part-of-speech (POS) tagger:
* a. Write a set of rules to assign POS tags to words based on their context
* b. Apply the rules to a sample text and evaluate the accuracy of the tagger.



Implement a statistical POS tagger using a pre-trained model:


* a. Train a statistical POS tagger on a labeled corpus using a machine learning algorithm such as Naive Bayes or Maximum Entropy.
* b. Apply the trained model to tag a sample text and evaluate its accuracy.





In [None]:
import nltk
import re
nltk.download('punkt')
nltk.download('treebank')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker') #The maxent_ne_chunker contains two pre-trained English named entity chunkers trained on an ACE corpus (perhaps ACE ACE 2004 Multilingual Training Corpus?)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data]   Package treebank is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.


True

In [None]:
from nltk import pos_tag, ne_chunk
from nltk.tokenize import word_tokenize
from nltk.corpus import treebank
from nltk.tag import DefaultTagger, UnigramTagger, BigramTagger #important for POS tagging


# Part 1: Rule-based and Statistical Approaches for Part-of-Speech Tagging

# Rule-based POS Tagger
def rule_based_pos_tagger(sentence):
    # Define your rules here
    rules = [
          (re.compile(r'\bThe\b'), 'DT'),
          (re.compile(r'\bcat\b'), 'NN'),
          (re.compile(r'\bis\b'), 'VB'),
          (re.compile(r'\bsitting\b'), 'VB'),
          (re.compile(r'\bon\b'), 'IN'),
          (re.compile(r'\bthe\b'), 'DT'),
          (re.compile(r'\bmat\b'), 'NN'),
      ]
    tagged_sentence = []
    words = word_tokenize(sentence)
    for word in words:
        for pattern, tag in rules:
            if pattern.match(word):
                tagged_sentence.append((word, tag))
                break
        else:
            tagged_sentence.append((word, 'UNKNOWN'))
    return tagged_sentence

# Statistical POS Tagger
def statistical_pos_tagger(sentence):
    # Train your model on a labeled corpus (e.g., treebank)
    train_data = treebank.tagged_sents()[:3000]
    # Train your statistical model here

    # Split data into training and testing sets
    train_size = int(len(train_data) * 0.8)
    train_set = train_data[:train_size]
    test_set = train_data[train_size:]

    # Create taggers
    default_tagger = DefaultTagger('NN')  # Default tagger assigns 'NN' to all words
    unigram_tagger = UnigramTagger(train_set, backoff=default_tagger)  # Unigram tagger using training set
    bigram_tagger = BigramTagger(train_set, backoff=unigram_tagger)  # Bigram tagger using training set and fallback to unigram tagger

    # Evaluate on test set
    accuracy = bigram_tagger.accuracy(test_set)
    print("Accuracy:", accuracy)


    # Apply the trained model to tag the sentence
    tagged_sentence  = bigram_tagger.tag(word_tokenize(sentence))
    #tagged_sentence = nltk.pos_tag(words)
    #tagged_sentence.append(tagged_sentence)
    return tagged_sentence


In [None]:
# Part 1: Rule-based and Statistical Approaches for Part-of-Speech Tagging
sample_sentence = "The cat is sitting on the mat."

# Rule-based POS Tagging
rule_based_tags = rule_based_pos_tagger(sample_sentence)
print("Rule-based POS Tags:")
print(rule_based_tags)

# Statistical POS Tagging
statistical_tags = statistical_pos_tagger(sample_sentence)
print("Statistical POS Tags:")
print(statistical_tags)


Rule-based POS Tags:
[('The', 'DT'), ('cat', 'NN'), ('is', 'VB'), ('sitting', 'VB'), ('on', 'IN'), ('the', 'DT'), ('mat', 'NN'), ('.', 'UNKNOWN')]
Accuracy: 0.8748033560566335
Statistical POS Tags:
[('The', 'DT'), ('cat', 'NN'), ('is', 'VBZ'), ('sitting', 'VBG'), ('on', 'IN'), ('the', 'DT'), ('mat', 'NN'), ('.', '.')]


Additionally, NLTK has a built in function call ```pos_tags``
See example below

In [None]:
sample_sentence = "The cat is sitting on the mat."

tagged_sentence = nltk.pos_tag(word_tokenize(sample_sentence))
print(tagged_sentence)

[('The', 'DT'), ('cat', 'NN'), ('is', 'VBZ'), ('sitting', 'VBG'), ('on', 'IN'), ('the', 'DT'), ('mat', 'NN'), ('.', '.')]


 ### Exercise

Update the Rule-based tagger with patterns using regex. An example could be:



      ```  (r'\b\w+s\b|\b\w+es\b', 'NN'),     # Nouns ending ```

  From here proivde an updated rule-based tagger and statistical based tagger that can apply a part of speech tag for the following complex sentence:

  ```
  sentence = "The quick brown fox jumps over the lazy dog while it's raining heavily."

  ```

In [None]:
##### START CODE ####

""" Enter code here"""

#### END CODE ####


In [None]:
#show printed output below
sentence = "The quick brown fox jumps over the lazy dog while it's raining heavily."
"""your POS tagging function"""

# II. Hidden Markov Models (HMM) for POS Tagging
A Hidden Markov Model (HMM) is a statistical model that is widely used in various applications, including Natural Language Processing (NLP). An HMM is a type of generative model that incorporates hidden states and observable outputs. It assumes that there is an underlying sequence of hidden states that generates a sequence of observable outputs.

In the context of NLP, HMMs are often used for tasks such as part-of-speech tagging, where the hidden states represent the part-of-speech tags and the observable outputs are the words in a sentence. HMMs are particularly suitable for such sequential data modeling tasks.

The book "Speech and Language Processing" by Jurafsky and Martin provides a comprehensive explanation of HMMs in Chapter 8, including the underlying mathematical concepts, the training algorithms, and applications in NLP. The chapter also covers various extensions and improvements to the basic HMM model, such as the use of different probability distributions and more sophisticated inference techniques.

Let's break down the components of an HMM:

* Hidden States: The hidden states in an HMM represent the latent variables that generate the observable outputs. For example, in part-of-speech tagging, the hidden states correspond to the different parts of speech (nouns, verbs, adjectives, etc.). The set of possible hidden states is denoted as S.

* Observable Outputs: The observable outputs, also known as emissions, are the data or measurements that are observed. In part-of-speech tagging, the observable outputs correspond to the words in a sentence. The set of possible observable outputs is denoted as V.

* Transition Probabilities: The transition probabilities model the probabilities of transitioning from one hidden state to another. These probabilities capture the underlying dynamics of the system. In part-of-speech tagging, the transition probabilities represent the likelihood of transitioning from one part of speech to another. The transition probabilities are denoted as A, where A(i, j) represents the probability of transitioning from state i to state j.

* Emission Probabilities: The emission probabilities represent the probabilities of observing a specific output given a hidden state. In part-of-speech tagging, the emission probabilities capture the likelihood of observing a word given a particular part of speech. The emission probabilities are denoted as B, where B(j, k) represents the probability of emitting output k from state j.


Implement an HMM-based POS tagger:

* a. Train an HMM model on a labeled corpus to learn the  transition and emission probabilities.
* b. Use the Viterbi algorithm to decode the most probable sequence of tags for a given sentence.
* c. Apply the HMM tagger to a sample text and evaluate its accuracy.

In [None]:

# Part 2: Hidden Markov Models (HMM) for POS Tagging

# HMM POS Tagger
def hmm_pos_tagger(sentence):
    # Train your HMM model on a labeled corpus
    tagged_words = treebank.tagged_words()
    # Train your HMM model here
    # Apply the Viterbi algorithm to tag the sentence
    tagged_sentence = []
    words = word_tokenize(sentence)
    # Apply Viterbi algorithm to get the most probable sequence of tags
    # tagged_sentence = ...
    return tagged_sentence

In [None]:
# HMM POS Tagging
hmm_tags = hmm_pos_tagger(sample_sentence)
print("HMM POS Tags:")
print(hmm_tags)


## III. Named Entity Recognition (NER) Methods and Evaluation

Implement a rule-based NER system:
a. Define a set of rules to identify named entities (e.g., person names, locations, organizations) in a text.
b. Apply the rules to a sample text and evaluate the precision, recall, and F1-score of the NER system.

Implement a statistical NER system using a pre-trained model:
* a. Train a statistical NER model on a labeled corpus using machine learning techniques like Conditional Random Fields (CRF) or Support Vector Machines (SVM).
* b. Apply the trained model to identify named entities in a sample text and evaluate its performance using precision, recall, and F1-score.

Compare the performance of rule-based and statistical NER systems:
* a. Apply both the rule-based and statistical NER systems to the same sample text.
* b. Compare their precision, recall, and F1-score to analyze their strengths and weaknesses.

Note: For evaluating the taggers and NER systems, you can use labeled datasets available in NLTK or other sources. You should report metrics such as accuracy, precision, recall, and F1-score for a comprehensive analysis of the implemented approaches.

In [None]:


# Part 3: Named Entity Recognition (NER) Methods and Evaluation

# Rule-based NER
def rule_based_ner(sentence):
    # Apply your rule-based NER here
    tagged_sentence = ne_chunk(pos_tag(word_tokenize(sentence)))
    return tagged_sentence

# Statistical NER
def statistical_ner(sentence):
    # Train your statistical NER model on a labeled corpus
    tagged_sentences = treebank.tagged_sents()
    # Train your statistical NER model here
    # Apply the trained model to identify named entities in the sentence
    tagged_sentence = ...
    return tagged_sentence

In [None]:
sample_sentence = "Barack Obama was born in Hawaii."

# Rule-based NER
rule_based_ner_tags = rule_based_ner(sample_sentence)
print("Rule-based NER Tags:")
print(rule_based_ner_tags)

# Statistical NER
statistical_ner_tags = statistical_ner(sample_sentence)
print("Statistical NER Tags:")
print(statistical_ner_tags)
