**Name** : Bodhisatya Ghosh \
**Class** : CSE DS \
**UID** : 2021700026 \
**Subject** : NLP \
**Experiment number** : 5 \
\
**Aim**:
Print emission & transition matrix \
Calculate POS tags for a given sentence

### Questions

1. Ways for tagging parts of speech:

   a. Rule-based tagging: Utilizes predefined grammatical rules to assign parts of speech based on word patterns and context.
   
   b. Dictionary-based tagging: Matches words against a pre-built dictionary that includes information about the part of speech of each word.
   
   c. Probabilistic tagging: Uses statistical models to determine the likelihood of a word belonging to a specific part of speech based on training data.

   d. Machine learning-based tagging: Employs machine learning algorithms, such as Hidden Markov Models (HMMs) or Conditional Random Fields (CRFs), trained on annotated corpora to predict parts of speech.

2. Finding the most probable sequence of POS tags:

   a. Hidden Markov Models (HMMs): Use the Viterbi algorithm to find the most probable sequence of POS tags based on the emission and transition probabilities.
   
   b. Conditional Random Fields (CRFs): Optimize a conditional probability model that considers the entire sequence, taking into account both local and global context.

3. Markov chain vs. Markov model:

   - Markov Chain: A mathematical model representing a sequence of events where the probability of transitioning to any particular state depends solely on the current state. It has discrete states and transition probabilities.
   
   - Markov Model: A broader term that encompasses various mathematical models, including Markov Chains. Markov Models can refer to systems with both discrete and continuous states, and they may have additional parameters beyond transition probabilities.

4. Identifying whether a system follows a Markov Process:

   - If a system exhibits the Markov property, meaning the future state depends only on the present state and not on the sequence of events leading to the present state, it can be considered a Markov Process. This property can be assessed by analyzing the conditional probability distribution of future states given the current state.

5. Use of Markov Chains in text generation algorithms:

   - Markov Chains can model the transition probabilities between words or characters in a text. By analyzing a training corpus, the probabilities of transitioning from one word to another can be learned.
   
   - In text generation, a Markov Chain can be used to predict the next word or sequence of words based on the current state (previous words). This allows for the generation of coherent and contextually relevant text, making Markov Chains a simple yet effective tool for text generation algorithms.

In [4]:
import nltk
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from tabulate import tabulate  

nltk.download('treebank')

nltk.download('universal_tagset')

nltk_data = list(nltk.corpus.treebank.tagged_sents(tagset='universal'))

print(nltk_data[:2])

for sent in nltk_data[:2]:
    for tuple in sent:
        print(tuple)

train_set, test_set = train_test_split(nltk_data, train_size=0.30, test_size=0.70, random_state=101)

train_tagged_words = [tup for sent in train_set for tup in sent]
test_tagged_words = [tup for sent in test_set for tup in sent]

print(len(train_tagged_words))
print(len(test_tagged_words))

tags = {tag for word, tag in train_tagged_words}
print(len(tags))
print(tags)

vocab = {word for word, tag in train_tagged_words}

# compute Emission Probability
def word_given_tag(word, tag, train_bag=train_tagged_words):
    tag_list = [pair for pair in train_bag if pair[1] == tag]
    count_tag = len(tag_list)
    w_given_tag_list = [pair[0] for pair in tag_list if pair[0] == word]
    count_w_given_tag = len(w_given_tag_list)

    return count_w_given_tag, count_tag

# compute Transition Probability
def t2_given_t1(t2, t1, train_bag=train_tagged_words):
    tags = [pair[1] for pair in train_bag]
    count_t1 = len([t for t in tags if t == t1])
    count_t2_t1 = sum(1 for i in range(len(tags) - 1) if tags[i] == t1 and tags[i + 1] == t2)

    return count_t2_t1, count_t1

tags_matrix = pd.DataFrame(columns=list(tags), index=list(vocab))

for tag in tags:
    for word in vocab:
        tags_matrix.at[word, tag] = word_given_tag(word, tag)[0] / word_given_tag(word, tag)[1]

tags_transition_matrix = pd.DataFrame(columns=list(tags), index=list(tags))

for t1 in tags:
    for t2 in tags:
        tags_transition_matrix.at[t1, t2] = t2_given_t1(t2, t1)[0] / t2_given_t1(t2, t1)[1]

print("\nEmission Probabilities:")
print(tabulate(tags_matrix, headers='keys', tablefmt='fancy_grid'))

print("\nTransition Probabilities:")
print(tabulate(tags_transition_matrix, headers='keys', tablefmt='fancy_grid'))

# test sentence
test_sentence = "Vinken will join the board"

# tokenize the test sentence
words = nltk.word_tokenize(test_sentence)

# initialize the probability
sentence_probability = 1.0

for i in range(len(words)-1):
    word = words[i]
    next_word = words[i+1]
    current_tag = max(tags, key=lambda tag: tags_matrix.at[word, tag])
    next_tag = max(tags, key=lambda tag: tags_transition_matrix.at[current_tag, tag])
    
    emission_prob = tags_matrix.at[word, current_tag]
    transition_prob = tags_transition_matrix.at[current_tag, next_tag]
    
    sentence_probability *= emission_prob * transition_prob

print("\nOverall Probability of Tagged Sentence:", sentence_probability)

tagged_sentence = [(word, max(tags, key=lambda tag: tags_matrix.at[word, tag])) for word in words]

print("\nTagged Sentence with Final Probabilities:")
print(tabulate(tagged_sentence, headers=['Word', 'Tag'], tablefmt='fancy_grid'))

[nltk_data] Downloading package treebank to
[nltk_data]     C:\Users\Rommel\AppData\Roaming\nltk_data...
[nltk_data]   Package treebank is already up-to-date!
[nltk_data] Downloading package universal_tagset to
[nltk_data]     C:\Users\Rommel\AppData\Roaming\nltk_data...
[nltk_data]   Package universal_tagset is already up-to-date!


[[('Pierre', 'NOUN'), ('Vinken', 'NOUN'), (',', '.'), ('61', 'NUM'), ('years', 'NOUN'), ('old', 'ADJ'), (',', '.'), ('will', 'VERB'), ('join', 'VERB'), ('the', 'DET'), ('board', 'NOUN'), ('as', 'ADP'), ('a', 'DET'), ('nonexecutive', 'ADJ'), ('director', 'NOUN'), ('Nov.', 'NOUN'), ('29', 'NUM'), ('.', '.')], [('Mr.', 'NOUN'), ('Vinken', 'NOUN'), ('is', 'VERB'), ('chairman', 'NOUN'), ('of', 'ADP'), ('Elsevier', 'NOUN'), ('N.V.', 'NOUN'), (',', '.'), ('the', 'DET'), ('Dutch', 'NOUN'), ('publishing', 'VERB'), ('group', 'NOUN'), ('.', '.')]]
('Pierre', 'NOUN')
('Vinken', 'NOUN')
(',', '.')
('61', 'NUM')
('years', 'NOUN')
('old', 'ADJ')
(',', '.')
('will', 'VERB')
('join', 'VERB')
('the', 'DET')
('board', 'NOUN')
('as', 'ADP')
('a', 'DET')
('nonexecutive', 'ADJ')
('director', 'NOUN')
('Nov.', 'NOUN')
('29', 'NUM')
('.', '.')
('Mr.', 'NOUN')
('Vinken', 'NOUN')
('is', 'VERB')
('chairman', 'NOUN')
('of', 'ADP')
('Elsevier', 'NOUN')
('N.V.', 'NOUN')
(',', '.')
('the', 'DET')
('Dutch', 'NOUN')
('

In [5]:
import nltk
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from tabulate import tabulate  

nltk.download('treebank')

nltk.download('universal_tagset')

training_data = [
    [('Mary', 'NOUN'), ('Jane', 'NOUN'), ('can', 'MODEL'), ('see', 'VERB'), ('Will', 'NOUN')],
    [('Spot', 'NOUN'), ('will', 'MODEL'), ('see', 'VERB'), ('Mary', 'NOUN')],
    [('Will', 'MODEL'), ('Jane', 'NOUN'), ('spot', 'VERB'), ('Mary', 'NOUN'), ('?', '.')],
    [('Mary', 'NOUN'), ('will', 'MODEL'), ('pat', 'VERB'), ('Spot', 'NOUN')]
]

train_set, test_set = train_test_split(training_data, train_size=0.80, test_size=0.20, random_state=101)

train_tagged_words = [tup for sent in train_set for tup in sent]
test_tagged_words = [tup for sent in test_set for tup in sent]

tags = {tag for word, tag in train_tagged_words}

vocab = {word for word, tag in train_tagged_words}

# compute Emission Probability
def word_given_tag(word, tag, train_bag=train_tagged_words):
    tag_list = [pair for pair in train_bag if pair[1] == tag]
    count_tag = len(tag_list)
    w_given_tag_list = [pair[0] for pair in tag_list if pair[0] == word]
    count_w_given_tag = len(w_given_tag_list)

    return count_w_given_tag, count_tag

# compute Transition Probability
def t2_given_t1(t2, t1, train_bag=train_tagged_words):
    tags = [pair[1] for pair in train_bag]
    count_t1 = len([t for t in tags if t == t1])
    count_t2_t1 = sum(1 for i in range(len(tags) - 1) if tags[i] == t1 and tags[i + 1] == t2)

    return count_t2_t1, count_t1

# create a DataFrame to store emission probabilities
tags_matrix = pd.DataFrame(columns=list(tags), index=list(vocab))

for tag in tags:
    for word in vocab:
        try:
            tags_matrix.at[word, tag] = word_given_tag(word, tag)[0] / word_given_tag(word, tag)[1]
        except KeyError:
            print(f"Word '{word}' not found in training data. Using default tag.")
            tags_matrix.at[word, tag] = 0.0001  


tags_transition_matrix = pd.DataFrame(columns=list(tags), index=list(tags))

for t1 in tags:
    for t2 in tags:
        tags_transition_matrix.at[t1, t2] = t2_given_t1(t2, t1)[0] / t2_given_t1(t2, t1)[1]

print("\nEmission Probabilities:")
print(tabulate(tags_matrix, headers='keys', tablefmt='fancy_grid'))

print("\nTransition Probabilities:")
print(tabulate(tags_transition_matrix, headers='keys', tablefmt='fancy_grid'))

# test sentence
test_sentence = "Will can see Mary"

# tokenize the test sentence
words = nltk.word_tokenize(test_sentence)

# initialize the probability
sentence_probability = 1.0

for i in range(len(words) - 1):
    word = words[i]
    next_word = words[i + 1]
    try:
        current_tag = max(tags, key=lambda tag: tags_matrix.at[word, tag])
        next_tag = max(tags, key=lambda tag: tags_transition_matrix.at[current_tag, tag])

        emission_prob = tags_matrix.at[word, current_tag]
        transition_prob = tags_transition_matrix.at[current_tag, next_tag]

        sentence_probability *= emission_prob * transition_prob
    except KeyError:
        print(f"Word '{word}' not found in training data. Using default tag.")
        sentence_probability *= 0.0001  

last_word = words[-1]
try:
    last_tag = max(tags, key=lambda tag: tags_matrix.at[last_word, tag])
    sentence_probability *= tags_matrix.at[last_word, last_tag] 
except KeyError:
    print(f"Word '{last_word}' not found in training data. Using default tag.")
    sentence_probability *= 0.0001  

# print the overall probability of the tagged sentence
print("\nOverall Probability of Tagged Sentence:", sentence_probability)

# tag the test sentence based on the most likely tag for each word
tagged_sentence = [(word, max(tags, key=lambda tag: tags_matrix.at[word, tag], default='NOUN')) for word in words]

# print the tagged sentence with final probabilities
print("\nTagged Sentence with Final Probabilities:")
print(tabulate(tagged_sentence, headers=['Word', 'Tag'], tablefmt='fancy_grid'))



Emission Probabilities:
╒══════╤══════════╤══════════╤══════════╕
│      │    MODEL │     NOUN │     VERB │
╞══════╪══════════╪══════════╪══════════╡
│ Spot │ 0        │ 0.285714 │ 0        │
├──────┼──────────┼──────────┼──────────┤
│ pat  │ 0        │ 0        │ 0.333333 │
├──────┼──────────┼──────────┼──────────┤
│ Will │ 0        │ 0.142857 │ 0        │
├──────┼──────────┼──────────┼──────────┤
│ Jane │ 0        │ 0.142857 │ 0        │
├──────┼──────────┼──────────┼──────────┤
│ can  │ 0.333333 │ 0        │ 0        │
├──────┼──────────┼──────────┼──────────┤
│ will │ 0.666667 │ 0        │ 0        │
├──────┼──────────┼──────────┼──────────┤
│ Mary │ 0        │ 0.428571 │ 0        │
├──────┼──────────┼──────────┼──────────┤
│ see  │ 0        │ 0        │ 0.666667 │
╘══════╧══════════╧══════════╧══════════╛

Transition Probabilities:
╒═══════╤══════════╤══════════╤════════╕
│       │    MODEL │     NOUN │   VERB │
╞═══════╪══════════╪══════════╪════════╡
│ MODEL │ 0        │ 0     

[nltk_data] Downloading package treebank to
[nltk_data]     C:\Users\Rommel\AppData\Roaming\nltk_data...
[nltk_data]   Package treebank is already up-to-date!
[nltk_data] Downloading package universal_tagset to
[nltk_data]     C:\Users\Rommel\AppData\Roaming\nltk_data...
[nltk_data]   Package universal_tagset is already up-to-date!
