## CS5803 NLP
### Assignment 2
#### Tanmay Garg, Tanmay Goyal, Tanay Yadav
#### Roll no: CS20BTECH11063, AI20BTECH11021, AI20BTECH11026

##### Link to dataset: https://www.kaggle.com/datasets/moxxis/harry-potter-lstm.

In [3]:
# importing necessary packages
import re
from collections import defaultdict
from math import log, exp  
import numpy as np
import random
import nltk
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/tanmaygoyal/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/tanmaygoyal/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

##### **Q1. Preprocess and tokenize the dataset using NLTK**

In [4]:
# reading the original data
with open('Harry_Potter_all_char_separated.txt', 'r', encoding='utf-8') as file:
    harry_potter_data = file.read()

def preprocess_text(text):
    '''
    Function to preprocess the text by removing punctuations and converting to lower case
    '''
    # [^\w\s] -> ^ means except , \w refers to any alphanumeric character and \s refers to whitespace    
    text = re.sub(r'[^\w\s]', '', text).lower()
    return text

def tokenize(text):
    '''
    Function to tokenize the text
    '''
    tokens = nltk.tokenize.word_tokenize(text)
    # stop_words = set(nltk.corpus.stopwords.words('english'))
    # tokens = [token for token in tokens if token not in stop_words]
    return tokens

# preprocessing the text
harry_potter_text = preprocess_text(harry_potter_data)
# using the first 10000 words
harry_potter_tokens = tokenize(harry_potter_text)[:10000]

##### **Q2. Fit two bigram language models on the text: MLE and kneserNey Discounting**

In [5]:
# fit two bigram language models on the text: MLE and Kneser-Ney discounting using the nltk library

def MLE_bigram(n , n_gram , vocab):
    '''
    Function to fit a bigram language model using MLE
    '''
    model = nltk.lm.MLE(n)
    model.fit([n_gram], vocabulary_text = vocab)
    return model

def KN_bigram(n , n_gram , vocab):
    '''
    Function to fit a bigram language model using Kneser-Ney discounting
    '''
    model = nltk.lm.KneserNeyInterpolated(n)
    model.fit([n_gram], vocabulary_text = vocab)
    return model

harry_potter_bigrams = nltk.ngrams(harry_potter_tokens, 2)

# converting the bigrams to a list
bigrams = []
for bigram in harry_potter_bigrams:
    bigrams.append(bigram)

bigram_mle = MLE_bigram(2 , bigrams , harry_potter_tokens)
bigram_kn = KN_bigram(2 , bigrams , harry_potter_tokens)

##### **Q3. Use the beginning words 1. "Harry Potter" and 2. "Dumbledore" to generate text using both the language models. Keep maximum text length as 20**

In [6]:
def generate_prediction(model , num_words = None , text_seed = None , random_seed = None):
    '''
    Function to generate the predictions given text_seed and num_words.
    It joins them together in a sentence and returns the sentence
    '''

    # preprocessing the text_seed
    text_seed = preprocess_text(text_seed)
    text_seed = tokenize(text_seed)
    
    # generating the prediction
    pred = model.generate(num_words = num_words , text_seed = text_seed , random_seed = random_seed)
    
    sentence = text_seed + [pred[i] for i in range(num_words)]
    predicted_sentence = sentence[0]
    for word in sentence[1:]:
        predicted_sentence += ' ' + word
    return predicted_sentence


print("==== MLE Model ====")
print(generate_prediction(bigram_mle , num_words = 20 , text_seed = "Harry Potter" , random_seed = 123))
print(generate_prediction(bigram_mle , num_words = 20 , text_seed = "Dumbledore" , random_seed = 123)) 

print("==== KneserNey Model ====")
print(generate_prediction(bigram_kn , num_words = 20 , text_seed = "Harry Potter" , random_seed = 123))
print(generate_prediction(bigram_kn , num_words = 20 , text_seed = "Dumbledore" , random_seed = 123)) 

==== MLE Model ====
harry potter are are less like us all right of them angrily it he does tend to be nice if hed just
dumbledore and bacon as dudley was a man had swapped at his favorite program had expected mrs dursleys bought dudley he
==== KneserNey Model ====
harry potter are are less like us all right place you cant ive finished dialing his hair and piers and had four
dumbledore and bacon as dudley was a mad old things gray tuesday our heads down old clothes of arms and james


Beam search is a tree-based search strategy similar to BFS. In BF, we expand every child node, however, in Beam Search, we only expand the top k most probable children. The generated text is the text with the highest probabiltity

##### **Q4. To implement beam search, implement a function to find the top k most probable words**

In [7]:
def k_top_probable(model , k , text_seed):
    '''
    returns the top-k most probable words based on the model given
    ''' 
    
    # preprocessing the text_seed
    text_seed = [w.lower() for w in text_seed]
    # we store the non-zero probabilities
    non_zero_prob = {}

    for w in model.vocab:
        if model.score(w , text_seed) > 0:
            non_zero_prob[w] = model.score(w , text_seed)

    # sorting the non_zero_prob based on values
    sorted_probabilities = dict(sorted(non_zero_prob.items() , key = lambda item:item[1] , reverse = True))
    if len(non_zero_prob) > k:
        return list(sorted_probabilities.keys())[:k]
    
    else:
        top = list(sorted_probabilities.keys())

        return top + [None]*(k - len(top))

##### **Q5. Implement the Beam search using the previously trained MLE model.**

In [109]:
class BeamSearchNode():

    def __init__(self , current_sentence , probability = 1 , depth = 0):
        self.current_sentence = current_sentence
        self.children = []
        self.probability = probability
        self.depth = depth

    def add_child(self,child):
        self.children.append(child)
        return


class BeamSearch:

    '''
    class for implementing BeamSearch to generate a sentence using the Ngram Language Modelling

    input parameters:
        model = Ngram Language Model
        topKfunction = Function to get the top-k probable words
        k = Top k-probable words to consider
        starting_context = The starting context for the sentence
        ngram_N = The value of n for n-grams, example n=1 for unigrams
        max_depth = The maximm depth of the tree
    '''


    def __init__(self , model , topKfunction , k , context , ngram_N , max_depth):
        self.model = model
        self.k = k
        self.context = [context]
        self.topKfunction = topKfunction
        self.n = ngram_N
        self.max_depth = max_depth
        self.generated_sentences = {} 
        self.queue = []
        self.root = BeamSearchNode(self.context)
        self.queue.append(self.root)
        self.tree = self.generate_tree(self.root)

    def generate_tree(self , node ):
        '''
        Function to generate the tree
        '''
        self.queue = self.queue[1:]

        if node.depth == self.max_depth:
            self.generated_sentences[tuple(node.current_sentence)] = node.probability
            if len(self.queue) != 0:
                self.generate_tree(self.queue[0])
                return
            else:
                return


        context_to_generate_words = node.current_sentence[-(self.n-1):] if len(node.current_sentence) >= self.n-1 else node.current_sentence
        context_to_generate_words = tuple(context_to_generate_words)

        top_prob_words = self.topKfunction(self.model , self.k , context_to_generate_words)

        for word in top_prob_words:
            if word is not None:
                new_node = BeamSearchNode(node.current_sentence + [word] , node.probability * self.model.score(word , context_to_generate_words) , depth = node.depth+1)
                node.add_child(new_node)
                self.queue.append(new_node)

        if len(self.queue) != 0:
            self.generate_tree(self.queue[0])

    def print_sentences(self , num_sentences):
        
        sorted_sentences = sorted(self.generated_sentences.items() , key = lambda item: item[1] , reverse = True)
        
        num_sentences = min(num_sentences , len(sorted_sentences))

        for s in range(num_sentences):
            current_sentence_tuple = sorted_sentences[s][0]

            current_sentence = current_sentence_tuple[0]
            for word in current_sentence_tuple[1:]:
                current_sentence += ' ' + word

            print("{}. {}".format(s+1 , current_sentence))
            print("Probability = {}".format(sorted_sentences[s][1]))

        return



#### **Q6. Implement Beam search for k=2 and depth = 10. Find the 5 generated texts with the highest probability.**

In [110]:
model = bigram_mle
topKfunction = k_top_probable
k = 2
n = 2
max_depth = 10

In [111]:
b = BeamSearch(model , topKfunction , k , 'harry' , n , max_depth)
b.print_sentences(5)

1. harry had a large pink beach ball wearing an emerald green
Probability = 4.1452323609999957e-07
2. harry was a large pink beach ball wearing an emerald green
Probability = 2.6321621231072855e-07
3. harry had a large pink beach ball wearing an emerald one
Probability = 2.0726161804999978e-07
4. harry was a large pink beach ball wearing an emerald one
Probability = 1.3160810615536428e-07
5. harry had been a large pink beach ball wearing an emerald
Probability = 9.420982638636354e-08


In [112]:
b = BeamSearch(model , topKfunction , k , 'dumbledore' , n , max_depth)
b.print_sentences(5)

1. dumbledore and a large pink beach ball wearing an emerald green
Probability = 1.4618533099840802e-07
2. dumbledore and a large pink beach ball wearing an emerald one
Probability = 7.309266549920401e-08
3. dumbledore you know who was a large pink beach ball wearing
Probability = 4.470384040603406e-08
4. dumbledore and dudley had a large pink beach ball wearing a
Probability = 4.111462434330226e-08
5. dumbledore you cant blame her sister marge who lived chapter the
Probability = 2.422621470240518e-08
