<a href="https://colab.research.google.com/github/thenerdyouknow/AML_Final_Project/blob/master/UIP_Assignment2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

This is a Python notebook which contains code that can be used to build language models based on corpora of texts. To run this notebook you need to follow the following steps :

1. Download your corpora(for example, mine are books from Project Gutenberg).
2. Zip them and upload them to your Google Drive.
3. Mount your drive on Google Colaboratory.
4. Copy and paste the paths of the file under FILE_1 and FILE_2(you can create more global variables if you'd like).
5. Run all the cells in order.

The files I used are available as a zip, if you upload the zip as is on Google Drive and run the code on Google Colaboratory, it should probably run without any errors.

In [0]:
#Importing all the necessary tools. Since the idea was to build a language model from scratch, minimal libraries have been used.
import string
import re
from collections import Counter
from itertools import islice

In [0]:
#Mounting Google Drive files and unzipping the datasets' zip.
from google.colab import drive
drive.mount('/content/drive')
!unzip "/content/drive/My Drive/Datasets.zip"

In [0]:
#Global variables for the files
FILE_1 = '/content/Datasets/TaleOfTwoCities.txt'
FILE_2 = '/content/Datasets/WarAndPeace.txt'

# Section 1(Cleaning and N-Gram Creation):

This section contains functions to clean the dataset, create the n-grams, and store them in an appropriate datastructure so we can use it for our language models.

Contains : 

1.   open_and_read()
2.   generate_ngrams()
3.   split_ngrams()



In [0]:
def open_and_read(filepath):
  '''
  def open_and_read():
  Input: the path of the file
  Output: A cleaned string(removing all punctuations, other problematic elements, 
          adding start and end of line, etc. etc.)
  
  In this function, the following happens(in order):
  1. File contents are read line by line.
  2. A loop runs over all lines which does the following:
       a. Converts all the characters to lower case.
       b. Deals with all the words with a period sign in front of them(as periods will be considered
          as End of Line so these words will cause issues later).
       c. All period signs are converted to <s> </s> where <s> is start of sentence and </s>\
          is end of sentence.
       d. If length of the string is more than 4, we check if the string has a closing quotations
          and a \n character after as examination of the corpus revealed that it was usually
          end of sentence.
       e. Replaces all the \n new lines with space.
       f. Removes all punctuation using str.translate.
       g. Append the cleaned string to new list.
  3. Convert the cleaned list to a string and return it.
  '''
  final_list = []
  with open(filepath,'rt') as open_book:
    file_contents = open_book.readlines()
  for each_string in file_contents:
    each_string = each_string.lower()
    final_string = each_string.replace("st.", "")
    final_string = final_string.replace("mrs.","")
    final_string = final_string.replace("mr.","")
    final_string = final_string.replace("."," </s> <s> ")
    if(len(each_string)>4):
      if(each_string[-2]=="”" and each_string[-3]!=","):
        final_string = final_string + " </s> <s> "
    final_string = final_string.replace("\n", " ")
    final_string = final_string.translate(str.maketrans('','',punctuation))
    final_list.append(final_string)
  return ''.join(final_list)

def generate_ngrams(s, n):
  '''
  def generate_ngrams():
  Input: The string to split into n-grams, and the length of the n-grams that need to be formed.
  Output: List of all n-grams in the string
  
  This is a clever function. It does the following:
  1. Tokenizes the string on spaces.
  2. Splits the string into n pieces(with each list being one-off the previous one),
     and uses zip to combine the n-th element of each piece together. 
     
     If one string is bigger than the other, the extra elements are discarded.
     For example:
     'i had a little lamb haha' with a bi-gram becomes
     splits to 'i had a little lamb haha', 'had a little lamb haha'
     the n-grams become :
     [('i','had'),('had','a'),('a','little'),('little','lamb'),('lamb','haha')]
  3. Joins the strings together, puts them in a list and returns the list
  '''
  tokens = [token for token in s.split(" ") if token != ""]
  ngrams = zip(*[tokens[i:] for i in range(n)])
  return [" ".join(ngram) for ngram in ngrams]
  
def split_ngrams(ngrams):
  '''
  def split_ngrams():
  Input: The n-gram list
  Output: A dictionary, the keys of which are the all the words of an n-gram except
          the last one, and the values being a dictionary of all the words that
          come after the words in the key.
          For example: if the n-gram is 'you are the best', 'you are the worst'
          The dict will look like this : {'you are the':{'best':1,'worst':1}}
  
  Another clever function. It does the following:
  1. Loops over all the n-grams.
  2. For each n-gram it does the following:
        a. Check if there is already a dictionary element for the words from
           position 0 to n-1(also indexed by -1 in Python).
           If yes, append the word that appears after the words from 0 to n-1.
        b. If not, create a list with the word that appears after words 0 to n-1.
  3. Then, for each ngram, it does the following:
        a. Adds up all the frequencies, converts to list and sorts it, then 
           converts it back to a dict.
        b. Sets the dictionary in place of the list for that n-gram.     
  4. Returns the dictionary of all the n-grams.
  '''
  ngrams_dict = {}
  
  for each_ngram in ngrams:
    try:
      ngrams_dict[' '.join(each_ngram.split(" ")[:-1])].append(each_ngram.split(" ")[-1])
    except KeyError:
      ngrams_dict[' '.join(each_ngram.split(" ")[:-1])] = [each_ngram.split(" ")[-1]]

  for ngram, value in ngrams_dict.items():
    new_value = dict(Counter(value).most_common())
    ngrams_dict[ngram] = new_value 
  
  return ngrams_dict
  
punctuation = "“”!\"#$%&'()*+,-.:;=?@[\]^_`{|}~"

file_1 = open_and_read(FILE_1)
file_2 = open_and_read(FILE_2)

# Section 2(Language Models):

Contains the functions that are used to build the language models and predict the end of sentences using the beginning of the sentences given in global variables. File mappings are also stored in a global variable.

Contains:

1.   take()
2.   one_gram_prediction()
3.   all_gram_prediction()
4.   final_predictions()


In [0]:
#Bunch of global variables like how big should the predicted sentence be, the beginning of the sentences, etc. etc.
SENTENCE_LENGTH = 20
list_of_sentences = ['i suppose','and having got','not two minutes']
files_mapping = {1:file_1,2:file_2}


def take(n, iterable_object):
  '''
  def take():
  Input: Value of n and an iterable object.
  Output: Just n values of all the iterable objects

  Standard function. Used to cut the dictionary to the length of just all the
  words needed to complete the sentence for the mono-gram function.
  '''
  return list(islice(iterable_object, n))

def one_gram_prediction(sentence,ngrams):
  '''
  def one_gram_prediction():
  Input: The sentence and the n-grams dictionary
  Output: The predicted sentence using a mono-gram language model.
  
  The function does the following:
  1. Cuts the dictionary to just the words needed to finish the sentence
     and make it _atleast_ 10 words long.
  2. Loops through the edited list and just appends to the end of the sentence.
     This works because the dict elements were
     inserted in a descending order, so the first few words are the ones that
     occur the most in the dataset.
  3. Returns predicted sentence.
  '''
  n_ngrams = take((SENTENCE_LENGTH+2) - len(sentence.split(" ")), ngrams.items())
  for keys in n_ngrams:
    if(keys[0] != '</s>' and keys[0] != '<s>'):
      sentence = sentence + ' ' + keys[0]
  return sentence


def all_gram_prediction(sentence,ngrams,ngram_value):
  '''
  def all_gram_prediction():
  Input: The beginning sentence, the n-grams, and the n value in the n-grams.
  Output: The predicted sentence
  
  Another clever function(I clearly think all my functions are clever). It does the following:
  1. Splits the beginning sentence on space.
  2. Runs a while loop while the length of the predicted sentence is smaller than the desired length. It does this in the loop:
        a. Split the sentence again, this is done because it needs to update.
        b. Checks if ngram_value has a value of 1, if yes, then redirects to 
           the mono-gram function to find a predicted sentence using monograms
           and returns that sentence.
        c. Checks if the ngram_value has a value of 2, if yes, then it just
           takes the previous word for context as that's the protocol for a
           bi-gram model.
        d. Checks if the ngram_value has a value of 3, if yes, then it takes
           the last two words for context as that's the protocol for a tri-gram
           model.
        e. Similarly, for ngram_value of 4, it takes the last three words as 
           context because 4 indicates quad-gram model.
        f. If the last word of the context is the start token(<s>) then it 
           deletes the end token value from the n-grams, as this will
           ensure the predicted sentence is not stuck in a loop of <s></s><s></s>...
        g. Tries to find the word most used after the word(s) that are the 
           context, if it doesn't find anything then it doesn't exist in the 
           n-grams so it just initializes the word to a blank.
        h. Adds the new word to the sentence and returns to the start of the
           loop with the new sentence
   3. Returns the predicted sentence.
  '''
  split_sentence = sentence.split(" ")
  
  while(len(split_sentence)<=SENTENCE_LENGTH):
    
    split_sentence = sentence.split(" ")
    
    if(ngram_value == 1):
      final_sentence = one_gram_prediction(sentence,ngrams)
      return final_sentence
    
    elif(ngram_value == 2):
      words_needed = split_sentence[-1]
    
    elif(ngram_value == 3):
      words_needed = split_sentence[-2] + ' '+ split_sentence[-1]
    
    elif(ngram_value == 4):
      words_needed = split_sentence[-3] + ' ' + split_sentence[-2] + ' '+ split_sentence[-1]
      
    if(split_sentence[-1] == '<s>'):
      try:
        del ngrams[words_needed]['</s>']
      except:
        pass
      
    try:
      new_word = list(ngrams[words_needed])[0]
    except:
      new_word = ''
      
    sentence = sentence + ' ' + new_word
    
  return sentence


def final_predictions(ngram_value):
  '''
  def final_predictions():
  Input: value of the n in the n-grams needed.
  Output: All the predicted sentences as found in the list of global variables, using the length given in the global variables.
  
  Driver function just written for ease of use when predicting words. It does the following:
  1. Loops through all the files in the global file mapping dictionary.
  2. Generates n-grams for the file.
  3. If the n-gram value is 1, then generates frequency from the n-grams for
     mono-gram prediction.
  4. Else, just uses the split_ngrams function to get the n-gram dictionary.
  5. Loops through all the beginning sentences, calls all_gram_prediction to get
     the predicted sentences and appends them to a list.
  6. Returns the list of all predicted sentences
  '''
  sentence_list = []
  for i in range(1,3):
    ngrams = generate_ngrams(files_mapping[i],ngram_value)
    
    if(ngram_value == 1):
      ngrams = dict(Counter(ngrams).most_common())
    else:  
      ngrams = split_ngrams(ngrams)
      
    for each_sentence in list_of_sentences:
      each_sentence = '<s> ' + each_sentence
      sentence = all_gram_prediction(each_sentence,ngrams,ngram_value)
      sentence_list.append(sentence)
      
  return sentence_list


final_sentences = final_predictions(4)#Example for quad-gram.
print(final_sentences)
