<b>Importing the libraries</b>

In [None]:
import pandas as pd
import random
from tqdm import tqdm



In [None]:
from nltk import ngrams, trigrams
from collections import Counter, defaultdict
from nltk.tokenize import sent_tokenize,word_tokenize
import nltk
nltk.download('punkt')



<b>Pre-Processing</b>

Here I have used a preprocessed corpus done in the preprocessing notebook. I have performed steps like removing citations, email addresses, urls , parenthesis, numbers and special characters which I feel will give me good results for a language model

In [None]:
#Loading the preprocessed Corpus
corpus = pd.read_csv('corpus.csv')

##1. Trigram Language Model

Here I have used dictionary to keep the count of trigrams in the corpus with which I have assigned probablities for each such trigram i.e. (w1, w2 , w3) is a trigram then we can get calculate <b>P</b>(w3| w1, w2) using the count of each such trigram divided by the total count of (w1, w2)

In [None]:
# Create a placeholder dictionary for model (assigning probablities)
model = defaultdict(lambda: defaultdict(lambda: 0))

In [None]:
for i,j in corpus['processed'].iteritems():
  if type(j)==str:
    sent = sent_tokenize(j)
    for k in sent:
      words = word_tokenize(k)
      for w1, w2, w3 in trigrams(words, pad_right=True, pad_left=True):
        model[(w1, w2)][w3] += 1

 
# transforming counts to probabilities
for w1_w2 in model:
    total_count = float(sum(model[w1_w2].values()))
    for w3 in model[w1_w2]:
        model[w1_w2][w3] /= total_count

<b>2. Generating Sentences using trigram model</b>

Using the above built model, here I have generated a sentences using the starting two words. Since 'the' is the most common term in the corpus, I have generated sentences starting from 'the' followed by some common occuring words. I have again used the same words with 4-gram model to make a comparison. I have first selected a random threshold to filter out my next term and then using that I am accumulating words from previous two words to generate a sentence. Here us the result that I have obtained. 

In [None]:
# starting words
start = [["the","study"],["the","role"],["the","virus"]]

for text in start:
  sentence_finished = False

  while not sentence_finished:
    # selecting a random probability threshold  
    r = random.random()
    accumulator = .0

    for word in model[tuple(text[-2:])].keys():
      accumulator += model[tuple(text[-2:])][word]
      # selecting words that are above the probability threshold
      if accumulator >= r:
          text.append(word)
          break

    if text[-2:] == [None, None]:
      sentence_finished = True
 
  print (' '.join([t for t in text if t]))
 


the study authors were able to rescue stop codon resulting in a regulatory role in the range of possible directions which can be evaluated with caution is advised in using digital inline holography i.e .
the role of the nasal washes among previously healthy 60yearold man from human activity in progressive destruction of the hyperechogenic posterior wall of the population in punjab are compared with monotherapy or rimantadinenebulized zanamivir combination therapy also occurred since hospitalization and others of which are manned by a human monoclonal igm antibody but the risk of increasing rates of older adults .
the virus tended to adhere to this disease spread .


##3. 4-gram Language Model

Similar to the trigram model, I have used dictionary to keep the count of 4-grams in the corpus with which I have assigned probablities for each such 4-gram i.e. (w1, w2 , w3, w4) to calculate <b>P</b>(w4| w1, w2, w3) using the count of each such 4-gram divided by the total count of (w1, w2,w3). 

In [None]:
model4 = defaultdict(lambda: defaultdict(lambda: 0))

for i,j in corpus['processed'].iteritems():
  if type(j)==str:
    sent = sent_tokenize(j)
    for k in sent:
      words = word_tokenize(k)
      for w1, w2, w3, w4 in ngrams(words, 4, pad_right=True, pad_left=True):
        model4[(w1, w2, w3)][w4] += 1

 
# tranformation of counts to probabilities
for w1_w2_w3 in model4:
    total_count = float(sum(model4[w1_w2_w3].values()))
    for w4 in model4[w1_w2_w3]:
        model4[w1_w2_w3][w4] /= total_count

<b>4. Generating Sentences using 4-gram model</b>

I have again used the same starting words as in the trigram model to make an comparison. Since this model requires 3 starting words, I have used the third word as generated in the trigram model to generate sentences. Here is the result that I have obtained.

In [None]:
# starting words
start = [["the","study","authors"],["the","role","of"]]

for text in start:
  sentence_finished = False

  while not sentence_finished:
    # selecting a random probability threshold  
    r = random.random()
    accumulator = .0

    for word in model4[tuple(text[-3:])].keys():
      accumulator += model4[tuple(text[-3:])][word]
      # selecting words that are above the probability threshold
      if accumulator >= r:
          text.append(word)
          break

    if text[-2:] == [None, None]:
      sentence_finished = True
 
  print (' '.join([t for t in text if t]))

the study authors found no published studies comparing the cbc indices cardiac and coagulation parameters electrolyte factors renal and liver damage .
the role of lipid pathways in covid19 in more narrowly constructed samples .


Apparent problem in this ngram model is the data sparseness. Though I am building the model by maintaining a dictionary which doesnot keep count of the 0 count terms, it can be further optimized by using thresholds to filter out less frequent terms and maintaing small dictionaries which can also take advantage of parallelization.