# n-Gram Language Models

Your task is to train n-gram language models. [Ref SLP Chapter 3]

- Task 1: You will train unigram, bigram, and trigram models on given training files. Then you will score on given test files for unigram, bigram, and trigram. you will generate sentences from the trained model and compute perplexity.
- Task 2: You will create training data for n > 3. and Repeat the above task from training model.
<h6>Part-A = (55 Points) </h6>

In [None]:
'''
Your imports go here
You are encouraged to implement your own functions and not use from library.
'''
import re

import sys
from collections import Counter
import numpy as np

In [None]:
# constants to define pseudo-word tokens
# access via UNK, for instance
# for this assignemnt we will follow <s> tag for beginning of sentence and
# </s> for end of senetence as suggested in SLP Book. Check sample training files for reference.
UNK = "<UNK>"
SENT_BEGIN = "<s>"
SENT_END = "</s>"

We need to initialise global variables for model

In [None]:
"""Initializes Parameters:
  n_gram (int): the n-gram order.
  is_laplace_smoothing (bool): whether or not to use Laplace smoothing
  threshold: words with frequency  below threshold will be converted to token
"""
# Initializing different object attributes
n_gram = 3
is_laplace_smoothing = True
vocab = [] 
n_gram_counts = {}
n_minus_1_gram_counts = {}
threshold = 1



### Implement training function (10 points)

In [None]:
def make_ngrams (tokens,n):
    n_gram_list = []
    n_grams = zip(*[tokens[i:] for i in range(n)])
    for n_gram in n_grams:
        n_gram_list.append(n_gram)
    return n_gram_list

In [None]:
def train(training_file_path):
    """
    Trains the language model on the given data. Input file that
    has tokens that are white-space separated, has one sentence per line, and
    that the sentences begin with <s> and end with </s>
    Parameters:
      training_file_path (str): the location of the training data to read

    Returns:
    N Gram Counts, Vocab, N Minus 1 Gram Counts
    """


    with open(training_file_path, 'r') as fh:
        content = fh.read().split() # Read and split data to get list of words
       
    # Get the count of each word
    counts = {}
    for i in content: 
        if i not in counts.keys(): 
            counts[i] = 1
        else:
            counts[i] += 1
    

    for i in range(len(content)):
      if counts[content[i]] <= threshold:
        content[i] = "<UNK>"

    # make use of make_n_grams function

    grams = make_ngrams(content, n_gram)
    n_gram_counts = {}
    for i in grams:
      if i in n_gram_counts:
        n_gram_counts[i] += 1
      else: 
         n_gram_counts[i] = 1

    # Get the training data vocabulary

    vocab = list(set(content))
    n_minus_1_gram_counts = {}       
    # For n>1 grams compute n-1 gram counts to compute probability
    if n_gram > 1:
      grams_1 = make_ngrams(content, n_gram-1)
      for i in grams_1:
        if i in n_minus_1_gram_counts:
          n_minus_1_gram_counts[i] += 1
        else: 
          n_minus_1_gram_counts[i] = 1


    return n_gram_counts, vocab, n_minus_1_gram_counts, len(content)

Output your Trained Data Parameters:

In [None]:
n_gram_counts, vocab, n_minus_1_gram_counts, total_words = train("/berp-training-tri.txt")
print(n_gram_counts)
print(vocab)
print(total_words)

{('<s>', '<s>', "let's"): 196, ('<s>', "let's", 'start'): 163, ("let's", 'start', 'over'): 136, ('start', 'over', '</s>'): 366, ('over', '</s>', '</s>'): 367, ('</s>', '</s>', '<s>'): 6755, ('</s>', '<s>', '<s>'): 6755, ('<s>', '<s>', 'my'): 6, ('<s>', 'my', 'mother'): 1, ('my', 'mother', 'is'): 1, ('mother', 'is', 'coming'): 1, ('is', 'coming', 'to'): 1, ('coming', 'to', 'visit'): 1, ('to', 'visit', 'and'): 1, ('visit', 'and', "i'd"): 1, ('and', "i'd", 'like'): 5, ("i'd", 'like', 'to'): 409, ('like', 'to', 'take'): 8, ('to', 'take', '<UNK>'): 1, ('take', '<UNK>', 'to'): 1, ('<UNK>', 'to', 'dinner'): 1, ('to', 'dinner', '</s>'): 7, ('dinner', '</s>', '</s>'): 224, ('<s>', '<s>', 'new'): 1, ('<s>', 'new', 'query'): 1, ('new', 'query', '</s>'): 1, ('query', '</s>', '</s>'): 2, ('<s>', '<s>', 'now'): 7, ('<s>', 'now', "i'm"): 3, ('now', "i'm", 'interested'): 6, ("i'm", 'interested', 'in'): 33, ('interested', 'in', 'some'): 2, ('in', 'some', 'middle'): 1, ('some', 'middle', 'eastern'): 5, 

### Scoring function (points 5):
Implement Score function that will take input sentence and output probability of given string representing a single sentence.

In [None]:
import math
def score(sentence):
    """Calculates the probability score for a given string representing a single sentence.
    Parameters:
      sentence (str): a sentence with tokens separated by whitespace to calculate the score of
      
    Returns:
      float: the probability value of the given string for this model
    """
    # Split the input sentence and replace out of vocabulary tokens with <UNK>     
    # Calculate probability for each word and multiply(or tak e log and sum) them to get the sentence probability
    tokens = sentence.split(" ")
    clean = []
    for i in tokens:
        if i not in vocab:
            clean.append("<UNK>")
        else: 
            clean.append(i)

    probability = 0
    frequency = Counter(set(clean))
    looping = make_ngrams(clean, n_gram)

    for i in looping:
        if is_laplace_smoothing == True:
          numerator = 1
          denominator = len(vocab)
          if i in n_gram_counts.keys():
            numerator = n_gram_counts[i] + numerator
          if i[:-1] in n_minus_1_gram_counts.keys() and n_gram > 1:
            denominator = n_minus_1_gram_counts[i[:-1]] +  denominator
          if n_gram == 1:
            denominator = total_words +  denominator
        else:
          if i in n_gram_counts.keys():
            numerator = n_gram_counts[i]
          if i[:-1] in n_minus_1_gram_counts.keys() and n_gram > 1:
            denominator = n_minus_1_gram_counts[i[:-1]] 
          if n_gram == 1:
            denominator = total_words

        prob = np.log(numerator) - np.log(denominator)
        probability += prob
        
    return math.exp(probability)



In [None]:
with open("/hw2-test-tri.txt", 'r') as fh:
    test_content = fh.read().split("\n")
num_sentences_1 = len(test_content)
ten_sentences_1 = test_content[:10]
print("# of test sentences: ", num_sentences_1)
probablities = []


# of test sentences:  102


In [None]:
# print probabilities/score of sentences in test content
for sentence in test_content:
  probablities.append(score(sentence))
probablities = np.array(probablities)
mean = np.mean(probablities)
std_dev = np.std(probablities)

In [None]:
print(mean)
print(std_dev)

0.019614142211209613
0.1386475066426789


### Sentence generation (10 points)
Generate sentence from the above trained model
- To generate next word from a set of probable n-grams and their probabilities check below tutorial:
https://numpy.org/doc/stable/reference/random/generated/numpy.random.choice.html

In [None]:
import random
def generate_sentence():
  """Generates a single sentence from a trained language model using the Shannon technique.
      
    Returns:
      str: the generated sentence
    """
  sentence = ['<s>']
  prev_word = '<s>'
  if n_gram > 1:
    if n_gram > 2:
      for _ in range(n_gram - 2):
        sentence.append('<s>')
    prev_token = ['<s>' for _ in range(n_gram-1)]
    while prev_word != "</s>":
      # Construct the (n-1) gram so far
      # Get the counts of all available choices based on n-1 gram
      # Convert the counts into probability for random.choice() function
      # If <s> is generated, ignore and generate another word'
      candidate_tokens = []
      candidate_tokens_occurance = []
      for token in n_gram_counts:
        if list(token[:n_gram-1]) == prev_token:
          candidate_tokens.append(token)
          candidate_tokens_occurance.append(n_gram_counts[token])
      picked_token = random.choices(candidate_tokens, weights = candidate_tokens_occurance, k=1)[0]
      picked_word = picked_token[-1]
      if picked_word != '<s>':
        sentence.append(picked_word)
        prev_word = picked_word
        prev_token = prev_token[1:]
        prev_token.append(picked_word)

  else:
    # In case of unigram model, n-1 gram is just the previous word and possible choice is whole vocabulary
    while prev_word != "</s>":
      # Convert the counts into probability for random.choice() function
      # If <s> is generated, ignore and generate another word
      picked_word = random.choice(vocab)
      if picked_word != '<s>':
        sentence.append(picked_word)
        prev_word = picked_word

  # Append sentence end markers for n>2
  if n_gram > 2:
    for _ in range(n_gram - 2):
      sentence.append('</s>')
  return " ".join(sentence)

In [None]:
def generate(n):
    """Generates n sentences from a trained language model using the Shannon technique.
    Parameters:
      n (int): the number of sentences to generate
      
    Returns:
      list: a list containing strings, one per generated sentence
    """
    # Generate sentences one by one and store
    sentences = []
    count = 0
    while(count < n):
        try:
            sentences.append(generate_sentence())
        
            count+=1
        except:
            pass
    return sentences

In [None]:
sentences = generate(50)
print("Sentences:")
for sentence in sentences:
  print(sentence)

Sentences:
<s> <s> is that right </s> </s>
<s> <s> to make a reservation for uh sunday dinner time or even breakfast </s> </s>
<s> <s> i'd like to spend about fifty bucks </s> </s>
<s> <s> about two or three miles of icsi </s> </s>
<s> <s> african restaurants </s> </s>
<s> <s> just a couple of dollars </s> </s>
<s> <s> a saturday </s> </s>
<s> <s> i want to eat on tuesday what about chinese restaurants in oakland not american </s> </s>
<s> <s> show me other italian restaurant in your list oh no no no no the distance does not matter </s> </s>
<s> <s> so now we should change to dinner </s> </s>
<s> <s> which of these two restaurants open after midnight </s> </s>
<s> <s> i would like to eat some spicy meal </s> </s>
<s> <s> let's start again </s> </s>
<s> <s> it has to be between uh fifteen minutes away from icsi </s> </s>
<s> <s> i want some spanish food </s> </s>
<s> <s> let's start over </s> </s>
<s> <s> start over again </s> </s>
<s> <s> do they serve rib-eye steaks </s> </s>
<s> <s> 

### Evaluate model perplexity (5 points)
Measures the perplexity for the test sequence with your trained model. 
you may assume that this sequence may consist of many sentences "glued together"

The perplexity of the given sequence is the inverse probability of the test set, normalized by the number of words.


In [None]:
# Since this sequence will cross many sentence boundaries, we need to include 
# the begin- and end-sentence markers <s> and </s> in the probability computation. 
# We also need to include the end-of-sentence marker </s> 
# but not the beginning-of-sentence marker <s>) in the total count of word tokens N

def perplexity(test_sequence):
    """.
    Parameters:
      test_sequence (string): a sequence of space-separated tokens to measure the perplexity of

    Returns:
      float: the perplexity of the given sequence
    """ 

    # Replace out of vocab words with <UNK>, already done in score function
    # test_sequence = [token if token in vocab else UNK for token in test_sequence.split()]

    test_list = [token if token in vocab else UNK for token in test_sequence.split()]

    test = " ".join(test_list) 

    N = 0
    for i in test_list:
        if i != '<s>':
            N += 1      

    probability=score(test)

    perplexity= (1/probability)**(1/N)

    # Remove sentence begin markers from data for computing N
    # Get the probability for the sequence

    return perplexity

In [None]:
print(perplexity(" ".join(ten_sentences_1)))
print(perplexity(" ".join(sentences[0:10])))

132.46094460783397
152.25095173309435


In [None]:
# For Uni gram
n_gram = 1
n_gram_counts, vocab, n_minus_1_gram_counts, total_words = train("/berp-training_uni.txt")
with open("/hw2-test_uni.txt", 'r') as fh:
    test_content = fh.read().split("\n")
num_sentences_1 = len(test_content)
ten_sentences_1 = test_content[:10]
print("# of test sentences: ", num_sentences_1)
probablities = []
for sentence in test_content:
  probablities.append(score(sentence))
probablities = np.array(probablities)
mean = np.mean(probablities)
std_dev = np.std(probablities)
print("Mean of the unigram model =", mean)
print("Standard Deviation of the unigram model =",std_dev)
print("Perplexity of the unigram model(Test Sentences) = ", perplexity(" ".join(ten_sentences_1)))


# of test sentences:  100
Mean of the unigram model = 2.4727713630715013e-06
Standard Deviation of the unigram model = 1.4448856183018081e-05
Perplexity of the unigram model(Test Sentences) =  252.5220190326356


In [None]:
# For bi gram
n_gram = 2
n_gram_counts, vocab, n_minus_1_gram_counts, total_words = train("/berp-training_bi.txt")
with open("/hw2-test_bi.txt", 'r') as fh:
    test_content = fh.read().split("\n")
num_sentences_1 = len(test_content)
ten_sentences_1 = test_content[:10]
print("# of test sentences: ", num_sentences_1)
probablities = []
for sentence in test_content:
  probablities.append(score(sentence))
probablities = np.array(probablities)
mean = np.mean(probablities)
std_dev = np.std(probablities)
print("Mean of the bigram model =", mean)
print("Standard Deviation of the bigram model =",std_dev)
sentences = generate(50)
print("Sentences:")
for sentence in sentences:
  print(sentence)
print("Perplexity of the bigram model(Test Sentences) = ", perplexity(" ".join(ten_sentences_1)))
print("Perplexity of the bigram model(Generated Sentences) = ", perplexity(" ".join(sentences[:10])))


# of test sentences:  100
Mean of the bigram model = 4.943948710689297e-05
Standard Deviation of the bigram model = 0.000285326303522883
Sentences:
<s> i want to go any between five miles away </s>
<s> start over </s>
<s> okay how about the list </s>
<s> i want to eat on nakapan </s>
<s> any price </s>
<s> i would like uh i would like to eat russian food </s>
<s> i want to have the me more than ten dollars </s>
<s> do i want it doesn't have </s>
<s> i go back to be going to go to eat tonight round trip </s>
<s> well i'm looking for lunch </s>
<s> i wanna change to to eat on telegraph avenue </s>
<s> i would like to walk to eat malaysian food is the price </s>
<s> i would like to fifty dollars tonight </s>
<s> any australian </s>
<s> can i want to take reservations </s>
<s> mediterranean meal up to go to get a buffet lunch </s>
<s> i would like a really slow no excuse me more about some thai food for a reservation </s>
<s> i'm <UNK> it needs to travel any place within twenty minutes i w

In [None]:
# For tri gram
n_gram = 3
n_gram_counts, vocab, n_minus_1_gram_counts, total_words = train("/berp-training-tri.txt")
with open("/hw2-test-tri.txt", 'r') as fh:
    test_content = fh.read().split("\n")
num_sentences_1 = len(test_content)
ten_sentences_1 = test_content[:10]
print("# of test sentences: ", num_sentences_1)
probablities = []
for sentence in test_content:
  probablities.append(score(sentence))
probablities = np.array(probablities)
mean = np.mean(probablities)
std_dev = np.std(probablities)
print("Mean of the trigram model =", mean)
print("Standard Deviation of the trigram model =",std_dev)
sentences = generate(50)
print("Sentences:")
for sentence in sentences:
  print(sentence)
print("Perplexity of the trigram model(Test Sentences) = ", perplexity(" ".join(ten_sentences_1)))
print("Perplexity of the trigram model(Generated Sentences) = ", perplexity(" ".join(sentences[:10])))


# of test sentences:  102
Mean of the trigram model = 0.019614142211209613
Standard Deviation of the trigram model = 0.1386475066426789
Sentences:
<s> <s> i would like to eat sometime this evening </s> </s>
<s> <s> near solano avenue </s> </s>
<s> <s> start over oops is it <UNK> and i wanna spend not very much money </s> </s>
<s> <s> i want to eat either on saturday </s> </s>
<s> <s> okay tell me about viva taqueria westside bakery </s> </s>
<s> <s> i don't care about the <UNK> <UNK> </s> </s>
<s> <s> um i'm looking for a mexican restaurant </s> </s>
<s> <s> i could travel three hundred kilometers </s> </s>
<s> <s> i want to eat mexican food </s> </s>
<s> <s> it could be any distance from icsi um somewhere to take a friend </s> </s>
<s> <s> i did not say thai i said two miles of icsi </s> </s>
<s> <s> start over </s> </s>
<s> <s> i'd like to eat on a saturday </s> </s>
<s> <s> um i'm willing to travel five miles away </s> </s>
<s> <s> uh what <UNK> restaurants californian style </s> </

In [None]:
# Generating sentences using the trigram model and using that to train 4, 5, 6 and 7
sentences = generate(500)
# Training file for ngram = 4
train_4 = []
for sentence in sentences:
  s = '<s> ' + sentence + ' </s>'
  train_4.append(s)
with open('\training_4_file.txt', 'w') as f:
    f.write('\n'.join(train_4))

In [None]:
# For n_gram = 4
n_gram = 4
n_gram_counts, vocab, n_minus_1_gram_counts, total_words = train("\training_4_file.txt")
with open("/hw2-test-four.txt", 'r') as fh:
    test_content = fh.read().split("\n")
num_sentences_1 = len(test_content)
ten_sentences_1 = test_content[:10]
print("# of test sentences: ", num_sentences_1)
probablities = []
for sentence in test_content:
  probablities.append(score(sentence))
probablities = np.array(probablities)
mean = np.mean(probablities)
std_dev = np.std(probablities)
print("Mean of the ngram =4 model:", mean)
print("Standard Deviation of the ngram=4 model:",std_dev)
sentences_1 = generate(10)
print("Sentences:")
for sentence in sentences_1:
  print(sentence)
print("Perplexity of the ngram=4 model(Test Sentences) = ", perplexity(" ".join(ten_sentences_1)))
print("Perplexity of the ngram=4 model(Generated Sentences) = ", perplexity(" ".join(sentences_1)))

# of test sentences:  100
Mean of the ngram =4 model: 2.327461500947876e-08
Standard Deviation of the ngram=4 model: 1.4187140836906923e-07
Sentences:
<s> <s> <s> it can cost <UNK> more </s> </s> </s>
<s> <s> <s> where do you have any <UNK> restaurants on shattuck </s> </s> </s>
<s> <s> <s> how about anywhere in berkeley </s> </s> </s>
<s> <s> <s> how about lunch uh to walk less than one mile </s> </s> </s>
<s> <s> <s> i would like to know what type of food kosher food </s> </s> </s>
<s> <s> <s> do you have african food </s> </s> </s>
<s> <s> <s> where is it <UNK> understand that </s> </s> </s>
<s> <s> <s> could be <UNK> expensive </s> </s> </s>
<s> <s> <s> it should be within ten minutes </s> </s> </s>
<s> <s> <s> no more than ten dollars </s> </s> </s>
Perplexity of the ngram=4 model(Test Sentences) =  173.14586505680919
Perplexity of the ngram=4 model(Generated Sentences) =  94.688139795237


In [None]:
# Training file for ngram = 5
train_5 = []
for sentence in sentences:
  s = '<s> <s> ' + sentence + ' </s> </s>'
  train_5.append(s)
with open('\training_5_file.txt', 'w') as f:
    f.write('\n'.join(train_5))
# For n_gram = 5
n_gram = 5
n_gram_counts, vocab, n_minus_1_gram_counts, total_words = train("\training_5_file.txt")
with open("/hw2-test_five.txt", 'r') as fh:
    test_content = fh.read().split("\n")
num_sentences_1 = len(test_content)
ten_sentences_1 = test_content[:10]
print("# of test sentences: ", num_sentences_1)
probablities = []
for sentence in test_content:
  probablities.append(score(sentence))
probablities = np.array(probablities)
mean = np.mean(probablities)
std_dev = np.std(probablities)
print("Mean of the ngram =5 model:", mean)
print("Standard Deviation of the ngram=5 model:",std_dev)
sentences_1 = generate(10)
print("Sentences:")
for sentence in sentences_1:
  print(sentence)
print("Perplexity of the ngram=5 model(Test Sentences) = ", perplexity(" ".join(ten_sentences_1)))
print("Perplexity of the ngram=5 model(Generated Sentences) = ", perplexity(" ".join(sentences_1)))

# of test sentences:  100
Mean of the ngram =5 model: 4.1127515686319835e-10
Standard Deviation of the ngram=5 model: 2.2229936294186784e-09
Sentences:
<s> <s> <s> <s> is there a good vegetarian chinese lunch </s> </s> </s> </s>
<s> <s> <s> <s> i'd like to eat <UNK> or uh <UNK> places </s> </s> </s> </s>
<s> <s> <s> <s> i want to be inexpensive </s> </s> </s> </s>
<s> <s> <s> <s> tell me something about it </s> </s> </s> </s>
<s> <s> <s> <s> how much it costs </s> </s> </s> </s>
<s> <s> <s> <s> i'd like to eat some american breakfast </s> </s> </s> </s>
<s> <s> <s> <s> at eight o'clock in the neighborhood of berkeley </s> </s> </s> </s>
<s> <s> <s> <s> tell me about spats </s> </s> </s> </s>
<s> <s> <s> <s> is jupiter in your <UNK> </s> </s> </s> </s>
<s> <s> <s> <s> more than fifteen dollars </s> </s> </s> </s>
Perplexity of the ngram=5 model(Test Sentences) =  190.10456033662436
Perplexity of the ngram=5 model(Generated Sentences) =  92.1851835675835


In [None]:
# Training file for ngram = 6
train_6 = []
for sentence in sentences:
  s = '<s> <s> <s> ' + sentence + ' </s> </s> </s>'
  train_6.append(s)
with open('\training_6_file.txt', 'w') as f:
    f.write('\n'.join(train_6))
# For n_gram = 6
n_gram = 6
n_gram_counts, vocab, n_minus_1_gram_counts, total_words = train("\training_6_file.txt")
with open("/hw2-test_six.txt", 'r') as fh:
    test_content = fh.read().split("\n")
num_sentences_1 = len(test_content)
ten_sentences_1 = test_content[:10]
print("# of test sentences: ", num_sentences_1)
probablities = []
for sentence in test_content:
  probablities.append(score(sentence))
probablities = np.array(probablities)
mean = np.mean(probablities)
std_dev = np.std(probablities)
print("Mean of the ngram =6 model:", mean)
print("Standard Deviation of the ngram=6 model:",std_dev)
sentences_1 = generate(10)
print("Sentences:")
for sentence in sentences_1:
  print(sentence)
print("Perplexity of the ngram=6 model(Test Sentences) = ", perplexity(" ".join(ten_sentences_1)))
print("Perplexity of the ngram=6 model(Generated Sentences) = ", perplexity(" ".join(sentences_1)))

# of test sentences:  100
Mean of the ngram =6 model: 9.33735459865477e-12
Standard Deviation of the ngram=6 model: 4.9158164291548074e-11
Sentences:
<s> <s> <s> <s> <s> i would like to eat some place <UNK> for lunch </s> </s> </s> </s> </s>
<s> <s> <s> <s> <s> the money doesn't matter </s> </s> </s> </s> </s>
<s> <s> <s> <s> <s> i want to spend </s> </s> </s> </s> </s>
<s> <s> <s> <s> <s> is there a <UNK> restaurant </s> </s> </s> </s> </s>
<s> <s> <s> <s> <s> i want to eat <UNK> </s> </s> </s> </s> </s>
<s> <s> <s> <s> <s> i'd like a place for thai food today </s> </s> </s> </s> </s>
<s> <s> <s> <s> <s> i would like to eat dinner </s> </s> </s> </s> </s>
<s> <s> <s> <s> <s> tell me about jupiter </s> </s> </s> </s> </s>
<s> <s> <s> <s> <s> <UNK> food </s> </s> </s> </s> </s>
<s> <s> <s> <s> <s> <UNK> <UNK> </s> </s> </s> </s> </s>
Perplexity of the ngram=6 model(Test Sentences) =  202.60182616369553
Perplexity of the ngram=6 model(Generated Sentences) =  64.23994681036501


In [None]:
# Training file for ngram = 7
train_7 = []
for sentence in sentences:
  s = '<s> <s> <s> <s> ' + sentence + ' </s> </s> </s> </s>'
  train_7.append(s)
with open('\training_7_file.txt', 'w') as f:
    f.write('\n'.join(train_7))
# For n_gram = 7
n_gram = 7
n_gram_counts, vocab, n_minus_1_gram_counts, total_words = train("\training_7_file.txt")
with open("/hw2-test_seven.txt", 'r') as fh:
    test_content = fh.read().split("\n")
num_sentences_1 = len(test_content)
ten_sentences_1 = test_content[:10]
print("# of test sentences: ", num_sentences_1)
probablities = []
for sentence in test_content:
  probablities.append(score(sentence))
probablities = np.array(probablities)
mean = np.mean(probablities)
std_dev = np.std(probablities)
print("Mean of the ngram =7 model:", mean)
print("Standard Deviation of the ngram=7 model:",std_dev)
print("Sentences:")
for sentence in sentences_1:
  print(sentence)
print("Perplexity of the ngram=7 model(Test Sentences) = ", perplexity(" ".join(ten_sentences_1)))
print("Perplexity of the ngram=7 model(Generated Sentences) = ", perplexity(" ".join(sentences_1)))

# of test sentences:  100
Mean of the ngram =7 model: 2.8432328878286093e-13
Standard Deviation of the ngram=7 model: 1.740433885026636e-12
Sentences:
<s> <s> <s> <s> <s> i would like to eat some place <UNK> for lunch </s> </s> </s> </s> </s>
<s> <s> <s> <s> <s> the money doesn't matter </s> </s> </s> </s> </s>
<s> <s> <s> <s> <s> i want to spend </s> </s> </s> </s> </s>
<s> <s> <s> <s> <s> is there a <UNK> restaurant </s> </s> </s> </s> </s>
<s> <s> <s> <s> <s> i want to eat <UNK> </s> </s> </s> </s> </s>
<s> <s> <s> <s> <s> i'd like a place for thai food today </s> </s> </s> </s> </s>
<s> <s> <s> <s> <s> i would like to eat dinner </s> </s> </s> </s> </s>
<s> <s> <s> <s> <s> tell me about jupiter </s> </s> </s> </s> </s>
<s> <s> <s> <s> <s> <UNK> food </s> </s> </s> </s> </s>
<s> <s> <s> <s> <s> <UNK> <UNK> </s> </s> </s> </s> </s>
Perplexity of the ngram=7 model(Test Sentences) =  213.62935860771307
Perplexity of the ngram=7 model(Generated Sentences) =  166.13182671604775


Expected perplexity for ngram = 1, 2 and 3 are matched. But since for n_gram = 4, 5, 6, 7 we are using the sentences generated in the trigram model to calculate perplexoty it varies with the expected perplexity. 

### **Explore and explain: (5 points)**
* Experiment n_gram model for n = [1,2,3..7] of your choice. Explain the best choice of n that generates more meaninful sentences.


We can see that when n_gram = 1 we have a higher perplexity which then reduces and increases again gradually as n_gram increases. In my case I observed when N-gram = 3 the sentences generated where more meaningful and longer compared to the other n_gram values. 
