In this notebook, we first implement a simple bi-gram model implementation, and train it on a tiny corpus. We then use this bi-gram model to generate texts. From this minimum bi-gram implementation, we should understand
* How N-Gram performs language modeling, i.e. computing P(next_word | prev_grams)
* How we can use P(next_word | prev_grams) values to generate texts.

We then examine text generation from bi-gram and 4-gram models trained on IMDB movie review dataset. How well do our N-Gram model write movie reviews?

In [None]:
import bisect
import collections
import numpy as np


# Pad `sos` and `eos` to denote the beginning and end of each sentence.
def _pad_sos_eos(corpus: list[list[str]], sos: str, eos: str):
  padded_corpus = []
  for sentence in corpus:
    padded_sentence = sentence
    if not sentence[0] == sos:
      padded_sentence = [sos] + sentence
    if not sentence[-1] == eos:
      padded_sentence += [eos]
    print(f'input sentence = {sentence}')
    print(f'padded sentence = {padded_sentence}')
    padded_corpus.append(padded_sentence)
  return padded_corpus


# Go through the sentences in the corpus and return the next bigram to caller.
def _yield_bigrams(corpus: list[list[str]]):
  for sentence in corpus:
    for i in range(0, len(sentence) - 1):
      yield (sentence[i], sentence[i + 1])


# Implements a simple bi-gram model.
class SimpleBigram:
  def __init__(self):
    self.vocab = []
    self.counts = {}
    self._sos = '<sos>'
    self._eos = '<eos>'

  # Create the model's vocabular from the training corpus.
  # We first use a set to obtain unique words in the corpus. We then convert
  # the set into a sorted list of words, from which we can deterministically
  # retrieve a word: e.g. "get the 3rd word in the list".
  def build_vocab(self, corpus: list[list[str]]):
    vocab = set()
    for sentence in corpus:
      vocab.update(sentence)
    self.vocab = sorted(list(vocab))
    print(f'Built vocabulary = {self.vocab}')

  # Count bigrams in the training corpus.
  # We go through the corpus and increment the self.counts entries.
  # self.counts[prev_word][next_word] holds the number of times the bi-gram
  # (prev_word, next_word) occurs in the corpus.
  # P(next_word | previous_word), counting (<prev>, <next>)
  # self.counts["I"]: keep counts of bigram thats starts with "I"
  # self.counts["I"]["mango"]: counts of bigram <I,mango>
  def count_bigrams(self, corpus: list[list[str]]):
    for word in self.vocab:
      self.counts[word] = collections.Counter()

    for prev_word, next_word in _yield_bigrams(corpus):
      self.counts[prev_word][next_word] += 1

  # Fit the training corpus to the bi-gram model.
  # [["the", "sun", "is", "red"],
  #  ["hello", "word"]]
  def fit(self, corpus: list[list[str]]):
    padded_corpus = _pad_sos_eos(corpus, self._sos, self._eos)
    self.build_vocab(padded_corpus)
    self.count_bigrams(padded_corpus)

  # Compute P(next_word | prev_word).
  def p_next_word(self, next_word: str, prev_word: str) -> float:
    if prev_word not in self.counts:
      raise ValueError(f'{prev_word = } is oov')

    n_total = sum(self.counts[prev_word].values())
    p_next_word = self.counts[prev_word][next_word] / n_total
    return p_next_word

  # Generate next word, given a prev_word.
  # For each word in the vocabulary, we compute P(<word> | prev_word).
  # We then "roll a dice" to draw the next word, according to the computed next
  # token probabilities.
  # P(apple | context): deterministic
  # context := context + apple|orange
  # I like apple: maximize P(I | sos), maximize P(like | I), maximize P(apple | I like)
  # groundtruth: P(I | sos) = 1, P(like | I) = 1, ...
  def generate_next_word(self, prev_word: str) -> str:
    next_word_probs = []
    for next_word in self.vocab:
      next_word_probs.append(self.p_next_word(next_word, prev_word))
    print(f'Counts of bi-grams starting with {prev_word}: {self.counts[prev_word]}')
    print(f'{prev_word = }. next word probalities = {dict((word, prob) for word, prob in zip(self.vocab, next_word_probs))}')
    print()

    next_word_id = np.nonzero(np.random.multinomial(1, next_word_probs))[0][0]
    return self.vocab[next_word_id]

  # Generate text, given a seed word. When seed word is None, consider
  # seed_word = sos.
  # We start from the seed word, and generate a next_word, as defined by
  # `generate_next_word`. Then we will use next_word and the nex prev_word, to
  # generate the next next word. We continue this process, until an eos token
  # is generated - this is when we terminate the generation.
  def generate_text(self, seed_word: str | None = None) -> list[str]:
    seed_word = self._sos if seed_word is None else seed_word
    result = [seed_word]
    while result[-1] != self._eos:
      next_word = self.generate_next_word(prev_word=result[-1])
      result.append(next_word)
    return result

In [None]:
corpus = [['mango', 'is', 'yummy'],
          ['I', 'like', 'orange'],
          ['apple', 'sucks'],
          ['I', 'really', 'like', 'mango']]

simple_bigram = SimpleBigram()
simple_bigram.fit(corpus)

input sentence = ['mango', 'is', 'yummy']
padded sentence = ['<sos>', 'mango', 'is', 'yummy', '<eos>']
input sentence = ['I', 'like', 'orange']
padded sentence = ['<sos>', 'I', 'like', 'orange', '<eos>']
input sentence = ['apple', 'sucks']
padded sentence = ['<sos>', 'apple', 'sucks', '<eos>']
input sentence = ['I', 'really', 'like', 'mango']
padded sentence = ['<sos>', 'I', 'really', 'like', 'mango', '<eos>']
Built vocabulary = ['<eos>', '<sos>', 'I', 'apple', 'is', 'like', 'mango', 'orange', 'really', 'sucks', 'yummy']


In [None]:
simple_bigram.generate_text()

Counts of bi-grams starting with <sos>: Counter({'I': 2, 'mango': 1, 'apple': 1})
prev_word = '<sos>'. next word probalities = {'<eos>': 0.0, '<sos>': 0.0, 'I': 0.5, 'apple': 0.25, 'is': 0.0, 'like': 0.0, 'mango': 0.25, 'orange': 0.0, 'really': 0.0, 'sucks': 0.0, 'yummy': 0.0}

Counts of bi-grams starting with I: Counter({'like': 1, 'really': 1})
prev_word = 'I'. next word probalities = {'<eos>': 0.0, '<sos>': 0.0, 'I': 0.0, 'apple': 0.0, 'is': 0.0, 'like': 0.5, 'mango': 0.0, 'orange': 0.0, 'really': 0.5, 'sucks': 0.0, 'yummy': 0.0}

Counts of bi-grams starting with like: Counter({'orange': 1, 'mango': 1})
prev_word = 'like'. next word probalities = {'<eos>': 0.0, '<sos>': 0.0, 'I': 0.0, 'apple': 0.0, 'is': 0.0, 'like': 0.0, 'mango': 0.5, 'orange': 0.5, 'really': 0.0, 'sucks': 0.0, 'yummy': 0.0}

Counts of bi-grams starting with mango: Counter({'is': 1, '<eos>': 1})
prev_word = 'mango'. next word probalities = {'<eos>': 0.5, '<sos>': 0.0, 'I': 0.0, 'apple': 0.0, 'is': 0.5, 'like': 0.0,

['<sos>', 'I', 'like', 'mango', '<eos>']

In [None]:
import os

# Note: To load files correct, add the "Module 6 : Deep Dive Into LLMs" folder
# as shortcut under "MyDrive".
from google.colab import drive
drive.mount('/content/drive')
assets_dir = '/content/drive/MyDrive/Module 6 : Deep Dive into LLMs - V2/Assignment and MCQs/datasets/'

Mounted at /content/drive


In [None]:
#@title Read IMDB movie review dataset.
import pandas as pd

df_reviews = pd.read_csv(os.path.join(assets_dir, 'IMDB_Dataset.csv'))
df_reviews.head(5)

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [None]:
df_reviews.iloc[0].review

"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fa

### Train an bi-gram and a 4-gram model using IMDB movie review corpus

In this section, we train n-gram models using the IMDB movie review dataset, and use the trained n-gram models to write moview reviews. We use the nltk library, which provides n-gram model training and data preprocessing funcionalities. Read more about nltk from [nltk documentation](https://www.nltk.org/index.html).

We can observe that the n-gram models are not great moview review writers. They only care about the counts of n-grams, and have no understanding of gramma or meaning of words. They also have no intrinsic knowledge of any movies. The n-gram models also struggle with coherency: each word is generated based on only the previous few words, and different parts of a review often discuss unrelated movies/topics.

In [None]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

Training data is constructed as follows:


1.   Take the training corpus. For each text sequence (i.e. review), tokenize the sequence into words, and add start and end tokens.
2.   Flatten all reviews into a big single string.
3.   Create a stream of n-gram and the vocabulary from the flat review string. nltk uses `everygram` objects to create a generator which returns all unigram, bi-gram, .... k-gram (k is a argument).

In [None]:
# Create training data and vocabulary from IMDB movie reviews.
# Note that `padded_everygram_pipeline` returns generaters. We need to call
# this function each time we train a model, since the training exhausts the
# generator.
from nltk.lm.preprocessing import padded_everygram_pipeline

def get_train_data_and_vocab():
  all_reviews = df_reviews.review.to_list()
  # Each review is broken into a list of tokens:
  # ["I like dog", "good movie"] => [["I", "like", "dog"], ["good", "movie"]]
  all_reviews_tokens = [nltk.tokenize.word_tokenize(review, language='english', preserve_line=False)
                        for review in all_reviews]
  # Pad start and end token to each review:
  # [["I", "like", "dog"], ["good", "movie"]] => ["<s>", "I", "like", "dog", "</s>", "<s>", "good", "movie", "</s>"] (unigram stream)
  train_grams, vocab = padded_everygram_pipeline(4, all_reviews_tokens)
  return train_grams, vocab

In [None]:
from nltk.lm import MLE

# Train an n-gram model using IMDB movie review dataset.
def train_n_gram(n):
  train_grams, vocab = get_train_data_and_vocab()
  model = MLE(n)
  model.fit(train_grams, vocab)
  return model

In [None]:
review_4_gram = train_n_gram(4)
review_2_gram = train_n_gram(2)

### Generate movie reviews using the bi-gram and 4-gram models.

We trained a bi-gram model and a 4-gram model below and now use them to generate movie reviews. For each model, we seed the generation with "I", and generate 3 different review.

The bi-gram model has notably worse lucidity. Additionally, since both models are based purely on n-gram counts, the generated reviews lack coherence: different parts of the review tend to discuss different movies.


In [None]:
for _ in range(3):
  new_review = review_4_gram.generate(num_words=100, text_seed='I')
  print(' '.join(new_review))

knew someone who saw Revenge of The Dorks . '' Perhaps the film could drag on forever and the scares are very subtle , not the Wolfman , or Lenny '' Of Mice and Men '' - an altogether different story. < br / > Spielberg is always put down for the money spent on them the numbers are well staged and electrifying . Prince is so tight that you are special in the same semester . We get lovable old grumpus Jonathan the prospector , his young brother is a bad movie but could n't make it seem boring
'm not really one i 've seen in awhile ! Hopefully , I 'll give him a first name , but this stuff is happening right before your eyes , please do , you will not want to see an independent film . This looks like a high-end hotel room - probably because they are freely displayed . What we got was as good as the baddest of the bad guys are cardboard cut-outs who all sound like a very foolish organization . In reality , that 's right they never have anything approaching a normal conversation in all of


In [None]:
for _ in range(3):
  new_review = review_2_gram.generate(num_words=100, text_seed='I')
  print(' '.join(new_review))

did it 's teeth . If you busy trying not much better movies so beautiful woman who are for book . As the amount of us guessing for their parts feature is quite brilliant about the songs as a charming southerner gave a better trimmed otherwise it should be argued that target and crosses over to experience but he joins the Robocop cyborg , it on this cruddy low-brow magician turned out of America would 've seen . Director Sasaki ) , not very boring movies ' Ford Coppola 's parts : a single email , are the need .
like Verhoeven look at the 80 's the continuity of the trailer as one of the philosophy into a wash my reviews of just about a motorcycle race against the movie I 'd be it 's car , while its over-long one of the daughter ( a comedy , as , it . The plot does sleazy exploitation films , folks could hit the word THEY looked like people who demands . At least terrifying if you 're guaranteed commercial values as far the Russian film is . Sheridan and suspense low : from Le Blanc has 