# **1. N-Gram Language Models.**

N-Grams are basic language models that estimate the likelihood of a subsequent word based on the previous n - 1 words. Typically, all language models, whether small, large, or advanced like transformers, predict the likelihood of a word's occurrence based on the preceding sequence of words.

N-grams can be of different types, a N-gram model which look at one word to the past to predict the probability of the next word is called a bigram model. A N-Gram model which look at 2 words to the past to predict the probabiltiy of the next word is called a trigram model. So theoratically, n-gram models can look at n - 1 words to the past for predicting the next word.

The reason we are not selecting the whole sequence is because of the computational challenges and the lack of better predictions. In fact, we don't need to look to the whole past words to estimate the next word, we just need a few according to **Markov Assumption**

$$P(w_{1:n}) \approx \prod_{k=1}^{n} P(w_k \mid w_{k-n})$$





## **1.2 Bigram Language Model**

Bigram language models predict the probability of a word to occur next given a single previous word, thus it is called Bigram. For example if a word 'I' is given there is a higher probability that 'am' will occur next rather than 'is' or 'has'.

N-Gram including Bigram does this through **Maximum Likelihood Estimation (MLE)**. MLE is the technique of estimating parameters of a statistical model which makes the observed values more probable, ie, we are trying to align the parameters of the statistical model so that the model produced probabilites that align as closely as possible with the observed values.

$$P(w_{1:n}) \approx \prod_{k=1}^{n} P(w_k \mid w_{k-1})$$

The MLE can be written as:

$$P(w_n \mid w_{n-1}) = \frac{C(w_{n-1} w_n)}{\sum_{w} C(w_{n-1} w)}$$

Probability of word $w_n$ given previous word $w_{n-1}$ is equal to the count of paired words $C(w_{n - 1} w_n)$ divided by the sum of all the bigrams that share the same first word $w_{n-1}$

This equation of more intutive, if the bigrams or combination of two words occur more than the the previous word and some other word, the probability will be high, else it will be low.

In [None]:
import random
from collections import defaultdict, Counter

class BigramModel:

    def __init__(self):
        self.bigram_counts = defaultdict(Counter)
        self.total_counts = Counter()

    def train(self, corpus):
        words = corpus.split()

        for i in range(len(words) - 1):
            current_word = words[i]
            next_word = words[i + 1]

            self.bigram_counts[current_word][next_word] += 1
            self.total_counts[current_word] += 1

    def predict(self, word):
        if word not in self.bigram_counts:
            return None

        bigram_distribution = self.bigram_counts[word]
        total_count = self.total_counts[word]
        probabilities = {next_word: count / total_count for next_word, count in bigram_distribution.items()}
        next_word = random.choices(list(probabilities.keys()), list(probabilities.values()))[0]

        return next_word

# Example usage:
with open('/content/drive/MyDrive/Natural-Language-Processing/internet_archive_scifi_v3.txt', 'r') as f:
    corpus = f.read()

bigram_model = BigramModel()
bigram_model.train(corpus)

current_word = "The"
predicted_next_word = bigram_model.predict(current_word)
print(f"Given the word '{current_word}', the predicted next word is '{predicted_next_word}'.")

Given the word 'The', the predicted next word is 'Mousterian'.


### **Generating Text with Bigram LM**

In [None]:
def generate_bigram_text(model, start_word, num_words):
    generated_text = [start_word]
    current_word = start_word

    for _ in range(num_words - 1):
        next_word = model.predict(current_word)
        if next_word is None:
            break

        generated_text.append(next_word)
        current_word = next_word

    return ' '.join(generated_text)

In [None]:
generate_bigram_text(bigram_model, "Iam", 70)

'Iam eager. "If I couldn\'t hold off the new as new streetlamps coming of what could predict a wonder -- " "So our regulations. Correct?" "Is that some bad accent, affected if I\'ve had been twenty thousand files I\'ll show it would." I recall," he had given clearance from wherever his own, almost impossible. Their notable for such a poor, beat-up, bedraggled each serving. Work Estimating g acceleration and her'

## **1.3 Trigram Language Model**

Trigram Language Model predicts the probability of the next word to occur given the two previous words thus it is called Trigram. Trigrams works similar to how Bigram works in case of computation, but what differ is the number of past words the Trigram attend to which is 2 and incase of bigram it is 1.

Joint probability in Trigram:

$$P(w_{1:n}) \approx \prod_{k=1}^{n} P(w_k \mid w_{k-2}, w_{k-1})$$

Same for Bigrams, we use Maximum Likelihood Estimation:

$$P(w_n \mid w_{n-2}, w_{n-1}) = \frac{C(w_{n-2}W_{n-1} w_n)}{\sum_{w} C(w_{n-2} w_{n - 1} w)}$$


In [None]:
class TrigramModel:

    def __init__(self):
        self.trigram_counts = defaultdict(Counter)
        self.total_counts = defaultdict(Counter)

    def train(self, corpus):
        words = corpus.split()
        for i in range(len(words) - 2):
            prev_word1 = words[i]
            prev_word2 = words[i + 1]
            next_word = words[i + 2]

            context = (prev_word1, prev_word2)

            self.trigram_counts[context][next_word] += 1
            self.total_counts[prev_word1][prev_word2] += 1

    def predict(self, context):
        if context not in self.trigram_counts:
            return None


        trigram_distribution = self.trigram_counts[context]
        total_count = sum(trigram_distribution.values())

        probabilities = {next_word: count / total_count for next_word, count in trigram_distribution.items()}
        next_word = random.choices(list(probabilities.keys()), list(probabilities.values()))[0]

        return next_word


trigram_model = TrigramModel()
trigram_model.train(corpus)

### **Generating text using trigram LM**

In [None]:
def generate_trigram_text(model, start_words, num_words):
    if len(start_words) < 2:
        raise ValueError("Provide at least two words to start")

    generated_text = start_words.copy()
    context = tuple(start_words)

    for _ in range(num_words):
        next_word = model.predict(context)
        if next_word is None:
            break

        generated_text.append(next_word)
        context = (context[1], next_word)

    return ' '.join(generated_text)


generate_trigram_text(trigram_model, ["Oh", "his"], 100)

'Oh his left hand. "You\'ll see a better story. And the gyros the moment and decided to outshine them." "Don\'t have to behave like solids most of those guns, Winnington, I command you to let him know, in carefully filtered screen, they seemed to have more on Palash. I returned good-naturedly. "Heard about it just came, what would have used. We cannot risk my head?" "They\'re the best all-round science-fiction writer comes along a streetcar on Earth, and here, also, the transition of power. Lon drove it, each connected by an intolerable nuisance." "There\'s no call to mind of an old man,'

## **2. Perprexity For Language Model Evaluation**

Perplexity is one of the metrics that can be used for evaluating language models including large language models. It measures how well a language model is able to predict the next word given previous words from a sample ot text. It gives an idea of how much the model is uncertain about the next world, if the perplexity score is low meaning that the model is performing better such that it is able to predict the next token compared to the actual test set. On the other hand a higher perplexity score indicates that the model is not confident in its prediction for a sequence of words.

There is a close relation between negative cross entropy and perplexity. Perplexity is the exponentiation of the negative cross entropy of the probability of a word to occur given some previous sequences of words.

Here is how we calcualte the negative cross entropy of the probability of a word to come next given previous few words,

$$\text{Loss} = -\sum_{i=1}^{n} \log P(w_i \mid w_{1:i-1})$$

So here we have no lables, instead what we are calculating is the probability of a word to occur given previous ones based on training and test set. When evaluating the model on a test set, the model predicts the next word in the sequence for each context. If the model predicts a word with high probability and the actual word in the test set matches this prediction, it indicates that the model is performing well. Essentially, the model's performance is measured by how closely its predicted probabilities match the actual occurrences of words in the test set.

But cross entropy will not give a measure that is understandable to normal people when evaluating Language models. So here we have to convert this measure of cross entropy to something called Perplexity, it can be done simply by taking the exponent of negative cross entropy.


$$\text{Perplexity} = \exp\left(\text{Loss} / n\right)$$



$\text{Loss}$ is the negative cross-entropy loss calculated as

- $\sum_{i=1}^{n} \log P(w_i \mid w_{1:i-1})$
- $n$ is the length of the sequence of words





In [None]:
import math

def calculate_perplexity(model, test_corpus, smoothing=1e-6):
    words = test_corpus.split()
    N = len(words) - 2  # Number of trigrams in the test corpus

    if N <= 0:
        return None  # Not enough data in test_corpus to calculate perplexity

    cross_entropy_loss = 0

    for i in range(N):
        prev_word1 = words[i]
        prev_word2 = words[i + 1]
        actual_next_word = words[i + 2]

        context = (prev_word1, prev_word2)

        # Get the predicted probability of the actual next word given the context
        if context in model.trigram_counts:
            trigram_distribution = model.trigram_counts[context]
            total_count = sum(trigram_distribution.values())
            actual_word_count = trigram_distribution.get(actual_next_word, 0)

            # Calculate the probability of the actual next word
            if total_count > 0:
                # Add smoothing value to avoid zero probability
                probability = (actual_word_count + smoothing) / (total_count + smoothing * len(trigram_distribution))

                # Accumulate the negative log likelihood
                cross_entropy_loss -= math.log(probability)

    # Calculate the average cross-entropy loss
    cross_entropy_loss /= N

    # Calculate perplexity
    perplexity = math.exp(cross_entropy_loss)

    return perplexity

calculate_perplexity(trigram_model, corpus[:100000])

15.040316634418183