# Lab: N-gram Language Models

## Objectives
In this lab, you will learn about:
- N-gram language models and probability calculations
- Various smoothing techniques
- Sentence generation using n-gram models
- Model evaluation using perplexity
- Working with Arabic texts.


# Exercise 1: Exploring N-gram Language Models

In this exercise, we will delve into the fundamentals of n-gram language models, a crucial component in natural language processing. N-grams are sequences of words (or tokens) that provide insight into the structure and patterns of language. By analyzing n-grams, we can gain valuable statistics about word occurrences and relationships, which are essential for various applications, including text generation, sentiment analysis, and machine translation.

## Understanding N-grams and Context

An **n-gram** is defined as a contiguous sequence of \( n \) items (typically words) from a given text.

**Context** refers to the preceding words that provide information about the likelihood of a given word occurring after them.

For example, in a bigram $ (w_{i-1}, w_i) $, the context is $ w_{i-1} $. The probability of $ w_i $ given the context $ w_{i-1} $ can be expressed as:

$$
P(w_i \mid w_{i-1}) = \frac{C(w_{i-1}, w_i)}{C(w_{i-1})}
$$

where:
- $ C(w_{i-1}, w_i) $ is the count of the bigram $ (w_{i-1}, w_i) $.
- $ C(w_{i-1}) $ is the count of the context word $ w_{i-1} $.


For trigrams $ (w_{i-2}, w_{i-1}, w_i) $, the context is $ (w_{i-2}, w_{i-1}) $, and the probability can be expressed as:

$$
P(w_i \mid w_{i-2}, w_{i-1}) = \frac{C(w_{i-2}, w_{i-1}, w_i)}{C(w_{i-2}, w_{i-1})}
$$

where:
- $ C(w_{i-2}, w_{i-1}, w_i) $ is the count of the trigram $ (w_{i-2}, w_{i-1}, w_i) $.
- $ C(w_{i-2}, w_{i-1}) $ is the count of the context bigram $ (w_{i-2}, w_{i-1}) $.



This formula provides a way to calculate the conditional probability of a word occurring based on its preceding context, which is critical for modeling language.

## Questions

### Question 1
Write a Python function that takes a set of sentences as input and calculates the statistics of the n-grams (both bigrams and trigrams) along with their respective context counts.

### Question 2
Using the function you wrote in Question 1, compare the probabilities of the following sentences using both bigram and trigram models:
1. "I enjoy Chocolate cake"
2. "I hate Chocolate cake"
3. "He told me about it"

Calculate the probabilities for both sentences using the respective models. What observations can you make about the resulting probabilities? Is there any problems?

### Question 3
For unseen n-grams, assign a small probability of **0.0001** and repeat your calculations. What observations can you make about the resulting probabilities?

### Question 4
Instead of calculating the probabilities directly, use the log probabilities, applying the relation:

$$
\log(P1 \times P2 \times P3) = \log(P1) + \log(P2) + \log(P3)
$$

Recalculate the log probabilities of sentences. What do you notice?







In [1]:
!pip install nltk



In [13]:
import nltk
from nltk import bigrams, trigrams
from collections import Counter
from nltk.tokenize import word_tokenize
import math

# Download required NLTK data
nltk.download('punkt')

# Define the corpus (a collection of sentences)
corpus = [
    "I love chocolate cake",
    "I love chocolate ice cream",
    "I love cookies",
    "I enjoy vanilla cake",
    "I enjoy vanilla ice cream",
    "I enjoy cookies",
    "Chocolate is my favorite",
    "Vanilla cookies are great",
    "Chocolate cookies are also great"
]

# ---- Question 1: Calculate n-gram statistics ----

def prepare_sentences(sentences, n):
    """
    Prepare sentences by adding start and end tokens <s> and <\s>.

    Input:
    - sentences: A list of sentences (strings).
    - n: The n-gram size (2 for bigrams, 3 for trigrams).

    Output:
    - A list of tokenized sentences with added start and end tokens.
    """
    processed = []

    #your code here
    for sentence in sentences:
        tokens = word_tokenize(sentence.lower())
        if n==2:
            processed.append(['<s>']+ tokens + ['</s>'])
        elif n== 3:
            processed.append(['<s>', '<s>']+ tokens + ['</s>'])

    return processed

def calculate_ngram_probabilities(sentences, n):
    """
    Calculate n-gram probabilities from sentences.

    Input:
    - sentences: A list of sentences (strings).
    - n: The n-gram size (2 for bigrams, 3 for trigrams).

    Output:
      1. n-gram counts (Counter object).
      2. (n-1)-gram context counts (Counter object).
    """
    processed_sentences = prepare_sentences(sentences, n)

    # Count n-grams and (n-1)-grams
    ngram_counts = Counter()
    context_counts = Counter()

    #your code here
    for sentence in processed_sentences:
        if n==2:
            ngrams = list(bigrams(sentence))
            contexts = [ng[0] for ng in ngrams]

        elif n==3:
            ngrams = list(trigrams(sentence))
            contexts = [(ng[0], ng[1]) for ng in ngrams]

        ngram_counts.update(ngrams)
        context_counts.update(contexts)

    print(ngram_counts)
    print(context_counts)
    ngram_probabilities = {}
    for ng in ngram_counts:
        context = ng[:-1]
        if isinstance(context, tuple) and len(context) == 1:
            context = context[0]
        ngram_probabilities[ng] = ngram_counts[ng] / context_counts[context]

    return ngram_counts, context_counts

def get_ngram_probability(ngram, context_counts, ngram_counts):
    """ Calculate the probability of a given n-gram. """
    context = ngram[:-1]
    ngram_count = ngram_counts.get(ngram, 0)
    context_count = context_counts.get(context, 1)  # Default context_count to 1 for safe division

    # If n-gram is missing, use probability of 0.0001
    return ngram_count / context_count if ngram_count > 0 else 0.0001

# ---- Question 2: Calculate sentence probability ----
def calculate_sentence_probability(sentence, ngram_counts, context_counts, n):
    """
    Calculate the probability of a sentence using an n-gram model without smoothing.

    Input:
    - sentence: A sentence (string).
    - ngram_counts: Counts of n-grams (Counter object).
    - context_counts: Counts of (n-1)-gram contexts (Counter object).
    - n: The n-gram size (2 for bigrams, 3 for trigrams).

    Output:
    - Probability of the sentence (float).
    """
    tokens = word_tokenize(sentence.lower())

    # Add appropriate start and end tokens based on n
    if n == 2:
        tokens = ['<s>'] + tokens + ['</s>']
        ngrams = list(bigrams(tokens))
    elif n == 3:
        tokens = ['<s>', '<s>'] + tokens + ['</s>']
        ngrams = list(trigrams(tokens))

    sentence_probability = 1.0  # Initialize sentence probability as 1

    # your code here
    for ng in ngrams:
        context = ng[:-1]
        if isinstance(context, tuple) and len(context) == 1:
            context = context[0]

        ngram_count = ngram_counts.get(ng, 0)
        context_count = context_counts.get(context, 0)

        if ngram_count > 0:
          ngram_prob = ngram_count/ context_count
        else:
          ngram_prob = 0

        sentence_probability *= ngram_prob

    return sentence_probability

# Example usage
bigram_counts, bigram_contexts = calculate_ngram_probabilities(corpus, 2)
trigram_counts, trigram_contexts = calculate_ngram_probabilities(corpus, 3)

# Example sentences
test_sentences = [
    "I enjoy chocolate cake",
    "I hate chocolate cake",
    "He told me about it"
]

for sent in test_sentences:
    print("\t\t---------------------------\t\t")
    # Calculate normal probabilities using bigram and trigram models
    bigram_sentence_prob = calculate_sentence_probability(sent, bigram_counts, bigram_contexts, 2)
    print(f"Bigram model probability of '{sent}': {bigram_sentence_prob}")
print("----------------------------------------------------------------")

Counter({('<s>', 'i'): 6, ('i', 'love'): 3, ('i', 'enjoy'): 3, ('love', 'chocolate'): 2, ('cake', '</s>'): 2, ('ice', 'cream'): 2, ('cream', '</s>'): 2, ('cookies', '</s>'): 2, ('enjoy', 'vanilla'): 2, ('<s>', 'chocolate'): 2, ('cookies', 'are'): 2, ('great', '</s>'): 2, ('chocolate', 'cake'): 1, ('chocolate', 'ice'): 1, ('love', 'cookies'): 1, ('vanilla', 'cake'): 1, ('vanilla', 'ice'): 1, ('enjoy', 'cookies'): 1, ('chocolate', 'is'): 1, ('is', 'my'): 1, ('my', 'favorite'): 1, ('favorite', '</s>'): 1, ('<s>', 'vanilla'): 1, ('vanilla', 'cookies'): 1, ('are', 'great'): 1, ('chocolate', 'cookies'): 1, ('are', 'also'): 1, ('also', 'great'): 1})
Counter({'<s>': 9, 'i': 6, 'chocolate': 4, 'cookies': 4, 'love': 3, 'enjoy': 3, 'vanilla': 3, 'cake': 2, 'ice': 2, 'cream': 2, 'are': 2, 'great': 2, 'is': 1, 'my': 1, 'favorite': 1, 'also': 1})
Counter({('<s>', '<s>', 'i'): 6, ('<s>', 'i', 'love'): 3, ('<s>', 'i', 'enjoy'): 3, ('i', 'love', 'chocolate'): 2, ('ice', 'cream', '</s>'): 2, ('i', 'enjo

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [14]:
def calculate_sentence_probability(sentence, ngram_counts, context_counts, n):
    """
    Calculate the probability of a sentence using an n-gram model without smoothing.

    Input:
    - sentence: A sentence (string).
    - ngram_counts: Counts of n-grams (Counter object).
    - context_counts: Counts of (n-1)-gram contexts (Counter object).
    - n: The n-gram size (2 for bigrams, 3 for trigrams).

    Output:
    - Probability of the sentence (float).
    """
    tokens = word_tokenize(sentence.lower())

    # Add appropriate start and end tokens based on n
    if n == 2:
        tokens = ['<s>'] + tokens + ['</s>']
        ngrams = list(bigrams(tokens))
    elif n == 3:
        tokens = ['<s>', '<s>'] + tokens + ['</s>']
        ngrams = list(trigrams(tokens))

    sentence_probability = 1.0  # Initialize sentence probability as 1

    # your code here
    for ng in ngrams:
        context = ng[:-1]
        if isinstance(context, tuple) and len(context) == 1:
            context = context[0]

        ngram_count = ngram_counts.get(ng, 0)
        context_count = context_counts.get(context, 0)

        if ngram_count > 0:
          ngram_prob = ngram_count/ context_count
        else:
          ngram_prob = 0.0001

        sentence_probability *= ngram_prob

    return sentence_probability

for sent in test_sentences:
    print("\t\t---------------------------\t\t")
    # Calculate normal probabilities using bigram and trigram models
    bigram_sentence_prob = calculate_sentence_probability(sent, bigram_counts, bigram_contexts, 2)
    print(f"Bigram model probability of '{sent}': {bigram_sentence_prob}")
print("----------------------------------------------------------------")

		---------------------------		
Bigram model probability of 'I enjoy chocolate cake': 8.333333333333334e-06
		---------------------------		
Bigram model probability of 'I hate chocolate cake': 1.666666666666667e-09
		---------------------------		
Bigram model probability of 'He told me about it': 1.0000000000000001e-24
----------------------------------------------------------------


In [15]:
# ---- Question 4: Calculate sentence probability using log----
def calculate_log_sentence_probability(sentence, ngram_counts, context_counts, n):
    """
    Calculate the log probability of a sentence using an n-gram model without smoothing.

    Input:
    - sentence: A sentence (string).
    - ngram_counts: Counts of n-grams (Counter object).
    - context_counts: Counts of (n-1)-gram contexts (Counter object).
    - n: The n-gram size (2 for bigrams, 3 for trigrams).

    Output:
    - Log probability of the sentence (float).
    """
    tokens = word_tokenize(sentence.lower())

    # Add appropriate start and end tokens based on n
    if n == 2:
        tokens = ['<s>'] + tokens + ['</s>']
        ngrams = list(bigrams(tokens))
    elif n == 3:
        tokens = ['<s>', '<s>'] + tokens + ['</s>']
        ngrams = list(trigrams(tokens))

    log_sentence_probability = 0.0  # Initialize log probability as 0

    #your code here
    for ng in ngrams:
        context = ng[:-1]
        if isinstance(context, tuple) and len(context) == 1:
            context = context[0]
        ngram_count = ngram_counts.get(ng, 0)
        context_count = context_counts.get(context, 0)
        if ngram_count > 0:
          ngram_prob = ngram_count/ context_count
        else:
          ngram_prob = 0.0001

        log_sentence_probability += math.log(ngram_prob)

    return log_sentence_probability

for sent in test_sentences:
    print("\t\t---------------------------\t\t")
    # Log probabilities using bigram and trigram models
    log_bigram_sentence_prob = calculate_log_sentence_probability(sent, bigram_counts, bigram_contexts, 2)
    # Output for bigram model
    print(f"Log bigram model probability of '{sent}': {log_bigram_sentence_prob}")
print("----------------------------------------------------------------")

		---------------------------		
Log bigram model probability of 'I enjoy chocolate cake': -11.695247021764182
		---------------------------		
Log bigram model probability of 'I hate chocolate cake': -20.212440213180418
		---------------------------		
Log bigram model probability of 'He told me about it': -55.26204223185709
----------------------------------------------------------------


# Exercise 2: Smoothing Techniques

In this exercise, we will explore various smoothing techniques used in language models to enhance the accuracy of probability estimates. Smoothing is essential in natural language processing, particularly when dealing with sparse data, where certain n-grams may not appear in the training set, leading to zero probabilities. By applying smoothing methods, we can adjust these probabilities to account for unseen data.

## Purpose of Smoothing in Language Models

Smoothing techniques aim to prevent zero probabilities for n-grams that do not occur in the training corpus. This is particularly important for language models, as a zero probability would imply that a particular sequence of words is impossible, which is not realistic in natural language. Smoothing helps create a more robust model that can generalize better to new, unseen text.

## Smoothing Methods

### 1. Add-One Smoothing
Also known as Laplace smoothing, this method adds one to the count of each n-gram and the vocabulary size to the denominator. The formula for calculating the probability in the case of trigrams is:

$$
P(w_i | w_{i-2}, w_{i-1}) = \frac{C(w_{i-2}, w_{i-1}, w_i) + 1}{C(w_{i-2}, w_{i-1}) + V}
$$

where:
- $ C(w_{i-2}, w_{i-1}, w_i) $ is the count of the trigram $ (w_{i-2}, w_{i-1}, w_i) $.
- $ C(w_{i-2}, w_{i-1}) $ is the count of the context bigram $ (w_{i-2}, w_{i-1}) $.
- $ V $ is the vocabulary size.

### 2. Add-k Smoothing
A generalization of add-one smoothing, this method adds a constant \( k \) (where \( k > 0 \)) to the counts of all n-grams. The probability in the case of trigrams is computed as follows:

$$
P(w_i | w_{i-2}, w_{i-1}) = \frac{C(w_{i-2}, w_{i-1}, w_i) + k}{C(w_{i-2}, w_{i-1}) + k \cdot V}
$$

where \( k \) is a small constant (e.g., \( k=0.5 \)).

### 3. Interpolation
This method combines probabilities from different n-gram models (e.g., unigrams, bigrams, trigrams) using weighted averages. The interpolated probability in case of trigrams can be expressed as:

$$
P_{\text{interp}}(w_i | w_{i-1}) = \lambda_1 P_{1}(w_i) + \lambda_2 P_{2}(w_{i-1}, w_i) + \lambda_3 P_{3}(w_{i-2}, w_{i-1}, w_i)
$$

where $ \lambda_1, \lambda_2, \lambda_3 $ are weights that sum to 1, and $ P_{1}, P_{2}, P_{3} $ are the probabilities from the unigram, bigram, and trigram models, respectively.

### 4. Stupid Backoff
Stupid backoff uses a simple heuristic to handle zero probabilities. If a trigram has a zero probability, it backs off to the lower-order n-grams with discounted probabilities. The probability for trigrams can be expressed as:

$$
P_{\text{backoff}}(w_i | w_{i-2}, w_{i-1}) =
\begin{cases}
\frac{C(w_{i-2}, w_{i-1}, w_i)}{C(w_{i-2}, w_{i-1})} & \text{if } C(w_{i-2}, w_{i-1}, w_i) > 0 \\
\alpha \cdot P_{\text{backoff}}(w_i | w_{i-1}) & \text{otherwise}
\end{cases}
$$

If the bigram probability $ P_{\text{backoff}}(w_i | w_{i-1}) $ is also zero, the model will further back off to the unigram probability:

$$
P_{\text{backoff}}(w_i | w_{i-1}) =
\begin{cases}
\frac{C(w_{i-1}, w_i)}{C(w_{i-1})} & \text{if } C(w_{i-1}, w_i) > 0 \\
\alpha \cdot P_{\text{backoff}}(w_i) & \text{otherwise}
\end{cases}
$$

where $ \alpha $ is a discount factor (typically set to $ 0.4 $ or similar), and $ P_{\text{backoff}}(w_i) $ is the probability of the unigram.


## Questions:
1. Implement the four aforementioned smoothing methods for both bigram and trigram models.

  **Note:** Returns log probability to avoid numerical underflow with long sequences.

2. Using the smoothing methods described above, apply each technique to recalculate the probabilities of the sentences from Exercise 1:

  *   "I enjoy chocolate cake"
  *   "I hate chocolate cake"
  *    "He told me about it"

3. Which smoothing method seems to work best for the test sentences used in exercise 1, and why? Discuss the strengths and weaknesses of each method in the context of the sentences analyzed.





In [None]:
import nltk
from nltk import bigrams, trigrams
from collections import Counter
from nltk.tokenize import word_tokenize
import math

def get_vocabulary_size(sentences):
    """
    Helper function to compute the vocabulary size from a list of sentences.

    Parameters:
    - sentences (list): List of strings, each string being a sentence

    Returns:
    - int: Number of unique words in the vocabulary
    """
    vocab = set()

    # your code here
    for sentence in sentences:
      tokens = word_tokenize(sentence.lower())
      vocab.update(tokens)

    return len(vocab)


def add_one_smoothing(sentence, ngram_counts, context_counts, vocab_size, n):
    """
    Implements Add-One (Laplace) smoothing for n-gram language models.

    Mathematical equation:
    P(w_i|w_{i-n+1}^{i-1}) = [C(w_{i-n+1}^i) + 1] / [C(w_{i-n+1}^{i-1}) + |V|]

    Parameters:
    - sentence (str): Input sentence to calculate probability for
    - ngram_counts (Counter): Dictionary of n-gram counts from training data
    - context_counts (Counter): Dictionary of (n-1)-gram counts from training data
    - vocab_size (int): Size of vocabulary in training data
    - n (int): Order of n-gram model (2 for bigram, 3 for trigram)

    Returns:
    - float: Log probability of the sentence under the smoothed model
    """
    tokens = word_tokenize(sentence.lower())
    tokens = ['<s>'] * (n - 1) + tokens + ['</s>']


    sentence_probability = 0.0

    # your code here
    for i in range(n-1, len(tokens)):
      ngram = tuple(tokens[i - n + 1: i+1])
      context = tuple(tokens[i-n + 1: i])

      ngram_count = ngram_counts[ngram]
      context_count = context_counts[context]

      prob = (ngram_count + 1) / (context_count + vocab_size)
      sentence_probability += math.log(prob)

    return sentence_probability

def add_k_smoothing(sentence, ngram_counts, context_counts, vocab_size, n, k=0.5):
    """
    Implements Add-k smoothing (also known as Lidstone smoothing) for n-gram language models.

    Mathematical equation:
    P(w_i|w_{i-n+1}^{i-1}) = [C(w_{i-n+1}^i) + k] / [C(w_{i-n+1}^{i-1}) + k|V|]
    where:
    - k is the smoothing parameter (typically 0 < k < 1)
    - C(w_{i-n+1}^i) is the count of the n-gram
    - C(w_{i-n+1}^{i-1}) is the count of the context
    - |V| is the vocabulary size

    Parameters:
    - sentence (str): Input sentence to calculate probability for
    - ngram_counts (Counter): Dictionary of n-gram counts from training data
    - context_counts (Counter): Dictionary of (n-1)-gram counts from training data
    - vocab_size (int): Size of vocabulary in training data
    - n (int): Order of n-gram model (2 for bigram, 3 for trigram)
    - k (float): Smoothing parameter, defaults to 0.5

    Returns:
    - float: Log probability of the sentence under the smoothed model
    """
    tokens = word_tokenize(sentence.lower())

    if n == 2:
        tokens = ['<s>'] + tokens + ['</s>']
        ngrams = list(bigrams(tokens))
    elif n == 3:
        tokens = ['<s>', '<s>'] + tokens + ['</s>']
        ngrams = list(trigrams(tokens))

    sentence_probability = 0.0

    # your code here
    for ngram in ngrams:
        context = ngram[:-1]

        ngram_count = ngram_counts[ngram]
        context_count = context_counts[context]

        prob = (ngram_count + k) / (context_count + k * vocab_size)
        sentence_probability += math.log(prob)

    return sentence_probability

def interpolation_smoothing(sentence, unigram_counts, ngram_counts, context_counts,
                          total_words, n, lambdas=None):
    """
    Implements linear interpolation smoothing for n-gram models (n=2 or n=3).

    Mathematical equation for bigrams (n=2):
    P(w_i|w_{i-1}) = λ₁P(w_i) + λ₂P(w_i|w_{i-1})

    For trigrams (n=3):
    P(w_i|w_{i-2}w_{i-1}) = λ₁P(w_i) + λ₂P(w_i|w_{i-1}) + λ₃P(w_i|w_{i-2}w_{i-1})

    # We handle the zero probability case (interpolated proba) by using a very small value of 0.0001

    Parameters:
    - sentence (str): Input sentence
    - unigram_counts (Counter): Dictionary of unigram counts
    - ngram_counts (Counter): Dictionary of n-gram counts
    - context_counts (dict): Dictionary of context counts for different n-gram orders
    - total_words (int): Total word count in training data
    - n (int): Order of n-gram model (2 or 3)
    - lambdas (list): Interpolation weights. For bigrams: [λ₁, λ₂], for trigrams: [λ₁, λ₂, λ₃]

    Returns:
    - float: Log probability of the sentence under the interpolated model
    """
    tokens = word_tokenize(sentence.lower())

    # Set default lambdas if not provided
    if lambdas is None:
        if n == 2:
            lambdas = [0.3, 0.7]  # [unigram, bigram]
        else:
            lambdas = [0.1, 0.3, 0.6]  # [unigram, bigram, trigram]

    # Add appropriate start/end tokens
    if n == 2:
        tokens = ['<s>'] + tokens + ['</s>']
        start_index = 1
    else:  # n == 3
        tokens = ['<s>', '<s>'] + tokens + ['</s>']
        start_index = 2

    sentence_probability = 0.0

    # your code here
    for i in range(start_index, len(tokens)):
      word = tokens[i]
      bigram = tuple(tokens[i-1:i+1])
      trigram = tuple(tokens[i-1:i+1])

      P_unigram = (unigram_counts[word]+1)/ (total_words + len(unigram_counts))
      P_bigram =  (ngram_counts[bigram]+ 1)/ (context_counts[bigram[:-1]] + len(unigram_counts))
      P_trigram = (ngram_counts[trigram] + 1)/ (context_counts[trigram[:-1]]+ len(unigram_counts))

      prob = lambdas[0] * P_unigram + lambdas[1] * P_bigram + lambdas[2] * P_trigram
      sentence_probability +=math.log(prob)

    return sentence_probability

def stupid_backoff(sentence, unigram_counts, ngram_counts, total_words,
                  context_counts, n, alpha=0.4):
    """
    Implements Stupid Backoff smoothing for n-gram models (n=2 or n=3).

    Mathematical equation for bigrams (n=2):
    S(w_i|w_{i-1}) =
        count(w_{i-1},w_i) / count(w_{i-1})     if count(w_{i-1},w_i) > 0
        α * count(w_i) / N                       otherwise

    For trigrams (n=3):
    S(w_i|w_{i-2},w_{i-1}) =
        count(w_{i-2},w_{i-1},w_i) / count(w_{i-2},w_{i-1})   if trigram exists
        α * S(w_i|w_{i-1})                                     otherwise

    # We handle the zero probability case (when unigram count is zero) by using a very small value of 0.0001

    Parameters:
    - sentence (str): Input sentence
    - unigram_counts (Counter): Dictionary of unigram counts
    - ngram_counts (dict): Dictionary of n-gram counts for different orders
    - total_words (int): Total word count in training data
    - context_counts (dict): Dictionary of context counts for different n-gram orders
    - n (int): Order of n-gram model (2 or 3)
    - alpha (float): Backoff penalty parameter

    Returns:
    - float: Log score of the sentence
    """
    tokens = word_tokenize(sentence.lower())

    if n == 2:
        tokens = ['<s>'] + tokens + ['</s>']
        start_index = 1
    else:  # n == 3
        tokens = ['<s>', '<s>'] + tokens + ['</s>']
        start_index = 2

    sentence_probability = 0.0

    # your code here

    return sentence_probability



"""
Main function demonstrating the usage of different smoothing methods with both
bigram and trigram models, as well as calculating and printing perplexities.
"""
nltk.download('punkt')

corpus = [
    "I love chocolate cake",
    "I love chocolate ice cream",
    "I love cookies",
    "I enjoy vanilla cake",
    "I enjoy vanilla ice cream",
    "I enjoy cookies",
    "Chocolate is my favorite",
    "Vanilla cookies are great",
    "Chocolate cookies are also great"
]

# Initialize counts
vocab_size = get_vocabulary_size(corpus)
unigram_counts = Counter()

# Create separate counters for different n-gram orders
ngram_counts = {1: Counter(), 2: Counter(), 3: Counter()}  # 1:unigram, 2:bigram, 3:trigram
context_counts = {1: Counter(), 2: Counter()}  # 1:unigram context, 2:bigram context

# Process corpus
for sentence in corpus:
    tokens = word_tokenize(sentence.lower())
    unigram_counts.update(tokens)

    # Add appropriate padding for different n-grams
    bigram_tokens = ['<s>'] + tokens + ['</s>']
    trigram_tokens = ['<s>', '<s>'] + tokens + ['</s>']

    # Update counts
    ngram_counts[1].update(tokens)  # unigrams
    for bigram in bigrams(bigram_tokens):
        ngram_counts[2][bigram] += 1
        context_counts[1][bigram[:-1]] += 1
    for trigram in trigrams(trigram_tokens):
        ngram_counts[3][trigram] += 1
        context_counts[2][trigram[:-1]] += 1

total_words = sum(unigram_counts.values())

# Test sentences
test_sentences = [
    "I enjoy chocolate cake",
    "I hate chocolate cake",
    "He told me about it"
]

# Test each smoothing method with both bigrams and trigrams
for sentence in test_sentences:
    print(f"\nAnalyzing sentence: '{sentence}'")
    sentence_length = len(word_tokenize(sentence))

    print("\nBigram Models (n=2):")
    for method, func in [("Add-one smoothing", add_one_smoothing)]:
        log_prob = func(sentence, ngram_counts[2], context_counts[1], vocab_size, 2)
        print(f"{method}: Log probability = {log_prob}")

    print("\nTrigram Models (n=3):")
    for method, func in [("Add-one smoothing", add_one_smoothing)]:
        log_prob = func(sentence, ngram_counts[3], context_counts[2], vocab_size, 3)
        print(f"{method}: Log probability = {log_prob}")


Analyzing sentence: 'I enjoy chocolate cake'

Bigram Models (n=2):
Add-one smoothing: Log probability = 0.0

Trigram Models (n=3):
Add-one smoothing: Log probability = 0.0

Analyzing sentence: 'I hate chocolate cake'

Bigram Models (n=2):
Add-one smoothing: Log probability = 0.0

Trigram Models (n=3):
Add-one smoothing: Log probability = 0.0

Analyzing sentence: 'He told me about it'

Bigram Models (n=2):
Add-one smoothing: Log probability = 0.0

Trigram Models (n=3):
Add-one smoothing: Log probability = 0.0


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


# Exercise 3: Perplexity Evaluation

In this exercise, we will evaluate the performance of your n-gram language model using perplexity. This metric is widely used to assess how well a probability distribution predicts a sample.

## Understanding Perplexity

**Perplexity** is defined as the exponentiation of the average negative log-likelihood of a sequence. It provides insight into how effectively a language model can predict a given sequence of words. A lower perplexity score indicates that the model is more confident in its predictions, while a higher score suggests greater uncertainty.

The formula for calculating perplexity for a test set is:

$$
\text{Perplexity} = e^{-\frac{1}{N} \sum_{i=1}^{N} \log P(w_i | w_{i-1})}
$$

where:
- $N$ is the total number of words in the test set.
- $P(w_i | w_{i-1})$ is the probability of the word $w_i$ given its preceding word $w_{i-1}$ as predicted by your n-gram model.

### Significance of Perplexity

Perplexity is useful for comparing different language models; a model with lower perplexity generally performs better. It effectively captures how well the model generalizes to unseen data.

## Implementation
- Use the testing part to evaluate the two language models built previously using perplexity and using the diffrent smoothing methods.




In [None]:
def calculate_perplexity(log_probability, sentence_length):
    """
    Calculate the perplexity of a sentence.

    Parameters:
    - log_probability (float): Total log probability of the sentence
    - sentence_length (int): Length of the sentence in words

    Returns:
    - float: Perplexity of the sentence
    """
    return 0

# your code here

# Exercise 4: Sentence Generation

In this exercise, we will focus on generating sentences using a trained n-gram language model. We will implement a random generation method that selects the next word from the top five most probable words based on a specific context. This will help us understand how context influences word choice in language generation.

## Task: Generate Sentences Using N-gram Models

You will implement a function to generate sentences based on bigram, trigram, and four-gram models. Additionally, you will utilize the Add-One smoothing method to calculate and print the probabilities of the generated sentences.

### Text Corpus

For this exercise, you will use the **Gutenberg corpus** from the NLTK library. You can access this corpus by importing the required module and selecting a suitable text, such as "Alice's Adventures in Wonderland" or any other available text.

### Steps to Complete:

1. **Sampling Sentences**:
   - Implement a function that generates sentences by:
     - Starting with a seed word or phrase.
     - Using the bigram model to select the next word based on the current context.
     - Repeating the process using the trigram model and then the four-gram model.
     - For each generated word, select from the top five most probable candidates.

2. **Calculate Probabilities**:
   - Use different smoothing methods to compute and print the probability of each generated sentence.
   - Analyze which smoothing method yields the best results.




# Exercise 5: Simplified Bigram EM Algorithm

## Background
The EM (Expectation-Maximization) algorithm is often used in NLP to optimize mixture weights for combining different language models, such as unigram and bigram models. This exercise explores a simplified version of this approach, focusing on calculating and updating mixture weights for unigram and bigram probabilities using a small corpus.

## Corpus with Sentence Boundaries
We are given a small corpus with three sentences and sentence boundaries, as follows:

1. `a b a`
2. `b c b`
3. `a b c`

## Steps of the EM Algorithm
1. **Initialization**: Start with initial values for the mixture weights: λ₁ = 0.5 and λ₂ = 0.5.
   
2. **E-Step (Expectation Step)**:
   - For each bigram $(w_{i-1}, w_i)$ in the corpus, calculate the combined probability (total_prob) using the current λ values:

$$
\text{total_prob}_{\text{(w_{i-1}, w_i)}} = \lambda_1 \cdot P_{\text{unigram}}(w_i) + \lambda_2 \cdot P_{\text{bigram}}(w_{i-1}, w_i)
$$
     
   - Compute the unigram and bigram contributions for each bigram:
     - **Unigram contribution**: $ \frac{\lambda_1 \cdot P_{\text{unigram}}(w_i)}{\text{total_prob}} $
     - **Bigram contribution**: $ \frac{\lambda_2 \cdot P_{\text{bigram}}(w_{i-1}, w_i)}{\text{total_prob}} $

3. **M-Step (Maximization Step)**:
   - Sum up the unigram and bigram contributions across all bigrams in the corpus.
   - Update the λ values using the total unigram and bigram contributions:
     $$
     \lambda_1 = \frac{\text{Total Unigram Contributions}}{\text{Total Contributions (Unigram + Bigram)}}
     $$
     $$
     \lambda_2 = \frac{\text{Total Bigram Contributions}}{\text{Total Contributions (Unigram + Bigram)}}
     $$

4. **Iteration and Convergence**:
   - Repeat the E-step and M-step until the changes in λ values are below a specified threshold (e.g., 0.001).

## Questions

1. What are the updated values of λ₁ and λ₂ after the first EM step?
2. What are the final values of λ₁ and λ₂ after the algorithm has converged?
3. How many iterations did it take for the algorithm to converge?
4. How would the results change if λ₁ and λ₂ were initialized with different values?


# Exercise 6: Training and Evaluating an n-gram Language Model using KenLM

## Objective
Learn to train an n-gram language model using KenLM, apply different smoothing techniques, and evaluate the model’s performance through perplexity. You’ll also print the model vocabulary, calculate sentence probabilities, and generate random sentences.

## Instructions

### 1. Setup
- Load the provided corpus, **"TED2013.ar-en.en,"** and use it to train a KenLM trigram language model.

### 2. Sentence Probability Calculation
- Use the trained model to calculate the probability of a given sentence.

### 3. Sentence Smoothness Evaluation
- Use the model to rank sentences based on their linguistic smoothness.

### 4. Arabic Usage
- Perform the same tasks using the provided Arabic corpus **"TED2013.ar-en.ar,"**.


In [None]:
# Step 1: Install KenLM if not already installed
# If KenLM is not installed, use the following command in the terminal to install it:
!pip install kenlm
!git clone https://github.com/kpu/kenlm.git
!cd kenlm && mkdir build && cd build && cmake .. && make -j4

In [None]:
!kenlm/build/bin/lmplz -h

In [None]:
import kenlm
import random
import os
from collections import defaultdict

def train_ngram_model(corpus_path, model_path, order=3):
    """
    Goal:
        Train an n-gram model using KenLM on a specified corpus and save the model.

    Input:
        - corpus_path (str): Path to the corpus text file for training the model.
        - model_path (str): Path to save the trained ARPA model file.
        - order (int): Order of the n-gram model (e.g., 3 for a trigram model).

    Output:
        - Saves the trained model as an ARPA file at the specified model_path.
    """
    from subprocess import run

    command = [
        'kenlm/build/bin/lmplz',  # Full path to lmplz in Colab
        '--order', str(order),
        '--text', corpus_path,
        '--arpa', model_path,
        '--temp_prefix', '/tmp'
    ]

    print(f"Training a {order}-gram model...")
    result = run(command, capture_output=True, text=True)

    # Print stdout and stderr to diagnose issues
    print(result.stdout)
    if result.stderr:
        print("Error output:", result.stderr)

    print(f"Model saved to {model_path}")

def load_ngram_probabilities(model_path):
    """
    Goal:
        Load n-gram probabilities from an ARPA file into a nested dictionary structure.

    Input:
        - model_path (str): Path to the ARPA model file to load.

    Output:
        - dict: A nested dictionary where each key is a context (tuple of words)
                and values are dictionaries mapping next words to their probabilities.
    """
    ngram_probs = defaultdict(lambda: defaultdict(float))
    current_order = 0

    with open(model_path, 'r') as f:
        for line in f:
            line = line.strip()
            if line.startswith('\\'):
                if 'grams:' in line:
                    current_order = int(line[1])
                continue

            if not line or line.startswith('\\'):
                continue

            parts = line.split('\t')
            if len(parts) >= 2:
                prob = float(parts[0])
                ngram = tuple(parts[1].split())
                if len(ngram) == current_order:
                    context = ngram[:-1]
                    next_word = ngram[-1]
                    ngram_probs[context][next_word] = prob

    return ngram_probs

def calculate_sentence_probability(model, sentence):
    """
    Calculate the probability of a given sentence using the KenLM model.

    Parameters:
        model (kenlm.Model): The trained KenLM model.
        sentence (str): The sentence whose probability needs to be calculated.

    Returns:
        float: The probability of the sentence.
    """
    # Calculate log probability
    log_prob = model.score(sentence, bos=True, eos=True)
    probability = 10 ** (log_prob / 10)  # Assuming KenLM logs are base 10
    return probability

def sort_sentences_by_smoothness(model, test_sentences):
   # add you code here
   pass


# Example usage
def main():
    model_path = 'ngram_model.arpa'

    # Load the KenLM model
    model = kenlm.Model(model_path)

    # Define test sentences
    test_sentences1 = [
        "The ocean is vast and mysterious.",
        "ocean The mysterious is vast.",
        "The vast ocean is mysterious.",
        "The mysterious vast ocean is and.",
    ]

    test_sentences2 = [
        "The sun rises in the east and sets in the west.",
        "The rises sun in the east and sets in the west.",
        "The sun rises in the west and sets in the east.",
        "The sun west rises in east the  and sets in the."
    ]

    # your code here

if __name__ == "__main__":
    main()


# **Appendix: Detailed Calculations for the First EM Step (Exercise 5)**

## Initial Setup
- Initial λ₁ = 0.5, λ₂ = 0.5

## Corpus with Sentence Boundaries
Sentences = [
* `['<s>', 'a', 'b', 'a', '</s>']`,
* `['<s>', 'b', 'c', 'b', '</s>']`,
* `['<s>', 'a', 'b', 'c', '</s>']`
]

## Unigram Counts
- Count of "\<s>": 3
- Count of "a": 3
- Count of "b": 4
- Count of "c": 2
- Count of "\</s>": 3
- **Total** = 15

## Unigram Probabilities
$$
P_{\text{unigram}} =
\begin{cases}
    P(\text{"<s>"}) = \frac{3}{15} = 0.2000 \\
    P(\text{"a"}) = \frac{3}{15} = 0.2000 \\
    P(\text{"b"}) = \frac{4}{15} \approx 0.2667 \\
    P(\text{"c"}) = \frac{2}{15} \approx 0.1333 \\
    P(\text{"</s>"}) = \frac{3}{15} = 0.2000
\end{cases}
$$

## Bigram Counts
- Count of ("\<s>", "a"): 2
- Count of ("\<s>", "b"): 1
- Count of ("a", "b"): 2
- Count of ("b", "a"): 1
- Count of ("b", "c"): 2
- Count of ("c", "b"): 1
- Count of ("a", "\</s>"): 1
- Count of ("b", "\</s>"): 1
- Count of ("c", "\</s>"): 1

## Bigram Probabilities
$$
P_{\text{bigram}} =
\begin{cases}
    P(\text{"<s>", "a"}) = \frac{2}{3} \approx 0.6667 \\
    P(\text{"<s>", "b"}) = \frac{1}{3} \approx 0.3333 \\
    P(\text{"a", "b"}) = \frac{2}{3} \approx 0.6667 \\
    P(\text{"b", "a"}) = \frac{1}{4} = 0.25 \\
    P(\text{"b", "c"}) = \frac{2}{4} = 0.5 \\
    P(\text{"c", "b"}) = \frac{1}{2} = 0.5 \\
    P(\text{"a", "</s>"}) = \frac{1}{3} \approx 0.3333 \\
    P(\text{"b", "</s>"}) = \frac{1}{4} = 0.25 \\
    P(\text{"c", "</s>"}) = \frac{1}{2} = 0.5
\end{cases}
$$

# Step 1: E-Step Calculation for Each Sentence

We will be using the below formula for calculating $\text{total\_prob}$ for the bigram $(w_{i-1}, w_i)$ which combines the unigram and bigram probabilities using the initial values of $\lambda_1$ and $\lambda_2$:

$$
\text{total_prob}_{\text{(w_{i-1}, w_i)}} = \lambda_1 \cdot P_{\text{unigram}}(w_i) + \lambda_2 \cdot P_{\text{bigram}}(w_{i-1}, w_i)
$$
### First Sentence: ["\<s>", "a", "b", "a", "\</s>"]

1. **Bigram ("\<s>", "a")**
   - $ \text{total_prob} = 0.5 \times 0.2000 + 0.5 \times 0.6667 = 0.4334 $
   - $ \text{unigram_contribution} = \frac{(0.5 \times 0.2000)}{0.4334} = 0.2307 $
   - $ \text{bigram_contribution} = \frac{(0.5 \times 0.6667)}{0.4334} = 0.7693 $

2. **Bigram ("a", "b")**
   - $ \text{total_prob} = 0.5 \times 0.2667 + 0.5 \times 0.6667 = 0.4667 $
   - $ \text{unigram_contribution} = \frac{(0.5 \times 0.2667)}{0.4667} = 0.2857 $
   - $ \text{bigram_contribution} = \frac{(0.5 \times 0.6667)}{0.4667} = 0.7143 $

3. **Bigram ("b", "a")**
   - $ \text{total_prob} = 0.5 \times 0.2000 + 0.5 \times 0.2500 = 0.2250 $
   - $ \text{unigram_contribution} = \frac{(0.5 \times 0.2000)}{0.2250} = 0.4444 $
   - $ \text{bigram_contribution} = \frac{(0.5 \times 0.2500)}{0.2250} = 0.5556 $

4. **Bigram ("a", "\</s>")**
   - $ \text{total_prob} = 0.5 \times 0.2000 + 0.5 \times 0.3333 = 0.2667 $
   - $ \text{unigram_contribution} = \frac{(0.5 \times 0.2000)}{0.2667} = 0.3750 $
   - $ \text{bigram_contribution} = \frac{(0.5 \times 0.3333)}{0.2667} = 0.6250 $

**Total for first sentence:**
- **Unigram contributions**: $0.2307 + 0.2857 + 0.4444 + 0.3750 = 1.3358$
- **Bigram contributions**: $0.7693 + 0.7143 + 0.5556 + 0.6250 = 2.6642$

### Second Sentence: ["\<s>", "b", "c", "b", "\</s>"]

1. **Bigram ("\<s>", "b")**
   - $ \text{total_prob} = 0.5 \times 0.2667 + 0.5 \times 0.3333 = 0.3000 $
   - $ \text{unigram_contribution} = \frac{(0.5 \times 0.2667)}{0.3000} = 0.4445 $
   - $ \text{bigram_contribution} = \frac{(0.5 \times 0.3333)}{0.3000} = 0.5555 $

2. **Bigram ("b", "c")**
   - $ \text{total_prob} = 0.5 \times 0.1333 + 0.5 \times 0.5000 = 0.3167 $
   - $ \text{unigram_contribution} = \frac{(0.5 \times 0.1333)}{0.3167} = 0.2105 $
   - $ \text{bigram_contribution} = \frac{(0.5 \times 0.5000)}{0.3167} = 0.7895 $

3. **Bigram ("c", "b")**
   - $ \text{total_prob} = 0.5 \times 0.2667 + 0.5 \times 0.5000 = 0.3834 $
   - $ \text{unigram_contribution} = \frac{(0.5 \times 0.2667)}{0.3834} = 0.3478 $
   - $ \text{bigram_contribution} = \frac{(0.5 \times 0.5000)}{0.3834} = 0.6522 $

4. **Bigram ("b", "\</s>")**
   - $ \text{total_prob} = 0.5 \times 0.2000 + 0.5 \times 0.2500 = 0.2250 $
   - $ \text{unigram_contribution} = \frac{(0.5 \times 0.2000)}{0.2250} = 0.4444 $
   - $ \text{bigram_contribution} = \frac{(0.5 \times 0.2500)}{0.2250} = 0.5556 $

**Total for second sentence:**
- **Unigram contributions**: $0.4445 + 0.2105 + 0.3478 + 0.4444 = 1.4472$
- **Bigram contributions**: $0.5555 + 0.7895 + 0.6522 + 0.5556 = 2.5528$

### Third Sentence: ["\<s>", "a", "b", "c", "\</s>"]

1. **Bigram ("\<s>", "a")**
   - $ \text{total_prob} = 0.5 \times 0.2000 + 0.5 \times 0.6667 = 0.4334 $
   - $ \text{unigram_contribution} = \frac{(0.5 \times 0.2000)}{0.4334} = 0.2307 $
   - $ \text{bigram_contribution} = \frac{(0.5 \times 0.6667)}{0.4334} = 0.7693 $

2. **Bigram ("a", "b")**
   - $ \text{total_prob} = 0.5 \times 0.2667 + 0.5 \times 0.6667 = 0.4667 $
   - $ \text{unigram_contribution} = \frac{(0.5 \times 0.2667)}{0.4667} = 0.2857 $
   - $ \text{bigram_contribution} = \frac{(0.5 \times 0.6667)}{0.4667} = 0.7143 $

3. **Bigram ("b", "c")**
   - $ \text{total_prob} = 0.5 \times 0.1333 + 0.5 \times 0.5000 = 0.3167 $
   - $ \text{unigram_contribution} = \frac{(0.5 \times 0.1333)}{0.3167} = 0.2105 $
   - $ \text{bigram_contribution} = \frac{(0.5 \times 0.5000)}{0.3167} = 0.7895 $

4. **Bigram ("c", "\</s>")**
   - $ \text{total_prob} = 0.5 \times 0.2000 + 0.5 \times 0.5000 = 0.3500 $
   - $ \text{unigram_contribution} = \frac{(0.5 \times 0.2000)}{0.3500} = 0.2857 $
   - $ \text{bigram_contribution} = \frac{(0.5 \times 0.5000)}{0.3500} = 0.7143 $

**Total for third sentence:**
- **Unigram contributions**: $0.2307 + 0.2857 + 0.2105 + 0.2857 = 1.0126$
- **Bigram contributions**: $0.7693 + 0.7143 + 0.7895 + 0.7143 = 2.9874$

# Step 2: M-Step - Calculate Updated Lambda Values
Using the total expected counts to update λ values:
- **Total unigram contributions**: $1.3358 + 1.4472 + 1.0126 = 3.7956$
- **Total bigram contributions**: $2.6642 + 2.5528 + 2.9874 = 8.2044$

Updated λ values:
$$
λ₁ = \frac{3.7956}{12.0} = 0.3163
$$
$$
λ₂ = \frac{8.2044}{12.0} = 0.6837
$$

# Final Lambda Values
- $λ₁ = 0.3163$
- $λ₂ = 0.6837$
