## **N-Gram Language Model Tutorial**
### This Jupyter Notebook will guide you through the basics of N-Gram language models using Python.

In [None]:
# Importing Required Libraries
import re
from collections import Counter, defaultdict
import random
import math
import requests
import os

#### **1. N-Gram**

#### What is an N-Gram?

An N-Gram is a contiguous sequence of N items (words, characters, etc.) from a given text.

For example, in the sentence "I love machine learning", the 2-grams (bigrams) are:

["I love", "love machine", "machine learning"]

#### N-Gram Probability Equation
The probability of a word \( w_n \) given its history \( h \) in an n-gram model is:
$$
P(w_n \mid h) = \frac{C(h, w_n)}{\sum_{w'} C(h, w')}
$$
Where:
- \( C(h, w_n) \): Count of occurrences of \( h \) followed by \( w_n \)
- \( h \): Context (history of n-1 words)

In [None]:

def generate_ngrams(text, n):
    """
    Generate n-grams (character-level) from a given text.

    Parameters:
    text (str): Input text
    n (int): Size of the n-grams

    Returns:
    list: A list of n-grams as tuples
    """
    # Added padding with '#' characters to handle the start of sequences
    padded_text = '#' * (n-1) + text
    ngrams = []
    for i in range(len(padded_text) - n + 1):
        ngram = tuple(padded_text[i:i+n])
        ngrams.append(ngram)
    return ngrams

In [None]:
# Example Text
text = "hello world"

# Generate and display bigrams (2-grams)
bigrams = generate_ngrams(text, 2)
print("Character-Level Bigrams:", bigrams)

Character-Level Bigrams: [('#', 'h'), ('h', 'e'), ('e', 'l'), ('l', 'l'), ('l', 'o'), ('o', ' '), (' ', 'w'), ('w', 'o'), ('o', 'r'), ('r', 'l'), ('l', 'd')]


In [None]:
def build_ngram_model(corpus, n):
    """
    Build an n-gram language model from the corpus.

    Parameters:
    corpus (str): Text corpus for building the model
    n (int): Size of the n-grams

    Returns:
    dict: A probability distribution for each context
    """
    # Initialize the model
    model = defaultdict(Counter)

    # Generate n-grams
    ngrams = generate_ngrams(corpus, n)

    # Build the model
    for ngram in ngrams:
        context = ngram[:-1]  # all but the last character
        char = ngram[-1]      # the last character
        model[context][char] += 1

    # Convert counts to probabilities
    for context in model:
        total_count = sum(model[context].values())
        for char in model[context]:
            model[context][char] = model[context][char] / total_count

    return model

#### **2. Smoothing**

Smoothing assigns a small non-zero probability to unseen n-grams.

#### Smoothing Equation
With smoothing, the probability becomes:
$$
P(w_n \mid h) = \frac{C(h, w_n) + \alpha}{\sum_{w'} C(h, w') + \alpha \times |V|}
$$
Where:
- $\alpha $: Smoothing parameter (default is 1)
- \( |V| \): Vocabulary size


In [None]:
def add_smoothing(model, vocabulary_size, alpha=1.0):
    """
    Apply smoothing to an n-gram model.

    Parameters:
    model (defaultdict): N-gram model.
    vocabulary_size (int): Total number of unique characters in the vocabulary.
    alpha (float): Smoothing parameter (default is 1.0).

    Returns:
    defaultdict: Smoothed n-gram model.
    """
    smoothed_model = defaultdict(Counter)
    for prefix, char_counts in model.items():
        total_count = sum(char_counts.values()) + alpha * vocabulary_size
        for char in char_counts:
            smoothed_model[prefix][char] = (char_counts[char] + alpha) / total_count
        for char in range(vocabulary_size):
            if char not in char_counts:
                smoothed_model[prefix][char] = alpha / total_count
    return smoothed_model

#### **3. Generating Text Using the N-Gram Model**

To generate text, we sample from the model using the probabilities of the next word given the current context.

In [None]:
def generate_text(model, n, start_text, length=100):
    """
    Generate text using the n-gram model.

    Parameters:
    model (dict): Trained n-gram model
    n (int): Size of the n-grams
    start_text (str): Initial text to start generation
    length (int): Number of characters to generate

    Returns:
    str: Generated text
    """
    # Initialize with start text
    current_text = list(start_text)

    # Generate characters
    for _ in range(length):
        # Get the current context
        context = tuple(current_text[-(n-1):]) if len(current_text) >= n-1 else tuple('#' * (n-1 - len(current_text)) + ''.join(current_text))

        # If context not in model, break
        if context not in model:
            break

        # Get probability distribution for next character
        char_dist = model[context]

        # Sample next character
        chars, probs = zip(*char_dist.items())
        next_char = random.choices(chars, weights=probs)[0]

        # Append to generated text
        current_text.append(next_char)

    return ''.join(current_text)

In [None]:
# Sample text
text = "hello world this is a sample text for testing the n-gram model"

# Build a bigram model
bigram_model = build_ngram_model(text, 2)

# Generate text
generated = generate_text(bigram_model, 2, "he", 10)
print(f"Generated text: {generated}")

Generated text: helo ingramo


#### **4. Evaluating the Language Model**

#### **Perplexity**

Perplexity is a common metric for evaluating language models. Lower perplexity indicates a better model.

#### Perplexity Equation
Perplexity measures how well a language model predicts a test dataset:
$$
PP(W) = 2^{-\frac{1}{N} \sum_{i=1}^{N} \log_2 P(w_i \mid h_i)}
$$
Where:
- \( W \): Sequence of words
- \( N \): Total number of words in the sequence
- $ P(w_i \mid h_i) $: Probability of word $w_i$ given its history $ h_i $

In [None]:
def calculate_perplexity(model, n, test_text):
    """
    Calculate perplexity of the model on test text.

    Parameters:
    model (dict): Trained n-gram model
    n (int): Size of the n-grams
    test_text (str): Text to evaluate on

    Returns:
    float: Perplexity score
    """
    ngrams = generate_ngrams(test_text, n)
    log_prob = 0
    total_ngrams = len(ngrams)

    for ngram in ngrams:
        context = ngram[:-1]
        char = ngram[-1]

        if context in model and char in model[context]:
            prob = model[context][char]
            log_prob += -1 * math.log2(prob)
        else:
            return float('inf')  # Return infinity for unseen n-grams

    return 2 ** (log_prob / total_ngrams)


In [None]:
# First, let's create a more substantial training corpus
training_corpus = """
The quick brown fox jumps over the lazy dog.
She sells seashells by the seashore.
How much wood would a woodchuck chuck if a woodchuck could chuck wood?
To be or not to be, that is the question.
All that glitters is not gold.
A journey of a thousand miles begins with a single step.
Actions speak louder than words.
Beauty is in the eye of the beholder.
Every cloud has a silver lining.
Fortune favors the bold and brave.
Life is like a box of chocolates.
The early bird catches the worm.
Where there's smoke, there's fire.
Time heals all wounds and teaches all things.
Knowledge is power, and power corrupts.
Practice makes perfect, but nobody's perfect.
The pen is mightier than the sword.
When in Rome, do as the Romans do.
A picture is worth a thousand words.
Better late than never, but never late is better.
Experience is the best teacher of all things.
Laughter is the best medicine for the soul.
Music soothes the savage beast within us.
Nothing ventured, nothing gained in life.
The grass is always greener on the other side.
"""

# Clean the corpus
training_corpus = ''.join(c.lower() for c in training_corpus if c.isalnum() or c.isspace())

In [None]:
training_corpus

'\nthe quick brown fox jumps over the lazy dog\nshe sells seashells by the seashore\nhow much wood would a woodchuck chuck if a woodchuck could chuck wood\nto be or not to be that is the question\nall that glitters is not gold\na journey of a thousand miles begins with a single step\nactions speak louder than words\nbeauty is in the eye of the beholder\nevery cloud has a silver lining\nfortune favors the bold and brave\nlife is like a box of chocolates\nthe early bird catches the worm\nwhere theres smoke theres fire\ntime heals all wounds and teaches all things\nknowledge is power and power corrupts\npractice makes perfect but nobodys perfect\nthe pen is mightier than the sword\nwhen in rome do as the romans do\na picture is worth a thousand words\nbetter late than never but never late is better\nexperience is the best teacher of all things\nlaughter is the best medicine for the soul\nmusic soothes the savage beast within us\nnothing ventured nothing gained in life\nthe grass is always

In [None]:
# Build models of different orders
def build_models(corpus):
    models = {}
    for n in [2, 3, 4]:  # bigram, trigram, and 4-gram models
        models[n] = build_ngram_model(corpus, n)
    return models

# Build the models
models = build_models(training_corpus)

In [None]:
# Generate samples and calculate perplexity
def evaluate_samples(models, num_samples=10, sample_length=40):
    """
    Evaluates multiple n-gram language models by generating text samples and calculating their perplexity scores.

    This function:
    1. Takes different n-gram models (e.g., bigram, trigram, 4-gram)
    2. For each model:
       - Generates multiple text samples
       - Calculates perplexity for each sample
       - Computes average perplexity across all samples
    3. Stores and returns all results for comparison

    Parameters:
    -----------
    models : dict
        Dictionary where keys are n-gram sizes (e.g., 2 for bigram)
        and values are the trained n-gram models

    num_samples : int, optional (default=5)
        Number of text samples to generate for each model

    sample_length : int, optional (default=30)
        Length of each generated text sample in characters

    Returns:
    --------
    dict
        A dictionary where:
        - Keys are n-gram sizes (2, 3, 4, etc.)
        - Values are lists of dictionaries containing:
          {'text': generated_text, 'perplexity': perplexity_score}

    Example:
    --------
    # Example output structure:
    {
        2: [  # Results for bigram model
            {'text': 'hello world', 'perplexity': 10.5},
            {'text': 'sample text', 'perplexity': 12.3},
            # ... more samples
        ],
        3: [  # Results for trigram model
            {'text': 'another example', 'perplexity': 8.7},
            # ... more samples
        ]
    }
    """
    results = defaultdict(list)

    for n, model in models.items():
        print(f"\n=== {n}-gram Model Evaluation ===")

        # Generate multiple samples
        start_text = training_corpus[:n-1]
        for i in range(num_samples):
            # Generate sample
            generated = generate_text(model, n, start_text, sample_length)

            # Calculate perplexity
            perplexity = calculate_perplexity(model, n, generated)

            print(f"\nSample {i+1}:")
            print(f"Text: {generated}")
            print(f"Perplexity: {perplexity:.2f}")

            results[n].append({
                'text': generated,
                'perplexity': perplexity
            })

        # Calculate average perplexity for this n-gram model
        avg_perplexity = sum(sample['perplexity'] for sample in results[n]) / len(results[n])
        print(f"\nAverage Perplexity for {n}-gram model: {avg_perplexity:.2f}")

    return results

In [None]:
# Evaluate samples
results = evaluate_samples(models)

# Calculate statistics for each model
print("\n=== Overall Statistics ===")
for n in models.keys():
    perplexities = [sample['perplexity'] for sample in results[n]]
    min_perp = min(perplexities)
    max_perp = max(perplexities)
    avg_perp = sum(perplexities) / len(perplexities)

    print(f"\n{n}-gram Model Statistics:")
    print(f"Minimum Perplexity: {min_perp:.2f}")
    print(f"Maximum Perplexity: {max_perp:.2f}")
    print(f"Average Perplexity: {avg_perp:.2f}")


=== 2-gram Model Evaluation ===

Sample 1:
Text: 
athteter azysoke be pobe ack tinorterdoo
Perplexity: 7.43

Sample 2:
Text: 
sex couckever be d wofactoonora wher pon
Perplexity: 8.10

Sample 3:
Text: 
ak lds
k is bisthe oolas whifoullacty th
Perplexity: 6.95

Sample 4:
Text: 
houicoudea ak seane ls ove saite theeso 
Perplexity: 9.45

Sample 5:
Text: 
this ire wiochoucot bingld teves ththe
a
Perplexity: 6.56

Sample 6:
Text: 
l insine sike ald us
berthande tur jods 
Perplexity: 7.54

Sample 7:
Text: 
k thecothoferrs als s is qucans imot nou
Perplexity: 7.38

Sample 8:
Text: 
wotththe allwoof s
thes ouisotis pt ckes
Perplexity: 6.91

Sample 9:
Text: 
ne woxperea a s anor s
angs ome bullas a
Perplexity: 6.48

Sample 10:
Text: 
the was thinerauchikeys sin ck ain there
Perplexity: 5.93

Average Perplexity for 2-gram model: 7.27

=== 3-gram Model Evaluation ===

Sample 1:
Text: 
to buty beach ands
knothinglike eacturedg
Perplexity: 2.62

Sample 2:
Text: 
thers a thene row man the speas is 