<a href="https://colab.research.google.com/github/steliosg23/TextAnalytics-DS-2025/blob/main/TA_Assignment_1_N_grams_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercise 3

(i) Implement a bigram and a trigram language model for sentences, using Laplace smoothing or optionally Kneser- Ney smoothing.

In practice, n-gram language models
compute the sum of the logarithms of the n-gram probabilities of each sequence, instead of
their product and you should do the same.

Assume that each sentence starts with the
pseudo-token *start* (or two pseudo-tokens *start1*, *start2* for the trigram model) and
ends with the pseudo-token *end*.

Train your models on a training subset of a corpus. Include in the vocabulary
only words that occur, e.g., at least 10 times in the training subset. Use the same vocabulary
in the bigram and trigram models. Replace all out-of-vocabulary (OOV) words (in the
training, development, test subsets) by a special token *UNK*.

Alternatively, use BPEs instead of words (obtaining the BPE vocabulary from your training subset) to
avoid unknown words.

In [1]:
!pip install -U nltk



In [2]:
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

## Downloading the Reuters corpus from NLTK library.

In [3]:
nltk.download('reuters')

[nltk_data] Downloading package reuters to /root/nltk_data...
[nltk_data]   Package reuters is already up-to-date!


True

## Loading the corpus

stelios


In [4]:


import re
import math
import random
import unicodedata
import string
from collections import Counter
# numpy
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Loading and Processing the Reuters Corpus

### 1. Importing the Reuters Corpus
We begin by importing the Reuters corpus from the `nltk` library:



In [5]:
from nltk.corpus import reuters

# Load the Reuters corpus
reuters_corpus = reuters.fileids()

# Print the first 10 files
print(reuters_corpus[:10])


['test/14826', 'test/14828', 'test/14829', 'test/14832', 'test/14833', 'test/14839', 'test/14840', 'test/14841', 'test/14842', 'test/14843']


### 2. Constructing the Final Corpus
We then iterate through each file in the corpus, convert its text to lowercase, and append it to a string that will contain the entire corpus.

In [6]:
# Initialize an empty string to store the final corpus
final_corpus = ''

# Loop through each file in the Reuters corpus
for corpus in reuters_corpus:
    # Load the raw text from the file
    text = reuters.raw(fileids=corpus)

    # Convert the text to lowercase
    lower_text = text.lower()

    # Add the lowercase text to the final corpus
    final_corpus += lower_text



### 3. Counting the Number of Words
Finally, we calculate the total number of words in the final corpus by splitting the text into words and counting them.

In [7]:
# Print the total number of words in the final corpus
print(len(final_corpus.split()))


1378305


## Text Cleaning Function

The `text_cleaning` function is designed to clean the input text by removing unwanted characters, keeping only words and sentence-ending punctuation marks (periods, question marks, exclamation points, and apostrophes). It also removes multiple spaces and newline characters.

In [8]:
import re

def text_cleaning(text):

    # Remove all characters except letters (both uppercase and lowercase),
    # sentence-ending characters (periods, question marks, exclamation marks),
    # and apostrophes. Replace unwanted characters with a space.
    corpus = re.sub(r'[^a-zA-Z.?!\']', ' ', text)

    # Remove all left square brackets '[' from the text
    corpus = corpus.replace('[', '')

    # Replace all right square brackets ']' with a period '.'
    corpus = corpus.replace(']', '.')

    # Remove specific special characters (like $, @, ^, &, *, (, ), €, :, etc.)
    # by replacing them with a space.
    corpus = re.sub(r'[[]/$@^&*()€:΄]', ' ', corpus)

    # Replace multiple consecutive spaces with a single space
    corpus = re.sub(' +', ' ', corpus)

    # Replace newline characters '\n' with a space to maintain continuous text
    corpus = corpus.replace('\n', ' ')

    # Return the cleaned text
    return corpus


In [9]:
# Print the first 150 characters of the corpus before cleaning
print(f'Before Cleaning: {final_corpus[:150]}...\n')

# Clean the corpus
final_corpus = text_cleaning(final_corpus)

# Print the first 150 characters of the corpus after cleaning
print(f'After Cleaning: {final_corpus[:150]}...')


Before Cleaning: asian exporters fear damage from u.s.-japan rift
  mounting trade friction between the
  u.s. and japan has raised fears among many of asia's exportin...

After Cleaning: asian exporters fear damage from u.s. japan rift mounting trade friction between the u.s. and japan has raised fears among many of asia's exporting na...


  corpus = re.sub(r'[[]/$@^&*()€:΄]', ' ', corpus)


In [10]:
import nltk

def sentence_tokenization(text):
    try:
        # Tokenizes the input text into a list of sentences using NLTK's sent_tokenize
        return nltk.sent_tokenize(text)
    except Exception as e:
        # Handle any errors that occur during sentence tokenization
        print(f"Error in tokenizing text: {e}")
        return []


In [11]:
sentence_list = sentence_tokenization(final_corpus)

# Print the first three sentences of the corpus
for i in range(10):
    print(f"Sentence {i+1}: {sentence_list[i]}")


Sentence 1: asian exporters fear damage from u.s. japan rift mounting trade friction between the u.s. and japan has raised fears among many of asia's exporting nations that the row could inflict far reaching economic damage businessmen and officials said.
Sentence 2: they told reuter correspondents in asian capitals a u.s. move against japan might boost protectionist sentiment in the u.s. and lead to curbs on american imports of their products.
Sentence 3: but some exporters said that while the conflict would hurt them in the long run in the short term tokyo's loss might be their gain.
Sentence 4: the u.s. has said it will impose mln dlrs of tariffs on imports of japanese electronics goods on april in retaliation for japan's alleged failure to stick to a pact not to sell semiconductors on world markets at below cost.
Sentence 5: unofficial japanese estimates put the impact of the tariffs at billion dlrs and spokesmen for major electronics firms said they would virtually halt exports of

## Word Tokenization Function

The `word_tokenization` function takes an input text and tokenizes it into a list of words using the NLTK library's `word_tokenize` function. It also handles any exceptions that may occur during the tokenization process.

In [12]:
import nltk

def word_tokenization(text):
    try:
        # Tokenizes the input text into a list of words using NLTK's word_tokenize
        return nltk.word_tokenize(text)
    except Exception as e:
        # Handle any errors that occur during word tokenization
        print(f"Error in tokenizing text: {e}")
        return []


In [13]:
# Apply word_tokenization for each sentence in the sentence list
words_in_sentences = [word_tokenization(sentence) for sentence in sentence_list]

# Print words for the first three sentences in the corpus
for i in range(10):
    print(f"Sentence {i+1}: {words_in_sentences[i]}")


Sentence 1: ['asian', 'exporters', 'fear', 'damage', 'from', 'u.s.', 'japan', 'rift', 'mounting', 'trade', 'friction', 'between', 'the', 'u.s.', 'and', 'japan', 'has', 'raised', 'fears', 'among', 'many', 'of', 'asia', "'s", 'exporting', 'nations', 'that', 'the', 'row', 'could', 'inflict', 'far', 'reaching', 'economic', 'damage', 'businessmen', 'and', 'officials', 'said', '.']
Sentence 2: ['they', 'told', 'reuter', 'correspondents', 'in', 'asian', 'capitals', 'a', 'u.s.', 'move', 'against', 'japan', 'might', 'boost', 'protectionist', 'sentiment', 'in', 'the', 'u.s.', 'and', 'lead', 'to', 'curbs', 'on', 'american', 'imports', 'of', 'their', 'products', '.']
Sentence 3: ['but', 'some', 'exporters', 'said', 'that', 'while', 'the', 'conflict', 'would', 'hurt', 'them', 'in', 'the', 'long', 'run', 'in', 'the', 'short', 'term', 'tokyo', "'s", 'loss', 'might', 'be', 'their', 'gain', '.']
Sentence 4: ['the', 'u.s.', 'has', 'said', 'it', 'will', 'impose', 'mln', 'dlrs', 'of', 'tariffs', 'on', 'im

## Data Splitting Process

In this process, the corpus is split into three subsets: `train_corpus`, `dev_corpus`, and `test_corpus`. This is a common practice in machine learning for training, development, and testing models.

### Steps:
1. **Set Random Seed for Reproducibility:**
   The random seed is set to ensure that the splitting process is reproducible. This means that every time the code runs, the same splits will occur, providing consistency in results.

2. **Initial Split into Train and Temporary Corpus:**
   The corpus is first split into two sets: the `train_corpus` and a `temp_corpus` which contains 40% of the data. The remaining 60% is used for training. This ensures that a portion of the data is available for training the model.

3. **Further Split the Temporary Corpus into Dev and Test Sets:**
   The `temp_corpus` (which is 40% of the original corpus) is further split evenly into two parts: the `dev_corpus` (development set) and the `test_corpus` (test set). Each of these subsets will contain 20% of the original corpus. The dev set is used for model tuning, and the test set is kept for evaluating the model's final performance.

4. **Resulting Subsets:**
   After the splits, you have:
   - `train_corpus` containing 60% of the data used for training.
   - `dev_corpus` containing 20% of the data used for validation and hyperparameter tuning.
   - `test_corpus` containing 20% of the data used for final testing.

This splitting process helps in ensuring that the model is trained, validated, and tested on separate data sets to avoid overfitting and ensure robust evaluation.


In [14]:
from sklearn.model_selection import train_test_split

# Set random seed for reproducibility
random.seed(2025)

# Shuffle and split the corpus into train, dev, and test sets
train_corpus, temp_corpus = train_test_split(words_in_sentences, test_size=0.4, random_state=2025)

# Further split temp_corpus into dev and test sets
dev_corpus, test_corpus = train_test_split(temp_corpus, test_size=0.5, random_state=2025)

# Now train_corpus, dev_corpus, and test_corpus are split


In [15]:
# Print a sample of each set
print("Sample from Training Set:")
print(train_corpus[:3])  # Printing the first 3 samples from the training set

print("\nSample from Development Set:")
print(dev_corpus[:3])  # Printing the first 3 samples from the development set

print("\nSample from Test Set:")
print(test_corpus[:3])  # Printing the first 3 samples from the test set


Sample from Training Set:
[['mln', 'vs', '.'], ['although', 'these', 'had', 'made', 'minor', 'contribution', 'to', 'profits', 'the', 'real', 'benefits', 'would', 'come', 'in', 'and', 'beyond', '.'], ['we', 'think', 'the', 'stock', 'will', 'do', 'moderately', 'better', 'than', 'the', 'market', 'he', 'said', '.']]

Sample from Development Set:
[['lt', 'atco', 'ltd', 'sees', 'gain', 'from', 'sale', 'atco', 'ltd', 'said', 'its', 'atco', 'development', 'unit', 'agreed', 'to', 'sell', 'the', 'canadian', 'utilities', 'center', 'in', 'edmonton', 'alberta', 'and', 'the', 'canadian', 'western', 'center', 'in', 'calgary', '.'], ['what', 'i', "'m", 'really', 'saying', 'is', 'that', 'they', 'should', 'not', 'expect', 'us', 'to', 'simply', 'sit', 'back', 'here', 'and', 'accept', 'increased', 'tightening', 'on', 'their', 'part', 'on', 'the', 'assumption', 'that', 'somehow', 'we', 'are', 'going', 'to', 'follow', 'them', 'he', 'added', '.'], ['mln', 'marks', 'from', '.']]

Sample from Test Set:
[['in',

## N-gram Calculation with Kneser-Ney Smoothing

The following functions are used to calculate n-grams (unigrams, bigrams, trigrams, etc.) from a given corpus. Kneser-Ney smoothing is applied to bigrams and higher-order n-grams to improve the model's performance by adjusting for zero-frequency events.

### Functions:

1. **`calc_ngrams(corpus, n)`**:
   - **Purpose:** This function calculates n-grams (where `n` is the number of items in the n-gram) with Kneser-Ney smoothing.
   - **How It Works:**
     - It creates `n`-grams for each sentence in the corpus and updates a counter to track the frequency of each n-gram.
     - It also generates (n-1)-grams for lower-order smoothing.
     - If the `n`-grams are bigrams or higher, Kneser-Ney smoothing is applied by adjusting the frequency of n-grams based on the lower-order (n-1)-grams.
   - **Return:** A `Counter` object containing the smoothed n-grams.

2. **`calc_unigrams(corpus)`**:
   - **Purpose:** This function calculates unigrams (single words) from the corpus.
   - **How It Works:** It creates unigrams for each sentence in the corpus and counts their occurrences.
   - **Return:** A `Counter` object containing the unigrams. No smoothing is applied here because Kneser-Ney smoothing typically applies to bigrams or higher.

3. **`calc_bigrams(corpus)`**:
   - **Purpose:** This function calculates bigrams (pairs of consecutive words) from the corpus, using the `calc_ngrams` function with `n=2`.
   - **Return:** A `Counter` object containing the smoothed bigrams.

4. **`calc_trigrams(corpus)`**:
   - **Purpose:** This function calculates trigrams (triplets of consecutive words) from the corpus, using the `calc_ngrams` function with `n=3`.
   - **Return:** A `Counter` object containing the smoothed trigrams.

### Explanation of Kneser-Ney Smoothing:
- **Kneser-Ney smoothing** is a technique used to adjust the probabilities of n-grams, especially for those that have never been seen in the training corpus. It reduces the impact of unseen n-grams by distributing probability mass to less frequent n-grams, thus making the model more robust to rare events.
- The function first calculates the raw frequency of n-grams, then applies the smoothing technique by modifying the counts of n-grams based on their lower-order counterparts.

These functions help in building a language model by counting the frequency of n-grams and applying smoothing to handle unseen combinations of words.


In [16]:
from collections import Counter
from nltk import ngrams

def calc_ngrams(corpus, n):
    """ Returns a Counter for n-grams (unigrams, bigrams, trigrams, etc.) with Kneser-Ney smoothing."""
    ngram_counter = Counter()
    lower_order_counter = Counter()

    for sentence in corpus:
        # Generate n-grams and update the counter
        sentence_ngrams = list(ngrams(sentence, n, pad_left=True, pad_right=True, left_pad_symbol='<s>', right_pad_symbol='<e>'))
        ngram_counter.update(sentence_ngrams)

        # Generate (n-1)-grams for lower-order smoothing
        if n > 1:
            lower_order_ngrams = list(ngrams(sentence, n-1, pad_left=True, pad_right=True, left_pad_symbol='<s>', right_pad_symbol='<e>'))
            lower_order_counter.update(lower_order_ngrams)

    # Apply Kneser-Ney smoothing for bigrams and higher
    if n > 1:
        for (gram, count) in ngram_counter.items():
            # Bigram smoothing: Subtract 1 from the count, add back the discount (only for seen bigrams)
            lower_order_count = lower_order_counter[gram[:-1]]
            ngram_counter[gram] = max(count - 1, 0) / lower_order_count

    # For unigrams, just return the counts (Kneser-Ney only applies to bigrams or higher)
    return ngram_counter

def calc_unigrams(corpus):
    # Returns a Unigram Counter with Kneser-Ney smoothing.
    unigram_counter = Counter()
    for sentence in corpus:
        unigram_counter.update(ngrams(sentence, 1, pad_left=True, pad_right=True, left_pad_symbol='<s>', right_pad_symbol='<e>'))

    # For unigrams, no smoothing is applied in Kneser-Ney (this is typically for bigrams or higher)
    return unigram_counter

def calc_bigrams(corpus):
    # Returns a Bigram Counter with Kneser-Ney smoothing.
    return calc_ngrams(corpus, 2)

def calc_trigrams(corpus):
    # Returns a Trigram Counter with Kneser-Ney smoothing.
    return calc_ngrams(corpus, 3)


## Replacing Out-of-Vocabulary (OOV) Words in Training Corpus

The `replace_oov_words_train` function identifies and replaces out-of-vocabulary (OOV) words in a training corpus. It also creates a vocabulary by excluding the OOV words and returning them separately.

### Steps:

1. **Calculate Unigram Counts:**
   - The function starts by calculating the unigram (single word) counts for the entire corpus using the `calc_unigrams` function. This gives the frequency of each word in the corpus.

2. **Identify OOV Words:**
   - OOV words are those whose frequency in the corpus is less than a predefined threshold (in this case, less than 10 occurrences).
   - The function creates a dictionary where OOV words are mapped to the placeholder `'UNK'`.

3. **Replace OOV Words in the Corpus:**
   - The function processes each sentence in the corpus and replaces any OOV word with `'UNK'`. Words that are not OOV are left unchanged.

4. **Create Vocabulary:**
   - The vocabulary is created by including only those words from the unigram counts that are not considered OOV. The resulting vocabulary excludes the rare words that have been replaced with `'UNK'`.

5. **Return Results:**
   - The function returns:
     - A dictionary of OOV words, where each OOV word is replaced by `'UNK'`.
     - The cleaned corpus where OOV words have been replaced with `'UNK'`.
     - The vocabulary, which contains all the words that are not considered OOV.

### Purpose:
The function is useful in preparing a training corpus for language models by replacing rare or unseen words (OOV) with a special token (`'UNK'`). This ensures that the model doesn't encounter issues when faced with words that were not present in the training set. It also creates a vocabulary that helps in efficient processing and model training by excluding rare words.


In [20]:
def replace_oov_words_train(corpus):

    # Calculate unigram frequencies for the corpus
    unigram_counter = calc_unigrams(corpus)

    # Create a dictionary for OOV words (those that appear less than 10 times)
    OOV_words = {k[0]: "UNK" for k, v in unigram_counter.items() if v < 10}

    # Replace OOV words in the corpus (using list comprehension for each sentence)
    clean_corpus = [
        [OOV_words.get(word, word) for word in sentence]
        for sentence in corpus
    ]

    # Create vocabulary (set of unique words not in OOV_words)
    vocabulary = [f[0] for f in unigram_counter.keys() if f[0] not in OOV_words]
    vocabulary = set(vocabulary)  # Set for unique words

    return OOV_words, clean_corpus, vocabulary

# Example usage:
oov_words, clean_corpus, vocabulary = replace_oov_words_train(train_corpus)


In [21]:
def replace_oov_words_dev_test(corpus, vocabulary, oov_words):
    clean_corpus = []
    for sentence in corpus:
        updated_sentence = ['UNK' if ((word not in vocabulary) or (word in oov_words)) else word for word in sentence]
        clean_corpus.append(updated_sentence)
    return clean_corpus
final_dev_corpus = replace_oov_words_dev_test(dev_corpus, vocabulary, oov_words)
final_test_corpus = replace_oov_words_dev_test(test_corpus, vocabulary, oov_words)


In [23]:
# Function to print samples from the final corpus
def print_samples(corpus, num_samples=5):
    """ Print a few sample sentences from the corpus """
    for i, sentence in enumerate(corpus[:num_samples]):
        print(f"Sample {i+1}: {sentence}")

# Print samples from final_dev_corpus and final_test_corpus
print("Samples from final_dev_corpus:")
print_samples(final_dev_corpus)

print("\nSamples from final_test_corpus:")
print_samples(final_test_corpus)


Samples from final_dev_corpus:
Sample 1: ['lt', 'UNK', 'ltd', 'sees', 'gain', 'from', 'sale', 'UNK', 'ltd', 'said', 'its', 'UNK', 'development', 'unit', 'agreed', 'to', 'sell', 'the', 'canadian', 'utilities', 'center', 'in', 'edmonton', 'alberta', 'and', 'the', 'canadian', 'western', 'center', 'in', 'UNK', '.']
Sample 2: ['what', 'i', "'m", 'really', 'saying', 'is', 'that', 'they', 'should', 'not', 'expect', 'us', 'to', 'simply', 'UNK', 'back', 'here', 'and', 'accept', 'increased', 'tightening', 'on', 'their', 'part', 'on', 'the', 'assumption', 'that', 'UNK', 'we', 'are', 'going', 'to', 'follow', 'them', 'he', 'added', '.']
Sample 3: ['mln', 'marks', 'from', '.']
Sample 4: ['the', 'minister', 'for', 'economic', 'affairs', 'would', 'need', 'to', 'be', 'informed', 'in', 'advance', 'of', 'deals', 'under', 'which', 'foreign', 'interests', 'planned', 'to', 'buy', 'a', 'new', 'stake', 'of', 'more', 'than', 'ten', 'pct', 'of', 'the', 'voting', 'shares', 'in', 'a', 'large', 'belgian', 'company

## Why Do We Compute the Sum of Logarithms of N-gram Probabilities?

In n-gram language models, probabilities are often very small, especially for longer sequences. Instead of directly multiplying these probabilities, we compute the **sum of logarithms** of the probabilities. Here's why this approach is used and why it is important.

### 1. **Small Probabilities**

In typical n-gram models, the probability of a sequence of words is calculated by multiplying the probabilities of individual n-grams. For example, the probability of a sequence like "the market is rising" might be computed as:

$$
P(\text{"the"}) = 0.1, \quad P(\text{"market" | the}) = 0.05, \quad \dots
$$

Multiplying these small probabilities together yields an extremely small number, which can cause **numerical underflow** (i.e., the result becomes too small to represent in a computer).

### 2. **Logarithms for Numerical Stability**

To avoid underflow and to handle very small numbers efficiently, we use **logarithms**. Logarithms have the following useful property:

$$
\log(P(w_1, w_2, \dots, w_n)) = \log(P(w_1)) + \log(P(w_2 | w_1)) + \dots + \log(P(w_n | w_1, \dots, w_{n-1}))
$$

By applying the logarithm, we transform the **product of probabilities** into a **sum of logarithms**. This makes it easier to work with small values and avoids the numerical problems associated with multiplying many small numbers.

### 3. **Simplification of Calculations**

- **Logarithms** simplify the model's calculations by turning a **multiplicative** process into an **additive** one.
- The logarithm of any probability is always **negative** (since probabilities are between 0 and 1), but the sum of these logarithms can be handled much more easily.

This transformation allows us to compute the **log-likelihood** of a sequence efficiently. For example, the log-likelihood of a sequence $( w_1, w_2, \dots, w_n )$ is given by:

$$
\text{log-likelihood} = \sum_{i=1}^{n} \log P(w_i | w_{i-1}, \dots, w_1)
$$

Maximizing the log-likelihood is equivalent to maximizing the likelihood but is much easier to compute.

### 4. **Log-Likelihood Maximization**

Maximizing the likelihood of a sequence is essential in language modeling, and using logarithms makes this process more computationally feasible. The sum of logarithms allows for easier **optimization**, particularly when using **maximum likelihood estimation (MLE)**.

## Why We Do the Same

- **Numerical Stability**: Multiplying many small probabilities can cause underflow. Taking the log of probabilities ensures that we avoid this issue.
- **Simplification**: Adding logarithms is computationally simpler and more stable than multiplying probabilities.
- **Optimization**: The log-likelihood function is easier to maximize compared to the product of probabilities.
- **Consistency**: Using log-probabilities is the standard approach in machine learning and natural language processing, ensuring consistency with other models and libraries.

In summary, computing the sum of the logarithms of the n-gram probabilities instead of directly multiplying the probabilities is a widely adopted practice for maintaining numerical stability and simplifying the optimization process in language models.

