*Author: Lucy Wu*

*Date: 03/12/2025*

# Bigram WORD Language Identification Model 2

This notebook implements a bigram WORD language identification model. The model is trained on text data from three languages: English, French, and Italian. The goal is to classify sentences from a validation set into one of these languages. The models are trained with Good-Turing smoothing specifically.

### Workflow

1. **Preprocessing Function**:
    - `preprocess(text)`: Converts text to lowercase but spaces are kept.

2. **Training Bigram Model**:
    - `train_bigram_model(file_path)`: Trains a bigram model for a given language by counting occurrences of bigrams and unigrams, and then converting these counts to probabilities using Good-Turing smoothing.

3. **Train Models for Each Language**:
    - We train bigram models for English, French, and Italian using the respective training files.

4. **Compute Sentence Probability**:
    - `compute_sentence_probability(sentence, bigram_model)`: Computes the log probability of a sentence under a given bigram model.

5. **Classify Validation Data**:
    - We read sentences from the validation file and compute their probabilities under each language model.
    - The language with the highest probability is chosen as the predicted language for each sentence.

6. **Output Results**:
    - The classification results are written to an output file.

7. **Compare Output with Solution**:
    - `compare_files(output_file, solution_file)`: Compares the output file with a solution file to identify differences.

### Variables

- `TRAINING_FILES`: Dictionary mapping each language to its corresponding training file.
- `TRAINING_PATH`: Path to the directory containing training data.
- `VALIDATION_FILE`: Path to the validation file.
- `LANGUAGES`: List of languages being considered.
- `bigram_models`: Dictionary containing trained bigram models for each language.
- `OUTPUT_FILE`: Path to the output file where classification results are stored.

This model uses bigram probabilities to classify sentences into one of the three languages, providing a simple yet effective approach to language identification.

### Project Set Up

In [53]:
import os
import math
from collections import defaultdict

# Define paths for training data and validation file
TRAINING_PATH = "../Data/Input/"
VALIDATION_FILE = "../Data/Validation/LangId.test"

# Dictionary containing file names for each language
TRAINING_FILES = {
    "English": "LangId.train.English",
    "French": "LangId.train.French",
    "Italian": "LangId.train.Italian"
}

# List of languages
LANGUAGES = ["English", "French", "Italian"]

### Preprocessing Function

In [54]:
# Function to preprocess text by converting to lowercase and keeping spaces
def preprocess(text):
    return text.lower().replace("\n", " ").split()

### Creating & Training Models (with Add-One smoothing)
A different implementation than the previous block by adding Add-One smoothing to handle unseen WORD bigrams.

***Re-run the `Classify Validation Data` section to see different output results.***

In [55]:
# Function to train a bigram model with Good-Turing smoothing
def train_bigram_model(file_path):
    # Dictionaries to store bigram and unigram counts
    bigram_counts = defaultdict(lambda: defaultdict(int))
    unigram_counts = defaultdict(int)
    frequency_of_counts = defaultdict(int)  # Tracks how many bigrams have each count
    vocabulary = set()

    # Open and preprocess the training file
    with open(file_path, "r", encoding="utf-8") as f:
        words = preprocess(f.read())
        vocabulary.update(words)
        
        # Count occurrences of bigrams and unigrams
        for i in range(len(words) - 1):
            bigram = (words[i], words[i + 1])
            bigram_counts[bigram[0]][bigram[1]] += 1
            unigram_counts[bigram[0]] += 1

    # Compute frequency of counts for Good-Turing smoothing
    for first_word, following_words in bigram_counts.items():
        for second_word, count in following_words.items():
            frequency_of_counts[count] += 1
    
    # Compute Good-Turing smoothed probabilities
    bigram_probs = {}
    vocab_size = len(vocabulary)

    # Total number of bigrams
    total_bigrams = sum(sum(following_words.values()) for following_words in bigram_counts.values())

    for first_word, following_words in bigram_counts.items():
        bigram_probs[first_word] = {}
        
        for second_word, count in following_words.items():
            # Apply Good-Turing estimation
            if count + 1 in frequency_of_counts and frequency_of_counts[count] > 0:
                adjusted_count = (count + 1) * (frequency_of_counts[count + 1] / frequency_of_counts[count])
            else:
                adjusted_count = count  # Default to normal count if no higher count exists

            bigram_probs[first_word][second_word] = adjusted_count / total_bigrams

        # Assign probability for unseen bigrams using Good-Turing smoothing
        unseen_prob = frequency_of_counts[1] / total_bigrams if frequency_of_counts[1] > 0 else 1 / total_bigrams
        for word in vocabulary:
            if word not in bigram_probs[first_word]:
                bigram_probs[first_word][word] = unseen_prob

    return bigram_probs

# Train bigram models for each language
bigram_models = {}
for lang, filename in TRAINING_FILES.items():
    file_path = os.path.join(TRAINING_PATH, filename)
    bigram_models[lang] = train_bigram_model(file_path)

# Function to compute log probability of a sentence under a given bigram model
def compute_sentence_probability(sentence, bigram_model):
    words = preprocess(sentence)  # Preprocess the sentence
    log_prob = 0  # Initialize log probability
    
    # Iterate over word bigrams in the sentence
    for i in range(len(words) - 1):
        first, second = words[i], words[i + 1]

        # Retrieve bigram probability safely
        first_word_probs = bigram_model.get(first, {})
        total_prob_mass = sum(first_word_probs.values())

        if total_prob_mass > 0:
            prob = first_word_probs.get(second, first_word_probs.get("__unseen__", 1 / total_prob_mass))
        else:
            prob = 1e-10  # Small fallback probability for unseen cases

        log_prob += math.log(prob)

    return log_prob

### Classify Validation Data

In [56]:
# Load and classify validation data
with open(VALIDATION_FILE, "r", encoding="utf-8") as f:
    test_sentences = f.readlines()  # Read test sentences line by line

results = []

# Compute probabilities for each sentence in the test set
for idx, sentence in enumerate(test_sentences, start=1):
    # Calculate the probability of the sentence under each language model
    scores = {lang: compute_sentence_probability(sentence, bigram_models[lang]) for lang in LANGUAGES}
    
    # Determine the language with the highest probability
    predicted_language = max(scores, key=scores.get)
    
    # Store the result as "[line_number] [predicted_language]"
    results.append(f"{idx} {predicted_language}")

# Write the classification results
OUTPUT_FILE = "../Data/Output/wordLangId2.out"
with open(OUTPUT_FILE, "w", encoding="utf-8") as f:
    f.write("\n".join(results))

### Compare Output with Solution

In [57]:
def compare_files(output_file, solution_file):
    with open(output_file, 'r') as f1, open(solution_file, 'r') as f2:
        output_lines = f1.readlines()
        solution_lines = f2.readlines()
    
    # Ensure both files have the same number of lines
    if (len(output_lines) != len(solution_lines)):
        print("Files have different number of lines.")

    # Counter for the number of differences
    diff_count = 0

    for i in range(len(output_lines)):
        if output_lines[i] != solution_lines[i]:
            diff_count += 1
            print(f"Line {i + 1} is different:")
            print(f"Output: {output_lines[i].strip()}")
            print(f"Solution: {solution_lines[i].strip()}")
            print()

    # Print the total number of differences
    print(f"Total number of wrong predictions: {diff_count}")
    print(f"Model Accuracy: {(300-diff_count)/3}%")
    
# File paths
output_file = '../Data/Output/wordLangId2.out'
solution_file = '../Data/Validation/labels.sol'

# Call the function
compare_files(output_file, solution_file)

Line 126 is different:
Output: 126 Italian
Solution: 126 French

Total number of wrong predictions: 1
Model Accuracy: 99.66666666666667%
