*Author: Lucy Wu*

*Date: 03/12/2025*

# Bigram LETTER Language Identification Model

This notebook implements a bigram LETTER language identification model. The model is trained on text data from three languages: English, French, and Italian. The goal is to classify sentences from a validation set into one of these languages. The models are trained both with and without Add-One smoothing to handle unseen bigrams.

### Workflow

1. **Preprocessing Function**:
    - `preprocess(text)`: Converts text to lowercase and removes spaces and newlines.

2. **Training Bigram Model**:
    - `train_bigram_model(file_path)`: Trains a bigram model for a given language by counting occurrences of bigrams and unigrams, and then converting these counts to probabilities using Add-One smoothing.

3. **Train Models for Each Language**:
    - We train bigram models for English, French, and Italian using the respective training files.

4. **Compute Sentence Probability**:
    - `compute_sentence_probability(sentence, bigram_model)`: Computes the log probability of a sentence under a given bigram model.

5. **Classify Validation Data**:
    - We read sentences from the validation file and compute their probabilities under each language model.
    - The language with the highest probability is chosen as the predicted language for each sentence.

6. **Output Results**:
    - The classification results are written to an output file.

7. **Compare Output with Solution**:
    - `compare_files(output_file, solution_file)`: Compares the output file with a solution file to identify differences.

### Variables

- `TRAINING_FILES`: Dictionary mapping each language to its corresponding training file.
- `TRAINING_PATH`: Path to the directory containing training data.
- `VALIDATION_FILE`: Path to the validation file.
- `LANGUAGES`: List of languages being considered.
- `bigram_models`: Dictionary containing trained bigram models for each language.
- `OUTPUT_FILE`: Path to the output file where classification results are stored.

This model uses bigram probabilities to classify sentences into one of the three languages, providing a simple yet effective approach to language identification.

### Project Set Up

In [13]:
import os
import math
from collections import defaultdict

# Define paths for training data and validation file
TRAINING_PATH = "../Data/Input/"
VALIDATION_FILE = "../Data/Validation/LangId.test"

# Dictionary containing file names for each language
TRAINING_FILES = {
    "English": "LangId.train.English",
    "French": "LangId.train.French",
    "Italian": "LangId.train.Italian"
}

# List of languages
LANGUAGES = ["English", "French", "Italian"]

### Preprocessing Function

In [14]:
# Function to preprocess text by converting to lowercase and removing spaces/newlines
def preprocess(text):
    return text.lower().replace("\n", " ").replace(" ", "")

### Creating & Training Models (without smoothing)
This version of models only calculate probabilities based on observed counts without adjusting for unseen LETTER bigrams.

The models assign negative infinity to unseen bigrams instead of smoothing, meaning they will strictly rely on observed data.

***Skip the next block and jump to `Classify Validation Data` section to see different output results.***

In [19]:
# Function to train a bigram model from a given training file
def train_bigram_model(file_path):
    # Dictionaries to store bigram counts and unigram counts
    bigram_counts = defaultdict(lambda: defaultdict(int))
    unigram_counts = defaultdict(int)
    
    # Open the training file and preprocess the text
    with open(file_path, "r", encoding="utf-8") as f:
        text = preprocess(f.read())
        
        # Count occurrences of bigrams and unigrams
        for i in range(len(text) - 1):
            bigram = (text[i], text[i + 1])
            bigram_counts[bigram[0]][bigram[1]] += 1
            unigram_counts[bigram[0]] += 1
    
    # Convert bigram counts to probabilities (without smoothing)
    bigram_probs = {}
    for first_char, following_chars in bigram_counts.items():
        total = unigram_counts[first_char]  # No smoothing
        bigram_probs[first_char] = {char: count / total for char, count in following_chars.items()}
    
    return bigram_probs

# Train bigram models for each language
bigram_models = {}
for lang, filename in TRAINING_FILES.items():
    file_path = os.path.join(TRAINING_PATH, filename)
    bigram_models[lang] = train_bigram_model(file_path)

# Function to compute log probability of a sentence under a given bigram model
def compute_sentence_probability(sentence, bigram_model):
    sentence = preprocess(sentence)  # Preprocess the sentence
    log_prob = 0  # Initialize log probability to 0
    
    # Iterate over character bigrams in the sentence
    for i in range(len(sentence) - 1):
        first, second = sentence[i], sentence[i + 1]
        
        # Retrieve bigram probability, defaulting to a very low probability if not found
        prob = bigram_model.get(first, {}).get(second, 0)
        
        if prob > 0:
            log_prob += math.log(prob)
        else:
            return float('-inf')  # Assign negative infinity if an unseen bigram is encountered
    
    return log_prob

### Creating & Training Models (with Add-One smoothing)
A different implementation than the previous block by adding Add-One smoothing to handle unseen LETTER bigrams.

***Re-run the `Classify Validation Data` section to see different output results.***

In [16]:
# Function to train a bigram model from a given training file
def train_bigram_model(file_path):
    # Dictionaries to store bigram counts and unigram counts
    bigram_counts = defaultdict(lambda: defaultdict(int))
    unigram_counts = defaultdict(int)
    
    # Open the training file and preprocess the text
    with open(file_path, "r", encoding="utf-8") as f:
        text = preprocess(f.read())
        
        # Count occurrences of bigrams and unigrams
        for i in range(len(text) - 1):
            bigram = (text[i], text[i + 1])
            bigram_counts[bigram[0]][bigram[1]] += 1
            unigram_counts[bigram[0]] += 1
    
    # Convert bigram counts to probabilities using Add-One smoothing
    bigram_probs = {}
    for first_char, following_chars in bigram_counts.items():
        total = unigram_counts[first_char] + len(bigram_counts)  # Add vocabulary size for smoothing
        bigram_probs[first_char] = {char: (count + 1) / total for char, count in following_chars.items()}
    
    return bigram_probs

# Train bigram models for each language
bigram_models = {}
for lang, filename in TRAINING_FILES.items():
    file_path = os.path.join(TRAINING_PATH, filename)
    bigram_models[lang] = train_bigram_model(file_path)

# Function to compute log probability of a sentence under a given bigram model
def compute_sentence_probability(sentence, bigram_model):
    sentence = preprocess(sentence)  # Preprocess the sentence
    log_prob = 0  # Initialize log probability to 0
    
    # Iterate over character bigrams in the sentence
    for i in range(len(sentence) - 1):
        first, second = sentence[i], sentence[i + 1]
        
        # Retrieve bigram probability; apply add-one smoothing if necessary
        prob = bigram_model.get(first, {}).get(second, 1 / (sum(bigram_model.get(first, {}).values()) + len(bigram_model)))
        
        # Add the log probability (log avoids numerical underflow)
        log_prob += math.log(prob)
    
    return log_prob

### Classify Validation Data

In [20]:
# Load and classify validation data
with open(VALIDATION_FILE, "r", encoding="utf-8") as f:
    test_sentences = f.readlines()  # Read test sentences line by line

results = []

# Compute probabilities for each sentence in the test set
for idx, sentence in enumerate(test_sentences, start=1):
    # Calculate the probability of the sentence under each language model
    scores = {lang: compute_sentence_probability(sentence, bigram_models[lang]) for lang in LANGUAGES}
    
    # Determine the language with the highest probability
    predicted_language = max(scores, key=scores.get)
    
    # Store the result as "[line_number] [predicted_language]"
    results.append(f"{idx} {predicted_language}")

# Write the classification results
OUTPUT_FILE = "../Data/Output/letterLangId.out"
with open(OUTPUT_FILE, "w", encoding="utf-8") as f:
    f.write("\n".join(results))

### Compare Output with Solution

In [21]:
def compare_files(output_file, solution_file):
    with open(output_file, 'r') as f1, open(solution_file, 'r') as f2:
        output_lines = f1.readlines()
        solution_lines = f2.readlines()
    
    # Ensure both files have the same number of lines
    if (len(output_lines) != len(solution_lines)):
        print("Files have different number of lines.")

    # Counter for the number of differences
    diff_count = 0

    for i in range(len(output_lines)):
        if output_lines[i] != solution_lines[i]:
            diff_count += 1
            print(f"Line {i + 1} is different:")
            print(f"Output: {output_lines[i].strip()}")
            print(f"Solution: {solution_lines[i].strip()}")
            print()

    # Print the total number of differences
    print(f"Total number of wrong predictions: {diff_count}")
    print(f"Model Accuracy: {(300-diff_count)/3}%")
    
# File paths
output_file = '../Data/Output/letterLangId.out'
solution_file = '../Data/Validation/labels.sol'

# Call the function
compare_files(output_file, solution_file)

Line 22 is different:
Output: 22 Italian
Solution: 22 French

Line 24 is different:
Output: 24 English
Solution: 24 French

Line 25 is different:
Output: 25 English
Solution: 25 Italian

Line 43 is different:
Output: 43 English
Solution: 43 Italian

Line 44 is different:
Output: 44 English
Solution: 44 Italian

Line 60 is different:
Output: 60 English
Solution: 60 French

Line 66 is different:
Output: 66 English
Solution: 66 French

Line 83 is different:
Output: 83 English
Solution: 83 Italian

Line 91 is different:
Output: 91 English
Solution: 91 Italian

Line 111 is different:
Output: 111 English
Solution: 111 French

Line 169 is different:
Output: 169 English
Solution: 169 Italian

Line 185 is different:
Output: 185 English
Solution: 185 French

Line 190 is different:
Output: 190 English
Solution: 190 Italian

Line 221 is different:
Output: 221 English
Solution: 221 French

Line 223 is different:
Output: 223 English
Solution: 223 Italian

Line 232 is different:
Output: 232 French
So