# Vietnamese Accent Prediction Model

This notebook contains the complete implementation of a Vietnamese accent prediction model. It allows you to train, evaluate, and use a model that predicts accents in Vietnamese text.

## Overview

Vietnamese text without accents (diacritics) is ambiguous as many words can have different meanings depending on the tone marks. This model uses n-gram language modeling to predict the most likely accented version of Vietnamese text.

**Key Components:**
* Utility functions for text processing
* Data loading and corpus preparation
* N-gram model training
* Accent prediction using beam search
* Model evaluation metrics
* Interactive prediction

## 1. Setup and Dependencies

This section installs all required dependencies and sets up the environment for the Vietnamese Accent Prediction model to work in Google Colab.

In [None]:
# Install required packages
!pip install nltk==3.5 scikit-learn==1.2.2 dill~=0.3.7 pandas~=2.0.3 matplotlib~=3.7.5 seaborn~=0.13.2 tqdm~=4.66.0 requests~=2.31.0

# Import common libraries
import os
import sys
import re
import pickle
import string
import json
import random
import nltk
import numpy as np
import pandas as pd
import multiprocessing as mp
from tqdm import tqdm
from collections import defaultdict
from nltk.tokenize.treebank import TreebankWordDetokenizer
from nltk.lm import KneserNeyInterpolated, Laplace, WittenBellInterpolated

# Download NLTK data
nltk.download('punkt')

# Create necessary directories
BASE_DIR = os.getcwd()
DATA_DIR = os.path.join(BASE_DIR, "data")
MODEL_DIR = os.path.join(BASE_DIR, "models")
os.makedirs(DATA_DIR, exist_ok=True)
os.makedirs(MODEL_DIR, exist_ok=True)

# Constants
TRAIN_EXTRACT_PATH = os.path.join(DATA_DIR, "Train_Full")
SYLLABLES_NAME = "vn_syllables.txt"
SYLLABLES_PATH = os.path.join(DATA_DIR, SYLLABLES_NAME)
DEFAULT_MODEL_FILENAME = "kneserney_trigram_model.pkl"
N_GRAM_ORDER = 3

## 2. Utility Functions

This section defines core utility functions for text processing, including tokenization and accent removal. These functions form the foundation of the model's ability to process Vietnamese text.

In [None]:
# Utility functions for Vietnamese text processing (from utils.py)

def tokenize(doc: str) -> list[str]:
    """Tokenize a document into words."""
    tokens = nltk.word_tokenize(doc.lower())
    # Allow underscore, remove other punctuation
    table = str.maketrans('', '', string.punctuation.replace("_", ""))
    tokens = [w.translate(table) for w in tokens]
    tokens = [word for word in tokens if word]  # Remove empty strings
    return tokens

def remove_vn_accent(word: str) -> str:
    """Remove Vietnamese accents from a word."""
    word = word.lower()
    word = re.sub(r'[áàảãạăắằẳẵặâấầẩẫậ]', 'a', word)
    word = re.sub(r'[éèẻẽẹêếềểễệ]', 'e', word)
    word = re.sub(r'[óòỏõọôốồổỗộơớờởỡợ]', 'o', word)
    word = re.sub(r'[íìỉĩị]', 'i', word)
    word = re.sub(r'[úùủũụưứừửữự]', 'u', word)
    word = re.sub(r'[ýỳỷỹỵ]', 'y', word)
    word = re.sub(r'đ', 'd', word)
    return word

def gen_accents_word(word: str, syllables_path: str = SYLLABLES_PATH) -> set[str]:
    """Generate all possible accented forms of a word using a syllables file."""
    normalized_input_word = word.lower()
    word_no_accent = remove_vn_accent(normalized_input_word)
    all_accent_word = {normalized_input_word}  # Start with the input word (normalized)

    if not os.path.exists(syllables_path):
        print(f"Warning: Syllables file not found at {syllables_path}. "
              f"Accent generation will be limited to the input word: '{word}'.")
        if word_no_accent != normalized_input_word:
            all_accent_word.add(word_no_accent)
        return all_accent_word

    try:
        with open(syllables_path, 'r', encoding='utf-8') as f:
            for w_line in f.read().splitlines():
                w_line_lower = w_line.lower()  # Normalize file content
                if remove_vn_accent(w_line_lower) == word_no_accent:
                    all_accent_word.add(w_line_lower)
    except Exception as e:
        print(f"Error reading or processing syllables file {syllables_path}: {e}")
    
    return all_accent_word

# Test the utility functions
print("Tokenize example:")
print(tokenize("Đây_là một câu, ví dụ."))

print("\nRemove accent example:")
print(remove_vn_accent("hoàng"))
print(remove_vn_accent("Hoàng"))

# Create a sample syllables file for testing if it doesn't exist
if not os.path.exists(SYLLABLES_PATH):
    print(f"\nCreating sample syllables file at {SYLLABLES_PATH} for testing...")
    try:
        with open(SYLLABLES_PATH, "w", encoding="utf-8") as f:
            f.write("hoàng\nhoang\nhọang\nhởang\nhờang\nhoa\nhòa\nviệt\n") # Add some test data
        print(f"Created sample syllables file with test data.")
    except Exception as e:
        print(f"Error creating sample syllables file: {e}")

print("\nGenerate accents example:")
print(f"'hoang': {gen_accents_word('hoang')}")
print(f"'hoa': {gen_accents_word('hoa')}")
print(f"'viet': {gen_accents_word('viet')}")
print(f"'HoÀnG': {gen_accents_word('HoÀnG')}")


## 3. Data Loader Module

This section implements the data loading functionality from `data_loader.py`. It includes functions for checking data existence, loading the corpus, and splitting data for training and testing.

In [None]:
# Data loader functions (from data_loader.py)

# Constants for downloading data if needed
TRAIN_ZIP_URL = "https://github.com/hoanganhpham1006/Vietnamese_Language_Model/raw/master/Train_Full.zip"
TRAIN_ZIP_NAME = "Train_Full.zip"
TRAIN_ZIP_PATH = os.path.join(DATA_DIR, TRAIN_ZIP_NAME)

SYLLABLES_URL = "https://gist.githubusercontent.com/hieuthi/0f5adb7d3f79e7fb67e0e499004bf558/raw/135a4d9716e49a981624474156d6f247b9b46f6a/all-vietnamese-syllables.txt"

def download_and_prepare_data():
    """Download and extract training data and syllables file."""
    import requests
    import zipfile
    
    # Create data directory if it doesn't exist
    os.makedirs(DATA_DIR, exist_ok=True)
    
    # Download training data zip file if it doesn't exist
    if not os.path.exists(TRAIN_EXTRACT_PATH) or not os.listdir(TRAIN_EXTRACT_PATH):
        print(f"Downloading training data from {TRAIN_ZIP_URL}...")
        if not os.path.exists(TRAIN_ZIP_PATH):
            try:
                response = requests.get(TRAIN_ZIP_URL, stream=True)
                total_size = int(response.headers.get('content-length', 0))
                block_size = 8192
                with open(TRAIN_ZIP_PATH, 'wb') as f, tqdm(total=total_size, unit='B', unit_scale=True) as pbar:
                    for data in response.iter_content(block_size):
                        f.write(data)
                        pbar.update(len(data))
                print(f"Downloaded training data to {TRAIN_ZIP_PATH}")
            except Exception as e:
                print(f"Error downloading training data: {e}")
                return False
        
        # Extract training data zip file
        print(f"Extracting training data to {TRAIN_EXTRACT_PATH}...")
        try:
            os.makedirs(TRAIN_EXTRACT_PATH, exist_ok=True)
            with zipfile.ZipFile(TRAIN_ZIP_PATH, 'r') as zip_ref:
                zip_ref.extractall(DATA_DIR)
            print(f"Extracted training data to {TRAIN_EXTRACT_PATH}")
        except Exception as e:
            print(f"Error extracting training data: {e}")
            return False
    
    # Download syllables file if it doesn't exist
    if not os.path.exists(SYLLABLES_PATH):
        print(f"Downloading Vietnamese syllables from {SYLLABLES_URL}...")
        try:
            response = requests.get(SYLLABLES_URL)
            with open(SYLLABLES_PATH, 'w', encoding='utf-8') as f:
                f.write(response.text)
            print(f"Downloaded Vietnamese syllables to {SYLLABLES_PATH}")
        except Exception as e:
            print(f"Error downloading Vietnamese syllables: {e}")
            return False
    
    return True

def check_data_exists():
    """Check if required data exists."""
    os.makedirs(DATA_DIR, exist_ok=True)
    print(f"Data directory configured at: {DATA_DIR}")
    all_data_exists = True
    if not os.path.exists(TRAIN_EXTRACT_PATH) or not os.listdir(TRAIN_EXTRACT_PATH):
        print(f"Warning: Training data directory {TRAIN_EXTRACT_PATH} is missing or empty.")
        print("Please ensure you have downloaded and extracted 'Train_Full.zip' into the 'data/Train_Full' directory.")
        all_data_exists = False
    else:
        print(f"Training data found at: {TRAIN_EXTRACT_PATH}")

    if not os.path.exists(SYLLABLES_PATH):
        print(f"Warning: Vietnamese syllables file {SYLLABLES_PATH} is missing.")
        print("Please ensure you have 'vn_syllables.txt' in the 'data' directory.")
        all_data_exists = False
    else:
        print(f"Vietnamese syllables file found at: {SYLLABLES_PATH}")
    return all_data_exists

def load_corpus(data_extract_path: str = TRAIN_EXTRACT_PATH) -> list[list[str]]:
    """Load corpus from extracted training data."""
    if not os.path.exists(data_extract_path):
        print(f"Error: Training data path not found: {data_extract_path}")
        print("Please run download_and_prepare_data() first or ensure data is correctly placed.")
        return []

    full_text_content = []
    print(f"Loading corpus from: {data_extract_path}")

    for dirname, _, filenames in os.walk(data_extract_path):
        for filename in tqdm(filenames, desc=f"Reading files in {os.path.basename(dirname)}"):
            if filename.endswith(".txt"): 
                try:
                    with open(os.path.join(dirname, filename), 'r', encoding='UTF-16') as f:
                        full_text_content.append(f.read())
                except Exception as e:
                    # Try with different encodings if UTF-16 fails
                    try:
                        with open(os.path.join(dirname, filename), 'r', encoding='utf-8') as f:
                            full_text_content.append(f.read())
                    except Exception as e2:
                        print(f"Could not read file {os.path.join(dirname, filename)}: {e2}")
    
    if not full_text_content:
        print("No text files found or loaded from the training data path.")
        return []

    print(f"Loaded {len(full_text_content)} documents.")
    full_data_string = ". ".join(full_text_content)
    full_data_string = full_data_string.replace("\n", ". ")
    
    corpus = []
    raw_sents = re.split(r'[.?!]\s+', full_data_string) 
    print(f"Processing {len(raw_sents)} raw sentences...")
    for sent in tqdm(raw_sents, desc="Tokenizing sentences"):
        if sent.strip(): # Ensure sentence is not just whitespace
            corpus.append(tokenize(sent))
    
    print(f"Corpus created with {len(corpus)} tokenized sentences.")
    return corpus

def _process_single_sentence_for_splitting(sent_accented: str):
    """Process a single sentence for parallel corpus splitting."""
    if sent_accented.strip():
        temp_tokenized_for_unaccenting = tokenize(sent_accented)
        unaccented_words = [remove_vn_accent(word) for word in temp_tokenized_for_unaccenting]
        unaccented_sentence_str = " ".join(unaccented_words)
        tokenized_accented_sentence = tokenize(sent_accented)
        if unaccented_sentence_str and tokenized_accented_sentence:
            return (unaccented_sentence_str, tokenized_accented_sentence)
    return None

def load_and_split_corpus(data_extract_path: str = TRAIN_EXTRACT_PATH, test_size: float = 0.2, random_seed: int = 42):
    """Load and split corpus into training and testing sets."""
    if not os.path.exists(data_extract_path):
        print(f"Error: Training data path not found: {data_extract_path}")
        return [], []

    full_text_content = []
    print(f"Loading corpus from: {data_extract_path}")
    for dirname, _, filenames in os.walk(data_extract_path):
        for filename in tqdm(filenames, desc=f"Reading files in {os.path.basename(dirname)}"):
            if filename.endswith(".txt"):
                try:
                    with open(os.path.join(dirname, filename), 'r', encoding='UTF-16') as f:
                        full_text_content.append(f.read())
                except Exception as e:
                    # Try with different encodings if UTF-16 fails
                    try:
                        with open(os.path.join(dirname, filename), 'r', encoding='utf-8') as f:
                            full_text_content.append(f.read())
                    except Exception as e2:
                        print(f"Could not read file {os.path.join(dirname, filename)}: {e2}")
    
    if not full_text_content:
        print("No text files found or loaded from the training data path.")
        return [], []

    print(f"Loaded {len(full_text_content)} documents.")
    full_data_string = ". ".join(full_text_content)
    full_data_string = full_data_string.replace("\n", ". ")
    
    raw_sentences_with_accent = re.split(r'[.?!]\s+', full_data_string)

    # Limit number of sentences to avoid memory errors
    MAX_PROCESS_SENTENCES = 1000
    if len(raw_sentences_with_accent) > MAX_PROCESS_SENTENCES:
        print(f"WARNING: Limiting processing to {MAX_PROCESS_SENTENCES} sentences out of {len(raw_sentences_with_accent)} to save memory.")
        raw_sentences_with_accent = raw_sentences_with_accent[:MAX_PROCESS_SENTENCES]
    
    print(f"Processing {len(raw_sentences_with_accent)} raw sentences for splitting...")
    
    # Process sentences in parallel if possible
    processed_sentences = []
    if mp.cpu_count() > 1:
        with mp.Pool() as pool:
            num_cores = os.cpu_count() or 1
            chunk_size = max(1, len(raw_sentences_with_accent) // (num_cores * 4))
            
            results_iterator = pool.imap(_process_single_sentence_for_splitting, 
                                        tqdm(raw_sentences_with_accent, desc="Generating unaccented and tokenizing (parallel)"),
                                        chunksize=chunk_size)
            
            # Filter out None results
            processed_sentences = [res for res in results_iterator if res is not None]
    else:
        # Sequential processing if multiprocessing is not available
        for sent in tqdm(raw_sentences_with_accent, desc="Generating unaccented and tokenizing"):
            result = _process_single_sentence_for_splitting(sent)
            if result is not None:
                processed_sentences.append(result)

    if not processed_sentences:
        print("No sentences could be processed. Check data and tokenization.")
        return [], []
        
    random.seed(random_seed)
    random.shuffle(processed_sentences)
    
    split_index = int(len(processed_sentences) * (1 - test_size))
    train_data_pairs = processed_sentences[:split_index]
    test_data_pairs = processed_sentences[split_index:]
    
    train_corpus = [pair[1] for pair in train_data_pairs] # Only take tokenized accented sentences for training
    test_set = test_data_pairs # Keep full pairs (unaccented_string, tokenized_accented_sentence) for testing
    
    print(f"Data split: {len(train_corpus)} training sentences, {len(test_set)} test sentences.")
    return train_corpus, test_set

# Check if data exists and download if necessary
print("Checking if required data exists...")
if not check_data_exists():
    print("Required data not found. Attempting to download...")
    download_and_prepare_data()
    if not check_data_exists():
        print("Warning: Data preparation failed. You may need to create sample data for testing.")
        # Create minimal sample data for testing in Google Colab
        os.makedirs(TRAIN_EXTRACT_PATH, exist_ok=True)
        sample_text = "Đây là một câu tiếng Việt có dấu để kiểm tra. Xin chào Việt Nam. Tiếng Việt rất hay."
        with open(os.path.join(TRAIN_EXTRACT_PATH, "sample.txt"), "w", encoding="utf-8") as f:
            f.write(sample_text)
        print("Created minimal sample data for testing.")
else:
    print("Required data exists.")

# Load a small corpus for demonstration
print("\nLoading a small corpus for demonstration...")
train_corpus, test_set = load_and_split_corpus(test_size=0.2, random_seed=42)
print(f"Loaded {len(train_corpus)} training sentences and {len(test_set)} test pairs.")

# Show sample data
if train_corpus:
    print("\nSample training sentences (tokenized):\n", train_corpus[:2])
if test_set:
    print("\nSample test pairs (unaccented, tokenized_accented):\n", test_set[:2])


## 4. Model Trainer Module

This section implements the model training functionality from `model_trainer.py`. It includes functions for training n-gram models and saving trained models.

In [None]:
# Model trainer functions (from model_trainer.py)

def train_ngram_model(corpus: list[list[str]], n: int = N_GRAM_ORDER):
    """Train an n-gram language model."""
    if not corpus:
        print("Corpus is empty. Cannot train model.")
        return None

    print(f"Preparing data for custom {n}-gram model...")

    start_symbol = "<s>"
    end_symbol = "</s>"

    ngram_counts = defaultdict(int)
    context_counts = defaultdict(int) # For (n-1)-grams
    vocab = set()

    for sentence_tokens in corpus:
        # Pad sentence: (n-1) start symbols, 1 end symbol.
        # e.g., for n=3 (trigram): ['<s>', '<s>'] + sentence_tokens + ['</s>']
        # e.g., for n=1 (unigram): [] + sentence_tokens + ['</s>']
        current_padded_sentence = ([start_symbol] * (n - 1)) + sentence_tokens + [end_symbol]

        for token in sentence_tokens: # Original sentence tokens for vocab
            vocab.add(token)

        # Generate n-grams and (n-1)-gram contexts
        # Iterate up to the point where the last n-gram can be formed
        for i in range(len(current_padded_sentence) - n + 1):
            ngram = tuple(current_padded_sentence[i : i + n])
            ngram_counts[ngram] += 1

            if n > 1:
                # The context is the first (n-1) tokens of the n-gram
                context = tuple(current_padded_sentence[i : i + n - 1])
                context_counts[context] += 1
            # For n=1 (unigrams), context_counts will be handled later if needed for P(w) = count(w)/TotalWords

    # For unigram model (n=1), if we define P(w) = count(w) / total_tokens,
    # context_counts can store the total number of tokens.
    if n == 1:
        total_word_occurrences = sum(ngram_counts.values()) # Sum of counts of all unigrams
        if total_word_occurrences > 0:
            context_counts[()] = total_word_occurrences # Global context for unigrams

    print(f"Custom model training complete. Vocabulary size: {len(vocab)}")
    print(f"Number of unique {n}-grams: {len(ngram_counts)}")
    if n > 1 or (n == 1 and context_counts):
        print(f"Number of unique contexts: {len(context_counts)}")

    # Create a class-like object with the necessary methods for prediction
    class CustomNGramModel:
        def __init__(self, n, vocab, ngram_counts, context_counts):
            self.order = n  # Need 'order' attribute for compatibility with predictor.py
            self.n = n
            self.vocab = set(vocab)
            self.ngram_counts = ngram_counts
            self.context_counts = context_counts
        
        def logscore(self, word, context_tuple=()):
            """Calculate log probability of word given context."""
            if not context_tuple and self.n > 1:
                # Pad with start symbols if context is empty
                context_tuple = tuple(["<s>"] * (self.n - 1))
            elif len(context_tuple) < (self.n - 1) and self.n > 1:
                # Pad context if it's shorter than n-1
                context_tuple = tuple(["<s>"] * ((self.n - 1) - len(context_tuple))) + context_tuple
            
            # For lower-order n-grams, we only use the last (n-1) tokens of the context
            if self.n > 1 and len(context_tuple) > (self.n - 1):
                context_tuple = context_tuple[-(self.n - 1):]
            
            # Calculate the ngram and context
            ngram = context_tuple + (word,)
            
            # Get counts
            ngram_count = self.ngram_counts.get(ngram, 0)
            context_count = self.context_counts.get(context_tuple, 0)
            
            # Calculate probability with smoothing (add-1 smoothing)
            # This is a simple version of smoothing, could be improved for better results
            smooth_value = 0.1  # Small value for smoothing
            if ngram_count == 0:
                prob = smooth_value / (context_count + (smooth_value * len(self.vocab)))
            else:
                prob = (ngram_count + smooth_value) / (context_count + (smooth_value * len(self.vocab)))
            
            # Return log probability, handle zero probability
            import math
            if prob <= 0:
                return -float('inf')
            return math.log(prob)
    
    # Create and return the model instance
    return CustomNGramModel(n, vocab, dict(ngram_counts), dict(context_counts))

def save_model(model, model_dir: str = MODEL_DIR, filename: str = DEFAULT_MODEL_FILENAME):
    """Save trained model to file."""
    if model is None:
        print("Model is None. Nothing to save.")
        return False

    os.makedirs(model_dir, exist_ok=True)
    model_path = os.path.join(model_dir, filename)
    print(f"Saving model to {model_path}...")
    try:
        with open(model_path, 'wb') as fout:
            pickle.dump(model, fout)
        print(f"Model successfully saved to {model_path}")
        return True
    except Exception as e:
        print(f"Error saving model to {model_path}: {e}")
        return False

def load_model(model_path: str):
    """Load trained model from file."""
    if not os.path.exists(model_path):
        print(f"ERROR: Model file not found at {model_path}")
        print(f"Please train and save the model first or provide a valid path.")
        return None
    
    print(f"Attempting to load model from {model_path}...")
    try:
        with open(model_path, 'rb') as fin:
            model_loaded = pickle.load(fin)
        # Check if the loaded model has the necessary attributes
        if hasattr(model_loaded, 'vocab') and hasattr(model_loaded, 'order'):
            print(f"Model loaded successfully from {model_path}.")
            if hasattr(model_loaded, 'vocab') and isinstance(model_loaded.vocab, set):
                print(f"Vocabulary size: {len(model_loaded.vocab)}")
        else:
            print(f"Model loaded from {model_path}, but it does not have the expected attributes.")
        return model_loaded
    except Exception as e:
        print(f"CRITICAL ERROR: An error occurred while loading the model from {model_path}. Details: {e}")
    return None


## 5. Predictor Module

This section implements the accent prediction functionality from `predictor.py`. It includes the beam search algorithm for finding the most likely accented versions of Vietnamese text without accents.

In [None]:
# Predictor functions (from predictor.py)

_detokenizer = TreebankWordDetokenizer()

def beam_search_predict_accents(text_no_accents: str, model, k: int = 3, 
                              syllables_file: str = SYLLABLES_PATH, 
                              detokenizer=_detokenizer) -> list[tuple[str, float]]:
    """Predict Vietnamese accents using beam search algorithm."""
    words = text_no_accents.lower().split()
    sequences = [] # Stores list of ([word_sequence], score)

    for idx, word_no_accent in enumerate(words):
        possible_accented_words = gen_accents_word(word_no_accent, syllables_path=syllables_file)
        if not possible_accented_words:
            possible_accented_words = {word_no_accent}

        if idx == 0:
            sequences = [([x], 0.0) for x in possible_accented_words]
        else:
            all_new_sequences = []
            for seq_words, seq_score in sequences:
                for next_accented_word in possible_accented_words:
                    context = seq_words[-(model.order - 1):] if model.order > 1 else [] 
                    try:
                        score_addition = model.logscore(next_accented_word, tuple(context))
                    except Exception as e: 
                        # print(f"Logscore error for '{next_accented_word}' with context {context}: {e}. Assigning low score.")
                        score_addition = -float('inf') 
                        
                    new_seq_words = seq_words + [next_accented_word]
                    all_new_sequences.append((new_seq_words, seq_score + score_addition))
            
            all_new_sequences = sorted(all_new_sequences, key=lambda x: x[1], reverse=True)
            sequences = all_new_sequences[:k]
            if not sequences: 
                if all_new_sequences:
                    sequences = [(all_new_sequences[0][0][:-1] + [word_no_accent], all_new_sequences[0][1] - 1000)] 
                else:
                    return []

    results = [(detokenizer.detokenize(seq_words), score) for seq_words, score in sequences]
    return results

# Test prediction with a simple test model if no model is available yet
class TestModel:
    def __init__(self):
        self.order = 2
        self.vocab = {"test", "tiếng", "việt", "rất", "hay"}
    
    def logscore(self, word, context_tuple=()):
        import random
        return random.random() * -10  # Random negative score between 0 and -10

# Try prediction with test data
print("Testing accent prediction with a simple test model:")
test_model = TestModel()
test_input = "tieng viet rat hay"
print(f"Input: '{test_input}'")
predictions = beam_search_predict_accents(test_input, test_model, k=3)
if predictions:
    print("Top predictions:")
    for i, (sent, score) in enumerate(predictions):
        print(f"{i+1}. '{sent}' (Score: {score:.4f})")
else:
    print("No predictions returned.")


## 6. Evaluation Module

This section implements model evaluation functionality from `evaluate_model.py`. It includes metrics for evaluating the quality of accent predictions compared to ground truth.

In [None]:
# Evaluation functions (from evaluate_model.py)

import matplotlib.pyplot as plt
from nltk.metrics.distance import edit_distance

def calculate_sentence_accuracy(results: list[dict]) -> float:
    """Calculate percentage of exactly correct sentences."""
    if not results:
        return 0.0
    correct_sentences = 0
    for res in results:
        if res["true_accented"].strip() == res["predicted_accented"].strip():
            correct_sentences += 1
    return (correct_sentences / len(results)) * 100

def calculate_word_accuracy(results: list[dict]) -> float:
    """Calculate percentage of correctly predicted words."""
    if not results:
        return 0.0
    total_words = 0
    correct_words = 0
    for res in results:
        true_words = res["true_accented"].strip().split()
        predicted_words = res["predicted_accented"].strip().split()
        len_min = min(len(true_words), len(predicted_words))
        for i in range(len_min):
            if true_words[i] == predicted_words[i]:
                correct_words += 1
        total_words += len(true_words)
    if total_words == 0: return 0.0
    return (correct_words / total_words) * 100

def calculate_cer(results: list[dict]) -> float:
    """Calculate Character Error Rate (CER)."""
    if not results:
        return 0.0
    total_edit_distance = 0
    total_true_chars = 0
    for res in results:
        true_str = res["true_accented"].strip()
        pred_str = res["predicted_accented"].strip()
        if not true_str and not pred_str:
            dist = 0
        elif not true_str:
            dist = len(pred_str)
        elif not pred_str:
            dist = len(true_str)
        else:
            dist = edit_distance(true_str, pred_str)
        total_edit_distance += dist
        total_true_chars += len(true_str)
    if total_true_chars == 0: return 1.0 if total_edit_distance > 0 else 0.0
    return (total_edit_distance / total_true_chars) * 100

def display_sample_results(results: list[dict], num_samples: int = 5):
    """Display sample prediction results."""
    print("\n--- Sample Predictions ---")
    for i, res in enumerate(results[:num_samples]):
        print(f"Sample {i+1}:")
        print(f"  Input:     '{res['input_unaccented']}'")
        print(f"  True:      '{res['true_accented']}'")
        print(f"  Predicted: '{res['predicted_accented']}'")
        print("---")

def plot_metrics(metrics: dict):
    """Plot evaluation metrics."""
    names = list(metrics.keys())
    values = list(metrics.values())

    plt.figure(figsize=(10, 5))
    plt.bar(names, values)
    plt.ylabel('Percentage (%)')
    plt.title('Model Evaluation Metrics')

    # Add values on top of bars
    for i, value in enumerate(values):
        plt.text(i, value + 1, f"{value:.2f}%", ha='center')

    plt.show()

def evaluate_model(model, test_set, k=3):
    """Evaluate a model on test data."""
    if not model or not test_set:
        print("Model or test set is empty. Cannot evaluate.")
        return []
    
    print(f"Evaluating model on {len(test_set)} test sentences...")
    evaluation_results = []
    
    for i, (unaccented_input_str, tokenized_true_accented_sent) in enumerate(tqdm(test_set, desc="Evaluating")):
        true_accented_str = _detokenizer.detokenize(tokenized_true_accented_sent)
        
        # Get top prediction
        predictions = beam_search_predict_accents(
            text_no_accents=unaccented_input_str,
            model=model,
            k=k,
            syllables_file=SYLLABLES_PATH
        )
        
        predicted_accented_str = predictions[0][0] if predictions else ""
        
        evaluation_results.append({
            "input_unaccented": unaccented_input_str,
            "true_accented": true_accented_str,
            "predicted_accented": predicted_accented_str
        })
    
    print(f"Evaluation complete. Processed {len(evaluation_results)} test items.")
    return evaluation_results


## 7. Train Model

This section demonstrates how to train a new Vietnamese accent prediction model using the loaded training corpus.

In [None]:
# Training a new model

# Create a smaller sample corpus for quick demonstration if the corpus is large
SAMPLE_SIZE_LIMIT = 5000  # Limit sample size to avoid long training times in this notebook
if train_corpus and len(train_corpus) > SAMPLE_SIZE_LIMIT:
    print(f"Using a sample of {SAMPLE_SIZE_LIMIT} sentences from the {len(train_corpus)} available for faster training.")
    random.seed(42)  # For reproducibility
    train_sample = random.sample(train_corpus, SAMPLE_SIZE_LIMIT)
else:
    train_sample = train_corpus

print(f"Training with {len(train_sample)} sentences...")

# Train the model
trained_model = train_ngram_model(train_sample, n=N_GRAM_ORDER)

if trained_model:
    print("Model training successful.")
    
    # Save the model
    model_saved = save_model(trained_model, model_dir=MODEL_DIR, filename=DEFAULT_MODEL_FILENAME)
    
    if model_saved:
        print(f"Model successfully saved to {os.path.join(MODEL_DIR, DEFAULT_MODEL_FILENAME)}")
    else:
        print("Failed to save model.")
else:
    print("Model training failed.")


## 8. Evaluate Model

This section demonstrates how to evaluate the trained Vietnamese accent prediction model on the test data.

In [None]:
# Evaluate the trained model

# Load the model if not already in memory
model_path = os.path.join(MODEL_DIR, DEFAULT_MODEL_FILENAME)
if not trained_model and os.path.exists(model_path):
    print(f"Loading model from {model_path}...")
    trained_model = load_model(model_path)

if trained_model and test_set:
    # Limit test set size for quicker evaluation
    MAX_TEST_SAMPLES = 100
    if len(test_set) > MAX_TEST_SAMPLES:
        print(f"Using {MAX_TEST_SAMPLES} random samples from test set for quick evaluation.")
        random.seed(42)  # For reproducibility
        test_sample = random.sample(test_set, MAX_TEST_SAMPLES)
    else:
        test_sample = test_set
    
    print(f"Evaluating model on {len(test_sample)} test samples...")
    
    # Run evaluation
    evaluation_results = evaluate_model(trained_model, test_sample, k=3)
    
    if evaluation_results:
        # Calculate metrics
        sent_accuracy = calculate_sentence_accuracy(evaluation_results)
        word_acc = calculate_word_accuracy(evaluation_results)
        char_error_rate = calculate_cer(evaluation_results)
        char_accuracy = 100.0 - char_error_rate
        
        # Display metrics
        print(f"\n--- Evaluation Metrics ---")
        print(f"Sentence Accuracy: {sent_accuracy:.2f}%")
        print(f"Word Accuracy: {word_acc:.2f}%")
        print(f"Character Error Rate (CER): {char_error_rate:.2f}%")
        print(f"Character Accuracy: {char_accuracy:.2f}%")
        
        # Plot metrics
        metrics_to_plot = {
            "Sentence Accuracy": sent_accuracy,
            "Word Accuracy": word_acc,
            "Character Accuracy": char_accuracy
        }
        plot_metrics(metrics_to_plot)
        
        # Show sample results
        display_sample_results(evaluation_results, num_samples=5)
    else:
        print("No evaluation results available.")
else:
    if not trained_model:
        print("No trained model available for evaluation.")
    if not test_set:
        print("No test data available for evaluation.")


## 9. Interactive Accent Prediction

This section provides an interactive interface to predict accents in Vietnamese text. Enter text without accents and see the model's predictions.

#

In [None]:
# Interactive prediction
from IPython.display import HTML, display
import ipywidgets as widgets

# Load the model if not already in memory
model_path = os.path.join(MODEL_DIR, DEFAULT_MODEL_FILENAME)
if not trained_model and os.path.exists(model_path):
    print(f"Loading model from {model_path}...")
    trained_model = load_model(model_path)

if not trained_model:
    print("No model available for prediction. Please train a model first.")
else:
    # Create widgets for interactive prediction
    input_text = widgets.Textarea(
        value='',
        placeholder='Nhập câu tiếng Việt không dấu ở đây...',
        description='Input:',
        disabled=False,
        layout={'width': '100%', 'height': '80px'}
    )

    beam_width = widgets.IntSlider(
        value=3,
        min=1,
        max=10,
        step=1,
        description='Beam Width (k):',
        disabled=False,
        continuous_update=False,
        orientation='horizontal',
        readout=True,
        readout_format='d'
    )

    output_area = widgets.Output()

    def on_predict_button_clicked(b):
        with output_area:
            output_area.clear_output()
            if not input_text.value.strip():
                print("Please enter some text.")
                return
            
            text_no_accents = input_text.value.strip()
            print(f"Input: '{text_no_accents}'")
            
            predictions = beam_search_predict_accents(
                text_no_accents=text_no_accents,
                model=trained_model,
                k=beam_width.value,
                syllables_file=SYLLABLES_PATH
            )
            
            if predictions:
                print("\nPredictions:")
                for i, (sent, score) in enumerate(predictions):
                    print(f"{i+1}. '{sent}' (Score: {score:.4f})")
            else:
                print("No predictions returned.")

    predict_button = widgets.Button(
        description='Predict Accents',
        disabled=False,
        button_style='primary',
        tooltip='Click to predict accents',
        icon='check'
    )
    predict_button.on_click(on_predict_button_clicked)

    # Display the interactive elements
    display(widgets.HTML('<h3>Vietnamese Accent Prediction</h3>'))
    display(input_text)
    display(beam_width)
    display(predict_button)
    display(output_area)

    # Add some example sentences that users can try
    display(widgets.HTML('<h4>Example Sentences to Try:</h4>'))
    examples = [
        "toi la nguoi viet nam",
        "chuc mung nam moi",
        "hom nay troi dep qua",
        "tieng viet rat hay",
        "ban dang o dau"
    ]
    for example in examples:
        display(widgets.HTML(f"<code>{example}</code>"))


## 10. Conclusion and Next Steps

This notebook has implemented a complete Vietnamese accent prediction model that can:

1. Process Vietnamese text and remove/add accents
2. Load and prepare corpus data
3. Train n-gram language models
4. Predict accents using beam search
5. Evaluate model performance with multiple metrics
6. Provide an interactive interface for predictions

### Possible Improvements

- **Data Augmentation**: Expand training data with more diverse Vietnamese text
- **Advanced Models**: Implement neural models (LSTM, Transformer) for better prediction
- **Smoothing Techniques**: Implement more sophisticated smoothing methods for n-gram models
- **User Interface**: Build a web application or mobile app for easier access
- **Contextual Analysis**: Consider broader context beyond n-grams for better disambiguation

### References

- [NLTK Documentation](https://www.nltk.org/)
- [Vietnamese Language Resources](https://github.com/undertheseanlp/resources)
- [N-gram Language Models](https://web.stanford.edu/~jurafsky/slp3/3.pdf)