# Vietnamese Accent Model - Colab Notebook

This notebook contains the entire Vietnamese Accent Model project code, adapted to run on Google Colab. It includes all functionality from the original project files:

- Data loading and preprocessing
- Model training
- Accent prediction
- Model evaluation

Original source files:
- `utils/model_trainer.py`
- `utils/data_loader.py`
- `utils/utils.py`
- `utils/predictor.py`
- `main.py`
- `train_model.py`
- `evaluate_model.py`

## 0. Setup Environment

First, let's install the necessary dependencies:

In [2]:
# Install required packages
!pip install nltk==3.5 scikit-learn==1.2.2 dill~=0.3.7 pandas~=2.0.3 matplotlib~=3.7.5 seaborn~=0.13.2 tqdm~=4.66.0 requests~=2.31.0

# Download NLTK punkt tokenizer
import nltk
nltk.download('punkt')



[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\tontide1\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## 1. Directory Structure Setup

Next, let's set up the directory structure for our project:

In [2]:
import os

# Create project directories
BASE_DIR = os.getcwd()
DATA_DIR = os.path.join(BASE_DIR, "data")
MODEL_DIR = os.path.join(BASE_DIR, "models")
PLOTS_DIR = os.path.join(BASE_DIR, "plots")

os.makedirs(DATA_DIR, exist_ok=True)
os.makedirs(MODEL_DIR, exist_ok=True)
os.makedirs(PLOTS_DIR, exist_ok=True)

print(f"Project directories created:")
print(f"- Data dir: {DATA_DIR}")
print(f"- Model dir: {MODEL_DIR}")
print(f"- Plots dir: {PLOTS_DIR}")

Project directories created:
- Data dir: c:\Users\tontide1\Desktop\Vietnamese-Accent-Model\data
- Model dir: c:\Users\tontide1\Desktop\Vietnamese-Accent-Model\models
- Plots dir: c:\Users\tontide1\Desktop\Vietnamese-Accent-Model\plots


## 2. Define Utility Functions (utils.py)

In [3]:
# Porting utils.py
import re
import string
from nltk import word_tokenize

VN_SYLLABLES_FILE_PATH = os.path.join(DATA_DIR, "vn_syllables.txt")

def tokenize(doc: str) -> list[str]:
    tokens = word_tokenize(doc.lower())
    # Allow underscore, remove other punctuation
    table = str.maketrans('', '', string.punctuation.replace("_", ""))
    tokens = [w.translate(table) for w in tokens]
    tokens = [word for word in tokens if word]  # Remove empty strings
    return tokens

def remove_vn_accent(word: str) -> str:
    word = word.lower()
    word = re.sub(r'[áàảãạăắằẳẵặâấầẩẫậ]', 'a', word)
    word = re.sub(r'[éèẻẽẹêếềểễệ]', 'e', word)
    word = re.sub(r'[óòỏõọôốồổỗộơớờởỡợ]', 'o', word)
    word = re.sub(r'[íìỉĩị]', 'i', word)
    word = re.sub(r'[úùủũụưứừửữự]', 'u', word)
    word = re.sub(r'[ýỳỷỹỵ]', 'y', word)
    word = re.sub(r'đ', 'd', word)
    return word

def gen_accents_word(word: str, syllables_path: str = VN_SYLLABLES_FILE_PATH) -> set[str]:
    """
    Sinh văn bản tự động với file vn_syllables.txt
    """
    normalized_input_word = word.lower()
    word_no_accent = remove_vn_accent(normalized_input_word)
    all_accent_word = {normalized_input_word}  # Start with the input word (normalized)

    if not os.path.exists(syllables_path):
        print(f"Warning: Syllables file not found at {syllables_path}. "
              f"Accent generation will be limited to the input word: '{word}'.")
        if word_no_accent != normalized_input_word:
            all_accent_word.add(word_no_accent)
        return all_accent_word

    try:
        with open(syllables_path, 'r', encoding='utf-8') as f:
            for w_line in f.read().splitlines():
                w_line_lower = w_line.lower()  # Normalize file content
                if remove_vn_accent(w_line_lower) == word_no_accent:
                    all_accent_word.add(w_line_lower)
    except Exception as e:
        print(f"Error reading or processing syllables file {syllables_path}: {e}")
    
    return all_accent_word

# Test the utility functions
print("Tokenize example:", tokenize("Đây_là một câu, ví dụ."))
print("Remove accent example:", remove_vn_accent("hoàng"))

Tokenize example: ['đây_là', 'một', 'câu', 'ví', 'dụ']
Remove accent example: hoang


## 3. Define Data Loader (data_loader.py)

In [4]:
# Porting data_loader.py
import requests
import zipfile
from tqdm import tqdm
import random
import multiprocessing as mp

# Data paths configuration
TRAIN_ZIP_URL = "https://github.com/hoanganhpham1006/Vietnamese_Language_Model/raw/master/Train_Full.zip"
TRAIN_ZIP_NAME = "Train_Full.zip"
TRAIN_ZIP_PATH = os.path.join(DATA_DIR, TRAIN_ZIP_NAME)
TRAIN_EXTRACT_PATH = os.path.join(DATA_DIR, "Train_Full")

SYLLABLES_URL = "https://gist.githubusercontent.com/hieuthi/0f5adb7d3f79e7fb67e0e499004bf558/raw/135a4d9716e49a981624474156d6f247b9b46f6a/all-vietnamese-syllables.txt"
SYLLABLES_NAME = "vn_syllables.txt"
SYLLABLES_PATH = os.path.join(DATA_DIR, SYLLABLES_NAME)

def download_file(url, local_path, desc="Downloading file"):
    """
    Download a file from the given URL to the specified local path.
    """
    with requests.get(url, stream=True) as r:
        r.raise_for_status()
        total_size = int(r.headers.get('content-length', 0))
        chunk_size = 8192
        
        with open(local_path, 'wb') as f, tqdm(
            desc=desc,
            total=total_size,
            unit='B', unit_scale=True, unit_divisor=1024,
        ) as bar:
            for chunk in r.iter_content(chunk_size=chunk_size):
                size = f.write(chunk)
                bar.update(size)
    return local_path

def download_and_prepare_data():
    """
    Download and extract the training data and syllables file.
    """
    os.makedirs(DATA_DIR, exist_ok=True)
    
    # Download syllables file
    if not os.path.exists(SYLLABLES_PATH):
        print(f"Downloading Vietnamese syllables file to {SYLLABLES_PATH}...")
        download_file(SYLLABLES_URL, SYLLABLES_PATH, desc="Downloading syllables file")
    else:
        print(f"Syllables file already exists at {SYLLABLES_PATH}")
    
    # Download training data
    if not os.path.exists(TRAIN_EXTRACT_PATH) or not os.listdir(TRAIN_EXTRACT_PATH):
        if not os.path.exists(TRAIN_ZIP_PATH):
            print(f"Downloading training data to {TRAIN_ZIP_PATH}...")
            download_file(TRAIN_ZIP_URL, TRAIN_ZIP_PATH, desc="Downloading training data")
        
        print(f"Extracting training data to {TRAIN_EXTRACT_PATH}...")
        os.makedirs(TRAIN_EXTRACT_PATH, exist_ok=True)
        with zipfile.ZipFile(TRAIN_ZIP_PATH, 'r') as zip_ref:
            zip_ref.extractall(DATA_DIR)
    else:
        print(f"Training data already exists at {TRAIN_EXTRACT_PATH}")
    
    return check_data_exists()

def check_data_exists():
    """kiểm tra xem dữ liệu đã có chưa"""
    os.makedirs(DATA_DIR, exist_ok=True)
    print(f"Data directory configured at: {DATA_DIR}")
    all_data_exists = True
    if not os.path.exists(TRAIN_EXTRACT_PATH) or not os.listdir(TRAIN_EXTRACT_PATH):
        print(f"Warning: Training data directory {TRAIN_EXTRACT_PATH} is missing or empty.")
        print("Please ensure you have downloaded and extracted 'Train_Full.zip' into the 'data/Train_Full' directory.")
        all_data_exists = False
    else:
        print(f"Training data found at: {TRAIN_EXTRACT_PATH}")

    if not os.path.exists(SYLLABLES_PATH):
        print(f"Warning: Vietnamese syllables file {SYLLABLES_PATH} is missing.")
        print("Please ensure you have 'vn_syllables.txt' in the 'data' directory.")
        all_data_exists = False
    else:
        print(f"Vietnamese syllables file found at: {SYLLABLES_PATH}")
    return all_data_exists

def load_corpus(data_extract_path: str = TRAIN_EXTRACT_PATH) -> list[list[str]]:
    """
    load dữ liệu từ thư mục đã giải nén
    """
    if not os.path.exists(data_extract_path):
        print(f"Error: Training data path not found: {data_extract_path}")
        print("Please run download_and_prepare_data() first or ensure data is correctly placed.")
        return []

    full_text_content = []
    print(f"Loading corpus from: {data_extract_path}")

    for dirname, _, filenames in os.walk(data_extract_path):
        for filename in tqdm(filenames, desc=f"Reading files in {os.path.basename(dirname)}"):
            if filename.endswith(".txt"): 
                try:
                    with open(os.path.join(dirname, filename), 'r', encoding='UTF-16') as f:
                        full_text_content.append(f.read())
                except Exception as e:
                    print(f"Could not read file {os.path.join(dirname, filename)}: {e}")
    
    if not full_text_content:
        print("No text files found or loaded from the training data path.")
        return []

    print(f"Loaded {len(full_text_content)} documents.")
    full_data_string = ". ".join(full_text_content)
    full_data_string = full_data_string.replace("\n", ". ")
    
    corpus = []
    raw_sents = re.split(r'[.?!]\s+', full_data_string) 
    print(f"Processing {len(raw_sents)} raw sentences...")
    for sent in tqdm(raw_sents, desc="Tokenizing sentences"):
        if sent.strip(): # Ensure sentence is not just whitespace
            corpus.append(tokenize(sent))
    
    print(f"Corpus created with {len(corpus)} tokenized sentences.")
    return corpus

# Hàm tận dụng multiprocessing để xử lý nhiều câu cùng lúc
# cải thiện tốc xử lý cho các tác vụ nặng như tokenize
def _process_single_sentence_for_splitting(sent_accented: str):
    if sent_accented.strip():
        temp_tokenized_for_unaccenting = tokenize(sent_accented)
        unaccented_words = [remove_vn_accent(word) for word in temp_tokenized_for_unaccenting]
        unaccented_sentence_str = " ".join(unaccented_words)
        tokenized_accented_sentence = tokenize(sent_accented)
        if unaccented_sentence_str and tokenized_accented_sentence:
            return (unaccented_sentence_str, tokenized_accented_sentence)
    return None

def load_and_split_corpus(data_extract_path: str = TRAIN_EXTRACT_PATH, test_size: float = 0.2, random_seed: int = 42):
    """
    Load dữ liệu, tạo phiên bản không dấu, tokenize câu có dấu,
    và chia thành tập huấn luyện và tập kiểm thử.
    """
    if not os.path.exists(data_extract_path):
        print(f"Error: Training data path not found: {data_extract_path}")
        print("Please run download_and_prepare_data() first or ensure data is correctly placed.")
        return [], []

    full_text_content = []
    print(f"Loading corpus from: {data_extract_path}")
    for dirname, _, filenames in os.walk(data_extract_path):
        for filename in tqdm(filenames, desc=f"Reading files in {os.path.basename(dirname)}"):
            if filename.endswith(".txt"):
                try:
                    with open(os.path.join(dirname, filename), 'r', encoding='UTF-16') as f:
                        full_text_content.append(f.read())
                except Exception as e:
                    print(f"Could not read file {os.path.join(dirname, filename)}: {e}")
    
    if not full_text_content:
        print("No text files found or loaded from the training data path.")
        return [], []

    print(f"Loaded {len(full_text_content)} documents.")
    full_data_string = ". ".join(full_text_content)
    full_data_string = full_data_string.replace("\n", ". ")
    
    raw_sentences_with_accent = re.split(r'[.?!]\s+', full_data_string)

    # GIỚI HẠN SỐ LƯỢNG CÂU ĐỂ TRÁNH MemoryError
    MAX_PROCESS_SENTENCES = 100
    if len(raw_sentences_with_accent) > MAX_PROCESS_SENTENCES:
        print(f"CẢNH BÁO: Giới hạn xử lý {MAX_PROCESS_SENTENCES} câu đầu tiên trên tổng số {len(raw_sentences_with_accent)} câu để tiết kiệm bộ nhớ.")
        raw_sentences_with_accent = raw_sentences_with_accent[:MAX_PROCESS_SENTENCES]
    
    print(f"Processing {len(raw_sentences_with_accent)} raw sentences for splitting using multiprocessing...")
    
    with mp.Pool() as pool:
        num_cores = os.cpu_count() or 1 # Đảm bảo num_cores ít nhất là 1
        chunk_size = max(1, len(raw_sentences_with_accent) // (num_cores * 4))
        
        results_iterator = pool.imap(_process_single_sentence_for_splitting, 
                                     tqdm(raw_sentences_with_accent, desc="Generating unaccented and tokenizing (parallel)"),
                                     chunksize=chunk_size)
        
        # Lọc bỏ các kết quả None (nếu câu rỗng hoặc không xử lý được)
        processed_sentences = [res for res in results_iterator if res is not None]

    if not processed_sentences:
        print("No sentences could be processed. Check data and tokenization.")
        return [], []
        
    random.seed(random_seed)
    random.shuffle(processed_sentences)
    
    split_index = int(len(processed_sentences) * (1 - test_size))
    train_data_pairs = processed_sentences[:split_index]
    test_data_pairs = processed_sentences[split_index:]
    
    train_corpus = [pair[1] for pair in train_data_pairs] # Chỉ lấy câu có dấu đã tokenize cho training
    test_set = test_data_pairs # Giữ nguyên (câu không dấu, câu có dấu tokenize) cho testing
    
    print(f"Data split: {len(train_corpus)} training sentences, {len(test_set)} test sentences.")
    return train_corpus, test_set

## 4. Define Model Trainer (model_trainer.py)

In [5]:
# Porting model_trainer.py
import pickle
from collections import defaultdict

DEFAULT_MODEL_FILENAME = "kneserney_trigram_model.pkl"
N_GRAM_ORDER = 3

def train_ngram_model(corpus: list[list[str]],
                      n: int = N_GRAM_ORDER): # Signature changed, model_class removed
    if not corpus:
        print("Corpus is empty. Cannot train model.")
        return None

    print(f"Preparing data for custom {n}-gram model...")

    start_symbol = "<s>"
    end_symbol = "</s>"

    ngram_counts = defaultdict(int)
    context_counts = defaultdict(int) # For (n-1)-grams
    vocab = set()

    for sentence_tokens in corpus:
        # Pad sentence: (n-1) start symbols, 1 end symbol.
        # e.g., for n=3 (trigram): ['<s>', '<s>'] + sentence_tokens + ['</s>']
        # e.g., for n=1 (unigram): [] + sentence_tokens + ['</s>']
        current_padded_sentence = ([start_symbol] * (n - 1)) + sentence_tokens + [end_symbol]

        for token in sentence_tokens: # Original sentence tokens for vocab
            vocab.add(token)

        # Generate n-grams and (n-1)-gram contexts
        # Iterate up to the point where the last n-gram can be formed
        for i in range(len(current_padded_sentence) - n + 1):
            ngram = tuple(current_padded_sentence[i : i + n])
            ngram_counts[ngram] += 1

            if n > 1:
                # The context is the first (n-1) tokens of the n-gram
                context = tuple(current_padded_sentence[i : i + n - 1])
                context_counts[context] += 1
            # For n=1 (unigrams), context_counts will be handled later if needed for P(w) = count(w)/TotalWords

    # For unigram model (n=1), if we define P(w) = count(w) / total_tokens,
    # context_counts can store the total number of tokens.
    if n == 1:
        total_word_occurrences = sum(ngram_counts.values()) # Sum of counts of all unigrams (word occurrences)
        if total_word_occurrences > 0:
            context_counts[()] = total_word_occurrences # Global context for unigrams is the total count

    print(f"Custom model training complete. Vocabulary size: {len(vocab)}")
    print(f"Number of unique {n}-grams: {len(ngram_counts)}")
    if n > 1 or (n == 1 and context_counts):
        print(f"Number of unique contexts: {len(context_counts)}")

    # The "model" is now a dictionary of these counts, vocab, and n
    custom_model = {
        "n": n,
        "vocab": list(vocab), # Store as list for potential JSON needs, pickle handles sets
        "ngram_counts": dict(ngram_counts), # Convert defaultdict to dict
        "context_counts": dict(context_counts), # Convert defaultdict to dict
    }
    return custom_model

def save_model(model, 
               model_dir: str = MODEL_DIR, 
               filename: str = DEFAULT_MODEL_FILENAME):
    if model is None:
        print("Model is None. Nothing to save.")
        return False

    os.makedirs(model_dir, exist_ok=True)
    model_path = os.path.join(model_dir, filename)
    print(f"Saving model to {model_path}...")
    try:
        with open(model_path, 'wb') as fout:
            pickle.dump(model, fout)
        print(f"Model successfully saved to {model_path}")
        return True
    except Exception as e:
        print(f"Error saving model to {model_path}: {e}")
        return False

## 5. Define Predictor (predictor.py)

In [6]:
# Porting predictor.py
from nltk.tokenize.treebank import TreebankWordDetokenizer

DEFAULT_MODEL_PATH = os.path.join(MODEL_DIR, DEFAULT_MODEL_FILENAME)

# def load_model(model_path: str = DEFAULT_MODEL_PATH):
#     if not os.path.exists(model_path):
#         print(f"ERROR: Model file not found at {model_path}")
#         print(f"Please train and save the model first or provide a valid path.")
#         return None
    
#     print(f"Attempting to load model from {model_path}...")
#     try:
#         with open(model_path, 'rb') as fin:
#             model_loaded = pickle.load(fin)
#         # Kiểm tra sơ bộ xem có thuộc tính vocab không, vì các model NLTK thường có
#         if hasattr(model_loaded, 'vocab'):
#             print(f"Model loaded successfully from {model_path}. Vocabulary size: {len(model_loaded.vocab)}")
#         else:
#             print(f"Model loaded from {model_path}, but it does not seem to have a 'vocab' attribute. Type: {type(model_loaded)}")
#         return model_loaded
#     except FileNotFoundError: # Để chắc chắn, dù đã kiểm tra ở trên
#         print(f"CRITICAL ERROR: Model file not found during open or pickle.load operation: {model_path}.")
#     except pickle.UnpicklingError as e_pickle:
#         print(f"CRITICAL ERROR: Could not unpickle model from {model_path}. File might be corrupted or not a NLTK pickle file. Details: {e_pickle}")
#     except AttributeError as e_attr:
#         print(f"CRITICAL ERROR: AttributeError during unpickling model from {model_path}. This might indicate a mismatch in class definitions (e.g., model saved with a different NLTK version or custom class not found). Details: {e_attr}")
#     except ImportError as e_imp:
#         print(f"CRITICAL ERROR: ImportError during unpickling model from {model_path}. A custom class definition might be missing. Details: {e_imp}")
#     except Exception as e:
#         print(f"CRITICAL ERROR: An unexpected error occurred while loading the model from {model_path}. Details: {e}")
#     return None

def load_model(model_path: str = DEFAULT_MODEL_PATH):
    if not os.path.exists(model_path):
        print(f"ERROR: Model file not found at {model_path}")
        print(f"Please train and save the model first or provide a valid path.")
        return None
    
    print(f"Attempting to load model from {model_path}...")
    try:
        with open(model_path, 'rb') as fin:
            model_loaded = pickle.load(fin)
        
        # Sửa đoạn kiểm tra này để phù hợp với mô hình từ điển
        if isinstance(model_loaded, dict) and "vocab" in model_loaded:
            print(f"Model loaded successfully from {model_path}. Vocabulary size: {len(model_loaded['vocab'])}")
        elif hasattr(model_loaded, 'vocab'):
            print(f"Model loaded successfully from {model_path}. Vocabulary size: {len(model_loaded.vocab)}")
        else:
            print(f"Model loaded from {model_path}, but it does not seem to have a 'vocab' attribute or key. Type: {type(model_loaded)}")
        
        return model_loaded
    except Exception as e:
        print(f"CRITICAL ERROR: Could not load model from {model_path}. Details: {e}")
        return None

_detokenizer = TreebankWordDetokenizer()

# NLTK-like Language Model class implementation for our custom model
class CustomNGramModel:
    def __init__(self, model_dict):
        self.order = model_dict["n"]
        self.vocab = set(model_dict["vocab"])
        self.ngram_counts = model_dict["ngram_counts"]
        self.context_counts = model_dict["context_counts"]
        
    def logscore(self, word, context=None):
        """Return the log probability of the word given the context"""
        import math
        if context is None:
            context = tuple()
        
        # Ensure context is a tuple of the right length
        if len(context) > self.order - 1:
            context = context[-(self.order - 1):]
        
        # Make sure the tuple is the right length by padding if needed
        if len(context) < self.order - 1:
            context = tuple(["<s>"] * ((self.order - 1) - len(context))) + tuple(context)
        
        # Check for the n-gram and get its count
        ngram = context + (word,)
        ngram_count = self.ngram_counts.get(ngram, 0)
        
        # Get the context count
        context_count = self.context_counts.get(context, 0)
        
        # Calculate probability with smoothing
        if context_count == 0:
            # Backed off to a very small probability
            return math.log(1e-10)
        
        # Use simple MLE probability here (could be improved with real smoothing)
        prob = (ngram_count + 1e-10) / (context_count + 1e-10 * len(self.vocab))
        return math.log(prob)

# cài đặt beam_search thuật toán
def beam_search_predict_accents(text_no_accents: str, model, k: int = 3, 
                                syllables_file: str = SYLLABLES_PATH, 
                                detokenizer=_detokenizer) -> list[tuple[str, float]]:
    # Wrap dictionary model in CustomNGramModel if needed
    if isinstance(model, dict) and "n" in model and "vocab" in model:
        model = CustomNGramModel(model)
    
    words = text_no_accents.lower().split()
    sequences = [] # Stores list of ([word_sequence], score)

    for idx, word_no_accent in enumerate(words):
        possible_accented_words = gen_accents_word(word_no_accent, syllables_path=syllables_file)
        if not possible_accented_words:
            possible_accented_words = {word_no_accent} 

        if idx == 0:
            sequences = [([x], 0.0) for x in possible_accented_words]
        else:
            all_new_sequences = []
            for seq_words, seq_score in sequences:
                for next_accented_word in possible_accented_words:
                    context = seq_words[-(model.order - 1):] if model.order > 1 else [] 
                    try:
                        score_addition = model.logscore(next_accented_word, tuple(context))
                    except Exception as e: 
                        # print(f"Logscore error for '{next_accented_word}' with context {context}: {e}. Assigning low score.")
                        score_addition = -float('inf') 
                        
                    new_seq_words = seq_words + [next_accented_word]
                    all_new_sequences.append((new_seq_words, seq_score + score_addition))
            
            all_new_sequences = sorted(all_new_sequences, key=lambda x: x[1], reverse=True)
            sequences = all_new_sequences[:k]
            if not sequences: 
                if all_new_sequences:
                    sequences = [(all_new_sequences[0][0][:-1] + [word_no_accent], all_new_sequences[0][1] - 1000)] 
                else:
                    return []

    results = [(detokenizer.detokenize(seq_words), score) for seq_words, score in sequences]
    return results

## 6. Define Evaluation Functions

In [7]:
# Porting evaluator code
import json
from nltk.metrics.distance import edit_distance

RESULTS_FILE = "evaluation_results.json"
TEST_SET_SIZE = 0.2
RANDOM_SEED = 42
BEAM_K = 3

# def _predict_item(item_data: tuple):
#     unaccented_input_str, tokenized_true_accented_sent = item_data
#     true_accented_str = _detokenizer.detokenize(tokenized_true_accented_sent)

#     # Load model from the path
#     model = load_model(DEFAULT_MODEL_PATH)
#     if model is None:
#         return {
#             "input_unaccented": unaccented_input_str,
#             "true_accented": true_accented_str,
#             "predicted_accented": "MODEL_LOAD_ERROR"
#         }

#     predictions = beam_search_predict_accents(
#         text_no_accents=unaccented_input_str,
#         model=model,
#         k=BEAM_K,
#         syllables_file=SYLLABLES_PATH
#     )

#     predicted_accented_str = predictions[0][0] if predictions else ""

#     return {
#         "input_unaccented": unaccented_input_str,
#         "true_accented": true_accented_str,
#         "predicted_accented": predicted_accented_str
#     }


# Chỉnh sửa hàm _predict_item trong phần 6 (Evaluation Functions)
def _predict_item(item_data: tuple):
    unaccented_input_str, tokenized_true_accented_sent = item_data
    true_accented_str = _detokenizer.detokenize(tokenized_true_accented_sent)

    # Load model from the path
    model = load_model(DEFAULT_MODEL_PATH)
    if model is None:
        return {
            "input_unaccented": unaccented_input_str,
            "true_accented": true_accented_str,
            "predicted_accented": "MODEL_LOAD_ERROR"
        }

    # Thêm đoạn code này để chuyển đổi từ điển sang CustomNGramModel
    if isinstance(model, dict) and "n" in model and "vocab" in model:
        model = CustomNGramModel(model)

    predictions = beam_search_predict_accents(
        text_no_accents=unaccented_input_str,
        model=model,
        k=BEAM_K,
        syllables_file=SYLLABLES_PATH
    )

    predicted_accented_str = predictions[0][0] if predictions else ""

    return {
        "input_unaccented": unaccented_input_str,
        "true_accented": true_accented_str,
        "predicted_accented": predicted_accented_str
    }

def calculate_sentence_accuracy(results: list[dict]) -> float:
    if not results:
        return 0.0
    correct_sentences = 0
    for res in results:
        if res["true_accented"].strip() == res["predicted_accented"].strip():
            correct_sentences += 1
    return (correct_sentences / len(results)) * 100

def calculate_word_accuracy(results: list[dict]) -> float:
    if not results:
        return 0.0
    total_words = 0
    correct_words = 0
    for res in results:
        true_words = res["true_accented"].strip().split()
        predicted_words = res["predicted_accented"].strip().split()
        len_min = min(len(true_words), len(predicted_words))
        for i in range(len_min):
            if true_words[i] == predicted_words[i]:
                correct_words += 1
        total_words += len(true_words)
    if total_words == 0: return 0.0
    return (correct_words / total_words) * 100

def calculate_cer(results: list[dict]) -> float:
    if not results:
        return 0.0
    total_edit_distance = 0
    total_true_chars = 0
    for res in results:
        true_str = res["true_accented"].strip()
        pred_str = res["predicted_accented"].strip()
        if not true_str and not pred_str:
            dist = 0
        elif not true_str:
            dist = len(pred_str)
        elif not pred_str:
            dist = len(true_str)
        else:
            dist = edit_distance(pred_str, true_str)
        total_edit_distance += dist
        total_true_chars += len(true_str)
    if total_true_chars == 0: return 1.0 if total_edit_distance > 0 else 0.0
    return (total_edit_distance / total_true_chars) * 100

def display_sample_results(results: list[dict], num_samples: int = 5):
    print("\n--- Sample Predictions ---")
    for i, res in enumerate(results[:num_samples]):
        print(f"Sample {i+1}:")
        print(f"  Input:     '{res['input_unaccented']}'")
        print(f"  True:      '{res['true_accented']}'")
        print(f"  Predicted: '{res['predicted_accented']}'")
        print("---")

def plot_metrics(metrics: dict):
    """Plots metrics using matplotlib."""
    try:
        import matplotlib.pyplot as plt
        names = list(metrics.keys())
        values = list(metrics.values())

        plt.figure(figsize=(10, 5))
        plt.bar(names, values)
        plt.ylabel('Percentage (%)')
        plt.title('Model Evaluation Metrics')

        # Thêm giá trị trên mỗi cột
        for i, value in enumerate(values):
            plt.text(i, value + 0.5, f"{value:.2f}%", ha = 'center')

        plt.savefig(os.path.join(PLOTS_DIR, "evaluation_metrics.png"))
        plt.show()
    except ImportError:
        print("\nMatplotlib not found. Skipping plot generation. Please install it: pip install matplotlib")
    except Exception as e:
        print(f"\nError plotting metrics: {e}")

## 7. Data Preparation and Download

In [9]:
# First, download and prepare the data
print("Downloading and preparing the Vietnamese language data...")
download_and_prepare_data()

# Check if data exists
print("\nVerifying data availability...")
check_data_exists()

Downloading and preparing the Vietnamese language data...
Downloading Vietnamese syllables file to c:\Users\tontide1\Desktop\Vietnamese-Accent-Model\data\vn_syllables.txt...


Downloading syllables file: 100%|██████████| 114k/114k [00:00<00:00, 1.45MB/s]



Downloading training data to c:\Users\tontide1\Desktop\Vietnamese-Accent-Model\data\Train_Full.zip...


Downloading training data: 100%|██████████| 64.5M/64.5M [00:14<00:00, 4.74MB/s]



Extracting training data to c:\Users\tontide1\Desktop\Vietnamese-Accent-Model\data\Train_Full...
Data directory configured at: c:\Users\tontide1\Desktop\Vietnamese-Accent-Model\data
Training data found at: c:\Users\tontide1\Desktop\Vietnamese-Accent-Model\data\Train_Full
Vietnamese syllables file found at: c:\Users\tontide1\Desktop\Vietnamese-Accent-Model\data\vn_syllables.txt

Verifying data availability...
Data directory configured at: c:\Users\tontide1\Desktop\Vietnamese-Accent-Model\data
Training data found at: c:\Users\tontide1\Desktop\Vietnamese-Accent-Model\data\Train_Full
Vietnamese syllables file found at: c:\Users\tontide1\Desktop\Vietnamese-Accent-Model\data\vn_syllables.txt
Data directory configured at: c:\Users\tontide1\Desktop\Vietnamese-Accent-Model\data
Training data found at: c:\Users\tontide1\Desktop\Vietnamese-Accent-Model\data\Train_Full
Vietnamese syllables file found at: c:\Users\tontide1\Desktop\Vietnamese-Accent-Model\data\vn_syllables.txt

Verifying data availa

True

## 8. Train the N-gram Model

In [10]:
# Load the corpus
print("Loading corpus for training...")
corpus = load_corpus(data_extract_path=TRAIN_EXTRACT_PATH)

if corpus:
    print(f"Corpus loaded with {len(corpus)} sentences.")
    
    # Train the model
    print("\nTraining the N-gram model...")
    trained_model = train_ngram_model(corpus, n=N_GRAM_ORDER)
    
    if trained_model:
        # Save the model
        print("\nSaving the trained model...")
        save_model(trained_model, model_dir=MODEL_DIR, filename=DEFAULT_MODEL_FILENAME)
    else:
        print("Model training failed.")
else:
    print("Corpus could not be loaded. Skipping training.")

Loading corpus for training...
Loading corpus from: c:\Users\tontide1\Desktop\Vietnamese-Accent-Model\data\Train_Full


Reading files in Train_Full: 0it [00:00, ?it/s]
Reading files in Chinh tri Xa hoi:   0%|          | 0/6567 [00:00<?, ?it/s]
Reading files in Chinh tri Xa hoi: 100%|██████████| 6567/6567 [00:00<00:00, 12812.86it/s]
Reading files in Chinh tri Xa hoi: 100%|██████████| 6567/6567 [00:00<00:00, 12812.86it/s]
Reading files in Doi song: 100%|██████████| 4195/4195 [00:00<00:00, 12691.70it/s]
Reading files in Doi song: 100%|██████████| 4195/4195 [00:00<00:00, 12691.70it/s]
Reading files in Kinh doanh: 100%|██████████| 4276/4276 [00:00<00:00, 13488.81it/s]
Reading files in Phap luat:   0%|          | 0/6656 [00:00<?, ?it/s]
Reading files in Phap luat: 100%|██████████| 6656/6656 [00:00<00:00, 14235.83it/s]
Reading files in Phap luat: 100%|██████████| 6656/6656 [00:00<00:00, 14235.83it/s]
Reading files in Suc khoe: 100%|██████████| 4417/4417 [00:00<00:00, 13996.20it/s]
Reading files in The gioi:   0%|          | 0/5716 [00:00<?, ?it/s]
Reading files in The gioi: 100%|██████████| 5716/5716 [00:00<00

Loaded 42744 documents.
Processing 1208363 raw sentences...


Tokenizing sentences: 100%|██████████| 1208363/1208363 [02:30<00:00, 8035.38it/s]



Corpus created with 1001625 tokenized sentences.
Corpus loaded with 1001625 sentences.

Training the N-gram model...
Preparing data for custom 3-gram model...
Custom model training complete. Vocabulary size: 104810
Number of unique 3-grams: 8052391
Number of unique contexts: 1973514

Saving the trained model...
Saving model to c:\Users\tontide1\Desktop\Vietnamese-Accent-Model\models\kneserney_trigram_model.pkl...
Custom model training complete. Vocabulary size: 104810
Number of unique 3-grams: 8052391
Number of unique contexts: 1973514

Saving the trained model...
Saving model to c:\Users\tontide1\Desktop\Vietnamese-Accent-Model\models\kneserney_trigram_model.pkl...
Model successfully saved to c:\Users\tontide1\Desktop\Vietnamese-Accent-Model\models\kneserney_trigram_model.pkl
Model successfully saved to c:\Users\tontide1\Desktop\Vietnamese-Accent-Model\models\kneserney_trigram_model.pkl


## 9. Test the Accent Prediction

In [8]:
# Load the trained model
print("Loading the trained model for prediction...")
model = load_model(DEFAULT_MODEL_PATH)

if model:
    # Test some sample sentences
    test_sentences = [
        "ngay hom qua la ngay bau cu tong thong my",
        "chuc mung nam moi",
        "toi yeu tieng viet",
        "hoc sinh truong thpt",
        "cau lac bo am nhac"
    ]
    
    print("\nTesting accent prediction on sample sentences:")
    for i, sentence in enumerate(test_sentences):
        print(f"\nSample {i+1}: '{sentence}'")
        predictions = beam_search_predict_accents(sentence, model, k=3, syllables_file=SYLLABLES_PATH)
        
        if predictions:
            print("Top predictions:")
            for j, (sent, score) in enumerate(predictions):
                print(f"{j+1}. '{sent}' (Score: {score:.4f})")
        else:
            print("No predictions returned.")
else:
    print("Model could not be loaded. Cannot run predictions.")

Loading the trained model for prediction...
Attempting to load model from c:\Users\tontide1\Desktop\Vietnamese-Accent-Model\models\kneserney_trigram_model.pkl...
Model loaded successfully from c:\Users\tontide1\Desktop\Vietnamese-Accent-Model\models\kneserney_trigram_model.pkl. Vocabulary size: 104810

Testing accent prediction on sample sentences:

Sample 1: 'ngay hom qua la ngay bau cu tong thong my'
Model loaded successfully from c:\Users\tontide1\Desktop\Vietnamese-Accent-Model\models\kneserney_trigram_model.pkl. Vocabulary size: 104810

Testing accent prediction on sample sentences:

Sample 1: 'ngay hom qua la ngay bau cu tong thong my'
Top predictions:
1. 'ngày hôm qua là ngày bẩu cự tộng thống mý' (Score: -134.2986)
2. 'ngày hôm qua là ngày bẩu cự tộng thống mỹ' (Score: -134.2986)
3. 'ngày hôm qua là ngày bẩu cự tộng thống my' (Score: -134.2986)

Sample 2: 'chuc mung nam moi'
Top predictions:
1. 'ngày hôm qua là ngày bẩu cự tộng thống mý' (Score: -134.2986)
2. 'ngày hôm qua là n

## 10. Evaluate the Model

In [None]:
print("Preparing for model evaluation...")

# Prepare evaluation data
print("\nLoading and splitting corpus for evaluation...")
_, test_set = load_and_split_corpus(data_extract_path=TRAIN_EXTRACT_PATH, test_size=TEST_SET_SIZE, random_seed=RANDOM_SEED)

if test_set:
    print(f"Loaded {len(test_set)} test items for evaluation.")
    
    # Limit test set size for faster evaluation
    MAX_EVAL_ITEMS = 50
    if len(test_set) > MAX_EVAL_ITEMS:
        print(f"Limiting evaluation to first {MAX_EVAL_ITEMS} items to save time...")
        test_set = test_set[:MAX_EVAL_ITEMS]
    
    # Evaluate each test item
    print("\nRunning predictions on test items...")
    evaluation_results = []
    
    from tqdm.notebook import tqdm
    for item in tqdm(test_set, desc="Evaluating"):
        result = _predict_item(item)
        evaluation_results.append(result)
    
    # Calculate metrics
    print("\nCalculating evaluation metrics...")
    sent_accuracy = calculate_sentence_accuracy(evaluation_results)
    word_accuracy = calculate_word_accuracy(evaluation_results)
    char_error_rate = calculate_cer(evaluation_results)
    char_accuracy = 100.0 - char_error_rate
    
    # Display metrics
    print(f"\n--- Evaluation Metrics ---")
    print(f"Sentence Accuracy: {sent_accuracy:.2f}%")
    print(f"Word Accuracy: {word_accuracy:.2f}%")
    print(f"Character Error Rate (CER): {char_error_rate:.2f}%")
    print(f"Character Accuracy: {char_accuracy:.2f}%")
    
    # Save results
    results_file_path = os.path.join(BASE_DIR, RESULTS_FILE)
    try:
        with open(results_file_path, 'w', encoding='utf-8') as f:
            json.dump(evaluation_results, f, ensure_ascii=False, indent=4)
        print(f"\nResults saved to {results_file_path}")
    except IOError as e:
        print(f"Error saving results: {e}")
    
    # Display sample results
    display_sample_results(evaluation_results, num_samples=5)
    
    # Plot metrics
    metrics_to_plot = {
        "Sentence Accuracy": sent_accuracy,
        "Word Accuracy": word_accuracy,
        "Character Accuracy": char_accuracy
    }
    plot_metrics(metrics_to_plot)
else:
    print("Test set could not be loaded. Skipping evaluation.")

Preparing for model evaluation...

Loading and splitting corpus for evaluation...
Loading corpus from: c:\Users\tontide1\Desktop\Vietnamese-Accent-Model\data\Train_Full


Reading files in Train_Full: 0it [00:00, ?it/s]
Reading files in Chinh tri Xa hoi:   0%|          | 0/6567 [00:00<?, ?it/s]
Reading files in Chinh tri Xa hoi: 100%|██████████| 6567/6567 [00:00<00:00, 14496.13it/s]
Reading files in Chinh tri Xa hoi: 100%|██████████| 6567/6567 [00:00<00:00, 14496.13it/s]
Reading files in Doi song: 100%|██████████| 4195/4195 [00:00<00:00, 13551.81it/s]
Reading files in Doi song: 100%|██████████| 4195/4195 [00:00<00:00, 13551.81it/s]
Reading files in Kinh doanh: 100%|██████████| 4276/4276 [00:00<00:00, 13391.61it/s]
Reading files in Kinh doanh: 100%|██████████| 4276/4276 [00:00<00:00, 13391.61it/s]
Reading files in Phap luat: 100%|██████████| 6656/6656 [00:00<00:00, 14706.79it/s]
Reading files in Phap luat: 100%|██████████| 6656/6656 [00:00<00:00, 14706.79it/s]
Reading files in Suc khoe: 100%|██████████| 4417/4417 [00:00<00:00, 14434.82it/s]
Reading files in Suc khoe: 100%|██████████| 4417/4417 [00:00<00:00, 14434.82it/s]
Reading files in The gioi: 100%|██

Loaded 42744 documents.
CẢNH BÁO: Giới hạn xử lý 100 câu đầu tiên trên tổng số 1208363 câu để tiết kiệm bộ nhớ.
Processing 100 raw sentences for splitting using multiprocessing...


Generating unaccented and tokenizing (parallel): 100%|██████████| 100/100 [00:00<00:00, 131.27it/s]



## 11. Interactive Prediction Demo

In [None]:
def run_prediction_demo():
    # Load the model
    model = load_model(DEFAULT_MODEL_PATH)
    if not model:
        print("Model could not be loaded. Cannot run prediction demo.")
        return
    
    print("--- Vietnamese Accent Predictor Demo ---")
    print("Enter Vietnamese text without accents to predict the accented version.")
    print("Type 'exit' to quit.")
    
    while True:
        text_input = input("\nNhập câu tiếng Việt không dấu: ")
        if text_input.lower() == 'exit':
            break
        if not text_input.strip():
            print("Please enter a sentence.")
            continue
            
        print(f"Predicting accents for: '{text_input}'")
        predictions = beam_search_predict_accents(
            text_input, 
            model, 
            k=3, 
            syllables_file=SYLLABLES_PATH
        )
        
        if predictions:
            print("\nTop predictions:")
            for i, (sent, score) in enumerate(predictions):
                print(f"{i+1}. '{sent}' (Score: {score:.4f})")
        else:
            print("No predictions returned.")
    
    print("\n--- Demo Finished ---")

# Uncomment to run the interactive demo
# run_prediction_demo()

## 12. Implement the Interactive Widget (Google Colab specific)

In [None]:
try:
    from IPython.display import display
    import ipywidgets as widgets
    
    # Load the model outside the function to avoid reloading it for each prediction
    print("Loading model for interactive widget...")
    interactive_model = load_model(DEFAULT_MODEL_PATH)
    
    if interactive_model:
        print("Model loaded successfully.")
        
        def predict_button_clicked(b):
            input_text = input_widget.value
            if not input_text.strip():
                output_widget.value = "Please enter a sentence."
                return
                
            output_widget.value = f"Predicting accents for: '{input_text}'\n\n"
            predictions = beam_search_predict_accents(
                input_text, 
                interactive_model, 
                k=3, 
                syllables_file=SYLLABLES_PATH
            )
            
            if predictions:
                output_widget.value += "Top predictions:\n"
                for i, (sent, score) in enumerate(predictions):
                    output_widget.value += f"{i+1}. '{sent}' (Score: {score:.4f})\n"
            else:
                output_widget.value += "No predictions returned."
        
        # Create widgets
        input_widget = widgets.Text(
            description='Input:',
            placeholder='Enter Vietnamese text without accents',
            layout=widgets.Layout(width='80%')
        )
        
        predict_button = widgets.Button(
            description='Predict Accents',
            button_style='primary',
            tooltip='Click to predict accents'
        )
        predict_button.on_click(predict_button_clicked)
        
        output_widget = widgets.Textarea(
            placeholder='Predictions will appear here',
            layout=widgets.Layout(width='80%', height='200px')
        )
        
        # Display the interactive interface
        print("\n--- Vietnamese Accent Predictor ---")
        display(widgets.VBox([input_widget, predict_button, output_widget]))
    else:
        print("Model could not be loaded. Cannot initialize interactive widget.")
        
except ImportError:
    print("Could not import ipywidgets. Running in a non-interactive environment or ipywidgets not installed.")
    print("To use the interactive widget, please install ipywidgets: !pip install ipywidgets")
    print("You can still use the text-based demo by uncommenting and running the call to run_prediction_demo().")

## 13. Conclusion and Additional Resources

This notebook contains the complete Vietnamese Accent Model code, adapted to run on Google Colab. The model is trained to predict appropriate Vietnamese accents for text without diacritics.

### Summary of Components:

1. **Utilities**: Basic text processing functions for tokenization and accent manipulation
2. **Data Loader**: Functions to download, prepare, and load the Vietnamese text corpus
3. **Model Trainer**: N-gram model training implementation 
4. **Predictor**: Beam search algorithm for predicting accents in Vietnamese text
5. **Evaluation**: Metrics calculation and visualization for model performance
6. **Interactive Demo**: User interface for testing accent prediction

### Further Improvements:

- Implement more sophisticated smoothing techniques for the language model
- Incorporate character-level features for better accent prediction
- Optimize the beam search algorithm for better performance
- Experiment with neural models for accent prediction

Feel free to modify and extend this code for your own research or applications!