# Biting The Bytes: Transformers For Diacritic Restoration

<b> Istanbul Technical University - Natural Language Processing - Spring 2024 </b>

Emircan Erol - 150200324 <br>
erole20@itu.edu.tr

Muhammed Rüşen Birben - 150220755 <br>
birben20@itu.edu.tr

The notebook has been run many times to get the results. We've iteratively selected checkpoint models (saved the best ones), modified the hyperparameters, and changed the dataset. The base model is [`google-byt5-small`](https://huggingface.co/google/byt5-small).

The resources used in this projects can be found in the following drive folder:

[` Google Drive Link 🗂️`](https://drive.google.com/drive/folders/1nfAvnj_-EB4FMa83mUCtGNel9DkdC0PL?usp=sharing)

## Imports and Preparation

In [1]:
import torch.nn.functional as F
import torch.optim as optim
from random import shuffle
from tqdm import tqdm
import pandas as pd
import torch
import wandb
import json
import os
from os.path import exists as is_path_exists

LR = 2 ** (-12)
CHECKPOINT = None # 'models/peft32-train32-256-base-trtw1e-train-512-1024-base-trw-6+kelimeler-2-6200+6300tr3600-train128tr128-1000.pt'
MIN_LEN = 0
MAX_LEN = 32
model_id = 'google/byt5-small' # needed if training from scratch
BATCH_SIZE = 64
enc = 'utf-8'

In [2]:
if CHECKPOINT:
    assert is_path_exists(CHECKPOINT)
    print(f'Checkpoint {CHECKPOINT} exists. Loading model from checkpoint.')
    model = torch.load(CHECKPOINT)
    # turn on peft
    #model = prepare_model_for_kbit_training(model)
    #model = get_peft_model(model, peft_config)
else:
    # peft is used to redce the trainable parameter size of the base model
    from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model
    from turbot5 import T5ForTokenClassification
    from transformers import BitsAndBytesConfig
    
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_use_double_quant=True,
        bnb_4bit_compute_dtype=torch.float16,
    )

    LORA_ALPHA = 64
    LORA_DROPOUT = 0.125
    R = 32

    peft_config = LoraConfig(
        lora_alpha=LORA_ALPHA,
        lora_dropout=LORA_DROPOUT,
        r=R,
        bias="none",
        task_type='TOKEN_CLS',
        target_modules = ['k', 'q', 'v', 'o']
    )
    print(f'Checkpoint {CHECKPOINT} does not exist. Training from scratch.')
    
    model = T5ForTokenClassification.from_pretrained(model_id,
                                                    num_labels=2,
                                                    torch_dtype=torch.bfloat16,
                                                    quantization_config=bnb_config,
                                                    device_map="auto",
                                                    attention_type = 'flash')
    model = prepare_model_for_kbit_training(model)
    model = get_peft_model(model, peft_config)

model.print_trainable_parameters()

Checkpoint None does not exist. Training from scratch.


Some weights of T5ForTokenClassification were not initialized from the model checkpoint at google/byt5-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


trainable params: 2,657,282 || all params: 417,363,844 || trainable%: 0.6366823667648605


## Preparing the data

Notice that the conversion of main training data to .json is not provided, it is very similar with the conversion of the text, just substitute the diactritic characters with their non-diactritic counterparts. and convert to .json.

Something like this:

```python

def process_data(input_file, output_file, answer_file):
    # Read input file (CSV or JSONL)
    if input_file.endswith('.csv'):
        df = pd.read_csv(input_file)
    elif input_file.endswith('.jsonl'):
        df = pd.read_json(input_file, lines=True, encoding='utf-8-sig')
    else:
        raise ValueError("Unsupported input file format. Use CSV or JSONL.")

    # Shuffle the dataframe
    df = df.sample(frac=1).reset_index(drop=True)

    # Rename columns
    df = df.rename(columns={'Sentence': 'output'})


    # Add an input non-diacritic version of the output
    # This is the input for the model
    df['input'] = df['output'].apply(asciify_turkish_chars)    

    # Reorder columns
    df = df[['ID', 'input', 'output']]

    # Save the processed data to the output file (JSONL)
    df.to_json(output_file, orient='records', lines=True)

    # Save the answer data to the answer file (CSV)
    df[['ID', 'output']].to_csv(answer_file, index=False)

    

In [None]:
def asciify_turkish_chars(text):
    """
    Removes diacritics from Turkish text.
    Args:
        text (str): Turkish text.
    Returns:
        text: Turkish text without diacritics.
    """
    turkish_chars = {
        'ç': 'c',
        'ğ': 'g',
        'ı': 'i',
        'ö': 'o',
        'ş': 's',
        'ü': 'u',
        'Ç': 'C',
        'Ğ': 'G',
        'İ': 'I',
        'Ö': 'O',
        'Ş': 'S',
        'Ü': 'U'
    }
    
    for char, replacement in turkish_chars.items():
        text = text.replace(char, replacement)
    
    return text

def txt_to_input_output(fp, skip=500_000, split='all'):
    """
    Reads a text file and writes it to a jsonl file. 
    So that it can be used as a dataset.
    Args:
        fp (str): File path of the text file.
        skip (int): Number of lines to skip.
        split (int): Number of lines to read.
    """
    sentences = []
    dict_data = []
    with open(fp, 'r') as f:
        for line in f:
            if skip:
                skip -= 1
                continue
            
            elif line == '\n': continue

            elif split == 'all':
                sentences.append(line.strip('\n'))

            elif MIN_LEN <= len(line) <= MAX_LEN:
                sentences.append(line.strip('\n'))
                split -= 1
                if split == 0: break

        dict_data = [{'input': asciify_turkish_chars(i), 'output': i} for i in sentences]

    with open('data.jsonl', mode='w', encoding='utf-8') as f:
        for w_dict in dict_data:
            json.dump(w_dict, f)
            f.write('\n')

# running the function if we need to create dataset from scratch
# txt_to_input_output('turkce_kelime_listesi.txt', skip=0, split='all')

In [4]:
# Read data with a minimum and maximum length
path = 'data.jsonl' # 'data.jsonl'

data = []
with open(path, 'r', encoding=enc) as f:
    for jline in f.readlines():
        line = json.loads(jline)
        n = len(line['input'])
        if MIN_LEN <= n <= MAX_LEN:
            data.append(line)

print(f'Number of samples: {len(data)}')

Number of samples: 76186


In [5]:
shuffle(data)

In [6]:
# print some samples
for i in range(5):
    print(f"Input: {data[i]['input']}\nOutput: {data[i]['output']}")

Input: ovunmak
Output: ovunmak
Input: mujdecilik
Output: müjdecilik
Input: Ardesen 
Output: Ardeşen 
Input: bir alay
Output: bir alay
Input: ihtiyat akcesi
Output: ihtiyat akçesi


In [7]:
def mask_label(data, batch_size=BATCH_SIZE):
    """
    Masks the padded tokens in the input and creates batches.
    Args:
        data (list): List of dictionaries.
        batch_size (int): Batch size.
    Returns:
        batches (list): List of dictionaries.
        The dictionaries contain the following keys: 'input_ids', 'labels', 'attention_mask'
    """
    batches = list()
    
    dataset = list()
    for idx, sample in enumerate(data):
        inp = list(sample['input'])
        gold = list(sample['output'])
        
        new_sample = dict()

        label = []
        
        # create labels
        for i, j in zip(inp, gold):
            n = len(i.encode('utf-8'))
            label.extend([i != j] * n)

        new_sample['labels'] = label # labels
        new_sample['input_ids'] = [i + 3 for i in sample['input'].encode('utf-8')] # input_ids
        
        assert len(label) == len(new_sample['input_ids'])
        new_sample['input_ids'].append(0) # eos token

        dataset.append(new_sample)
        if idx and not (idx % batch_size):
            batch = dict()
            max_size = 0
            for i in dataset[idx-batch_size: idx]:
                length = len(i['input_ids'])
                if length > max_size: max_size = length

            input_ids = list()
            labels = list()

            # padding
            for i in dataset[idx-batch_size: idx]:
                i['labels'].extend([False] * (max_size - len(i['labels'])))
                i['input_ids'].extend([0] * (max_size - len(i['input_ids'])))
                input_ids.append(i['input_ids'])
                labels.append(i['labels'])

            batch['labels'] = torch.tensor(labels, dtype=torch.int64)
            batch['input_ids'] = torch.tensor(input_ids, dtype=torch.int64)

            # attention mask: 1 for real tokens, 0 for padding
            batch['attention_mask'] = torch.ones_like(batch['input_ids'], dtype=torch.int64)
            batch['attention_mask'][batch['input_ids'] == 0] = 0
            batches.append(batch)

    return batches

ds = mask_label(data)
print(f'Number of batches: {len(ds)}')
assert len(ds) > 0
del data

Number of batches: 1190


In [8]:
def test_mask(data):
    """
    Masks the padded tokens in the input.
    Args:
        data (list): List of strings.
    Returns:
        dataset (list): List of dictionaries.
    """

    dataset = list()
    for sample in data:        
        new_sample = dict()

        input_tokens = [i + 3 for i in sample.encode('utf-8')]
        input_tokens.append(0) # eos token
        new_sample['input_ids'] = torch.tensor([input_tokens], dtype=torch.int64)
        
        # Create attention mask
        attention_mask = [1] * len(input_tokens)  # Attend to all tokens
        new_sample['attention_mask'] = torch.tensor([attention_mask], dtype=torch.int64)
        
        dataset.append(new_sample)
        
    return dataset

def rewrite(model, data):
    """
    Rewrites the input text with the model.
    Args:
        model (torch.nn.Module): Model.
        data (dict): Dictionary containing 'input_ids' and 'attention_mask'.
    Returns:
        output (str): Rewritten text.
    """

    with torch.no_grad():
        for k, v in data.items():
            data[k] = data[k].to(model.device)
        pred = torch.argmax(model(**data).logits, dim=2)
        
    output = list() # save the indices of the characters as list of integers
    
    # Conversion table for Turkish characters {100: [300, 350], ...}
    en2tr = {en: tr for tr, en in zip(list(map(list, map(str.encode, list('ÜİĞŞÇÖüığşçö')))), list(map(ord, list('UIGSCOuigsco'))))}

    for inp, lab in zip((data['input_ids'] - 3)[0].tolist(), pred[0].tolist()):
        if lab and inp in en2tr:
            # if the model predicts a diacritic, replace it with the corresponding Turkish character
            output.extend(en2tr[inp])
        elif inp >= 0: output.append(inp)
    return bytes(output).decode()

def submission(model=model, split=None, path='test.jsonl', output_path='submission.csv'):
    """
    Creates a submission file.
    Args:
        model (torch.nn.Module): Model.
        path (str): Path of the test data.
        output_path (str): Path of the submission file.
    Returns:
        df (pd.DataFrame): Dataframe containing the submission.
    """
    
    predictions = list()
    with open(path, 'r', encoding=enc) as f:
        data = [json.loads(jline) for jline in f.readlines()[:split]]
        ds = test_mask([i['input'] for i in data])
        for sample in tqdm(ds):
            predictions.append([rewrite(model, sample)])
    
    df = pd.DataFrame(predictions, columns=['output'])

    # take ids from data
    df['ID'] = [i['ID'] for i in data]
    df.rename(columns={'output': 'Sentence'}, inplace=True)
    df = df[['ID', 'Sentence']]
    df.to_csv(output_path, index=False, encoding=enc)
    
    return df


def acc_overall(test_result, testgold, verbose=True):
    """
    Calculates the overall accuracy. Using the formula used in the competition.
    Args:
        test_result (list): List of strings.
        testgold (list): List of strings.
        verbose (bool): If True, prints the incorrect words.
    Returns:
        float: Overall accuracy.
    """
    correct = 0
    total = 0
    
    # count number of correctly diacritized words
    for i in range(len(testgold)):
        golds = testgold[i].split()
        results = test_result[i].split()
        for m in range(len(golds)):
            if results[m] == golds[m]:
                correct += 1
            elif verbose:
                print(results[m], golds[m]) # print the incorrect words, handy to see the errors
            total += 1
    
    return correct / total

*setting weights and biases service for tracking*

In [14]:
os.environ["WANDB__SERVICE_WAIT"] = "300"
os.environ["WANDB_PROJECT"]="NLP Project"
os.environ["WANDB_LOG_MODEL"]="true"
os.environ["WANDB_WATCH"]="true"
os.environ["WANDB_NOTEBOOK_NAME"]="token_classification.ipynb"
run = wandb.init()

## Training

In [15]:
optimizer = optim.AdamW(model.parameters(), lr=LR)
epoch = 1
val_steps = 250
for i in range(epoch):
    for counter, batch in tqdm(enumerate(ds)):

        for k, v in batch.items():
            batch[k] = batch[k].to('cuda')

        outputs = model(**batch)
        logits = outputs.logits
        
        # Create attention mask
        attention_mask = batch['attention_mask']
        
        # Compute categorical cross-entropy loss
        active_loss = attention_mask.view(-1) == 1 # select only non-padded tokens (attended tokens)
        active_logits = logits.view(-1, logits.shape[-1]) # get logits for all tokens
        active_labels = torch.where(
            active_loss, batch['labels'].view(-1), torch.tensor(model.config.pad_token_id).type_as(batch['labels'])
        ) 
        loss = F.cross_entropy(active_logits, active_labels) # compute loss
        
        # Compute metrics
        predictions = torch.argmax(logits, dim=2)
        tp = ((batch['labels'] == predictions) & (batch['labels'] != model.config.pad_token_id)).sum()
        tn = ((batch['labels'] != predictions) & (batch['labels'] != model.config.pad_token_id) & (predictions != model.config.pad_token_id)).sum()
        total_labels = (batch['labels'] != model.config.pad_token_id).sum()
        recall = tp / total_labels
        acc = (tp + tn) / total_labels
        precision = tp / (predictions != model.config.pad_token_id).sum()
        f1 = (2 * recall * precision) / (recall + precision)

        del batch # free memory

        # Step
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        # Log metrics
        run.log({"loss": loss.item(), "batch": counter+1, 'recall': recall, 'accuracy': acc, 'precision': precision, 'f1': f1})
        print(f'Epoch: [{i+1}/{epoch}], Batch [{counter+1}/{len(ds)}], Loss: {loss.item():.4f}, F1: {f1:.4f}, Recall: {recall:.4f}, Precision: {precision:.4f}, Accuracy: {acc:.4f}')

        if not (counter % val_steps): # validation
            torch.save(model.state_dict(), f'base-{counter}.pt')
            submission_df = submission(path='data/valid.jsonl', output_path='submission.csv')
            answers = pd.read_csv('submission.csv')
            val_score = acc_overall(submission_df.Sentence, answers.Sentence)
            run.log({"validation": wandb.Table(dataframe=submission_df), 'validation score': val_score})
wandb.finish()

VBox(children=(Label(value='0.133 MB of 0.133 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
accuracy,▄▃▆▄▂▄▅▂▅▇▂▃▅▄▆▆▅▆▇▅▁▅▅▆▄█▆▆▂▇▄▄▆▃▇▆▅▃▃▅
batch,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
f1,▂▃▄▁▂▅▄▂▆▆▂▄▁▄▅▅▆▇▆▄▁▃▄▄▂▆▅▇▃█▆▆▆▃▇▆▇▄▃▇
loss,▇▃▆▆█▃▅▅▂▄▅▆▅▆▅▆▂▄▂▆▇▅▇▄▅▆▅▄▅▁▂▃▂▄▁▃▂▂▄▁
precision,▃▆▃▂▆▇▅▇▆▄▅▆▁▆▅▄▆▆▄▄▇▃▄▃▃▃▅▇▇▆██▅▅▆▆▇▇▅▇
recall,▄▃▆▄▂▄▅▂▅▇▂▃▅▄▆▆▅▆▇▅▁▅▅▆▄█▆▆▂▇▄▄▆▃▇▆▅▃▃▅
validation score,▁▁▁▁

0,1
accuracy,0.85366
batch,940.0
f1,0.875
loss,0.04966
precision,0.89744
recall,0.85366
validation score,1.0


*save the model*

In [None]:
torch.save(model, f'base_kelimeler-2.pt') # PEFT save function had a bug

## Testing the model on dev set

In [12]:
val_df = submission(path='data/valid.jsonl', output_path='submission.csv')
val_df

100%|██████████| 100/100 [00:02<00:00, 47.75it/s]


Unnamed: 0,ID,Sentence
0,1001147,limeler ince ünlü bulunduran bir fiille birlik...
1,1002343,bu yıl türkiye olarak dördüncü kez katılacağım...
2,1005460,bizim görüşümüz ve uygulamalarımızda nusret f...
3,1002100,ankaralılar mısıra popcorn demez ve onu bir f...
4,1005445,daha sonraları iç içe bir yaşam sürdüreceğim ...
...,...,...
95,1007513,sizce bir faydası olur mu ?
96,1004524,yalçınkaya : bu yatırım sepeti kasım ayında 6...
97,1005141,çok yüksek : filmlerin romanların müthiş bilg...
98,1003252,yenilenen stili ve kullanılan yeni teknolojile...


In [13]:
val_df.Sentence[0] # check the first sentence

'limeler ince ünlü bulunduran bir fiille birlikte kullanılıyorsa bir ulama sonucunda son sesteki bu kalın k sesi de ince okunabilmekte ve hatta gereği yokken birleşik bile yazilabilmektedir . '

In [14]:
answers = pd.read_csv('data/valid_answers.csv')
val = acc_overall(val_df.Sentence, answers.Sentence)
val

yazilabilmektedir yazılabilmektedir
ayinda ayında
duzenlenecek düzenlenecek
turkiye türkiye
odul ödül
cifte çifte
hakkinda hakkında
yazin yazın
beg'de beğ'de
in ın
bilgisayar’dir bilgisayar’dır
gecti geçti
dun dün
tum tüm
uzmanlari uzmanları
yapildi yapıldı
nu nü
in ın
olmasınki olmasınkı
dönüklük donukluk
kubali kübalı
onlarin onların
tercumanliğinı tercümanlığını
yapmis yapmış
yaşın yasin
geçmiste geçmişte
barişmakta barışmakta
gormemistir görmemiştir
siçak sıcak
adı adi
dur. dür.
ornekti örnekti
ağasi ağası
asıretinin aşiretinin
sımdiki şimdiki
adiyaman adıyaman
in ın
adiyaman adıyaman
in ın
yalin yalın
bıcimlerin biçimlerin
degil değil
iô ıô
degil değil
taklıt taklit
yazilmasi yazılması
eşas esas
alinmistir alınmıştır
hayatin hayatın
sacmaliklarla saçmalıklarla
oldugünu olduğunu
duygularin duyguların
unutuldugünu unutulduğunu
m3’un m3’ün
uzerindeki üzerindeki
icin için
(hukumet (hükümet
olmadi olmadı
dogulu doğulu
cakkidı çakkıdı
ğibi gibi
sarkiyla şarkıyla
ortaligı ortalığı
yikar 

0.9202762084118016

## Submission on test set - Real World

In [15]:
submission_df = submission(path='data/test.jsonl', output_path='submission.csv')
submission_df

100%|██████████| 1157/1157 [00:21<00:00, 52.92it/s]


Unnamed: 0,ID,Sentence
0,0,tr ekonomi ve politika haberleri türkiye nin ...
1,1,üye girişi
2,2,son güncelleme 12:12
3,3,İmralı Mit görüşmesi ihtiyaç duyuldukça oluyor
4,4,Suriye deki silahlı selefi muhalifler yeni ku...
...,...,...
1152,1152,Yüreğir Adana ilimize ait şirin bir ilçedir
1153,1153,yüze gülücülüğün at oynattığı bir aydınlar ort...
1154,1154,zavallı adamı oracıkta astılar ve hiç kimse se...
1155,1155,zengin çocuklarına arız münasebetsizlikler fak...


In [None]:
def try_it(text, model=model):
    sample = test_mask([text])
    return rewrite(model, sample[0])

In [None]:
try_it('nasilsin bu aksam, bir seyler icmek ister misin')

'nasılsın bu akşam, bir şeyler içmek ister misin'