# Fine-tuning a model for Transliteration of Polish Names and Toponyms

## Introduction
Transliteration is a common task in natural language processing (NLP) and machine translation. Here, we will use Polish, a Slavic language. Many words came from the Proto-Slavic language, an ancestor to all Slavic languages. In most languages (Serbian, Bulgarian, etc.), these words are scribed in Cyrillic; however, speakers in Polish and some other languages use the Latin script. This creates unusual writings: for one, the word *świekier* (English: *father-in-law*) came from the word `*svekrъ`, which is also a common ancestor for the Russian *свёкр* (pronounced as *svjokr*), Serbian *свѐкар* (pronounced as *svekar*), and Slovak *svokor*.

The same is true for names and toponyms. For example, there is a Polish name, *Bogumił*. You may think it can simply be transliterated to English as *Bogomil* — not at all! Polish *Bogumił* consists of two old Slavic roots: *Bog* (God) and *mił* (beloved, adored). The name means to be loved by God and is translated to English as *Theodore*. *Theodore* is a name that has Ancient Greek roots and means *gift of Gods*.

The original Ancient Greek name *Θεόδωρος* (Theódoros) had two types of inheritance into the modern European languages:

- By transliteration: Spanish *Theodoro*, French *Théodore*, Italian *Teodoro*, and Serbian *Теодор* (*Teodor*)
- By translation: Spanish *Diosdado*, French *Dieudonné*, Italian *Donato*, and Serbian *Божидар* (*Božidar*)

All European languages except English have two sibling names (transliterated and translated). English has only one form — Theodore. Thus, the translation of Polish Bogumił is a translation with a slight semantic change: instead of beloved by God, it means a gift of Gods. This and many other exciting nuances make Polish a perfect language to fine-tune a transliteration model on.

## Table of Contents
1. [Installation](#installation)
2. [Preparing the Dataset](#preparing-a-corpus)
3. [Tokenization](#tokenization)
4. [Fine-tuning the Model](#fine-tuning)
5. [Transliteration](#transliteration)

## Installation
Before we begin, we need to install the necessary libraries.

In [None]:
!pip install sentencepiece
!pip install pandas
!pip install lxml
!pip install datasets
!pip install transformers
!pip install scikit-learn
!pip install huggingface_hub
!pip install torch
!pip install evaluate
!pip install rouge_score
!pip install nltk
!pip install numpy
!pip install accelerate
!pip install wandb # for logging and monitoring the training process

## Preparing a corpus
As you know, fine-tuning lets you adapt a pre-trained LLM to your specific domain or task, thus improving relevance, reducing inference cost, and injecting proprietary knowledge. We'll start by finding a proper dataset for fine-tuning. We'll create our own Polish names' transliteration corpus independently. For this, we will use open sources:

- A Wikipedia page: [Appendix:Polish_given_names](https://en.wiktionary.org/wiki/Appendix:Polish_given_names). It has both feminine and masculine names.
- A PDF file with multiple tables: [Toponymic Guidelines of Poland](https://www.gov.pl/web/ksng-en/toponymic-guidelines-of-poland). You will need only one for the country names table located on pages 45-51.

You can download the files for the datasets from here:
- [polish_names.csv](./datasets/polish_names.csv)
- [toponyms_of_poland.csv](./datasets/toponyms_of_poland.csv).

Next, you will need to combine them into one dataset, not forgetting to post-process it too:

- When an explanation of the meaning is given instead of the equivalent in the `English` column, replace the value with `None`. You can easily locate these cases. They will include the `=` sign, for instance, `dob = + sław = 'fame, glory, renown'`.
- Each cell must contain just one name. If there are two or more in the `English` column, select the first one. For example, if we have two equivalent names *(Janet, Jeanette)*, leave *Janet*.
- There should be no extra symbols in the cell. Some country names have the following form: *Zjednoczone Emiraty\rArabskie*. You should substitute \r with a single space symbol, so the cell will have only *Zjednoczone Emiraty Arabskie*.
- Don't forget to create a new, default index for the combined DataFrame (via *ignore_index* or *reset_index*).

This is how your dataset should look like at this point:
```csv
                                Polish                 English
0                                 Adam                    Adam
1                               Adrian                  Adrian
2                               Albert                  Albert
3                                Albin                   Albin
4                           Aleksander               Alexander
..                                 ...                     ...
631                     Wyspy Salomona         Solomon Islands
632  Wyspy Świętego Tomasza i Książęca  S~ao Tomé and Príncipe
633                             Zambia                  Zambia
634                           Zimbabwe                Zimbabwe
635       Zjednoczone Emiraty Arabskie    United Arab Emirates

[636 rows x 2 columns]
```
We have a relatively small dataset which may not be enough for fine-tuning. To get good results, we'd need a much larger dataset. However, for educational purposes, we will use this small dataset to demonstrate the fine-tuning process.

In [None]:
# write your code here


## Tokenization

Once you've created the dataset, the next step is transforming it into the Hugging Face dataset format. But first, split it into *train* and *test* sets with `sklearn`. The test set should be of size `0.078`. Also, make sure that `random_state` during the split is equal to 42.

Next, transform the two sets into a Hugging Face dataset. You can use the following method:

```python
from datasets import DatasetDict, Dataset


dataset = DatasetDict({'train': Dataset.from_pandas(train),
                       'test': Dataset.from_pandas(test)})

```
After this, you may push the dataset to your personal Hugging Face Hub. Now, let's turn to tokenization, the most challenging part of the Transformers' fine-tuning. During fine-tuning, as well as tokenization, you will use the [T5-base](https://huggingface.co/google-t5/t5-base)(`t5-base`) model. While loading it, you may need to set the following parameter: `model_max_length=512`.

Next, we need to define a custom `preprocess_function()` for tokenization. Here is an example:

```python
max_input_length = 5  # Some toponyms contain 2-5 tokens
max_target_length = 5


def preprocess_function(data):
    model_inputs = tokenizer(data['Polish'], max_length=max_input_length, truncation=True, padding=True)
    labels = tokenizer(data['English'], max_length=max_target_length, truncation=True, padding=True)
    model_inputs['labels'] = labels['input_ids']

    return model_inputs

```
At this point, ensure that `data['Polish']` and `data['English']` contain only string values. The `None` values should be represented as strings too.

Finally, map the defined function both to the train and test sets and print their structure. While mapping, use `batched=True`.

Here is an example of the structure of the train and test sets that you should get at the end of this stage (figures may vary):

```python
Dataset({
    features: ['Polish', 'English', '__index_level_0__', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 600
})
Dataset({
    features: ['Polish', 'English', '__index_level_0__', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 35
})
```

In [None]:
# write your code here


## Fine tuning

Now, you will finally fine-tune your model! First, load the model itself with the `AutoModelForSeq2SeqLM` class. Apart from the training set, you need to define the following classes: `Seq2SeqTrainingArguments`, `DataCollatorForSeq2Seq`, and `Seq2SeqTrainer`.

The first and default argument in `Seq2SeqTrainingArguments` is the name of your future model. Then, it's necessary to define the following arguments:
- `eval_strategy` — *epoch*  # Evaluate the model after each epoch
- `learning_rate` — *3e-6*  # Learning rate for the optimizer
- `num_train_epochs` — 3 # Total number of training epochs (you can modify this as needed)
- `weight_decay` — *0.01* # Amount of L2 regularization applied to the model's weights
- `predict_with_generate=True` # Use generate to calculate evaluation loss and metrics
- `fp16=True` # Use 16-bit floating point precision for training

> Fine-tuning can take a lot of time and compute resources. To achieve good results with the model we use here and the relatively small dataset, we need to train for at least 50 epochs. For educational purposes, in this project, you can try with around 3-5 epochs (you can use the T4 GPU option for faster training — ~10 minutes).

You will also need to compute the ROUGE score while training. For this, define two Python functions: `postprocess_text()` and `compute_metrics()`. In `compute_metrics()`, you need to extract the following metrics: ROUGE1, ROUGE2, ROUGEL, ROUGELSUM, and GEN_LEN (the mean length of the text generated on the validation set). These metrics will help you evaluate the quality of the model's output during training.

The first part of the code above is just post-processing. At the start, we have only embeddings, neither words nor sentences. Transform them into normal words. The most important thing is the `compute_metrics()` function; it allows us to apply the ROUGE metric (or any other metric) to our validation set.

As an additional step, use [Weights and Biases](https://wandb.ai/site) (`wandb`) for monitoring. All you need to do is create an account there, get the API key, and set the key and project in your notebook:
```python
from google.colab import userdata
import os
os.environ["WANDB_API_KEY"] = userdata.get('WANDB_API_KEY') # set in secrets
os.environ["WANDB_PROJECT"] = "my-project" # use any name
```
Then, in your training arguments, set `report_to=wandb`. It will then track:
- training/validation loss
- learning rate
- system metrics (GPU usage, RAM, etc.)

And now, define the `Seq2SeqTrainer` and start training. You can save your model locally using the `save_model()` function. Alternatively, set the `output_dir` parameter in `Seq2SeqTrainingArguments`. This will save the model checkpoints to the specified directory and the model after fine-tuning is complete. You'll be able to use this model later for inference or further fine-tuning without the need to re-download it.

To avoid losing the files, you can push the model to Hugging Face Hub. For that, you need to create an account in Hugging Face and set your `HF_TOKEN`:
```
os.environ["HF_TOKEN"] = userdata.get("HF_TOKEN")
```
Then, set `push_to_hub=True` in the `Seq2SeqTrainingArguments` or use the `push_to_hub()` method of the `Seq2SeqTrainer` class. Your token will be used to authenticate you with Hugging Face Hub, allowing you to push the model to your personal repository.

In [None]:
# write your code here


## Transliteration
In the final stage of this project, you are going to test your fine-tuned model to anglicize Polish names. You can use the test set you created earlier. You can use the `pipeline()` function from the `transformers` library to load the model and tokenizer:

```python
from transformers import pipeline

model_name = "/content/model/"  # the output dir or your model in HF hub if you pushed it there
model = pipeline('text2text-generation', model=model_name, tokenizer=model_name)

```
Print the anglicized names and toponyms. Here's an example of what you'd expect to see (the output may vary):

```text
Original: Tobiasz, Anglicized: Tobias
Original: Lubomir, Anglicized: Lubomir
Original: Holandia, Anglicized: Holand
Original: Łotwa, Anglicized: otwa
Original: Celestyn, Anglicized: Celestin
Original: Roland, Anglicized: Roland
```


In [None]:
# write your code here


As noted earlier, the model may not always produce the expected results. For example, it may anglicize *Łotwa* as *otwa* instead of *Latvia*. This is because the model was trained on a small dataset and may not have learned the correct mapping for all names and toponyms. However, it should still be able to produce reasonable results compared to the base T5 model.

If you want to improve the model's performance, you can try fine-tuning it on a larger dataset or using a different model architecture. You can also experiment with different hyperparameters, such as the learning rate and batch size, to see if they affect the model's performance.

Finally, you can also try using other models from the Hugging Face Hub that are fine-tuned for similar tasks. These models have been trained on larger datasets and will produce better results than the model you just fine-tuned. The fine-tuning process for these (and other) models is similar to what we did here, with a few adjustments to the dataset and hyperparameters.

Try using the [sdadas/mt5-base-translator-pl-en](https://huggingface.co/sdadas/mt5-base-translator-pl-en) model, which is a fine-tuned version of the mT5 model for Polish to English translation tasks. You can load it in the same way as shown above and test it on your dataset.