# Train end-to-end

This notebook trains the competition models end-to-end. Notebooks for each individual step is also available, and those notebooks describe the steps in much greater detail. Please refer to the `README` for information about the training process.

> **Note about signing in to the Hugging Face Hub:** This notebook requires the user to be signed in to the Hugging Face Hub and to be authenticated to download the `uvci/Koumankan_mt_dyu_fr` dataset. If you are not already signed in, run the five four cells one-by-one and use the `notebook_login` functionality to sign in. Once you are signed in, then you can run all cells.

## Setup environment

Make sure that you are connected to a GPU instance. If you opened the notebook in Colab, connect to a T4 GPU instance. In Kaggle, select the P100 GPU accelerator.

If you opened this Notebook in a platform such as Colab or Kaggle, make sure that the following source files from the repo is in your working directory:
 - `data.py`
 - `evaluation.py`
 - `modeling.py`
 - `tokenizers_mod.py`
 - `translation.py`
 - `utils.py`

In Colab, the easiest way to get the files into your working directory is to simply upload them from your local machine.

In Kaggle, to upload files, you have to create a `dataset`. From the notebook editor, click on `Upload`, select `New Dataset`, and upload the files. Once your `dataset` has been created, you must copy the files to your working directory. One way to do that is to add a new cell to the notebook and to run `!cp /kaggle/input/<your-dataset-name>/*py .`

Alternatively, the files can be downloaded from a AWS S3 bucket to your local directory. To do that, simply uncomment and execute the following cell.

In [None]:
# !wget -O code.zip --no-check-certificate --no-proxy 'https://pfaof7krtww4e.s3.af-south-1.amazonaws.com/code.zip' && unzip code.zip

If you have not installed the dependencies in the `requirements.txt` file in a terminal, you can uncomment the following cell and install the dependencies directly from the notebook. That can be convenient in platforms such as Colab or Kaggle. Make sure that the `requirements.txt` file is in your working directory (see steps above) before running the install here.

> **Restart the kernel after you have installed packages with `pip install` in the notebook cell below.**

In [None]:
# !pip install -q -r requirements.txt

In [2]:
import os
from pathlib import Path
import json
from collections import Counter
import pandas as pd
from tqdm.auto import tqdm
from functools import partial

from datasets import load_dataset
from tokenizers import SentencePieceBPETokenizer
from huggingface_hub import notebook_login
import sentencepiece as spm
from sentencepiece import sentencepiece_model_pb2 as sp_model

import torch
from transformers import AutoModelForSeq2SeqLM, NllbTokenizer
from fastai.callback.training import GradientAccumulation

from tokenizers_mod import (
    limit_tokenizer_vocab,
    train_new_tokenizer
)
from utils import preproc, set_seed, cleanup
from data import (
    load_from_json,
    save_data_locally,
    ds2df,
    TranslationDataset,
    create_dataloaders
)
from modeling import (
    load_model,
    instantiate_small_model,
    prune,
    create_learner,
    create_distillation_learner
)
from translation import translate, back_translate
from evaluation import calculate_bleu

## Options

In [3]:
DOWNLOAD_DATA = True
TRAIN_TOKENIZERS = True
TRAIN_DYU_FRA_LARGE = True
TRAIN_FRA_DYU_LARGE = True
CREATE_BACK_TRANSLATIONS = True
TRAIN_DYU_FRA_SMALL = True
TRAIN_DYU_FRA_DISTILLED = False

RANDOM_SEED = 7  # Set `RANDOM_SEED = None` to run without a seed

if RANDOM_SEED is not None:
    set_seed(RANDOM_SEED, reproducible=True)

## Download data

Do not run any cells beyond the login cell untill you're signed into the Hugging Face Hub

In [4]:
if DOWNLOAD_DATA:
    notebook_login(new_session=False)

User is already logged in.


In [5]:
if DOWNLOAD_DATA:
    ds = load_dataset("uvci/Koumankan_mt_dyu_fr")
    save_data_locally(ds, save_dir="./data")

README.md:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/530k [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/102k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/55.8k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8065 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1471 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1393 [00:00<?, ? examples/s]

## Train tokenizers

### Tokenizer 1

Pretrained `NllbTokenizer` with vocabulary limited to tokens appearing in the training data

In [6]:
%%time
if TRAIN_TOKENIZERS and (TRAIN_DYU_FRA_LARGE or TRAIN_FRA_DYU_LARGE):
    df = load_from_json(
        train_files="data/dataset_train.json",
        valid_files="data/dataset_validation.json",
        test_files="data/dataset_test.json",
        return_format="df"
    )
    limit_tokenizer_vocab(
        model_id="facebook/nllb-200-distilled-600M",
        df=df,
        save_dir="tokenizers/tokenizer_freq1",
        min_token_freq=1
    )

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

tokenizer_config.json:   0%|          | 0.00/564 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/4.85M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/3.55k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.3M [00:00<?, ?B/s]



Number of tokens used: 10802
New vocabulary size: 10806
Updated index: 10802: <mask>
Updated index: 10804: dyu_Latn
Updated index: 10805: fra_Latn
Updated index: 10804: dyu_Latn
Updated index: 10805: fra_Latn
Updated index: 10802: <mask>
CPU times: user 1min 27s, sys: 292 ms, total: 1min 27s
Wall time: 1min 27s


### Tokenizer 2

Pretrained `NllbTokenizer` with vocabulary limited to tokens appearing in the training data with a minimum frequency of 5

In [7]:
%%time
if TRAIN_TOKENIZERS and TRAIN_DYU_FRA_DISTILLED:
    df = load_from_json(
        train_files="data/dataset_train.json",
        valid_files="data/dataset_validation.json",
        test_files="data/dataset_test.json",
        return_format="df"
    )
    limit_tokenizer_vocab(
        model_id="facebook/nllb-200-distilled-600M",
        df=df,
        save_dir="tokenizers/tokenizer_freq5",
        min_token_freq=5
    )



Number of tokens used: 3705
New vocabulary size: 3709
Updated index: 3705: <mask>
Updated index: 3707: dyu_Latn
Updated index: 3708: fra_Latn
Updated index: 3707: dyu_Latn
Updated index: 3708: fra_Latn
Updated index: 3705: <mask>
CPU times: user 1min 23s, sys: 159 ms, total: 1min 23s
Wall time: 1min 23s


### Tokenizer 3

Train a `NllbTokenizer` with a vocabulary of 2,000 tokens from scratch on the training data

In [8]:
%%time
if TRAIN_TOKENIZERS and TRAIN_DYU_FRA_SMALL:
    df = load_from_json(
        train_files="data/dataset_train.json",
        valid_files="data/dataset_validation.json",
        test_files="data/dataset_test.json",
        return_format="df"
    )
    train_new_tokenizer(
        model_id="facebook/nllb-200-distilled-600M",
        df=df,
        save_dir="tokenizers/tokenizer_2k",
        vocab_size=2000
    )






CPU times: user 3min, sys: 344 ms, total: 3min
Wall time: 3min


## Fine-tune NLLB-200 600M for Dyula to French translation

Purpose of the model is to train a high-quality model to create back-translations and to use as a teacher model in knowledge distillation training

In [9]:
%%time
if TRAIN_DYU_FRA_LARGE:
    SRC_LANG = "dyu"
    TGT_LANG = "fr"

    MODEL_ID = "facebook/nllb-200-distilled-600M"
    TOKENIZER_ID = "tokenizers/tokenizer_freq1"
    MODEL_SAVE_PATH = 'saved_models/dyu-fra-600M'

    src_lang_code = "dyu_Latn" if SRC_LANG == "dyu" else "fra_Latn"
    tgt_lang_code = "dyu_Latn" if TGT_LANG == "dyu" else "fra_Latn"
    print(f"Translation from {src_lang_code} -> {tgt_lang_code}")

    # Load data
    df = load_from_json(
        train_files="data/dataset_train.json",
        valid_files="data/dataset_validation.json",
        test_files="data/dataset_test.json",
        return_format="df"
    )
    df_train = df[df["split"] == "train"].copy()
    df_valid = df[df["split"]== "validation"].copy()
    df_test = df[df["split"]== "test"].copy()
    assert len(df_train) + len(df_valid) + len(df_test) == len(df)

    # Load model
    model, tokenizer = load_model(
        MODEL_ID, tokenizer_id=TOKENIZER_ID, load_tokenizer=True, remap_embeddings=True,
        init_embeds_for_new_tokens=True, src_language=src_lang_code, tgt_language=tgt_lang_code,
    )
    print(f"Memory footprint: {model.get_memory_footprint() / 1024**3 :.2f}GB")

    # Validation BLEU before fine-tuning
    translate_func = partial(
        translate, model=model, tokenizer=tokenizer, src_lang=src_lang_code, tgt_lang=tgt_lang_code
    )
    print(
        "Validation BLEU before fine-tuning:",
        calculate_bleu(
            model, df_valid[[SRC_LANG, TGT_LANG]], translate_func, src_lang=src_lang_code, tgt_lang=tgt_lang_code,
            preproc_func=preproc
        )
    )

    # Create dataloaders and fastai learner
    bs, validation_bs = 16, 128
    dls = create_dataloaders(
        df_train[[SRC_LANG, TGT_LANG]], df_valid[[SRC_LANG, TGT_LANG]], tokenizer, bs=bs,
        src_lang=src_lang_code, tgt_lang=tgt_lang_code, preproc_func=preproc,
        max_length=128, validation_bs=validation_bs
    )
    learn = create_learner(
        dls, model, tokenizer, src_lang=src_lang_code, tgt_lang=tgt_lang_code
    )
    accum_steps = max(1, 32//bs)
    learn.fit_one_cycle(10, 5e-5, cbs=GradientAccumulation(accum_steps))

    # Validation BLEU after fine-tuning
    print(
        "Validation BLEU before fine-tuning:",
        calculate_bleu(
            model, df_valid, translate_func, src_lang=src_lang_code, tgt_lang=tgt_lang_code,
            preproc_func=preproc
        )
    )

    model.save_pretrained(MODEL_SAVE_PATH)
    tokenizer.save_pretrained(MODEL_SAVE_PATH)
    cleanup()

Translation from dyu_Latn -> fra_Latn


config.json:   0%|          | 0.00/846 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.46G [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()


generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

Number of tokens w/o embeddings: 1
model.shared.weight torch.Size([256206, 1024])
model.encoder.embed_tokens.weight torch.Size([256206, 1024])
model.decoder.embed_tokens.weight torch.Size([256206, 1024])
lm_head.weight torch.Size([256206, 1024])
Memory footprint: 1.36GB
Memory footprint: 1.36GB


  0%|          | 0/1471 [00:00<?, ?it/s]

Validation BLEU before fine-tuning: 4.440538477725872


epoch,train_loss,valid_loss,BLEU,time
0,2.906576,2.704631,7.013531,01:13
1,2.391741,2.412951,8.10499,01:13
2,1.980386,2.288856,10.148932,01:13
3,1.704136,2.270944,10.677092,01:14
4,1.484883,2.29423,11.592265,01:13
5,1.373568,2.329874,11.383374,01:14
6,1.223823,2.363985,11.852718,01:13
7,1.161909,2.379085,11.604426,01:13
8,1.132784,2.40102,11.479462,01:14
9,1.147457,2.402054,11.493653,01:13


  0%|          | 0/1471 [00:00<?, ?it/s]

Validation BLEU before fine-tuning: 11.452626482241318
[2024-09-17 15:37:03,219] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)


Non-default generation parameters: {'max_length': 200}


CPU times: user 11min 14s, sys: 1min 23s, total: 12min 38s
Wall time: 12min 43s


## Fine-tune NLLB-200 600M for French to Dyula translation

Purpose of the model is to train a high-quality model to create back-translations

In [10]:
%%time
if TRAIN_FRA_DYU_LARGE:
    SRC_LANG = "fr"
    TGT_LANG = "dyu"

    MODEL_ID = "facebook/nllb-200-distilled-600M"
    TOKENIZER_ID = "tokenizers/tokenizer_freq1"
    MODEL_SAVE_PATH = 'saved_models/fra-dyu-600M'

    src_lang_code = "dyu_Latn" if SRC_LANG == "dyu" else "fra_Latn"
    tgt_lang_code = "dyu_Latn" if TGT_LANG == "dyu" else "fra_Latn"
    print(f"Translation from {src_lang_code} -> {tgt_lang_code}")

    # Load data
    df = load_from_json(
        train_files="data/dataset_train.json",
        valid_files="data/dataset_validation.json",
        test_files="data/dataset_test.json",
        return_format="df"
    )
    df_train = df[df["split"] == "train"].copy()
    df_valid = df[df["split"]== "validation"].copy()
    df_test = df[df["split"]== "test"].copy()
    assert len(df_train) + len(df_valid) + len(df_test) == len(df)

    # Load model
    model, tokenizer = load_model(
        MODEL_ID, tokenizer_id=TOKENIZER_ID, load_tokenizer=True, remap_embeddings=True,
        init_embeds_for_new_tokens=True, src_language=src_lang_code, tgt_language=tgt_lang_code,
    )
    print(f"Memory footprint: {model.get_memory_footprint() / 1024**3 :.2f}GB")

    # Validation BLEU before fine-tuning
    translate_func = partial(
        translate, model=model, tokenizer=tokenizer, src_lang=src_lang_code, tgt_lang=tgt_lang_code
    )
    print(
        "Validation BLEU before fine-tuning:",
        calculate_bleu(
            model, df_valid[[SRC_LANG, TGT_LANG]], translate_func, src_lang=src_lang_code, tgt_lang=tgt_lang_code,
            preproc_func=preproc
        )
    )

    # Create dataloaders and fastai learner
    bs, validation_bs = 16, 128
    dls = create_dataloaders(
        df_train[[SRC_LANG, TGT_LANG]], df_valid[[SRC_LANG, TGT_LANG]], tokenizer, bs=bs,
        src_lang=src_lang_code, tgt_lang=tgt_lang_code, preproc_func=preproc,
        max_length=128, validation_bs=validation_bs
    )
    learn = create_learner(
        dls, model, tokenizer, src_lang=src_lang_code, tgt_lang=tgt_lang_code
    )
    accum_steps = max(1, 32//bs)
    learn.fit_one_cycle(3, 1e-4, cbs=GradientAccumulation(accum_steps))

    # Validation BLEU after fine-tuning
    print(
        "Validation BLEU before fine-tuning:",
        calculate_bleu(
            model, df_valid, translate_func, src_lang=src_lang_code, tgt_lang=tgt_lang_code,
            preproc_func=preproc
        )
    )

    model.save_pretrained(MODEL_SAVE_PATH)
    tokenizer.save_pretrained(MODEL_SAVE_PATH)
    cleanup()

Translation from fra_Latn -> dyu_Latn




Number of tokens w/o embeddings: 1
model.shared.weight torch.Size([256206, 1024])
model.encoder.embed_tokens.weight torch.Size([256206, 1024])
model.decoder.embed_tokens.weight torch.Size([256206, 1024])
lm_head.weight torch.Size([256206, 1024])
Memory footprint: 1.36GB
Memory footprint: 1.36GB


  0%|          | 0/1471 [00:00<?, ?it/s]

Validation BLEU before fine-tuning: 2.641426643203745


epoch,train_loss,valid_loss,BLEU,time
0,2.960861,3.052158,6.702611,01:13
1,2.428807,2.920789,7.301053,01:13
2,2.222335,2.909772,7.935876,01:13


  0%|          | 0/1471 [00:00<?, ?it/s]

Non-default generation parameters: {'max_length': 200}


Validation BLEU before fine-tuning: 7.885343659481106
CPU times: user 3min 34s, sys: 23.3 s, total: 3min 58s
Wall time: 3min 58s


## Create back-translations

In [11]:
%%time
if CREATE_BACK_TRANSLATIONS:

    def _save(savedir="./data", df=None, json_data=None, split=""):
        if df is None and json_data is None:
            print("Nothing to save")
            return None

        if not os.path.exists("./data"):
            os.mkdir("data")

        if df is not None:
            _ = df_bt.to_csv(f"data/dataset_bt_{DATA_SPLIT}.txt", sep="|", index=False)

        if json_data is not None:
            with open(f"data/dataset_bt_{DATA_SPLIT}.json", "w") as f:
                json.dump(back_translations_json, f, indent=4)

    # Load data
    ds = load_from_json(
        train_files="data/dataset_train.json",
        valid_files="data/dataset_validation.json",
        test_files="data/dataset_test.json",
        return_format="ds"
    )
    
    # ------------------------------------------------
    # Create back-translations from the training data
    # ------------------------------------------------
    MODEL_ID = './saved_models/fra-dyu-600M'
    DATA_SPLIT = "train"
    SRC_LANG = "fra_Latn"
    TGT_LANG = "dyu_Latn"
    SRC2SRC_SAMPLING = True
    SRC2SRC_MODEL_ID = "facebook/nllb-200-distilled-1.3B"

    tokenizer = NllbTokenizer.from_pretrained(MODEL_ID, src_lang=SRC_LANG, tgt_lang=TGT_LANG)
    model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_ID, device_map="cuda", torch_dtype=torch.bfloat16)
    print(f"Memory footprint: {model.get_memory_footprint() / 1024**3 :.2f}GB")

    if SRC2SRC_SAMPLING and SRC2SRC_MODEL_ID is not None:
        src2src_tokenizer = NllbTokenizer.from_pretrained(SRC2SRC_MODEL_ID, src_lang=SRC_LANG, tgt_lang=TGT_LANG)
        src2src_model = AutoModelForSeq2SeqLM.from_pretrained(SRC2SRC_MODEL_ID, device_map="cuda", torch_dtype=torch.bfloat16)
        print(f"Memory footprint: {src2src_model.get_memory_footprint() / 1024**3 :.2f}GB")

    back_translations = []
    src_translations = []
    for _ in range(15):
        dyu, fra = back_translate(
            ds, model, tokenizer, split=DATA_SPLIT, batch_size=64, src_lang=SRC_LANG,
            tgt_lang=TGT_LANG, sample_src=SRC2SRC_SAMPLING, src2src_model=src2src_model,
            src2src_tokenizer=src2src_tokenizer
        )
        back_translations += dyu
        src_translations += fra
    assert len(back_translations) == len(src_translations)

    df_bt = pd.DataFrame({"dyu": back_translations, "fr": src_translations}).drop_duplicates()
    back_translations_json = {
        "split": DATA_SPLIT,
        "data": [{"ID": 0, "translation": {"dyu": row["dyu"], "fr": row["fr"]}} for _, row in df_bt.iterrows()]
    }
    _save(savedir="./data", df=df_bt, json_data=back_translations_json, split=DATA_SPLIT)
    cleanup()

    # ------------------------------------------------
    # Create back-translations from the validation data
    # ------------------------------------------------
    DATA_SPLIT = "validation"

    back_translations = []
    src_translations = []
    for _ in range(10):
        dyu, fra = back_translate(
            ds, model, tokenizer, split=DATA_SPLIT, batch_size=64, src_lang=SRC_LANG,
            tgt_lang=TGT_LANG, sample_src=SRC2SRC_SAMPLING, src2src_model=src2src_model,
            src2src_tokenizer=src2src_tokenizer
        )
        back_translations += dyu
        src_translations += fra
    assert len(back_translations) == len(src_translations)

    df_bt = pd.DataFrame({"dyu": back_translations, "fr": src_translations}).drop_duplicates()
    back_translations_json = {
        "split": DATA_SPLIT,
        "data": [{"ID": 0, "translation": {"dyu": row["dyu"], "fr": row["fr"]}} for _, row in df_bt.iterrows()]
    }
    _save(savedir="./data", df=df_bt, json_data=back_translations_json, split=DATA_SPLIT)
    cleanup()

    # ------------------------------------------------
    # Create back-translations from the test data
    # ------------------------------------------------
    MODEL_ID = './saved_models/dyu-fra-600M'
    DATA_SPLIT = "test"
    SRC_LANG = "dyu_Latn"
    TGT_LANG = "fra_Latn"
    SRC2SRC_SAMPLING = False
    SRC2SRC_MODEL_ID = None

    tokenizer = NllbTokenizer.from_pretrained(MODEL_ID, src_lang=SRC_LANG, tgt_lang=TGT_LANG)
    model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_ID, device_map="cuda", torch_dtype=torch.bfloat16)
    print(f"Memory footprint: {model.get_memory_footprint() / 1024**3 :.2f}GB")

    back_translations = []
    src_translations = []
    for _ in range(5):
        dyu, fra = back_translate(
            ds, model, tokenizer, split=DATA_SPLIT, batch_size=64, src_lang=SRC_LANG,
            tgt_lang=TGT_LANG, sample_src=SRC2SRC_SAMPLING, src2src_model=src2src_model,
            src2src_tokenizer=src2src_tokenizer
        )
        back_translations += dyu
        src_translations += fra
    assert len(back_translations) == len(src_translations)

    df_bt = pd.DataFrame({"dyu": src_translations, "fr": back_translations}).drop_duplicates()
    back_translations_json = {
        "split": DATA_SPLIT,
        "data": [{"ID": 0, "translation": {"dyu": row["dyu"], "fr": row["fr"]}} for _, row in df_bt.iterrows()]
    }
    _save(savedir="./data", df=df_bt, json_data=back_translations_json, split=DATA_SPLIT)
    cleanup()

Memory footprint: 0.68GB


tokenizer_config.json:   0%|          | 0.00/564 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/4.85M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/3.55k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.3M [00:00<?, ?B/s]



config.json:   0%|          | 0.00/808 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/5.48G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

Memory footprint: 2.56GB


  0%|          | 0/8065 [00:00<?, ?it/s]

2024-09-17 15:44:12.862652: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-09-17 15:44:12.862759: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-09-17 15:44:12.948344: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-09-17 15:44:13.109240: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


  0%|          | 0/8065 [00:00<?, ?it/s]

  0%|          | 0/8065 [00:00<?, ?it/s]

  0%|          | 0/8065 [00:00<?, ?it/s]

  0%|          | 0/8065 [00:00<?, ?it/s]

  0%|          | 0/8065 [00:00<?, ?it/s]

  0%|          | 0/8065 [00:00<?, ?it/s]

  0%|          | 0/8065 [00:00<?, ?it/s]

  0%|          | 0/8065 [00:00<?, ?it/s]

  0%|          | 0/8065 [00:00<?, ?it/s]

  0%|          | 0/8065 [00:00<?, ?it/s]

  0%|          | 0/8065 [00:00<?, ?it/s]

  0%|          | 0/8065 [00:00<?, ?it/s]

  0%|          | 0/8065 [00:00<?, ?it/s]

  0%|          | 0/8065 [00:00<?, ?it/s]

  0%|          | 0/1471 [00:00<?, ?it/s]

  0%|          | 0/1471 [00:00<?, ?it/s]

  0%|          | 0/1471 [00:00<?, ?it/s]

  0%|          | 0/1471 [00:00<?, ?it/s]

  0%|          | 0/1471 [00:00<?, ?it/s]

  0%|          | 0/1471 [00:00<?, ?it/s]

  0%|          | 0/1471 [00:00<?, ?it/s]

  0%|          | 0/1471 [00:00<?, ?it/s]

  0%|          | 0/1471 [00:00<?, ?it/s]

  0%|          | 0/1471 [00:00<?, ?it/s]

Memory footprint: 0.68GB


  0%|          | 0/1393 [00:00<?, ?it/s]

  0%|          | 0/1393 [00:00<?, ?it/s]

  0%|          | 0/1393 [00:00<?, ?it/s]

  0%|          | 0/1393 [00:00<?, ?it/s]

  0%|          | 0/1393 [00:00<?, ?it/s]

CPU times: user 26min 59s, sys: 43.5 s, total: 27min 42s
Wall time: 30min 13s


## Train small Dyula to French translation model

In [12]:
%%time
if TRAIN_DYU_FRA_SMALL:
    BASE_MODEL_ID = "facebook/nllb-200-distilled-600M"
    TOKENIZER_ID = "tokenizers/tokenizer_2k"

    # Load data
    df = load_from_json(
        train_files=[
            "data/dataset_train.json",
            "data/dataset_bt_train.json",
            "data/dataset_bt_test.json",
            # "data/dataset_validation.json", "data/dataset_bt_validation.json"
        ],
        valid_files="data/dataset_validation.json",
        test_files="data/dataset_test.json",
        return_format="df"
    )
    df_train = df[df["split"] == "train"].copy()
    df_valid = df[df["split"]== "validation"].copy()
    df_test = df[df["split"]== "test"].copy()
    assert len(df_train) + len(df_valid) + len(df_test) == len(df)

    # Load model
    tokenizer = NllbTokenizer.from_pretrained(TOKENIZER_ID)
    model = instantiate_small_model(
        BASE_MODEL_ID, dim_factor=4, layer_factor=4, vocab_size=tokenizer.vocab_size,
        encoder_ffn_dim=None, encoder_layers=None, decoder_ffn_dim=None, decoder_layers=None,
        num_hidden_layers=None, d_model=None, max_position_embeddings=None
    )

    # Create dataloaders
    dls = create_dataloaders(
        df_train[["dyu", "fr"]], df_valid[["dyu", "fr"]], tokenizer, bs=128,
        src_lang="dyu_Latn", tgt_lang="fra_Latn", preproc_func=preproc,
        max_length=128
    )

    # Train on back-translated data
    learn = create_learner(dls, model, tokenizer, src_lang="dyu_Latn", tgt_lang="fra_Latn")
    learn.fit_sgdr(5, 1, cycle_mult=2, lr_max=1e-3)
    translate_func = partial(
        translate, model=model, tokenizer=tokenizer, src_lang="dyu_Latn", tgt_lang="fra_Latn"
    )
    print(
        "Validation BLEU after initial training",
        calculate_bleu(
            model, df_valid, translate_func, src_lang="dyu_Latn", tgt_lang="fra_Latn",
            preproc_func=preproc
        )
    )
    model.save_pretrained('./tmp')
    tokenizer.save_pretrained('./tmp')

    # Fine-tune on the original training data
    df = load_from_json(
        train_files="data/dataset_train.json",
        valid_files="data/dataset_validation.json",
        test_files="data/dataset_test.json",
        return_format="df"
    )
    df_train = df[df["split"] == "train"].copy()
    df_valid = df[df["split"]== "validation"].copy()
    df_test = df[df["split"]== "test"].copy()
    assert len(df_train) + len(df_valid) + len(df_test) == len(df)

    dls = create_dataloaders(
        df_train[["dyu", "fr"]], df_valid[["dyu", "fr"]], tokenizer, bs=256,
        src_lang="dyu_Latn", tgt_lang="fra_Latn", preproc_func=preproc,
        max_length=128
    )
    learn = create_learner(dls, model, tokenizer, src_lang="dyu_Latn", tgt_lang="fra_Latn")
    learn.fit_flat_cos(10, 1e-4)

    translate_func = partial(
        translate, model=model, tokenizer=tokenizer, src_lang="dyu_Latn", tgt_lang="fra_Latn"
    )
    print(
        "Validation BLEU after further fine-tuning",
        calculate_bleu(
            model, df_valid, translate_func, src_lang="dyu_Latn", tgt_lang="fra_Latn",
            preproc_func=preproc
        )
    )

    print("Converting to bf16...")
    model.bfloat16()
    print(f"Memory footprint: {model.get_memory_footprint() / 1024**3 :.2f}GB")

    translate_func = partial(
        translate, model=model, tokenizer=tokenizer, src_lang="dyu_Latn", tgt_lang="fra_Latn"
    )
    print(
        "Validation BLEU for bf16 model",
        calculate_bleu(
            model, df_valid, translate_func, src_lang="dyu_Latn", tgt_lang="fra_Latn",
            preproc_func=preproc
        )
    )

    MODEL_SAVE_PATH = 'saved_models/nllb-dyu-fr-10MB'
    print("Model saved to", MODEL_SAVE_PATH)
    model.save_pretrained(MODEL_SAVE_PATH)
    tokenizer.save_pretrained(MODEL_SAVE_PATH)
    cleanup()

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Memory footprint: 0.02GB


epoch,train_loss,valid_loss,BLEU,time
0,3.178445,3.616487,2.359799,00:37
1,2.299672,3.30413,3.938671,00:37
2,1.958335,3.212257,4.733772,00:38
3,1.831303,3.357568,4.603349,00:38
4,1.513231,3.317283,4.383167,00:38
5,1.318357,3.272649,5.153862,00:40
6,1.263743,3.261696,5.452816,00:39
7,1.459558,3.439089,4.734892,00:38
8,1.313965,3.439996,4.862956,00:38
9,1.203292,3.454323,4.932612,00:38


  0%|          | 0/1471 [00:00<?, ?it/s]

Non-default generation parameters: {'max_length': 200}


Validation BLEU after initial training 6.5007973468296845


epoch,train_loss,valid_loss,BLEU,time
0,0.680522,3.751438,7.336107,00:04
1,0.649479,3.752925,7.561769,00:04
2,0.622414,3.748319,7.756585,00:04
3,0.595922,3.755126,7.731476,00:04
4,0.57451,3.759532,7.677297,00:04
5,0.548271,3.768267,7.9477,00:04
6,0.528486,3.770392,7.475109,00:04
7,0.510657,3.779884,7.660944,00:04
8,0.491862,3.784925,7.821357,00:04
9,0.476568,3.786598,7.834554,00:04


  0%|          | 0/1471 [00:00<?, ?it/s]

Validation BLEU after further fine-tuning 7.71826143445664
Converting to bf16...
Memory footprint: 0.01GB


  0%|          | 0/1471 [00:00<?, ?it/s]

Non-default generation parameters: {'max_length': 200}


Validation BLEU for bf16 model 7.670348861677615
Model saved to saved_models/nllb-dyu-fr-10MB
CPU times: user 20min 50s, sys: 14.5 s, total: 21min 5s
Wall time: 21min 8s


## Train a small Dyula to French translation model through knowledge distillation

This is a three-step process:
 1. Fine-tune NLLB-200-600M, but with a modified tokenizer with a smaller vocabulary
 2. Use the model from step 1 as teacher. The student is a pruned version of the model where half of the encoder and decoder layers are dropped, and half of the neurons in linear layers are dropped.
 3. Repeat step 2

In [14]:
%%time
if TRAIN_DYU_FRA_DISTILLED:
    # ------------------------------------------------
    # STEP 1 - Fine-tune NLLB-200-600M with modified tokenizer
    # ------------------------------------------------
    MODEL_ID = "facebook/nllb-200-distilled-600M"
    TOKENIZER_ID = "tokenizers/tokenizer_freq5"

    df = load_from_json(
        train_files="data/dataset_train.json",
        valid_files="data/dataset_validation.json",
        test_files="data/dataset_test.json",
        return_format="df"
    )
    df_train = df[df["split"] == "train"].copy()
    df_valid = df[df["split"]== "validation"].copy()
    df_test = df[df["split"]== "test"].copy()
    assert len(df_train) + len(df_valid) + len(df_test) == len(df)

    # Load model
    model, tokenizer = load_model(
        MODEL_ID, load_tokenizer=True, tokenizer_id=TOKENIZER_ID, remap_embeddings=True, init_embeds_for_new_tokens=True,
        old_tokenizer_id=MODEL_ID, src_language="dyu_Latn", tgt_language="fra_Latn"
    )

    # Create dataloaders and train
    dls = create_dataloaders(
        df_train[["dyu", "fr"]], df_valid[["dyu", "fr"]], tokenizer, bs=32,
        src_lang="dyu_Latn", tgt_lang="fra_Latn", preproc_func=preproc,
        max_length=128
    )
    learn = create_learner(dls, model, tokenizer, src_lang="dyu_Latn", tgt_lang="fra_Latn")#, wd=1e-3)
    learn.fit_flat_cos(10, lr=1e-4, div_final=100_000.0, pct_start=0.75)

    translate_func = partial(
        translate, model=model, tokenizer=tokenizer, src_lang="dyu_Latn", tgt_lang="fra_Latn"
    )
    print(
        "Validation BLEU after step 1 fine-tune",
        calculate_bleu(
            model, df_valid, translate_func, src_lang="dyu_Latn", tgt_lang="fra_Latn",
            preproc_func=preproc
        )
    )
    MODEL_SAVE_PATH = 'saved_models/dyu-fra-vocab-freq5'
    model.save_pretrained(MODEL_SAVE_PATH)
    tokenizer.save_pretrained(MODEL_SAVE_PATH)
    del model
    cleanup()

    # ------------------------------------------------
    # STEP 2 - Down-scale and train again
    # ------------------------------------------------
    df = load_from_json(
        train_files=[
            "data/dataset_train.json",
            "data/dataset_bt_train.json",
            "data/dataset_bt_test.json",
            # "data/dataset_validation.json", "data/dataset_bt_validation.json"
        ],
        valid_files="data/dataset_validation.json",
        test_files="data/dataset_test.json",
        return_format="df"
    )
    df_train = df[df["split"] == "train"].copy()
    df_valid = df[df["split"]== "validation"].copy()
    df_test = df[df["split"]== "test"].copy()
    assert len(df_train) + len(df_valid) + len(df_test) == len(df)

    # Load the teacher model:
    BASE_MODEL_ID = "saved_models/dyu-fra-vocab-freq5"
    tokenizer = NllbTokenizer.from_pretrained(BASE_MODEL_ID)
    teacher_model = AutoModelForSeq2SeqLM.from_pretrained(BASE_MODEL_ID, device_map="cuda")
    print(f"Memory footprint: {teacher_model.get_memory_footprint() / 1024**3 :.2f}GB")

    # Create the student model as a pruned version of the teacher:
    student_model = prune(
        teacher_model, size_factor=2, layer_size_factor=2, dim_strategy="alternate",
        layer_strategy="alternate"
    )

    # Create dataloaders and train
    dls = create_dataloaders(
        df_train[["dyu", "fr"]], df_valid[["dyu", "fr"]], tokenizer, bs=256,
        src_lang="dyu_Latn", tgt_lang="fra_Latn", preproc_func=preproc,
        max_length=128
    )
    learn = create_distillation_learner(
        dls, student_model, teacher_model, tokenizer, src_lang="dyu_Latn", tgt_lang="fra_Latn"
    )
    learn.fit_sgdr(5, 1, cycle_mult=2, lr_max=1e-3)

    translate_func = partial(
        translate, model=student_model, tokenizer=tokenizer, src_lang="dyu_Latn", tgt_lang="fra_Latn"
    )
    print(
        "Validation BLEU for step 2 student model",
        calculate_bleu(
            student_model, df_valid, translate_func, src_lang="dyu_Latn", tgt_lang="fra_Latn",
            preproc_func=preproc
        )
    )
    student_model.save_pretrained('./tmp')
    tokenizer.save_pretrained('./tmp')
    del teacher_model, student_model
    cleanup()

    # ------------------------------------------------
    # STEP 3 - Down-scale for the last time
    # ------------------------------------------------
    # Load the teacher and student models:
    BASE_MODEL_ID = "./tmp"
    tokenizer = NllbTokenizer.from_pretrained(BASE_MODEL_ID)
    teacher_model = AutoModelForSeq2SeqLM.from_pretrained(BASE_MODEL_ID, device_map="cuda")
    print(f"Memory footprint: {teacher_model.get_memory_footprint() / 1024**3 :.2f}GB")
    student_model = prune(
        teacher_model, size_factor=2, layer_size_factor=2, dim_strategy="alternate", layer_strategy="alternate"
    )

    # Create `Learner` and train:
    learn = create_distillation_learner(
        dls, student_model, teacher_model, tokenizer, src_lang="dyu_Latn", tgt_lang="fra_Latn"
    )
    learn.fit_sgdr(5, 1, cycle_mult=2, lr_max=1e-3)

    translate_func = partial(
        translate, model=student_model, tokenizer=tokenizer, src_lang="dyu_Latn", tgt_lang="fra_Latn"
    )
    print(
        "Validation BLEU for step 3 student model",
        calculate_bleu(
            student_model, df_valid, translate_func, src_lang="dyu_Latn", tgt_lang="fra_Latn",
            preproc_func=preproc
        )
    )
    MODEL_SAVE_PATH = 'saved_models/nllb-dyu-fr-distilled'
    student_model.save_pretrained(MODEL_SAVE_PATH)
    tokenizer.save_pretrained(MODEL_SAVE_PATH)

    # Finally, fine-tune on the original training data only:
    df = load_from_json(
        train_files="data/dataset_train.json",
        valid_files="data/dataset_validation.json",
        test_files="data/dataset_test.json",
        return_format="df"
    )
    df_train = df[df["split"] == "train"].copy()
    df_valid = df[df["split"]== "validation"].copy()
    df_test = df[df["split"]== "test"].copy()
    assert len(df_train) + len(df_valid) + len(df_test) == len(df)

    dls = create_dataloaders(
        df_train[["dyu", "fr"]], df_valid[["dyu", "fr"]], tokenizer, bs=128,
        src_lang="dyu_Latn", tgt_lang="fra_Latn", preproc_func=preproc,
        max_length=128
    )
    learn = create_learner(dls, student_model, tokenizer, src_lang="dyu_Latn", tgt_lang="fra_Latn")
    learn.fit_flat_cos(10, 1e-5)

    translate_func = partial(
        translate, model=student_model, tokenizer=tokenizer, src_lang="dyu_Latn", tgt_lang="fra_Latn"
    )
    print(
        "Validation BLEU after further fine-tuning",
        calculate_bleu(
            student_model, df_valid, translate_func, src_lang="dyu_Latn", tgt_lang="fra_Latn",
            preproc_func=preproc
        )
    )
    MODEL_SAVE_PATH = 'saved_models/nllb-dyu-fr-distilled-final'
    student_model.save_pretrained(MODEL_SAVE_PATH)
    tokenizer.save_pretrained(MODEL_SAVE_PATH)

    # Convert to bf16
    student_model.bfloat16()
    print(f"Memory footprint: {student_model.get_memory_footprint() / 1024**3 :.2f}GB")
    translate_func = partial(
        translate, model=student_model, tokenizer=tokenizer, src_lang="dyu_Latn", tgt_lang="fra_Latn"
    )
    print(
        "Validation BLEU after converting to bf16",
        calculate_bleu(
            student_model, df_valid, translate_func, src_lang="dyu_Latn", tgt_lang="fra_Latn",
            preproc_func=preproc
        )
    )
    MODEL_SAVE_PATH = 'saved_models/nllb-dyu-fr-distilled-final-bf16'
    student_model.save_pretrained(MODEL_SAVE_PATH)
    tokenizer.save_pretrained(MODEL_SAVE_PATH)

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 6.2 µs
