<a href="https://colab.research.google.com/github/ryderwishart/biblical-machine-learning/blob/main/wip/fine_tune_nllb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

TODO: use an actual translation model if fine-tuning NLLB. The current data is good for fine-tuning a CLM model.

# Fine-Tuning Facebook's NLLB Model with a New Language using Huggingface Transformers

In this IPython Notebook, we demonstrate how to fine-tune Facebook's NLLB model with a new language using the Huggingface Transformers library. The NLLB model is a powerful neural network model designed for natural language processing tasks. By fine-tuning it with a new language, we can leverage its capabilities for language-specific tasks such as translation or text classification.

The notebook covers the following steps:

- Set up the Colab environment and install necessary packages.
- Import required libraries.
- Clone the biblical machine learning repo and access the dataset.
- Preprocess the data.- Fine-tune the NLLB model using Huggingface Transformers.- Evaluate the fine-tuned model.
- Create a translation pipeline instance and test it with sample text.

By following these steps, you will learn how to adapt the NLLB model to a new language and utilize its power for various language processing tasks. Feel free to modify the notebook to suit your specific needs or to experiment with other datasets and model configurations. We hope that this notebook serves as a useful starting point for your own projects.

## Get Started

To get started, simply follow the code blocks and instructions in this notebook. We encourage you to experiment, ask questions, and share your results with the community. Together, we can advance the field of natural language processing and unlock the full potential of these powerful models for new languages.

To set up the Colab environment and install the necessary packages, run the following code block:

In [1]:
!pip install transformers>=4.27 torch datasets sentencepiece

Now, import the required libraries:

In [2]:
import os
import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM, NllbTokenizer
from datasets import load_dataset, DatasetDict, Dataset
from sklearn.model_selection import train_test_split

And let's set the name of the model we are going to be fine-tuning so that the `transformers` library loads the correct tokenizer and model throughout the notebook:

In [3]:
model_name = 'facebook/nllb-200-distilled-600M'

Clone the biblical machine learning repo and access the dataset:

In [4]:
!git clone https://github.com/ryderwishart/biblical-machine-learning.git


Cloning into 'biblical-machine-learning'...
remote: Enumerating objects: 3300, done.[K
remote: Counting objects: 100% (33/33), done.[K
remote: Compressing objects: 100% (30/30), done.[K
remote: Total 3300 (delta 12), reused 11 (delta 3), pack-reused 3267[K
Receiving objects: 100% (3300/3300), 229.38 MiB | 11.09 MiB/s, done.
Resolving deltas: 100% (523/523), done.
Updating files: 100% (2759/2759), done.


Split the dataset into training data (80%), testing/validation data (15%), and evaluation data (5%):

In [5]:
# Load the dataset
dataset = load_dataset('text', data_files='./biblical-machine-learning/data/texts/*.txt')
dataset

Resolving data files:   0%|          | 0/901 [00:00<?, ?it/s]

Downloading and preparing dataset text/default to /root/.cache/huggingface/datasets/text/default-cfda968b67fb369d/0.0.0/cb1e9bd71a82ad27976be3b12b407850fe2837d80c22c5e03a28949843a8ace2...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset text downloaded and prepared to /root/.cache/huggingface/datasets/text/default-cfda968b67fb369d/0.0.0/cb1e9bd71a82ad27976be3b12b407850fe2837d80c22c5e03a28949843a8ace2. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 1358068
    })
})

In [6]:
# Split the dataset into train, test, and eval sets
split_dataset = dataset['train'].train_test_split(test_size=0.15)

# Now split the test set into test and eval sets (2/3, 1/3)
split_dataset['test'], split_dataset['eval'] = split_dataset['test'].train_test_split(test_size=0.3)['train'], split_dataset['test'].train_test_split(test_size=0.3)['test']

# Print the sizes of the datasets
print(f"Train set size: {len(split_dataset['train'])}")
print(f"Test set size: {len(split_dataset['test'])}")
print(f"Eval set size: {len(split_dataset['eval'])}")

Train set size: 1154357
Test set size: 142597
Eval set size: 61114


Preprocess the data by tokenizing the text and splitting it into training and validation sets:

In [7]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

def tokenize_function(examples):
    return tokenizer(examples['text'], padding='max_length', truncation=True)

tokenized_train_dataset = split_dataset['train'].map(tokenize_function, batched=True)
tokenized_test_dataset = split_dataset['test'].map(tokenize_function, batched=True)
tokenized_eval_dataset = split_dataset['eval'].map(tokenize_function, batched=True)


Downloading (…)okenizer_config.json:   0%|          | 0.00/564 [00:00<?, ?B/s]

Downloading (…)tencepiece.bpe.model:   0%|          | 0.00/4.85M [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/17.3M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/3.55k [00:00<?, ?B/s]

Map:   0%|          | 0/1154357 [00:00<?, ? examples/s]

Map:   0%|          | 0/142597 [00:00<?, ? examples/s]

Map:   0%|          | 0/61114 [00:00<?, ? examples/s]

Fine-tune the NLLB model using Huggingface Transformers:

In [8]:
from transformers import AutoModelForSeq2SeqLM, TrainingArguments, Trainer

model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=4, 
    per_device_eval_batch_size=4, 
    evaluation_strategy='epoch',
    logging_dir='./logs',
    learning_rate=5e-5,
    weight_decay=0.01,
    # gradient_accumulation_steps=4,  # Add gradient accumulation steps
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_test_dataset,
)

trainer.train()


Downloading (…)lve/main/config.json:   0%|          | 0.00/846 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/2.46G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]



ValueError: ignored

Evaluate the fine-tuned model:

In [None]:
evaluation_results = trainer.evaluate()

print(f"Validation loss: {evaluation_results['eval_loss']}")

model.save_pretrained('./fine_tuned_model')
tokenizer.save_pretrained('./fine_tuned_model')


Create a translation pipeline instance and test it with some sample text:

In [None]:
from transformers import pipeline
import random 

translation_pipeline = pipeline('translation', model=model, tokenizer=tokenizer)
sample_text = random.choice(tokenized_eval_dataset['text'])['text']
translated_text = translation_pipeline(sample_text)[0]['translation_text']
print(f"Translated text: {translated_text}")

In [None]:
email = "ryder.wishart@clear.bible"
print(f"Please contact us at {email} for any assistance.")