Saturday, February 18, 2023

This notebook will contain numerous examples of code extracted from the 'Translation' sub-section under the 'Natural Language Processing' portion of https://huggingface.co/models?sort=downloads

In [None]:
# Cell 19 throws an error. I kept the output, and ran the remainder of the notebook. 
# Total output time is accurate. 
# Run Date: Saturday, February 18, 2023
# Run Time: 00:26:37

In [1]:
# #Do this at the beginning so we can run this in one pass ...
from huggingface_hub import notebook_login

# Training in cell 17 throws an error if this is set to True ... meh ... 
pushToHub = False

if pushToHub:
    notebook_login()

Start this notebook from the next cell, to run it in one pass.

In [2]:
import time
from datetime import date

startTime = time.time()
todaysDate = date.today()

In [3]:
# only target the 2070 Super ...
import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"

## https://huggingface.co/docs/transformers/tasks/translation

This is a walk through of the above page. Some parts were missing when I first started with a simple copy and paste, such as defining which model to use. 

### Load OPUS Books dataset

In [4]:
from datasets import load_dataset

books = load_dataset("opus_books", "en-fr")

Found cached dataset opus_books (/home/rob/Data2/huggingface/datasets/opus_books/en-fr/1.0.0/e8f950a4f32dc39b7f9088908216cd2d7e21ac35f893d04d39eb594746af2daf)


  0%|          | 0/1 [00:00<?, ?it/s]

The above code does not define the model, so let's use the code from https://huggingface.co/t5-small as an exmple of how this is done.

And when I first ran this, it complained about 'T5Tokenizer requires the SentencePiece library but it was not found in your environment', so I installed it, then re-ran the cell.

In [5]:
#!pip install sentencepiece

In [6]:
# https://huggingface.co/t5-small
from transformers import T5Tokenizer, T5Model

tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5Model.from_pretrained("t5-small")

input_ids = tokenizer(
    "Studies have been shown that owning a dog is good for you", return_tensors="pt"
).input_ids  # Batch size 1
decoder_input_ids = tokenizer("Studies show that", return_tensors="pt").input_ids  # Batch size 1

# forward pass
outputs = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids)
last_hidden_states = outputs.last_hidden_state


2023-02-18 16:38:12.997351: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-18 16:38:13.589808: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-02-18 16:38:13.589859: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
For now, this behavior is kept to avoi

In [7]:
books = books["train"].train_test_split(test_size=0.2)

In [8]:
books["train"][0]

{'id': '120603',
 'translation': {'en': '" But she suddenly uttered a shrill cry; cold hands had seized her by the neck.',
  'fr': 'Mais elle eut un cri rauque: des mains froides venaient de la prendre au cou.'}}

### Preprocess

In [9]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("t5-small")

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-small automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


In [10]:
source_lang = "en"
target_lang = "fr"
prefix = "translate English to French: "


def preprocess_function(examples):
    inputs = [prefix + example[source_lang] for example in examples["translation"]]
    targets = [example[target_lang] for example in examples["translation"]]
    model_inputs = tokenizer(inputs, text_target=targets, max_length=128, truncation=True)
    return model_inputs

In [11]:
tokenized_books = books.map(preprocess_function, batched=True)

  0%|          | 0/102 [00:00<?, ?ba/s]

  0%|          | 0/26 [00:00<?, ?ba/s]

In [12]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

### Evaluate

In [13]:
import evaluate

sacrebleu = evaluate.load("sacrebleu")

In [14]:
import numpy as np


def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]

    return preds, labels


def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    # The original code failed on this next line, because it does not know what 'metric' is ...
    # result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    result = sacrebleu.compute(predictions=decoded_preds, references=decoded_labels)
    result = {"bleu": result["score"]}

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    result = {k: round(v, 4) for k, v in result.items()}
    return result

In [15]:
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")

Up to this point, we have not enaged the GPU. The next cell grabs 918MiB.

In [16]:
training_args = Seq2SeqTrainingArguments(
    output_dir="my_awesome_opus_books_model",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=2,
    predict_with_generate=True,
    fp16=True,
    push_to_hub=pushToHub, # This was True, but I set it to false, because I don't want to login to huggingface
)                          # So yeah, I now control this at the beginning of the notebook

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_books["train"],
    eval_dataset=tokenized_books["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

# trainer.train()

Using cuda_amp half precision backend


The next call grabs 5310Mib of the GPU.

In [17]:
%%time
# will this supress the ridiculous amount of stuff that gets output when calling this method?? Lets see, shall we ... 
# Nice! Yeah, it does! Good to know, right?! ...
# Hmm ... actually, Nope! ... it still outputs stuff ... sigh.
deleteThisCrap = trainer.train()

# This cell outputs a ridiculous amount of text ...
# CPU times: user 25min 42s, sys: 17.4 s, total: 26min
# Wall time: 25min 55s

The following columns in the training set don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: id, translation. If id, translation are not expected by `T5ForConditionalGeneration.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 101668
  Num Epochs = 2
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 12710
  Number of trainable parameters = 60506624
You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,1.8674,1.62848,5.5363,17.6283
2,1.8084,1.60586,5.7242,17.6117


Saving model checkpoint to my_awesome_opus_books_model/checkpoint-500
Configuration saved in my_awesome_opus_books_model/checkpoint-500/config.json
Configuration saved in my_awesome_opus_books_model/checkpoint-500/generation_config.json
Model weights saved in my_awesome_opus_books_model/checkpoint-500/pytorch_model.bin
tokenizer config file saved in my_awesome_opus_books_model/checkpoint-500/tokenizer_config.json
Special tokens file saved in my_awesome_opus_books_model/checkpoint-500/special_tokens_map.json
Copy vocab file to my_awesome_opus_books_model/checkpoint-500/spiece.model
Deleting older checkpoint [my_awesome_opus_books_model/checkpoint-11500] due to args.save_total_limit
Saving model checkpoint to my_awesome_opus_books_model/checkpoint-1000
Configuration saved in my_awesome_opus_books_model/checkpoint-1000/config.json
Configuration saved in my_awesome_opus_books_model/checkpoint-1000/generation_config.json
Model weights saved in my_awesome_opus_books_model/checkpoint-1000/pyt

CPU times: user 25min 56s, sys: 17 s, total: 26min 13s
Wall time: 26min 9s


In [18]:
text = "translate English to French: Legumes share resources with nitrogen-fixing bacteria."

The next cell has the line ...

translator = pipeline("translation", model="my_awesome_opus_books_model")

... which does not work. Is this because I chose NOT to upload the model to huggingface?? Gonna re-run this stuff, but this time, I WILL upload the model, to see if it works. 

In [19]:
from transformers import pipeline

translator = pipeline("translation", model="my_awesome_opus_books_model")
translator(text)

OSError: my_awesome_opus_books_model does not appear to have a file named config.json. Checkout 'https://huggingface.co/my_awesome_opus_books_model/None' for available files.

In [21]:
!nvidia-smi

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Sat Feb 18 17:05:04 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0  On |                  N/A |
| 36%   57C    P0    N/A /  70W |    576MiB /  2048MiB |      9%      Default |
|                               |            

In [20]:
endTime = time.time()

elapsedTime = time.strftime("%H:%M:%S", time.gmtime(endTime - startTime))

print(todaysDate.strftime('# Run Date: %A, %B %d, %Y'))
print(f"# Run Time: {elapsedTime}")

# Run Date: Saturday, February 18, 2023
# Run Time: 00:26:37
