# Language Translation with mT5 or T5 

## English to German with T5
* T5 https://huggingface.co/docs/transformers/model_doc/t5
* Translation intro course https://huggingface.co/learn/nlp-course/chapter7/4?fw=tf
* Two way translation with T5 discussion: https://stackoverflow.com/questions/66797042/using-googles-t5-for-translation-from-german-to-english
* Model capability output example: https://github.com/PacktPublishing/Transformers-for-Natural-Language-Processing/blob/main/Chapter07/Summarizing_Text_with_T5.ipynb


## German to English with mT5
* https://huggingface.co/docs/transformers/model_doc/mt5
* https://huggingface.co/docs/transformers/model_doc/mt5#transformers.MT5ForConditionalGeneration
* https://huggingface.co/transformers/v4.9.2/model_doc/mt5.html

In [23]:
list=!nvidia-smi -L
for i in range(len(list)):
    print(list[i])

GPU 0: NVIDIA A100 80GB PCIe (UUID: GPU-51f84540-9ebb-1d44-7bb7-3c62ae55c20e)
  MIG 2g.20gb     Device  0: (UUID: MIG-f1e32298-70d4-52fc-9b1d-21a178d44529)


In [24]:
import re

def get_device_uuid(input: str) -> str:
    try:
        # r'' before the search pattern indicates it is a raw string, 
        # otherwise "" instead of single quote
        uuid = re.search(r'UUID\:\s(.+?)\)', input).group(1)
    except AttributeError:
        # "UUID\:\s" and "\)" not found
        uuid = ""
    return uuid    

# skip the first GPU ID, only get the MIG IDs, using python list slice over index access
uuid_list = [get_device_uuid(e) for e in list[1:]]
# print(uuid_list)
UUIDs = ",".join(uuid_list)
print(UUIDs)

MIG-f1e32298-70d4-52fc-9b1d-21a178d44529


In [25]:
import os, time, sys
from platform import python_version
os.environ["WORLD_SIZE"] = "1" 
os.environ["CUDA_VISIBLE_DEVICES"] = UUIDs # "0,1,2"
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:512" #512
display_architecture=True

print(os.environ["CUDA_VISIBLE_DEVICES"])
print(python_version())

MIG-f1e32298-70d4-52fc-9b1d-21a178d44529
3.8.10


In [26]:
# set the model download cache directory
# DATA_ROOT="/data"
DATA_ROOT="/home/jovyan/llm-models"
os.environ['XDG_CACHE_HOME']=f"{DATA_ROOT}/core-kind/yinwang/models"

model_map = {
   "small": "google/mt5-small", # 1.2 GB
   "base" : "google/mt5-base", # 2.33 GB
   "large" : "google/mt5-large", # 4.9 GB,
   "xl" : "google/mt5-xl", # 15 GB
   "xxl" : "google/mt5-xxl", # 51.7 GB,
   "custom": "Helsinki-NLP/opus-mt-de-en", 
}

In [27]:
# model_type = "xl"
# model_type = "small"
model_type = "custom"
model_name = model_map.get(model_type, "small")

print(model_name)

Helsinki-NLP/opus-mt-de-en


In [28]:
import transformers
# T5
# from transformers import T5Tokenizer, T5ForConditionalGeneration, T5Config

# mT5
# from transformers import MT5Model, MT5ForConditionalGeneration, MT5TokenizerFast, MT5Config

In [31]:
# tokenizer = T5Tokenizer.from_pretrained(model_name, model_max_length=512)
# tokenizer = MT5TokenizerFast.from_pretrained(model_name)
# type(tokenizer)

In [32]:
from transformers import pipeline

In [33]:
translator = pipeline(
    "translation", 
    model="Helsinki-NLP/opus-mt-de-en",
    # torch_dtype=torch.float16,
    # device_map="auto",
    device=0,
)

In [34]:
# model = T5ForConditionalGeneration.from_pretrained(model_name)
# model = MT5ForConditionalGeneration.from_pretrained(model_name)

NameError: name 'MT5ForConditionalGeneration' is not defined

In [35]:
# type(model)

In [36]:
if display_architecture == True:
    print(MT5Config(model.config))

NameError: name 'MT5Config' is not defined

In [75]:
# if display_architecture == True:
#    print(model)

## Settings

* max_new_tokens ?

```console
/home/jovyan/.local/lib/python3.8/site-packages/transformers/generation/utils.py:1254: UserWarning: Using the model-agnostic default `max_length` (=20) to control thegeneration length. We recommend setting `max_new_tokens` to control the maximum length of the generation.
```


In [77]:
from util.gpu_utils import GPUInfoHelper

gpu_info_helper = GPUInfoHelper()
# task_prefix = "translate English to German: "
task_prefix = "translate German to English: "
# task_prefix = "übersetze Deutsch zum Englisch: "

def translate_gen(
    model: transformers.models.t5.modeling_t5.T5ForConditionalGeneration, 
    tokenizer: transformers.models.t5.tokenization_t5_fast.T5TokenizerFast,
    info: GPUInfoHelper,
    task_prefix: str = "translate English to German: "
):  
    """
    Args:
      max_new_tokens: control the maximum length of the generation
    """
    
    def local(input: str) -> str:
        """single input, no batch input
        """
        start = time.time()
        
        sentence = task_prefix + input
        
        input_ids = tokenizer(sentence, return_tensors="pt").input_ids
        outputs = model.generate(input_ids)
        result = tokenizer.decode(outputs[0], skip_special_tokens=True)
        
        end = time.time()
        duration = end - start
        print("-"*20)
        print(f"walltime: {duration} in secs.")
        info.gpu_usage()
        
        return result
    return local    

translate = translate_gen(model, tokenizer, info=gpu_info_helper, task_prefix=task_prefix)

In [78]:
input="Das Haus ist wunderbar."
# input="The house is wonderful."

In [79]:
translate(input)

--------------------
walltime: 0.5482296943664551 in secs.
num_of_gpus: 1
--------------------
Device_name      : NVIDIA A100 80GB PCIe MIG 2g.20gb 
Multi_processor  : 28
Physical  memory : 19.500000 GB
Reserved  memory : 0.000000 GB
Allocated memory : 0.000000 GB
Free      memory : 0.000000 GB
--------------------


'<extra_id_0>'