# Language Translation German to English

## English to German with T5
* T5 https://huggingface.co/docs/transformers/model_doc/t5
* Translation intro course https://huggingface.co/learn/nlp-course/chapter7/4?fw=tf
* Two way translation with T5 discussion: https://stackoverflow.com/questions/66797042/using-googles-t5-for-translation-from-german-to-english
* Model capability output example: https://github.com/PacktPublishing/Transformers-for-Natural-Language-Processing/blob/main/Chapter07/Summarizing_Text_with_T5.ipynb


## German to English with mT5
* https://huggingface.co/docs/transformers/model_doc/mt5
* https://huggingface.co/docs/transformers/model_doc/mt5#transformers.MT5ForConditionalGeneration
* https://huggingface.co/transformers/v4.9.2/model_doc/mt5.html

## German to English custom model
* https://stackoverflow.com/questions/66797042/using-googles-t5-for-translation-from-german-to-english

## MarianMT 
* BART Translate: https://huggingface.co/docs/transformers/model_doc/marian

In [1]:
list=!nvidia-smi -L
for i in range(len(list)):
    print(list[i])

GPU 0: NVIDIA A100 80GB PCIe (UUID: GPU-51f84540-9ebb-1d44-7bb7-3c62ae55c20e)
  MIG 2g.20gb     Device  0: (UUID: MIG-0efc9f06-6dca-5886-98af-0273ca7fde51)


In [2]:
import re

def get_device_uuid(input: str) -> str:
    try:
        # r'' before the search pattern indicates it is a raw string, 
        # otherwise "" instead of single quote
        uuid = re.search(r'UUID\:\s(.+?)\)', input).group(1)
    except AttributeError:
        # "UUID\:\s" and "\)" not found
        uuid = ""
    return uuid    

# skip the first GPU ID, only get the MIG IDs, using python list slice over index access
uuid_list = [get_device_uuid(e) for e in list[1:]]
# print(uuid_list)
UUIDs = ",".join(uuid_list)
print(UUIDs)

MIG-0efc9f06-6dca-5886-98af-0273ca7fde51


In [3]:
import os, time, sys
from platform import python_version
os.environ["WORLD_SIZE"] = "1" 
os.environ["CUDA_VISIBLE_DEVICES"] = UUIDs # "0,1,2"
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:512" #512
display_architecture=True

print(os.environ["CUDA_VISIBLE_DEVICES"])
print(python_version())

MIG-0efc9f06-6dca-5886-98af-0273ca7fde51
3.8.10


In [4]:
# set the model download cache directory
# DATA_ROOT="/data"
DATA_ROOT="/home/jovyan/llm-models"
os.environ['XDG_CACHE_HOME']=f"{DATA_ROOT}/core-kind/yinwang/models"

model_map = {
   "small": "google/mt5-small", # 1.2 GB
   "base" : "google/mt5-base", # 2.33 GB
   "large" : "google/mt5-large", # 4.9 GB,
   "xl" : "google/mt5-xl", # 15 GB
   "xxl" : "google/mt5-xxl", # 51.7 GB,
   "custom": "Helsinki-NLP/opus-mt-de-en", 
}

In [5]:
model_type = "custom"
model_name = model_map.get(model_type, "small")

print(model_name)

Helsinki-NLP/opus-mt-de-en


In [6]:
from transformers import pipeline
import transformers

In [7]:
'''
device_map="auto" doesn't work with "Helsinki-NLP/opus-mt-de-en" translator model
use explicit gpu device id 0 with device=0
'''
generator = pipeline(
    "translation", 
    model=model_name,
    # device_map="auto",
    device=0,
)

In [8]:
type(generator)

transformers.pipelines.text2text_generation.TranslationPipeline

## Settings

In [9]:
from util.gpu_utils import GPUInfoHelper

gpu_info_helper = GPUInfoHelper()
# task_prefix = "translate English to German: "
# task_prefix = "translate German to English: "
# task_prefix = "übersetze Deutsch zum Englisch: "
# Reference: https://huggingface.co/docs/transformers/model_doc/marian
def translate_gen(
    generator: transformers.pipelines.text2text_generation.TranslationPipeline, 
    info: GPUInfoHelper,
):  
    """
    Args:
      max_new_tokens: control the maximum length of the generation
    """
    
    def local(sentences: list, max_length=400) -> list:
        """single input, no batch input
        Args:
          sentences:
        """
        start = time.time()
        
        result = generator(
            sentences, 
            max_length=max_length,
            # return_tensors="pt"
        )
        
        end = time.time()
        duration = end - start
        print("-"*20)
        print(f"walltime: {duration} in secs.")
        info.gpu_usage()
        
        return result
    return local    

translate = translate_gen(generator, gpu_info_helper)

In [10]:
input="Das Haus ist wunderbar."

In [11]:
%timeit
translate(input, max_length=1000)

--------------------
walltime: 16.0162136554718 in secs.
num_of_gpus: 1
--------------------
Device_name      : NVIDIA A100 80GB PCIe MIG 2g.20gb 
Multi_processor  : 28
Physical  memory : 19.500000 GB
Reserved  memory : 0.310547 GB
Allocated memory : 0.285861 GB
Free      memory : 0.024686 GB
--------------------


[{'translation_text': 'The house is wonderful.'}]

In [12]:
from util.pdf_text_loader import PDFHelper
# DATA_ROOT="/home/jovyan/llm-models"
DATA_SUBDIR="core-kind/yinwang/data/medreports"
print(f"{DATA_ROOT}/{DATA_SUBDIR}")
loader = PDFHelper(data_folder = f"{DATA_ROOT}/{DATA_SUBDIR}", file_pattern="KK-SCIVIAS-*.pdf")

/home/jovyan/llm-models/core-kind/yinwang/data/medreports


In [13]:
loader.file_path_list

['/home/jovyan/llm-models/core-kind/yinwang/data/medreports/KK-SCIVIAS-00004-0054584394-2021-01-17.pdf',
 '/home/jovyan/llm-models/core-kind/yinwang/data/medreports/KK-SCIVIAS-00004-0051726752-2015-12-17.pdf']

In [14]:
# has two testing file, choose the pdf file to be tranlated with list index
# file_idx = 0
file_idx = 1

In [15]:
context = loader.read_pdf(file_idx)

In [16]:
loader.count_token(file_idx)

file: /home/jovyan/llm-models/core-kind/yinwang/data/medreports/KK-SCIVIAS-00004-0051726752-2015-12-17.pdf
total token: 17545


17545

In [17]:
# https://stackoverflow.com/questions/13673060/split-string-into-strings-by-length
def wrap(s, w):
    """
    split string with length w into a list of strings with length w
    Arge:
      s: orginial str
      w: with of the each split for the string
      
    Return:
      a list of string with each element as string of length w
    """
    return [s[i:i + w] for i in range(0, len(s), w)]

In [18]:
splitted_content = wrap(context, 350)

In [19]:
len(splitted_content)

51

In [20]:
output = []
for input in splitted_content:
    output.append(translate(input)[0].get('translation_text', '').strip())

--------------------
walltime: 0.5716536045074463 in secs.
num_of_gpus: 1
--------------------
Device_name      : NVIDIA A100 80GB PCIe MIG 2g.20gb 
Multi_processor  : 28
Physical  memory : 19.500000 GB
Reserved  memory : 0.357422 GB
Allocated memory : 0.285861 GB
Free      memory : 0.071561 GB
--------------------
--------------------
walltime: 0.5071165561676025 in secs.
num_of_gpus: 1
--------------------
Device_name      : NVIDIA A100 80GB PCIe MIG 2g.20gb 
Multi_processor  : 28
Physical  memory : 19.500000 GB
Reserved  memory : 0.357422 GB
Allocated memory : 0.285861 GB
Free      memory : 0.071561 GB
--------------------
--------------------
walltime: 0.3874013423919678 in secs.
num_of_gpus: 1
--------------------
Device_name      : NVIDIA A100 80GB PCIe MIG 2g.20gb 
Multi_processor  : 28
Physical  memory : 19.500000 GB
Reserved  memory : 0.357422 GB
Allocated memory : 0.285861 GB
Free      memory : 0.071561 GB
--------------------
--------------------
walltime: 0.3659214973449707



--------------------
walltime: 0.4631073474884033 in secs.
num_of_gpus: 1
--------------------
Device_name      : NVIDIA A100 80GB PCIe MIG 2g.20gb 
Multi_processor  : 28
Physical  memory : 19.500000 GB
Reserved  memory : 0.357422 GB
Allocated memory : 0.285861 GB
Free      memory : 0.071561 GB
--------------------
--------------------
walltime: 0.4110853672027588 in secs.
num_of_gpus: 1
--------------------
Device_name      : NVIDIA A100 80GB PCIe MIG 2g.20gb 
Multi_processor  : 28
Physical  memory : 19.500000 GB
Reserved  memory : 0.359375 GB
Allocated memory : 0.285861 GB
Free      memory : 0.073514 GB
--------------------
--------------------
walltime: 0.5056653022766113 in secs.
num_of_gpus: 1
--------------------
Device_name      : NVIDIA A100 80GB PCIe MIG 2g.20gb 
Multi_processor  : 28
Physical  memory : 19.500000 GB
Reserved  memory : 0.359375 GB
Allocated memory : 0.285861 GB
Free      memory : 0.073514 GB
--------------------
--------------------
walltime: 0.491748571395874 

In [21]:
en_content = ''.join(output)

In [22]:
#print(en_content)

In [23]:
print(f"the translated text has tokens: {len(en_content)}")

the translated text has tokens: 14647


In [24]:
def store_txt(content, path):
    with open (path, "w") as text_file:
        #write string to file
        text_file.write(content)

In [25]:
en_txt_path = loader.file_path_list[file_idx].replace("pdf", "txt")

In [26]:
store_txt(en_content, en_txt_path)