<a href="https://colab.research.google.com/github/webbigdata-jp/python_sample/blob/main/gemma_2_2b_jpn_it_tranlate_batch_translation_sample.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# [gemma-2-2b-jpn-it-translaten](https://huggingface.co/webbigdata/gemma-2-2b-jpn-it-translate) batch translaion sample.

アップロードされたファイルを英語から日本語、または日本語から英語に一括翻訳し、ファイルとして出力します。  
Translate uploaded file from English to Japanese or from Japanese to English in bulk and output it as file.  

左側の三角形状のアイコンを順にクリックして実行してください  
Click the triangle icons on the left in order to execute the.  

- .txtファイルのみに対応しています
- Only .txt files are supported
- 翻訳の品質にはブレがあり高品質に翻訳できる文体とそうでない文体の差が大きいです
- There is a large discrepancy in translation quality, and there is a large difference between styles that can be translated with high quality and those that cannot.

In [None]:
%%capture
%%shell
#@title Install library
pip install -U transformers

In [None]:
#@title Upload Text File(.txt only)
import os
from google.colab import files
import shutil

uploaded = files.upload()

In [None]:
#@title Translation Setting
Translation_direction = 'English to Japanese' #@param ["Japanese to English", "English to Japanese"]

In [None]:
#@title Writing Style Setting
Writing_style = 'business' #@param ["business", "formal", "casual", "slang"]


In [None]:
%%capture
#@title Download Model (may take a few minutes)

import re
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

def get_torch_dtype():
    if torch.cuda.is_available():
        device = torch.device("cuda")
        prop = torch.cuda.get_device_properties(device)
        # Ampere (Compute Capability 8.0 above), for example L4 support bfloat16, but T4 not support.
        if prop.major >= 8:
            return torch.bfloat16
    return torch.float16

model_name = "webbigdata/gemma-2-2b-jpn-it-translate"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=get_torch_dtype(),
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.unk_token


In [None]:
#@title Translate line by line and write to file
import chardet  # Required for character encoding detection


def translate_file():
    system_prompt =  """You are a highly skilled professional Japanese-English and English-Japanese translator. Translate the given text accurately, taking into account the context and specific instructions provided. Steps may include hints enclosed in square brackets [] with the key and value separated by a colon:. Only when the subject is specified in the Japanese sentence, the subject will be added when translating into English. If no additional instructions or context are provided, use your expertise to consider what the most appropriate context is and provide a natural translation that aligns with that context. When translating, strive to faithfully reflect the meaning and tone of the original text, pay attention to cultural nuances and differences in language usage, and ensure that the translation is grammatically correct and easy to read. After completing the translation, review it once more to check for errors or unnatural expressions. For technical terms and proper nouns, either leave them in the original language or use appropriate translations as necessary. Take a deep breath, calm down, and start translating."""
    instruct = ""

    filename = next(iter(uploaded.keys()))
    with open(filename, "rb") as file:  # Read file in binary mode
        binary_content = file.read()
        detected_encoding = chardet.detect(binary_content)["encoding"] or "sjis"
        content = binary_content.decode(detected_encoding).encode("utf-8").decode("utf-8")  # Convert to utf-8

        if Translation_direction == 'Japanese to English':
            sentences = [s for s in content.split('。') if s]
            sentences = [item for sublist in [s.split('\n') for s in sentences] for item in sublist]
            instruct = "Translate Japanese to English."
        else:
            sentences = [s for s in content.split('.') if s]
            sentences = [item for sublist in [s.split('\n') for s in sentences] for item in sublist]
            instruct = "Translate English to Japanese."

    style = "When translating, please use the following hints:\n[writing_style: {Writing_style}]"

    initial_messages = [
        {"role": "user", "content": system_prompt + "\n\n" + instruct + "\n" + "When translating, please use the following hints:\n[writing_style: {Writing_style}]"},
        {"role": "assistant", "content": "OK"}
    ]
    messages = initial_messages.copy()

    output_path = filename.replace('.txt', '_Ja_to_En.txt') if Translation_direction == 'Japanese to English' else filename.replace('.txt', '_En_to_Ja.txt')
    with open(output_path, 'w', encoding='utf-8') as hyp_file:
        for line in sentences:
            if line == "":
                model_response = ""
                hyp_file.write('\n')
                continue

            messages.append({"role": "user", "content": line.strip()})
            inputs = tokenizer.apply_chat_template(
                messages,
                tokenize=True,
                add_generation_prompt=True,
                return_tensors="pt",
            ).to("cuda")

            with torch.no_grad():
                generated_ids = model.generate(
                input_ids=inputs,
                num_beams=3, max_new_tokens=1200, do_sample=True, temperature=0.5, top_p=0.3,
                repetition_penalty=1.0
            )
            full_outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)

            model_marker = "\nmodel\n"
            model_response = full_outputs[0].split(model_marker)[-1].strip()

            hyp_file.write(model_response + '\n')
            messages.append({"role": "assistant",  "content": model_response})
            print(f"Translated: {line.strip()} -> {model_response}")

            if len(messages) > 8:  # 2 (initial) + 6 (new) = 8
                messages = initial_messages + messages[-6:]

    print(f"Translation compleated. please download files.: {output_path}")
    files.download(output_path)

translate_file()


## 謝辞 Acknowledgment

### Referenced models.
Original Model  
google/gemma-2-2b-jpn-it  
https://huggingface.co/google/gemma-2-2b-jpn-it

This Model  
webbigdata/gemma-2-2b-jpn-it-translate  
https://huggingface.co/webbigdata/gemma-2-2b-jpn-it-translate


このスクリプトは[webbigdata](https://webbigdata.jp/)によって作成されました  
This script was created by [webbigdata](https://webbigdata.jp/).  