<a href="https://colab.research.google.com/github/webbigdata-jp/python_sample/blob/main/C3TR_Adapter_v3_batch_translation_sample.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# [C3TR-Adapter v3](https://huggingface.co/webbigdata/C3TR-Adapter) batch translaion sample.

アップロードされたファイルを英語から日本語、または日本語から英語に一括翻訳し、ファイルとして出力します。  
Translate uploaded file from English to Japanese or from Japanese to English in bulk and output it as file.  

上段メニューの「ランタイム」→「すべてのセルを実行」で実行してください  
Please execute it by clicking "Runtime" -> "Execute All Cells" in the upper menu  
  

以下は既知の問題です。  
Below are the known issues  

- 長い文章を入力するとエラーになります。
- 意味のない文章や日本語でない文章を入力すると、出力がおかしくなることがあります。

- If you give a long sentence, an error will occur.(This is a limitation of free Colab)
- If you provide meaningless sentences or sentences that are not Japanese, the output may become strange.

## (1)Install required libraries

In [1]:
%%capture
%%shell
#@title Install Required Libraries
pip install peft==0.11.1 bitsandbytes==0.43.1 transformers==4.42.3

## (2)Setting Up

In [8]:
#@title Upload Text File(.txt only)
import os
from google.colab import files
import shutil

uploaded = files.upload()

Saving TODO2.txt to TODO2.txt


In [9]:
#@title Translation Setting
Translation_direction = 'Japanese to English' #@param ["Japanese to English", "English to Japanese"]

In [6]:
%%capture
#@title Download Model (may take a few minutes)
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

model_id = "unsloth/gemma-2-9b-it-bnb-4bit"
peft_model_id = "webbigdata/C3TR-Adapter"

if torch.cuda.is_available() and torch.cuda.get_device_capability(0)[0] >= 8:
    dtype = torch.bfloat16
else:
    dtype = torch.float16

model = AutoModelForCausalLM.from_pretrained(model_id,  torch_dtype=dtype, device_map="auto")
model = PeftModel.from_pretrained(model = model, model_id = peft_model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)

import re


def contains_japanese(text):
    # 日本語の文字範囲を確認するための正規表現パターン
    # 平仮名: 3040-309F, 片仮名: 30A0-30FF, 漢字: 4E00-9FAF (旧字体、新字体)
    pattern = re.compile('[\u3040-\u309F\u30A0-\u30FF\u4E00-\u9FAF]')
    return re.search(pattern, text) is not None

def trans(prompt, model, tokenizer, Translation_direction):
    input_ids = tokenizer(prompt, return_tensors="pt",
        padding=True, max_length=1600, truncation=True).input_ids.cuda()

    # Translation
    generated_ids = model.generate(input_ids=input_ids,
        max_new_tokens=800,
        num_beams=3, do_sample=True, temperature=0.5, top_p=0.3,
        )
    full_outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
    return full_outputs[0].split("### Response:\n")[-1].strip()


## (3)Do translation

In [10]:


#@title Translate line by line and write to file
import chardet  # Required for character encoding detection

for filename in uploaded.keys():
    translated_sentences = []

    with open(filename, "rb") as file:  # Read file in binary mode
        binary_content = file.read()
        detected_encoding = chardet.detect(binary_content)["encoding"] or "sjis"
        content = binary_content.decode(detected_encoding).encode("utf-8").decode("utf-8")  # Convert to utf-8

        if Translation_direction == 'Japanese to English':
            if contains_japanese(content):
              sentences = [s for s in content.split('。') if s]
              sentences = [item for sublist in [s.split('\n') for s in sentences] for item in sublist]
            else:
              print(content)
              sentences = [content]
        else:
            sentences = [s for s in content.split('.') if s]
            sentences = [item for sublist in [s.split('\n') for s in sentences] for item in sublist]

    for sentence in sentences:
        sentence = sentence.strip()
        if len(sentence) > 0:
            if Translation_direction == 'Japanese to English':
                if contains_japanese(content):
                    ja_prompt = f"You are a highly skilled professional Japanese-English and English-Japanese translator. Translate the given text accurately, taking into account the context and specific instructions provided. Steps may include hints enclosed in square brackets [] with the key and value separated by a colon:. Only when the subject is specified in the Japanese sentence, the subject will be added when translating into English. If no additional instructions or context are provided, use your expertise to consider what the most appropriate context is and provide a natural translation that aligns with that context. When translating, strive to faithfully reflect the meaning and tone of the original text, pay attention to cultural nuances and differences in language usage, and ensure that the translation is grammatically correct and easy to read. After completing the translation, review it once more to check for errors or unnatural expressions. For technical terms and proper nouns, either leave them in the original language or use appropriate translations as necessary. Take a deep breath, calm down, and start translating.\n\n<start_of_turn>### Instruction:\nTranslate Japanese to English.\n\n### Input:\n{sentence}\n<end_of_turn>\n<start_of_turn>### Response:\n"
                    translated_sentences.append(trans(ja_prompt, model, tokenizer, Translation_direction))
                else:
                    translated_sentences.append(sentence)
            else:
                en_prompt = f"You are a highly skilled professional Japanese-English and English-Japanese translator. Translate the given text accurately, taking into account the context and specific instructions provided. Steps may include hints enclosed in square brackets [] with the key and value separated by a colon:. Only when the subject is specified in the Japanese sentence, the subject will be added when translating into English. If no additional instructions or context are provided, use your expertise to consider what the most appropriate context is and provide a natural translation that aligns with that context. When translating, strive to faithfully reflect the meaning and tone of the original text, pay attention to cultural nuances and differences in language usage, and ensure that the translation is grammatically correct and easy to read. After completing the translation, review it once more to check for errors or unnatural expressions. For technical terms and proper nouns, either leave them in the original language or use appropriate translations as necessary. Take a deep breath, calm down, and start translating.\n\n<start_of_turn>### Instruction:\nTranslate English to Japanese.\n\n### Input:\n{sentence}\n<end_of_turn>\n<start_of_turn>### Response:\n"
                translated_sentences.append(trans(en_prompt, model, tokenizer, Translation_direction))
        else:
          translated_sentences.append("")

    output_filename = filename.replace('.txt', '_Ja_to_En.txt') if Translation_direction == 'Japanese to English' else filename.replace('.txt', '_En_to_Ja.txt')
    with open(output_filename, 'w', encoding='utf-8') as f:
        f.write('\n'.join(translated_sentences))

    print(f"Translation compleated. please download files.: {output_filename}")
    files.download(output_filename)


Translation compleated. please download files.: TODO2_Ja_to_En.txt


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>