<a href="https://colab.research.google.com/github/rickiepark/MLQandAI/blob/main/supplementary/q15-text-augment/backtranslation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 데이터 증식을 위한 역번역

In [1]:
!pip install watermark sacremoses

%load_ext watermark
%watermark -a 'Sebastian Raschka' -v -p transformers

Author: Sebastian Raschka

Python implementation: CPython
Python version       : 3.10.12
IPython version      : 7.34.0

transformers: 4.44.2



In [2]:
from transformers import MarianMTModel, MarianTokenizer

def back_translate(text):
    # 영어를 독일어로
    en_to_de_model_name = "Helsinki-NLP/opus-mt-en-de"
    en_to_de_tokenizer = MarianTokenizer.from_pretrained(en_to_de_model_name,
                                                         clean_up_tokenization_spaces=False)
    en_to_de_model = MarianMTModel.from_pretrained(en_to_de_model_name)

    inputs = en_to_de_tokenizer([text], return_tensors="pt")
    translated_german_tokens = en_to_de_model.generate(**inputs)
    translated_german_text = en_to_de_tokenizer.decode(translated_german_tokens[0], skip_special_tokens=True)

    # 독일어를 영어로
    de_to_en_model_name = 'Helsinki-NLP/opus-mt-de-en'
    de_to_en_tokenizer = MarianTokenizer.from_pretrained(de_to_en_model_name,
                                                         clean_up_tokenization_spaces=False)
    de_to_en_model = MarianMTModel.from_pretrained(de_to_en_model_name)

    inputs_back = de_to_en_tokenizer([translated_german_text], return_tensors="pt")
    translated_english_tokens = de_to_en_model.generate(**inputs_back)
    translated_back_english_text = de_to_en_tokenizer.decode(translated_english_tokens[0], skip_special_tokens=True)

    return translated_german_text, translated_back_english_text

In [3]:
text = ("Despite the intermittent rain showers, "
        "Amelia decided to venture outside with "
        "her new umbrella, hoping to enjoy the fresh "
        "air and perhaps bump into some old friends "
        "at the local café down the street."
       )

translated_text, back_translated_text = back_translate(text)

print("원본 텍스트:")
print(text)
print("--------------------------")

print("번역된 텍스트:")
print(translated_text)
print("--------------------------")

print("역번역된 텍스트:")
print(back_translated_text)
print("--------------------------")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


원본 텍스트:
Despite the intermittent rain showers, Amelia decided to venture outside with her new umbrella, hoping to enjoy the fresh air and perhaps bump into some old friends at the local café down the street.
--------------------------
번역된 텍스트:
Trotz der periodischen Regenschauer entschied sich Amelia, sich mit ihrem neuen Regenschirm nach draußen zu wagen, in der Hoffnung, die frische Luft zu genießen und vielleicht einige alte Freunde im örtlichen Café auf der Straße zu treffen.
--------------------------
역번역된 텍스트:
Despite the periodic rain showers, Amelia decided to venture outside with her new umbrella, hoping to enjoy the fresh air and perhaps meet some old friends in the local café on the street.
--------------------------


In [4]:
import difflib


d = difflib.Differ()
diff = d.compare(text.split(),
                 back_translated_text.split())

print('\n'.join(diff))

  Despite
  the
- intermittent
+ periodic
  rain
  showers,
  Amelia
  decided
  to
  venture
  outside
  with
  her
  new
  umbrella,
  hoping
  to
  enjoy
  the
  fresh
  air
  and
  perhaps
+ meet
- bump
- into
  some
  old
  friends
- at
+ in
  the
  local
  café
- down
+ on
  the
  street.
