### KETI의 모델에 데이터를 전이학습
- KETI(Korea Electronic Technology Institude: 한국전자기술연구원)의 모델 사용
- 한국어 3600만개의 문서를 영어에 대해서 3800만개의 문서를 학습한 데이터
- 데이터세트는 HelsinkiNLP에서 공개한 OPUS-100을 사용해 영어를 한국어로 번역
- OPUS-100 데이터세트는 전 세계 100개 언어와 언어의 쌍으로 이루어진 데이터세트로 약 5500만개의 문장으로 구성

### OPUS-100 데이터 세트 토큰화

In [1]:
from datasets import load_dataset
from transformers import T5TokenizerFast, T5ForConditionalGeneration

2025-04-13 06:22:10.683464: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1744525330.875281      31 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1744525330.928622      31 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [2]:
model_name = "KETI-AIR/long-ke-t5-small"
tokenizer = T5TokenizerFast.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/2.49k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/1.59M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/4.17M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.22k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/893 [00:00<?, ?B/s]

You are using a model of type longt5 to instantiate a model of type t5. This is not supported for all configurations of models and can yield errors.


model.safetensors:   0%|          | 0.00/439M [00:00<?, ?B/s]

Some weights of T5ForConditionalGeneration were not initialized from the model checkpoint at KETI-AIR/long-ke-t5-small and are newly initialized: ['encoder.block.0.layer.0.SelfAttention.k.weight', 'encoder.block.0.layer.0.SelfAttention.o.weight', 'encoder.block.0.layer.0.SelfAttention.q.weight', 'encoder.block.0.layer.0.SelfAttention.relative_attention_bias.weight', 'encoder.block.0.layer.0.SelfAttention.v.weight', 'encoder.block.1.layer.0.SelfAttention.k.weight', 'encoder.block.1.layer.0.SelfAttention.o.weight', 'encoder.block.1.layer.0.SelfAttention.q.weight', 'encoder.block.1.layer.0.SelfAttention.v.weight', 'encoder.block.2.layer.0.SelfAttention.k.weight', 'encoder.block.2.layer.0.SelfAttention.o.weight', 'encoder.block.2.layer.0.SelfAttention.q.weight', 'encoder.block.2.layer.0.SelfAttention.v.weight', 'encoder.block.3.layer.0.SelfAttention.k.weight', 'encoder.block.3.layer.0.SelfAttention.o.weight', 'encoder.block.3.layer.0.SelfAttention.q.weight', 'encoder.block.3.layer.0.SelfAt

In [3]:
dataset = load_dataset("Helsinki-NLP/opus-100", "en-ko") # 영어-한국어 부분은 100만개의 학습데이터와 2000개의 검증 데이터로 구성

README.md:   0%|          | 0.00/65.4k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/143k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/70.1M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/144k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/2000 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/1000000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2000 [00:00<?, ? examples/s]

In [4]:
def preprocess_data(example, tokenizer):
    translation = example['translation']
    translation_source = ['en: ' + instance['en'] for instance in translation]
    translation_target = ['ko: ' + instance['ko'] for instance in translation]
    tokenized = tokenizer(
        translation_source,
        text_target=translation_target,
        truncation = True
    )
    return tokenized

In [7]:
processed_dataset = dataset.map(
    lambda example: preprocess_data(example, tokenizer),
    batched=True,
    remove_columns = dataset['train'].column_names
)

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Map:   0%|          | 0/1000000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

In [11]:
sample = processed_dataset['test'][0]
print(sample)
print("영어:", tokenizer.decode(sample['input_ids']))
print("한글:", tokenizer.decode(sample['labels']))

{'input_ids': [20004, 20525, 20048, 20298, 20480, 20025, 20263, 20027, 20187, 20050, 43305, 20009, 21015, 20047, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'labels': [20004, 23477, 20048, 92, 14, 4256, 11, 1363, 71, 1133, 2951, 20371, 33, 16, 75, 242, 10, 513, 20047, 1]}
영어: en: What makes you think I want an intro to anyone?</s>
한글: ko: 내가 너를 누구에게 소개하고 싶어한다고 생각하니?</s>


### 기계 번역 모델 학습

In [12]:
from transformers import DataCollatorForSeq2Seq
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer

In [13]:
seq2seq_collator = DataCollatorForSeq2Seq(
    tokenizer=tokenizer,
    padding = 'longest',
    return_tensors='pt'
)

In [15]:
# Hyper Parameter
training_arguments = Seq2SeqTrainingArguments(
    output_dir = ".",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=16,
    learning_rate=5e-5,
    num_train_epochs=1,
    eval_steps = 2000,
    logging_steps = 2000,
    seed=42
)

In [16]:
trainer = Seq2SeqTrainer(
    model=model,
    args=training_arguments,
    data_collator=seq2seq_collator,
    train_dataset=processed_dataset['train'].select(range(100000)),
    eval_dataset=processed_dataset['validation'].select(range(1000))
)

In [17]:
import wandb, os
wandb.login(key="349fe2034aca280a50c69ff319105cf8df84cc34")
os.environ['WANDB_CONSOLE'] ='wrap'

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mzeushahn[0m ([33mzeushahn-khankong[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


In [18]:
trainer.train()



[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Step,Training Loss
2000,3.1116
4000,2.8744
6000,2.8131




TrainOutput(global_step=6250, training_loss=2.92708841796875, metrics={'train_runtime': 1434.7098, 'train_samples_per_second': 69.701, 'train_steps_per_second': 4.356, 'total_flos': 1818984778039296.0, 'train_loss': 2.92708841796875, 'epoch': 1.0})

### 기계 번역 수행

In [19]:
import torch

model.eval()
device=torch.device("cuda" if torch.cuda.is_available() else 'cpu')
model.to(device)

T5ForConditionalGeneration(
  (shared): Embedding(64100, 512)
  (encoder): T5Stack(
    (embed_tokens): Embedding(64100, 512)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=512, out_features=384, bias=False)
              (k): Linear(in_features=512, out_features=384, bias=False)
              (v): Linear(in_features=512, out_features=384, bias=False)
              (o): Linear(in_features=384, out_features=512, bias=False)
              (relative_attention_bias): Embedding(32, 6)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseGatedActDense(
              (wi_0): Linear(in_features=512, out_features=1024, bias=False)
              (wi_1): Linear(in_features=512, out_features=1024, bias=False)
              (wo): 

In [31]:
data = "en: It's alaways great to acquire new knowledge."
inputs = tokenizer(data, return_tensors='pt').to(device)

In [34]:
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_length=512,
        num_beams=4,
        no_repeat_ngram_size=2,
        early_stopping=False
    )
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

ko: 새로운 지식을 얻을 수 있어
