# Test_pretrained_ke-t5-small

## git

In [1]:
# !git config --global user.name candym1
# !git config --global user.email tmxk5283@gmail.com
# !git clone https://github.com/seuyon0101/saturi.git

In [2]:
# !cd saturi

## import

In [3]:
import torch
import os
import re
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
import nltk
from nltk.tokenize import sent_tokenize

import datasets
from datasets import Dataset, DatasetDict
from sklearn.model_selection import train_test_split
from transformers import T5Tokenizer, T5ForConditionalGeneration, BartForConditionalGeneration
from transformers import pipeline
from transformers import AutoModel, AutoTokenizer
from transformers import DataCollatorWithPadding
from transformers import Trainer, TrainingArguments

## Test data upload

In [4]:
kor_path = os.getenv('HOME')+"/korean-english-park.train.ko"
eng_path = os.getenv('HOME')+"/korean-english-park.train.en"

In [5]:
with open(kor_path, "r") as f:
    kor = f.read().splitlines()

print("Data Size:", len(kor))
print("Example:")

for sen in kor[0:100][::20]: print(">>", sen)

Data Size: 94123
Example:
>> 개인용 컴퓨터 사용의 상당 부분은 "이것보다 뛰어날 수 있느냐?"
>> 북한의 핵무기 계획을 포기하도록 하려는 압력이 거세지고 있는 가운데, 일본과 북한의 외교관들이 외교 관계를 정상화하려는 회담을 재개했다.
>> "경호 로보트가 침입자나 화재를 탐지하기 위해서 개인적으로, 그리고 전문적으로 사용되고 있습니다."
>> 수자원부 당국은 논란이 되고 있고, 막대한 비용이 드는 이 사업에 대해 내년에 건설을 시작할 계획이다.
>> 또한 근력 운동은 활발하게 걷는 것이나 최소한 20분 동안 뛰는 것과 같은 유산소 활동에서 얻는 운동 효과를 심장과 폐에 주지 않기 때문에, 연구학자들은 근력 운동이 심장에 큰 영향을 미치는지 여부에 대해 논쟁을 해왔다.


In [6]:
with open(eng_path, "r") as f:
    eng = f.read().splitlines()

print("Data Size:", len(eng))
print("Example:")

for sen in eng[0:100][::20]: print(">>", sen)

Data Size: 94123
Example:
>> Much of personal computing is about "can you top this?"
>> Amid mounting pressure on North Korea to abandon its nuclear weapons program Japanese and North Korean diplomats have resumed talks on normalizing diplomatic relations.
>> “Guard robots are used privately and professionally to detect intruders or fire,” Karlsson said.
>> Authorities from the Water Resources Ministry plan to begin construction next year on the controversial and hugely expensive project.
>> Researchers also have debated whether weight-training has a big impact on the heart, since it does not give the heart and lungs the kind of workout they get from aerobic activities such as brisk walking or running for at least 20 minutes.


In [7]:
cleaned_corpus = []
for i in range(len(kor)):
    set_corpus = []
    raw_sen = kor[i] + ' <TSL> ' + eng[i]
    set_corpus.append(raw_sen)
    for t in range(len(set_corpus)):
        set_corpus = list(set(set_corpus))
        for s in set_corpus:
            result = ""
            result += s
            cleaned_corpus.append(result)

In [8]:
cleaned_corpus[2]

"그러나 이것은 또한 책상도 필요로 하지 않는다. <TSL> Like all optical mice, But it also doesn't need a desk."

In [9]:
def preprocess_sentence_ko(sentence, s_token=False, e_token=False):
    sentence = re.sub(r"([?.!,])", r" \1 ", sentence)
    sentence = re.sub(r'[" "]+', " ", sentence)
    sentence = re.sub(r"[^ㄱ-ㅎ가-힣a-zA-Z?.!,]+", " ", sentence)

    sentence = sentence.strip()

    if s_token:
        sentence = '<start> ' + sentence

    if e_token:
        sentence += ' <end>'
    
    return sentence

In [10]:
def preprocess_sentence_en(sentence, s_token=False, e_token=False):
    sentence = sentence.lower().strip()
    
    sentence = re.sub(r"([?.!,])", r" \1 ", sentence)
    sentence = re.sub(r'[" "]+', " ", sentence)
    sentence = re.sub(r"[^a-zA-Z?.!,]+", " ", sentence)

    sentence = sentence.strip()

    if s_token:
        sentence = '<start> ' + sentence

    if e_token:
        sentence += ' <end>'
    
    return sentence

In [11]:
cleaned_corpus[0]

'개인용 컴퓨터 사용의 상당 부분은 "이것보다 뛰어날 수 있느냐?" <TSL> Much of personal computing is about "can you top this?"'

In [12]:
enc_corpus = []
dec_corpus = []

num_examples = 30000

for z in range(num_examples):
    ko, en = cleaned_corpus[z].split(" <TSL> ")
    
    enc_corpus.append(preprocess_sentence_ko(ko))
    dec_corpus.append(preprocess_sentence_en(en))
    
print("Korean:", enc_corpus[0])
print("English:", dec_corpus[0])

Korean: 개인용 컴퓨터 사용의 상당 부분은 이것보다 뛰어날 수 있느냐 ?
English: much of personal computing is about can you top this ?


### DataFreame 으로 변경

In [13]:
df = pd.DataFrame(zip(enc_corpus, dec_corpus))
df.columns = ['input', 'target']

In [14]:
df.head()

Unnamed: 0,input,target
0,개인용 컴퓨터 사용의 상당 부분은 이것보다 뛰어날 수 있느냐 ?,much of personal computing is about can you to...
1,모든 광마우스와 마찬가지 로 이 광마우스도 책상 위에 놓는 마우스 패드를 필요로 하...,so a mention a few weeks ago about a rechargea...
2,그러나 이것은 또한 책상도 필요로 하지 않는다 .,"like all optical mice , but it also doesn t ne..."
3,". 달러하는 이 최첨단 무선 광마우스는 허공에서 팔목 , 팔 , 그외에 어떤 부분이...",uses gyroscopic sensors to control the cursor ...
4,정보 관리들은 동남 아시아에서의 선박들에 대한 많은 테러 계획들이 실패로 돌아갔음을...,intelligence officials have revealed a spate o...


In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   input   30000 non-null  object
 1   target  30000 non-null  object
dtypes: object(2)
memory usage: 468.9+ KB


### train, test data

In [16]:
x_train, x_test = train_test_split(df, test_size=0.2, random_state=77)

In [17]:
print(len(x_train))
print(len(x_test))

24000
6000


In [18]:
all_doc_f = np.concatenate((x_train,x_test))

In [19]:
print(len(all_doc_f))

30000


In [20]:
all_doc_f[0]

array(['한편 터키는 독일 뒤스브르크에서 열린 평가전에서 핀란드를 으로 제압했다 .',
       'meanwhile , turkey continued their preparations for the euro finals with a victory over finland in duisburg , germany .'],
      dtype=object)

In [21]:
x_train['input'][1]

'모든 광마우스와 마찬가지 로 이 광마우스도 책상 위에 놓는 마우스 패드를 필요로 하지 않는다 .'

## circulus/kobart-trans-en-ko-v2

### tokenizer

In [22]:
tokenizer1_en2ko = AutoTokenizer.from_pretrained("circulus/kobart-trans-en-ko-v2")
tokenizer1_ko2en = AutoTokenizer.from_pretrained("circulus/kobart-trans-ko-en-v2")

In [23]:
test_text = x_train['input'][1]
test_t5 = tokenizer1_en2ko(test_text).tokens()
print(test_t5)

['▁모든', '▁광', '마', '우', '스와', '▁마찬가지', '▁로', '▁이', '▁광', '마', '우스', '도', '▁책', '상', '▁위에', '▁놓', '는', '▁마', '우스', '▁패', '드를', '▁필요로', '▁하지', '▁않는다', '▁.']


In [24]:
print(len(x_train) == len(x_train['target']))
print(len(x_test) == len(x_test['target']))

True
True


In [25]:
tokenizer1_en2ko.model_max_length

1000000000000000019884624838656

In [26]:
tokenizer1_en2ko.model_input_names

['input_ids', 'token_type_ids', 'attention_mask']

In [27]:
train_ = pd.DataFrame({'input' : x_train['input'], 'target' : x_train['target']}).reset_index(drop=True)
test_ = pd.DataFrame({'input' : x_test['input'], 'target' : x_test['target']}).reset_index(drop=True)

In [28]:
train_d = Dataset.from_pandas(train_)
test_d = Dataset.from_pandas(test_)

# datasetdict형태로 transformation
dataset = datasets.DatasetDict({"train":train_d,"test":test_d})

In [29]:
# 데이터 값 최종 확인
dataset.set_format(type='pandas')
df = dataset['train'][:]
df

Unnamed: 0,input,target
0,한편 터키는 독일 뒤스브르크에서 열린 평가전에서 핀란드를 으로 제압했다 .,"meanwhile , turkey continued their preparation..."
1,이들은 소속당의 열성 지지자뿐만아니라 부동층 및 상대정당의 당원까지 공략하고 나섰다 .,they are portraying themselves as uniters with...
2,그가 받은 처벌이 어떠한 것인지에 대해서는 알려지지 않았다 .,the marine corps would not specify what that p...
3,로딕은 처음부터 서비스가 잘 들어갔다 며 서비스가 위력적이어서 승리할 수 있었다 고...,"one ace was . mph , breaking the dubai serve r..."
4,수도 전역의 여러 개의 붕괴된 건물 속에 많은 사람들이 실종된 것으로 보도되면서 사...,local media reports said one man died in his c...
...,...,...
23995,로열 아윈 병원의 렌 노타로스 박사는 호주 방송과의 인터뷰에서 대통령의 몸에 박힌 ...,surgeons operated on ramos horta for three hou...
23996,노무현 대통령은 일 KTV 특집 인터뷰에서 제 차 남북정상회담을 언급하며 김정일 국...,north korean leader kim jong il is the most fl...
23997,년 LZ 힌데브르크가 뉴저지에서 이륙 직전 추락해 화재로 타버렸을 때 라디오 저널리...,"london , england cnn oh , the humanity . when ..."
23998,인도 헌법상 세습적 계급 제도를 근거로 한 신분 차별은 위법이며 대도시에선 이러한 ...,india s constitution outlaws caste based discr...


In [30]:
# 인코딩하여 최종 데이터 dict 저장
dataset.set_format(type=None)
def tokenize(batch):
    return tokenizer1_en2ko(batch['input'], padding=True, truncation=True)

In [31]:
dataset_encoded = dataset.map(tokenize, batched=True, batch_size=None)

  0%|          | 0/1 [00:00<?, ?ba/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


  0%|          | 0/1 [00:00<?, ?ba/s]

In [32]:
print(dataset_encoded.column_names)
print(dataset_encoded["train"].column_names)

{'train': ['attention_mask', 'input', 'input_ids', 'target', 'token_type_ids'], 'test': ['attention_mask', 'input', 'input_ids', 'target', 'token_type_ids']}
['attention_mask', 'input', 'input_ids', 'target', 'token_type_ids']


In [33]:
print(dataset_encoded["train"][0])

{'target': 'meanwhile , turkey continued their preparations for the euro finals with a victory over finland in duisburg , germany .', 'input_ids': [14602, 14887, 18298, 15604, 14289, 11440, 11007, 17703, 14030, 14739, 14653, 16532, 20899, 10215, 15626, 18806, 25884, 16982, 17546, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

In [34]:
data_collator =  DataCollatorWithPadding(tokenizer=tokenize, return_tensors="pt")

### model pipeline

In [35]:
nltk.download("punkt")

test = "안녕 내이름은 곱등이. 곱등 곱등"
sent_tokenize(test)

[nltk_data] Downloading package punkt to /aiffel/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


['안녕 내이름은 곱등이.', '곱등 곱등']

In [36]:
sample_text_en = dataset['train']["target"][7]
sample_text_ko = dataset['train']["input"][7]

In [103]:
summaries = {}

In [38]:
print("en : " + sample_text_en)
print("ko : " + sample_text_ko)

en : a spanish woman who lived in switzerland was killed , they said . further details about her were not immediately available .
ko : 경찰은 사망자에 대해 스위스에 거주하고 있던 스페인 여성이라는 것 외 에는 구체적인 인적사항을 밝히지 않았다 .


In [39]:
pipe1_ko2en = pipeline("translation_ko_to_en", model="circulus/kobart-trans-ko-en-v2", tokenizer=tokenizer1_ko2en)

In [40]:
pipe_out1_ko2en = pipe1_ko2en(sample_text_ko, clean_up_tokenization_spaces=True, min_length=100)

Your input_length: 23 is bigger than 0.9 * max_length: 20. You might consider increasing your max_length manually, e.g. translator('...', max_length=400)


In [41]:
print(f"ko2en : {pipe_out1_ko2en}")

ko2en : [{'translation_text': 'The police did not reveal specific'}]


In [104]:
summaries["input_sentence"] = "\n".join(sent_tokenize(sample_text_ko))

In [105]:
summaries["circulus/kobart-trans-ko-en-v2"] = "\n".join(sent_tokenize("<output> : " + pipe_out1_ko2en[0]["translation_text"]))

In [106]:
summaries

{'input_sentence': '경찰은 사망자에 대해 스위스에 거주하고 있던 스페인 여성이라는 것 외 에는 구체적인 인적사항을 밝히지 않았다 .',
 'circulus/kobart-trans-ko-en-v2': '<output> : The police did not reveal specific'}

In [107]:
torch.cuda.empty_cache()
torch.cuda.empty_cache()
torch.cuda.empty_cache()

ko2en 는 어느정도 trnaslation 되는것을 확인했으나 en2ko 의 방식은 짧게 요약식으로 나타나는것을 볼 수 있었다.

## Helsinki-NLP/opus-mt-ko-en

### tokenizer

In [46]:
tokenizer2_ko2en = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-ko-en")

### model pipeline

In [47]:
pipe2_ko2en = pipeline("translation_ko_to_en", model="Helsinki-NLP/opus-mt-ko-en", tokenizer=tokenizer2_ko2en)
torch.cuda.empty_cache()
torch.cuda.empty_cache()
torch.cuda.empty_cache()
# pipe2_ko2en = pipeline("translation_ko_to_en", model="alphahg/mbart-large-50-finetuned-en-to-ko-8603428-finetuned-en-to-ko-9914408", tokenizer=tokenizer2_ko2en)

In [48]:
torch.cuda.empty_cache()
torch.cuda.empty_cache()
torch.cuda.empty_cache()
pipe_out2_ko2en = pipe2_ko2en(sample_text_ko, clean_up_tokenization_spaces=True, min_length=100)
# pipe_out2_ko2en = pipe2_ko2en(sample_text_ko, clean_up_tokenization_spaces=True, min_length=100)

In [49]:
print(f"ko2en : {pipe_out2_ko2en}")

ko2en : [{'translation_text': 'The police did not reveal specific details other than the Spanish women who were living in Switzerland for the death of a woman, who had been killed in the war and who had been killed in the war and who had been killed in the war and who had been killed in the war and who had been killed in the war and who had been killed in the war and who had been killed in the war and who had been killed in the war and who had been killed in the war and who had been killed in the war.'}]


In [108]:
summaries["Helsinki-NLP/opus-mt-ko-en"] = "\n".join(sent_tokenize("<output> : " + pipe_out2_ko2en[0]["translation_text"]))

In [109]:
summaries

{'input_sentence': '경찰은 사망자에 대해 스위스에 거주하고 있던 스페인 여성이라는 것 외 에는 구체적인 인적사항을 밝히지 않았다 .',
 'circulus/kobart-trans-ko-en-v2': '<output> : The police did not reveal specific',
 'Helsinki-NLP/opus-mt-ko-en': '<output> : The police did not reveal specific details other than the Spanish women who were living in Switzerland for the death of a woman, who had been killed in the war and who had been killed in the war and who had been killed in the war and who had been killed in the war and who had been killed in the war and who had been killed in the war and who had been killed in the war and who had been killed in the war and who had been killed in the war and who had been killed in the war.'}

In [110]:
torch.cuda.empty_cache()
torch.cuda.empty_cache()
torch.cuda.empty_cache()

## alphahg/opus-mt-ko-en-finetuned-ko-to-en100

### tokenizer

In [53]:
tokenizer3_ko2en = AutoTokenizer.from_pretrained("alphahg/opus-mt-ko-en-finetuned-ko-to-en100")

### model pipeline

In [54]:
pipe3_ko2en = pipeline("translation_ko_to_en", model="alphahg/opus-mt-ko-en-finetuned-ko-to-en100", tokenizer=tokenizer3_ko2en)
torch.cuda.empty_cache()
torch.cuda.empty_cache()
torch.cuda.empty_cache()

In [55]:
pipe_out3_ko2en = pipe3_ko2en(sample_text_ko, clean_up_tokenization_spaces=True, min_length=100)
# pipe_out2_ko2en = pipe2_ko2en(sample_text_ko, clean_up_tokenization_spaces=True, min_length=100)

In [56]:
print(f"ko2en : {pipe_out3_ko2en}")

ko2en : [{'translation_text': 'The police did not disclose specific humanities except that it was a Spanish woman who lived in Switzerland for the death of a woman, and the police said that it was a woman who died in a state of distancing the death of a woman who died in a state of despair and was killed in the death of a woman who died in a state of despair and was killed in the death of a woman who died in the death of a woman who died in the death of a woman who died in the death of her husband.'}]


In [111]:
summaries["alphahg/opus-mt-ko-en-finetuned-ko-to-en100"] = "\n".join(sent_tokenize("<output> : " + pipe_out3_ko2en[0]["translation_text"]))

In [112]:
summaries

{'input_sentence': '경찰은 사망자에 대해 스위스에 거주하고 있던 스페인 여성이라는 것 외 에는 구체적인 인적사항을 밝히지 않았다 .',
 'circulus/kobart-trans-ko-en-v2': '<output> : The police did not reveal specific',
 'Helsinki-NLP/opus-mt-ko-en': '<output> : The police did not reveal specific details other than the Spanish women who were living in Switzerland for the death of a woman, who had been killed in the war and who had been killed in the war and who had been killed in the war and who had been killed in the war and who had been killed in the war and who had been killed in the war and who had been killed in the war and who had been killed in the war and who had been killed in the war and who had been killed in the war.',
 'alphahg/opus-mt-ko-en-finetuned-ko-to-en100': '<output> : The police did not disclose specific humanities except that it was a Spanish woman who lived in Switzerland for the death of a woman, and the police said that it was a woman who died in a state of distancing the death of a woman who died in a state o

In [113]:
torch.cuda.empty_cache()
torch.cuda.empty_cache()
torch.cuda.empty_cache()

### tokenizer

In [60]:
tokenizer4_ko2en = AutoTokenizer.from_pretrained("jihyun/mbart-large-cc25-finetuned-ko-to-en_morp-90")

### model pipeline

In [61]:
torch.cuda.empty_cache()
torch.cuda.empty_cache()
torch.cuda.empty_cache()
pipe4_ko2en = pipeline("translation_ko_to_en", model="jihyun/mbart-large-cc25-finetuned-ko-to-en_morp-90", tokenizer=tokenizer4_ko2en)

In [62]:
pipe_out4_ko2en = pipe4_ko2en(sample_text_ko, clean_up_tokenization_spaces=True, min_length=100)

In [63]:
print(f"ko2en : {pipe_out4_ko2en}")

ko2en : [{'translation_text': "en the police did not mention the fact that the person who died was a spain woman, other than that she lived in switzerland, nor didn't mention any specific personality information, except that she was a spain, spain, spain, spain, spain, spain, spain, spain, switzerland, spain, spain, spain."}]


In [114]:
summaries["jihyun/mbart-large-cc25-finetuned-ko-to-en_morp-90"] = "\n".join(sent_tokenize("<output> : " + pipe_out4_ko2en[0]["translation_text"]))

In [115]:
summaries

{'input_sentence': '경찰은 사망자에 대해 스위스에 거주하고 있던 스페인 여성이라는 것 외 에는 구체적인 인적사항을 밝히지 않았다 .',
 'circulus/kobart-trans-ko-en-v2': '<output> : The police did not reveal specific',
 'Helsinki-NLP/opus-mt-ko-en': '<output> : The police did not reveal specific details other than the Spanish women who were living in Switzerland for the death of a woman, who had been killed in the war and who had been killed in the war and who had been killed in the war and who had been killed in the war and who had been killed in the war and who had been killed in the war and who had been killed in the war and who had been killed in the war and who had been killed in the war and who had been killed in the war.',
 'alphahg/opus-mt-ko-en-finetuned-ko-to-en100': '<output> : The police did not disclose specific humanities except that it was a Spanish woman who lived in Switzerland for the death of a woman, and the police said that it was a woman who died in a state of distancing the death of a woman who died in a state o

In [116]:
torch.cuda.empty_cache()
torch.cuda.empty_cache()
torch.cuda.empty_cache()

## hcho22/opus-mt-ko-en-finetuned-kr-to-en  
* 믿을 수.. 없어...?

## inhee/m2m100_418M-finetuned-ko-to-  
* TypeError: got multiple values for keyword argument 'return_tensors'

## astrojihye/opus-mt-ko-en-finetuned-ko-to-en4  
* 불고기 정식..?

## Stxlla/ko-en
* TypeError: got multiple values for keyword argument 'return_tensors'

## Hayoung/my_awesome_ko_en_model  
* out

## tunib/electra-ko-en-base  
* IndexError: too many indices for tensor of dimension 2

In [117]:
summaries.keys()

dict_keys(['input_sentence', 'circulus/kobart-trans-ko-en-v2', 'Helsinki-NLP/opus-mt-ko-en', 'alphahg/opus-mt-ko-en-finetuned-ko-to-en100', 'jihyun/mbart-large-cc25-finetuned-ko-to-en_morp-90'])

In [145]:
print(f"""---- 전체 종합 model ----
summaries keys : {summaries.keys()}

input sentence : {summaries['input_sentence']}

model name : circulus/kobart-trans-ko-en-v2
{summaries['circulus/kobart-trans-ko-en-v2']}
google_translation : 경찰은 구체적으로 밝히지 않았다

model name : Helsinki-NLP/opus-mt-ko-en
{summaries['Helsinki-NLP/opus-mt-ko-en']}
google_translation : 경찰은 여성의 죽음을 위해 스위스에 거주하던 스페인 여성, 전쟁 중 전사한 여성, 전쟁 중 사망한 여성
그리고 전쟁 중 사망한 여성 등 구체적인 내용은 밝히지 않았다.
전쟁에서 죽었고 전쟁에서 죽었고 전쟁에서 죽었고 전쟁에서 죽었고 전쟁에서 죽었고 전쟁에서 죽었고 전쟁에서 죽었고
전쟁에서 죽은 사람 전쟁에서

model name : alphahg/opus-mt-ko-en-finetuned-ko-to-en100
{summaries['alphahg/opus-mt-ko-en-finetuned-ko-to-en100']}
google_translation : 경찰은 한 여성의 죽음으로 스위스에 살던 스페인 여성이라는 점 외에는 구체적인 인문학적 내용을 밝히지 않았으며,
경찰은 자가격리 중 사망한 여성의 죽음을 거리두기 상태에서 숨진 여성이라고 밝혔다.
자포자기한 상태에서 사망한 여인의 죽음으로 절망한 상태에서 사망한 여인의 죽음으로
사망한 여인의 죽음으로 사망한 여인의 죽음으로 그녀의 남편

model name : jihyun/mbart-large-cc25-finetuned-ko-to-en_morp-90
{summaries['jihyun/mbart-large-cc25-finetuned-ko-to-en_morp-90']}
google_translation : ko 경찰은 숨진 사람이 스페인 여성이라는 사실을 언급하지 않았고,
스위스에 살았다는 점 외에는 스페인, 스페인, 스페인, 스페인, 스페인 , 스페인, 스페인, 스페인, 스위스, 스페인, 스페인, 스페인.
""")

---- 전체 종합 model ----
summaries keys : dict_keys(['input_sentence', 'circulus/kobart-trans-ko-en-v2', 'Helsinki-NLP/opus-mt-ko-en', 'alphahg/opus-mt-ko-en-finetuned-ko-to-en100', 'jihyun/mbart-large-cc25-finetuned-ko-to-en_morp-90'])

input sentence : 경찰은 사망자에 대해 스위스에 거주하고 있던 스페인 여성이라는 것 외 에는 구체적인 인적사항을 밝히지 않았다 .

model name : circulus/kobart-trans-ko-en-v2
<output> : The police did not reveal specific
google_translation : 경찰은 구체적으로 밝히지 않았다

model name : Helsinki-NLP/opus-mt-ko-en
<output> : The police did not reveal specific details other than the Spanish women who were living in Switzerland for the death of a woman, who had been killed in the war and who had been killed in the war and who had been killed in the war and who had been killed in the war and who had been killed in the war and who had been killed in the war and who had been killed in the war and who had been killed in the war and who had been killed in the war and who had been killed in the war.
google_translation : 경찰은 여성

## 표준 - 사투리 번역 pipeline test

### MarianMT

In [5]:
from transformers import MarianMTModel, MarianTokenizer

In [17]:
test_text_en = "HI, Hellow my name is jongin"
test_text_ko = "안녕, 반가워. 내 이름은 임종인이야"

In [20]:
ckpt = "guymorlan/DialectTransliterator"
tokenizer_dia = MarianTokenizer.from_pretrained(ckpt)

In [21]:
torch.cuda.empty_cache()
torch.cuda.empty_cache()
torch.cuda.empty_cache()
pipe_dia = pipeline("translation", model=ckpt, tokenizer=tokenizer_dia)

In [29]:
pipe_out_dia_en = pipe_dia(test_text_en, clean_up_tokenization_spaces=True)
pipe_out_dia_ko = pipe_dia(test_text_ko, clean_up_tokenization_spaces=True)

In [30]:
print(f"dialect test_en : {pipe_out_dia_en}")
print(f"dialect test_ko : {pipe_out_dia_ko}")

dialect test_en : [{'translation_text': 'hI، Helfōn mywem is jōnġin'}]
dialect test_ko : [{'translation_text': ',.     '}]


### T5_1

In [14]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

In [24]:
ckpt_1 = "declare-lab/dialect"
tokenizer_dia_1 = T5Tokenizer.from_pretrained(ckpt_1)

In [25]:
torch.cuda.empty_cache()
torch.cuda.empty_cache()
torch.cuda.empty_cache()
pipe_dia_1 = pipeline("translation", model=ckpt_1, tokenizer=tokenizer_dia_1)

In [31]:
pipe_out_dia_1_en = pipe_dia_1(test_text_en, clean_up_tokenization_spaces=True)
pipe_out_dia_1_ko = pipe_dia_1(test_text_ko, clean_up_tokenization_spaces=True)

In [33]:
print(f"dialect test_en : {pipe_out_dia_1_en}")
print(f"dialect test_ko : {pipe_out_dia_1_ko}")

dialect test_en : [{'translation_text': 'HI, Hallo, mein Name ist jongin'}]
dialect test_ko : [{'translation_text': ''}]


### T5_2

In [38]:
ckpt_2 = "shensq0814/DIALECT"
tokenizer_dia_2 = T5Tokenizer.from_pretrained(ckpt_2)

Downloading:   0%|          | 0.00/773k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.74k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.88k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.31M [00:00<?, ?B/s]

In [39]:
torch.cuda.empty_cache()
torch.cuda.empty_cache()
torch.cuda.empty_cache()
pipe_dia_2 = pipeline("translation", model=ckpt_2, tokenizer=tokenizer_dia_2)

Downloading:   0%|          | 0.00/1.35k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.75G [00:00<?, ?B/s]

In [40]:
pipe_out_dia_2_en = pipe_dia_2(test_text_en, clean_up_tokenization_spaces=True)
pipe_out_dia_2_ko = pipe_dia_2(test_text_ko, clean_up_tokenization_spaces=True)

In [41]:
print(f"dialect test_en : {pipe_out_dia_2_en}")
print(f"dialect test_ko : {pipe_out_dia_2_ko}")

dialect test_en : [{'translation_text': 'HI, Hallo, mein Name ist jongin'}]
dialect test_ko : [{'translation_text': ''}]


### MBART

In [45]:
from transformers import MBartForConditionalGeneration, BartTokenizer

In [46]:
ckpt_3 = "eunyounglee/mbart_finetuned_dialect_translation_4"
tokenizer_dia_3 = BartTokenizer.from_pretrained(ckpt_3)

Downloading:   0%|          | 0.00/1.20M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/891k [00:00<?, ?B/s]

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'BertTokenizer'. 
The class this function is called from is 'BartTokenizer'.


In [47]:
torch.cuda.empty_cache()
torch.cuda.empty_cache()
torch.cuda.empty_cache()
pipe_dia_3 = pipeline("translation", model=ckpt_3, tokenizer=tokenizer_dia_3)

Downloading:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

In [48]:
pipe_out_dia_3_en = pipe_dia_3(test_text_en, clean_up_tokenization_spaces=True)
pipe_out_dia_3_ko = pipe_dia_3(test_text_ko, clean_up_tokenization_spaces=True)

In [49]:
print(f"dialect test_en : {pipe_out_dia_3_en}")
print(f"dialect test_ko : {pipe_out_dia_3_ko}")

dialect test_en : [{'translation_text': 'HI, Hellow my name is jongin'}]
dialect test_ko : [{'translation_text': '안녕, 반가워. 내 이름은 임종인이야'}]


### BART_1

In [8]:
from transformers import BartForConditionalGeneration, BartTokenizer, PreTrainedTokenizerFast

In [69]:
ckpt_4 = "circulus/kobart-trans-dialect-v2"
tokenizer_dia_4 = PreTrainedTokenizerFast.from_pretrained(ckpt_4)

In [70]:
torch.cuda.empty_cache()
torch.cuda.empty_cache()
torch.cuda.empty_cache()
pipe_dia_4 = pipeline("translation", model=ckpt_4, tokenizer=tokenizer_dia_4)

In [71]:
pipe_out_dia_4_en = pipe_dia_4(test_text_en, clean_up_tokenization_spaces=True)
pipe_out_dia_4_ko = pipe_dia_4(test_text_ko, clean_up_tokenization_spaces=True)

In [72]:
print(f"dialect test_en : {pipe_out_dia_4_en}")
print(f"dialect test_ko : {pipe_out_dia_4_ko}")

dialect test_en : [{'translation_text': '마마  name  소련 바뀌가고 바뀌며며 스 스 스 스로'}]
dialect test_ko : [{'translation_text': '내 이름은 임종인이야야. 내 이름은 임종인이야워. 내 이름은 임'}]


### BART_2

In [18]:
ckpt_5 = "circulus/kobart-trans-jeju-v2"
tokenizer_dia_5 = PreTrainedTokenizerFast.from_pretrained(ckpt_5)

In [19]:
torch.cuda.empty_cache()
torch.cuda.empty_cache()
torch.cuda.empty_cache()
pipe_dia_5 = pipeline("translation", model=ckpt_5, tokenizer=tokenizer_dia_5)

In [20]:
pipe_out_dia_5_en = pipe_dia_5(test_text_en, clean_up_tokenization_spaces=True)
pipe_out_dia_5_ko = pipe_dia_5(test_text_ko, clean_up_tokenization_spaces=True)

In [21]:
print(f"dialect test_en : {pipe_out_dia_5_en}")
print(f"dialect test_ko : {pipe_out_dia_5_ko}")

dialect test_en : [{'translation_text': 'HI, Hellowmy name is jong점수'}]
dialect test_ko : [{'translation_text': '안녕, 반갑수다. 내 이름은 임종인이야'}]


### BART_3

In [22]:
ckpt_6 = "eunjin/kobart_jeju_translator"
tokenizer_dia_6 = PreTrainedTokenizerFast.from_pretrained(ckpt_6)

In [23]:
torch.cuda.empty_cache()
torch.cuda.empty_cache()
torch.cuda.empty_cache()
pipe_dia_6 = pipeline("translation", model=ckpt_6, tokenizer=tokenizer_dia_6)

In [24]:
pipe_out_dia_6_en = pipe_dia_6(test_text_en, clean_up_tokenization_spaces=True)
pipe_out_dia_6_ko = pipe_dia_6(test_text_ko, clean_up_tokenization_spaces=True)

In [25]:
print(f"dialect test_en : {pipe_out_dia_6_en}")
print(f"dialect test_ko : {pipe_out_dia_6_ko}")

dialect test_en : [{'translation_text': 'Hel Hellow my name is jongin'}]
dialect test_ko : [{'translation_text': '안녕. 안녕. 내 이름은 임종인이라'}]


### BART_4

In [26]:
ckpt_7 = "eunjin/kobart_gyeongsang_translator"
tokenizer_dia_7 = PreTrainedTokenizerFast.from_pretrained(ckpt_7)

In [27]:
torch.cuda.empty_cache()
torch.cuda.empty_cache()
torch.cuda.empty_cache()
pipe_dia_7 = pipeline("translation", model=ckpt_7, tokenizer=tokenizer_dia_7)

In [28]:
test_text_ko_gs = "오늘의 하루는 어떘어? 나는 즐겁고 기뻤어 너는 어때?"

In [29]:
pipe_out_dia_7_en = pipe_dia_7(test_text_en, clean_up_tokenization_spaces=True)
pipe_out_dia_7_ko = pipe_dia_7(test_text_ko_gs, clean_up_tokenization_spaces=True)

In [30]:
print(f"dialect test_en : {pipe_out_dia_7_en}")
print(f"dialect test_ko : {pipe_out_dia_7_ko}")

dialect test_en : [{'translation_text': 'Hel Hellow my name is jongin'}]
dialect test_ko : [{'translation_text': '오늘의 하루는 어어? 나는 즐겁고 기뻤어 니는 어때'}]


### BART_5

In [31]:
ckpt_8 = "circulus/kobart-trans-gyeongsang-v2"
tokenizer_dia_8 = PreTrainedTokenizerFast.from_pretrained(ckpt_8)

In [32]:
torch.cuda.empty_cache()
torch.cuda.empty_cache()
torch.cuda.empty_cache()
pipe_dia_8 = pipeline("translation", model=ckpt_8, tokenizer=tokenizer_dia_8)

In [33]:
pipe_out_dia_8_en = pipe_dia_8(test_text_en, clean_up_tokenization_spaces=True)
pipe_out_dia_8_ko = pipe_dia_8(test_text_ko_gs, clean_up_tokenization_spaces=True)

In [34]:
print(f"dialect test_en : {pipe_out_dia_8_en}")
print(f"dialect test_ko : {pipe_out_dia_8_ko}")

dialect test_en : [{'translation_text': 'HI, Hellow my name is jongin'}]
dialect test_ko : [{'translation_text': '오늘의 하루는 어노? 나는 즐겁고 기뻤어 너는 어때?'}]


### BART_6

In [35]:
ckpt_9 = "circulus/kobart-trans-chungcheong-v2"
tokenizer_dia_9 = PreTrainedTokenizerFast.from_pretrained(ckpt_9)

In [36]:
torch.cuda.empty_cache()
torch.cuda.empty_cache()
torch.cuda.empty_cache()
pipe_dia_9 = pipeline("translation", model=ckpt_9, tokenizer=tokenizer_dia_9)

In [37]:
pipe_out_dia_9_en = pipe_dia_9(test_text_ko_gs, clean_up_tokenization_spaces=True)
pipe_out_dia_9_ko = pipe_dia_9(test_text_ko, clean_up_tokenization_spaces=True)

In [38]:
print(f"dialect test_en : {pipe_out_dia_9_en}")
print(f"dialect test_ko : {pipe_out_dia_9_ko}")

dialect test_en : [{'translation_text': '오늘의 하루는 어어? 나는 즐겁고 기뻤어 너는 어뗘?'}]
dialect test_ko : [{'translation_text': '안녕, 반가워. 내 이름은 임종인이여.'}]


### BART_7

In [39]:
ckpt_10 = "circulus/kobart-trans-jeolla-v2"
tokenizer_dia_10 = PreTrainedTokenizerFast.from_pretrained(ckpt_10)

In [40]:
torch.cuda.empty_cache()
torch.cuda.empty_cache()
torch.cuda.empty_cache()
pipe_dia_10 = pipeline("translation", model=ckpt_10, tokenizer=tokenizer_dia_10)

In [41]:
test_text_jd = "어디가는거야? 거기 우산좀 줘봐 오늘은 비가 오니까"

In [42]:
pipe_out_dia_10_en = pipe_dia_10(test_text_ko_gs, clean_up_tokenization_spaces=True)
pipe_out_dia_10_ko = pipe_dia_10(test_text_ko, clean_up_tokenization_spaces=True)
pipe_out_dia_10_jd = pipe_dia_10(test_text_jd, clean_up_tokenization_spaces=True)

In [43]:
print(f"dialect test_en : {pipe_out_dia_10_en}")
print(f"dialect test_ko : {pipe_out_dia_10_ko}")
print(f"dialect test_ko : {pipe_out_dia_10_jd}")

dialect test_en : [{'translation_text': '오늘의 하루는 어어? 나는 즐겁고 기뻤어 너는 어뗘?'}]
dialect test_ko : [{'translation_text': '안녕, 반가워. 내 이름은 임종인이여'}]
dialect test_ko : [{'translation_text': '어디가는거야? 거기 우산좀 줘봐 오늘은 비가 온께'}]


### BART_7

In [44]:
ckpt_11 = "circulus/kobart-trans-gangwon-v2"
tokenizer_dia_11 = PreTrainedTokenizerFast.from_pretrained(ckpt_11)

In [45]:
torch.cuda.empty_cache()
torch.cuda.empty_cache()
torch.cuda.empty_cache()
pipe_dia_11 = pipeline("translation", model=ckpt_11, tokenizer=tokenizer_dia_11)

In [46]:
pipe_out_dia_11_en = pipe_dia_11(test_text_ko_gs, clean_up_tokenization_spaces=True)
pipe_out_dia_11_ko = pipe_dia_11(test_text_ko, clean_up_tokenization_spaces=True)
pipe_out_dia_11_jd = pipe_dia_11(test_text_jd, clean_up_tokenization_spaces=True)

In [47]:
print(f"dialect test_en : {pipe_out_dia_11_en}")
print(f"dialect test_ko : {pipe_out_dia_11_ko}")
print(f"dialect test_ko : {pipe_out_dia_11_jd}")

dialect test_en : [{'translation_text': '오늘의 하루는 어어? 나는 즐겁고 기뻤어 니는 어때'}]
dialect test_ko : [{'translation_text': '안녕, 반가워. 내 이름은 임종인이래.'}]
dialect test_ko : [{'translation_text': '어디가는거야? 거 우산좀 줘봐 오늘은 비가 오니까'}]


## 나름 온전한 번역 Model check List  (Model name)
* JJ
>* circulus/kobart-trans-jeju-v2 (BART)
>* eunjin/kobart_jeju_translator (BART)
* GS
>* circulus/kobart-trans-gyeongsang-v2 (BART)
>* eunjin/kobart_gyeongsang_translator (BART)
* GW
>* circulus/kobart-trans-gangwon-v2 (BART)
* CC
>* circulus/kobart-trans-chungcheong-v2 (BART)
* JD
>* circulus/kobart-trans-jeolla-v2 (BART)

In [51]:
print(f""" --모델 요약--

*제주도
    circulus/kobart-trans-jeju-v2 : {pipe_out_dia_5_ko[0]['translation_text']}
    eunjin/kobart_jeju_translator : {pipe_out_dia_6_ko[0]['translation_text']}
    
*경상도
    circulus/kobart-trans-gyeongsang-v2 : {pipe_out_dia_8_ko[0]['translation_text']}
    eunjin/kobart_gyeongsang_translator : {pipe_out_dia_7_ko[0]['translation_text']}
    
*강원도
    circulus/kobart-trans-gangwon-v2 : {pipe_out_dia_11_ko[0]['translation_text']}
                                       {pipe_out_dia_11_jd[0]['translation_text']}
                                       
*충청도
    circulus/kobart-trans-chungcheong-v2 : {pipe_out_dia_9_en[0]['translation_text']}
                                           {pipe_out_dia_9_ko[0]['translation_text']}
                                           
*전라도
    circulus/kobart-trans-jeolla-v2 : {pipe_out_dia_10_en[0]['translation_text']}
                                      {pipe_out_dia_10_jd[0]['translation_text']}
                                      
""")

 --모델 요약--

*제주도
    circulus/kobart-trans-jeju-v2 : 안녕, 반갑수다. 내 이름은 임종인이야
    eunjin/kobart_jeju_translator : 안녕. 안녕. 내 이름은 임종인이라
    
*경상도
    circulus/kobart-trans-gyeongsang-v2 : 오늘의 하루는 어노? 나는 즐겁고 기뻤어 너는 어때?
    eunjin/kobart_gyeongsang_translator : 오늘의 하루는 어어? 나는 즐겁고 기뻤어 니는 어때
    
*강원도
    circulus/kobart-trans-gangwon-v2 : 안녕, 반가워. 내 이름은 임종인이래.
                                       어디가는거야? 거 우산좀 줘봐 오늘은 비가 오니까
                                       
*충청도
    circulus/kobart-trans-chungcheong-v2 : 오늘의 하루는 어어? 나는 즐겁고 기뻤어 너는 어뗘?
                                           안녕, 반가워. 내 이름은 임종인이여.
                                           
*전라도
    circulus/kobart-trans-jeolla-v2 : 오늘의 하루는 어어? 나는 즐겁고 기뻤어 너는 어뗘?
                                      어디가는거야? 거기 우산좀 줘봐 오늘은 비가 온께
                                      



### Model

In [94]:
num_labels = 2
num_epochs = 5
batch_size = 32

In [45]:
model = BartForConditionalGeneration.from_pretrained("circulus/kobart-trans-en-ko-v2")

Downloading:   0%|          | 0.00/1.41k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/473M [00:00<?, ?B/s]

In [46]:
torch.cuda.empty_cache()
torch.cuda.empty_cache()
torch.cuda.empty_cache()

In [47]:
def model_init():
    return AutoModel.from_pretrained("KETI-AIR/ke-t5-small")

In [48]:
args = TrainingArguments(
    output_dir = 'data_test',
    evaluation_strategy="epoch",
    save_strategy = "epoch",
    learning_rate=2e-5,
    save_steps=1e6,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_epochs,
    weight_decay=0.01,
    disable_tqdm=True,
    load_best_model_at_end=True)

In [49]:
trainer = Trainer(model_init=model_init,
                  args=args,
                  data_collator=data_collator,
                  train_dataset=dataset_encoded["train"],
                  eval_dataset=dataset_encoded["test"],
                  tokenizer=tokenizer)

https://huggingface.co/KETI-AIR/ke-t5-small/resolve/main/config.json not found in cache or force_download set to True, downloading to /aiffel/.cache/huggingface/transformers/tmpndf3dkq5


Downloading:   0%|          | 0.00/597 [00:00<?, ?B/s]

storing https://huggingface.co/KETI-AIR/ke-t5-small/resolve/main/config.json in cache at /aiffel/.cache/huggingface/transformers/a240b555451a28d400c0fcd042656bc28d18c553be5503a17a5fff9ab86ecf1b.cfa5a0bf5803bcceef6e8ff70f41932d7d5eb3b077c6885fadf7e912703f33e9
creating metadata file for /aiffel/.cache/huggingface/transformers/a240b555451a28d400c0fcd042656bc28d18c553be5503a17a5fff9ab86ecf1b.cfa5a0bf5803bcceef6e8ff70f41932d7d5eb3b077c6885fadf7e912703f33e9
loading configuration file https://huggingface.co/KETI-AIR/ke-t5-small/resolve/main/config.json from cache at /aiffel/.cache/huggingface/transformers/a240b555451a28d400c0fcd042656bc28d18c553be5503a17a5fff9ab86ecf1b.cfa5a0bf5803bcceef6e8ff70f41932d7d5eb3b077c6885fadf7e912703f33e9
Model config T5Config {
  "_name_or_path": "hf/ke-t5-small",
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "d_ff": 1024,
  "d_kv": 64,
  "d_model": 512,
  "decoder_start_token_id": 0,
  "dropout_rate": 0.0,
  "eos_token_id": 1,
  "feed_forward_proj"

Downloading:   0%|          | 0.00/293M [00:00<?, ?B/s]

storing https://huggingface.co/KETI-AIR/ke-t5-small/resolve/main/pytorch_model.bin in cache at /aiffel/.cache/huggingface/transformers/b507b951b740a5181bad562d35074d8ca263278c343090e1a5ebf6e19c4576d6.9718d7ed3498702cfca52320624732deeacc9d0f4876059768b09881e73b561d
creating metadata file for /aiffel/.cache/huggingface/transformers/b507b951b740a5181bad562d35074d8ca263278c343090e1a5ebf6e19c4576d6.9718d7ed3498702cfca52320624732deeacc9d0f4876059768b09881e73b561d
loading weights file https://huggingface.co/KETI-AIR/ke-t5-small/resolve/main/pytorch_model.bin from cache at /aiffel/.cache/huggingface/transformers/b507b951b740a5181bad562d35074d8ca263278c343090e1a5ebf6e19c4576d6.9718d7ed3498702cfca52320624732deeacc9d0f4876059768b09881e73b561d
Some weights of the model checkpoint at KETI-AIR/ke-t5-small were not used when initializing T5Model: ['lm_head.weight']
- This IS expected if you are initializing T5Model from the checkpoint of a model trained on another task or with another architecture (e

In [50]:
trainer.train()

loading configuration file https://huggingface.co/KETI-AIR/ke-t5-small/resolve/main/config.json from cache at /aiffel/.cache/huggingface/transformers/a240b555451a28d400c0fcd042656bc28d18c553be5503a17a5fff9ab86ecf1b.cfa5a0bf5803bcceef6e8ff70f41932d7d5eb3b077c6885fadf7e912703f33e9
Model config T5Config {
  "_name_or_path": "hf/ke-t5-small",
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "d_ff": 1024,
  "d_kv": 64,
  "d_model": 512,
  "decoder_start_token_id": 0,
  "dropout_rate": 0.0,
  "eos_token_id": 1,
  "feed_forward_proj": "gated-gelu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "n_positions": 512,
  "num_decoder_layers": 8,
  "num_heads": 6,
  "num_layers": 8,
  "pad_token_id": 0,
  "relative_attention_num_buckets": 32,
  "transformers_version": "4.11.3",
  "use_cache": true,
  "vocab_size": 64128
}

loading weights file https://huggingface.co/KETI-AIR/ke-t5-small/resolve/main/pytorch_model.bin fro

ValueError: You have to specify either decoder_input_ids or decoder_inputs_embeds