# Finetune GPT-2 Indonesian Pantun
Finetuning Indonesian GPT-2 ([pretrained by Cahya Wirawan](https://github.com/cahya-wirawan/indonesian-language-models)) on Indonesian Pantun.

Github: https://github.com/ilhamfp/puisi-pantun-generator  
Data: https://www.kaggle.com/ilhamfp31/pantun-indonesia  
Example: https://www.kaggle.com/ilhamfp31/pembangkitan-pantun-otomatis  

# Preprocess

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
import os
os.environ["CUDA_VISIBLE_DEVICES"]= "1"

data = pd.read_csv("/kaggle/input/pantun-indonesia/pantun.csv")
print(data.shape)
data.head()

FileNotFoundError: [Errno 2] No such file or directory: '/kaggle/input/pantun-indonesia/pantun.csv'

In [None]:
train, test = train_test_split(data.teks.values,
                               test_size=0.05,
                               random_state=31) 

print('Len train: ', len(train))
print('Len test: ', len(test))

with open('train.txt', 'w+') as f:
    for text in train:
        f.write('<BOS> ' + repr(text)[1:-1] + ' <EOS>\n')
        
with open('test.txt', 'w+') as f:
    for text in test:
        f.write('<BOS> ' + repr(text)[1:-1] + ' <EOS>\n')

# Train
Finetune using HuggingFace's run_language_modeling.py script and [raymond's](https://towardsdatascience.com/fine-tuning-gpt2-for-text-generation-using-pytorch-2ee61a4f1ba7) modification.

In [None]:
!pip install transformers==3.0.2

In [None]:
import transformers
print(transformers.__version__)

In [None]:
!python ../usr/lib/run_language_modeling_py/run_language_modeling_py.py \
--output_dir='./gpt2-pantun' \
--model_type=gpt2 \
--model_name_or_path='cahya/gpt2-small-indonesian-522M' \
--do_train \
--train_data_file='train.txt' \
--do_eval \
--eval_data_file='test.txt' \
--per_device_train_batch_size=8 \
--per_device_eval_batch_size=8 \
--line_by_line \
--evaluate_during_training \
--learning_rate 5e-5 \
--num_train_epochs=2

In [None]:
!head './gpt2-pantun/eval_results_lm.txt'

# Generate Pantun

In [None]:
!python ../usr/lib/run_generation_py/run_generation_py.py \
--model_type gpt2 \
--model_name_or_path './gpt2-pantun' \
--length 125 \
--prompt "<BOS>" \
--stop_token "<EOS>" \
--k 50 \
--num_return_sequences 100 > generated-pantun.txt

In [None]:
result = []
with open('generated-pantun.txt', 'r') as f:
    lines = f.readlines()
    for line in lines:
        if line.startswith('<BOS>'):
            result.append(line[5:].replace(r'\\n', '\n'))
            
for idx, pantun in enumerate(result):
    print(' ===========\n   Pantun {}\n ===========\n'.format(idx))
    print(pantun)