# Finetune GPT-2 Indonesian Pantun
Finetuning Indonesian GPT-2 ([pretrained by Cahya Wirawan](https://github.com/cahya-wirawan/indonesian-language-models)) on Indonesian Pantun.

Github: https://github.com/ilhamfp/puisi-pantun-generator  
Data: https://www.kaggle.com/ilhamfp31/pantun-indonesia  
Example: https://www.kaggle.com/ilhamfp31/pembangkitan-pantun-otomatis  

# Preprocess

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
import os
os.environ["CUDA_VISIBLE_DEVICES"]= "1"

data = pd.read_csv("../puisi-pantun-generator/data/pantun.csv")
print(data.shape)
data.head()

(440, 2)


Unnamed: 0,teks,tipe
0,Pakai baju ukurannya pas \n Baju biru pemberia...,Pantun Bijak
1,Orang bijak cinta bahasa \n Bahasa luas Bahasa...,Pantun Pendidikan
2,Kepada siapa datangnya wahyu \n Kepada Nabi wa...,Pantun Nasihat
3,Citah perang melawan citah \n Seekor pelatuk m...,Pantun Pendidikan
4,"Ada gadis perawan, \n paling cantik di kampung...",Pantun Agama


In [3]:
train, test = train_test_split(data.teks.values,
                               test_size=0.05,
                               random_state=31) 

print('Len train: ', len(train))
print('Len test: ', len(test))

with open('train.txt', 'w+') as f:
    for text in train:
        f.write('<BOS> ' + repr(text)[1:-1] + ' <EOS>\n')
        
with open('test.txt', 'w+') as f:
    for text in test:
        f.write('<BOS> ' + repr(text)[1:-1] + ' <EOS>\n')

Len train:  418
Len test:  22


# Train
Finetune using HuggingFace's run_language_modeling.py script and [raymond's](https://towardsdatascience.com/fine-tuning-gpt2-for-text-generation-using-pytorch-2ee61a4f1ba7) modification.

In [1]:
import transformers
print(transformers.__version__)

2.1.1


In [7]:
!python ../puisi-pantun-generator/src/run_language_modeling.py \
--output_dir='./gpt2-pantun' \
--model_type=gpt2 \
--model_name_or_path='cahya/gpt2-small-indonesian-522M' \
--do_train \
--train_data_file='train.txt' \
--do_eval \
--eval_data_file='test.txt' \
--per_device_train_batch_size=8 \
--per_device_eval_batch_size=8 \
--line_by_line \
--evaluate_during_training \
--learning_rate 5e-5 \
--num_train_epochs=2

Traceback (most recent call last):
  File "../puisi-pantun-generator/src/run_language_modeling.py", line 29, in <module>
    from transformers import (
ImportError: cannot import name 'CONFIG_MAPPING' from 'transformers' (/Users/syahrulanuar/miniforge3/envs/ganimg/lib/python3.8/site-packages/transformers/__init__.py)


In [None]:
!head './gpt2-pantun/eval_results_lm.txt'

# Generate Pantun

In [None]:
!python ../usr/lib/run_generation_py/run_generation_py.py \
--model_type gpt2 \
--model_name_or_path './gpt2-pantun' \
--length 125 \
--prompt "<BOS>" \
--stop_token "<EOS>" \
--k 50 \
--num_return_sequences 100 > generated-pantun.txt

In [None]:
result = []
with open('generated-pantun.txt', 'r') as f:
    lines = f.readlines()
    for line in lines:
        if line.startswith('<BOS>'):
            result.append(line[5:].replace(r'\\n', '\n'))
            
for idx, pantun in enumerate(result):
    print(' ===========\n   Pantun {}\n ===========\n'.format(idx))
    print(pantun)