# BERT 학습을 위한 vocab을 만들기

학습에 사용될 vocab을 만들어보겠습니다. \
(해당 코드는 wordpiece를 이용하여 영어 vocab을 만드는 코드, 전체 vocab은 create_custom_vocab.ipynb를 이용)

---

In [None]:
! python src/make_vocab/wordpiece.py \
--corpus=rsc/training_data/final_text_only_eng.txt \
--iter=10000 \
--fname=rsc/my_conf/ices_eng_vocab.txt

terminated vocabulary scanning
('##o', '##n')
('##e', '##r')
('##t', '##i')
('##e', '##n')
('##i', '##n')
('##e', '##s')
('##a', '##l')
('##q', '##u')
('##o', '##r')
('##qu', '##o')
('##a', '##n')
('##h', '##e')
('##ti', '##on')
('##e', '##d')
('##a', '##r')
('o', '##f')
('##s', '##quo')
('##i', '##g')
('##a', '##t')
('##l', '##e')
('##n', '##d')
('t', '##he')
('##i', '##c')
('##r', '##o')
('##i', '##t')
('##r', '##e')
('##i', '##s')
('##a', '##s')
('a', '##nd')
('##in', '##g')
('##en', '##t')
('##s', '##p')
('##e', '##l')
('i', '##n')
('F', '##ig')
('##a', '##b')
('##b', '##sp')
('n', '##bsp')
('##a', '##c')
('##e', '##t')
('##i', '##d')
('r', '##squo')
('##e', '##c')
('##i', '##m')
('l', '##squo')
('##m', '##p')
('##u', '##s')
('##o', '##l')
('##o', '##t')
('##ab', '##le')
('##u', '##l')
('T', '##able')
('##u', '##r')
('##i', '##l')
('a', '##l')
('##o', '##d')
('s', '##t')
('t', '##o')
('l', '##t')
('##a', '##m')
('##c', '##e')
('c', '##on')
('##v', '##e')
('f', '##or')
('##m', '##en

# BERT 학습을 위한 Preprocessed data 만들기

이제 vocab 이 준비되었으니, BERT 학습을 위한 corpus를 preprocessing 해보도록 하겠습니다.

In [5]:
!python src/make_preprocessed_data/create_pretraining_data.py \
--input_file=rsc/training_data/paper_full_sentence_v4.txt \
--vocab_file=rsc/my_conf/ices_vocab_v4_sejin.txt \
--do_lower_case=True \
--max_seq_length=512 \
--output_file=rsc/processed_training_data/paper_full_pretraining_data_tf_v4.record



W1124 06:51:16.645816 47249422990848 module_wrapper.py:139] From src/make_preprocessed_data/create_pretraining_data.py:425: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.


W1124 06:51:16.646093 47249422990848 module_wrapper.py:139] From src/make_preprocessed_data/create_pretraining_data.py:425: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.


W1124 06:51:16.964357 47249422990848 module_wrapper.py:139] From src/make_preprocessed_data/create_pretraining_data.py:432: The name tf.gfile.Glob is deprecated. Please use tf.io.gfile.glob instead.


W1124 06:51:16.968078 47249422990848 module_wrapper.py:139] From src/make_preprocessed_data/create_pretraining_data.py:434: The name tf.logging.info is deprecated. Please use tf.compat.v1.logging.info instead.

INFO:tensorflow:*** Reading from input files ***
I1124 06:51:16.968315 47249422990848 create_pretraining_data.py:434] *** Reading from input fil

# BERT 학습

이제 만들어진 학습 데이터를 이용해서 실제로 BERT를 학습해보도록 하겠습니다.

이번 학습에선 논문 데이터를 이용해 학습해보겠습니다.

---

In [None]:
! python src/make_bert_model/run_pretraining.py \
--input_file=rsc/processed_training_data/paper_full_pretraining_data_tf_v4.record \
--output_dir=rsc/my_pretrained_model \
--do_train=True \
--do_eval=True \
--bert_config_file=rsc/conf/bert_config.json \
--train_batch_size=16 \
--max_seq_length=512 \
--max_predictions_per_seq=20 \
--num_train_steps=80000 \
--learning_rate=5e-5 \
--save_checkpoints_steps=1000 \
--do_lower_case=True \


2020-11-24 22:43:13.900159: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-11-24 22:43:13.931948: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: Tesla V100-PCIE-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
pciBusID: 0000:18:00.0
2020-11-24 22:43:13.932522: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 1 with properties: 
name: Tesla V100-PCIE-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
pciBusID: 0000:af:00.0
2020-11-24 22:43:13.933992: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-11-24 22:43:13.937182: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-11-24 22:43:13.940088: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.

# 학습 된 BERT 모델로 KorQuAD 학습

이번엔 BERT 모델을 이용해 KorQuAD를 학습해보도록 하겠습니다.

---

In [None]:
!CUDA_VISIBLE_DEVICES=1 python src/make_bert_model/run_squad_2.py \
--vocab_file=rsc/my_conf/ices_vocab_v4_sejin.txt \
--bert_config_file=rsc/conf/bert_config.json \
--init_checkpoint=rsc/my_pretrained_model/model.ckpt-80000 \
--do_train=True \
--train_file=rsc/conf/QA_train.json \
--do_predict=True \
--predict_file=rsc/conf/QA_test.json \
--train_batch_size=16 \
--learning_rate=5e-5 \
--num_train_epochs=3.0 \
--max_seq_length=512 \
--doc_stride=128 \
--output_dir=QA_output/pre_80000 \
--save_checkpoints_steps=200 \
--do_lower_case=True




W1125 11:21:59.267806 47806889930240 module_wrapper.py:139] From src/make_bert_model/run_squad_2.py:1033: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.


W1125 11:21:59.267999 47806889930240 module_wrapper.py:139] From src/make_bert_model/run_squad_2.py:1033: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.


W1125 11:21:59.268158 47806889930240 module_wrapper.py:139] From /scratch/kedu21/workspace/Saejin/src/make_bert_model/modeling.py:92: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.


W1125 11:21:59.276223 47806889930240 module_wrapper.py:139] From src/make_bert_model/run_squad_2.py:1039: The name tf.gfile.MakeDirs is deprecated. Please use tf.io.gfile.makedirs instead.

The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset

# 학습 된 KorQuAD 평가

아래 코드를 실행해보시면, KorQuAD를 평가하실 수 있습니다.

---



In [7]:
!python evaluate.py \
rsc/conf/QA_test.json \
QA_output/pre_80000/predictions.json

Evaluation expects v-1.1, but got dataset with v-ICES_test
{"exact_match": 9.563187729243081, "f1": 42.18527454239351}
