### Google Colab notebook to use pre-trained BERT model for review classification  

Take pre-trained BERT model from https://github.com/ThAIKeras/bert   
[Optional] Fine-tune on my dataset with my sentiment classification task (Optional because pre-trained BERT already supports classification on wongnai dataset)

See *Pre_Trained_BERT_submission* notebook for predictions   

In [2]:
# install packages
!pip install sentencepiece
!pip install pythainlp

Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/f5/99/e0808cb947ba10f575839c43e8fafc9cc44e4a7a2c8f79c60db48220a577/sentencepiece-0.1.95-cp37-cp37m-manylinux2014_x86_64.whl (1.2MB)
[K     |▎                               | 10kB 22.7MB/s eta 0:00:01[K     |▌                               | 20kB 30.3MB/s eta 0:00:01[K     |▉                               | 30kB 23.4MB/s eta 0:00:01[K     |█                               | 40kB 26.9MB/s eta 0:00:01[K     |█▍                              | 51kB 28.4MB/s eta 0:00:01[K     |█▋                              | 61kB 31.0MB/s eta 0:00:01[K     |██                              | 71kB 20.0MB/s eta 0:00:01[K     |██▏                             | 81kB 21.3MB/s eta 0:00:01[K     |██▌                             | 92kB 19.7MB/s eta 0:00:01[K     |██▊                             | 102kB 19.7MB/s eta 0:00:01[K     |███                             | 112kB 19.7MB/s eta 0:00:01[K     |███▎        

In [3]:
import tensorflow as tf
print(tf.__version__)

1.13.1


In [4]:
# make sure to use TensorFlow 1.13.1 (project requirement - tensorflow >= 1.11.0)
# !pip install tensorflow==1.13.1
# !pip install tensorflow-gpu==1.13.1
# import tensorflow as tf
# print(tf.__version__)

# restat runtime to downgrade TF

Collecting tensorflow==1.13.1
[?25l  Downloading https://files.pythonhosted.org/packages/d4/29/6b4f1e02417c3a1ccc85380f093556ffd0b35dc354078074c5195c8447f2/tensorflow-1.13.1-cp37-cp37m-manylinux1_x86_64.whl (92.6MB)
[K     |████████████████████████████████| 92.6MB 100kB/s 
Collecting keras-applications>=1.0.6
[?25l  Downloading https://files.pythonhosted.org/packages/71/e3/19762fdfc62877ae9102edf6342d71b28fbfd9dea3d2f96a882ce099b03f/Keras_Applications-1.0.8-py3-none-any.whl (50kB)
[K     |████████████████████████████████| 51kB 8.3MB/s 
Collecting tensorboard<1.14.0,>=1.13.0
[?25l  Downloading https://files.pythonhosted.org/packages/0f/39/bdd75b08a6fba41f098b6cb091b9e8c7a80e1b4d679a581a0ccd17b10373/tensorboard-1.13.1-py3-none-any.whl (3.2MB)
[K     |████████████████████████████████| 3.2MB 38.8MB/s 
Collecting tensorflow-estimator<1.14.0rc0,>=1.13.0
[?25l  Downloading https://files.pythonhosted.org/packages/bb/48/13f49fc3fa0fdf916aa1419013bb8f2ad09674c275b4046d5ee669a46873/tensorf

Collecting tensorflow-gpu==1.13.1
[?25l  Downloading https://files.pythonhosted.org/packages/2c/65/8dc8fc4a263a24f7ad935b72ad35e72ba381cb9e175b6a5fe086c85f17a7/tensorflow_gpu-1.13.1-cp37-cp37m-manylinux1_x86_64.whl (345.0MB)
[K     |████████████████████████████████| 345.0MB 22kB/s 
Installing collected packages: tensorflow-gpu
Successfully installed tensorflow-gpu-1.13.1


2.4.1


In [4]:
# mount drive to use pre-trained (saved) BERT-Thai model
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
# unzip pre-trained BERT-Thai model
!unzip drive/MyDrive/ai-palette-test/bert_base_th.zip

Archive:  drive/MyDrive/ai-palette-test/bert_base_th.zip
   creating: bert_base_th/
  inflating: bert_base_th/model.ckpt.index  
  inflating: bert_base_th/model.ckpt.meta  
  inflating: bert_base_th/bert_config.json  
  inflating: bert_base_th/model.ckpt.data-00000-of-00001  


In [6]:
# clone the project of pre-trained BERT model
!git clone https://github.com/ThAIKeras/bert

Cloning into 'bert'...
remote: Enumerating objects: 275, done.[K
remote: Total 275 (delta 0), reused 0 (delta 0), pack-reused 275[K
Receiving objects: 100% (275/275), 201.44 KiB | 18.31 MiB/s, done.
Resolving deltas: 100% (151/151), done.


In [7]:
# clone the project with dataset
!git clone https://github.com/wongnai/wongnai-corpus.git

Cloning into 'wongnai-corpus'...
remote: Enumerating objects: 127, done.[K
remote: Total 127 (delta 0), reused 0 (delta 0), pack-reused 127[K
Receiving objects: 100% (127/127), 39.65 MiB | 38.96 MiB/s, done.
Resolving deltas: 100% (45/45), done.


In [8]:
# unzip dataset
!mkdir wongnai_data
!unzip wongnai-corpus/review/review_dataset.zip -d wongnai_data/

Archive:  wongnai-corpus/review/review_dataset.zip
 extracting: wongnai_data/sample_submission.csv  
  inflating: wongnai_data/test_file.csv  
  inflating: wongnai_data/w_review_train.csv  
   creating: wongnai_data/__MACOSX/
  inflating: wongnai_data/__MACOSX/._sample_submission.csv  
  inflating: wongnai_data/__MACOSX/._test_file.csv  
  inflating: wongnai_data/__MACOSX/._w_review_train.csv  


In [9]:
# unzip pre-trained Thai SentencePiece model and vocab from BPEmb
!unzip drive/MyDrive/ai-palette-test/th_wiki_bpe.zip

Archive:  drive/MyDrive/ai-palette-test/th_wiki_bpe.zip
  inflating: th.wiki.bpe.op25000.model  
  inflating: th.wiki.bpe.op25000.vocab  


In [10]:
# save results to directory
!mkdir bert_task_output

In [22]:
# !export BPE_DIR=bpemb/th/
# !export WONGNAI_DIR=wongnai_data
# !export OUTPUT_DIR=bert_task_output
# !export BERT_BASE_DIR=bert_base_th

## Use pre-trained Thai BERT to perform classification

In [11]:
!python bert/run_classifier.py \
  --task_name=wongnai \
  --do_train=true \
  --do_predict=true \
  --data_dir=wongnai_data \
  --vocab_file=th.wiki.bpe.op25000.vocab \
  --bert_config_file=bert_base_th/bert_config.json \
  --init_checkpoint=bert_base_th/model.ckpt \
  --max_seq_length=128 \
  --train_batch_size=32 \
  --learning_rate=5e-5 \
  --num_train_epochs=2.0 \
  --output_dir=bert_task_output \
  --spm_file=th.wiki.bpe.op25000.model

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])

For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.

INFO:tensorflow:Using config: {'_model_dir': 'bert_task_output', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': 1000, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '

In [14]:
# save results to drive folder
!cp -r bert_task_output/ drive/MyDrive/ai-palette-test/