# Sentiment Analysis for Indonesian Text - Training Part

This notebook is a source that used for Gemastik 2019 Data Mining.

Created by:

Team: NamaTimnyaApa

Member
- Setyo Nugroho
- Cindy Alifia Putri
- Nurliah Awaliah

University: Telkom Universiy

This is a modification of https://github/google-research/bert/blob/master/predicting_movie_reviews_with_bert_on_tf_hub.ipynb using the Tensorflow 2.0 Keras implementation of BERT from [kpe/bert-for-tf2](https://github.com/kpe/bert-for-tf2) with the original [google-research/bert](https://github.com/google-research/bert) weights.


# Predicting Twitter Sentiment with [kpe/bert-for-tf2](https://github.com/kpe/bert-for-tf2)

First install some prerequisites:

In [None]:
!pip install tqdm  >> /dev/null

In [None]:
import os
import math
import datetime

from tqdm import tqdm

import pandas as pd
import numpy as np

# Install TensorFlow
try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass
  
import tensorflow as tf


TensorFlow 2.x selected.


In [None]:
tf.__version__

'2.0.0-rc1'

In [None]:
if tf.__version__.startswith("1."):
  tf.enable_eager_execution()


In addition to the standard libraries we imported above, we'll need to install the [bert-for-tf2](https://github.com/kpe/bert-for-tf2) python package, and do the imports required for loading the pre-trained weights and tokenizing the input text. 

In [None]:
!pip install bert-for-tf2==0.11.6 >> /dev/null

In [None]:
import bert
from bert import BertModelLayer
from bert.loader import StockBertConfig, map_stock_config_to_params, load_stock_weights
from bert.tokenization import FullTokenizer

#Data

The dataset used is based from http://ridi.staff.ugm.ac.id/2019/03/06/indonesia-sentiment-analysis-dataset/

The dataset is already cleanied and formatted to only contains positive (1) and negative (0).

Dataset can be downloaded from https://drive.google.com/open?id=1TiaMpQZe99dPpC7_YuTy87wzW8bVWpYa

The data is saved on my own google drive, if you want to run this notebook please change dataset location

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


In [None]:
from tensorflow import keras
import os
import re
import pandas as pd
from sklearn.model_selection import train_test_split

# Download and process the dataset files.
def download_and_load_datasets(force_download=False):
  df = pd.read_csv("drive/My Drive/dataset/indo_dataset.csv")

  df_train, df_test = train_test_split(df,  test_size=0.2)
  
  return df_train, df_test


Specify BERT model and trained model location 

In [None]:
bert_model_name  = "multi_cased_L-12_H-768_A-12"
bert_ckpt_dir    = os.path.join("drive/My Drive/bert_models/",bert_model_name)
bert_ckpt_file   = os.path.join(bert_ckpt_dir, "bert_model.ckpt")
bert_config_file = os.path.join(bert_ckpt_dir, "bert_config.json")

model_save_location = "drive/My Drive/bert_models/my_sentiment.h5"

Let's use the `TwitterSentimentData` class below, to prepare/encode 
the data for feeding into our BERT model, by:
  - tokenizing the text
  - trim or pad it to a `max_seq_len` length
  - append the special tokens `[CLS]` and `[SEP]`
  - convert the string tokens to numerical `ID`s using the original model's token encoding from `vocab.txt`

In [None]:

import bert
from bert import BertModelLayer
from bert.loader import StockBertConfig, map_stock_config_to_params, load_stock_weights
from bert.tokenization import FullTokenizer


class TwitterSentimentData:
    DATA_COLUMN = "Tweet"
    LABEL_COLUMN = "sentimen"

    def __init__(self, tokenizer: FullTokenizer, sample_size=None, max_seq_len=1024):
        self.tokenizer = tokenizer
        self.sample_size = sample_size
        self.max_seq_len = max_seq_len
        train, test = download_and_load_datasets()
        
        train, test = map(lambda df: df.reindex(df[TwitterSentimentData.DATA_COLUMN].str.len().sort_values().index), 
                          [train, test])
                
        if sample_size is not None:
            assert sample_size % 128 == 0
            train, test = train.head(sample_size), test.head(sample_size)
            # train, test = map(lambda df: df.sample(sample_size), [train, test])
        
        ((self.train_x, self.train_y),
         (self.test_x, self.test_y)) = map(self._prepare, [train, test])

        print("max seq_len", self.max_seq_len)
        self.max_seq_len = min(self.max_seq_len, max_seq_len)
        ((self.train_x, self.train_x_token_types),
         (self.test_x, self.test_x_token_types)) = map(self._pad, 
                                                       [self.train_x, self.test_x])

    def _prepare(self, df):
        x, y = [], []
        with tqdm(total=df.shape[0], unit_scale=True) as pbar:
            for ndx, row in df.iterrows():
                text, label = row[TwitterSentimentData.DATA_COLUMN], row[TwitterSentimentData.LABEL_COLUMN]
                tokens = self.tokenizer.tokenize(text)
                tokens = ["[CLS]"] + tokens + ["[SEP]"]
                token_ids = self.tokenizer.convert_tokens_to_ids(tokens)
                self.max_seq_len = max(self.max_seq_len, len(token_ids))
                x.append(token_ids)
                y.append(int(label))
                pbar.update()
        return np.array(x), np.array(y)

    def _pad(self, ids):
        x, t = [], []
        token_type_ids = [0] * self.max_seq_len
        for input_ids in ids:
            input_ids = input_ids[:min(len(input_ids), self.max_seq_len - 2)]
            input_ids = input_ids + [0] * (self.max_seq_len - len(input_ids))
            x.append(np.array(input_ids))
            t.append(token_type_ids)
        return np.array(x), np.array(t)


# Preparing the Data

Now let's fetch and prepare the data by taking the first `max_seq_len` tokenens after tokenizing with the BERT tokenizer, und use `sample_size` examples for both training and testing.

To keep training fast, we'll take a sample of about 2500 train and test examples, respectively, and use the first 128 tokens only (transformers memory and computation requirements scale quadraticly with the sequence length - so with a TPU you might use `max_seq_len=512`, but on a GPU this would be too slow, and you will have to use a very small `batch_size`s to fit the model into the GPU memory).

In [None]:
%%time

tokenizer = FullTokenizer(vocab_file=os.path.join(bert_ckpt_dir, "vocab.txt"))
data = TwitterSentimentData(tokenizer, 
                       sample_size=None, 
                       max_seq_len=128)

100%|██████████| 4.38k/4.38k [00:01<00:00, 2.91kit/s]
100%|██████████| 1.10k/1.10k [00:00<00:00, 2.94kit/s]


max seq_len 128
CPU times: user 2.3 s, sys: 40.6 ms, total: 2.35 s
Wall time: 2.35 s


In [None]:
print("            train_x", data.train_x.shape)
print("train_x_token_types", data.train_x_token_types.shape)
print("            train_y", data.train_y.shape)

print("             test_x", data.test_x.shape)

print("        max_seq_len", data.max_seq_len)

            train_x (4383, 128)
train_x_token_types (4383, 128)
            train_y (4383,)
             test_x (1096, 128)
        max_seq_len 128


## Learning Rate Scheduler


In [None]:
def create_learning_rate_scheduler(max_learn_rate=5e-5,
                                   end_learn_rate=1e-7,
                                   warmup_epoch_count=10,
                                   total_epoch_count=90):

    def lr_scheduler(epoch):
        if epoch < warmup_epoch_count:
            res = (max_learn_rate/warmup_epoch_count) * (epoch + 1)
        else:
            res = max_learn_rate*math.exp(math.log(end_learn_rate/max_learn_rate)*(epoch-warmup_epoch_count+1)/(total_epoch_count-warmup_epoch_count+1))
        return float(res)
    learning_rate_scheduler = tf.keras.callbacks.LearningRateScheduler(lr_scheduler, verbose=1)

    return learning_rate_scheduler


#Creating a model


In [None]:
def create_model(max_seq_len):
  """Creates a classification model."""

  #adapter_size = 64  # see - arXiv:1902.00751

  # create the bert layer
  with tf.io.gfile.GFile(bert_config_file, "r") as reader:
      bc = StockBertConfig.from_json_string(reader.read())
      bert_params = map_stock_config_to_params(bc)
      bert_params.adapter_size = adapter_size
      bert = BertModelLayer.from_params(bert_params, name="bert")
        
  input_ids      = keras.layers.Input(shape=(max_seq_len,), dtype='int32', name="input_ids")
  # token_type_ids = keras.layers.Input(shape=(max_seq_len,), dtype='int32', name="token_type_ids")
  # output         = bert([input_ids, token_type_ids])
  output         = bert(input_ids)

  print("bert shape", output.shape)
  cls_out = keras.layers.Lambda(lambda seq: seq[:, 0, :])(output)
  cls_out = keras.layers.Dropout(0.3)(cls_out)
  logits = keras.layers.Dense(units=768, activation="tanh")(cls_out)
  logits = keras.layers.Dropout(0.5)(logits)
  logits = keras.layers.Dense(units=2, activation="softmax")(logits)

  # model = keras.Model(inputs=[input_ids, token_type_ids], outputs=logits)
  # model.build(input_shape=[(None, max_seq_len), (None, max_seq_len)])
  model = keras.Model(inputs=input_ids, outputs=logits)
  model.build(input_shape=(None, max_seq_len))

  # load the pre-trained model weights
  load_stock_weights(bert, bert_ckpt_file)

  model.compile(optimizer=keras.optimizers.Adam(),
                loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                metrics=[keras.metrics.SparseCategoricalAccuracy(name="acc")])

  model.summary()
        
  return model


In [None]:
model = create_model(data.max_seq_len)

bert shape (None, 128, 768)
Done loading 196 BERT weights from: drive/My Drive/bert_models/multi_cased_L-12_H-768_A-12/bert_model.ckpt into <bert.model.BertModelLayer object at 0x7fa31c51c160> (prefix:bert_1). Count of weights not found in the checkpoint was: [0]. Count of weights with mismatched shape: [0]
Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_ids (InputLayer)       [(None, 128)]             0         
_________________________________________________________________
bert (BertModelLayer)        (None, 128, 768)          177261312 
_________________________________________________________________
lambda_1 (Lambda)            (None, 768)               0         
_________________________________________________________________
dropout_2 (Dropout)          (None, 768)               0         
_________________________________________________________________
dense_2 (Dense

In [None]:
%%time

log_dir = "log/sentiment/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%s")
tensorboard_callback = keras.callbacks.TensorBoard(log_dir=log_dir)

total_epoch_count = 50
# model.fit(x=(data.train_x, data.train_x_token_types), y=data.train_y,
model.fit(x=data.train_x, y=data.train_y,
          validation_split=0.1,
          batch_size=45,
          shuffle=True,
          epochs=total_epoch_count,
          callbacks=[create_learning_rate_scheduler(max_learn_rate=1e-5,
                                                    end_learn_rate=1e-7,
                                                    warmup_epoch_count=20,
                                                    total_epoch_count=total_epoch_count),
                     keras.callbacks.EarlyStopping(patience=20, restore_best_weights=True),
                     tensorboard_callback])

model.save_weights(model_save_location, overwrite=True)

Train on 3944 samples, validate on 439 samples

Epoch 00001: LearningRateScheduler reducing learning rate to 5.000000000000001e-07.
Epoch 1/50

Epoch 00002: LearningRateScheduler reducing learning rate to 1.0000000000000002e-06.
Epoch 2/50

Epoch 00003: LearningRateScheduler reducing learning rate to 1.5000000000000002e-06.
Epoch 3/50

Epoch 00004: LearningRateScheduler reducing learning rate to 2.0000000000000003e-06.
Epoch 4/50

Epoch 00005: LearningRateScheduler reducing learning rate to 2.5000000000000006e-06.
Epoch 5/50

Epoch 00006: LearningRateScheduler reducing learning rate to 3.0000000000000005e-06.
Epoch 6/50

Epoch 00007: LearningRateScheduler reducing learning rate to 3.5000000000000004e-06.
Epoch 7/50

Epoch 00008: LearningRateScheduler reducing learning rate to 4.000000000000001e-06.
Epoch 8/50

Epoch 00009: LearningRateScheduler reducing learning rate to 4.500000000000001e-06.
Epoch 9/50

Epoch 00010: LearningRateScheduler reducing learning rate to 5.000000000000001e-06

In [None]:
%%time

_, train_acc = model.evaluate(data.train_x, data.train_y)
_, test_acc = model.evaluate(data.test_x, data.test_y)

print("train acc", train_acc)
print(" test acc", test_acc)

train acc 0.8754278
 test acc 0.7436131
CPU times: user 34.4 s, sys: 2.7 s, total: 37.1 s
Wall time: 1min 12s


# Evaluation

To evaluate the trained model, let's load the saved weights in a new model instance, and evaluate.

In [None]:
%%time 

model = create_model(data.max_seq_len, adapter_size=None)
model.load_weights(model_save_location)

_, train_acc = model.evaluate(data.train_x, data.train_y)
_, test_acc = model.evaluate(data.test_x, data.test_y)

print("train acc", train_acc)
print(" test acc", test_acc)

bert shape (None, 128, 768)
Done loading 196 BERT weights from: drive/My Drive/bert_models/multi_cased_L-12_H-768_A-12/bert_model.ckpt into <bert.model.BertModelLayer object at 0x7fa31ff5e9e8> (prefix:bert_2). Count of weights not found in the checkpoint was: [0]. Count of weights with mismatched shape: [0]
Model: "model_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_ids (InputLayer)       [(None, 128)]             0         
_________________________________________________________________
bert (BertModelLayer)        (None, 128, 768)          177261312 
_________________________________________________________________
lambda_2 (Lambda)            (None, 768)               0         
_________________________________________________________________
dropout_4 (Dropout)          (None, 768)               0         
_________________________________________________________________
dense_4 (Dense

# Prediction

For prediction, we need to prepare the input text the same way as we did for training - tokenize, adding the special `[CLS]` and `[SEP]` token at begin and end of the token sequence, and pad to match the model input shape.

In [None]:
pred_sentences = [
  "ini pemerintah kok goblok yah",
   "saya dukung sih soal ruu ini",
    "saya suka makan kadal",
    "kenapa pemerintah bikin uu yang malah bikin kpk tidak berdaya",
    "pemerintah cacat, ini kebijakan yang saya tentang",
    "kebijakan ruu ini justru akan membuat kpk lebih tidak dapat menangkap para koruptor, ini tidak bisa di biarkan",
    "kalau bicara soal kebijakan, ini adalah salah satu yang saya tentang, saya sangat tidak setuju dengan adanya kebijakan tidak jelas ini",
    "ini sebenernya kebijakan bagus, tidak seperti yang lainnya, ini akan memperkuat kpk"
]

tokenizer = FullTokenizer(vocab_file=os.path.join(bert_ckpt_dir, "vocab.txt"))
pred_tokens    = map(tokenizer.tokenize, pred_sentences)
pred_tokens    = map(lambda tok: ["[CLS]"] + tok + ["[SEP]"], pred_tokens)
pred_token_ids = list(map(tokenizer.convert_tokens_to_ids, pred_tokens))

pred_token_ids = map(lambda tids: tids +[0]*(data.max_seq_len-len(tids)),pred_token_ids)
pred_token_ids = np.array(list(pred_token_ids))

print('pred_token_ids', pred_token_ids.shape)

res = model.predict(pred_token_ids).argmax(axis=-1)

# res = model.predict(pred_token_ids)
# print(res)
for text, sentiment in zip(pred_sentences, res):
  print(" text:", text)
  print("  res:", ["negative","positive"][sentiment])

pred_token_ids (8, 128)
 text: ini pemerintah kok goblok yah
  res: positive
 text: saya dukung sih soal ruu ini
  res: positive
 text: saya suka makan kadal
  res: positive
 text: kenapa pemerintah bikin uu yang malah bikin kpk tidak berdaya
  res: negative
 text: pemerintah cacat, ini kebijakan yang saya tentang
  res: positive
 text: kebijakan ruu ini justru akan membuat kpk lebih tidak dapat menangkap para koruptor, ini tidak bisa di biarkan
  res: negative
 text: kalau bicara soal kebijakan, ini adalah salah satu yang saya tentang, saya sangat tidak setuju dengan adanya kebijakan tidak jelas ini
  res: negative
 text: ini sebenernya kebijakan bagus, tidak seperti yang lainnya, ini akan memperkuat kpk
  res: positive
