## Sentiment Analysis with BERT
### Hugging Face Transformers, Tensorflow

ref. https://pypi.org/project/keras-bert/, https://github.com/CyberZHG/keras-bert/tree/master/keras_bert

In [None]:
!pip install transformers


Collecting transformers
  Downloading transformers-4.10.3-py3-none-any.whl (2.8 MB)
[K     |████████████████████████████████| 2.8 MB 5.4 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 40.4 MB/s 
[?25hCollecting huggingface-hub>=0.0.12
  Downloading huggingface_hub-0.0.17-py3-none-any.whl (52 kB)
[K     |████████████████████████████████| 52 kB 1.5 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.45-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 24.0 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-5.4.1-cp37-cp37m-manylinux1_x86_64.whl (636 kB)
[K     |████████████████████████████████| 636 kB 25.0 MB/s 
Installing collected packages: tokenizers, sacremoses, pyyaml, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3

In [None]:
import tensorflow as tf
import numpy as np
import pandas as pd
import json
import numpy as np
import pandas as pd
from tqdm import tqdm
import os
from sklearn.model_selection import train_test_split
from transformers import BertTokenizer, TFBertModel

In [None]:
df = pd.read_csv("sample_data/IMDB_Dataset.csv")
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [None]:
df.loc[df['sentiment']=='positive', 'label'] = "0"
df.loc[df['sentiment']=='negative', 'label'] = "1"
df.head()

Unnamed: 0,review,sentiment,label
0,One of the other reviewers has mentioned that ...,positive,0
1,A wonderful little production. <br /><br />The...,positive,0
2,I thought this was a wonderful way to spend ti...,positive,0
3,Basically there's a family where a little boy ...,negative,1
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,0


In [None]:
train, test = train_test_split(df, test_size=0.2)
train.head(10)

Unnamed: 0,review,sentiment,label
16693,Jordan takes us into the seedy crime side of S...,positive,0
26279,"""Kicked in the Head"" is all about the Corrigan...",negative,1
22960,1993 was a time of change in the WWE but for t...,negative,1
15781,From the acclaim it got I was expecting more f...,negative,1
25873,This movie lost me with the crossbow RPG (rock...,negative,1
35967,I was really looking forward to seeing this mo...,negative,1
14972,Creature Unknown is the right word for this mo...,negative,1
27078,This is the middle cartoon of the three (betwe...,positive,0
10905,I just saw this at the 2006 Vancouver internat...,negative,1
37965,"This is a nicely-done story with pretty music,...",positive,0


In [None]:
del train["sentiment"]
del test["sentiment"]

In [None]:
train = train.reset_index(drop=True)
train.head(10)

Unnamed: 0,review,label
0,Jordan takes us into the seedy crime side of S...,0
1,"""Kicked in the Head"" is all about the Corrigan...",1
2,1993 was a time of change in the WWE but for t...,1
3,From the acclaim it got I was expecting more f...,1
4,This movie lost me with the crossbow RPG (rock...,1
5,I was really looking forward to seeing this mo...,1
6,Creature Unknown is the right word for this mo...,1
7,This is the middle cartoon of the three (betwe...,0
8,I just saw this at the 2006 Vancouver internat...,1
9,"This is a nicely-done story with pretty music,...",0


In [None]:
test = test.reset_index(drop=True)
test.head(10)

Unnamed: 0,review,label
0,I was never all that impressed by Night Galler...,0
1,"This flick, which is a.k.a. ""Life In the Fast ...",1
2,"""The Cell"" is an exotic masterpiece, a dizzyin...",0
3,"Having been familiar with Hartley's ""The Go-Be...",0
4,"As other reviews have said, another of the cou...",1
5,*McCabe and Mrs. Miller* takes place in the tu...,1
6,by Dane Youssef<br /><br />I was kind of looki...,1
7,We loved this movie because it was so entertai...,0
8,"Well, this latest version of Mansfield Park se...",1
9,I just finished watching this movie and am dis...,1


In [None]:
tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')

Downloading:   0%|          | 0.00/996k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/625 [00:00<?, ?B/s]

In [None]:
print(tokenizer.tokenize("This is encode test"))
print(tokenizer.encode("This is encode test"))

['This', 'is', 'en', '##code', 'test']
[101, 10747, 10124, 10110, 54261, 15839, 102]


In [None]:
print(tokenizer.tokenize("ஒரு சாதாரண வளர்ந்த மனிதனுடைய எலும்புக்கூடு"))
print(tokenizer.encode("ஒரு சாதாரண வளர்ந்த மனிதனுடைய எலும்புக்கூடு"))

['ஒரு', 'ச', '##ாத', '##ார', '##ண', 'வ', '##ளர்', '##ந்த', 'ம', '##னித', '##ன', '##ுடைய', 'எ', '##லும்', '##பு', '##க்க', '##ூ', '##டு']
[101, 13496, 1154, 88567, 81773, 40397, 1170, 81452, 17002, 1163, 67101, 17506, 77626, 1146, 26934, 29972, 19932, 59189, 35667, 102]


In [None]:
#token input
print(tokenizer.encode("ஒரு சாதாரண வளர்ந்த மனிதனுடைய எலும்புக்கூடு", max_length=128, pad_to_max_length=True))

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


[101, 13496, 1154, 88567, 81773, 40397, 1170, 81452, 17002, 1163, 67101, 17506, 77626, 1146, 26934, 29972, 19932, 59189, 35667, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]




In [None]:
#segment input
print([0]*128)

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


In [None]:
#mask input
valid_num = len(tokenizer.encode("ஒரு சாதாரண வளர்ந்த மனிதனுடைய எலும்புக்கூடு"))
print(valid_num * [1] + (64 - valid_num) * [0])

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


In [None]:
def convert_data(data_df):
    global tokenizer
    
    SEQ_LEN = 128 #SEQ_LEN : input length
    
    tokens, masks, segments, targets = [], [], [], []
    
    for i in tqdm(range(len(data_df))):
        # token : tokenise sentence
        token = tokenizer.encode(data_df[DATA_COLUMN][i], max_length=SEQ_LEN, truncation=True, padding='max_length')
       
        # mask : padding - 0
        num_zeros = token.count(0)
        mask = [1]*(SEQ_LEN-num_zeros) + [0]*num_zeros
        
        # segment : one sentence
        segment = [0]*SEQ_LEN

        # BERT input  
        tokens.append(token)
        masks.append(mask)
        segments.append(segment)
        
        # positive : 1, negative : 0
        targets.append(data_df[LABEL_COLUMN][i])

    # tokens, masks, segments, targets -> numpy array   
    tokens = np.array(tokens)
    masks = np.array(masks)
    segments = np.array(segments)
    targets = np.array(targets)

    return [tokens, masks, segments], targets

# call convert_data 
def load_data(pandas_dataframe):
    data_df = pandas_dataframe
    data_df[DATA_COLUMN] = data_df[DATA_COLUMN].astype(str)
    data_df[LABEL_COLUMN] = data_df[LABEL_COLUMN].astype(int)
    data_x, data_y = convert_data(data_df)
    return data_x, data_y

SEQ_LEN = 128
BATCH_SIZE = 20
# context column
DATA_COLUMN = "review"
# label column
LABEL_COLUMN = "label"

# convert train data to BERT input format
train_x, train_y = load_data(train)

100%|██████████| 40000/40000 [03:51<00:00, 172.51it/s]


In [None]:
test_x, test_y = load_data(test)

100%|██████████| 10000/10000 [00:56<00:00, 176.00it/s]


In [None]:
# TPU 
TPU = True
if TPU:
  resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='grpc://' + os.environ['COLAB_TPU_ADDR'])
  tf.config.experimental_connect_to_cluster(resolver)
  tf.tpu.experimental.initialize_tpu_system(resolver)
else:
  pass

In [None]:
model = TFBertModel.from_pretrained('bert-base-multilingual-cased')

token_inputs = tf.keras.layers.Input((SEQ_LEN,), dtype=tf.int32, name='input_word_ids')
mask_inputs = tf.keras.layers.Input((SEQ_LEN,), dtype=tf.int32, name='input_masks')
segment_inputs = tf.keras.layers.Input((SEQ_LEN,), dtype=tf.int32, name='input_segment')

bert_outputs = model([token_inputs, mask_inputs, segment_inputs])

Some layers from the model checkpoint at bert-base-multilingual-cased were not used when initializing TFBertModel: ['mlm___cls', 'nsp___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-multilingual-cased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


In [None]:
sentiment_model = tf.keras.Model([token_inputs, mask_inputs, segment_inputs], bert_outputs)
sentiment_model.summary()

Model: "model_11"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_word_ids (InputLayer)     [(None, 128)]        0                                            
__________________________________________________________________________________________________
input_masks (InputLayer)        [(None, 128)]        0                                            
__________________________________________________________________________________________________
input_segment (InputLayer)      [(None, 128)]        0                                            
__________________________________________________________________________________________________
tf_bert_model_9 (TFBertModel)   TFBaseModelOutputWit 177853440   input_word_ids[0][0]             
                                                                 input_masks[0][0]         

In [None]:
bert_outputs = bert_outputs[1]
sentiment_first = tf.keras.layers.Dense(1, activation='sigmoid', kernel_initializer=tf.keras.initializers.TruncatedNormal(stddev=0.02))(bert_outputs)
sentiment_model = tf.keras.Model([token_inputs, mask_inputs, segment_inputs], sentiment_first)
sentiment_model.compile(optimizer=tf.keras.optimizers.Adam(lr=1.0e-5), loss=tf.keras.losses.BinaryCrossentropy(), metrics = ['accuracy'])

  "The `lr` argument is deprecated, use `learning_rate` instead.")


In [None]:
sentiment_model.summary()

Model: "model_9"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_word_ids (InputLayer)     [(None, 128)]        0                                            
__________________________________________________________________________________________________
input_masks (InputLayer)        [(None, 128)]        0                                            
__________________________________________________________________________________________________
input_segment (InputLayer)      [(None, 128)]        0                                            
__________________________________________________________________________________________________
tf_bert_model_7 (TFBertModel)   TFBaseModelOutputWit 177853440   input_word_ids[0][0]             
                                                                 input_masks[0][0]          

In [None]:
# Rectified Adam optimiser
!pip install tensorflow_addons
import tensorflow_addons as tfa
opt = tfa.optimizers.RectifiedAdam(lr=5.0e-5, total_steps = 2344*2, warmup_proportion=0.1, min_lr=1e-5, epsilon=1e-08, clipnorm=1.0)



  "The `lr` argument is deprecated, use `learning_rate` instead.")


In [None]:
def create_sentiment_bert():
  # pretrained BERT model load
  model = TFBertModel.from_pretrained('bert-base-multilingual-cased')
 
  token_inputs = tf.keras.layers.Input((SEQ_LEN,), dtype=tf.int32, name='input_word_ids')
  mask_inputs = tf.keras.layers.Input((SEQ_LEN,), dtype=tf.int32, name='input_masks')
  segment_inputs = tf.keras.layers.Input((SEQ_LEN,), dtype=tf.int32, name='input_segment')
  
  bert_outputs = model([token_inputs, mask_inputs, segment_inputs])

  bert_outputs = bert_outputs[1]
  sentiment_first = tf.keras.layers.Dense(1, activation='sigmoid', kernel_initializer=tf.keras.initializers.TruncatedNormal(stddev=0.02))(bert_outputs)
  sentiment_model = tf.keras.Model([token_inputs, mask_inputs, segment_inputs], sentiment_first)

  sentiment_model.compile(optimizer=opt, loss=tf.keras.losses.BinaryCrossentropy(), metrics = ['accuracy'])
  return sentiment_model

In [None]:
if TPU:
  strategy = tf.distribute.experimental.TPUStrategy(resolver)
  with strategy.scope():
    sentiment_model = create_sentiment_bert()
  sentiment_model.fit(train_x, train_y, epochs=4, shuffle=True, batch_size=100, validation_data=(test_x, test_y))
else:
  sentiment_model = create_sentiment_bert()
  sentiment_model.fit(train_x, train_y, epochs=4, shuffle=True, batch_size=100, validation_data=(test_x, test_y))

Some layers from the model checkpoint at bert-base-multilingual-cased were not used when initializing TFBertModel: ['mlm___cls', 'nsp___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-multilingual-cased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


Epoch 1/4


INFO:absl:TPU has inputs with dynamic shapes: [<tf.Tensor 'Const:0' shape=() dtype=int32>, <tf.Tensor 'IteratorGetNext:0' shape=(None, 128) dtype=int64>, <tf.Tensor 'IteratorGetNext:1' shape=(None, 128) dtype=int64>, <tf.Tensor 'IteratorGetNext:2' shape=(None, 128) dtype=int64>, <tf.Tensor 'IteratorGetNext:3' shape=(None,) dtype=int64>]
INFO:absl:TPU has inputs with dynamic shapes: [<tf.Tensor 'Const:0' shape=() dtype=int32>, <tf.Tensor 'IteratorGetNext:0' shape=(None, 128) dtype=int64>, <tf.Tensor 'IteratorGetNext:1' shape=(None, 128) dtype=int64>, <tf.Tensor 'IteratorGetNext:2' shape=(None, 128) dtype=int64>, <tf.Tensor 'IteratorGetNext:3' shape=(None,) dtype=int64>]




INFO:absl:TPU has inputs with dynamic shapes: [<tf.Tensor 'Const:0' shape=() dtype=int32>, <tf.Tensor 'IteratorGetNext:0' shape=(None, 128) dtype=int64>, <tf.Tensor 'IteratorGetNext:1' shape=(None, 128) dtype=int64>, <tf.Tensor 'IteratorGetNext:2' shape=(None, 128) dtype=int64>, <tf.Tensor 'IteratorGetNext:3' shape=(None,) dtype=int64>]


Epoch 2/4
Epoch 3/4
Epoch 4/4


In [None]:
import os
from google.colab import drive
drive.mount('/content/gdrive/')

Drive already mounted at /content/gdrive/; to attempt to forcibly remount, call drive.mount("/content/gdrive/", force_remount=True).


In [None]:
path = "gdrive/My Drive/Colab Notebooks/BERT/sentiment"

In [None]:
sentiment_model.save_weights(path+"/huggingface_bert.h5")

In [None]:
def predict_convert_data(data_df):
    global tokenizer
    tokens, masks, segments = [], [], []
    
    for i in tqdm(range(len(data_df))):

        token = tokenizer.encode(data_df[DATA_COLUMN][i], max_length=SEQ_LEN, truncation=True, padding='max_length')
        num_zeros = token.count(0)
        mask = [1]*(SEQ_LEN-num_zeros) + [0]*num_zeros
        segment = [0]*SEQ_LEN

        tokens.append(token)
        segments.append(segment)
        masks.append(mask)

    tokens = np.array(tokens)
    masks = np.array(masks)
    segments = np.array(segments)
    return [tokens, masks, segments]


def predict_load_data(pandas_dataframe):
    data_df = pandas_dataframe
    data_df[DATA_COLUMN] = data_df[DATA_COLUMN].astype(str)
    data_x = predict_convert_data(data_df)
    return data_x


In [None]:
test_set = predict_load_data(test)


100%|██████████| 10000/10000 [00:59<00:00, 166.68it/s]


In [None]:
test_set

[array([[  101,   146, 10134, ..., 10149, 10105,   102],
        [  101, 10747, 58768, ...,   187, 94671,   102],
        [  101,   107, 10117, ..., 10188, 10435,   102],
        ...,
        [  101, 51962, 10124, ...,     0,     0,     0],
        [  101, 11101,   146, ..., 11152, 15198,   102],
        [  101, 11590, 14384, ...,   169, 11897,   102]]),
 array([[1, 1, 1, ..., 1, 1, 1],
        [1, 1, 1, ..., 1, 1, 1],
        [1, 1, 1, ..., 1, 1, 1],
        ...,
        [1, 1, 1, ..., 0, 0, 0],
        [1, 1, 1, ..., 1, 1, 1],
        [1, 1, 1, ..., 1, 1, 1]]),
 array([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]])]

In [None]:
with strategy.scope():
  preds = sentiment_model.predict(test_set)

INFO:absl:TPU has inputs with dynamic shapes: [<tf.Tensor 'Const:0' shape=() dtype=int32>, <tf.Tensor 'cond_8/Identity:0' shape=(None, 128) dtype=int64>, <tf.Tensor 'cond_8/Identity_1:0' shape=(None, 128) dtype=int64>, <tf.Tensor 'cond_8/Identity_2:0' shape=(None, 128) dtype=int64>]


In [None]:
preds

array([[0.56423783],
       [0.9963894 ],
       [0.00226241],
       ...,
       [0.9968954 ],
       [0.9984548 ],
       [0.003627  ]], dtype=float32)

In [None]:
from sklearn.metrics import classification_report
y_true = test['label']
# F1 Score 
print(classification_report(y_true, np.round(preds,0)))

              precision    recall  f1-score   support

           0       0.80      0.94      0.87      4960
           1       0.93      0.77      0.84      5040

    accuracy                           0.86     10000
   macro avg       0.87      0.86      0.85     10000
weighted avg       0.87      0.86      0.85     10000



In [None]:
#import logging
#tf.get_logger().setLevel(logging.ERROR)

### prediction

In [None]:
def sentence_convert_data(data):
    global tokenizer
    tokens, masks, segments = [], [], []
    token = tokenizer.encode(data, max_length=SEQ_LEN, truncation=True, padding='max_length')
    
    num_zeros = token.count(0) 
    mask = [1]*(SEQ_LEN-num_zeros) + [0]*num_zeros 
    segment = [0]*SEQ_LEN

    tokens.append(token)
    segments.append(segment)
    masks.append(mask)

    tokens = np.array(tokens)
    masks = np.array(masks)
    segments = np.array(segments)
    return [tokens, masks, segments]

def movie_evaluation_predict(sentence):
    data_x = sentence_convert_data(sentence)
    predict = sentiment_model.predict(data_x)
    predict_value = np.ravel(predict)
    predict_answer = np.round(predict_value,0).item()
    
    if predict_answer == 0:
      print("(Positive : %.2f)" % (1-predict_value))
    elif predict_answer == 1:
      print("(Negative : %.2f)" % predict_value)

In [None]:
movie_evaluation_predict("If you like original gut wrenching laughter you will like this movie. If you are young or old then you will love this movie, hell even my mom liked it.<br /><br />Great Camp!!!")

(Positive : 1.00)


In [None]:
movie_evaluation_predict("I didn't hate this movie as much as some on my all time black list, but I consider it a total wast of film. Jeremy Irons, Iron Jeremy, Ron Jeremy. Think about it. Scene one is very good, all the rest are crap.")


(Negative : 0.99)
