<a href="https://colab.research.google.com/github/serdarbozoglan/My_NLP/blob/master/My_BERT_embedding2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Stage 1: Importing dependencies

In [0]:
import numpy as np
import math
import re
import pandas as pd
from bs4 import BeautifulSoup
import random

from google.colab import drive

In [2]:
!pip install bert-for-tf2
!pip install sentencepiece

Collecting bert-for-tf2
  Downloading https://files.pythonhosted.org/packages/c2/d8/14e0cfa03bbeb72c314f0648267c490bcceec5e8fb25081ec31307b5509c/bert-for-tf2-0.12.6.tar.gz
Collecting py-params>=0.7.3
  Downloading https://files.pythonhosted.org/packages/ec/17/71c5f3c0ab511de96059358bcc5e00891a804cd4049021e5fa80540f201a/py-params-0.8.2.tar.gz
Collecting params-flow>=0.7.1
  Downloading https://files.pythonhosted.org/packages/0d/12/2604f88932f285a473015a5adabf08496d88dad0f9c1228fab1547ccc9b5/params-flow-0.7.4.tar.gz
Building wheels for collected packages: bert-for-tf2, py-params, params-flow
  Building wheel for bert-for-tf2 (setup.py) ... [?25l[?25hdone
  Created wheel for bert-for-tf2: filename=bert_for_tf2-0.12.6-cp36-none-any.whl size=29115 sha256=f0861b771875d0640973aab9866506052f52ca894ba0c3fa7e2f3e94ff6d6584
  Stored in directory: /root/.cache/pip/wheels/24/19/54/51eeca468b219a1bc910c54aff87f0648b28a1fb71c115ba0f
  Building wheel for py-params (setup.py) ... [?25l[?25hdone
  C

In [3]:
try:
    %tensorflow_version 2.x
except Exception:
    pass
import tensorflow as tf

import tensorflow_hub as hub

from tensorflow.keras import layers
import bert

TensorFlow 2.x selected.


# Stage 2: Data preprocessing

## Loading files

We import files from our personal Google drive.

In [4]:
drive.mount("/content/drive")

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
cols = ["sentiment", "id", "date", "query", "user", "text"]
data = pd.read_csv(
    "/content/drive/My Drive/DS_Projects/BERT/sentiment_data/train.csv",
    header=None,
    names=cols,
    engine="python",
    encoding="latin1"
)

In [0]:
# Kolaylik olmasi icin sadece ilk 20K ve son 20K yi alacagim data'dan (sirali oludgu icin ilk 20K negative sentiment, last 20K positive sentiment)
data1 = data[:20000]
data2 = data[-20000:]
data = pd.concat([data1, data2], axis=0)

In [0]:
## Drop unnecessary columns
data.drop(["id", "date", "query", "user"],
          axis=1,
          inplace=True)

## Preprocessing

### Cleaning

In [0]:
def clean_tweet(tweet):
    tweet = BeautifulSoup(tweet, "lxml").get_text()
    # Removing the @, mentions such as @tigers
    tweet = re.sub(r"@[A-Za-z0-9]+", ' ', tweet)
    # Removing the URL links
    tweet = re.sub(r"https?://[A-Za-z0-9./]+", ' ', tweet)
    # Keeping only letters
    tweet = re.sub(r"[^a-zA-Z.!?']", ' ', tweet) # [^] means "not" yani a-zA-Z etc olmayanlari degistir anlaminda
    # Removing additional whitespaces
    tweet = re.sub(r" +", ' ', tweet)
    return tweet

In [0]:
data_clean = [clean_tweet(tweet) for tweet in data.text]

In [0]:
data_labels = data.sentiment.values

# We will convert 4 to 1 because in dataset positive is represented by 4 rather than 1
data_labels[data_labels == 4] = 1

### Tokenization

We need to create a BERT layer to have access to meta data for the tokenizer (like vocab size).

In [0]:
FullTokenizer = bert.bert_tokenization.FullTokenizer
bert_layer = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/1",
                            trainable=False)
vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer = FullTokenizer(vocab_file, do_lower_case)

In [13]:
tokenizer.tokenize("My dog loves, strawberries.")

['my', 'dog', 'loves', ',', 'straw', '##berries', '.']

In [14]:
tokenizer.convert_tokens_to_ids(tokenizer.tokenize('My dog loves, strawberries.'))

[2026, 3899, 7459, 1010, 13137, 20968, 1012]

We only use the first sentence for BERT inputs so we add the CLS token at the beginning and the SEP token at the end of each sentence.

In [0]:
def encode_sentence(sent):
    return ["[CLS]"] + tokenizer.tokenize(sent) + ["[SEP]"]

In [0]:
data_inputs = [encode_sentence(sentence) for sentence in data_clean]

### Dataset creation

We need to create the 3 different inputs for each sentence.

In [0]:
# First input
def get_ids(tokens): # we get integers for strings
    return tokenizer.convert_tokens_to_ids(tokens)

# Second input
# if the tokens=="[PAD]" then it will return 1, if they are equalant then we will get 0
# Eger token "[PAD]" kelimesine esit degilse yani sornam bir stringse 1, "[PAD]" kelimesi ise token o zaman 0 dondurur
def get_mask(tokens): # padding mask
    return np.char.not_equal(tokens, "[PAD]").astype(int)

# Third input
# Until we see "[SEP]" token we will get 0 and after that we will get 1.
# Hatirlatma, "[SEP]" i ilk cumleden sonra eklemistik yukarida tokenize ederken
def get_segments(tokens):
    seg_ids =[]
    current_seg_id = 0 # for the first sentence we will have 0
    for tok in tokens:
        seg_ids.append(current_seg_id)
        if tok == "[SEP]":
            current_seg_id = 1-current_seg_id # When we see token [SEP] we understand that we're in the second sentence # turns 1 into 0 and vice versa
            # 2nci cumlenin sonunda tekrar [SEP] i gordugunde 0 olur bu sefer
    return seg_ids

We will create padded batches (so we pad sentences for each batch independently), this way we add the minimum of padding tokens possible. For that, we sort sentences by length, apply padded_batches and then shuffle.

In [0]:
data_with_len = [[sent, data_labels[i], len(sent)]
                 for i, sent in enumerate(data_inputs)]

## Initial/original file has ordered labels, first comes 0s then 4s(we converted to 1s later) so we need to shuffle
# We shuffle negative and positive sentences
random.shuffle(data_with_len)

# data_with_len in elemanlari siranyla sentence, label ve sent lenght (row number 42)
# we're sorting the list based on the sentence length which is the index of [2] means 3rd element in the list
data_with_len.sort(key=lambda x: x[2])

In [0]:
# We sort our necessary inputs
# sent_lab is sentence_label
sorted_all = [([get_ids(sent_lab[0]),  # which corresponds our sentence
                get_mask(sent_lab[0]), # maks
                get_segments(sent_lab[0])],  # segments
                sent_lab[1]) # label
               for sent_lab in data_with_len if sent_lab[2] > 7] # sadece 7 token'dan buyuk olan cumleleri kullanacagiz. Kisa cumleleri disregard edecegiz

In [0]:
# A list is a type of iterator so it can be used as generator for a dataset
all_dataset = tf.data.Dataset.from_generator(lambda: sorted_all,
                                             output_types=(tf.int32, tf.int32))

In [0]:
BATCH_SIZE = 32
all_batched = all_dataset.padded_batch(BATCH_SIZE,
                                       padded_shapes=((3, None), ()),
                                       padding_values=(0, 0))

In [0]:
NB_BATCHES = math.ceil(len(sorted_all) / BATCH_SIZE)
NB_BATCHES_TEST = NB_BATCHES // 10
## all_batched i shuflle etmemiz gerekmektedir yoksa en kisa cumlelerden en uz
all_batched.shuffle(NB_BATCHES)
test_dataset = all_batched.take(NB_BATCHES_TEST)  # we grap first NUMBER_BATCHES_TEST for validation
train_dataset = all_batched.skip(NB_BATCHES_TEST) # we skip first BUMBER_BATCHES_TEST but rest for training set

In [25]:
# Ornek bir cumlenin bert layer'a unput olark gonderilmesi asagidadir
my_sent = ["[CLS]"] + tokenizer.tokenize("Roses are red.") + ["[SEP]"]

# 3 farkli token'a ihtiyacimiz oldugu yukarida aciklanmisti
# Batch i simulate etmek icin t.expand_dims i kullanacagiz, dimension eklemek icin, dimesion i expand ederiz
# inputumuzu tensora cevirmek icin tf.cast i kullaniriz, list of tokenlar ise ilgili fonksyonlarla elde edilir ornegin get_ids(my_sent) gibi
# dimesionin ilk elemani olarak eklemek icin 0'i kullaniyoruz
bert_layer([tf.expand_dims(tf.cast(get_ids(my_sent), tf.int32), 0),
            tf.expand_dims(tf.cast(get_mask(my_sent), tf.int32), 0),
            tf.expand_dims(tf.cast(get_segments(my_sent), tf.int32), 0)])

# Asagidaki ciktida sunu goruruz
# Input'un 2 elementi var, ilki shape(1, 768) ki 1 burada simulated batch oldugunu gosteriyor. Ilk element ilk array ile baslayan kisim
# 768 ise hidden dimension --> embedding dimension. 768 tane array'deki number'lar bizim TUM CUMLE icin kullaniliyor
# Second element'n shape i (1, 6, 768) --> 1, simulated batch. 6 --> number of tokens in our input (CLS + 3 kelime (Roses are red) + nokta (.) + SEP)
# 768 --> Embedding dimension of hidden size. Burasi reel emberdder olarak karisimiza cikar ve TUM CUMLENIN embeddingi burasidir
# 2 farkli output elde ediyoruz yukarida goruldugu gibi. Modeli ne maksatla kullanacagimiza gore yukaridaki outputlardan birini lullaniriz
# Classification icin ilk array, ilk element kullanilir
# 2. element token level olarak islem yapacagimiz baska NLP tasklerde kullanilir 

[<tf.Tensor: shape=(1, 768), dtype=float32, numpy=
 array([[-9.27935660e-01, -4.10335422e-01, -9.65755105e-01,
          9.07317996e-01,  8.12914014e-01, -1.74174443e-01,
          9.11234617e-01,  3.41952294e-01, -8.74521315e-01,
         -9.99989271e-01, -7.78410196e-01,  9.69385266e-01,
          9.86160517e-01,  6.36963129e-01,  9.48631346e-01,
         -7.51193285e-01, -4.58339691e-01, -7.08104610e-01,
          4.62098390e-01, -6.57927275e-01,  7.60414600e-01,
          9.99994814e-01, -3.96861047e-01,  3.44166279e-01,
          6.16488755e-01,  9.94400024e-01, -7.76633918e-01,
          9.38316584e-01,  9.59452271e-01,  7.32879519e-01,
         -6.93436861e-01,  2.93080509e-01, -9.93785501e-01,
         -1.64551809e-01, -9.67019558e-01, -9.95549560e-01,
          5.32935500e-01, -6.88060999e-01,  1.34715745e-02,
          2.98195519e-02, -9.18356597e-01,  4.20526326e-01,
          9.99988914e-01,  2.52676517e-01,  6.06235743e-01,
         -3.50750059e-01, -1.00000000e+00,  4.975

# Stage 3: Model building

In [0]:
# We're creating a Deep CNN (Convolutional Neural Netowrk) Class which intherits from tf.keras.Model class
class DCNNBERTEmbedding(tf.keras.Model):
    
    def __init__(self,
                 nb_filters=50,  # we'll use 50 filters/ feature detectors as default, we will use 50 filters for filter size 2 and 50 filters for 3 and 50 filters for 4 filter size
                 FFN_units=512,  # Number of Hidden Units we will use in Dense Layers at the end. We'll have 2 Dense Layers. we'll use FFN_units there
                 nb_classes=2,  # We have 2 classes positive and negative
                 dropout_rate=0.1,
                 name="dcnn"):  # model name we gave

        super(DCNNBERTEmbedding, self).__init__(name=name) # We're using super class and initiliaze tf.keras.Model class
        

        # We're staring to create layers
        # We're startting with embedding layer
        self.bert_layer = hub.KerasLayer(
            "https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/1",
            trainable=False) # to use BERT as it is trainable should be False, we're not fine tuning, we freeze the BERT layer by False
        
        # Srarting Creating CNN layers
        # First one will be the size of 2, means it will focus on 2 consecutive words, let's call it bigram
        # the width of feature detector is the same as input size, so will have 1 Dimensional vector when we aplly filter, remeber the figure from the lesson. We do not use smaller filter than input size width becasue it is nonnse to split the embedding vector. Whole vector represents a single word
        # out stride will be 1
        self.bigram = layers.Conv1D(filters=nb_filters,
                                    kernel_size=2,     # We shift feature detector to only 1D  # kernel_size= 2 for bigrams
                                    padding="valid",   # sometimes feature detectors exceeds the inputs size when it is strided then padding='valid handle this 
                                    activation="relu") # we only keep the positive results or 0 for the neagive results relu --> max(0,x)
        
        # We will create the same thing for filter size 3 and 4 as well, they will check the 3 and 4 consecutive words
        self.trigram = layers.Conv1D(filters=nb_filters,
                                     kernel_size=3,
                                     padding="valid",
                                     activation="relu")
        
        self.fourgram = layers.Conv1D(filters=nb_filters,
                                      kernel_size=4,
                                      padding="valid",
                                      activation="relu")
    
        # This layer will get the max pool of feature detector/filter
        self.pool = layers.GlobalMaxPool1D()

        # We're creating the Feed Forward Neural Network parts (We'll use 2 dense layers)
        self.dense_1 = layers.Dense(units=FFN_units, activation="relu") # Neuron numbers

        # We will apply droputs to avoid overfitting
        # Dropout will be applied only in TRAINING not in PREDICTION
        self.dropout = layers.Dropout(rate=dropout_rate)

        # Let's create the last Dense layer which is output layer
        if nb_classes == 2: # For binary classification
            self.last_dense = layers.Dense(units=1, # Don't confused here, if we have binary classes we will have 1 neuron
                                           activation="sigmoid")
        else: # multi-class
            self.last_dense = layers.Dense(units=nb_classes,
                                           activation="softmax")
    
    def embed_with_bert(self, all_tokens):
        # Asagidaki "_" tum cumleyi temsil eden vektoru temsil ediyor
        # embs represents words individually, we will use this one
        _, embs = self.bert_layer([all_tokens[:, 0, :], # first element--> all the batches, 0 --> token ids, son eleman (:) means all the values
                                   all_tokens[:, 1, :], # first element--> all the batches, 1 --> mask tokens, son eleman (:) means all the values
                                   all_tokens[:, 2, :]]) # first element--> all the batches, 2 --> segment tokens (sentence A or B) son eleman (:) means all the values
        return embs

    def call(self, inputs, training): #training shows if we are in training or not, it is boolean

        x = self.embed_with_bert(inputs)

        x_1 = self.bigram(x) # batch_size, nb_filters, seq_len-1)
        x_1 = self.pool(x_1) # we get the shape of (batch_size, nb_filters)
        x_2 = self.trigram(x) # batch_size, nb_filters, seq_len-2)
        x_2 = self.pool(x_2) # we get the shape of (batch_size, nb_filters)
        x_3 = self.fourgram(x) # batch_size, nb_filters, seq_len-3)
        x_3 = self.pool(x_3) # we get the shape of (batch_size, nb_filters)
        
        merged = tf.concat([x_1, x_2, x_3], axis=-1) # we concat the results based on the last parameter of shape, it is nb_filters
        # merged shape is (batch_size, 3*nb_filters)

        # We're gonna apply our first dense layer
        merged = self.dense_1(merged)

        # Dropout will be applied if it is in TRAINING
        merged = self.dropout(merged, training)

        # output layer
        output = self.last_dense(merged)
        
        return output

# Stage 4: Training

In [0]:
NB_FILTERS = 100
FFN_UNITS = 256
NB_CLASSES = 2

DROPOUT_RATE = 0.2

BATCH_SIZE = 32
NB_EPOCHS = 5

In [0]:
# We're creating our Neural Network
Dcnn = DCNNBERTEmbedding(nb_filters=NB_FILTERS,
                         FFN_units=FFN_UNITS,
                         nb_classes=NB_CLASSES,
                         dropout_rate=DROPOUT_RATE)

In [0]:
if NB_CLASSES == 2:
    Dcnn.compile(loss="binary_crossentropy",
                 optimizer="adam",
                 metrics=["accuracy"])
else:
    Dcnn.compile(loss="sparse_categorical_crossentropy",
                 optimizer="adam",
                 metrics=["sparse_categorical_accuracy"])

In [0]:
## We will save the weight of the trained model to use it later as well
checkpoint_path = "./drive/My Drive/DS_Projects/BERT/ckpt_bert_embedding/"

ckpt = tf.train.Checkpoint(Dcnn=Dcnn)

ckpt_manager = tf.train.CheckpointManager(ckpt, checkpoint_path, max_to_keep=1)
#max_to_keep shows how many checkpoints will be kept in this folder, we may increase this number if we wanna keep previuos checkpoints as well

if ckpt_manager.latest_checkpoint: # if we have a checkpoint in our relevant folder we get True if so not it will return None
    ckpt.restore(ckpt_manager.latest_checkpoint)
    print("Latest checkpoint restored!!")

In [0]:
# If we want to do anything custom in any epoch or any batch we can do it in the way below
class MyCustomCallback(tf.keras.callbacks.Callback):

    def on_epoch_end(self, epoch, logs=None): # on each epoch end we will save it and print out 
        ckpt_manager.save() # we will save the state of model at the end of the each epoch 
        print("Checkpoint saved at {}.".format(checkpoint_path))

## Result

In [32]:
Dcnn.fit(train_dataset,
         epochs=NB_EPOCHS,
         callbacks=[MyCustomCallback()])

Epoch 1/5
   1017/Unknown - 54s 53ms/step - loss: 0.4769 - accuracy: 0.7731Checkpoint saved at ./drive/My Drive/DS_Projects/BERT/ckpt_bert_embedding/.
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7f9944657d68>

1. The training is almost twice times faster than home-made/custom embedding which is hwon in My_BERT_tokenizer python file

2. Gercek dataset/Full dataseti kullanmadigim icin performans degerlendirmesi gercekci olmayabilir ama bunun performansi biraz daha koru olabilir. Evaluation accuracy'lere bakmak lazim

3. Daha az overfitting elde ederiz, less biased, less overfit

4. Eger test set, twitter datasi olmazsa da BERT embedding daha iyi sonuc verecektir. It generalize better way



# Stage 5: Evaluation

In [33]:
results = Dcnn.evaluate(test_dataset)
print(results)

    113/Unknown - 3s 30ms/step - loss: 0.5088 - accuracy: 0.8053[0.5088285698299915, 0.8053097]


In [0]:
def get_prediction(sentence):
    tokens = encode_sentence(sentence)

    input_ids = get_ids(tokens)
    input_mask = get_mask(tokens)
    segment_ids = get_segments(tokens)

    inputs = tf.stack(
        [tf.cast(input_ids, dtype=tf.int32),
         tf.cast(input_mask, dtype=tf.int32),
         tf.cast(segment_ids, dtype=tf.int32)],
         axis=0)
    inputs = tf.expand_dims(inputs, 0) # simulates a batch

    output = Dcnn(inputs, training=False)

    sentiment = math.floor(output*2)

    if sentiment == 0:
        print("Output of the model: {}\nPredicted sentiment: negative".format(
            output))
    elif sentiment == 1:
        print("Output of the model: {}\nPredicted sentiment: positive".format(
            output))

In [36]:
get_prediction("This actor is a deception.")

Output of the model: [[0.38071552]]
Predicted sentiment: negative
