<a href="https://colab.research.google.com/github/rajni-arora/Deep-Learning-Projects/blob/main/Bert_tokenizer(Multiclass%20Problem)Sentiment%20classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Stage 1: Importing dependencies

In [1]:
import numpy as np
import math
import re
import pandas as pd
from bs4 import BeautifulSoup
import random

from google.colab import drive

In [2]:
!pip install bert-for-tf2
!pip install sentencepiece

Collecting bert-for-tf2
  Downloading bert-for-tf2-0.14.9.tar.gz (41 kB)
[?25l[K     |████████                        | 10 kB 22.6 MB/s eta 0:00:01[K     |████████████████                | 20 kB 27.0 MB/s eta 0:00:01[K     |███████████████████████▉        | 30 kB 31.2 MB/s eta 0:00:01[K     |███████████████████████████████▉| 40 kB 23.6 MB/s eta 0:00:01[K     |████████████████████████████████| 41 kB 131 kB/s 
[?25hCollecting py-params>=0.9.6
  Downloading py-params-0.10.2.tar.gz (7.4 kB)
Collecting params-flow>=0.8.0
  Downloading params-flow-0.8.2.tar.gz (22 kB)
Building wheels for collected packages: bert-for-tf2, params-flow, py-params
  Building wheel for bert-for-tf2 (setup.py) ... [?25l[?25hdone
  Created wheel for bert-for-tf2: filename=bert_for_tf2-0.14.9-py3-none-any.whl size=30534 sha256=bbaefd4bc7cb6f6a7eb4c2edc78d68cadb42afe59a2cf40934e40f8acba27f13
  Stored in directory: /root/.cache/pip/wheels/47/b6/e5/8c76ec779f54bc5c2f1b57d2200bb9c77616da83873e8acb53
  Buil

In [3]:
try:
    %tensorflow_version 2.x
except Exception:
    pass
import tensorflow as tf

import tensorflow_hub as hub

from tensorflow.keras import layers
import bert

# Stage 2: Data preprocessing

## Loading files

We import files from our personal Google drive.

In [28]:
cols = ["Class", "Cleaned_Text"]

data = pd.read_csv(
    '/content/sample_data/processed_data.csv',
    header=None,
    names=cols,
    engine="python",
    encoding="latin1"
)

In [9]:
# data.drop(["id", "date", "query", "user"],
#           axis=1,
#           inplace=True)

In [80]:
data.head(10)

Unnamed: 0,Class,Cleaned_Text
1,Vaccine Challenges,faints from mild pain by the way a ton of girl...
2,Vaccine Challenges,painislife said not saying that is not the cas...
3,Consumer Experience,goldenwolf87 said i wonder how much more commo...
4,Consumer Experience,travelnomad said hpv herpes can be contracted ...
5,Consumer Experience,louisiana fisher said im sitting there playing...
6,Consumer Experience,wonder if there is a link between miss and hpv...
7,Consumer Experience,hpv has not been linked to developing miss pla...
8,Consumer Experience,hi i m mason i got this lump on penile shaft a...
9,Consumer Experience,be reddish in color or white sometimes you can...
10,Vaccine Challenges,hours to get through gardasil even causes canc...


In [30]:
data = data[1:]

## Preprocessing

### Cleaning

In [32]:
def clean_tweet(tweet):
    tweet = BeautifulSoup(tweet, "lxml").get_text()
    # Delete the @
    tweet = re.sub(r"@[A-Za-z0-9]+", ' ', tweet)
    # Delete URL links
    tweet = re.sub(r"https?://[A-Za-z0-9./]+", ' ', tweet)
    # Just keep letters and important punctuation
    tweet = re.sub(r"[^a-zA-Z.!?']", ' ', tweet)
    # Remove additional spaces
    tweet = re.sub(r" +", ' ', tweet)
    return tweet

In [33]:
data_clean = [clean_tweet(tweet) for tweet in data.Cleaned_Text.astype(str)]

In [34]:
data["Class"].value_counts()

Consumer Experience     7454
HPV Screening           5356
Awareness Related       2136
Vaccine Challenges      1732
Others                   940
Study Results            475
Campaign_initiatives     190
Name: Class, dtype: int64

In [35]:
from sklearn.preprocessing import LabelEncoder

In [36]:
le = LabelEncoder()
labels = le.fit_transform(data["Class"])

In [78]:
labels[:100]

array([6, 6, 2, 2, 2, 2, 2, 2, 2, 6, 0, 0, 3, 1, 2, 2, 2, 6, 2, 0, 2, 2,
       3, 3, 2, 3, 2, 3, 6, 2, 2, 2, 2, 3, 2, 0, 3, 2, 2, 3, 3, 3, 2, 3,
       3, 3, 2, 2, 2, 0, 6, 6, 2, 2, 3, 2, 4, 2, 3, 2, 3, 3, 3, 2, 3, 3,
       2, 3, 0, 2, 3, 2, 2, 2, 5, 2, 2, 2, 0, 0, 0, 2, 6, 2, 0, 2, 2, 2,
       2, 3, 2, 2, 2, 3, 3, 3, 2, 3, 2, 6])

In [8]:
# data_labels = data.Class.values
# data_labels[data_labels == 4] = 1

### Tokenization

We need to create a BERT layer to have access to meta data for the tokenizer (like vocab size).

input = 'Where are you going?'
output = ['where', 'are', 'you', 'going', '?']

In [38]:
FullTokenizer = bert.bert_tokenization.FullTokenizer
bert_layer = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/1",
                            trainable=False)
vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer = FullTokenizer(vocab_file, do_lower_case)

In [39]:
def encode_sentence(sent):
    return tokenizer.convert_tokens_to_ids(tokenizer.tokenize(sent))

**encode_sentence** = Universal Sentence Encoder encodes entire sentence or text into vectors of real numbers that can be used for clustering, sentence similarity, text classification, and other Natural language processing (NLP) tasks. 

In [40]:
data_inputs = [encode_sentence(sentence) for sentence in data_clean]

In [None]:
data_inputs

### Dataset creation

We will create padded batches (so we pad sentences for each batch inpedendently), this way we add the minimum of padding tokens possible. For that, we sort sentences by length, apply padded_batches and then shuffle.

In [41]:
data_with_len = [[sent, labels[i], len(sent)] for i, sent in enumerate(data_inputs)]
random.shuffle(data_with_len)
data_with_len.sort(key=lambda x: x[2])


In [42]:
sorted_all = [(sent_lab[0], sent_lab[1])
              for sent_lab in data_with_len if sent_lab[2] >10]

In [43]:
# A list is a type of iterator so it can be used as generator for a dataset
all_dataset = tf.data.Dataset.from_generator(lambda: sorted_all,
                                             output_types=(tf.int32,tf.int32))

In [24]:
all_dataset

<FlatMapDataset shapes: (<unknown>,), types: (tf.int32,)>

In [44]:
next(iter(all_dataset))

(<tf.Tensor: shape=(11,), dtype=int32, numpy=
 array([2613, 4485, 6522, 2615, 1998, 2060, 4485, 7110, 2102, 2053, 8257],
       dtype=int32)>, <tf.Tensor: shape=(), dtype=int32, numpy=2>)

In [17]:
BATCH_SIZE = 32
all_batched = all_dataset.padded_batch(BATCH_SIZE, padded_shapes=((None, ), ()))

In [45]:
next(iter(all_batched))

(<tf.Tensor: shape=(32, 11), dtype=int32, numpy=
 array([[ 2613,  4485,  6522,  2615,  1998,  2060,  4485,  7110,  2102,
          2053,  8257],
        [ 2054,  2015,  2008,  1998,  2054,  1998,  2043,  2003,  1996,
          6643,  2361],
        [ 2498,  2066,  2893,  1037,  6643,  2361, 15488, 14644,  2039,
          2115,  4451],
        [ 4756,  2026,  8239,  4632,  2125,  2157,  2066,  2017,  2525,
          2253,  5742],
        [ 2129,  2106,  2017,  2131,  2009,  3718,  2129,  2146,  2106,
          2009,  2202],
        [10413, 22911,  2053,  1045,  2031,  2025,  2018,  1996,  6522,
          2615, 17404],
        [ 2061,  2024,  2122, 14148,  2066,  2893,  1037,  6643,  2361,
         15488, 14644],
        [ 2033,  2043,  2049,  2051,  2005,  2026, 12142,  6643,  2361,
         15488, 14644],
        [ 1998,  2025,  2069, 28896,  2644, 13475,  2006,  2502, 21890,
         17830, 10338],
        [10166,  2035,  2013,  1037, 17404,  3745,  2014,  2466,  2023,
          3532,

In [46]:
NB_BATCHES = math.ceil(len(sorted_all) / BATCH_SIZE)
NB_BATCHES_TEST = NB_BATCHES // 10
all_batched.shuffle(NB_BATCHES)
test_dataset = all_batched.take(NB_BATCHES_TEST)
train_dataset = all_batched.skip(NB_BATCHES_TEST)

# Stage 3: Model building

In [63]:
class DCNN(tf.keras.Model):
    
    def __init__(self,
                 vocab_size,
                 emb_dim=128,
                 nb_filters=50,
                 FFN_units=512,
                 nb_classes=7,
                 dropout_rate=0.1,
                 training=False,
                 name="dcnn"):
        super(DCNN, self).__init__(name=name)
        
        self.embedding = layers.Embedding(vocab_size,
                                          emb_dim)
        self.bigram = layers.Conv1D(filters=nb_filters,
                                    kernel_size=2,
                                    padding="valid",
                                    activation="relu")
        self.trigram = layers.Conv1D(filters=nb_filters,
                                     kernel_size=3,
                                     padding="valid",
                                     activation="relu")
        self.fourgram = layers.Conv1D(filters=nb_filters,
                                      kernel_size=4,
                                      padding="valid",
                                      activation="relu")
        self.pool = layers.GlobalMaxPool1D()
        
        self.dense_1 = layers.Dense(units=FFN_units, activation="relu")
        self.dropout = layers.Dropout(rate=dropout_rate)
        if nb_classes == 7:
            self.last_dense = layers.Dense(units=1,
                                           activation="softmax")
        else:
            self.last_dense = layers.Dense(units=nb_classes,
                                           activation="softmax")
    
    def call(self, inputs, training):
        x = self.embedding(inputs)
        x_1 = self.bigram(x) # (batch_size, nb_filters, seq_len-1)
        x_1 = self.pool(x_1) # (batch_size, nb_filters)
        x_2 = self.trigram(x) # (batch_size, nb_filters, seq_len-2)
        x_2 = self.pool(x_2) # (batch_size, nb_filters)
        x_3 = self.fourgram(x) # (batch_size, nb_filters, seq_len-3)
        x_3 = self.pool(x_3) # (batch_size, nb_filters)
        
        merged = tf.concat([x_1, x_2, x_3], axis=-1) # (batch_size, 3 * nb_filters)
        merged = self.dense_1(merged)
        merged = self.dropout(merged, training)
        output = self.last_dense(merged)
        
        return output

# Stage 4: Training

In [64]:
VOCAB_SIZE = len(tokenizer.vocab)
EMB_DIM = 200
NB_FILTERS = 100
FFN_UNITS = 256
NB_CLASSES = 7

DROPOUT_RATE = 0.2

NB_EPOCHS = 5

In [65]:
Dcnn = DCNN(vocab_size=VOCAB_SIZE,
            emb_dim=EMB_DIM,
            nb_filters=NB_FILTERS,
            FFN_units=FFN_UNITS,
            nb_classes=NB_CLASSES,
            dropout_rate=DROPOUT_RATE)

In [66]:
if NB_CLASSES == 7:
    Dcnn.compile(loss="categorical_crossentropy",
                 optimizer="adam",
                 metrics=["accuracy"])
else:
    Dcnn.compile(loss="sparse_categorical_crossentropy",
                 optimizer="adam",
                 metrics=["sparse_categorical_accuracy"])

In [67]:
checkpoint_path = "./drive/MyDrive/projects/BERT/ckpt_bert_tok/"

ckpt = tf.train.Checkpoint(Dcnn=Dcnn)

ckpt_manager = tf.train.CheckpointManager(ckpt, checkpoint_path, max_to_keep=1)

if ckpt_manager.latest_checkpoint:
    ckpt.restore(ckpt_manager.latest_checkpoint)
    print("Latest Checkpoint restored!")

Latest Checkpoint restored!


In [68]:
class MyCustomCallback(tf.keras.callbacks.Callback):

    def on_epoch_end(self, epoch, logs=None):
        ckpt_manager.save()
        print("Checkpoint saved at {}.".format(checkpoint_path))

In [69]:
Dcnn.fit(train_dataset,
         epochs=NB_EPOCHS,
         callbacks=[MyCustomCallback()])

Epoch 1/5
Checkpoint saved at ./drive/MyDrive/projects/BERT/ckpt_bert_tok/.
Epoch 2/5
Checkpoint saved at ./drive/MyDrive/projects/BERT/ckpt_bert_tok/.
Epoch 3/5
Checkpoint saved at ./drive/MyDrive/projects/BERT/ckpt_bert_tok/.
Epoch 4/5
Checkpoint saved at ./drive/MyDrive/projects/BERT/ckpt_bert_tok/.
Epoch 5/5
Checkpoint saved at ./drive/MyDrive/projects/BERT/ckpt_bert_tok/.


<keras.callbacks.History at 0x7f20ab614b10>

# Stage 5: Evaluation

In [70]:
results = Dcnn.evaluate(test_dataset)
print(results)

[0.0, 0.006485849153250456]


In [81]:
def get_prediction(sentence):
    tokens = encode_sentence(sentence)
    inputs = tf.expand_dims(tokens, 0)

    output = Dcnn(inputs, training=False)

    sentiment = math.floor(output*2)

    if sentiment == 0:
        print("Output of the model: {}\nPredicted sentiment: Awareness Related.".format(
            output))
    elif sentiment == 1:
        print("Output of the model: {}\nPredicted sentiment: Campaign_initiatives	.".format(
            output))
    elif sentiment == 2:
        print("Output of the model: {}\nPredicted sentiment: Consumer Experience	.".format(
            output))
    elif sentiment == 3:
        print("Output of the model: {}\nPredicted sentiment: HPV Screening.".format(
            output))
    elif sentiment == 4:
        print("Output of the model: {}\nPredicted sentiment: Others.".format(
            output))
    elif sentiment == 5:
        print("Output of the model: {}\nPredicted sentiment: Study Results.".format(
            output))
    elif sentiment == 6:
        print("Output of the model: {}\nPredicted sentiment: Vaccine Challenges.".format(
            output))

In [82]:
get_prediction("louisiana fisher said im sitting there playing.")

Output of the model: [[1.]]
Predicted sentiment: Consumer Experience	.


In [83]:
get_prediction("I'd rather not do that again.")

Output of the model: [[1.]]
Predicted sentiment: Consumer Experience	.
