<a href="https://colab.research.google.com/github/seongyeon1/twitterNLP/blob/main/bert_modeling_final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Getting the Bert

In [3]:
!pip install --upgrade tensorflow_hub

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [4]:
import tensorflow_hub as hub
import numpy as np

## Load the BERT model

In [None]:
## loading bert from tensorhub
module_url = "https://tfhub.dev/tensorflow/bert_en_uncased_L-24_H-1024_A-16/1"
bert_layer = hub.KerasLayer(module_url, trainable=False)

trainable = False freezing the pre-trained Bert layers as we don’t want to retrain Bert layers.

BERT Model Versionbert_en_uncased_L-24_H-1024_A-16 model

- L=24 hidden layers (Transformer blocks),
- H=1024Hidden Layers
- A=16 attention heads.

This model is trained on the Wikipedia and BooksCorpus Dataset. en_uncased signifies that the model is pre-trained for the English language and its case insensitive.

## Loading the tokenizer

for the training, we need to parse our textual dataset into BERT-supported input format. In order to do this, we first tokenize our dataset and then convert it into features (encoding into some numbers)

In [None]:
!pip install transformers

In [None]:
from transformers import BertTokenizer, TFBertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertModel.from_pretrained("bert-base-uncased")

In [None]:
tokenizer.tokenize(train.tweet[5])

## Understanding Input Data Format

### **Token Embeddings**

Token Embedding holds the information of our dataset. it’s a number assigned to each unique words tokens

- [CLS] token is attached at the beginning of every sentence that indicates the starting
- [SEP] token is attached at the end of each and every sentence that indicates the ending of a sentence.

## **Position Embeddings**

It is used to indicate the position of tokens in a sentence.

this helps BERT to capture the sequence or order of information given in a sentence.

## **Segment Embeddings**

The model must know whether a particular token belongs to Sentence 1 or sentence 2.

In BERT. This is done by generating a fixed token, called the segment embedding

Till now we have discussed BERT, its input format, how to load the BERT model.

In [None]:
sample = tokenizer.tokenize(train.tweet[5])

# so this is how our bert based tokenizer works 
input_seq = ["[CLS]"] + sample + ["[SEP]"]
input_seq
token = tokenizer.convert_tokens_to_ids(input_seq) # this convert all the list of tokens into a ids 
pad_len = 512 - len(token)
token = token + [0] * pad_len 
# on this step we are padding and making  every sequence equal to 512 length 
len(token) 
# so far token becomes our first input for bert 

In [None]:
print(input_seq)

In [None]:
token[:10]

## Loading the dataset

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import pandas as pd

In [None]:
DATA_IN_PATH = '/content/drive/MyDrive/ColabNotebooks/datasets/sentiments/'

df = pd.read_csv(DATA_IN_PATH + 'preprocessed_df.csv')

In [None]:
df

### Data Cleaning

In [None]:
import re

In [None]:
# nltk
import nltk
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('stopwords')

stop_words = stopwords.words("english")
stemmer = SnowballStemmer("english")

stop_words = set(stopwords.words("english"))
stop_words_list = ['no', 'nor', 'not', 'don', "don't", 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't",
                   "hadn't", 'hasn', "hasn't", "haven't", 'isn', "isn't", 'mightn', "mightn't", 'mustn', "mustn't", 'needn',
                   "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

for i in stop_words_list:
  stop_words.remove((i))
  
# TEXT CLEANING
TEXT_CLEANING_RE = "@\S+|https?:\S+|http?:\S|[^A-Za-z0-9]+"

def preprocess(text, stem=False):
    # Remove link, user and special characters
    text = re.sub(TEXT_CLEANING_RE, ' ', str(text).lower()).strip()
    tokens = []
    for token in text.split():
        if token not in stop_words:
          if stem:
              tokens.append(stemmer.stem(token))
          else:
              tokens.append(token)
    return " ".join(tokens)


In [None]:
%%time
df.tweet = df.tweet.apply(lambda x: preprocess(x))

In [None]:
train = df[:7920]
test = df[7920:]

## find max length of the data

In [None]:
max_length = 0
for text in df.tweet:
  length = len(text.split())
  if length > max_length:
    max_length = length

print(max_length)

## Pre-Processing Dataset into BERT Format

In [None]:
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Model

- the function bert_encodertakes textual data and tokenizer and creates token_embeddings,positional_embeddings, and segment_embedding which will be passed in our model for training
- Bert supports max length up to 512 only

In [None]:
def bert_encoder(texts, tokenizer, max_len=41):
    
    # here we need 3 data inputs for bert training and fine tuning 
    all_tokens = []
    all_masks = []
    all_segments = []
    
    for text in texts:
        text = tokenizer.tokenize(text)
            
        text_sequence = text[:max_len-2] # here we are trimming 2 words if they getting bigger than 512
        input_sequences = ["[CLS]"] + text_sequence + ["[SEP]"]
        pad_len = max_len - len(input_sequences)
        
        tokens = tokenizer.convert_tokens_to_ids(input_sequences)
        tokens += [0] * pad_len
        pad_masks = [1] * len(input_sequences) + [0] * pad_len
        segment_ids = [0] * max_len
        
        all_tokens.append(tokens)
        all_masks.append(pad_masks)
        all_segments.append(segment_ids)
    
    return np.array(all_tokens), np.array(all_masks), np.array(all_segments)

- bert_encodertakes tokenizer and text data as input and returns 3 different lists of mask/position embedding, segment embedding, token embedding.
- convert_tokens_to_ids it maps our unique tokens to the vocab file and assigns unique ids to the unique tokens.
- max_length = 512, the maximum length of our sentence in the dataset

**Note**: Token Embedding and Positional Embedding are necessary to pass for BERT Training

Calling the encoding function:

In [None]:
train_input = bert_encoder(train.tweet.values, tokenizer, max_len=41)

- max_len = 41 since the length of most tweets is within 41 words.
- the train_input contains a list of 3 arrays (all_tokens, all_masks,all_segments)

## Building model using BERT layers

We need to design a model according to our use case using BERT pre-trained model by adding some CNN layers which will give us end prediction.

In [None]:
import tensorflow as tf

In [None]:
def build_model():
    input_word_ids = Input(shape=(max_len,), dtype=tf.int32, name="input_word_ids")
    input_mask = Input(shape=(max_len,), dtype=tf.int32, name="input_mask")
    segment_ids = Input(shape=(max_len,), dtype=tf.int32, name="segment_ids")

    _, sequence_output = bert_layer([input_word_ids, input_mask, segment_ids])
    clf_output = sequence_output[:, 0, :]
    out = Dense(1, activation='sigmoid')(clf_output)
    
    model = Model(inputs=[input_word_ids, input_mask, segment_ids], outputs=out)
    model.compile(Adam(lr=2e-6), loss='binary_crossentropy', metrics=['accuracy'])
    
    return model

In [None]:
max_len = 41

In [None]:
model_final = build_model()
model_final.summary()

## Training Step

In [None]:
history = model_final.fit(
    train_input, train.label,
    validation_split=0.2,
    epochs=5,
    batch_size=16
)

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']
 
epochs = range(len(acc))
 
plt.plot(epochs, acc, 'b', label='Training acc')
plt.plot(epochs, val_acc, 'r', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()
 
plt.figure()
 
plt.plot(epochs, loss, 'b', label='Training loss')
plt.plot(epochs, val_loss, 'r', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()
 
plt.show()

In [None]:
model_final.save('model.h5')

## Testing and validation
for the testing and prediction, the test data must be in the same format as training data.

Calling the **bert_encoder** function on the test data will convert it into 3 embeddings and that will be passed to the model.predict method.

In [None]:
test_input = bert_encoder(test.tweet.values, tokenizer, max_len=41)
test_pred = model_final.predict(test_input)
prediction = np.where(test_pred>.5, 1,0)

In [None]:
pred = pd.DataFrame(test_pred)
pred.to_csv((DATA_IN_PATH + 'bert_final_0.88.csv'), index=False)

In [None]:
test['prediction'] = prediction

In [None]:
test[test.prediction == 1]

In [None]:
test.prediction

In [None]:
submission = pd.read_csv((DATA_IN_PATH + 'sample_submission.csv'))

In [None]:
submission.label = test.prediction

In [None]:
submission