# Generating BERT Embeddings

As we found from our exploratory data analysis, the textual content and raw sentiment of the reviews is indicative of its corresponding rating. However, we found out that we cannot simply use raw sentiment as training data as it does not capture any contextual information regarding the review. Therefore, we must find a way to encode our data in a way which captures both sentiment and context.

We can create this encoding by leveraging the power of transfer learning and using a pre-trained SOTA deep neural network model: BERT.

In [13]:
# General Imports
import pandas as pd 
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle
from os.path import join
from tqdm import tqdm

# BERT Specific Imports
import tensorflow as tf
import tensorflow_hub as hub 
import tensorflow_text as text 

We will be using a pretrained BERT model provided by the Tensorflow module. More specifically, we use a variant of a pre-trained BERT model called BERT Experts which has been pre-trained on the Stanford Sentiment Treebank (SST-2) dataset. 

In [2]:
# Preprocessing layer to generate the tokenized sentences and input mask
bert_preprocess = hub.KerasLayer('https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3')
# Encoder layer which generates word-level and setence-level 768-dimensional text embeddings 
bert = hub.KerasLayer('https://tfhub.dev/google/experts/bert/wiki_books/sst2/2')

In [8]:
# Load the downsampled data
data_dir = "data"
data = pd.read_csv(join(data_dir, "downsampled_train_50000.csv"), names=['Rating', 'Title', 'Review'])
data = data.iloc[1:, :].reset_index(drop=True)
data["Rating"] = data["Rating"].apply(int) 
data 

Unnamed: 0,Rating,Title,Review
0,1,UNINSPIRED,This album is a travesty to the songs of the 5...
1,1,Mastiff Aristocratic Guardian by DeeDee Andersson,"I found this book a complete waist of time, I ..."
2,1,Scratches. Turns off by itself. Not working.,The product I got had scratches on its surface...
3,1,Worst College Book of all time,Ok well it may not be the worst book that I ha...
4,1,just okay,It was ok. Could of gotten into the other char...
...,...,...,...
49995,5,"Great product, no problems",Good little memory stick. Currently using as m...
49996,5,Beautiful,How anyone can write such fun tropical songs a...
49997,5,Is that thing loaded???,"Sure, its one of The Great Man's best movies, ..."
49998,5,WOW!,I finally bought and watched this classic epic...


In [9]:
# Create train, val, and test sets
train_data = shuffle(data)[:10000]
X = train_data["Review"].to_numpy()
y = train_data["Rating"].to_numpy() - 1
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2)
print(f"X_train: {X_train.shape} | X_val: {X_val.shape} | X_test: {X_test.shape} | \n" +
    f"y_train: {y_train.shape} | y_val: {y_val.shape} | y_test: {y_test.shape} | ")

X_train: (7200,) | X_val: (1800,) | X_test: (1000,) | 
y_train: (7200,) | y_val: (1800,) | y_test: (1000,) | 


In [14]:
# Generate train embeddings
def generate_bert_embeddings(data):
    """Generate the BERT embeddings for a given Series/list of senetences"""
    return bert(bert_preprocess(data))['pooled_output'] 

def generate_embeddings_list(data):
    """Generate embeddings of parts of list individually and then concatenate. 
    Create to overcome performance issues."""
    factor = int(data.shape[0]/100)
    embeddings_list = []
    for i in tqdm(range(0, 100)):
        embeddings_list.append(generate_bert_embeddings(X[factor*i: factor*(i+1)]))
    return embeddings_list
    
el = generate_embeddings_list(X_train)
embeddings = tf.stack(el)
embeddings

100%|██████████| 100/100 [14:10<00:00,  8.50s/it]


<tf.Tensor: shape=(100, 72, 768), dtype=float32, numpy=
array([[[ 2.5606203e-01,  2.2648114e-01, -7.3399884e-01, ...,
         -1.5405677e-01, -7.4479479e-01, -2.3880896e-01],
        [ 6.0433829e-01, -3.5289210e-01, -3.2896611e-01, ...,
         -7.0303285e-01, -9.6206862e-01,  3.5354868e-04],
        [-2.0405006e-01, -8.8491130e-01,  3.6593831e-01, ...,
          2.5963566e-01, -5.5853081e-01,  1.7557411e-01],
        ...,
        [ 1.1836937e-01,  6.7327064e-01,  6.7677331e-01, ...,
          5.5701458e-01,  7.0627445e-01, -6.5629554e-01],
        [ 1.5376495e-01, -2.7541134e-01,  8.3992398e-01, ...,
          1.3801001e-01, -6.7640889e-01, -6.8531263e-01],
        [ 7.9678363e-01, -8.7943457e-02,  4.2670679e-01, ...,
          6.2972367e-01,  8.2276469e-01, -5.8434826e-01]],

       [[ 2.4091324e-01, -3.3776504e-01,  6.5796727e-01, ...,
         -9.6181311e-02,  4.7297868e-01, -6.9508857e-01],
        [ 6.0087836e-01, -2.8265351e-01, -2.1154387e-01, ...,
         -1.7967005e-01, -7

In [18]:
embeddings = tf.reshape(embeddings, (7200, 768))

In [23]:
# Save train embeddings
import pickle
pickle.dump(embeddings, open(join(data_dir, "downsampled_shuffled_train_embeddings.pkl"), "wb"))
pickle.dump(y_train, open(join(data_dir, "downsampled_shuffled_train_labels.pkl"), "wb"))


In [24]:
# Generate and save validation and test embeddings 
print("Generating val data...")
val_embeddings = tf.reshape(tf.stack(generate_embeddings_list(X_val)), (1800, 768))
print("Generating test data...")
test_embeddings = tf.reshape(tf.stack(generate_embeddings_list(X_test)), (1000, 768))

pickle.dump(val_embeddings, open(join(data_dir, "downsampled_shuffled_val_embeddings.pkl"), "wb"))
pickle.dump(y_val, open(join(data_dir, "downsampled_shuffled_val_labels.pkl"), "wb"))

pickle.dump(test_embeddings, open(join(data_dir, "downsampled_shuffled_test_embeddings.pkl"), "wb"))
pickle.dump(y_test, open(join(data_dir, "downsampled_shuffled_test_labels.pkl"), "wb"))


  0%|          | 0/100 [00:00<?, ?it/s]

Generating val data...


100%|██████████| 100/100 [03:20<00:00,  2.00s/it]
  0%|          | 0/100 [00:00<?, ?it/s]

Generating test data...


100%|██████████| 100/100 [01:54<00:00,  1.15s/it]
