# BERT Embedding Generation

This notebook contains generating BERT<sub>BASE</sub> embedding on different pooling strategies.

In [None]:
import os
import urllib

from google.colab import drive, files
from getpass import getpass

from google.colab import drive

In [None]:
ROOT = '/content/drive'
GOOGLE_DRIVE_PATH = 'My Drive/Colab Notebooks/recommender/w266-final'
PROJECT_PATH = os.path.join(ROOT, GOOGLE_DRIVE_PATH)

drive.mount(ROOT)

%cd {PROJECT_PATH}

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/My Drive/Colab Notebooks/recommender/w266-final


In [None]:
import os
import sys
import re
import pandas as pd
import numpy as np
import itertools
import pickle
import random
import tensorflow as tf

from commons.store import PickleStore, NpyStore
from tqdm import tqdm
from IPython.core.display import HTML
from importlib import reload

%load_ext autoreload
%autoreload 2

## 1. Load Pre-filtered Dataset

Load the clean pre-processed dataset.

In [None]:
amazon = False

if amazon:
  input_pkl = '../dataset/25-65_tokens_grouped_Movies_and_TV_v2.pkl'
else:
  input_pkl = '../dataset/25-65_tokens_grouped_yelp.pkl'

pkl_store = PickleStore(input_pkl)

grouped_reviews_df = pkl_store.load(asPandasDF=True, \
                                    columns=['reviewerID', 'asin', 'overall', 'userReviews', 'itemReviews'])
print(len(grouped_reviews_df))
display(HTML(grouped_reviews_df.head(1).to_html()))

Loading ../dataset/25-65_tokens_grouped_yelp.pkl ...
	... 272296 records
Done!!
272296


Unnamed: 0,reviewerID,asin,overall,userReviews,itemReviews
0,JHXQEayrDHOWGexs0dCviA,KXCXaF5qimmtKKqnPc_LQA,1.0,"[the dark chocolate gelato is so rich and creamy and seductiveyou must lick it once in your life but their prices recently went up so take your credit card, great pastries and chocolate but why does the service have to be soooo slow and disorganized sigh, great ice cream but are the staff handpicked for their incompetence and total lack of training the service is that bad it's almost embarrassing, great baked goods but please don't remind me 1530 minutes before you close that we are closing at 6 as it's not polite it's not etiquette it's not nice to pressure customers to hurry up thank you, it's really hokey and old but not in a good way in an ancient dirty way a good place to view a cross section of west vancouver ie grey hair it needs to be renovatedthe 'old world charm' is simply not there, the gelato is good not quite to the italian standard but hey you can't have everything thx decor is tragic and the coffee is only 'accrptable' not 'outstanding', always a staple vancouver pizzeria not pretentious but good honest nongourmet pizza and pasta nice atmosphere and good staff, very good food and a basic honest restaurant the beef ribs are the best option it's a bit of a hipster place with loud hard rick music on the loud side not atmosphere inducing service was quick and fairly efficient but not overly attentive, this place has great freshly made tempura and good quality sushi excellent down town location and efficient rather than 'friendly' staff good value too, you can drink franchise coffee and then you can drink arti's coffee the difference is massive visit him once you will never look back, this is a great place to find beautifully crafted chocolates and frenchstyled pastries the coffee is good quality too the only downside is the lack of seating and the rather hap hazard way of serving the customers, food is good and wellpriced interior decor is tragic and service a bit slow if you want a good no frills greek experience this is a reasonable choice, good snack and coffee but needs a damn good clean it's pretty clear that the staff are not happy, it's a good place but really no ambience and when you use canned bamboo shoots you know you're not going to get authentic cuisine they try hard herebut not hard enough, you can't go past their pizza slices and freshsqueezed cranberry juice gets very crowded during lunch hourbut for a good reason nice, the pizza slice es used to be great but now they have compromised on the quality of their ingredients and it's gone down quite a bit, very good baked goods especially the croissants however the service sucks big time staff have horrible attitude, the ice cream is of course very nice if a little bit expensive the service is not goodslow and miserable staff should not serving ice cream to a willing public be a happy occasion, their burgers fries and shakes are simply the best in townand very healthy the pictures below do all the talking, just go you won't be disappointedgreat pizza great service great wine and beer selection tiramisu to die for happy happy joy joy, i had lamb korma and coconut chicken curry with naan and roti breadsvery good quality food and a fir price service is a little slow 25 minute wait for food but otherwise a good indian choicenice proud owner too, it's a very quirky place and microscopicso expect a lineup the service and the deserts easily make up for the wait good prices plunger coffeegood times, this is a terrific cafe with great art shows good coffee and deserts and a lovely calm ambience tyler the owner is fabulous and really values community and creativity a must try, very cool and authentic whiskeycocktail bar it's off the beaten track but that's what makes it so great huge selection knowledgeable staff very good happy hour and decent charcuterie board definitely a must see when in portland area, it's a great location just fabuloushorrendous service if you want to be ignored and treated with derision then this is the place for yousuch a damn shame, the happy hour choices are quite good atmosphere is fairly pedestrian and the service is very mediocre inattentive and uninterested staff, not that good honestly go to robson street location for better waffles and waaay better service, i went there for the first time about a month ago clean spa great spa pedicure prices are good and the toes came out amazing]","[tonight the macaron du jour' was the lycheeabsolutely yummybut still the hands down favourite are the coffee andor chocolate caramel whichever i munch on first, friday's nite batch just soso the filling just oozed out of every bite just guessing wasn't chilled long enoughsigh worst ones everbut still sweet the lemon raspberry , overhyped in my opinon honestly nothing exceptional macaroons and chocolates are very standard i love their passionfruit mousse though a staple every time i visit, artistic and tasty desserts it's a pricey place and crowded with young kids so it can be noisy but the desserts are worth it the macaroons are quite big so go nuts and enjoy the americano was delicious, i love their macarons it's chewy and not too sweet my only complaint is that they always run out of of varieties, very happening place lots of people from all over the world we had to chose from so many desserts that looked amazing don't think you can go wrong, amazing location delicious thick hot chocolate and great treatswe always ho there for dessert and we are both impressed each time will revisit, great selection of desserts the passionfruit dome cake with mini macarons on around the bottom of the cake was delicious, the desserts here are amazing we even got a cake for a birthday and it was delicious staff are wonderful coffee could be better, kenneth n's recommendation of thierry's macarons persuaded me to returnsmooth crust chewy cookie light creamy filling reasonable sweetness that does not overpower the flavour yum i haven't had a disappointing flavour to date, fall in love with their tiramisu looks so adorable and the cream is perfect friday night is busy we sat on the patio to enjoy the late night sweet, love the desserts at this place my favorites are the lemon tart millefeuille and chocolate marquise busy place desserts are not cheap but worth it love these french desserts and keep wanting to come back staff are pretty nice and service is pretty fast, a friend treated us to a sleeve of fruitflavoured macaronsthey are so delectablebest way to eat it is so pop it whole in your mouththe berry lemon are so goodi could eat a whole sleeve by myself, their pastries are the best in vancouver we went there on a saturday night but it was crowded as usualunfortunately choices were limited chocolate trio was awesome also the tea had a strong taste just like indian teas, sophisticated food really high quality and class but so crowded and noisy and often lacks free tables, beautiful cafe very attentive and knowledgable staff however their desserts are too sweet for me i've tried their macarons and their cakes and they're just way too sweet for me but their displays are very attractive, the macarons there are good but i think it's a bit too sweet and hard personally like soirette more the desserts are quite expensive but it's at great location to have a bakery though which is probably why, desserts and pastries are beautiful and almost is tasty try to find a quiet time otherwise you're competing with the self obsessed rude people coffee here is extra good and nice pick me up early in the day, visited last night it's so busy on a friday night food was great service was fast i love the liquid chocolate, everything i have had from thierry is to die for if you have a sweet tooth you absolutely need to try thierry, there is no place in vancouver like this omg everything is to die for i could eat their tiramisu cake everyday hahaaa since i moved back from toronto i was looking for something like this love u guys keep up the good work, since my last visit the quality bounced back up again it is simply one of the few good options you have in townthey now have two japanese baristas that makes coffee precisely and accurately it is fun to see them workpastry is good and cakes are fantastic, yes if you love french desserts this is the place to go i celebrated my birthday here and had a few of the delicacies and they were so good i'd recommend the layer cake and some of the classics it's a cute little cafe as well, the service was at best minimal coz they are under staffed also the price point is not justifiable for the quality of drinks and food better luck next time if they don't change will not come back, i had drinking chocolate and a pear danish which were superb and a croissant which was pretty good 10 for all three was neither great nor terrible but i felt it was reasonable given the high rent district, tried their coveted macaroons and an opera cake still not convinced that the place is worth returning to other than the fact that their bags are really pretty and would be really presentable as gifts, thierry is a treasure it's a wonderful place to visit with girlfriends and it's open until midnight the cakes there are amazing, i love this cafe it's one of my favorites in vancouver anything you order here will be amazing though i recommend the hot chocolate with or without liqueurs, really really good macaroons everything else is decent too just a nice upscale hangout place in downtown definitely beats timmy's across the street, great upscale cafe with boozie options available it's always packed but their outside seating is comfortable with the heaters going their liquid chocolate is so yummy as well as their macaroons their cakes are really beautiful and they even have a gluten free cake, amazing chocolates and desserts but terrible cappuccino unfortunately they don't know how to steam milk the way so it's nice and foamy and creamy it's just boiling hot milk mixed with espresso i wish they learn how to make it right, tried the forest cake and salted caramel latteboth are not my thing i didn't finish any of it, this is my late night downtown dessert place the service is always friendly and the dessert displays never fail to nudge my appetite even after dinner i love the lively atmosphere and how i can chat and catch up with a girlfriend without feeling pressured to leave even once it gets busy, great patio season just walked over for some macaroons and tea i love this place for a late night treat in the summer it's only april but it's hot enough to sit out yay, best place ever if you have a sweet tooth i love pretty much everything from the deserts to the drinks great service and also for it being packed in the evenings the staff work fast and efficient, good fresh ingredients and pastriessweets found it to be a bit pricey so i won't be a regular customer but is popular in vancouver, awesome dessert and macarons it's always busy during weekends which makes it hard to find a table to enjoy your dessert, great after dinner spot for dessert and coffee this place is trendy and expensive but given the area i understandthe desserts cakes are just right not too sweet or creamylooks like the serve alcohol too, cute little pastrychocolate shop in downtown vancouver on alberni street they have the best tiramisu and palmiers but i find that thomas haas has a better selection and quality it is great that they're open late and also serve alcohol, run by the same folks behind blue water cafe grabbed a decent pain au chocolat here before stanley park, tiramisu is my to go desert love it and also pair with london fog tea im in heaven lalalala, hands down favourite place to have desert and tea it is a known fine dining desert place so be ready to wait for seat deserts and tea are phenomenal definitely recommend to stop by if you are in the neighbourhood, 6 of us dropped in for a little treatit was my1st time and it was a pleasant surprise excellence is all over this place in every way you won't be disappointed at thierry, i love it when places stay consistent after a year their desserts are still just as good sure the price tag is sometimes a little high but in my opinion they are worth it, best macarons ever and the lattes are great nothing else i've tried has really stood out to me as being that amazing stick to the macarons they're outstanding my favourite flavours cassis lemoncherry lychee, sandwiches should be warm coffee is too hot always bad baristas i burned my tongue many times other than that they are always open holidays weekends late nights lots of seats napoleon is delicious, not even the aggressive couple behind me in line who tried to edge me out with their stroller loaded with tiffany's bags could stop me from coming here again just tried their macarons but holy crap were they delicious plus their selection of wobbly coffees is amazing, i got the macarons here and they were pretty good but it would've been nice if there was more selection, i love the desserts and coffee at this place is really good place to sit and relaxlove the passion cake and london fog here, i don't usually eat anything related to chocolate but cannot say no to theirry's chocolate marquise they also have the best tiramisu in town, winner winner chocolate for dinner macaroons and the triple layer chocolate cake are 10s just try and you will know that chef thierry knows desserts i am happy that i got to try the treats oh and the hot chocolate was just right, their hot chocolate is good their macaroons are too sweet not a fan overpriced food and drinks but once in a while is okay , expensive but tasty chocolate treats decent coffee in an area with few options outdoor seating is nice if you can get a seat often very busy pretty wood panel decori prefer 49th parallel one block away it has much better coffee and some of the best donuts in the city, knocking it down one star because of the lack of selection for macaron flavors i'm assuming that flavors tend to run out later in the day the only thing i wanted to get was the salted caramel and it would've been extremely disappointing if they didn't have that, a great place to enjoy a conversation over very good coffee and pastries interior is well done and comfortable they have a mixture of small tables and adirondack chairs outsidethe cinnamon rolls here are doughy and heavy on the icing]"


In [None]:
#grouped_reviews = grouped_reviews_df[['userReviews', 'itemReviews', 'overall']].to_numpy()
grouped_reviews = grouped_reviews_df.to_numpy()
grouped_reviews[0]

array(['JHXQEayrDHOWGexs0dCviA', 'KXCXaF5qimmtKKqnPc_LQA', 1.0,
       array(['the dark chocolate gelato is so rich and creamy and seductiveyou must lick it once in your life but their prices recently went up so take your credit card',
       'great pastries and chocolate but why does the service have to be soooo slow and disorganized sigh',
       "great ice cream but are the staff handpicked for their incompetence and total lack of training the service is that bad it's almost embarrassing",
       "great baked goods but please don't remind me 1530 minutes before you close that we are closing at 6 as it's not polite it's not etiquette it's not nice to pressure customers to hurry up thank you",
       "it's really hokey and old but not in a good way in an ancient dirty way a good place to view a cross section of west vancouver ie grey hair it needs to be renovatedthe 'old world charm' is simply not there",
       "the gelato is good not quite to the italian standard but hey you can't h

In [None]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/81/91/61d69d58a1af1bd81d9ca9d62c90a6de3ab80d77f27c5df65d9a2c1f5626/transformers-4.5.0-py3-none-any.whl (2.1MB)
[K     |████████████████████████████████| 2.2MB 13.4MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/ae/04/5b870f26a858552025a62f1649c20d29d2672c02ff3c3fb4c688ca46467a/tokenizers-0.10.2-cp37-cp37m-manylinux2010_x86_64.whl (3.3MB)
[K     |████████████████████████████████| 3.3MB 60.4MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/08/cd/342e584ee544d044fb573ae697404ce22ede086c9e87ce5960772084cad0/sacremoses-0.0.44.tar.gz (862kB)
[K     |████████████████████████████████| 870kB 67.5MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses: filename=sacremoses-0.0.44-cp37-none-any.whl size=886084 sha256=cf29

In [None]:
import tensorflow as tf

from transformers import BertTokenizer
from transformers import TFBertModel, BertConfig

In [None]:
# Detect hardware
try:
  tpu_resolver = tf.distribute.cluster_resolver.TPUClusterResolver() # TPU detection
except ValueError:
  tpu_resolver = None
  gpus = tf.config.experimental.list_logical_devices("GPU")

# Select appropriate distribution strategy
if tpu_resolver:
  tf.config.experimental_connect_to_cluster(tpu_resolver)
  tf.tpu.experimental.initialize_tpu_system(tpu_resolver)
  strategy = tf.distribute.experimental.TPUStrategy(tpu_resolver)
  print('Running on TPU ', tpu_resolver.cluster_spec().as_dict()['worker'])
elif len(gpus) > 1:
  strategy = tf.distribute.MirroredStrategy([gpu.name for gpu in gpus])
  print('Running on multiple GPUs ', [gpu.name for gpu in gpus])
elif len(gpus) == 1:
  strategy = tf.distribute.get_strategy() # default strategy that works on CPU and single GPU
  print('Running on single GPU ', gpus[0].name)
else:
  strategy = tf.distribute.get_strategy() # default strategy that works on CPU and single GPU
  print('Running on CPU')
  
print("Number of accelerators: ", strategy.num_replicas_in_sync)


Running on single GPU  /device:GPU:0
Number of accelerators:  1


Using BERT huggingface <img src='https://huggingface.co/front/assets/huggingface_logo.svg' width='20px'> library to load the BERT tokenizer and model.

In [None]:
bert_model_name = 'bert-base-uncased'

MAX_LEN = 128

config = BertConfig() 
config.output_hidden_states = True # set to True to obtain hidden states

with strategy.scope():
  tokenizer = BertTokenizer.from_pretrained(bert_model_name, do_lower_case=True)
  user_bert = TFBertModel.from_pretrained(bert_model_name, config=config)
  item_bert = TFBertModel.from_pretrained(bert_model_name, config=config)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=28.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=536063208.0, style=ProgressStyle(descri…




Some layers from the model checkpoint at bert-base-uncased were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.
Some layers from the model checkpoint at bert-base-uncased were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls']
- This IS expected if you are in

## 2. Pooling Strategies




### 2.1. Last Hidden State

In [None]:
def last_hidden_state_embedding(samples, tokenizer, user_model, item_model, max_len=128):

  def tokenize(reviews):
    return tokenizer(list(reviews), padding='max_length', truncation=True, max_length=max_len, return_tensors='tf')

  total = len(samples)
  _embeddings = np.empty(len(samples), dtype=object)

  for i, reviews in enumerate(samples):

    user_tokens = tokenize(reviews[3])
    item_tokens = tokenize(reviews[4])
    
    reviewerID = reviews[0]
    asin = reviews[1]
    label = reviews[2]
    
    user_embedding = user_model(user_tokens)[0]
    item_embedding = item_model(item_tokens)[0]

    user_embedding = tf.stack([user_embedding])
    item_embedding = tf.stack([item_embedding])

    user_embedding = tf.keras.layers.GlobalAveragePooling2D()(user_embedding)
    item_embedding = tf.keras.layers.GlobalAveragePooling2D()(item_embedding)

    #_embeddings[i] = ((dict(user_embedding=user_embedding, 
    #                      item_embedding=item_embedding)), label)
    
    _embeddings[i] = dict(reviewerID=reviewerID, asin=asin, 
                            user_embedding=user_embedding, 
                            item_embedding=item_embedding,
                            label=label)

    
    print(f'\rEmbedding... {i+1} of {total} record(s) -- {(i+1)/total*100:.2f}%', end='')
  
  print('\n\tDone!')
  return _embeddings

In [None]:
%%time
start = 0
end = 150000
embeddings = last_hidden_state_embedding(grouped_reviews[start:end], tokenizer, user_bert, item_bert)

Embedding... 150000 of 150000 record(s) -- 100.00%
	Done!
CPU times: user 14h 38min 48s, sys: 21min 50s, total: 15h 39s
Wall time: 17h 12min 4s


In [None]:
embedding_dir = '../dataset/embedding/'
if not os.path.exists(embedding_dir):
    os.makedirs(embedding_dir)

if amazon:
  embedding_npy = ''.join([embedding_dir, 'grouped_embedding_',str(start),'-',str(end),'_Movies_and_TV.npy'])
else:
  embedding_npy = ''.join([embedding_dir, 'grouped_embedding_',str(start),'-',str(end),'_yelp.npy'])

if os.path.exists(embedding_npy):
  os.remove(embedding_npy)

embedding_store = NpyStore(embedding_npy)
embedding_store.write(embeddings)



Saving ../dataset/embedding/grouped_embedding_0-150000_yelp.npy ...
	... 150000 records
Done!!



### 2.2. Sum Last Four Hidden States

In [None]:
def sum_last_four_embedding(samples, tokenizer, user_model, item_model, max_len=128):

  def tokenize(reviews):
    return tokenizer(list(reviews), padding='max_length', truncation=True, max_length=max_len, return_tensors='tf')

  total = len(samples)
  _embeddings = np.empty(len(samples), dtype=object)

  for i, reviews in enumerate(samples):

    user_tokens = tokenize(reviews[3])
    item_tokens = tokenize(reviews[4])
    
    reviewerID = reviews[0]
    asin = reviews[1]
    label = reviews[2]
    
    user_embedding = user_model(user_tokens).hidden_states
    item_embedding = item_model(item_tokens).hidden_states

    # Sum last four hiddden states
    user_embedding = tf.reduce_sum(user_embedding[-4:], axis=0)
    item_embedding = tf.reduce_sum(item_embedding[-4:], axis=0)

    user_embedding = tf.stack([user_embedding])
    item_embedding = tf.stack([item_embedding])

    user_embedding = tf.keras.layers.GlobalAveragePooling2D()(user_embedding)
    item_embedding = tf.keras.layers.GlobalAveragePooling2D()(item_embedding)
    
    _embeddings[i] = dict(reviewerID=reviewerID, asin=asin, 
                            user_embedding=user_embedding, 
                            item_embedding=item_embedding,
                            label=label)

    
    print(f'\rEmbedding... {i+1} of {total} record(s) -- {(i+1)/total*100:.2f}%', end='')
  
  print('\n\tDone!')
  return _embeddings

In [None]:
%%time
start = 0
end = 150000
embeddings = sum_last_four_embedding(grouped_reviews[start:end], tokenizer, user_bert, item_bert)

Embedding... 150000 of 150000 record(s) -- 100.00%
	Done!
CPU times: user 10h 45s, sys: 20min 29s, total: 10h 21min 14s
Wall time: 13h 44min 12s


In [None]:
embedding_dir = '../dataset/embedding/'
if not os.path.exists(embedding_dir):
    os.makedirs(embedding_dir)

if amazon:
  embedding_npy = ''.join([embedding_dir, 'grouped_embedding_sumlastfour_',str(start),'-',str(end),'_Movies_and_TV.npy'])
else:
  embedding_npy = ''.join([embedding_dir, 'grouped_embedding_sumlastfour_',str(start),'-',str(end),'_yelp.npy'])

if os.path.exists(embedding_npy):
  os.remove(embedding_npy)
  
embedding_store = NpyStore(embedding_npy)
embedding_store.write(embeddings)

Saving ../dataset/embedding/grouped_embedding_sumlastfour_0-150000_yelp.npy ...
	... 150000 records
Done!!


### 2.3. Sum Last Twelve Hidden States

In [None]:
def sum_last_twelve_embedding(samples, tokenizer, user_model, item_model, max_len=128):

  def tokenize(reviews):
    return tokenizer(list(reviews), padding='max_length', truncation=True, max_length=max_len, return_tensors='tf')

  total = len(samples)
  _embeddings = np.empty(len(samples), dtype=object)

  for i, reviews in enumerate(samples):

    user_tokens = tokenize(reviews[3])
    item_tokens = tokenize(reviews[4])
    
    reviewerID = reviews[0]
    asin = reviews[1]
    label = reviews[2]
    
    user_embedding = user_model(user_tokens).hidden_states
    item_embedding = item_model(item_tokens).hidden_states

    # Sum last twelve hiddden states
    user_embedding = tf.reduce_sum(user_embedding[-12:], axis=0)
    item_embedding = tf.reduce_sum(item_embedding[-12:], axis=0)

    user_embedding = tf.stack([user_embedding])
    item_embedding = tf.stack([item_embedding])

    user_embedding = tf.keras.layers.GlobalAveragePooling2D()(user_embedding)
    item_embedding = tf.keras.layers.GlobalAveragePooling2D()(item_embedding)
    
    _embeddings[i] = dict(reviewerID=reviewerID, asin=asin, 
                            user_embedding=user_embedding, 
                            item_embedding=item_embedding,
                            label=label)

    
    print(f'\rEmbedding... {i+1} of {total} record(s) -- {(i+1)/total*100:.2f}%', end='')
  
  print('\n\tDone!')
  return _embeddings

In [None]:
%%time
start = 0
end = 150000
embeddings = sum_last_twelve_embedding(grouped_reviews[start:end], tokenizer, user_bert, item_bert)

Embedding... 150000 of 150000 record(s) -- 100.00%
	Done!
CPU times: user 10h 14min 57s, sys: 21min 37s, total: 10h 36min 35s
Wall time: 13h 54min 38s


In [None]:
embedding_dir = '../dataset/embedding/'
if not os.path.exists(embedding_dir):
    os.makedirs(embedding_dir)

if amazon:
  embedding_npy = ''.join([embedding_dir, 'grouped_embedding_sumlasttwelve_',str(start),'-',str(end),'_Movies_and_TV.npy'])
else:
  embedding_npy = ''.join([embedding_dir, 'grouped_embedding_sumlasttwelve_',str(start),'-',str(end),'_yelp.npy'])

if os.path.exists(embedding_npy):
  os.remove(embedding_npy)
  
embedding_store = NpyStore(embedding_npy)
embedding_store.write(embeddings)

Saving ../dataset/embedding/grouped_embedding_sumlasttwelve_0-150000_yelp.npy ...
	... 150000 records
Done!!


### 2.4. Second-To-Last Hidden State

In [None]:
def second_to_last_embedding(samples, tokenizer, user_model, item_model, max_len=128):

  def tokenize(reviews):
    return tokenizer(list(reviews), padding='max_length', truncation=True, max_length=max_len, return_tensors='tf')

  total = len(samples)
  _embeddings = np.empty(len(samples), dtype=object)

  for i, reviews in enumerate(samples):

    user_tokens = tokenize(reviews[3])
    item_tokens = tokenize(reviews[4])
    
    reviewerID = reviews[0]
    asin = reviews[1]
    label = reviews[2]
    
    user_embedding = user_model(user_tokens).hidden_states
    item_embedding = item_model(item_tokens).hidden_states
    
    # Second-to-last hiddden states
    user_embedding = user_embedding[-2]
    item_embedding = item_embedding[-2]
    
    user_embedding = tf.stack([user_embedding])
    item_embedding = tf.stack([item_embedding])

    user_embedding = tf.keras.layers.GlobalAveragePooling2D()(user_embedding)
    item_embedding = tf.keras.layers.GlobalAveragePooling2D()(item_embedding)
    
    _embeddings[i] = dict(reviewerID=reviewerID, asin=asin, 
                            user_embedding=user_embedding, 
                            item_embedding=item_embedding,
                            label=label)

    
    print(f'\rEmbedding... {i+1} of {total} record(s) -- {(i+1)/total*100:.2f}%', end='')
  
  print('\n\tDone!')
  return _embeddings

In [None]:
%%time
start = 0
end = 150000
embeddings = second_to_last_embedding(grouped_reviews[start:end], tokenizer, user_bert, item_bert)

Embedding... 150000 of 150000 record(s) -- 100.00%
	Done!
CPU times: user 10h 20min 45s, sys: 20min 35s, total: 10h 41min 20s
Wall time: 13h 56min 54s


In [None]:
embedding_dir = '../dataset/embedding/'
if not os.path.exists(embedding_dir):
    os.makedirs(embedding_dir)

if amazon:
  embedding_npy = ''.join([embedding_dir, 'grouped_embedding_secondtolast_',str(start),'-',str(end),'_Movies_and_TV.npy'])
else:
  embedding_npy = ''.join([embedding_dir, 'grouped_embedding_secondtolast_',str(start),'-',str(end),'_yelp.npy'])

if os.path.exists(embedding_npy):
  os.remove(embedding_npy)
  
embedding_store = NpyStore(embedding_npy)
embedding_store.write(embeddings)

Saving ../dataset/embedding/grouped_embedding_secondtolast_0-150000_yelp.npy ...
	... 150000 records
Done!!


### 2.5. First Layer Hidden State

In [None]:
def first_embedding(samples, tokenizer, user_model, item_model, max_len=128):

  def tokenize(reviews):
    return tokenizer(list(reviews), padding='max_length', truncation=True, max_length=max_len, return_tensors='tf')

  total = len(samples)
  _embeddings = np.empty(len(samples), dtype=object)

  for i, reviews in enumerate(samples):

    user_tokens = tokenize(reviews[3])
    item_tokens = tokenize(reviews[4])
    
    reviewerID = reviews[0]
    asin = reviews[1]
    label = reviews[2]
    
    user_embedding = user_model(user_tokens).hidden_states
    item_embedding = item_model(item_tokens).hidden_states
    
    # First embedding
    user_embedding = user_embedding[0]
    item_embedding = item_embedding[0]
    
    user_embedding = tf.stack([user_embedding])
    item_embedding = tf.stack([item_embedding])

    user_embedding = tf.keras.layers.GlobalAveragePooling2D()(user_embedding)
    item_embedding = tf.keras.layers.GlobalAveragePooling2D()(item_embedding)
    
    _embeddings[i] = dict(reviewerID=reviewerID, asin=asin, 
                            user_embedding=user_embedding, 
                            item_embedding=item_embedding,
                            label=label)

    
    print(f'\rEmbedding... {i+1} of {total} record(s) -- {(i+1)/total*100:.2f}%', end='')
  
  print('\n\tDone!')
  return _embeddings

In [None]:
%%time
start = 0
end = 150000
embeddings = first_embedding(grouped_reviews[start:end], tokenizer, user_bert, item_bert)

Embedding... 150000 of 150000 record(s) -- 100.00%
	Done!
CPU times: user 10h 15min 7s, sys: 21min 18s, total: 10h 36min 25s
Wall time: 13h 52min 36s


In [None]:
embedding_dir = '../dataset/embedding/'
if not os.path.exists(embedding_dir):
    os.makedirs(embedding_dir)

if amazon:
  embedding_npy = ''.join([embedding_dir, 'grouped_embedding_first_',str(start),'-',str(end),'_Movies_and_TV.npy'])
else:
  embedding_npy = ''.join([embedding_dir, 'grouped_embedding_first_',str(start),'-',str(end),'_yelp.npy'])

if os.path.exists(embedding_npy):
  os.remove(embedding_npy)
  
embedding_store = NpyStore(embedding_npy)
embedding_store.write(embeddings)

### 2.6. Concat Last Four Hidden States

In [None]:
def concat_last_four_embedding(samples, tokenizer, user_model, item_model, max_len=128):

  def tokenize(reviews):
    return tokenizer(list(reviews), padding='max_length', truncation=True, max_length=max_len, return_tensors='tf')

  total = len(samples)
  _embeddings = np.empty(len(samples), dtype=object)

  for i, reviews in enumerate(samples):

    user_tokens = tokenize(reviews[3])
    item_tokens = tokenize(reviews[4])
    
    reviewerID = reviews[0]
    asin = reviews[1]
    label = reviews[2]
    
    user_embedding = user_model(user_tokens).hidden_states
    item_embedding = item_model(item_tokens).hidden_states

    # Concat last four hiddden states
    user_embedding = tf.concat(user_embedding[-4:], axis=2)
    item_embedding = tf.concat(item_embedding[-4:], axis=2)

    user_embedding = tf.stack([user_embedding])
    item_embedding = tf.stack([item_embedding])

    user_embedding = tf.keras.layers.GlobalAveragePooling2D()(user_embedding)
    item_embedding = tf.keras.layers.GlobalAveragePooling2D()(item_embedding)
    
    _embeddings[i] = dict(reviewerID=reviewerID, asin=asin, 
                            user_embedding=user_embedding, 
                            item_embedding=item_embedding,
                            label=label)

    
    print(f'\rEmbedding... {i+1} of {total} record(s) -- {(i+1)/total*100:.2f}%', end='')
  
  print('\n\tDone!')
  return _embeddings

In [None]:
%%time
start = 0
end = 150000
embeddings = concat_last_four_embedding(grouped_reviews[start:end], tokenizer, user_bert, item_bert)

Embedding... 50000 of 50000 record(s) -- 100.00%
	Done!
CPU times: user 4h 48min 47s, sys: 7min 16s, total: 4h 56min 4s
Wall time: 5h 32min 36s


In [None]:
embedding_dir = '../dataset/embedding/'
if not os.path.exists(embedding_dir):
    os.makedirs(embedding_dir)

if amazon:
  embedding_npy = ''.join([embedding_dir, 'grouped_embedding_concatlastfour_',str(start),'-',str(end),'_Movies_and_TV.npy'])
else:
  embedding_npy = ''.join([embedding_dir, 'grouped_embedding_concatlastfour_',str(start),'-',str(end),'_yelp.npy'])

if os.path.exists(embedding_npy):
  os.remove(embedding_npy)
  
embedding_store = NpyStore(embedding_npy)
embedding_store.write(embeddings)

Saving ../dataset/25-65_tokens/embedding/grouped_embedding_concatlastfour_0-50000_Movies_and_TV.npy ...
	... 50000 records
Done!!
