#Embedding methods - summary

1. Chunk & aggregate
  - Split text into smaller chunks
    - Sentences? Paras? Topics?
    - No. of tokens? (Fixed size chunking)
  - Embed each indiv chunk.
  - Aggregate across all chunk embeddings
    - Average
    - Maxpool
    - Attention mechanisms
2. Hierarchical embeddings
  - Embed small units of text eg sentences in a whole paragraph
  - Combine embeddings to reflect position and relationships (encode relationships)
  - Aggregate at lower levels first, then aggregate at higher levels (eg aggregate sentence embeddings for each review, then aggregate across all reviews)
  - Potentially capture overall summary/sentiment of the review at highest level
  - [Hierarchical Embedding for Amazon personalized product search](https://github.com/QingyaoAi/Hierarchical-Embedding-Model-for-Personalized-Product-Search)
  - Can also consider for product metadata to encode product features?
3. Extend context windows
  - Positional encoding interpolation/extrapolation (sinusoidal)
  - Attention approximation
4. Existing embedding models for consideration?
Comparison: BLaIR (roberta-base is 768)

  **3.1 Standalone models**

  Types to consider:

 A) larger embedding dimensions, more robust but increases training complexity
 - OpenAI ada-002 (but costs [$0.10 per 1M tokens](https://platform.openai.com/docs/pricing) ): embedding dim 1024

 B) smaller dim, less complex
 - E5-small-v2 (embedding dim 384, multilingual, but also truncates at 512 tokens)

 C) Embedding model + some other layers eg [BERT-BiGRU](https://link.springer.com/article/10.1007/s44196-025-00747-1#Tab1)
 - BiGRU: bidirectional gated recurrent unit.
 - Generate word embeddings from BERT first, then extract key features using BiGRU
 - research paper later feeds output features into classification GCN

 Other more custom options:

 D) other simpler/traditional methods eg glove, word2vec etc

 E) autoencoders?

  **3.2 Embedding model integrated w GNN**
  - BERTGCN: combines BERT and GNN arch

In [1]:
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


In [2]:
import os
import pandas as pd
import torch
from transformers import AutoModel, AutoTokenizer

folder_path = '/content/drive/My Drive/DL 28 project_copy/Amazon review GNN + BLaIR'
os.chdir(folder_path)
print(f"Current working directory: {os.getcwd()}")

Current working directory: /content/drive/My Drive/DL 28 project_copy/Amazon review GNN + BLaIR


In [3]:
product_df = pd.read_csv('./Pre-processed data/All_Beauty cleaned data/product_metadata.csv')
product_df.head()

Unnamed: 0,parent_asin,meta,reviews
0,0124784577,WOW Organics Apple Cider Vinegar Shampoo - 300 mL,Product delivers Makes my hair look healthy ||...
1,0692508988,The Listening Cards The Listening Cards are an...,Delightful and Profound This is a wonderful to...
2,069267599X,Inspirational Card Deck Nicole Piar created th...,Sigh I soo much wanted to Really love this dec...
3,0764490117,Kingdom Rock Starter Kit: Where Kids Stand Str...,Amazing VBS kit! We just completed our week of...
4,0816091846,"Sissy Spacek Collection - Raggedy Man, The Riv...","Wish she would'a done comedy, too! Spacek can ..."


In [4]:
# note: other languages present
product_df.iloc[0].reviews

"Product delivers Makes my hair look healthy || Perfectly and all the above! I exercise a lot being older you don't  smell the best put this shampoo make a big difference! || Que cumplió con mis expectativas de higiene para el pelo La manera de entrega poca seguridad de que lleve bien para su uso"

# Adjust constants here

In [5]:
MODEL_NAME = "roberta-base"
TOKEN_LENGTH_PER_CHUNK = 512

# Utils

In [7]:
import torch


def add_sos_and_bos(self, sentence, mask):
    input_ids = torch.cat([torch.Tensor([101]).to(self.device), sentence, torch.Tensor([102]).to(self.device)])
    mask = torch.cat([torch.Tensor([1]).to(self.device), mask, torch.Tensor([1]).to(self.device)])
    return input_ids, mask


def aggregate_embeddings(self, input_id_chunks, attn_mask_chunks, method='mean'):
    with torch.no_grad():
      output = self.model(input_id_chunks, attn_mask_chunks)
      if method == 'mean':
          return output.last_hidden_state.mean(dim=0)
      if method == 'maxpool':
          return output.last_hidden_state.max(dim=0)

# Fixed size chunking

In [80]:
import pandas as pd
import torch
from transformers import AutoModel, AutoTokenizer, AutoConfig


class FixedSizeChunker():
    def __init__(self):
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.agg_method = 'mean'
        self.tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
        self.model = AutoModel.from_pretrained(MODEL_NAME).to(self.device)
        self.model.eval()

    def fixed_size_chunking(self, review):
        tokenized = self.tokenizer(review, padding=False, truncation=False, return_tensors="pt")
        input_ids_full = tokenized['input_ids'].squeeze(0).to(self.device)
        attn_mask_full = tokenized['attention_mask'].squeeze(0).to(self.device)

        input_id_chunks = list(input_ids_full.split(TOKEN_LENGTH_PER_CHUNK - 2))
        mask_chunks = list(attn_mask_full.split(TOKEN_LENGTH_PER_CHUNK - 2))

        # for chunk in input_id_chunks:
        # print(len(chunk))

        # reference: https://medium.com/data-science/how-to-apply-transformers-to-any-length-of-text-a5601410af7f
        for i in range(len(input_id_chunks)):
            input_id_chunks[i], mask_chunks[i] = add_sos_and_bos(self, input_id_chunks[i], mask_chunks[i])
            req_pad_len = TOKEN_LENGTH_PER_CHUNK - input_id_chunks[i].shape[0]

            if req_pad_len > 0:
                input_id_chunks[i] = torch.nn.functional.pad(input_id_chunks[i], (0, req_pad_len),value=self.tokenizer.pad_token_id)
                mask_chunks[i] = torch.nn.functional.pad(mask_chunks[i], (0, req_pad_len), value=0)

        return torch.stack(input_id_chunks).long(), torch.stack(mask_chunks)

    def chunk_and_embed(self, review, method='mean'):
        input_chunks, mask = self.fixed_size_chunking(review)
        print(f'number of chunks for fixed sized chunking: {input_chunks.shape[0]}')
        input_chunks = input_chunks.to(self.device)
        mask = mask.to(self.device)
        embeddings = aggregate_embeddings(self, input_chunks, mask, self.agg_method)
        return embeddings

In [81]:
chunker = FixedSizeChunker()

sample_product_df = product_df.iloc[0:5].copy()
sample_product_df.loc[:, 'embedding_fixed_chunk'] = sample_product_df['reviews'].apply(lambda review: chunker.chunk_and_embed(review))

print(f'shape of embedding for first entry: {sample_product_df.embedding_fixed_chunk.iloc[0].shape}') # this will be token_length per chunk x model embedding dim
sample_product_df.head()

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Token indices sequence length is longer than the specified maximum sequence length for this model (2158 > 512). Running this sequence through the model will result in indexing errors


number of chunks for fixed sized chunking: 1
number of chunks for fixed sized chunking: 1
number of chunks for fixed sized chunking: 5
number of chunks for fixed sized chunking: 1
number of chunks for fixed sized chunking: 2
shape of embedding for first entry: torch.Size([512, 768])


Unnamed: 0,parent_asin,meta,reviews,embedding_fixed_chunk
0,0124784577,WOW Organics Apple Cider Vinegar Shampoo - 300 mL,Product delivers Makes my hair look healthy ||...,"[[tensor(-0.0392, device='cuda:0'), tensor(0.0..."
1,0692508988,The Listening Cards The Listening Cards are an...,Delightful and Profound This is a wonderful to...,"[[tensor(-0.0545, device='cuda:0'), tensor(0.1..."
2,069267599X,Inspirational Card Deck Nicole Piar created th...,Sigh I soo much wanted to Really love this dec...,"[[tensor(-0.0412, device='cuda:0'), tensor(0.1..."
3,0764490117,Kingdom Rock Starter Kit: Where Kids Stand Str...,Amazing VBS kit! We just completed our week of...,"[[tensor(-0.0457, device='cuda:0'), tensor(0.0..."
4,0816091846,"Sissy Spacek Collection - Raggedy Man, The Riv...","Wish she would'a done comedy, too! Spacek can ...","[[tensor(-0.0525, device='cuda:0'), tensor(0.0..."


# Sentence chunking (nltk sentence tokenizer)

In [77]:
from nltk import PunktSentenceTokenizer
import pandas as pd
import torch
import itertools
from transformers import AutoModel, AutoTokenizer, AutoConfig


class SentenceChunker():
    def __init__(self):
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.agg_method = 'mean'
        self.sentence_tokenizer = PunktSentenceTokenizer()
        self.model_tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
        self.model = AutoModel.from_pretrained(MODEL_NAME).to(self.device)
        self.model.eval()

    def sentence_chunking(self, combined_review):
      sentences = [self.sentence_tokenizer.tokenize(review.strip()) for review in combined_review.split("||")]
      sentences = list(itertools.chain(*sentences))
      print(f'number of sentences: {len(sentences)}')
      tokenized = self.model_tokenizer(sentences, padding=True, truncation=False, return_tensors="pt")

      input_ids_full = tokenized['input_ids'].squeeze(0).to(self.device)
      attn_mask_full = tokenized['attention_mask'].squeeze(0).to(self.device)

      input_id_sentences = list(input_ids_full)
      mask_sentences = list(attn_mask_full)

      for i in range(len(input_id_sentences)):
        input_id_sentences[i], mask_sentences[i] = add_sos_and_bos(self, input_id_sentences[i], mask_sentences[i])

      print(f'number of tokens in each sentence: {len(input_id_sentences[0])}')

      return torch.stack(input_id_sentences).long(), torch.stack(mask_sentences)

    def chunk_and_embed(self, review, method='mean'):
        input_chunks, mask = self.sentence_chunking(review)
        print(f'number of chunks for sentence chunking: {input_chunks.shape[0]}')
        input_chunks = input_chunks.to(self.device)
        mask = mask.to(self.device)
        embeddings = aggregate_embeddings(self, input_chunks, mask, self.agg_method)
        return embeddings

In [78]:
torch.cuda.empty_cache()
chunker = SentenceChunker()
sample_product_df = product_df.iloc[0:5].copy()
sample_product_df.loc[:, 'sentence_chunk'] = sample_product_df['reviews'].apply(lambda review: chunker.chunk_and_embed(review))

print(f'shape of embedding for first entry: {sample_product_df.sentence_chunk.iloc[0].shape}')   # unique for each row - will be no. tokens x model embedding dim
sample_product_df.head()

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


number of sentences: 4
number of tokens in each sentence: 45
number of chunks for sentence chunking: 4
number of sentences: 5
number of tokens in each sentence: 29
number of chunks for sentence chunking: 5
number of sentences: 137
number of tokens in each sentence: 103
number of chunks for sentence chunking: 137
number of sentences: 13
number of tokens in each sentence: 29
number of chunks for sentence chunking: 13
number of sentences: 40
number of tokens in each sentence: 52
number of chunks for sentence chunking: 40
shape of embedding for first entry: torch.Size([45, 768])


Unnamed: 0,parent_asin,meta,reviews,sentence_chunk
0,0124784577,WOW Organics Apple Cider Vinegar Shampoo - 300 mL,Product delivers Makes my hair look healthy ||...,"[[tensor(-0.0548, device='cuda:0'), tensor(0.0..."
1,0692508988,The Listening Cards The Listening Cards are an...,Delightful and Profound This is a wonderful to...,"[[tensor(-0.0873, device='cuda:0'), tensor(0.1..."
2,069267599X,Inspirational Card Deck Nicole Piar created th...,Sigh I soo much wanted to Really love this dec...,"[[tensor(-0.0874, device='cuda:0'), tensor(0.1..."
3,0764490117,Kingdom Rock Starter Kit: Where Kids Stand Str...,Amazing VBS kit! We just completed our week of...,"[[tensor(-0.0869, device='cuda:0'), tensor(0.0..."
4,0816091846,"Sissy Spacek Collection - Raggedy Man, The Riv...","Wish she would'a done comedy, too! Spacek can ...","[[tensor(-0.0931, device='cuda:0'), tensor(0.0..."


# Hierarchical chunking

chunk sentences & aggregate

then aggregate across all reviews

In [79]:
from nltk import PunktSentenceTokenizer
import pandas as pd
import torch
import itertools
from transformers import AutoModel, AutoTokenizer
from torch.nn.utils.rnn import pad_sequence


class HierarchicalChunker():
    def __init__(self):
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.agg_method = 'mean'
        self.sentence_tokenizer = PunktSentenceTokenizer()
        self.model_tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
        self.model = AutoModel.from_pretrained(MODEL_NAME).to(self.device)
        self.model.eval()

    def hierarchical_chunking(self, combined_review):
      print(combined_review)
      sentences_in_subreviews = [self.sentence_tokenizer.tokenize(sentence.strip()) for sentence in combined_review.split("||")]
      subreviews_embedded = list(map(self.embed_each_review, sentences_in_subreviews))

      subreview_token_count = [len(subreview) for subreview in subreviews_embedded]
      num_subreviews = len(subreviews_embedded)
      print(f'number of subreviews: {num_subreviews}')
      for i in range(num_subreviews):
        print(f'no. tokens in no. {i+1} subreview: {subreview_token_count[i]}')
      print(f'all subreviews will be padded to {max(subreview_token_count)} tokens')
      padded_all_reviews = pad_sequence(subreviews_embedded, batch_first=True)  # get shape: (num subreviews, highest token len of subreviews, model embedding dim)

      return padded_all_reviews.mean(dim=0)


    def embed_each_review(self, subreview):
      tokenized_subreview = self.model_tokenizer(subreview, padding=True, truncation=False, return_tensors="pt")

      input_ids_full = tokenized_subreview['input_ids'].to(self.device)
      attn_mask_full = tokenized_subreview['attention_mask'].to(self.device)

      input_id_sentences = list(input_ids_full)
      mask_sentences = list(attn_mask_full)

      for i in range(len(input_id_sentences)):
        input_id_sentences[i], mask_sentences[i] = add_sos_and_bos(self, input_id_sentences[i], mask_sentences[i])

      input_ids = torch.stack(input_id_sentences).long()
      mask = torch.stack(mask_sentences)
      subreview_embeddings = aggregate_embeddings(self, input_ids, mask, self.agg_method)

      return subreview_embeddings

In [82]:
torch.cuda.empty_cache()
chunker = HierarchicalChunker()
sample_product_df = product_df.iloc[0:5].copy()
sample_product_df.loc[:, 'hierar_chunk'] = sample_product_df['reviews'].apply(lambda review: chunker.hierarchical_chunking(review))

print(f'shape of embedding for first entry: {sample_product_df.hierar_chunk.iloc[0].shape}')   # unique for each row - will be no. tokens x model embedding dim
sample_product_df.head()

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Product delivers Makes my hair look healthy || Perfectly and all the above! I exercise a lot being older you don't  smell the best put this shampoo make a big difference! || Que cumplió con mis expectativas de higiene para el pelo La manera de entrega poca seguridad de que lleve bien para su uso
number of subreviews: 3
no. tokens in no. 1 subreview: 11
no. tokens in no. 2 subreview: 25
no. tokens in no. 3 subreview: 45
all subreviews will be padded to 45 tokens
Delightful and Profound This is a wonderful tool. Visually and tactilely pleasing. Profoundly fun to use. And full of wisdom dished out in easily digestible portions. Working with this deck has helped me increase my capacity to listen to others more deeply and to enjoy that process more and more.
number of subreviews: 1
no. tokens in no. 1 subreview: 29
all subreviews will be padded to 29 tokens
Sigh I soo much wanted to Really love this deck! I Really love the art work but as oracle cards it lacks. Since each card has the meani

Unnamed: 0,parent_asin,meta,reviews,hierar_chunk
0,0124784577,WOW Organics Apple Cider Vinegar Shampoo - 300 mL,Product delivers Makes my hair look healthy ||...,"[[tensor(-0.0530, device='cuda:0'), tensor(0.0..."
1,0692508988,The Listening Cards The Listening Cards are an...,Delightful and Profound This is a wonderful to...,"[[tensor(-0.0873, device='cuda:0'), tensor(0.1..."
2,069267599X,Inspirational Card Deck Nicole Piar created th...,Sigh I soo much wanted to Really love this dec...,"[[tensor(-0.0823, device='cuda:0'), tensor(0.1..."
3,0764490117,Kingdom Rock Starter Kit: Where Kids Stand Str...,Amazing VBS kit! We just completed our week of...,"[[tensor(-0.0869, device='cuda:0'), tensor(0.0..."
4,0816091846,"Sissy Spacek Collection - Raggedy Man, The Riv...","Wish she would'a done comedy, too! Spacek can ...","[[tensor(-0.0861, device='cuda:0'), tensor(0.0..."
