#Embedding methods - summary

1. Chunk & aggregate
  - Split text into smaller chunks
    - Sentences? Paras? Topics?
    - No. of tokens? (Fixed size chunking)
  - Embed each indiv chunk.
  - Aggregate chunk embeddings
    - Average
    - Maxpool
    - Attention mechanisms
2. Hierarchical embeddings
  - Embed small units of text
  - Combine embeddings to reflect position and relationships (encode relationships)
  - Potentially capture overall summary/sentiment of the review at highest level
  - [Hierarchical Embedding for Amazon personalized product search](https://github.com/QingyaoAi/Hierarchical-Embedding-Model-for-Personalized-Product-Search)
  - Can also consider for product metadata to encode product features?
3. Extend context windows
  - Positional encoding interpolation/extrapolation (sinusoidal)
  - Attention approximation
4. Existing embedding models for consideration?
Comparison: BLaIR (roberta-base is 768)

  **3.1 Standalone models**

  Types to consider:

 A) larger embedding dimensions, more robust but increases training complexity
 - OpenAI ada-002 (but costs [$0.10 per 1M tokens](https://platform.openai.com/docs/pricing) ): embedding dim 1024

 B) smaller dim, less complex
 - E5-small-v2 (embedding dim 384, multilingual, but also truncates at 512 tokens)

 C) Embedding model + some other layers eg [BERT-BiGRU](https://link.springer.com/article/10.1007/s44196-025-00747-1#Tab1)
 - BiGRU: bidirectional gated recurrent unit.
 - Generate word embeddings from BERT first, then extract key features using BiGRU
 - research paper later feeds output features into classification GCN

 Other more custom options:

 D) other simpler/traditional methods eg glove, word2vec etc

 E) autoencoders?

  **3.2 Embedding model integrated w GNN**
  - BERTGCN: combines BERT and GNN arch

In [None]:
from google.colab import drive
drive.mount('/content/drive/')

In [None]:
import os
import pandas as pd
import torch
from transformers import AutoModel, AutoTokenizer

folder_path = '/content/drive/My Drive/DL 28 project_copy/Amazon review GNN + BLaIR'
os.chdir(folder_path)
print(f"Current working directory: {os.getcwd()}")

product_df = pd.read_csv('./Pre-processed data/All_Beauty cleaned data/product_metadata.csv')
product_df.head()

# Fixed size chunking

In [None]:
import pandas as pd
import torch
from transformers import AutoModel, AutoTokenizer, AutoConfig

MODEL_NAME = "roberta-base"
TOKEN_LENGTH_PER_CHUNK = 512

class FixedSizeChunker():
    def __init__(self):
        self.device = torch.device("cuda" if torch.cuda.is_available() else "mps")
        self.tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
        self.model = AutoModel.from_pretrained(MODEL_NAME).to(self.device)
        self.model.eval()

    def fixed_size_chunking(self, review):
        tokenized = self.tokenizer(review, padding=False, truncation=False, return_tensors="pt")
        input_ids_full = tokenized['input_ids'].squeeze(0).to(self.device)
        attn_mask_full = tokenized['attention_mask'].squeeze(0).to(self.device)

        input_id_chunks = list(input_ids_full.split(TOKEN_LENGTH_PER_CHUNK - 2))
        mask_chunks = list(attn_mask_full.split(TOKEN_LENGTH_PER_CHUNK - 2))

        # for chunk in input_id_chunks:
        # print(len(chunk))

        for i in range(len(input_id_chunks)):
            input_id_chunks[i] = torch.cat(
                (torch.Tensor([101]).to(self.device), input_id_chunks[i], torch.Tensor([102]).to(self.device)))
            mask_chunks[i] = torch.cat(
                [torch.Tensor([1]).to(self.device), mask_chunks[i], torch.Tensor([1]).to(self.device)])

            req_pad_len = TOKEN_LENGTH_PER_CHUNK - input_id_chunks[i].shape[0]

            if req_pad_len > 0:
                input_id_chunks[i] = torch.nn.functional.pad(input_id_chunks[i], (0, req_pad_len),
                                                             value=self.tokenizer.pad_token_id)
                mask_chunks[i] = torch.nn.functional.pad(mask_chunks[i], (0, req_pad_len), value=0)

        return torch.stack(input_id_chunks).long(), torch.stack(mask_chunks)

    def aggregate_embeddings(self, input_id_chunks, attn_mask_chunks, method='mean'):
        output = self.model(input_id_chunks, attn_mask_chunks)
        if method == 'mean':
            return output.last_hidden_state.mean(dim=0)
            # check again
        if method == 'maxpool':
            return output.last_hidden_state.max(dim=0)

    def chunk_and_embed(self, review, method='mean'):
        input_chunks, mask = self.fixed_size_chunking(review)
        print(f'number of chunks: {input_chunks.shape[0]}')
        input_chunks = input_chunks.to(self.device)
        mask = mask.to(self.device)
        embeddings = self.aggregate_embeddings(input_chunks, mask, method)
        return embeddings

In [None]:
chunker = FixedSizeChunker()

sample_product_df = product_df.iloc[0:5].copy()
sample_product_df.loc[:, 'embedding_fixed_chunk'] = sample_product_df['reviews'].apply(lambda review: chunker.chunk_and_embed(review))

In [None]:
print(f'shape of embedding: {sample_product_df.embedding_fixed_chunk.iloc[0].shape}')
sample_product_df.head()