## Benchmarking Document Embedding of Wikipedia Articles on a Single RTX 4090
How much of Wikipedia is it feasible to embed, index, and retrieve on my desktop machine:
* How long would it take to embed Wikipedia for different embedding models?
    * Just testing one model for time's sake.
* How much storage would these embeddings require?
    * Again for time sake, just calculating embeddings/sec throughput and estimating time/storage.

In [1]:
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

import torch
from torch.utils.data import DataLoader
import transformers
from datasets import load_dataset
from sentence_transformers import SentenceTransformer, util

device = torch.device("cuda")

### Exploring Chunking on Wikipedia Dataset
Dataset is [wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia):
* ~ 6.8 * 10^6 articles => many-fold factor of chunks
* If embedding size is 384, stored in Float16, and 1 embedding per article => ~5.25GB (not including indexing data structure overhead)
* 11GB dataset size

In [2]:
# NOTE: An issue I noticed is that formulae, tables, etc. are missing from data
# which means some paragraphs end abruptly and this info is lost
dataset = load_dataset('/home/stefanwebb/data/wikimedia/wikipedia', '20231101.en', split='train', streaming=True)

Resolving data files:   0%|          | 0/41 [00:00<?, ?it/s]

### Test Document Embedder
* See how long it takes to process single parquet file

In [3]:
document_encoder = SentenceTransformer("/home/stefanwebb/models/llms/multi-qa-MiniLM-L6-cos-v1").to(device)

In [4]:
# NOTE: May need extra space for special tokens, hence 510 not 512
def chunk_iterator(seq, max_seq_length=510, overlap=128):
    """
    Given a list of tokens, if it is greater than maximum, break into overlapping sections
    """
    
    if len(seq) <= max_seq_length:
        yield seq
    else:
        count_chunks = ((len(seq) - max_seq_length) + overlap - 1) // overlap + 1
        for idx in range(count_chunks):
            yield seq[(idx*overlap):min(len(seq), (idx*overlap) + max_seq_length)]

In [5]:
# DEBUG: Test chunk_iterator
x = list(range(12))
for c in chunk_iterator(x, max_seq_length=15, overlap=4):
    print(c)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]


In [6]:
def chunk_examples(articles):
    chunks = []
    for article in articles['text']:
        # print('article', article)
        paragraphs = article.split('\n\n')
        # TODO: Can I avoid tokenizing sentences twice: once here and once in model.encode?
        # model.encode doesn't seem to be able to take tokens, only strings
        # I could dig into source code in library to find a solution...
        # print('paragraphs', paragraphs)
        # Alternatively, just do fixed length chunking with overlap
        
        # Filter by string
        tokens = document_encoder.tokenizer([p for p in paragraphs if p != '' and p[0].isalpha()])['input_ids']
        tokens = [x for p in tokens for x in list(chunk_iterator(p)) if len(x) > 7]

        paragraphs = document_encoder.tokenizer.batch_decode(tokens)
        chunks += paragraphs

    return {"chunks": chunks}

In [7]:
chunked_ds = dataset.map(chunk_examples, batched=True, batch_size=4048, remove_columns=dataset.column_names)

In [8]:
chunked_pt = DataLoader(chunked_ds, batch_size=1024, num_workers=1)

In [9]:
# elem = iter(chunked_ds)
# for _ in range(1000):
#     x = next(elem)
#     print(len(document_encoder.tokenizer(x['chunks'])['input_ids']))

In [None]:
# batches = chunked_ds.iter(batch_size=1024)
idx = 0
for batch in iter(chunked_pt):
    # print(next(batches))
    chunks = batch['chunks']
    # print(chunks)
    # chunks = torch.tensor(chunks, device=device)

    # TODO: encode_multi_process
    embeddings = document_encoder.encode(chunks, batch_size=1024, show_progress_bar=True)
    # print(embeddings)

    idx += 1
    
    if idx == 10:
        break
    # TODO: Enter into Vector DB

In [None]:
# ~ 700 chunks per second
len(chunks), embeddings.shape, idx

(1024, (1024, 384))

In [34]:
type(embeddings)

numpy.ndarray