Updating an index for batch indexing #5

fabiannagel · 2024-06-21T10:03:24Z

Hi! Is it possible to use update an existing index, e.g. for batch indexing and larger-than-memory datasets?
Unfortunately, this does not work:

import Stemmer
import bm25s

batch_0 = [
    {"id": 0, "text": "a cat is a feline and likes to purr"},
    {"id": 1, "text": "a dog is the human's best friend and loves to play"},
]

batch_1 = [
    {"id": 2, "text": "a bird is a beautiful animal that can fly"},
    {"id": 3, "text": "a fish is a creature that lives in water and swims"},
]


def index_corpus(corpus_batch):
    corpus_tokens = bm25s.tokenize([d['text'] for d in corpus_batch], stopwords="en", stemmer=stemmer)
    retriever.index(corpus_tokens)


def query_corpus(query):
    query_tokens = bm25s.tokenize(query, stemmer=stemmer)

    all_batches = batch_0 + batch_1
    results, scores = retriever.retrieve(query_tokens, corpus=all_batches, k=2)

    for i in range(results.shape[1]):
        doc, score = results[0, i], scores[0, i]
        print(f"Rank {i + 1} (score: {score:.2f}): {doc}")


retriever = bm25s.BM25()
stemmer = Stemmer.Stemmer("english")

index_corpus(batch_0)
index_corpus(batch_1)

query_corpus("what is a fish?")

The text was updated successfully, but these errors were encountered:

xhluca · 2024-06-21T18:53:48Z

Unfortunately it is not possible! The reason for that is that BM25 calculates the individual score based on the term frequency (tf), number of documents (|D|) and average document length (L_avg); thus, any new document added will change the information previously calculated.

As for a multi-stage approach (so you can remove the data from memory), I've thought about how to deal with memory with incremental passes and chunked indexing, but unfortunately I did not have the chance to come up with better ideas that would be straightforward to implement and use. I think it's good to keep this issue open if someone has a suggestion & potential PR.

xhluca · 2024-06-21T18:58:34Z

one major hurdle is that we need multiple pass on the corpus tokens (once to get term freq, and average length, a second for actual scoring). Although it might be possible to use memmap to store tokens to file and read from disk, that'd make the indexing slower, and the overall code more complex. if you are willing to try, feel free to create a PR and I'm glad to review to make sure it's consistent with tthe rest of the API (proper new tests would be needed to avoid regressions).

xhluca · 2024-06-21T19:02:10Z

That said, if your goal is to reduce memory usage, you can also consider passing an iterator to the tokenizer instead of a list in memory, some pseudo code:

def load_corpus(fname):
    with open(fname) as f:
        for line in f:
            yield json.loads(line)['text']

corpus_tokens = bm25s.tokenize(load_corpus(fname), stemmer=stemmer)

DemirTonchev · 2024-06-25T15:21:37Z

@xhluca Great job! You obviously put a lot of time in this!
About the update method - look at my implementation..

https://github.com/DemirTonchev/yeabm25/blob/f2c120d518b2b18893b0e4fc6eaf06fcbeefb01c/src/yeabm25/bm25.py#L168

I will be glad to spend some time and see if I can bring this to your implementation.

# ........
def index_corpus(corpus_batch):
    corpus_tokens = bm25s.tokenize([d['text'] for d in corpus_batch], stopwords="en", stemmer=stemmer)
    retriever.update(corpus_tokens) #udpate given the current state -> change to new state as if you would do .index(full_corpus)
#.......
# 1)
index_corpus(batch_0) <- inits and calculates idf etc..
index_corpus(batch_1) <- updated and calculates new idf etc...
# 2)
index_corpus(batch_0 + batch_1)

(1) and (2) give the same index.

xhluca · 2024-06-25T18:15:19Z

Hey that'd be an interesting addition! I'd be happy to review a PR with the incremental feature, docstrings and tests for the incremental update part.

That said, one reason I have not touched it is due to how BM25 is calculated, which is not directly compatible with precomputed BM25 scores. For example, the term frequency, average length and document count changes after an incremental update. Do you have an idea how to update existing pre-computed scores in this case?

DemirTonchev · 2024-06-25T19:28:15Z

look at my code in the link.
But in some sense you compute again the idf scores with the updated information. Also you have updated document-word frequencies and you can score any query or document.
I will get to it at some point wont be in the next 2-3 weeks though, sorry.

xhluca · 2024-06-25T20:42:03Z

No worries, take your time!

But in some sense you compute again the idf scores with the updated information.

But wouldn't that be not very efficient if you process by batch? For example, if you have 100M docs, and you want to update by adding 10; it'd be very fast to update a term frequency dictionary and update the average doc length. However, updating the IDF calculation of all the 100M docs would take a lot more time since you can't increment the values, instead you need to recalculate the idf and tf component. Then, repeating that for each batch would simply make it very computationally inefficient (and would also need to keep everything in memory... which defeats the purpose of batched updates).

DemirTonchev · 2024-06-26T08:00:55Z

Well yes, but this is the price to pay if you want to batch... I dont have solution otherwise at this moment. Its either you can do it by batch save the state and get rid of those documents or you need to retrain every time you have new batch documents on the whole corpus anyway.

xhluca · 2024-06-26T15:41:20Z

Thanks. I've been thinking about this but I was hoping I missed some solution...

In this case i guess index updates (i.e. for batching) would not be possible, but moving forward, I think it will be nice to eventually support low-memory tokenizing (i.e. a streaming tokenizer that yields the tokens as they are being iterated rather than returning them all at once) and sharded indexing (by calculate term frequency, avg doc len and num docs separately, and then shard the indexes which are saved as partial_index-{i}-of-{N})

DemirTonchev · 2024-06-27T09:37:59Z

Sure, batching updates are not possible by incrementing the idf values. Still in my use case I don't mind recalculating everything but with only new documents as the old ones are already archived, its a trade off. What happens if you need to add 100 new documents to the index? You need to have them all at one place..
I can contribute for the other features a bit later. I think you can close this one in this case.

xhluca · 2024-06-27T15:11:17Z

Makes sense! I'll close now. Thank you!

DemirTonchev · 2024-07-02T15:23:55Z

Got idea about the batching:
One could accumulate the statistics of the batched documents in the doc_frequencies, in this case idf calculation would have to wait.

# 1)
index_corpus(batch_0) <- inits and calculates doc_frequencies, etc.. but not idfs
index_corpus(batch_1) <- updates
index_corpus.flush() <-  calculates idfs etc. 
# 2)
index_corpus(batch_0 + batch_1) 
index_corpus.flush()

1 and 2 would be the same.
Again depends if it makes sense for the use case.

xhluca · 2024-07-02T17:53:09Z

I think it would make sense to make it into a util class, e.g.:

idfs = {}
tfcs = {}
for batch in batches:
    idfs = bm25s.utils.corpus.update_idfs(idfs, batch)
    tfcs =  = bm25s.utils.corpus.update_tfcs(tfcs, batch)

Then, we could have a method of bm25 that allows manually specifying idfs and tfcs:

retriever.index_precomputed(idfs=idfs, tfcs=tfcs)

If that method works well (passes new unittests, as fast as original, does not increase memory consumption), then we can rewrite BM25.index to use that new index_precomputed method by default.

DemirTonchev · 2024-07-03T09:42:23Z

Yeah makes sense :)

xansar · 2024-11-14T02:36:57Z

I think it would make sense to make it into a util class, e.g.:
idfs = {}
tfcs = {}
for batch in batches:
    idfs = bm25s.utils.corpus.update_idfs(idfs, batch)
    tfcs =  = bm25s.utils.corpus.update_tfcs(tfcs, batch)
Then, we could have a method of bm25 that allows manually specifying idfs and tfcs:
retriever.index_precomputed(idfs=idfs, tfcs=tfcs)
If that method works well (passes new unittests, as fast as original, does not increase memory consumption), then we can rewrite BM25.index to use that new index_precomputed method by default.

Is there any new progress about this method?

xhluca · 2024-11-14T18:05:26Z

@xansar I have not looked into it, mainly because the gains in speed of precomputing idfs/tfcs are outweighted by the 2x larger memory usage in having to store them in memory at all time.

Moreover, the main bottleneck of indexing at this stage is that we are not even using the most efficient implementation, i.e. numba JIT'd version of the computations. I think being able to solve this would bring a much bigger speed improvement over caching idfs/tfcs.

Unfortunately I have limited bandwidth in developing new features, as it requires

extensive testing,
docstring,
updating readme with the new feature,
give examples of the usage,

If you are interested in the feature being implemented, feel free to create an issue that outlines how the changes would happen and how the API will look like. If it looks good, I'd be happy to review a PR that include the points above.

xhluca closed this as completed Jun 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updating an index for batch indexing #5

Updating an index for batch indexing #5

fabiannagel commented Jun 21, 2024

xhluca commented Jun 21, 2024 •

edited

Loading

xhluca commented Jun 21, 2024

xhluca commented Jun 21, 2024

DemirTonchev commented Jun 25, 2024

xhluca commented Jun 25, 2024

DemirTonchev commented Jun 25, 2024

xhluca commented Jun 25, 2024 •

edited

Loading

DemirTonchev commented Jun 26, 2024

xhluca commented Jun 26, 2024

DemirTonchev commented Jun 27, 2024

xhluca commented Jun 27, 2024

DemirTonchev commented Jul 2, 2024

xhluca commented Jul 2, 2024

DemirTonchev commented Jul 3, 2024

xansar commented Nov 14, 2024

xhluca commented Nov 14, 2024 •

edited

Loading

Updating an index for batch indexing #5

Updating an index for batch indexing #5

Comments

fabiannagel commented Jun 21, 2024

xhluca commented Jun 21, 2024 • edited Loading

xhluca commented Jun 21, 2024

xhluca commented Jun 21, 2024

DemirTonchev commented Jun 25, 2024

xhluca commented Jun 25, 2024

DemirTonchev commented Jun 25, 2024

xhluca commented Jun 25, 2024 • edited Loading

DemirTonchev commented Jun 26, 2024

xhluca commented Jun 26, 2024

DemirTonchev commented Jun 27, 2024

xhluca commented Jun 27, 2024

DemirTonchev commented Jul 2, 2024

xhluca commented Jul 2, 2024

DemirTonchev commented Jul 3, 2024

xansar commented Nov 14, 2024

xhluca commented Nov 14, 2024 • edited Loading

xhluca commented Jun 21, 2024 •

edited

Loading

xhluca commented Jun 25, 2024 •

edited

Loading

xhluca commented Nov 14, 2024 •

edited

Loading