Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How realistic is serializing the internal C++ AnnoyIndex state? #614

Open
ApproximateIdentity opened this issue Nov 13, 2022 · 0 comments
Open

Comments

@ApproximateIdentity
Copy link

Basically what I would want is to run something like this:

import os
import random

from annoy import AnnoyIndex

num_rows = 10000
num_trees = 10
num_dims = 512

try:
    os.remove("annoy_idx.ann")
except FileNotFoundError:
    pass

annoy_idx = AnnoyIndex(num_dims, "angular")
annoy_idx.on_disk_build("annoy_idx.ann")
for idx in range(num_rows):
    vector = [random.gauss(0, 1) for _ in range(num_dims)]
    annoy_idx.add_item(idx, vector)

annoy_idx.serialize("annoy_idx.state") # XXX - This is the magic I'm looking for

and then (after that program is done and exited) I would like to continue appending data like something like this (this adds 10,000 new rows with indices 10,000, ..., 19,999):

import os
import random

from annoy import AnnoyIndex

num_rows = 10000
num_trees = 10
num_dims = 512

try:
    os.remove("annoy_idx.ann")
except FileNotFoundError:
    pass

annoy_idx = AnnoyIndex(num_dims, "angular")
annoy_idx.deserialize("annoy_idx.state") # XXX - This is the magic I'm looking for
for idx in range(num_rows):
    vector = [random.gauss(0, 1) for _ in range(num_dims)]
    annoy_idx.add_item(idx +num_rows, vector) # XXX - Note the increase in idx variable

So basically what I want is for there to be a serialize/deserialize ability so that I can continue the flow. It seems to me like the protected data here would need to be serialized:

https://github.com/spotify/annoy/blob/master/src/annoylib.h#L847-L885

In my case it seems to basically serializing the node here:

https://github.com/spotify/annoy/blob/master/src/annoylib.h#L442-L463

So my question is the following:

How realistic is this? More specifically, assuming that I am able to successfully serialize/deserialize the state, does it seem like this would play well with the mmap in the on_disk_build() step? This is maybe too general a question, but basically my point is: is this totally crazy? Are there obvious flaws with my thinking if I decided to go this route?

Thanks for any help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant