Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelize build? #70

Open
ravimody opened this issue May 29, 2015 · 28 comments
Open

Parallelize build? #70

ravimody opened this issue May 29, 2015 · 28 comments

Comments

@ravimody
Copy link

Conceptually, if the trees are independent, shouldn't we be able to build the trees in parallel across multiple cores? Is this something that could be implemented easily?

I haven't had a chance to dive deeply into the code/algorithm yet so I may be misunderstanding something.

@erikbern
Copy link
Collaborator

It wouldn't be too hard – you would need a mutex in a couple of places and also watch out for reallocations, but other than that it should be easy

@ravimody
Copy link
Author

ravimody commented Jun 1, 2015

Cool thanks, I'll look into it (although my C++ is a little rusty :))

@thomas4g
Copy link

Has there been any further work on this?

@erikbern
Copy link
Collaborator

not afaik :(

@ravimody
Copy link
Author

No work on it from my end.

@thomas4g
Copy link

Bummer. I'd love to help out, but I don't think I'm familiar enough with annoy yet (or frankly competent enough with C++) to spearhead anything. If anyone starts working on this and wants a hand, I'm happy to help out.

@thomas4g
Copy link

Actually, I take it back. It looks like a pretty simple change. If I understand it properly, it's line 496 of annoylib.h:

_roots.push_back(_make_tree(indices));

That can get parallelized, but with a mutex around _roots to avoid simultaneous push_back calls, right?

@erikbern
Copy link
Collaborator

roughly, but it's a bit more complicated._make_tree also modifies the datastructure. You also need a mutex around line 643-644 and line 702-703. Maybe that mutex could just be folded into _allocate_size and we could rename the function to something like _add_item ... probably makes more sense

then you obviously need to create pthreads and collect them afterwards... honestly I haven't done that in like 10 years, but iirc it's not too hard

@tjrileywisc
Copy link
Contributor

@erikbern Actually I was working on a branch to add this just this weekend. It doesn't quite work yet though (I can only get it to run the precision_test.cpp example, only a debug binary works and it crashes sometimes even then); should I submit a PR?

I didn't touch anything towards the bottom of annoylib.h (near those lines that you mentioned), that might be the problem...

@erikbern
Copy link
Collaborator

erikbern commented Nov 20, 2017

sure, feel free to submit a PR, just make sure to put "WIP" in the subject or something

you definitely need a mutex around those lines, that's probably the issue :)

@thomas4g
Copy link

@tjrileywisc what a coincidence! I also started working on this over the weekend. Mine also doesn't quite work... but I've pushed a copy to my fork: https://github.com/thomas4g/annoy/tree/parallelize_build

I hope to keep working on it tonight, but if you're further along let me know if you'd like any help! Feel free to ping me here or shoot me an email me@thomasshields.net

@tjrileywisc
Copy link
Contributor

tjrileywisc commented Nov 20, 2017

@thomas4g
Just took a peak at your fork - I'm actually using std::thread instead of pthread . I think Windows doesn't have support for pthread built in. std::thread should be available for gcc and MSVC.

@thomas4g
Copy link

@tjrileywisc ahh, I wanted to use that but got thrown off by the build not supporting my #include <mutex> by default. I didn't want to fiddle with build settings.

@tjrileywisc
Copy link
Contributor

tjrileywisc commented Nov 21, 2017

Just submitted a PR for my build_trees_threaded branch.

#246

@denkuzin
Copy link

denkuzin commented Jul 1, 2019

guys, is there any news on the multicore index building?

@os-gabe
Copy link

os-gabe commented Nov 25, 2019

I thought I might have a go at this. The actual threading machinery is all straightforward but I'm having some trouble seeing exactly which bits of _make_tree() modify common data and need to be guarded with a mutex. As per previous comments I wrapped some functionality into a thread safe _add_item() function but I'm still getting segfaults indicating data is getting modified out from under my threads somewhere else.

S _add_item() {
    std::lock_guard<std::mutex> guard(_mutex);
    _allocate_size(_n_nodes + 1);
    S item = _n_nodes++;
    return item;
}

Any help with this is appreciated.

@erikbern
Copy link
Collaborator

interesting – you're probably right that it should be fairly straightforward, but i can't think of other critical sections off the top of my head

@erikbern erikbern reopened this Nov 25, 2019
@os-gabe
Copy link

os-gabe commented Nov 25, 2019

Unless I'm misunderstanding the code this may be more complicated than I first thought. It appears that there are many places within _make_tree that touch the underlying data. For example _get would need a mutex since it could be attempting to access _nodes while another thread is calling _add_item. But worse, since it returns a pointer, anywhere that pointer is used would also need a mutex. I believe this is looking like you would wind up needing mutexes around quite a large portion of _make_tree which would mostly undo the benefits of parallelizing it. It's possible I'm still not understanding the code fully though.

@erikbern
Copy link
Collaborator

From what you say I think the main issue is that the underlying memory can be reallocated at any point in time, and that invalidates any pointers held by any other thread.

But those reallocations are actually pretty rare so there should be some way to fix.

I haven't dealt with concurrency code in C++ since maybe 2007 so my knowledge is a bit rusty but couldn't you use a shared lock for this? Almost all access will be nonexclusive (so near-zero overhead), but the few times when you need to reallocate the underlying storage, you would have to acquire an exclusive lock. Does that make sense?

@erikbern
Copy link
Collaborator

I meant a shared mutex. This looks like the right concurrency primitive: https://en.cppreference.com/w/cpp/thread/shared_mutex

So basically acquire a shared lock when writing individual vectors, acquire an exclusive lock when you have to resize the underlying data storage.

But I'm mostly speculating, could be wrong :)

@os-gabe
Copy link

os-gabe commented Nov 26, 2019

I think what you are saying makes sense. Thanks for the tip on shared mutex - I hadn't seen that before. It was introduced in c++17 though which may be it's own problem.

I'll take another look and see what I can figure out.

@chikubee
Copy link

@os-gabe Hey, Is it solved? Any insights on how to parallelize build?

@erikbern
Copy link
Collaborator

no, this would have to be implemented by someone

@chikubee
Copy link

@erikbern That's sad. I am trying to build an index for over 1M vectors and it is crashing even with on_disk_build. The process took up more than 30 GB memory and crashed.

@erikbern
Copy link
Collaborator

i'm not sure if parallelization would have helped, though. do you know what's causing it to crash?

@ravimody
Copy link
Author

Yeah agreed that parallelization probably would not help.

1M vectors isn't that much unless they are extremely high dimensional vectors. Did you try with a smaller number of vectors to see where it starts failing?

@erikbern
Copy link
Collaborator

good point – annoy isn't meant for super high dimensionality, so if that's what you're facing then you should probably run dimensionality reduction outside of annoy first!

@os-gabe
Copy link

os-gabe commented Jan 27, 2020

@os-gabe Hey, Is it solved? Any insights on how to parallelize build?

Unfortunately I had to move on to other things and did not get parallel build working

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants