Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multithreaded build #495

Merged
merged 14 commits into from Aug 3, 2020
Merged

Conversation

novoselrok
Copy link
Contributor

Issue: #70

This is my work-in-progress branch for supporting multithreaded builds in Annoy. Locally (Ubuntu 18.04) all C++, Python and Lua tests seem to pass (I've re-ran them multiple times). The only thing remaining is fixing the Go build and I would appreciate some help with this 😄

I've ran some speed tests on glove-angular-25 index with n_trees=32. Reference time on master (without threads) was 49.9s.
Here are results for multiple threads:

Num. threads Time (s)
1 51.28759694
2 26.675179
4 14.82837749
8 8.897883892
16 7.353658915

If you have any questions or concerns feel free to comment.

@novoselrok
Copy link
Contributor Author

novoselrok commented Jul 28, 2020

Summary of CI issues:

  • Missing a cross-platform implementation of isnan (fixed by replacing nan checks with checks before sqrt)
  • Missing -std=c++11 flag when compiling Lua and Go extensions
  • Windows Python 2.7 cannot find mutex library

src/annoylib.h Outdated
for (int i = 0; i < n_threads; i++) {
int trees_per_thread = -1;
if (q > -1) {
// First thread picks up the remainder of the work
Copy link
Collaborator

@erikbern erikbern Jul 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might lead to an uneven split, eg if q=20 and n_threads=7 then the distribution will be [8, 2, 2, 2, 2, 2, 2]

I think you can do trees_per_thread = work_per_thread + (i < work_remainder) and it should split it like [3, 3, 3, 3, 3, 3, 2]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually even simpler you can just to trees_per_thread = (q + i) / n_threads and you don't even have to define work_pre_thread

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, I was not exactly happy with my solution, so I'll reuse this.

src/annoylib.h Outdated
@@ -1172,7 +1206,87 @@ template<typename S, typename T, typename Distance, typename Random>
return get_node_ptr<S, Node>(_nodes, _s, i);
}

S _make_tree(const vector<S >& indices, bool is_root) {
void _thread_build(int q, int thread_idx, ThreadBarrier& barrier, std::mutex& _nodes_mutex) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it would be good to have the same code for threaded and non-threaded building

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not exactly sure what you mean by this: should we still support building without threads (e.g. when n_threads=1)?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd actually be ok only supporting threaded build so that we can remove the non-threaded build code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm probably missing something, but _thread_build is just a private helper function that was extracted from the previous build function. It is there just so I don't have to stuck everything inside a lambda function when spawning a new thread.

Could you point out which part of the code could be potentially removed? 😄

Copy link
Collaborator

@erikbern erikbern Jul 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think _thread_build could be a pretty simple wrapper around _make_tree. I think the only two places that touch any state are:

I don't think you need the magic where you go into nodes and add offsets etc? So all _thread_build would have to do is to keep adding trees until you reach the limit.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah exactly, that's what i was thinking.

Your approach at https://github.com/spotify/annoy/pull/495/files#diff-91eaadc9d0aa248375a22a64f315e3d8R1250 is very clever but i'm a bit nervous it's too hacky and I'd love to have something slightly simpler :)

Doesn't your barrier thing roughly accomplish what at shared_mutex does? Or it should at least be possible to modify it into sort of a shared mutex? Haven't thought about it enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I would definitely feel more comfortable if I could simplify it a bit 😄

Unfortunately a barrier does not accomplish what a shared_mutex is supposed to do. It can be only used as a global synchronization point, not for reader/writer style locking.

After reading a couple of SO threads, a custom shared mutex could prove really tricky to implement correctly. We could use the shared_timed_mutex which has a few additional capabilities and is available from C++14. Either way, I'll implement the less-magic version we discussed and then we can compare the performance and decide if it's worth it to bump the C++ version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I managed to simplify my solution in the latest commit 😄 I removed all of the magic and introduced a couple of more mutex variables and a shared_timed_mutex variable.

I've re-ran my time measurements on glove-angular-25:

Num. threads n_trees=32 time (s) n_trees=64 time (s)
1 49.286592960357666 99.88854050636292
2 26.54487133026123 53.29321575164795
4 15.026970863342285 29.342100143432617
8 10.422403812408447 18.966835498809814
16 9.757703065872192 17.069363117218018

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow amazing! Any idea if there's any slowdown compared to the current master?

The new code looks much cleaner and less "clever". I don't know how well supported c++14 is but hopefully it's not a big problem.

Do you really need so many different mutexes? Not a concern, was just surprised to see 4 different ones

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can prepare a more comprehensive timing benchmark and include a comparison with the current master. But as far as I can tell, there isn't a significant slowdown.

Regarding the number of mutexes: If I guard everything with the shared nodes_mutex just one thread has to acquire an exclusive lock and everything grinds to a halt. So I opted for a more granular approach, where I exclusively lock the nodes_mutex only when absolutely necessary and add a separate regular locks for all other shared variables (nodes_size, _n_nodes). I'll see if I can cut down on the number of mutexes, but I currently can't see a better way.

src/annoylib.h Outdated
std::mutex _nodes_mutex;
ThreadBarrier barrier(n_threads);
vector<std::thread> threads(n_threads);
int work_per_thread = (int)floor(q / (double)n_threads);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if q = -1? you probably have to determine q before this code in case it's set to -1

@erikbern
Copy link
Collaborator

this is super cool! i left a few minor comment, but generally this is great

src/annoylib.h Outdated
@@ -929,7 +952,7 @@ template<typename S, typename T, typename Distance, typename Random>
return true;
}

bool build(int q, char** error=NULL) {
bool build(int q, int n_threads=1, char** error=NULL) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be good to default it to -1 and if that's the case set it to the number of cpus

vector<vector<Node*> > thread_trees;
vector<S> thread_roots;
while (1) {
if (q == -1) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code to determine q if q = -1 could probably just be left in the build function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing with q = -1, is that I don't know beforehand how many trees each thread should build, because it depends on the number of nodes. That is why I have to check for -1 everywhere.

A better solution would be to introduce a different meaning to q = -1, where each thread would build just one tree for example.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah good point. I forgot about that. Ok if your code works then let's keep it, I guess

@novoselrok
Copy link
Contributor Author

@erikbern Could you take a look at the failing Go and Lua builds, if you'll be able to help? I'm not sure how to get the -std=c++11 flag into the compilation.

Windows Python 2.7 might be tricky, because it seems it doesn't support C++11 thread/mutex at all.

src/annoylib.h Outdated
}

// Wait for all threads to finish before we can start inserting tree nodes into global _nodes array
barrier.wait();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't locking the mutex enough to enforce this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately not, due to reasons noted here: #495 (comment) We have to wait for all threads finish their work before we can mutate the global _nodes buffer.

if (!self->ptr)
return NULL;
static char const * kwlist[] = {"n_trees", NULL};
if (!PyArg_ParseTupleAndKeywords(args, kwargs, "i", (char**)kwlist, &q))
static char const * kwlist[] = {"n_trees", "n_threads", NULL};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an aside -- what do you think about naming this parameter in the Python wrapper to n_jobs instead of n_threads? n_jobs seems more consistent with the Python ecosystem. Either way, I think this is a great improvement!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that makes sense. I can expose n_jobs keyword argument for Python, but keep n_threads everywhere else.

…h uses all CPU cores. Rename n_threads to n_jobs in the Python interface.
@erikbern
Copy link
Collaborator

This looks quite amazing so far. Would be good to figure out the CI failures

@novoselrok
Copy link
Contributor Author

This looks quite amazing so far. Would be good to figure out the CI failures

As far as I can tell Windows Py2.7 does not support the C++ thread/mutex libraries in any way. For Go and Lua I need to add the -std=c++14 flag to the compilation step, but I'm not really sure how to do that.

@erikbern
Copy link
Collaborator

maybe there's some dumb way so that if those headers aren't present, it just defaults to single-threaded? not sure if that's a lot of work. i'd imagine there would be a lot of #ifdef scattered throughout the code

@erikbern
Copy link
Collaborator

FYI the tests timed out on OSX which implies there might be some race condition or similar? https://travis-ci.org/github/spotify/annoy/jobs/713675853

…aded policies. Added ANNOYLIB_MULTITHREADED_BUILD which controls which threading policy to use.
@erikbern
Copy link
Collaborator

erikbern commented Aug 2, 2020

nice, the last change is 👌

@novoselrok
Copy link
Contributor Author

nice, the last change is ok_hand

Feel free to take another look and see if anything can be improved. OSX is going wonky, I'm still investigating the issue.

@novoselrok
Copy link
Contributor Author

I think I figured out the OSX issue. Apparently locking/unlocking is much more expensive under OSX and I'm doing it excessively in a loop when adding indices. Moving the lock around the loop should do the trick 🤞

@erikbern
Copy link
Collaborator

erikbern commented Aug 2, 2020

this looks great – i'm happy to merge whenever you give 👍

@novoselrok
Copy link
Contributor Author

Awesome 👍 I'll just prepare a short timing comparison with master sometime tomorrow. If there aren't any more surprises, it'll be ready to go.

@novoselrok
Copy link
Contributor Author

novoselrok commented Aug 3, 2020

Here is a multihreaded build timing comparison for angular-glove-25 dataset:

Num. threads n_trees=16 time (s) n_trees=32 time (s) n_trees=64 time (s) n_trees=128 time (s)
1 24.35913038 49.1573956 98.2360599 197.1160657
2 13.40727019 25.64709997 51.36644101 104.5556092
4 7.058898449 14.07375836 29.17391777 58.58751988
8 4.247594595 8.404431581 17.18267322 33.64984083
16 3.579955816 7.196531773 14.31243014 28.61186814

Here are single threaded timings from master for comparison:

Num. trees Time (s)
16 28.233051300048828
32 53.763163566589355
64 102.16749143600464
128 212.41096830368042

The speed up is not linear, but it's good enough especially for large number of trees.

I think it's ready to 🚢

@novoselrok novoselrok changed the title WIP: Multithreaded build Multithreaded build Aug 3, 2020
@erikbern
Copy link
Collaborator

erikbern commented Aug 3, 2020

Awesome, I’ll merge this now.

I won’t publish this to pypi for a few weeks just in case there’s some issues. There’s some people using master so that’s effectively a form of “beta testing”

Really appreciate the work!

@erikbern erikbern merged commit 5caba52 into spotify:master Aug 3, 2020
@novoselrok
Copy link
Contributor Author

Awesome, sounds good! 😄

@n1ru4l
Copy link

n1ru4l commented Nov 6, 2020

Hey there, we are already using the multithreaded build with python and love the speed improvements! Now we are even thinking about migrating to golang in order to have better performance when loading the data before inserting it into the tree.

As far as I can tell Windows Py2.7 does not support the C++ thread/mutex libraries in any way. For Go and Lua I need to add the -std=c++14 flag to the compilation step, but I'm not really sure how to do that.

Is there a specific issue that blocked this from getting implemented? Do you plan to follow-up with a PR that enabled multithreaded builds, or should I try sending a PR?

@novoselrok novoselrok deleted the multithreaded-build branch November 6, 2020 17:26
@novoselrok
Copy link
Contributor Author

Hey @n1ru4l, no I wasn't planning on it. You can give it a go, if you would like :) As I've mentioned, you would only need to supply the -std=c++14 flag when compiling for Go and possibly update the Go version in tests (https://github.com/spotify/annoy/blob/master/tox.ini#L13).

@n1ru4l
Copy link

n1ru4l commented Nov 6, 2020

@novoselrok That sounds easy in theory :D I will give it a try!

@n1ru4l
Copy link

n1ru4l commented Nov 9, 2020

So we decided against using golang. I won't implement this any time soon then.

@utkarshgupta137
Copy link

utkarshgupta137 commented Aug 20, 2021

I'm running this in docker on ubuntu, but I'm only seeing 1 CPU core used at a time (100%), the rest are idle (0%).
I'm passing n_jobs = -1 & also tried to set it to 6 (which is the number of cores), but got the same result.

When I'm training a tensorflow model in the same docker, it uses 100% of the CPUs, so I don't think docker configuration is the issue.

index = AnnoyIndex(128, "angular")
for idx, doc in enumerate(weights):  # len(weights) = ~120K
    index.add_item(idx, doc)
index.build(256)

index.save(f"{loc_model}/annoy_index.ann")

Screenshot 2021-08-21 at 00 01 57

@AniruddhaHumane
Copy link

I am facing same issue as @utkarshgupta137, did you find any solution for this?

@utkarshgupta137
Copy link

I am facing same issue as @utkarshgupta137, did you find any solution for this?

It has been a while, but I don't think I did.

@AniruddhaHumane
Copy link

I see. That's unfortunate. I have created an issue here -> #627 hopefully the author will respond.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants