Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelized batch processing #3

Closed
juldaani opened this issue Dec 18, 2020 · 6 comments
Closed

Parallelized batch processing #3

juldaani opened this issue Dec 18, 2020 · 6 comments
Labels
enhancement New feature or request

Comments

@juldaani
Copy link

Hey,

I'm wondering whether it could be possible to implement multi-core implementation for processing multiple batches of data simultaneously.

My current approach is:

import pynanoflann
import numpy as np

n_batches = 10
target = np.random.rand(n_batches, 10000, 3)
query = np.random.rand(n_batches, 2000, 3)

for i in range(n_batches):
    pts_target = target[i]
    pts_query = query[i]
    
    kd_tree = pynanoflann.KDTree(n_neighbors=1, metric='L2', leaf_size=20)
    kd_tree.fit(pts_target)
    d, nn_idx = kd_tree.kneighbors(pts_query)

Instead, I'd like to do something like that:

import pynanoflann
import numpy as np

n_batches = 10
target = np.random.rand(n_batches, 10000, 3)
query = np.random.rand(n_batches, 2000, 3)

kd_trees = pynanoflann.KDTree(n_neighbors=1, metric='L2', leaf_size=20)
kd_trees.fit(pts_target)
distances, nn_indexes = kd_trees.kneighbors(pts_query)

This would create a kd-tree for each batch in 'target'. Corresponding batches in 'query' then would be used to make nearest neighbor searches with corresponding kd-trees.

Currently, if I want to implement this kind of parallelized processing I have do it in python. For large data sets this is not a problem but for smaller data sets it will cause too much overhead. I assume that it would be much faster to implement in c++ side of the code.

Best regards
Juho

@u1234x1234 u1234x1234 added the enhancement New feature or request label Dec 18, 2020
@u1234x1234
Copy link
Owner

Hi Juho,

I get your point, but it seems like a custom solution, which will require additional efforts as there is no such option in the original nanoflann library, so I have no plans for this now. Contributions are welcome.

@u1234x1234
Copy link
Owner

Do you need kd_trees after queries? Or would such an interface work too?

import pynanoflann
import numpy as np

n_batches = 10
target = np.random.rand(n_batches, 10000, 3)
query = np.random.rand(n_batches, 2000, 3)

distances, nn_indexes = batched_search(target, query, n_neighbors=1, metric='L2', leaf_size=20)

@juldaani
Copy link
Author

Hey Dmitrii,

Thanks for the very fast response.

In my current application I don't need the trees so they can be thrown away.

@u1234x1234
Copy link
Owner

I added Parallelized batch processing.

pip install git+https://github.com/u1234x1234/pynanoflann.git@0.0.6

Full example:
https://github.com/u1234x1234/pynanoflann/blob/master/tests/test_batched_kneighbors.py

Internally it works for the list of numpy arrays of arbitrary shapes ie:

target = [
  np.random.rand(10000, 3),
  np.random.rand(5000, 3),
  np.random.rand(40000, 3),
]

I've implemented a more general interface at the cost of small overhead.

@juldaani
Copy link
Author

Hey,

I tested the implementation and it works very well. No bugs found. And have to say the library is now very fast with these multicore capabilities. As far as I know this is the fastest kd-tree implementation for python.

This helped me a lot and made my algorithm many times faster. Thank you very much.

@u1234x1234
Copy link
Owner

I'm glad it helped you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants