Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ft[torch]: How can we exploit cpu/gpu-parallelization with fabrics. #130

Draft
wants to merge 9 commits into
base: develop
Choose a base branch
from

Conversation

maxspahn
Copy link
Collaborator

The idea behind this PR is simple:

  1. Export planner as native c code.
  2. Parse the c-function into a numpy function using a custom translation.
  3. Enjoy parallelization with the generated numpy function.

Very similar approach should be applicable to torch.

@saraybakker1 @AndreuMatoses

@saraybakker1
Copy link
Collaborator

Thanks @maxspahn, will have a look at it! :)

@AndreuMatoses
Copy link
Member

Thanks, I will have a look, too!

@maxspahn maxspahn marked this pull request as draft April 15, 2024 10:28
@maxspahn
Copy link
Collaborator Author

maxspahn commented Apr 18, 2024

Adds simple comparison between loop and numpy parallelization.
It turns out that for 100 samples, you have a speed up of around 120 for the
planner from the panda.py example.
@AndreuMatoses

@AndreuMatoses
Copy link
Member

Okay, a 120x increase is indeed relevant. I will see if i can implement it for my case and potentially try to make it for torch. Thanks!

@maxspahn
Copy link
Collaborator Author

Also, the speedup scales with the number of samples, so for 1000 environments, the speed-up is even bigger. But I wasn't patient enough to wait for the result:D

@AndreuMatoses
Copy link
Member

I have added the translator to torch code from the .c function. I also added some examples using the dingo+kinova and cubic obstacles.
CAREFUL: the generated torch code may have an issue when using torch.fmax(a,b) if, for example, b is a float and not a tensor. This seems to always happen as one of the first variables Casadi declares is something like a1 = 0.000, and then a2 is always used in the fmax functions. For now, I just changed by hand the variable in the resulting Python script to a1 = torch.tensor(0.0, device='cuda:0'), but this should be made in a more consistent way if someone wants to use this for any problem.

@AndreuMatoses
Copy link
Member

To have an idea of the difference in performance with the different options (casadi function, parallelized numpy function, parallelized torch function) check these computation times for the dinova example. Take them just a s a reference as the performance could change depending on many things.

Casadi (looped) vs Numpy

computation_time
average_computation_time_per_N

Numpy vs Torch

computation_time_torchVSnp
average_computation_time_torchVSnp_per_N

Conclusion:

For less than ~100 N, looping the casadi function is best. Between 100 and 10k N, numpy is better as it has less overhead than torch. After 10k N, torch becomes better, especially with a very large N, as it seems to remain 300ms as long as you have enough VRAM basically.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants