Skip to content
This repository has been archived by the owner on Sep 25, 2023. It is now read-only.

[BUG] [Jetson Nano Conda install hangs on installing pip dependencies] #304

Closed
emeldar opened this issue Jan 11, 2021 · 18 comments · Fixed by #305
Closed

[BUG] [Jetson Nano Conda install hangs on installing pip dependencies] #304

emeldar opened this issue Jan 11, 2021 · 18 comments · Fixed by #305
Assignees
Labels
2 - In Progress Currenty a work in progress doc Documentation

Comments

@emeldar
Copy link

emeldar commented Jan 11, 2021

Describe the bug
When creating the conda environment on a Jetson Nano Development kit, the installation proceeds until installing pip dependencies, where it hangs indefinitely.

Steps/Code to reproduce bug
Fresh Jetpack install on Jetson Nano board.
Follow instructions for building from source on Jetson Nano exactly.

Expected behavior
Expected to install the environment.

Environment details (please complete the following information):

  • Environment location: Jetson Nano board with Jetpack SDK
  • Method of cuSignal install: conda (specifically miniforge)

I've never used conda before, so I don't know exactly what logs are needed, but this is the last output from the install before it hangs:
Installing pip dependencies: ...working...

@emeldar emeldar added ? - Needs Triage Need team to review and classify bug Something isn't working labels Jan 11, 2021
@github-actions github-actions bot added this to Needs prioritizing in Bug Squashing Jan 11, 2021
@emeldar
Copy link
Author

emeldar commented Jan 11, 2021

Upon trying to install the pip packages manually, I found that all of them are installed except for cupy>=8.0.0. When trying to install it manually using the pip binary from the environment, it hangs indefinitely while building the wheel for cupy. This might be the source of the issue, but I'm unsure what to do to build cupy for the aarch64 processor. Here is the line it hangs on:
Building wheels for collected packages: cupy
Building wheel for cupy (setup.py) ...

@znmeb
Copy link

znmeb commented Jan 11, 2021

I've got scripts that install cupy and cusignal successfully on both a Nano and an AGX-Xavier. cupy takes a long time to install - it is compiling many kernels for the GPU using cicc.

On the AGX-Xavier it takes almost 47 minutes to install cupy!

Successfully installed cupy-8.3.0 fastrlock-0.5
2736.81user 43.31system 46:44.74elapsed 99%CPU (0avgtext+0avgdata 2640716maxresident)k
98096inputs+5670632outputs (22major+11542275minor)pagefaults 0swaps

It is probably working - open another terminal on your Nano and do top - you should see the cicc compiles running. They're single-threaded; if there's some way to run four of them concurrently it would cut the install time down.

@awthomp awthomp self-assigned this Jan 11, 2021
@awthomp
Copy link
Member

awthomp commented Jan 11, 2021

Hi @eldaromer -- thanks for submitting an issue to cuSignal, and thanks for the quick input, @znmeb! I'd like to echo Ed's comments and say that cupy takes a very long time to compile on the Jetson platform, particularly the Nano. I'd recommend retrying the cupy pip install before you go to bed and report back the status. I'm happy to work with the cupy developers to get this working if we uncover some Jetson/aarch64 specific issue!

@leofang
Copy link
Member

leofang commented Jan 11, 2021

Hi all, a cupy guru here. Could you please set this env var

export CUPY_NVCC_GENERATE_CODE="arch=compute_XX,code=sm_XX"

with XX being your device's compute capability, and then install cupy. I hope this would make the compilation a lot faster.

@awthomp
Copy link
Member

awthomp commented Jan 11, 2021

Hi all, a cupy guru here. Could you please set this env var

export CUPY_NVCC_GENERATE_CODE="arch=compute_XX,code=sm_XX"

with XX being your device's compute capability, and then install cupy. I hope this would make the compilation a lot faster.

Thanks for the info, Leo! I'll update our documentation to reflect this suggestion too.

@awthomp awthomp added 2 - In Progress Currenty a work in progress doc Documentation and removed ? - Needs Triage Need team to review and classify bug Something isn't working labels Jan 11, 2021
Bug Squashing automation moved this from Needs prioritizing to Closed Jan 11, 2021
@leofang
Copy link
Member

leofang commented Jan 11, 2021

@znmeb @eldaromer Let us know if it helps reduce the compilation time.

@emeldar
Copy link
Author

emeldar commented Jan 11, 2021

@leofang I have set the environment variable as instructed before with the compute capability set to 53 for the Jetson Nano. I am currently running the cupy install again, and I'm timing how long it takes. I will keep you updated.

Creating a pre built wheel for the Nano as mentioned in a new issue above would be of great utility.

@emeldar
Copy link
Author

emeldar commented Jan 11, 2021

Ok, installing of the Nano with the environment variable took ~30 minutes to complete. Maybe that should be added to the build instructions so users know what to expect. Thank you all for the help.

@znmeb
Copy link

znmeb commented Jan 12, 2021

My current setup compiles for 53 (TX1 and Nano), 62 (TX2) and 72 (AGX Xavier and I assume also Xavier NX). https://developer.nvidia.com/cuda-gpus. That dropped the compile time on the AGX Xavier to 30 minutes from 47. After that, cusignal only takes about two minutes.

Given how useful cupy is I think a pre-built wheel on conda-forge is a great idea. I'd like to see more of RAPIDS.AI migrated to conda-forge, even though a lot of it will only run on Volta or later.

@leofang
Copy link
Member

leofang commented Jan 12, 2021

Thank you @znmeb @eldaromer for quick feedbacks. Indeed I've been wanting to build CuPy for ARM (which I assume is for Jetson devices?) on conda-forge. However it's currently blocked by a few needed infrastructure changes, for example this one. Perhaps you could open an issue on CuPy's issue tracker to let them know your need and evaluate if PFN has the resource and bandwidth to support pip wheels for ARM? (I am not from PFN so I can't speak for them on this.)

cc: @jakirkham Looks like we have at least two serious Jetson users in need of CuPy on ARM 🙂

@jakirkham
Copy link
Member

Well the first step would be packaging cudatoolkit. There's some initial work in PR ( conda-forge/cudatoolkit-feedstock#4 ) if someone would like to take a crack at it 😉

@znmeb
Copy link

znmeb commented Jan 12, 2021

Well the first step would be packaging cudatoolkit. There's some initial work in PR ( conda-forge/cudatoolkit-feedstock#4 ) if someone would like to take a crack at it 😉

I'm trying to push out a release but I can test on a 4 GB Nano and a 16 GB AGX Xavier in my spare time (cringes as my 3090 feels unloved) :-)

arrow-cpp and pyarrow-cuda are on my conda-forge wishlist too, BTW. And POCL.

@leofang
Copy link
Member

leofang commented Jan 12, 2021

This is perhaps something I can learn from you guys 🙂 I always imagine I can buy a Jetson device and make it sit and run on my desk, like a Raspberry Pi (which I don't have either). Is it the case? What's the best/cheapest/fastest way to set up a Jetson environment? What's the use case(s) for running cuSingal on Jetsons?

@znmeb
Copy link

znmeb commented Jan 12, 2021

This is perhaps something I can learn from you guys 🙂 I always imagine I can buy a Jetson device and make it sit and run on my desk, like a Raspberry Pi (which I don't have either). Is it the case? What's the best/cheapest/fastest way to set up a Jetson environment? What's the use case(s) for running cuSingal on Jetsons?

For now, plunk down the $700 for an AGX Xavier, or the $400 for a Xavier NX. The Nano only has 4 GB of RAM, which I find more of a constraint than the cores or the Maxwell GPU.

My use case is digital audio, but IIRC the original motivation was software-defined radio.

@awthomp
Copy link
Member

awthomp commented Jan 12, 2021

This is perhaps something I can learn from you guys 🙂 I always imagine I can buy a Jetson device and make it sit and run on my desk, like a Raspberry Pi (which I don't have either). Is it the case? What's the best/cheapest/fastest way to set up a Jetson environment? What's the use case(s) for running cuSingal on Jetsons?

For now, plunk down the $700 for an AGX Xavier, or the $400 for a Xavier NX. The Nano only has 4 GB of RAM, which I find more of a constraint than the cores or the Maxwell GPU.

My use case is digital audio, but IIRC the original motivation was software-defined radio.

Yes! @leofang, @znmeb is correct on SDR being the first Jetson use case. We have folks plugging in a ~20 dollar rtlsdr and doing GPU based FM demod, signal and modulation recognition, resampling and display, etc.

As for "how to get started" - you basically install JetPack on the Jetson and you're plopped into an Ubuntu environment.

@leofang
Copy link
Member

leofang commented Jan 18, 2021

Thanks for interesting answers, @awthomp @znmeb! $400 is very attracting -- now I don't know if I should get a PS5 or a Xavier first 😂 SDR seems to be a cool thing I never heard of, and I'm glad I asked!

Back to the slow compilation issue, @znmeb @eldaromer it occurs to me that I didn't think too hard on the CPU performance difference. On a normal x86-64 system we always see thrust and cub being the two slowest components to build, but perhaps on Jetson compiling other modules could also take non-negligible time.

If you have time, could you please try

  • Build with the verbose flag -v, ex: pip install -v cupy, and eyeballing which modules are the slowest to build (sorry I don't have a better recommendation here)
  • Try cranking up the env var CUPY_NUM_BUILD_JOBS to a higher number and see if it improves. Its default is 4 IIRC.

Let me know if it helps (or not)!

@znmeb
Copy link

znmeb commented Jan 18, 2021

@leofang OK - I'm adding pyarrow with CUDA to the script. When I get that done I'll time the cupy on both the AGX-Xavier and the Nano. It looked to me like it was only using one core.

@znmeb
Copy link

znmeb commented Jan 19, 2021

OK ... here we go!

nano-cupy.log
agx-xavier-cupy.log

I ran both with CUPY_NUM_BUILD_JOBS equal nproc, so 4 on the Nano and 8 on the AGX-Xavier. For both,

export CUPY_NVCC_GENERATE_CODE="arch=compute_53,code=sm_53;arch=compute_62,code=sm_62;arch=compute_72,code=sm_72"

The bottom line(s):

Nano: used 1.96 cores out of 4 on average (196%CPU)

3865.11user 68.66system 33:25.19elapsed 196%CPU (0avgtext+0avgdata 1973300maxresident)k
652688inputs+3901608outputs (1958major+9014625minor)pagefaults 0swaps

AGX-Xavier: used 2.30 cores out of 8 on average

2056.31user 38.91system 15:10.60elapsed 230%CPU (0avgtext+0avgdata 1973048maxresident)k
16inputs+3901576outputs (0major+8841728minor)pagefaults 0swaps

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
2 - In Progress Currenty a work in progress doc Documentation
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

5 participants