Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trying to install Horovod from a fresh conda environment (with tensorflow) and nothing seems to work #2138

Closed
illumidas-agn opened this issue Jul 23, 2020 · 99 comments
Labels

Comments

@illumidas-agn
Copy link

Environment:

  1. Framework: (TensorFlow)
  2. Framework version:
  3. Horovod version: 0.19.5
  4. MPI version: -
  5. CUDA version: -
  6. NCCL version: -
  7. Python version: 3.6
  8. OS and version: Ubuntu
  9. GCC version: -

Your question:
Please ask your question here.

Looked through all the available open questions. Currently trying to run go-explore (https://github.com/uber-research/go-explore/tree/master/policy_based) and I have only managed to make horovod work once for whatever reason.

I need it built with tensorflow (aka horovod.tensorflow) and when I try to force the tensorflow flag during installation I get a 10 page log dump which is hard to discern what it actually needs.

How do I get horovod running?

Im not sure what im doing wrong, I've tried everything else

@tgaddair
Copy link
Collaborator

Hey @illumidas-agn, have you taken a look at the Conda install guide here?

If you're still having issues after going through that, feel free to provide log output showing where things are breaking.

@illumidas-agn
Copy link
Author

Hi @tgaddair, Im gonna go through that and let you know how it goes

@illumidas-agn
Copy link
Author

Im trying to run this from a server where I am not the sudo user, is there any alternatives using pip or conda install?

@tgaddair
Copy link
Collaborator

Where are you running into permissions issues? @davidrpugh, do you have some thoughts on this?

If you don't need to use conda, you can also opt to install everything through pip in a virtual environment. But the important thing is you'll need to make sure the CUDA devtools are available when building in NCCL support. If this is an issue, you may want to see if you can run in a containerized environment.

@illumidas-agn
Copy link
Author

I need to be able to run the CUDA Toolkit as sudo and sadly im not an admin, therefore I cant install it the conventional way. I was able to make it run on my local virtual environment and then when I transfered it into the server that's when I ran into issues. Tried the entire day to get horovod and cuda to run.

I can ask the admin of the server to install it if theres no quick fix for this

@tgaddair
Copy link
Collaborator

Either that or setting up Docker/Singularity would probably be the easiest way, yes. It's certainly possible to install locally (see: https://stackoverflow.com/questions/39379792/install-cuda-without-root), but managing the correct environment variables will likely be a challenge.

@illumidas-agn
Copy link
Author

I see, in that case ill contact the server admin and ill get back to you if everything works. Thank you for your help

@davidrpugh
Copy link
Contributor

@illumidas-agn you should be able to install Horovod using Conda without root privileges. You will need to use the cudatoolkit-dev=10.1 package from Conda Forge channel. The environment file below should work (you will still need to other files referenced in the Conda install guide).

name: null

channels:
  - pytorch
  - conda-forge
  - defaults

dependencies:
  - cmake=3.16
  - cudatoolkit-dev=10.1
  - cudnn=7.6
  - cupti=10.1
  - cxx-compiler=1.1
  - jupyterlab=2.1
  - matplotlib=3.2
  - mpi4py=3.0 # installs cuda-aware openmpi
  - nccl=2.5
  - nodejs=13
  - pip=20.1
  - pip:
    - mxnet-cu101mkl==1.6.* # makes sure installed prior to horovod
    - -r file:requirements.txt
  - python=3.7
  - pytorch=1.5
  - tensorboard=2.1
  - tensorflow-gpu=2.1
  - torchvision=0.6 

Note that I have bumped a lot of version numbers from what is in the current guide. @tgaddair I will test this and then update the install guide accordingly. Perhaps an explicit indication that you will need to use the cudatoolkit-dev approach if you don't have permissions to install CUDA toolkit as root.

@illumidas-agn
Copy link
Author

The project im trying to run requires python 3.6, but I will see whether 3.7 works, will keep you updated

@davidrpugh
Copy link
Contributor

@illumidas-agn Then just change the Python version to 3.6. Shouldn't impact the build.

@illumidas-agn
Copy link
Author

I see, I just see it listed in the requirements that you have posted above

@davidrpugh
Copy link
Contributor

@illumidas-agn I pin the version numbers in my environment file to the most recent versions of the various dependencies for which I am able to get a successful build. Other combinations of version numbers may also work just fine. Also note that if you only need TensorFlow then you can probably get by with the following environment file which should build more quickly.

name: null

channels:
  - conda-forge
  - defaults

dependencies:
  - cudatoolkit-dev=10.1
  - cudnn=7.6
  - cupti=10.1
  - cxx-compiler=1.1
  - jupyterlab=2.1
  - matplotlib=3.2
  - mpi4py=3.0 # installs cuda-aware openmpi
  - nccl=2.5
  - nodejs=13
  - pip=20.1
  - pip:
    - -r file:requirements.txt
  - python=3.7 # python=3.6 should also work!
  - tensorboard=2.1
  - tensorflow-gpu=2.1

In the above environment file I have dropped the PyTorch and MXnet dependencies.

@illumidas-agn
Copy link
Author

Perfect, currently installing all the packages as we speak

@illumidas-agn
Copy link
Author

Installed all the packages, still getting errors when I use this command:

HOROVOD_WITH_TENSORFLOW=1 pip install --no-cache-dir horovod[tensorflow]

Error at the end of log:

"
Failed to load the native TensorFlow runtime.

See https://www.tensorflow.org/install/errors

for some common reasons and solutions.  Include the entire stack trace
above this error message when asking for help.

"

@davidrpugh
Copy link
Contributor

Did you activate the Conda environment prior to running pip?

@illumidas-agn
Copy link
Author

Yes, running it in the environment (Should be denoted by (envName) xxx@xxx)

@davidrpugh
Copy link
Contributor

After activating the Conda environment run the command conda list and share the output.

@illumidas-agn
Copy link
Author

_libgcc_mutex 0.1 main
_tflow_select 2.1.0 gpu
absl-py 0.9.0 py36_0
astor 0.8.0 py36_0
astor 0.8.1
astunparse 1.6.3 py_0
atari-py 0.2.6
attrs 19.3.0 py_0
backcall 0.2.0 py_0
binutils 2.33.1 h53a641e_8 conda-forge
binutils_impl_linux-64 2.33.1 he1b5a44_7 conda-forge
binutils_linux-64 2.33.1 h9595d00_17 conda-forge
blas 1.0 mkl
bleach 3.1.5 py_0
bleach 1.5.0
blinker 1.4 py36_0
bokeh 2.1.1
brotlipy 0.7.0 py36h7b6447c_1000
c-ares 1.15.0 h7b6447c_1001
c-compiler 1.1.1 h516909a_0 conda-forge
ca-certificates 2020.6.24 0
cachetools 4.1.0 py_1
certifi 2020.6.20 py36_0
cffi 1.14.0 py36he30daa8_1
chardet 3.0.4 py36_1003
click 7.1.2 py_0
cloog 0.18.0 0
cloudpickle 1.3.0
cmake 3.18.0
cryptography 2.9.2 py36h1ba5d50_0
cudatoolkit 10.0.130 0
cudatoolkit-dev 10.1.243 h516909a_3 conda-forge
cudnn 7.6.5 cuda10.0_0
cupti 10.0.130 0
cxx-compiler 1.1.1 hc9558a2_0 conda-forge
cycler 0.10.0 py36_0
dbus 1.13.16 hb2f20db_0
decorator 4.4.2 py_0
defusedxml 0.6.0 py_0
entrypoints 0.3 py36_0
expat 2.2.9 he6710b0_2
fontconfig 2.13.0 h9420a91_0
freetype 2.10.2 h5ab3b9f_0
future 0.18.2
gast 0.2.2 py36_0
gast 0.2.2
gcc_impl_linux-64 7.3.0 hd420e75_5 conda-forge
gcc_linux-64 7.3.0 h553295d_17 conda-forge
glib 2.65.0 h3eb4bd4_0
gmp 6.1.2 h6c8ec71_1
google-auth 1.17.2 py_0
google-auth-oauthlib 0.4.1 py_2
google-pasta 0.2.0 py_0
grpcio 1.27.2 py36hf8bcb03_0
gst-plugins-base 1.14.0 hbbd80ab_1
gstreamer 1.14.0 hb31296c_0
gxx_impl_linux-64 7.3.0 hdf63c60_5 conda-forge
gxx_linux-64 7.3.0 h553295d_17 conda-forge
gym 0.17.2
h5py 2.10.0 py36hd6299e0_1
hdf5 1.10.6 hb1b8bf9_0
html5lib 0.9999999
icu 58.2 he6710b0_3
idna 2.10 py_0
importlib-metadata 1.7.0 py36_0
importlib_metadata 1.7.0 0
intel-openmp 2020.1 217
ipykernel 5.3.3 py36h5ca1d4c_0
ipython 7.16.1 py36h5ca1d4c_0
ipython_genutils 0.2.0 py36_0
isl 0.12.2 0
jedi 0.17.1 py36_0
jinja2 2.11.2 py_0
jpeg 9b h024ee3a_2
json5 0.9.5 py_0
jsonschema 3.2.0 py36_0
jupyter_client 6.1.6 py_0
jupyter_core 4.6.3 py36_0
jupyterlab 2.1.5 py_0
jupyterlab_server 1.2.0 py_0
keras-applications 1.0.8 py_1
keras-preprocessing 1.1.0 py_1
kiwisolver 1.2.0 py36hfd86e86_0
ld_impl_linux-64 2.33.1 h53a641e_7
libedit 3.1.20191231 h14c3975_1
libffi 3.3 he6710b0_2
libgcc 7.2.0 h69d50b8_2
libgcc-ng 9.1.0 hdf63c60_0
libgfortran-ng 7.3.0 hdf63c60_0
libpng 1.6.37 hbc83047_0
libprotobuf 3.12.3 hd408876_0
libsodium 1.0.18 h7b6447c_0
libstdcxx-ng 9.1.0 hdf63c60_0
libuuid 1.0.3 h1bed415_2
libxcb 1.14 h7b6447c_0
libxml2 2.9.10 he19cac6_1
markdown 3.1.1 py36_0
markupsafe 1.1.1 py36h7b6447c_0
matplotlib 3.2.2 0
matplotlib-base 3.2.2 py36hef1b27d_0
mistune 0.8.4 py36h7b6447c_0
mkl 2020.1 217
mkl-service 2.3.0 py36he904b0f_0
mkl_fft 1.1.0 py36h23d657b_0
mkl_random 1.1.1 py36h0573a6f_0
mpc 1.0.3 hec55b23_5
mpfr 3.1.5 h11a74b3_2
mpi4py 3.0.3
nbconvert 5.6.1 py36_0
nbformat 5.0.7 py_0
nccl 1.3.5 cuda10.0_0
ncurses 6.2 he6710b0_1
nodejs 10.13.0 he6710b0_0
notebook 6.0.3 py36_0
numpy 1.18.5 py36ha1c710e_0
numpy-base 1.18.5 py36hde5b4d6_0
oauthlib 3.1.0 py_0
opencv-python 4.3.0.36
openssl 1.1.1g h7b6447c_0
opt_einsum 3.1.0 py_0
packaging 20.4 py_0
pandoc 2.10 0
pandocfilters 1.4.2 py36_1
parso 0.7.0 py_0
pcre 8.44 he6710b0_0
pexpect 4.8.0 py36_0
pickleshare 0.7.5 py36_0
Pillow 7.2.0
pip 20.1.1 py36_1
prometheus_client 0.8.0 py_0
prompt-toolkit 3.0.5 py_0
protobuf 3.12.3 py36he6710b0_0
psutil 5.7.2
ptyprocess 0.6.0 py36_0
pyasn1 0.4.8 py_0
pyasn1-modules 0.2.7 py_0
pycparser 2.20 py_2
pyglet 1.5.0
pygments 2.6.1 py_0
pyjwt 1.7.1 py36_0
pyopenssl 19.1.0 py_1
pyparsing 2.4.7 py_0
pyqt 5.9.2 py36h05f1152_2
pyrsistent 0.16.0 py36h7b6447c_0
pysocks 1.7.1 py36_0
python 3.6.10 h7579374_2
python-dateutil 2.8.1 py_0
python_abi 3.6 1_cp36m conda-forge
PyYAML 5.3.1
pyzmq 19.0.1 py36he6710b0_1
qt 5.9.7 h5867ecd_1
readline 8.0 h7b6447c_0
requests 2.24.0 py_0
requests-oauthlib 1.3.0 py_0
rsa 4.0 py_0
scipy 1.5.0 py36h0b6359f_0
send2trash 1.5.0 py36_0
setuptools 49.2.0 py36_0
sip 4.19.8 py36hf484d3e_0
six 1.15.0 py_0
sqlite 3.32.3 h62c20be_0
tensorboard 2.2.1 pyh532a8cf_0
tensorboard 1.15.0
tensorboard-plugin-wit 1.6.0 py_0
tensorflow 1.15.2
tensorflow 2.0.0 gpu_py36h6b29c10_0
tensorflow-base 2.0.0 gpu_py36h0ec5d1f_0
tensorflow-estimator 1.15.1
tensorflow-estimator 2.0.0 pyh2649769_0
tensorflow-gpu 2.0.0 h0d30ee6_0
tensorflow-tensorboard 1.5.1
termcolor 1.1.0 py36_1
terminado 0.8.3 py36_0
testpath 0.4.4 py_0
tk 8.6.10 hbc83047_0
tornado 6.0.4 py36h7b6447c_1
traitlets 4.3.3 py36_0
typing-extensions 3.7.4.2
urllib3 1.25.9 py_0
wcwidth 0.2.5 py_0
webencodings 0.5.1 py36_1
werkzeug 0.16.1 py_0
wheel 0.34.2 py36_0
wrapt 1.12.1 py36h7b6447c_1
xz 5.2.5 h7b6447c_0
zeromq 4.3.2 he6710b0_2
zipp 3.1.0 py_0
zlib 1.2.11 h7b6447c_3

@davidrpugh
Copy link
Contributor

What version of TensorFlow are you trying to use? What changes did you make to the environment file I sketched above? You seem to have two different versions of TensorFlow installed 2.0 and 1.15; as well as different versions of the various CUDA toolkit libraries.

@illumidas-agn
Copy link
Author

  • Trying to use tensorflow==1.15.2.
  • I installed all the packages according to the post above chronologically

@davidrpugh
Copy link
Contributor

davidrpugh commented Jul 23, 2020

OK. Well for sure to use TensorFlow 1.15 you will need to make changes to the environment file that I suggested above. For one, I think you will need an older version of cudatoolkit-dev and an older version of nccl. I think the following environment file should work.

name: null

channels:
  - conda-forge
  - defaults

dependencies:
  - cudatoolkit-dev=10.0
  - cudnn=7.6
  - cupti=10.0
  - cxx-compiler=1.1
  - jupyterlab=2.1
  - mpi4py=3.0 # installs cuda-aware openmpi
  - nccl=2.4
  - nodejs=13
  - pip=20.1
  - python=3.6 
  - tensorboard=1.15
  - tensorflow-gpu=1.15

I forget whether you need to install Keras separately with TensorFlow 1.15 or not.

You will also need to set the following environment variables slightly differently then what is mentioned in the user guide given that you are using the cudatoolkit-dev approach.

$ export ENV_PREFIX=$PWD/env
$ export HOROVOD_CUDA_HOME=$ENV_PREFIX
$ export HOROVOD_NCCL_HOME=$ENV_PREFIX
$ export HOROVOD_GPU_OPERATIONS=NCCL

Next, create the Conda environment and try building Horovod using the following commands.

conda env create --prefix $ENV_PREFIX --file environment.yml --force
conda activate $ENV_PREFIX
pip install --no-cache-dir horovod==0.19.*

Don't bother explicitly setting HOROVOD_WITH_TENSORFLOW=1 and specifying horovod[tensorflow]. Just install Horovod and let Horovod determine which bindings to build (given only TF is installed it should only build TensorFlow).

@illumidas-agn
Copy link
Author

It managed to install almost every package except the last one:

ERROR conda.core.link:_execute(481): An error occurred while installing package 'conda-forge::cudatoolkit-dev-10.0-2'.
LinkError: post-link script failed for package conda-forge::cudatoolkit-dev-10.0-2
running your command again with -v will provide additional information
location of failed script: /home/cogs5/ge/go-explore-master/env/bin/.cudatoolkit-dev-post-link.sh

@illumidas-agn
Copy link
Author

I used conda list on my local machine where go-explore works and it appears that the cuda tool kit isnt used? This is all very confusing....

_libgcc_mutex 0.1 main
absl-py 0.9.0 pypi_0 pypi
appdirs 1.4.4 pypi_0 pypi
astor 0.8.1 pypi_0 pypi
atari-py 0.2.6 pypi_0 pypi
baselines 0.1.6 dev_0
blas 1.0 mkl
ca-certificates 2020.6.24 0
certifi 2020.6.20 py36_0
cffi 1.14.0 pypi_0 pypi
click 7.1.2 pypi_0 pypi
cloudpickle 1.2.2 pypi_0 pypi
cycler 0.10.0 py36_0
dataclasses 0.7 pypi_0 pypi
dbus 1.13.16 hb2f20db_0
decorator 4.4.2 pypi_0 pypi
expat 2.2.9 he6710b0_2
fontconfig 2.13.0 h9420a91_0
freetype 2.10.2 h5ab3b9f_0
future 0.18.2 pypi_0 pypi
gast 0.2.2 pypi_0 pypi
glib 2.65.0 h3eb4bd4_0
google-pasta 0.2.0 pypi_0 pypi
grpcio 1.30.0 pypi_0 pypi
gst-plugins-base 1.14.0 hbbd80ab_1
gstreamer 1.14.0 hb31296c_0
gym 0.15.7 pypi_0 pypi
h5py 2.10.0 pypi_0 pypi
horovod 0.19.5 pypi_0 pypi
icu 58.2 he6710b0_3
imageio 2.9.0 py_0
importlib-metadata 1.7.0 pypi_0 pypi
intel-openmp 2020.1 217
joblib 0.16.0 pypi_0 pypi
jpeg 9b h024ee3a_2
keras-applications 1.0.8 pypi_0 pypi
keras-preprocessing 1.1.2 pypi_0 pypi
kiwisolver 1.2.0 py36hfd86e86_0
lcms2 2.11 h396b838_0
ld_impl_linux-64 2.33.1 h53a641e_7
libedit 3.1.20191231 h14c3975_1
libffi 3.3 he6710b0_2
libgcc-ng 9.1.0 hdf63c60_0
libgfortran-ng 7.3.0 hdf63c60_0
libpng 1.6.37 hbc83047_0
libstdcxx-ng 9.1.0 hdf63c60_0
libtiff 4.1.0 h2733197_1
libuuid 1.0.3 h1bed415_2
libxcb 1.14 h7b6447c_0
libxml2 2.9.10 he19cac6_1
loky 2.8.0 pypi_0 pypi
lz4-c 1.9.2 he6710b0_0
mako 1.1.3 pypi_0 pypi
markdown 3.2.2 pypi_0 pypi
markupsafe 1.1.1 pypi_0 pypi
matplotlib 3.2.2 0
matplotlib-base 3.2.2 py36hef1b27d_0
mkl 2020.1 217
mkl-service 2.3.0 py36he904b0f_0
mkl_fft 1.1.0 py36h23d657b_0
mkl_random 1.1.1 py36h0573a6f_0
mpi 1.0 mpich
mpi4py 3.0.3 py36h028fd6f_0
mpich 3.3.2 hc856adb_0
ncurses 6.2 he6710b0_1
numpy 1.18.5 py36ha1c710e_0
numpy-base 1.18.5 py36hde5b4d6_0
nvcc_linux-64 11.0 h4962215_6 nvidia
olefile 0.46 py36_0
opencv-python 4.3.0.36 pypi_0 pypi
openssl 1.1.1g h7b6447c_0
opt-einsum 3.3.0 pypi_0 pypi
pcre 8.44 he6710b0_0
pillow 7.2.0 py36hb39fc2d_0
pip 20.1.1 py36_1
protobuf 3.12.2 pypi_0 pypi
psutil 5.7.2 pypi_0 pypi
pycparser 2.20 pypi_0 pypi
pyglet 1.5.0 pypi_0 pypi
pyparsing 2.4.7 py_0
pyqt 5.9.2 py36h05f1152_2
python 3.6.10 h7579374_2
python-dateutil 2.8.1 py_0
pytools 2020.3.1 pypi_0 pypi
pyyaml 5.3.1 pypi_0 pypi
qt 5.9.7 h5867ecd_1
readline 8.0 h7b6447c_0
scipy 1.5.1 pypi_0 pypi
setuptools 49.2.0 py36_0
sip 4.19.8 py36hf484d3e_0
six 1.15.0 py_0
sqlite 3.32.3 h62c20be_0
tensorboard 1.15.0 pypi_0 pypi
tensorflow 1.15.0 pypi_0 pypi
tensorflow-estimator 1.15.1 pypi_0 pypi
termcolor 1.1.0 pypi_0 pypi
tk 8.6.10 hbc83047_0
tornado 6.0.4 py36h7b6447c_1
tqdm 4.48.0 pypi_0 pypi
werkzeug 1.0.1 pypi_0 pypi
wheel 0.34.2 py36_0
wrapt 1.12.1 pypi_0 pypi
xz 5.2.5 h7b6447c_0
zipp 3.1.0 pypi_0 pypi
zlib 1.2.11 h7b6447c_3
zstd 1.4.5 h0b5b093_0

@davidrpugh
Copy link
Contributor

davidrpugh commented Jul 24, 2020

Looks like TensorFlow is still being installed via pip from pypi and not from Conda channels as expected. Are you installing TF by including tensorflow-gpu=1.15 as a dependency in your environment file? Or are you installing TF via pip?

Getting cudatoolkit-dev package to work properly is a bit tedious and error prone which is why the Conda guide recommends people install the full CUDA Toolkit themselves and then use nvcc_liunux-64 package to configure the environment, but I guess the standard install of CUDA Toolkit requires root.

There are several builds of cudatoolkit-dev available on conda-forge. From your error I can see that you got the second one, but you needed the third.

cudatoolkit-dev                 10.0               1  conda-forge         
cudatoolkit-dev                 10.0               2  conda-forge         
cudatoolkit-dev                 10.0          py36_0  conda-forge

Replace the cudatoolkit-dev=10.0 with cudatoolkit-dev=10.0=py36_0 to make sure that the specific build number is picked up during install. Try again and let me know if that helps.

@davidrpugh
Copy link
Contributor

davidrpugh commented Jul 24, 2020

@illumidas-agn I have created (and tested) a Horovod build for TensorFlow 1.15 using cudatoolkit-dev=10.0=py36_0 with support for JupyterLab. All of the required config files can be found here. In particular see the bin/create-conda-env.sh script which automates the environment creation process. Follow the instructions carefully and let me know how you get on!

@davidrpugh
Copy link
Contributor

davidrpugh commented Jul 24, 2020

@tgaddair When I built the environment for Horovod 19.5 setting HOROVOD_GPU_OPERATIONS=NCCL and then ran horovodrun --check-build it appeared that NCCL support was not built; when I built the environment using HOROVOD_GPU_ALLREDUCE=NCCL and HOROVOD_GPU_BROADCAST=NCCL and then ran horovodrun --check-build it appeared that NCCL support was built.

Hopefully NCCL support was actually built with HOROVOD_GPU_OPERATIONS=NCCL and this is just a bug in the horovodrun --check-build command.

@tgaddair
Copy link
Collaborator

Hey @davidrpugh, unfortunately HOROVOD_GPU_OPERATIONS was added recently to master and has not been released yet.

I recommend consulting the stable docs (for the latest release) when not building from source:

https://horovod.readthedocs.io/en/stable/summary_include.html#install

@illumidas-agn
Copy link
Author

illumidas-agn commented Jul 24, 2020

Looks like TensorFlow is still being installed via pip from pypi and not from Conda channels as expected. Are you installing TF by including tensorflow-gpu=1.15 as a dependency in your environment file? Or are you installing TF via pip?

Getting cudatoolkit-dev package to work properly is a bit tedious and error prone which is why the Conda guide recommends people install the full CUDA Toolkit themselves and then use nvcc_liunux-64 package to configure the environment, but I guess the standard install of CUDA Toolkit requires root.

There are several builds of cudatoolkit-dev available on conda-forge. From your error I can see that you got the second one, but you needed the third.

cudatoolkit-dev                 10.0               1  conda-forge         
cudatoolkit-dev                 10.0               2  conda-forge         
cudatoolkit-dev                 10.0          py36_0  conda-forge

Replace the cudatoolkit-dev=10.0 with cudatoolkit-dev=10.0=py36_0 to make sure that the specific build number is picked up during install. Try again and let me know if that helps.

I used Pip, yes, in order to get the specific version.

Im not sure how you mean to install the specific cuda version? Do I just use conda install cudatoolkit-dev=10.0=py36_0? Or do I need to edit the environment.yml file to include this when I build an environment?

EDIT: Got it working below

I apologize for the hassle, im brand new to anaconda and the ML environment.

Im currently following the sh script that you linked in order to try and make the environment, I'll let you know how that goes

@illumidas-agn
Copy link
Author

Followed the scripts in the directory you showed me, overwrote the environment.yml file according to what was in that directory

Currently its unable to find python3.6 now:

"ERROR conda.core.link:_execute(481): An error occurred while installing package 'conda-forge::astor-0.8.1-pyh9f0ad1d_0'.
FileNotFoundError(2, "No such file or directory: '/home/cogs5/ge/go-explore-master/env/bin/python3.6'")
Attempting to roll back.

Rolling back transaction: done

FileNotFoundError(2, "No such file or directory: '/home/cogs5/ge/go-explore-master/env/bin/python3.6'")
"

@illumidas-agn
Copy link
Author

It worked.....thank you so so so much

(/home/cogs5/ge/go-explore-master/env) cogs5@sci-gpu:~/ge/go-explore-master/policy_based$ ./run_policy_based_ge_montezuma.sh 
2020-08-06 14:58:29,217 - __main__ - INFO - Code hash: cd96b3cb3594c6f2f0a0c9861c4de8068bfaa36f90ffd4969ff4056a3413840c
2020-08-06 14:58:29,218 - __main__ - INFO - Experiment running in /home/cogs5/temp/0004_8a3f9d69e55640a993fad4a5739f8909/
/home/cogs5/ge/go-explore-master/env/lib/python3.6/site-packages/gym/logger.py:30: UserWarning: WARN: Box bound precision lowered by casting to float32
  warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow'))
2020-08-06 14:58:38,692 - __main__ - INFO - setup done
2020-08-06 14:58:38,693 - __main__ - INFO - Initializing logger
2020-08-06 14:58:38,693 - __main__ - INFO - Starting experiment
2020-08-06 14:58:38,693 - __main__ - INFO - Initiate cycle

@davidrpugh
Copy link
Contributor

@illumidas-agn Hurray! Great. @tgaddair I guess we can close this issue.

@tgaddair
Copy link
Collaborator

tgaddair commented Aug 6, 2020

Phew! Nice work @davidrpugh! That looked like a tough one.

@tgaddair tgaddair closed this as completed Aug 6, 2020
@niranjansuthar70
Copy link

niranjansuthar70 commented Feb 27, 2021

Hello @davidrpugh , I am also facing similar issues but could not get resolve by following this forum.

I am getting following error when trying to run envir create.sh

(base) ati-g1@ATI-G1:~/Desktop/Niranjan/horovod-gpu-data-science-project-tensorflow-gpu-1.15-cudatoolkit-dev-10.0$ create-conda-env.sh
Collecting package metadata (repodata.json): done
Solving environment: done
Preparing transaction: done
Verifying transaction: done
Executing transaction: - By downloading and using the CUDA Toolkit conda packages, you accept the terms and conditions of the CUDA End User License Agreement (EULA): https://docs.nvidia.com/cuda/eula/index.html

\ By downloading and using the cuDNN conda packages, you accept the terms and conditions of the NVIDIA cuDNN EULA -
https://docs.nvidia.com/deeplearning/cudnn/sla/index.html

done
Installing pip dependencies: / Ran pip subprocess with arguments:
['/home/ati-g1/Desktop/Niranjan/horovod-gpu-data-science-project-tensorflow-gpu-1.15-cudatoolkit-dev-10.0/env/bin/python', '-m', 'pip', 'install', '-U', '-r', '/home/ati-g1/Desktop/Niranjan/horovod-gpu-data-science-project-tensorflow-gpu-1.15-cudatoolkit-dev-10.0/condaenv.9hkklkb7.requirements.txt']
Pip subprocess output:
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting horovod==0.19.*
Downloading horovod-0.19.5.tar.gz (2.9 MB)
Collecting jupyterlab-nvdashboard==0.3.*
Downloading jupyterlab_nvdashboard-0.3.1-py3-none-any.whl (10.0 kB)
Collecting jupyter-tensorboard==0.2.*
Downloading jupyter_tensorboard-0.2.0.tar.gz (15 kB)
Requirement already satisfied, skipping upgrade: cloudpickle in /home/ati-g1/.local/lib/python3.6/site-packages (from horovod==0.19.->-r file:requirements.txt (line 1)) (1.6.0)
Requirement already satisfied, skipping upgrade: psutil in /home/ati-g1/.local/lib/python3.6/site-packages (from horovod==0.19.
->-r file:requirements.txt (line 1)) (5.8.0)
Requirement already satisfied, skipping upgrade: pyyaml in /home/ati-g1/.local/lib/python3.6/site-packages (from horovod==0.19.->-r file:requirements.txt (line 1)) (5.4.1)
Requirement already satisfied, skipping upgrade: six in /home/ati-g1/.local/lib/python3.6/site-packages (from horovod==0.19.
->-r file:requirements.txt (line 1)) (1.15.0)
Requirement already satisfied, skipping upgrade: cffi>=1.4.0 in /home/ati-g1/.local/lib/python3.6/site-packages (from horovod==0.19.->-r file:requirements.txt (line 1)) (1.14.5)
Collecting bokeh<2
Downloading bokeh-1.4.0.tar.gz (32.4 MB)
Collecting pynvml
Downloading pynvml-8.0.4-py3-none-any.whl (36 kB)
Collecting jupyter-server-proxy
Downloading jupyter_server_proxy-1.6.0-py3-none-any.whl (20 kB)
Requirement already satisfied, skipping upgrade: notebook>=5.0 in ./env/lib/python3.6/site-packages (from jupyter-tensorboard==0.2.
->-r file:requirements.txt (line 3)) (6.2.0)
Requirement already satisfied, skipping upgrade: pycparser in /home/ati-g1/.local/lib/python3.6/site-packages (from cffi>=1.4.0->horovod==0.19.->-r file:requirements.txt (line 1)) (2.20)
Requirement already satisfied, skipping upgrade: python-dateutil>=2.1 in ./env/lib/python3.6/site-packages (from bokeh<2->jupyterlab-nvdashboard==0.3.
->-r file:requirements.txt (line 2)) (2.8.1)
Requirement already satisfied, skipping upgrade: Jinja2>=2.7 in ./env/lib/python3.6/site-packages (from bokeh<2->jupyterlab-nvdashboard==0.3.->-r file:requirements.txt (line 2)) (2.11.3)
Requirement already satisfied, skipping upgrade: numpy>=1.7.1 in /home/ati-g1/.local/lib/python3.6/site-packages (from bokeh<2->jupyterlab-nvdashboard==0.3.
->-r file:requirements.txt (line 2)) (1.18.5)
Requirement already satisfied, skipping upgrade: pillow>=4.0 in /home/ati-g1/.local/lib/python3.6/site-packages (from bokeh<2->jupyterlab-nvdashboard==0.3.->-r file:requirements.txt (line 2)) (8.1.0)
Requirement already satisfied, skipping upgrade: packaging>=16.8 in ./env/lib/python3.6/site-packages (from bokeh<2->jupyterlab-nvdashboard==0.3.
->-r file:requirements.txt (line 2)) (20.9)
Requirement already satisfied, skipping upgrade: tornado>=4.3 in ./env/lib/python3.6/site-packages (from bokeh<2->jupyterlab-nvdashboard==0.3.->-r file:requirements.txt (line 2)) (6.1)
Collecting simpervisor>=0.4
Downloading simpervisor-0.4-py3-none-any.whl (5.7 kB)
Collecting aiohttp
Downloading aiohttp-3.7.4-cp36-cp36m-manylinux2014_x86_64.whl (1.3 MB)
Requirement already satisfied, skipping upgrade: jupyter-core>=4.6.1 in ./env/lib/python3.6/site-packages (from notebook>=5.0->jupyter-tensorboard==0.2.
->-r file:requirements.txt (line 3)) (4.7.1)
Requirement already satisfied, skipping upgrade: nbformat in ./env/lib/python3.6/site-packages (from notebook>=5.0->jupyter-tensorboard==0.2.->-r file:requirements.txt (line 3)) (5.1.2)
Requirement already satisfied, skipping upgrade: ipykernel in ./env/lib/python3.6/site-packages (from notebook>=5.0->jupyter-tensorboard==0.2.
->-r file:requirements.txt (line 3)) (5.5.0)
Requirement already satisfied, skipping upgrade: argon2-cffi in ./env/lib/python3.6/site-packages (from notebook>=5.0->jupyter-tensorboard==0.2.->-r file:requirements.txt (line 3)) (20.1.0)
Requirement already satisfied, skipping upgrade: pyzmq>=17 in ./env/lib/python3.6/site-packages (from notebook>=5.0->jupyter-tensorboard==0.2.
->-r file:requirements.txt (line 3)) (22.0.3)
Requirement already satisfied, skipping upgrade: jupyter-client>=5.3.4 in ./env/lib/python3.6/site-packages (from notebook>=5.0->jupyter-tensorboard==0.2.->-r file:requirements.txt (line 3)) (6.1.11)
Requirement already satisfied, skipping upgrade: traitlets>=4.2.1 in ./env/lib/python3.6/site-packages (from notebook>=5.0->jupyter-tensorboard==0.2.
->-r file:requirements.txt (line 3)) (4.3.3)
Requirement already satisfied, skipping upgrade: ipython-genutils in ./env/lib/python3.6/site-packages (from notebook>=5.0->jupyter-tensorboard==0.2.->-r file:requirements.txt (line 3)) (0.2.0)
Requirement already satisfied, skipping upgrade: Send2Trash>=1.5.0 in ./env/lib/python3.6/site-packages (from notebook>=5.0->jupyter-tensorboard==0.2.
->-r file:requirements.txt (line 3)) (1.5.0)
Requirement already satisfied, skipping upgrade: prometheus-client in ./env/lib/python3.6/site-packages (from notebook>=5.0->jupyter-tensorboard==0.2.->-r file:requirements.txt (line 3)) (0.9.0)
Requirement already satisfied, skipping upgrade: terminado>=0.8.3 in ./env/lib/python3.6/site-packages (from notebook>=5.0->jupyter-tensorboard==0.2.
->-r file:requirements.txt (line 3)) (0.9.2)
Requirement already satisfied, skipping upgrade: nbconvert in ./env/lib/python3.6/site-packages (from notebook>=5.0->jupyter-tensorboard==0.2.->-r file:requirements.txt (line 3)) (6.0.7)
Requirement already satisfied, skipping upgrade: MarkupSafe>=0.23 in ./env/lib/python3.6/site-packages (from Jinja2>=2.7->bokeh<2->jupyterlab-nvdashboard==0.3.
->-r file:requirements.txt (line 2)) (1.1.1)
Requirement already satisfied, skipping upgrade: pyparsing>=2.0.2 in /home/ati-g1/.local/lib/python3.6/site-packages (from packaging>=16.8->bokeh<2->jupyterlab-nvdashboard==0.3.->-r file:requirements.txt (line 2)) (2.4.7)
Collecting multidict<7.0,>=4.5
Downloading multidict-5.1.0-cp36-cp36m-manylinux2014_x86_64.whl (141 kB)
Requirement already satisfied, skipping upgrade: typing-extensions>=3.6.5 in /home/ati-g1/.local/lib/python3.6/site-packages (from aiohttp->jupyter-server-proxy->jupyterlab-nvdashboard==0.3.
->-r file:requirements.txt (line 2)) (3.7.4.3)
Collecting chardet<4.0,>=2.0
Downloading chardet-3.0.4-py2.py3-none-any.whl (133 kB)
Collecting idna-ssl>=1.0; python_version < "3.7"
Downloading idna-ssl-1.1.0.tar.gz (3.4 kB)
Requirement already satisfied, skipping upgrade: attrs>=17.3.0 in ./env/lib/python3.6/site-packages (from aiohttp->jupyter-server-proxy->jupyterlab-nvdashboard==0.3.->-r file:requirements.txt (line 2)) (20.3.0)
Collecting yarl<2.0,>=1.0
Downloading yarl-1.6.3-cp36-cp36m-manylinux2014_x86_64.whl (293 kB)
Collecting async-timeout<4.0,>=3.0
Downloading async_timeout-3.0.1-py3-none-any.whl (8.2 kB)
Requirement already satisfied, skipping upgrade: jsonschema!=2.5.0,>=2.4 in ./env/lib/python3.6/site-packages (from nbformat->notebook>=5.0->jupyter-tensorboard==0.2.
->-r file:requirements.txt (line 3)) (3.2.0)
Requirement already satisfied, skipping upgrade: ipython>=5.0.0 in ./env/lib/python3.6/site-packages (from ipykernel->notebook>=5.0->jupyter-tensorboard==0.2.->-r file:requirements.txt (line 3)) (7.16.1)
Requirement already satisfied, skipping upgrade: decorator in ./env/lib/python3.6/site-packages (from traitlets>=4.2.1->notebook>=5.0->jupyter-tensorboard==0.2.
->-r file:requirements.txt (line 3)) (4.4.2)
Requirement already satisfied, skipping upgrade: ptyprocess; os_name != "nt" in ./env/lib/python3.6/site-packages (from terminado>=0.8.3->notebook>=5.0->jupyter-tensorboard==0.2.->-r file:requirements.txt (line 3)) (0.7.0)
Requirement already satisfied, skipping upgrade: nbclient<0.6.0,>=0.5.0 in ./env/lib/python3.6/site-packages (from nbconvert->notebook>=5.0->jupyter-tensorboard==0.2.
->-r file:requirements.txt (line 3)) (0.5.3)
Requirement already satisfied, skipping upgrade: jupyterlab-pygments in ./env/lib/python3.6/site-packages (from nbconvert->notebook>=5.0->jupyter-tensorboard==0.2.->-r file:requirements.txt (line 3)) (0.1.2)
Requirement already satisfied, skipping upgrade: defusedxml in ./env/lib/python3.6/site-packages (from nbconvert->notebook>=5.0->jupyter-tensorboard==0.2.
->-r file:requirements.txt (line 3)) (0.6.0)
Requirement already satisfied, skipping upgrade: entrypoints>=0.2.2 in ./env/lib/python3.6/site-packages (from nbconvert->notebook>=5.0->jupyter-tensorboard==0.2.->-r file:requirements.txt (line 3)) (0.3)
Requirement already satisfied, skipping upgrade: pygments>=2.4.1 in ./env/lib/python3.6/site-packages (from nbconvert->notebook>=5.0->jupyter-tensorboard==0.2.
->-r file:requirements.txt (line 3)) (2.8.0)
Requirement already satisfied, skipping upgrade: mistune<2,>=0.8.1 in ./env/lib/python3.6/site-packages (from nbconvert->notebook>=5.0->jupyter-tensorboard==0.2.->-r file:requirements.txt (line 3)) (0.8.4)
Requirement already satisfied, skipping upgrade: pandocfilters>=1.4.1 in ./env/lib/python3.6/site-packages (from nbconvert->notebook>=5.0->jupyter-tensorboard==0.2.
->-r file:requirements.txt (line 3)) (1.4.2)
Requirement already satisfied, skipping upgrade: testpath in ./env/lib/python3.6/site-packages (from nbconvert->notebook>=5.0->jupyter-tensorboard==0.2.->-r file:requirements.txt (line 3)) (0.4.4)
Requirement already satisfied, skipping upgrade: bleach in /home/ati-g1/.local/lib/python3.6/site-packages (from nbconvert->notebook>=5.0->jupyter-tensorboard==0.2.
->-r file:requirements.txt (line 3)) (1.5.0)
Requirement already satisfied, skipping upgrade: idna>=2.0 in ./env/lib/python3.6/site-packages (from idna-ssl>=1.0; python_version < "3.7"->aiohttp->jupyter-server-proxy->jupyterlab-nvdashboard==0.3.->-r file:requirements.txt (line 2)) (2.10)
Requirement already satisfied, skipping upgrade: setuptools in /home/ati-g1/.local/lib/python3.6/site-packages (from jsonschema!=2.5.0,>=2.4->nbformat->notebook>=5.0->jupyter-tensorboard==0.2.
->-r file:requirements.txt (line 3)) (53.0.0)
Requirement already satisfied, skipping upgrade: importlib-metadata; python_version < "3.8" in /home/ati-g1/.local/lib/python3.6/site-packages (from jsonschema!=2.5.0,>=2.4->nbformat->notebook>=5.0->jupyter-tensorboard==0.2.->-r file:requirements.txt (line 3)) (3.4.0)
Requirement already satisfied, skipping upgrade: pyrsistent>=0.14.0 in ./env/lib/python3.6/site-packages (from jsonschema!=2.5.0,>=2.4->nbformat->notebook>=5.0->jupyter-tensorboard==0.2.
->-r file:requirements.txt (line 3)) (0.17.3)
Requirement already satisfied, skipping upgrade: jedi>=0.10 in ./env/lib/python3.6/site-packages (from ipython>=5.0.0->ipykernel->notebook>=5.0->jupyter-tensorboard==0.2.->-r file:requirements.txt (line 3)) (0.17.2)
Requirement already satisfied, skipping upgrade: backcall in ./env/lib/python3.6/site-packages (from ipython>=5.0.0->ipykernel->notebook>=5.0->jupyter-tensorboard==0.2.
->-r file:requirements.txt (line 3)) (0.2.0)
Requirement already satisfied, skipping upgrade: pickleshare in ./env/lib/python3.6/site-packages (from ipython>=5.0.0->ipykernel->notebook>=5.0->jupyter-tensorboard==0.2.->-r file:requirements.txt (line 3)) (0.7.5)
Requirement already satisfied, skipping upgrade: pexpect; sys_platform != "win32" in ./env/lib/python3.6/site-packages (from ipython>=5.0.0->ipykernel->notebook>=5.0->jupyter-tensorboard==0.2.
->-r file:requirements.txt (line 3)) (4.8.0)
Requirement already satisfied, skipping upgrade: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in ./env/lib/python3.6/site-packages (from ipython>=5.0.0->ipykernel->notebook>=5.0->jupyter-tensorboard==0.2.->-r file:requirements.txt (line 3)) (3.0.16)
Requirement already satisfied, skipping upgrade: nest-asyncio in ./env/lib/python3.6/site-packages (from nbclient<0.6.0,>=0.5.0->nbconvert->notebook>=5.0->jupyter-tensorboard==0.2.
->-r file:requirements.txt (line 3)) (1.4.3)
Requirement already satisfied, skipping upgrade: async-generator in ./env/lib/python3.6/site-packages (from nbclient<0.6.0,>=0.5.0->nbconvert->notebook>=5.0->jupyter-tensorboard==0.2.->-r file:requirements.txt (line 3)) (1.10)
Requirement already satisfied, skipping upgrade: html5lib!=0.9999,!=0.99999,<0.99999999,>=0.999 in /home/ati-g1/.local/lib/python3.6/site-packages (from bleach->nbconvert->notebook>=5.0->jupyter-tensorboard==0.2.
->-r file:requirements.txt (line 3)) (0.9999999)
Requirement already satisfied, skipping upgrade: zipp>=0.5 in /home/ati-g1/.local/lib/python3.6/site-packages (from importlib-metadata; python_version < "3.8"->jsonschema!=2.5.0,>=2.4->nbformat->notebook>=5.0->jupyter-tensorboard==0.2.->-r file:requirements.txt (line 3)) (3.4.0)
Requirement already satisfied, skipping upgrade: parso<0.8.0,>=0.7.0 in ./env/lib/python3.6/site-packages (from jedi>=0.10->ipython>=5.0.0->ipykernel->notebook>=5.0->jupyter-tensorboard==0.2.
->-r file:requirements.txt (line 3)) (0.7.1)
Requirement already satisfied, skipping upgrade: wcwidth in ./env/lib/python3.6/site-packages (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->ipython>=5.0.0->ipykernel->notebook>=5.0->jupyter-tensorboard==0.2.*->-r file:requirements.txt (line 3)) (0.2.5)
Skipping wheel build for horovod, due to binaries being disabled for it.
Building wheels for collected packages: jupyter-tensorboard, bokeh, idna-ssl
Building wheel for jupyter-tensorboard (setup.py): started
Building wheel for jupyter-tensorboard (setup.py): finished with status 'done'
Created wheel for jupyter-tensorboard: filename=jupyter_tensorboard-0.2.0-py2.py3-none-any.whl size=15258 sha256=68b200b7e7af6621785339b53a7c38fd3f2d283629d6423187c16efd25585cdd
Stored in directory: /tmp/pip-ephem-wheel-cache-kym394y2/wheels/5b/ff/1d/88f2511c564a6b40eff44157694d8d9581e039e18f03682b42
Building wheel for bokeh (setup.py): started
Building wheel for bokeh (setup.py): finished with status 'done'
Created wheel for bokeh: filename=bokeh-1.4.0-py3-none-any.whl size=23689200 sha256=75aa1f354c69521545661bc0281cc0a2ef4db5e4d04fb4020df59f986c628038
Stored in directory: /tmp/pip-ephem-wheel-cache-kym394y2/wheels/b6/72/72/a6a223f72a9b02a4922a3c2fec55b2f65567254d398f6c5f74
Building wheel for idna-ssl (setup.py): started
Building wheel for idna-ssl (setup.py): finished with status 'done'
Created wheel for idna-ssl: filename=idna_ssl-1.1.0-py3-none-any.whl size=3160 sha256=6515b0dfe7892d104c44e3fe5c60c933fde2ac559351a6fd5e6475c21c9c4dac
Stored in directory: /tmp/pip-ephem-wheel-cache-kym394y2/wheels/6a/f5/9c/f8331a854f7a8739cf0e74c13854e4dd7b1af11b04fe1dde13
Successfully built jupyter-tensorboard bokeh idna-ssl
Installing collected packages: horovod, bokeh, pynvml, simpervisor, multidict, chardet, idna-ssl, yarl, async-timeout, aiohttp, jupyter-server-proxy, jupyterlab-nvdashboard, jupyter-tensorboard
Running setup.py install for horovod: started
Running setup.py install for horovod: still running...
Running setup.py install for horovod: finished with status 'done'
Attempting uninstall: chardet
Found existing installation: chardet 4.0.0
Uninstalling chardet-4.0.0:
Successfully uninstalled chardet-4.0.0
Successfully installed aiohttp-3.7.4 async-timeout-3.0.1 bokeh-1.4.0 chardet-3.0.4 horovod-0.19.5 idna-ssl-1.1.0 jupyter-server-proxy-1.6.0 jupyter-tensorboard-0.2.0 jupyterlab-nvdashboard-0.3.1 multidict-5.1.0 pynvml-8.0.4 simpervisor-0.4 yarl-1.6.3

done

To activate this environment, use

$ conda activate /home/ati-g1/Desktop/Niranjan/horovod-gpu-data-science-project-tensorflow-gpu-1.15-cudatoolkit-dev-10.0/env

To deactivate an active environment, use

$ conda deactivate

CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
To initialize your shell, run

$ conda init <SHELL_NAME>

Currently supported shells are:

  • bash
  • fish
  • tcsh
  • xonsh
  • zsh
  • powershell

See 'conda init --help' for more information and options.

IMPORTANT: You may need to close an(base) ati-g1@ATI-G1:~/Desktop/Niranjan/horovod-gpu-data-science-project-tensorflow-gpu-1.15-cudatoolkit-dev-10.0$ create-conda-env.sh
Collecting package metadata (repodata.json): done
Solving environment: done
Preparing transaction: done
Verifying transaction: done
Executing transaction: - By downloading and using the CUDA Toolkit conda packages, you accept the terms and conditions of the CUDA End User License Agreement (EULA): https://docs.nvidia.com/cuda/eula/index.html

\ By downloading and using the cuDNN conda packages, you accept the terms and conditions of the NVIDIA cuDNN EULA -
https://docs.nvidia.com/deeplearning/cudnn/sla/index.html

done
Installing pip dependencies: / Ran pip subprocess with arguments:
['/home/ati-g1/Desktop/Niranjan/horovod-gpu-data-science-project-tensorflow-gpu-1.15-cudatoolkit-dev-10.0/env/bin/python', '-m', 'pip', 'install', '-U', '-r', '/home/ati-g1/Desktop/Niranjan/horovod-gpu-data-science-project-tensorflow-gpu-1.15-cudatoolkit-dev-10.0/condaenv.9hkklkb7.requirements.txt']
Pip subprocess output:
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting horovod==0.19.*
Downloading horovod-0.19.5.tar.gz (2.9 MB)
Collecting jupyterlab-nvdashboard==0.3.*
Downloading jupyterlab_nvdashboard-0.3.1-py3-none-any.whl (10.0 kB)
Collecting jupyter-tensorboard==0.2.*
Downloading jupyter_tensorboard-0.2.0.tar.gz (15 kB)
Requirement already satisfied, skipping upgrade: cloudpickle in /home/ati-g1/.local/lib/python3.6/site-packages (from horovod==0.19.->-r file:requirements.txt (line 1)) (1.6.0)
Requirement already satisfied, skipping upgrade: psutil in /home/ati-g1/.local/lib/python3.6/site-packages (from horovod==0.19.
->-r file:requirements.txt (line 1)) (5.8.0)
Requirement already satisfied, skipping upgrade: pyyaml in /home/ati-g1/.local/lib/python3.6/site-packages (from horovod==0.19.->-r file:requirements.txt (line 1)) (5.4.1)
Requirement already satisfied, skipping upgrade: six in /home/ati-g1/.local/lib/python3.6/site-packages (from horovod==0.19.
->-r file:requirements.txt (line 1)) (1.15.0)
Requirement already satisfied, skipping upgrade: cffi>=1.4.0 in /home/ati-g1/.local/lib/python3.6/site-packages (from horovod==0.19.->-r file:requirements.txt (line 1)) (1.14.5)
Collecting bokeh<2
Downloading bokeh-1.4.0.tar.gz (32.4 MB)
Collecting pynvml
Downloading pynvml-8.0.4-py3-none-any.whl (36 kB)
Collecting jupyter-server-proxy
Downloading jupyter_server_proxy-1.6.0-py3-none-any.whl (20 kB)
Requirement already satisfied, skipping upgrade: notebook>=5.0 in ./env/lib/python3.6/site-packages (from jupyter-tensorboard==0.2.
->-r file:requirements.txt (line 3)) (6.2.0)
Requirement already satisfied, skipping upgrade: pycparser in /home/ati-g1/.local/lib/python3.6/site-packages (from cffi>=1.4.0->horovod==0.19.->-r file:requirements.txt (line 1)) (2.20)
Requirement already satisfied, skipping upgrade: python-dateutil>=2.1 in ./env/lib/python3.6/site-packages (from bokeh<2->jupyterlab-nvdashboard==0.3.
->-r file:requirements.txt (line 2)) (2.8.1)
Requirement already satisfied, skipping upgrade: Jinja2>=2.7 in ./env/lib/python3.6/site-packages (from bokeh<2->jupyterlab-nvdashboard==0.3.->-r file:requirements.txt (line 2)) (2.11.3)
Requirement already satisfied, skipping upgrade: numpy>=1.7.1 in /home/ati-g1/.local/lib/python3.6/site-packages (from bokeh<2->jupyterlab-nvdashboard==0.3.
->-r file:requirements.txt (line 2)) (1.18.5)
Requirement already satisfied, skipping upgrade: pillow>=4.0 in /home/ati-g1/.local/lib/python3.6/site-packages (from bokeh<2->jupyterlab-nvdashboard==0.3.->-r file:requirements.txt (line 2)) (8.1.0)
Requirement already satisfied, skipping upgrade: packaging>=16.8 in ./env/lib/python3.6/site-packages (from bokeh<2->jupyterlab-nvdashboard==0.3.
->-r file:requirements.txt (line 2)) (20.9)
Requirement already satisfied, skipping upgrade: tornado>=4.3 in ./env/lib/python3.6/site-packages (from bokeh<2->jupyterlab-nvdashboard==0.3.->-r file:requirements.txt (line 2)) (6.1)
Collecting simpervisor>=0.4
Downloading simpervisor-0.4-py3-none-any.whl (5.7 kB)
Collecting aiohttp
Downloading aiohttp-3.7.4-cp36-cp36m-manylinux2014_x86_64.whl (1.3 MB)
Requirement already satisfied, skipping upgrade: jupyter-core>=4.6.1 in ./env/lib/python3.6/site-packages (from notebook>=5.0->jupyter-tensorboard==0.2.
->-r file:requirements.txt (line 3)) (4.7.1)
Requirement already satisfied, skipping upgrade: nbformat in ./env/lib/python3.6/site-packages (from notebook>=5.0->jupyter-tensorboard==0.2.->-r file:requirements.txt (line 3)) (5.1.2)
Requirement already satisfied, skipping upgrade: ipykernel in ./env/lib/python3.6/site-packages (from notebook>=5.0->jupyter-tensorboard==0.2.
->-r file:requirements.txt (line 3)) (5.5.0)
Requirement already satisfied, skipping upgrade: argon2-cffi in ./env/lib/python3.6/site-packages (from notebook>=5.0->jupyter-tensorboard==0.2.->-r file:requirements.txt (line 3)) (20.1.0)
Requirement already satisfied, skipping upgrade: pyzmq>=17 in ./env/lib/python3.6/site-packages (from notebook>=5.0->jupyter-tensorboard==0.2.
->-r file:requirements.txt (line 3)) (22.0.3)
Requirement already satisfied, skipping upgrade: jupyter-client>=5.3.4 in ./env/lib/python3.6/site-packages (from notebook>=5.0->jupyter-tensorboard==0.2.->-r file:requirements.txt (line 3)) (6.1.11)
Requirement already satisfied, skipping upgrade: traitlets>=4.2.1 in ./env/lib/python3.6/site-packages (from notebook>=5.0->jupyter-tensorboard==0.2.
->-r file:requirements.txt (line 3)) (4.3.3)
Requirement already satisfied, skipping upgrade: ipython-genutils in ./env/lib/python3.6/site-packages (from notebook>=5.0->jupyter-tensorboard==0.2.->-r file:requirements.txt (line 3)) (0.2.0)
Requirement already satisfied, skipping upgrade: Send2Trash>=1.5.0 in ./env/lib/python3.6/site-packages (from notebook>=5.0->jupyter-tensorboard==0.2.
->-r file:requirements.txt (line 3)) (1.5.0)
Requirement already satisfied, skipping upgrade: prometheus-client in ./env/lib/python3.6/site-packages (from notebook>=5.0->jupyter-tensorboard==0.2.->-r file:requirements.txt (line 3)) (0.9.0)
Requirement already satisfied, skipping upgrade: terminado>=0.8.3 in ./env/lib/python3.6/site-packages (from notebook>=5.0->jupyter-tensorboard==0.2.
->-r file:requirements.txt (line 3)) (0.9.2)
Requirement already satisfied, skipping upgrade: nbconvert in ./env/lib/python3.6/site-packages (from notebook>=5.0->jupyter-tensorboard==0.2.->-r file:requirements.txt (line 3)) (6.0.7)
Requirement already satisfied, skipping upgrade: MarkupSafe>=0.23 in ./env/lib/python3.6/site-packages (from Jinja2>=2.7->bokeh<2->jupyterlab-nvdashboard==0.3.
->-r file:requirements.txt (line 2)) (1.1.1)
Requirement already satisfied, skipping upgrade: pyparsing>=2.0.2 in /home/ati-g1/.local/lib/python3.6/site-packages (from packaging>=16.8->bokeh<2->jupyterlab-nvdashboard==0.3.->-r file:requirements.txt (line 2)) (2.4.7)
Collecting multidict<7.0,>=4.5
Downloading multidict-5.1.0-cp36-cp36m-manylinux2014_x86_64.whl (141 kB)
Requirement already satisfied, skipping upgrade: typing-extensions>=3.6.5 in /home/ati-g1/.local/lib/python3.6/site-packages (from aiohttp->jupyter-server-proxy->jupyterlab-nvdashboard==0.3.
->-r file:requirements.txt (line 2)) (3.7.4.3)
Collecting chardet<4.0,>=2.0
Downloading chardet-3.0.4-py2.py3-none-any.whl (133 kB)
Collecting idna-ssl>=1.0; python_version < "3.7"
Downloading idna-ssl-1.1.0.tar.gz (3.4 kB)
Requirement already satisfied, skipping upgrade: attrs>=17.3.0 in ./env/lib/python3.6/site-packages (from aiohttp->jupyter-server-proxy->jupyterlab-nvdashboard==0.3.->-r file:requirements.txt (line 2)) (20.3.0)
Collecting yarl<2.0,>=1.0
Downloading yarl-1.6.3-cp36-cp36m-manylinux2014_x86_64.whl (293 kB)
Collecting async-timeout<4.0,>=3.0
Downloading async_timeout-3.0.1-py3-none-any.whl (8.2 kB)
Requirement already satisfied, skipping upgrade: jsonschema!=2.5.0,>=2.4 in ./env/lib/python3.6/site-packages (from nbformat->notebook>=5.0->jupyter-tensorboard==0.2.
->-r file:requirements.txt (line 3)) (3.2.0)
Requirement already satisfied, skipping upgrade: ipython>=5.0.0 in ./env/lib/python3.6/site-packages (from ipykernel->notebook>=5.0->jupyter-tensorboard==0.2.->-r file:requirements.txt (line 3)) (7.16.1)
Requirement already satisfied, skipping upgrade: decorator in ./env/lib/python3.6/site-packages (from traitlets>=4.2.1->notebook>=5.0->jupyter-tensorboard==0.2.
->-r file:requirements.txt (line 3)) (4.4.2)
Requirement already satisfied, skipping upgrade: ptyprocess; os_name != "nt" in ./env/lib/python3.6/site-packages (from terminado>=0.8.3->notebook>=5.0->jupyter-tensorboard==0.2.->-r file:requirements.txt (line 3)) (0.7.0)
Requirement already satisfied, skipping upgrade: nbclient<0.6.0,>=0.5.0 in ./env/lib/python3.6/site-packages (from nbconvert->notebook>=5.0->jupyter-tensorboard==0.2.
->-r file:requirements.txt (line 3)) (0.5.3)
Requirement already satisfied, skipping upgrade: jupyterlab-pygments in ./env/lib/python3.6/site-packages (from nbconvert->notebook>=5.0->jupyter-tensorboard==0.2.->-r file:requirements.txt (line 3)) (0.1.2)
Requirement already satisfied, skipping upgrade: defusedxml in ./env/lib/python3.6/site-packages (from nbconvert->notebook>=5.0->jupyter-tensorboard==0.2.
->-r file:requirements.txt (line 3)) (0.6.0)
Requirement already satisfied, skipping upgrade: entrypoints>=0.2.2 in ./env/lib/python3.6/site-packages (from nbconvert->notebook>=5.0->jupyter-tensorboard==0.2.->-r file:requirements.txt (line 3)) (0.3)
Requirement already satisfied, skipping upgrade: pygments>=2.4.1 in ./env/lib/python3.6/site-packages (from nbconvert->notebook>=5.0->jupyter-tensorboard==0.2.
->-r file:requirements.txt (line 3)) (2.8.0)
Requirement already satisfied, skipping upgrade: mistune<2,>=0.8.1 in ./env/lib/python3.6/site-packages (from nbconvert->notebook>=5.0->jupyter-tensorboard==0.2.->-r file:requirements.txt (line 3)) (0.8.4)
Requirement already satisfied, skipping upgrade: pandocfilters>=1.4.1 in ./env/lib/python3.6/site-packages (from nbconvert->notebook>=5.0->jupyter-tensorboard==0.2.
->-r file:requirements.txt (line 3)) (1.4.2)
Requirement already satisfied, skipping upgrade: testpath in ./env/lib/python3.6/site-packages (from nbconvert->notebook>=5.0->jupyter-tensorboard==0.2.->-r file:requirements.txt (line 3)) (0.4.4)
Requirement already satisfied, skipping upgrade: bleach in /home/ati-g1/.local/lib/python3.6/site-packages (from nbconvert->notebook>=5.0->jupyter-tensorboard==0.2.
->-r file:requirements.txt (line 3)) (1.5.0)
Requirement already satisfied, skipping upgrade: idna>=2.0 in ./env/lib/python3.6/site-packages (from idna-ssl>=1.0; python_version < "3.7"->aiohttp->jupyter-server-proxy->jupyterlab-nvdashboard==0.3.->-r file:requirements.txt (line 2)) (2.10)
Requirement already satisfied, skipping upgrade: setuptools in /home/ati-g1/.local/lib/python3.6/site-packages (from jsonschema!=2.5.0,>=2.4->nbformat->notebook>=5.0->jupyter-tensorboard==0.2.
->-r file:requirements.txt (line 3)) (53.0.0)
Requirement already satisfied, skipping upgrade: importlib-metadata; python_version < "3.8" in /home/ati-g1/.local/lib/python3.6/site-packages (from jsonschema!=2.5.0,>=2.4->nbformat->notebook>=5.0->jupyter-tensorboard==0.2.->-r file:requirements.txt (line 3)) (3.4.0)
Requirement already satisfied, skipping upgrade: pyrsistent>=0.14.0 in ./env/lib/python3.6/site-packages (from jsonschema!=2.5.0,>=2.4->nbformat->notebook>=5.0->jupyter-tensorboard==0.2.
->-r file:requirements.txt (line 3)) (0.17.3)
Requirement already satisfied, skipping upgrade: jedi>=0.10 in ./env/lib/python3.6/site-packages (from ipython>=5.0.0->ipykernel->notebook>=5.0->jupyter-tensorboard==0.2.->-r file:requirements.txt (line 3)) (0.17.2)
Requirement already satisfied, skipping upgrade: backcall in ./env/lib/python3.6/site-packages (from ipython>=5.0.0->ipykernel->notebook>=5.0->jupyter-tensorboard==0.2.
->-r file:requirements.txt (line 3)) (0.2.0)
Requirement already satisfied, skipping upgrade: pickleshare in ./env/lib/python3.6/site-packages (from ipython>=5.0.0->ipykernel->notebook>=5.0->jupyter-tensorboard==0.2.->-r file:requirements.txt (line 3)) (0.7.5)
Requirement already satisfied, skipping upgrade: pexpect; sys_platform != "win32" in ./env/lib/python3.6/site-packages (from ipython>=5.0.0->ipykernel->notebook>=5.0->jupyter-tensorboard==0.2.
->-r file:requirements.txt (line 3)) (4.8.0)
Requirement already satisfied, skipping upgrade: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in ./env/lib/python3.6/site-packages (from ipython>=5.0.0->ipykernel->notebook>=5.0->jupyter-tensorboard==0.2.->-r file:requirements.txt (line 3)) (3.0.16)
Requirement already satisfied, skipping upgrade: nest-asyncio in ./env/lib/python3.6/site-packages (from nbclient<0.6.0,>=0.5.0->nbconvert->notebook>=5.0->jupyter-tensorboard==0.2.
->-r file:requirements.txt (line 3)) (1.4.3)
Requirement already satisfied, skipping upgrade: async-generator in ./env/lib/python3.6/site-packages (from nbclient<0.6.0,>=0.5.0->nbconvert->notebook>=5.0->jupyter-tensorboard==0.2.->-r file:requirements.txt (line 3)) (1.10)
Requirement already satisfied, skipping upgrade: html5lib!=0.9999,!=0.99999,<0.99999999,>=0.999 in /home/ati-g1/.local/lib/python3.6/site-packages (from bleach->nbconvert->notebook>=5.0->jupyter-tensorboard==0.2.
->-r file:requirements.txt (line 3)) (0.9999999)
Requirement already satisfied, skipping upgrade: zipp>=0.5 in /home/ati-g1/.local/lib/python3.6/site-packages (from importlib-metadata; python_version < "3.8"->jsonschema!=2.5.0,>=2.4->nbformat->notebook>=5.0->jupyter-tensorboard==0.2.->-r file:requirements.txt (line 3)) (3.4.0)
Requirement already satisfied, skipping upgrade: parso<0.8.0,>=0.7.0 in ./env/lib/python3.6/site-packages (from jedi>=0.10->ipython>=5.0.0->ipykernel->notebook>=5.0->jupyter-tensorboard==0.2.
->-r file:requirements.txt (line 3)) (0.7.1)
Requirement already satisfied, skipping upgrade: wcwidth in ./env/lib/python3.6/site-packages (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->ipython>=5.0.0->ipykernel->notebook>=5.0->jupyter-tensorboard==0.2.*->-r file:requirements.txt (line 3)) (0.2.5)
Skipping wheel build for horovod, due to binaries being disabled for it.
Building wheels for collected packages: jupyter-tensorboard, bokeh, idna-ssl
Building wheel for jupyter-tensorboard (setup.py): started
Building wheel for jupyter-tensorboard (setup.py): finished with status 'done'
Created wheel for jupyter-tensorboard: filename=jupyter_tensorboard-0.2.0-py2.py3-none-any.whl size=15258 sha256=68b200b7e7af6621785339b53a7c38fd3f2d283629d6423187c16efd25585cdd
Stored in directory: /tmp/pip-ephem-wheel-cache-kym394y2/wheels/5b/ff/1d/88f2511c564a6b40eff44157694d8d9581e039e18f03682b42
Building wheel for bokeh (setup.py): started
Building wheel for bokeh (setup.py): finished with status 'done'
Created wheel for bokeh: filename=bokeh-1.4.0-py3-none-any.whl size=23689200 sha256=75aa1f354c69521545661bc0281cc0a2ef4db5e4d04fb4020df59f986c628038
Stored in directory: /tmp/pip-ephem-wheel-cache-kym394y2/wheels/b6/72/72/a6a223f72a9b02a4922a3c2fec55b2f65567254d398f6c5f74
Building wheel for idna-ssl (setup.py): started
Building wheel for idna-ssl (setup.py): finished with status 'done'
Created wheel for idna-ssl: filename=idna_ssl-1.1.0-py3-none-any.whl size=3160 sha256=6515b0dfe7892d104c44e3fe5c60c933fde2ac559351a6fd5e6475c21c9c4dac
Stored in directory: /tmp/pip-ephem-wheel-cache-kym394y2/wheels/6a/f5/9c/f8331a854f7a8739cf0e74c13854e4dd7b1af11b04fe1dde13
Successfully built jupyter-tensorboard bokeh idna-ssl
Installing collected packages: horovod, bokeh, pynvml, simpervisor, multidict, chardet, idna-ssl, yarl, async-timeout, aiohttp, jupyter-server-proxy, jupyterlab-nvdashboard, jupyter-tensorboard
Running setup.py install for horovod: started
Running setup.py install for horovod: still running...
Running setup.py install for horovod: finished with status 'done'
Attempting uninstall: chardet
Found existing installation: chardet 4.0.0
Uninstalling chardet-4.0.0:
Successfully uninstalled chardet-4.0.0
Successfully installed aiohttp-3.7.4 async-timeout-3.0.1 bokeh-1.4.0 chardet-3.0.4 horovod-0.19.5 idna-ssl-1.1.0 jupyter-server-proxy-1.6.0 jupyter-tensorboard-0.2.0 jupyterlab-nvdashboard-0.3.1 multidict-5.1.0 pynvml-8.0.4 simpervisor-0.4 yarl-1.6.3

done

To activate this environment, use

$ conda activate /home/ati-g1/Desktop/Niranjan/horovod-gpu-data-science-project-tensorflow-gpu-1.15-cudatoolkit-dev-10.0/env

To deactivate an active environment, use

$ conda deactivate

CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
To initialize your shell, run

$ conda init <SHELL_NAME>

Currently supported shells are:

  • bash
  • fish
  • tcsh
  • xonsh
  • zsh
  • powershell

See 'conda init --help' for more information and options.

IMPORTANT: You may need to close and restart your shell after running 'conda init'.

Errored, use --debug for full output:
ValueError: Please install nodejs 5+ and npm before continuing installation. nodejs may be installed using conda or directly from the nodejs website.

Errored, use --debug for full output:
ValueError: Please install nodejs 5+ and npm before continuing installation. nodejs may be installed using conda or directly from the nodejs website.
[LabBuildApp] JupyterLab 0.35.4
[LabBuildApp] Cleaning /home/ati-g1/anaconda3/share/jupyter/lab
Cleaning /home/ati-g1/anaconda3/share/jupyter/lab...
Success!
[LabBuildApp] Building in /home/ati-g1/anaconda3/share/jupyter/lab
Traceback (most recent call last):
File "/home/ati-g1/anaconda3/bin/jupyter-lab", line 11, in
sys.exit(main())
File "/home/ati-g1/anaconda3/lib/python3.7/site-packages/jupyter_core/application.py", line 266, in launch_instance
return super(JupyterApp, cls).launch_instance(argv=argv, **kwargs)
File "/home/ati-g1/anaconda3/lib/python3.7/site-packages/traitlets/config/application.py", line 658, in launch_instance
app.start()
File "/home/ati-g1/anaconda3/lib/python3.7/site-packages/notebook/notebookapp.py", line 1760, in start
super(NotebookApp, self).start()
File "/home/ati-g1/anaconda3/lib/python3.7/site-packages/jupyter_core/application.py", line 255, in start
self.subapp.start()
File "/home/ati-g1/anaconda3/lib/python3.7/site-packages/jupyterlab/labapp.py", line 78, in start
command=command, logger=self.log)
File "/home/ati-g1/anaconda3/lib/python3.7/site-packages/jupyterlab/commands.py", line 271, in build
_node_check(logger)
File "/home/ati-g1/anaconda3/lib/python3.7/site-packages/jupyterlab/commands.py", line 1481, in _node_check
node = which('node')
File "/home/ati-g1/anaconda3/lib/python3.7/site-packages/jupyterlab_server/process.py", line 59, in which
raise ValueError(msg)
ValueError: Please install nodejs 5+ and npm before continuing installation. nodejs may be installed using conda or directly f(base) ati-g1@ATI((ba(base)(ba(base) ati-g1@ATI-G1:~/Desktop/Niranjan/horovod-gpu-data-science-project-tensorflow-gpu-1.15-cudatoolkit-dev-10.0$
d restart your shell after running 'conda init'.

Errored, use --debug for full output:
ValueError: Please install nodejs 5+ and npm before continuing installation. nodejs may be installed using conda or directly from the nodejs website.

Errored, use --debug for full output:
ValueError: Please install nodejs 5+ and npm before continuing installation. nodejs may be installed using conda or directly from the nodejs website.
[LabBuildApp] JupyterLab 0.35.4
[LabBuildApp] Cleaning /home/ati-g1/anaconda3/share/jupyter/lab
Cleaning /home/ati-g1/anaconda3/share/jupyter/lab...
Success!
[LabBuildApp] Building in /home/ati-g1/anaconda3/share/jupyter/lab
Traceback (most recent call last):
File "/home/ati-g1/anaconda3/bin/jupyter-lab", line 11, in
sys.exit(main())
File "/home/ati-g1/anaconda3/lib/python3.7/site-packages/jupyter_core/application.py", line 266, in launch_instance
return super(JupyterApp, cls).launch_instance(argv=argv, **kwargs)
File "/home/ati-g1/anaconda3/lib/python3.7/site-packages/traitlets/config/application.py", line 658, in launch_instance
app.start()
File "/home/ati-g1/anaconda3/lib/python3.7/site-packages/notebook/notebookapp.py", line 1760, in start
super(NotebookApp, self).start()
File "/home/ati-g1/anaconda3/lib/python3.7/site-packages/jupyter_core/application.py", line 255, in start
self.subapp.start()
File "/home/ati-g1/anaconda3/lib/python3.7/site-packages/jupyterlab/labapp.py", line 78, in start
command=command, logger=self.log)
File "/home/ati-g1/anaconda3/lib/python3.7/site-packages/jupyterlab/commands.py", line 271, in build
_node_check(logger)
File "/home/ati-g1/anaconda3/lib/python3.7/site-packages/jupyterlab/commands.py", line 1481, in _node_check
node = which('node')
File "/home/ati-g1/anaconda3/lib/python3.7/site-packages/jupyterlab_server/process.py", line 59, in which
raise ValueError(msg)
ValueError: Please install nodejs 5+ and npm before continuing installation. nodejs may be installed using conda or directly f(base) ati-g1@ATI((ba(base)(ba(base) ati-g1@ATI-G1:~/Desktop/Niranjan/horovod-gpu-data-science-project-tensorflow-gpu-1.15-cudatoolkit-dev-10.0$

@davidrpugh
Copy link
Contributor

@niranjansuthar70 Looks like Horovod built correctly in the first instance but that your Conda has not been properly initialized so that the conda activate command works. As it says above your need to run the conda init command to initialize Conda for you shell (Bash is the default). Then everything should run as expected.

@niranjansuthar70
Copy link

niranjansuthar70 commented Mar 1, 2021

Hey @davidrpugh thanks for your response.
I have installed and getting this error while running the script..

(/home/ati-g1/Desktop/Niranjan/horovod-gpu-data-science-project-tensorflow-gpu-1.15-cudatoolkit-dev-10.0/env) ati-g1@ATI-G1:~/dataset/ObjectRecognition$ python3 train.py
Goodbye, World!
2021-03-01 10:00:14.133594: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
WARNING:tensorflow:
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:

WARNING:tensorflow:From /home/ati-g1/dataset/ObjectRecognition/nets/mobilenet.py:388: The name tf.nn.avg_pool is deprecated. Please use tf.nn.avg_pool2d instead.

Traceback (most recent call last):
File "train.py", line 14, in
import horovod.tensorflow as hvd
File "/home/ati-g1/.local/lib/python3.6/site-packages/horovod/tensorflow/init.py", line 26, in
from horovod.tensorflow import elastic
File "/home/ati-g1/.local/lib/python3.6/site-packages/horovod/tensorflow/elastic.py", line 24, in
from horovod.tensorflow.functions import broadcast_object, broadcast_object_fn, broadcast_variables
File "/home/ati-g1/.local/lib/python3.6/site-packages/horovod/tensorflow/functions.py", line 24, in
from horovod.tensorflow.mpi_ops import allgather, broadcast
File "/home/ati-g1/.local/lib/python3.6/site-packages/horovod/tensorflow/mpi_ops.py", line 46, in
MPI_LIB = _load_library('mpi_lib' + get_ext_suffix())
File "/home/ati-g1/.local/lib/python3.6/site-packages/horovod/tensorflow/mpi_ops.py", line 42, in _load_library
library = load_library.load_op_library(filename)
File "/home/ati-g1/.local/lib/python3.6/site-packages/tensorflow_core/python/framework/load_library.py", line 61, in load_op_library
lib_handle = py_tf.TF_LoadLibrary(library_filename)
tensorflow.python.framework.errors_impl.NotFoundError: libmpi.so.40: cannot open shared object file: No such file or directory

@niranjansuthar70
Copy link

niranjansuthar70 commented Mar 1, 2021

Hello @davidrpugh one more question is that since I am using conda env for distribution training, should I run "mpi server1,2" command on both machines? [I have 2 machines]. Should I setup same env on both machines?

Also "horovodrun --check-build" giving me this error.

(/home/ati-g1/Desktop/Niranjan/horovod-gpu-data-science-project-tensorflow-gpu-1.15-cudatoolkit-dev-10.0/env) ati-g1@ATI-G1:/Desktop/Niranjan/horovod-gpu-data-science-project-tensorflow-gpu-1.15-cudatoolkit-dev-10.0$ horovodrun --check-build
bash: /home/ati-g1/.local/bin/horovodrun: /home/ati-g1/anaconda3/envs/tf1-nv/bin/python: bad interpreter: No such file or directory
(/home/ati-g1/Desktop/Niranjan/horovod-gpu-data-science-project-tensorflow-gpu-1.15-cudatoolkit-dev-10.0/env) ati-g1@ATI-G1:
/Desktop/Niranjan/horovod-gpu-data-science-project-tensorflow-gpu-1.15-cudatoolkit-dev-10.0$

@davidrpugh
Copy link
Contributor

@niranjansuthar70 I am very confused. Your environment prefix seems to suggest that you have installed tensorflow-gpu==1.15 with cudatoolkit-dev=10.0 however your error messages above indicate that TF is trying to load the Cuda 11.0 runtime library.

@niranjansuthar70
Copy link

niranjansuthar70 commented Mar 1, 2021

Hello @davidrpugh , I have installed anaconda at home directory and created env from the project you provided with environment.yml files at another directory..

edited: while "nvidia-smi" shows CUDA11.2

I ran exactly same commands for export paths.

export ENV_PREFIX=$PWD/env
export CUDA_HOME=$ENV_PREFIX
export NCCL_HOME=$ENV_PREFIX
export HOROVOD_CUDA_HOME=$CUDA_HOME
export HOROVOD_NCCL_HOME=$NCCL_HOME
export HOROVOD_GPU_ALLREDUCE=NCCL
export HOROVOD_GPU_BROADCAST=NCCL

@davidrpugh
Copy link
Contributor

@niranjansuthar70 I am sorry but I can not help you as I don't understand exactly what you have done. You seem to have CUDA 11.2 installed on your system and this is conflicting with the CUDA version installed in your Conda environment for reasons that I can not easily debug remotely.

@niranjansuthar70
Copy link

niranjansuthar70 commented Mar 1, 2021

@davidrpugh but nvcc --version gives out put v10.0.
Sir you can tell me run some commands to give you better idea of what I have done.
As per references, cuda version which shows in nvidia-smi can be different to cuda to which we want to compile i.e, cuda10.0.
can you please provide steps to setup horovod from scratch and uninstall cuda11.2 because I am not able to find cuda11.2 path in my system.

@davidrpugh
Copy link
Contributor

You said that nvidia-smi was output your CUDA version as being 11.2. Is this the case?

@niranjansuthar70
Copy link

Yes nvidia-smi ,

(base) ati-g1@ATI-G2:~$ nvidia-smi
Mon Mar 1 16:32:28 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.39 Driver Version: 460.39 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 3070 Off | 00000000:01:00.0 On | N/A |
| 0% 58C P8 18W / 220W | 354MiB / 7979MiB | 2% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1288 G /usr/lib/xorg/Xorg 18MiB |
| 0 N/A N/A 1406 G /usr/bin/gnome-shell 76MiB |
| 0 N/A N/A 1652 G /usr/lib/xorg/Xorg 162MiB |
| 0 N/A N/A 1800 G /usr/bin/gnome-shell 45MiB |
| 0 N/A N/A 2257 G ...gAAAAAAAAA --shared-files 47MiB |
+-----------------------------------------------------------------------------+

@davidrpugh
Copy link
Contributor

But when you activate the environment, in which you have installed cudatoolkit-dev=10.0 you get nvcc --version returning 10.0? Hopefully yes, this is the expected behavior.

@niranjansuthar70
Copy link

Yes I am getting CUDA10.0 in env i have created.
.

@davidrpugh
Copy link
Contributor

OK, now I understand. Can you provide the exact sequence of commands that you used to create the environment for Horovod? From the above, it looks like maybe the pip install portion did not install into your Conda environment but rather installed somewhere in your home directory.

@niranjansuthar70
Copy link

steps I have taken,

  1. install anaconda and update to latest verison --- dir /home/ati-g1/anaconda3/bin/conda
  2. copy your project "horovod-gpu-data-science-project-tensorflow-gpu-1.15-cudatoolkit-dev-10.0" to new directory -- /home/Desktop/Niranjan/horovod-gpu-data-science-project-tensorflow-gpu-1.15-cudatoolkit-dev-10.0
  3. I copied "env.yml, req.txt, create env.sh, postbuild" into this folder "horovod-gpu-data-science-project-tensorflow-gpu-1.15-cudatoolkit-dev-10.0"
  4. run these commands...bash --login
    conda env create --prefix $PWD/env --file environment.yml --force
    conda activate $PWD/env
    source postBuild
  5. run the export ENV_PREFIX=$PWD/env
    export CUDA_HOME=$ENV_PREFIX
    export NCCL_HOME=$ENV_PREFIX
    export HOROVOD_CUDA_HOME=$CUDA_HOME
    export HOROVOD_NCCL_HOME=$NCCL_HOME
    export HOROVOD_GPU_ALLREDUCE=NCCL
    export HOROVOD_GPU_BROADCAST=NCCL

@davidrpugh
Copy link
Contributor

Is your conda properly initialized? Can you use the conda activate command to activate environments?

@niranjansuthar70
Copy link

niranjansuthar70 commented Mar 1, 2021

yes it is properly initialized and I can use conda activate.
This time I tried with miniconda from starting and I am getting this while activating..

(/home/ati-g1/Desktop/Niranjan/horovod-gpu-data-science-project-tensorflow-gpu-1.15-cudatoolkit-dev-10.0/env) ati-g1@ATI-G2:~$ conda activate /home/ati-g1/Desktop/Niranjan/horovod-gpu-data-science-project-tensorflow-gpu-1.15-cudatoolkit-dev-10.0/env

ln: target '/home/ati-g1/Desktop/Niranjan/horovod-gpu-data-science-project-tensorflow-gpu-1.15-cudatoolkit-dev-10.0/env/bin/xzmore' is not a directory
ln: target '/home/ati-g1/Desktop/Niranjan/horovod-gpu-data-science-project-tensorflow-gpu-1.15-cudatoolkit-dev-10.0/env/lib/tkConfig.sh' is not a directory
ln: target '/home/ati-g1/Desktop/Niranjan/horovod-gpu-data-science-project-tensorflow-gpu-1.15-cudatoolkit-dev-10.0/env/bin/xzmore' is not a directory
ln: target '/home/ati-g1/Desktop/Niranjan/horovod-gpu-data-science-project-tensorflow-gpu-1.15-cudatoolkit-dev-10.0/env/lib/tkConfig.sh' is not a directory
ln: target '/home/ati-g1/Desktop/Niranjan/horovod-gpu-data-science-project-tensorflow-gpu-1.15-cudatoolkit-dev-10.0/env/lib/tkConfig.sh' is not a directory
ln: target '/home/ati-g1/Desktop/Niranjan/horovod-gpu-data-science-project-tensorflow-gpu-1.15-cudatoolkit-dev-10.0/env/include/zmq_utils.h' is not a directory

@niranjansuthar70
Copy link

@davidrpugh can you refer me step by step so I can give it a try from scratch once.. Lets say I have new machine with only Ubuntu 18.04 installed with only nvidia latest driver.

@davidrpugh
Copy link
Contributor

From a clean Ubuntu machine I would first install Miniconda in my home directory using the instructions in this GitHub repo which will install Miniconda AND properly initialize Conda for bash. Then I would run the following instructions.

git clone --single-branch --branch tensorflow-gpu-1.15-cudatoolkit-dev-10.0 https://github.com/kaust-vislab/horovod-gpu-data-science-project.git
cd horovod-gpu-data-science-project
./bin/create-conda-env.sh

I would then check the build with the following commands.

conda activate ./env
horovodrun --check-build

@niranjansuthar70
Copy link

niranjansuthar70 commented Mar 2, 2021

Hello @davidrpugh thanks for your response.

  1. I have properly installed miniconda at home directory.
  2. Then I initialized conda for bash using conda init bash then download project using given link and create env
  3. all packages were installed properly then again initialized conda init bash and restarted shell
  4. and open "project folder" and open terminal inside that, activate conda env using conda activate ./env
  5. env has been properly activated but when I am running this "horovodrun --check-build" its giving me this error

base) ati-g1@ATI-G1:~/horovod-gpu-data-science-project$ conda init bash no change /home/ati-g1/miniconda3/condabin/conda no change /home/ati-g1/miniconda3/bin/conda no change /home/ati-g1/miniconda3/bin/conda-env no change /home/ati-g1/miniconda3/bin/activate no change /home/ati-g1/miniconda3/bin/deactivate no change /home/ati-g1/miniconda3/etc/profile.d/conda.sh no change /home/ati-g1/miniconda3/etc/fish/conf.d/conda.fish no change /home/ati-g1/miniconda3/shell/condabin/Conda.psm1 no change /home/ati-g1/miniconda3/shell/condabin/conda-hook.ps1 no change /home/ati-g1/miniconda3/lib/python3.8/site-packages/xontrib/conda.xsh no change /home/ati-g1/miniconda3/etc/profile.d/conda.csh no change /home/ati-g1/.bashrc No action taken. (base) ati-g1@ATI-G1:~/horovod-gpu-data-science-project$ conda activate ./env (/home/ati-g1/horovod-gpu-data-science-project/env) ati-g1@ATI-G1:~/horovod-gpu-data-science-project$ horovodrun --check-build Traceback (most recent call last): File "/home/ati-g1/horovod-gpu-data-science-project/env/bin/horovodrun", line 18, in <module> from horovod.run.runner import run_commandline (/home/ati-g1/horovod-gpu-data-science-project/env) ati-g1@ATI-G1:~/horovod-gpu-data-science-project$ ^C

@niranjansuthar70
Copy link

niranjansuthar70 commented Mar 2, 2021

(/home/ati-g1/horovod-gpu-data-science-project/env) ati-g1@ATI-G1:~/horovod-gpu-data-science-project$ export
declare -x ADDR2LINE="/home/ati-g1/horovod-gpu-data-science-project/env/bin/x86_64-conda-linux-gnu-addr2line"
declare -x AR="/home/ati-g1/horovod-gpu-data-science-project/env/bin/x86_64-conda-linux-gnu-ar"
declare -x AS="/home/ati-g1/horovod-gpu-data-science-project/env/bin/x86_64-conda-linux-gnu-as"
declare -x BUILD="x86_64-conda-linux-gnu"
declare -x CC="/home/ati-g1/horovod-gpu-data-science-project/env/bin/x86_64-conda-linux-gnu-cc"
declare -x CC_FOR_BUILD="/home/ati-g1/horovod-gpu-data-science-project/env/bin/x86_64-conda-linux-gnu-cc"
declare -x CFLAGS="-march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /home/ati-g1/horovod-gpu-data-science-project/env/include"
declare -x CLUTTER_IM_MODULE="xim"
declare -x CMAKE_ARGS="-DCMAKE_AR=/home/ati-g1/horovod-gpu-data-science-project/env/bin/x86_64-conda-linux-gnu-ar -DCMAKE_CXX_COMPILER_AR=/home/ati-g1/horovod-gpu-data-science-project/env/bin/x86_64-conda-linux-gnu-gcc-ar -DCMAKE_C_COMPILER_AR=/home/ati-g1/horovod-gpu-data-science-project/env/bin/x86_64-conda-linux-gnu-gcc-ar -DCMAKE_RANLIB=/home/ati-g1/horovod-gpu-data-science-project/env/bin/x86_64-conda-linux-gnu-ranlib -DCMAKE_CXX_COMPILER_RANLIB=/home/ati-g1/horovod-gpu-data-science-project/env/bin/x86_64-conda-linux-gnu-gcc-ranlib -DCMAKE_C_COMPILER_RANLIB=/home/ati-g1/horovod-gpu-data-science-project/env/bin/x86_64-conda-linux-gnu-gcc-ranlib -DCMAKE_LINKER=/home/ati-g1/horovod-gpu-data-science-project/env/bin/x86_64-conda-linux-gnu-ld -DCMAKE_STRIP=/home/ati-g1/horovod-gpu-data-science-project/env/bin/x86_64-conda-linux-gnu-strip"
declare -x CMAKE_PREFIX_PATH="/home/ati-g1/horovod-gpu-data-science-project/env:/home/ati-g1/horovod-gpu-data-science-project/env/x86_64-conda-linux-gnu/sysroot/usr"
declare -x COLORTERM="truecolor"
declare -x CONDA_BUILD_SYSROOT="/home/ati-g1/horovod-gpu-data-science-project/env/x86_64-conda-linux-gnu/sysroot"
declare -x CONDA_DEFAULT_ENV="/home/ati-g1/horovod-gpu-data-science-project/env"
declare -x CONDA_EXE="/home/ati-g1/miniconda3/bin/conda"
declare -x CONDA_PREFIX="/home/ati-g1/horovod-gpu-data-science-project/env"
declare -x CONDA_PREFIX_1="/home/ati-g1/miniconda3"
declare -x CONDA_PROMPT_MODIFIER="(/home/ati-g1/horovod-gpu-data-science-project/env) "
declare -x CONDA_PYTHON_EXE="/home/ati-g1/miniconda3/bin/python"
declare -x CONDA_SHLVL="2"
declare -x CPP="/home/ati-g1/horovod-gpu-data-science-project/env/bin/x86_64-conda-linux-gnu-cpp"
declare -x CPPFLAGS="-DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /home/ati-g1/horovod-gpu-data-science-project/env/include"
declare -x CXX="/home/ati-g1/horovod-gpu-data-science-project/env/bin/x86_64-conda-linux-gnu-c++"
declare -x CXXFILT="/home/ati-g1/horovod-gpu-data-science-project/env/bin/x86_64-conda-linux-gnu-c++filt"
declare -x CXXFLAGS="-fvisibility-inlines-hidden -std=c++17 -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /home/ati-g1/horovod-gpu-data-science-project/env/include"
declare -x CXX_FOR_BUILD="/home/ati-g1/horovod-gpu-data-science-project/env/bin/x86_64-conda-linux-gnu-c++"
declare -x DBUS_SESSION_BUS_ADDRESS="unix:path=/run/user/1000/bus,guid=d89dced45ce9b32a5a7ae2bc603cc1d7"
declare -x DBUS_STARTER_ADDRESS="unix:path=/run/user/1000/bus,guid=d89dced45ce9b32a5a7ae2bc603cc1d7"
declare -x DBUS_STARTER_BUS_TYPE="session"
declare -x DEBUG_CFLAGS="-march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-all -fno-plt -Og -g -Wall -Wextra -fvar-tracking-assignments -ffunction-sections -pipe -isystem /home/ati-g1/horovod-gpu-data-science-project/env/include"
declare -x DEBUG_CPPFLAGS="-D_DEBUG -D_FORTIFY_SOURCE=2 -Og -isystem /home/ati-g1/horovod-gpu-data-science-project/env/include"
declare -x DEBUG_CXXFLAGS="-fvisibility-inlines-hidden -std=c++17 -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-all -fno-plt -Og -g -Wall -Wextra -fvar-tracking-assignments -ffunction-sections -pipe -isystem /home/ati-g1/horovod-gpu-data-science-project/env/include"
declare -x DESKTOP_SESSION="ubuntu"
declare -x DISPLAY=":1"
declare -x ELFEDIT="/home/ati-g1/horovod-gpu-data-science-project/env/bin/x86_64-conda-linux-gnu-elfedit"
declare -x GCC="/home/ati-g1/horovod-gpu-data-science-project/env/bin/x86_64-conda-linux-gnu-gcc"
declare -x GCC_AR="/home/ati-g1/horovod-gpu-data-science-project/env/bin/x86_64-conda-linux-gnu-gcc-ar"
declare -x GCC_NM="/home/ati-g1/horovod-gpu-data-science-project/env/bin/x86_64-conda-linux-gnu-gcc-nm"
declare -x GCC_RANLIB="/home/ati-g1/horovod-gpu-data-science-project/env/bin/x86_64-conda-linux-gnu-gcc-ranlib"
declare -x GDMSESSION="ubuntu"
declare -x GNOME_DESKTOP_SESSION_ID="this-is-deprecated"
declare -x GNOME_SHELL_SESSION_MODE="ubuntu"
declare -x GNOME_TERMINAL_SCREEN="/org/gnome/Terminal/screen/ca91e407_234d_4441_8440_b3e16fdde4ed"
declare -x GNOME_TERMINAL_SERVICE=":1.251"
declare -x GPG_AGENT_INFO="/run/user/1000/gnupg/S.gpg-agent:0:1"
declare -x GPROF="/home/ati-g1/horovod-gpu-data-science-project/env/bin/x86_64-conda-linux-gnu-gprof"
declare -x GTK_IM_MODULE="ibus"
declare -x GTK_MODULES="gail:atk-bridge"
declare -x GXX="/home/ati-g1/horovod-gpu-data-science-project/env/bin/x86_64-conda-linux-gnu-g++"
declare -x HOME="/home/ati-g1"
declare -x HOST="x86_64-conda-linux-gnu"
declare -x IM_CONFIG_PHASE="2"
declare -x INVOCATION_ID="4608b023abb24b4c9c0f01454bf44441"
declare -x JOURNAL_STREAM="9:53267"
declare -x LANG="en_US.UTF-8"
declare -x LD="/home/ati-g1/horovod-gpu-data-science-project/env/bin/x86_64-conda-linux-gnu-ld"
declare -x LDFLAGS="-Wl,-O2 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,--disable-new-dtags -Wl,--gc-sections -Wl,-rpath,/home/ati-g1/horovod-gpu-data-science-project/env/lib -Wl,-rpath-link,/home/ati-g1/horovod-gpu-data-science-project/env/lib -L/home/ati-g1/horovod-gpu-data-science-project/env/lib"
declare -x LD_GOLD="/home/ati-g1/horovod-gpu-data-science-project/env/bin/x86_64-conda-linux-gnu-ld.gold"
declare -x LESSCLOSE="/usr/bin/lesspipe %s %s"
declare -x LESSOPEN="| /usr/bin/lesspipe %s"
declare -x LOGNAME="ati-g1"
declare -x LS_COLORS="rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:"
declare -x MANAGERPID="2000"
declare -x NM="/home/ati-g1/horovod-gpu-data-science-project/env/bin/x86_64-conda-linux-gnu-nm"
declare -x OBJCOPY="/home/ati-g1/horovod-gpu-data-science-project/env/bin/x86_64-conda-linux-gnu-objcopy"
declare -x OBJDUMP="/home/ati-g1/horovod-gpu-data-science-project/env/bin/x86_64-conda-linux-gnu-objdump"
declare -x OLDPWD
declare -x PATH="/home/ati-g1/horovod-gpu-data-science-project/env/bin:/home/ati-g1/miniconda3/condabin:/home/ati-g1/.local/bin:/home/ati-g1/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin"
declare -x PWD="/home/ati-g1/horovod-gpu-data-science-project"
declare -x QT4_IM_MODULE="xim"
declare -x QT_ACCESSIBILITY="1"
declare -x QT_IM_MODULE="ibus"
declare -x RANLIB="/home/ati-g1/horovod-gpu-data-science-project/env/bin/x86_64-conda-linux-gnu-ranlib"
declare -x READELF="/home/ati-g1/horovod-gpu-data-science-project/env/bin/x86_64-conda-linux-gnu-readelf"
declare -x SESSION_MANAGER="local/ATI-G1:@/tmp/.ICE-unix/2030,unix/ATI-G1:/tmp/.ICE-unix/2030"
declare -x SHELL="/bin/bash"
declare -x SHLVL="1"
declare -x SIZE="/home/ati-g1/horovod-gpu-data-science-project/env/bin/x86_64-conda-linux-gnu-size"
declare -x SSH_AGENT_PID="2125"
declare -x SSH_AUTH_SOCK="/run/user/1000/keyring/ssh"
declare -x STRINGS="/home/ati-g1/horovod-gpu-data-science-project/env/bin/x86_64-conda-linux-gnu-strings"
declare -x STRIP="/home/ati-g1/horovod-gpu-data-science-project/env/bin/x86_64-conda-linux-gnu-strip"
declare -x TERM="xterm-256color"
declare -x TEXTDOMAIN="im-config"
declare -x TEXTDOMAINDIR="/usr/share/locale/"
declare -x USER="ati-g1"
declare -x USERNAME="ati-g1"
declare -x VTE_VERSION="5202"
declare -x WINDOWPATH="2"
declare -x XAUTHORITY="/run/user/1000/gdm/Xauthority"
declare -x XDG_CONFIG_DIRS="/etc/xdg/xdg-ubuntu:/etc/xdg"
declare -x XDG_CURRENT_DESKTOP="ubuntu:GNOME"
declare -x XDG_DATA_DIRS="/usr/share/ubuntu:/usr/local/share/:/usr/share/:/var/lib/snapd/desktop"
declare -x XDG_MENU_PREFIX="gnome-"
declare -x XDG_RUNTIME_DIR="/run/user/1000"
declare -x XDG_SEAT="seat0"
declare -x XDG_SESSION_DESKTOP="ubuntu"
declare -x XDG_SESSION_ID="2"
declare -x XDG_SESSION_TYPE="x11"
declare -x XDG_VTNR="2"
declare -x XMODIFIERS="@im=ibus"
declare -x _CE_CONDA=""
declare -x _CE_M=""
declare -x _CONDA_PYTHON_SYSCONFIGDATA_NAME="_sysconfigdata_x86_64_conda_cos6_linux_gnu"
declare -x build_alias="x86_64-conda-linux-gnu"
declare -x host_alias="x86_64-conda-linux-gnu"
(/home/ati-g1/horovod-gpu-data-science-project/env) ati-g1@ATI-G1:~/horovod-gpu-data-science-project$ 

@davidrpugh
Copy link
Contributor

FYI you only need to run the conda init command once. You do not need to run it every time you wish to activate your environment.

I am having trouble reading the output because of the formatting. Can you please post the output of the following commands.

cd ~/horovod-gpu-data-science-project
conda activate ./env
horovodrun --check-build

Also run the following command in the activate environment and return the output.

conda list

@niranjansuthar70
Copy link

niranjansuthar70 commented Mar 2, 2021

Hello @davidrpugh do I need to run those export commands before or after creating env? i am trying to reinstall all the things from miniconda3 one one new machine

these command?

export CUDA_HOME=$ENV_PREFIX
export NCCL_HOME=$ENV_PREFIX
export HOROVOD_CUDA_HOME=$CUDA_HOME
export HOROVOD_NCCL_HOME=$NCCL_HOME
export HOROVOD_GPU_ALLREDUCE=NCCL
export HOROVOD_GPU_BROADCAST=NCCL

and here are the output of commands you asked for..
conda activate ./env

(base) ati-g3@ATI-G3:~/horovod-gpu-data-science-project$ conda activate /home/ati-g3/horovod-gpu-data-science-project/env
ln: target '/home/ati-g3/horovod-gpu-data-science-project/env/bin/xzmore' is not a directory
ln: target '/home/ati-g3/horovod-gpu-data-science-project/env/lib/tkConfig.sh' is not a directory
ln: target '/home/ati-g3/horovod-gpu-data-science-project/env/bin/xzmore' is not a directory
ln: target '/home/ati-g3/horovod-gpu-data-science-project/env/lib/tkConfig.sh' is not a directory
ln: target '/home/ati-g3/horovod-gpu-data-science-project/env/lib/tkConfig.sh' is not a directory
ln: target '/home/ati-g3/horovod-gpu-data-science-project/env/include/zmq_utils.h' is not a directory

horovodrun --check-build

(/home/ati-g3/horovod-gpu-data-science-project/env) ati-g3@ATI-G3:~/horovod-gpu-data-science-project$ horovodrun --check-build
Horovod v0.19.5:

Available Frameworks:
    [X] TensorFlow
    [ ] PyTorch
    [ ] MXNet

Available Controllers:
    [X] MPI
    [ ] Gloo

Available Tensor Operations:
    [ ] NCCL
    [ ] DDL
    [ ] CCL
    [X] MPI
    [ ] Gloo    

conda list

(/home/ati-g3/horovod-gpu-data-science-project/env) ati-g3@ATI-G3:~/horovod-gpu-data-science-project$ conda list
# packages in environment at /home/ati-g3/horovod-gpu-data-science-project/env:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       1_gnu    conda-forge
_tflow_select             2.1.0                       gpu  
absl-py                   0.11.0           py36h5fab9bb_0    conda-forge
aiohttp                   3.7.4                    pypi_0    pypi
argon2-cffi               20.1.0           py36h8f6f2f9_2    conda-forge
astor                     0.8.1              pyh9f0ad1d_0    conda-forge
async-timeout             3.0.1                    pypi_0    pypi
async_generator           1.10                       py_0    conda-forge
attrs                     20.3.0             pyhd3deb0d_0    conda-forge
backcall                  0.2.0              pyh9f0ad1d_0    conda-forge
backports                 1.0                        py_2    conda-forge
backports.functools_lru_cache 1.6.1                      py_0    conda-forge
binutils                  2.35.1               hdd6e379_2    conda-forge
binutils_impl_linux-64    2.35.1               h193b22a_2    conda-forge
binutils_linux-64         2.35                h67ddf6f_30    conda-forge
bleach                    3.3.0              pyh44b312d_0    conda-forge
bokeh                     1.4.0                    pypi_0    pypi
brotlipy                  0.7.0           py36h8f6f2f9_1001    conda-forge
c-ares                    1.17.1               h36c2ea0_0    conda-forge
c-compiler                1.1.3                h7f98852_0    conda-forge
ca-certificates           2020.12.5            ha878542_0    conda-forge
cached-property           1.5.1                      py_0    conda-forge
certifi                   2020.12.5        py36h5fab9bb_1    conda-forge
cffi                      1.14.5           py36hc120d54_0    conda-forge
chardet                   3.0.4                    pypi_0    pypi
cloudpickle               1.6.0                    pypi_0    pypi
conda                     4.9.2            py36h5fab9bb_0    conda-forge
conda-package-handling    1.7.2            py36he6145b8_0    conda-forge
cryptography              3.4.4            py36hc39840e_0    conda-forge
cudatoolkit               10.0.130             hf841e97_8    conda-forge
cudatoolkit-dev           10.0                     py36_0    conda-forge
cudnn                     7.6.5.32             ha8d7eb6_1    conda-forge
cupti                     10.0.130                      0  
cxx-compiler              1.1.3                h4bd325d_0    conda-forge
decorator                 4.4.2                      py_0    conda-forge
defusedxml                0.6.0                      py_0    conda-forge
entrypoints               0.3             pyhd8ed1ab_1003    conda-forge
gast                      0.2.2                      py_0    conda-forge
gcc_impl_linux-64         9.3.0               h70c0ae5_18    conda-forge
gcc_linux-64              9.3.0               hf25ea35_30    conda-forge
google-pasta              0.2.0              pyh8c360ce_0    conda-forge
grpcio                    1.36.0           py36h8e87921_0    conda-forge
gxx_impl_linux-64         9.3.0               hd87eabc_18    conda-forge
gxx_linux-64              9.3.0               h3fbe746_30    conda-forge
h5py                      3.1.0           nompi_py36hc1bc4f5_100    conda-forge
hdf5                      1.10.6          nompi_h6a2412b_1114    conda-forge
horovod                   0.19.5                   pypi_0    pypi
icu                       68.1                 h58526e2_0    conda-forge
idna                      2.10               pyh9f0ad1d_0    conda-forge
idna-ssl                  1.1.0                    pypi_0    pypi
importlib-metadata        3.7.0            py36h5fab9bb_0    conda-forge
importlib_metadata        3.7.0                hd8ed1ab_0    conda-forge
ipykernel                 5.5.0            py36he448a4c_1    conda-forge
ipython                   7.16.1           py36he448a4c_2    conda-forge
ipython_genutils          0.2.0                      py_1    conda-forge
jedi                      0.17.2           py36h5fab9bb_1    conda-forge
jinja2                    2.11.3             pyh44b312d_0    conda-forge
json5                     0.9.5              pyh9f0ad1d_0    conda-forge
jsonschema                3.2.0                      py_2    conda-forge
jupyter-server-proxy      1.6.0                    pypi_0    pypi
jupyter-tensorboard       0.2.0                    pypi_0    pypi
jupyter_client            6.1.11             pyhd8ed1ab_1    conda-forge
jupyter_core              4.7.1            py36h5fab9bb_0    conda-forge
jupyterlab                2.2.9              pyhd8ed1ab_0    conda-forge
jupyterlab-nvdashboard    0.3.1                    pypi_0    pypi
jupyterlab_pygments       0.1.2              pyh9f0ad1d_0    conda-forge
jupyterlab_server         1.2.0                      py_0    conda-forge
keras-applications        1.0.8                      py_1    conda-forge
keras-preprocessing       1.1.2              pyhd8ed1ab_0    conda-forge
kernel-headers_linux-64   2.6.32              h77966d4_13    conda-forge
krb5                      1.17.2               h926e7f8_0    conda-forge
ld_impl_linux-64          2.35.1               hea4e1c9_2    conda-forge
libblas                   3.9.0                8_openblas    conda-forge
libcblas                  3.9.0                8_openblas    conda-forge
libcurl                   7.71.1               hcdd3856_8    conda-forge
libedit                   3.1.20191231         he28a2e2_2    conda-forge
libev                     4.33                 h516909a_1    conda-forge
libffi                    3.3                  h58526e2_2    conda-forge
libgcc-devel_linux-64     9.3.0               h7864c58_18    conda-forge
libgcc-ng                 9.3.0               h2828fa1_18    conda-forge
libgfortran-ng            9.3.0               hff62375_18    conda-forge
libgfortran5              9.3.0               hff62375_18    conda-forge
libgomp                   9.3.0               h2828fa1_18    conda-forge
liblapack                 3.9.0                8_openblas    conda-forge
libnghttp2                1.43.0               h812cca2_0    conda-forge
libopenblas               0.3.12          pthreads_h4812303_1    conda-forge
libprotobuf               3.15.3               h780b84a_0    conda-forge
libsodium                 1.0.18               h36c2ea0_1    conda-forge
libssh2                   1.9.0                hab1572f_5    conda-forge
libstdcxx-devel_linux-64  9.3.0               hb016644_18    conda-forge
libstdcxx-ng              9.3.0               h6de172a_18    conda-forge
libuv                     1.41.0               h7f98852_0    conda-forge
markdown                  3.3.4              pyhd8ed1ab_0    conda-forge
markupsafe                1.1.1            py36h8f6f2f9_3    conda-forge
mistune                   0.8.4           py36h8f6f2f9_1003    conda-forge
mpi                       1.0                       mpich    conda-forge
mpi4py                    3.0.3            py36h7b8b12a_5    conda-forge
mpich                     3.4.1                h846660c_2    conda-forge
multidict                 5.1.0                    pypi_0    pypi
nbclient                  0.5.3              pyhd8ed1ab_0    conda-forge
nbconvert                 6.0.7            py36h5fab9bb_3    conda-forge
nbformat                  5.1.2              pyhd8ed1ab_1    conda-forge
nccl                      2.4.8.1              hd6f8bf8_1    conda-forge
ncurses                   6.2                  h58526e2_4    conda-forge
nest-asyncio              1.4.3              pyhd8ed1ab_0    conda-forge
nodejs                    14.15.4              h92b4a50_1    conda-forge
notebook                  6.2.0            py36h5fab9bb_0    conda-forge
numpy                     1.19.5           py36h2aa4a07_1    conda-forge
openssl                   1.1.1j               h7f98852_0    conda-forge
opt_einsum                3.3.0                      py_0    conda-forge
packaging                 20.9               pyh44b312d_0    conda-forge
pandoc                    2.11.4               h7f98852_0    conda-forge
pandocfilters             1.4.2                      py_1    conda-forge
parso                     0.7.1              pyh9f0ad1d_0    conda-forge
pexpect                   4.8.0              pyh9f0ad1d_2    conda-forge
pickleshare               0.7.5                   py_1003    conda-forge
pillow                    8.1.1                    pypi_0    pypi
pip                       20.1.1                     py_1    conda-forge
prometheus_client         0.9.0              pyhd3deb0d_0    conda-forge
prompt-toolkit            3.0.16             pyha770c72_0    conda-forge
protobuf                  3.15.3           py36hc4f0c31_0    conda-forge
psutil                    5.8.0                    pypi_0    pypi
ptyprocess                0.7.0              pyhd3deb0d_0    conda-forge
pycosat                   0.6.3           py36h8f6f2f9_1006    conda-forge
pycparser                 2.20               pyh9f0ad1d_2    conda-forge
pygments                  2.8.0              pyhd8ed1ab_0    conda-forge
pynvml                    8.0.4                    pypi_0    pypi
pyopenssl                 20.0.1             pyhd8ed1ab_0    conda-forge
pyparsing                 2.4.7              pyh9f0ad1d_0    conda-forge
pyrsistent                0.17.3           py36h8f6f2f9_2    conda-forge
pysocks                   1.7.1            py36h5fab9bb_3    conda-forge
python                    3.6.13          hffdb5ce_0_cpython    conda-forge
python-dateutil           2.8.1                      py_0    conda-forge
python_abi                3.6                     1_cp36m    conda-forge
pyyaml                    5.4.1                    pypi_0    pypi
pyzmq                     22.0.3           py36h7068817_1    conda-forge
readline                  8.0                  he28a2e2_2    conda-forge
requests                  2.25.1             pyhd3deb0d_0    conda-forge
ruamel_yaml               0.15.80         py36h8f6f2f9_1004    conda-forge
scipy                     1.5.3            py36h9e8f40b_0    conda-forge
send2trash                1.5.0                      py_0    conda-forge
setuptools                49.6.0           py36h5fab9bb_3    conda-forge
simpervisor               0.4                      pypi_0    pypi
six                       1.15.0             pyh9f0ad1d_0    conda-forge
sqlite                    3.34.0               h74cdb3f_0    conda-forge
sysroot_linux-64          2.12                h77966d4_13    conda-forge
tensorboard               1.15.0                   py36_0    conda-forge
tensorflow                1.15.0          gpu_py36h5a509aa_0  
tensorflow-base           1.15.0          gpu_py36h9dcbed7_0  
tensorflow-estimator      1.15.1             pyh2649769_0  
tensorflow-gpu            1.15.0               h0d30ee6_0  
termcolor                 1.1.0                      py_2    conda-forge
terminado                 0.9.2            py36h5fab9bb_0    conda-forge
testpath                  0.4.4                      py_0    conda-forge
tk                        8.6.10               h21135ba_1    conda-forge
tornado                   6.1              py36h8f6f2f9_1    conda-forge
tqdm                      4.58.0             pyhd8ed1ab_0    conda-forge
traitlets                 4.3.3            py36h9f0ad1d_1    conda-forge
typing_extensions         3.7.4.3                    py_0    conda-forge
urllib3                   1.26.3             pyhd8ed1ab_0    conda-forge
wcwidth                   0.2.5              pyh9f0ad1d_2    conda-forge
webencodings              0.5.1                      py_1    conda-forge
werkzeug                  0.16.1                     py_0    conda-forge
wheel                     0.36.2             pyhd3deb0d_0    conda-forge
wrapt                     1.12.1           py36h8f6f2f9_3    conda-forge
xz                        5.2.5                h516909a_1    conda-forge
yaml                      0.2.5                h516909a_0    conda-forge
yarl                      1.6.3                    pypi_0    pypi
zeromq                    4.3.4                h9c3ff4c_0    conda-forge
zipp                      3.4.0                      py_0    conda-forge

@davidrpugh
Copy link
Contributor

Please look carefully at the script for creating the conda environment. You will see that the various HOROVOD_ environment variables are exported before the conda environment is created. These variables need to be defined when the environment is created in order for Horovod to build the appropriate extensions for CUDA, NCCL, etc. Once Horovod has been built during the Conda environment creation process these variables no longer need to be defined.

Are you using the script I provided? Or manually running the commands? I ask because you are getting errors that I have never seen (such as the ln errors when you activate your Conda env). Also the NCCL bindings did not build properly which suggests the CUDA bindings did not build. This further suggests that you didn’t export the Horovod build environment variables, which suggests that you didn’t run the provided script.

@davidrpugh
Copy link
Contributor

It looks like you are nearly there though! Everything looks as expected except those ln errors you are getting when activating the environment and the fact that the NCCL bindings didn't build properly. The NCCL bindings are pretty important for performance so it is important to figure out why those didn't get built.

@niranjansuthar70
Copy link

niranjansuthar70 commented Mar 2, 2021

Hello @davidrpugh thansk for your response, I have successfully build the env and did some experiments for distribution training. I have posted my issue here.

#2553 (comment)

I am currently facing these issues.

@rrrongon
Copy link

Hi,
I have been struggling to install properly for a while. Does anyone have updated environment yml file that can help me to install horovod on fresh conda environment?
It fails with nccl linking finally. Over the top a lot of conflicts. It seems packages are needed to installed in a specific order to avoid conflicts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

5 participants