New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trying to install Horovod from a fresh conda environment (with tensorflow) and nothing seems to work #2138
Comments
Hey @illumidas-agn, have you taken a look at the Conda install guide here? If you're still having issues after going through that, feel free to provide log output showing where things are breaking. |
Hi @tgaddair, Im gonna go through that and let you know how it goes |
Im trying to run this from a server where I am not the sudo user, is there any alternatives using pip or conda install? |
Where are you running into permissions issues? @davidrpugh, do you have some thoughts on this? If you don't need to use conda, you can also opt to install everything through pip in a virtual environment. But the important thing is you'll need to make sure the CUDA devtools are available when building in NCCL support. If this is an issue, you may want to see if you can run in a containerized environment. |
I need to be able to run the CUDA Toolkit as sudo and sadly im not an admin, therefore I cant install it the conventional way. I was able to make it run on my local virtual environment and then when I transfered it into the server that's when I ran into issues. Tried the entire day to get horovod and cuda to run. I can ask the admin of the server to install it if theres no quick fix for this |
Either that or setting up Docker/Singularity would probably be the easiest way, yes. It's certainly possible to install locally (see: https://stackoverflow.com/questions/39379792/install-cuda-without-root), but managing the correct environment variables will likely be a challenge. |
I see, in that case ill contact the server admin and ill get back to you if everything works. Thank you for your help |
@illumidas-agn you should be able to install Horovod using Conda without root privileges. You will need to use the name: null
channels:
- pytorch
- conda-forge
- defaults
dependencies:
- cmake=3.16
- cudatoolkit-dev=10.1
- cudnn=7.6
- cupti=10.1
- cxx-compiler=1.1
- jupyterlab=2.1
- matplotlib=3.2
- mpi4py=3.0 # installs cuda-aware openmpi
- nccl=2.5
- nodejs=13
- pip=20.1
- pip:
- mxnet-cu101mkl==1.6.* # makes sure installed prior to horovod
- -r file:requirements.txt
- python=3.7
- pytorch=1.5
- tensorboard=2.1
- tensorflow-gpu=2.1
- torchvision=0.6 Note that I have bumped a lot of version numbers from what is in the current guide. @tgaddair I will test this and then update the install guide accordingly. Perhaps an explicit indication that you will need to use the |
The project im trying to run requires python 3.6, but I will see whether 3.7 works, will keep you updated |
@illumidas-agn Then just change the Python version to 3.6. Shouldn't impact the build. |
I see, I just see it listed in the requirements that you have posted above |
@illumidas-agn I pin the version numbers in my environment file to the most recent versions of the various dependencies for which I am able to get a successful build. Other combinations of version numbers may also work just fine. Also note that if you only need TensorFlow then you can probably get by with the following environment file which should build more quickly. name: null
channels:
- conda-forge
- defaults
dependencies:
- cudatoolkit-dev=10.1
- cudnn=7.6
- cupti=10.1
- cxx-compiler=1.1
- jupyterlab=2.1
- matplotlib=3.2
- mpi4py=3.0 # installs cuda-aware openmpi
- nccl=2.5
- nodejs=13
- pip=20.1
- pip:
- -r file:requirements.txt
- python=3.7 # python=3.6 should also work!
- tensorboard=2.1
- tensorflow-gpu=2.1 In the above environment file I have dropped the PyTorch and MXnet dependencies. |
Perfect, currently installing all the packages as we speak |
Installed all the packages, still getting errors when I use this command: HOROVOD_WITH_TENSORFLOW=1 pip install --no-cache-dir horovod[tensorflow] Error at the end of log: "
" |
Did you activate the Conda environment prior to running |
Yes, running it in the environment (Should be denoted by (envName) xxx@xxx) |
After activating the Conda environment run the command |
_libgcc_mutex 0.1 main |
What version of TensorFlow are you trying to use? What changes did you make to the environment file I sketched above? You seem to have two different versions of TensorFlow installed 2.0 and 1.15; as well as different versions of the various CUDA toolkit libraries. |
|
OK. Well for sure to use TensorFlow 1.15 you will need to make changes to the environment file that I suggested above. For one, I think you will need an older version of
I forget whether you need to install Keras separately with TensorFlow 1.15 or not. You will also need to set the following environment variables slightly differently then what is mentioned in the user guide given that you are using the
Next, create the Conda environment and try building Horovod using the following commands.
Don't bother explicitly setting |
It managed to install almost every package except the last one: ERROR conda.core.link:_execute(481): An error occurred while installing package 'conda-forge::cudatoolkit-dev-10.0-2'. |
I used conda list on my local machine where go-explore works and it appears that the cuda tool kit isnt used? This is all very confusing.... _libgcc_mutex 0.1 main |
Looks like TensorFlow is still being installed via Getting There are several builds of
Replace the |
@illumidas-agn I have created (and tested) a Horovod build for TensorFlow 1.15 using |
@tgaddair When I built the environment for Horovod 19.5 setting Hopefully NCCL support was actually built with |
Hey @davidrpugh, unfortunately I recommend consulting the stable docs (for the latest release) when not building from source: https://horovod.readthedocs.io/en/stable/summary_include.html#install |
I used Pip, yes, in order to get the specific version. Im not sure how you mean to install the specific cuda version? Do I just use conda install cudatoolkit-dev=10.0=py36_0? Or do I need to edit the environment.yml file to include this when I build an environment? EDIT: Got it working below I apologize for the hassle, im brand new to anaconda and the ML environment. Im currently following the sh script that you linked in order to try and make the environment, I'll let you know how that goes |
Followed the scripts in the directory you showed me, overwrote the environment.yml file according to what was in that directory Currently its unable to find python3.6 now: "ERROR conda.core.link:_execute(481): An error occurred while installing package 'conda-forge::astor-0.8.1-pyh9f0ad1d_0'. Rolling back transaction: done FileNotFoundError(2, "No such file or directory: '/home/cogs5/ge/go-explore-master/env/bin/python3.6'") |
It worked.....thank you so so so much
|
@illumidas-agn Hurray! Great. @tgaddair I guess we can close this issue. |
Phew! Nice work @davidrpugh! That looked like a tough one. |
Hello @davidrpugh , I am also facing similar issues but could not get resolve by following this forum. I am getting following error when trying to run envir create.sh (base) ati-g1@ATI-G1:~/Desktop/Niranjan/horovod-gpu-data-science-project-tensorflow-gpu-1.15-cudatoolkit-dev-10.0$ create-conda-env.sh \ By downloading and using the cuDNN conda packages, you accept the terms and conditions of the NVIDIA cuDNN EULA - done done To activate this environment, use$ conda activate /home/ati-g1/Desktop/Niranjan/horovod-gpu-data-science-project-tensorflow-gpu-1.15-cudatoolkit-dev-10.0/envTo deactivate an active environment, use$ conda deactivateCommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
Currently supported shells are:
See 'conda init --help' for more information and options. IMPORTANT: You may need to close an(base) ati-g1@ATI-G1:~/Desktop/Niranjan/horovod-gpu-data-science-project-tensorflow-gpu-1.15-cudatoolkit-dev-10.0$ create-conda-env.sh \ By downloading and using the cuDNN conda packages, you accept the terms and conditions of the NVIDIA cuDNN EULA - done done To activate this environment, use$ conda activate /home/ati-g1/Desktop/Niranjan/horovod-gpu-data-science-project-tensorflow-gpu-1.15-cudatoolkit-dev-10.0/envTo deactivate an active environment, use$ conda deactivateCommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
Currently supported shells are:
See 'conda init --help' for more information and options. IMPORTANT: You may need to close and restart your shell after running 'conda init'. Errored, use --debug for full output: Errored, use --debug for full output: Errored, use --debug for full output: Errored, use --debug for full output: |
@niranjansuthar70 Looks like Horovod built correctly in the first instance but that your Conda has not been properly initialized so that the |
Hey @davidrpugh thanks for your response. (/home/ati-g1/Desktop/Niranjan/horovod-gpu-data-science-project-tensorflow-gpu-1.15-cudatoolkit-dev-10.0/env) ati-g1@ATI-G1:~/dataset/ObjectRecognition$ python3 train.py
WARNING:tensorflow:From /home/ati-g1/dataset/ObjectRecognition/nets/mobilenet.py:388: The name tf.nn.avg_pool is deprecated. Please use tf.nn.avg_pool2d instead. Traceback (most recent call last): |
Hello @davidrpugh one more question is that since I am using conda env for distribution training, should I run "mpi server1,2" command on both machines? [I have 2 machines]. Should I setup same env on both machines? Also "horovodrun --check-build" giving me this error. (/home/ati-g1/Desktop/Niranjan/horovod-gpu-data-science-project-tensorflow-gpu-1.15-cudatoolkit-dev-10.0/env) ati-g1@ATI-G1: |
@niranjansuthar70 I am very confused. Your environment prefix seems to suggest that you have installed |
Hello @davidrpugh , I have installed anaconda at home directory and created env from the project you provided with environment.yml files at another directory.. edited: while "nvidia-smi" shows CUDA11.2 I ran exactly same commands for export paths. export ENV_PREFIX=$PWD/env |
@niranjansuthar70 I am sorry but I can not help you as I don't understand exactly what you have done. You seem to have CUDA 11.2 installed on your system and this is conflicting with the CUDA version installed in your Conda environment for reasons that I can not easily debug remotely. |
@davidrpugh but nvcc --version gives out put v10.0. |
You said that |
Yes nvidia-smi , (base) ati-g1@ATI-G2:~$ nvidia-smi +-----------------------------------------------------------------------------+ |
But when you activate the environment, in which you have installed |
Yes I am getting CUDA10.0 in env i have created. |
OK, now I understand. Can you provide the exact sequence of commands that you used to create the environment for Horovod? From the above, it looks like maybe the |
steps I have taken,
|
Is your |
yes it is properly initialized and I can use conda activate. (/home/ati-g1/Desktop/Niranjan/horovod-gpu-data-science-project-tensorflow-gpu-1.15-cudatoolkit-dev-10.0/env) ati-g1@ATI-G2:~$ conda activate /home/ati-g1/Desktop/Niranjan/horovod-gpu-data-science-project-tensorflow-gpu-1.15-cudatoolkit-dev-10.0/env ln: target '/home/ati-g1/Desktop/Niranjan/horovod-gpu-data-science-project-tensorflow-gpu-1.15-cudatoolkit-dev-10.0/env/bin/xzmore' is not a directory |
@davidrpugh can you refer me step by step so I can give it a try from scratch once.. Lets say I have new machine with only Ubuntu 18.04 installed with only nvidia latest driver. |
From a clean Ubuntu machine I would first install Miniconda in my home directory using the instructions in this GitHub repo which will install Miniconda AND properly initialize Conda for bash. Then I would run the following instructions.
I would then check the build with the following commands. conda activate ./env
horovodrun --check-build |
Hello @davidrpugh thanks for your response.
|
|
FYI you only need to run the I am having trouble reading the output because of the formatting. Can you please post the output of the following commands.
Also run the following command in the activate environment and return the output.
|
Hello @davidrpugh do I need to run those export commands before or after creating env? i am trying to reinstall all the things from miniconda3 one one new machine these command?
and here are the output of commands you asked for..
horovodrun --check-build
conda list
|
Please look carefully at the script for creating the conda environment. You will see that the various Are you using the script I provided? Or manually running the commands? I ask because you are getting errors that I have never seen (such as the |
It looks like you are nearly there though! Everything looks as expected except those |
Hello @davidrpugh thansk for your response, I have successfully build the env and did some experiments for distribution training. I have posted my issue here. I am currently facing these issues. |
Hi, |
Environment:
Your question:
Please ask your question here.
Looked through all the available open questions. Currently trying to run go-explore (https://github.com/uber-research/go-explore/tree/master/policy_based) and I have only managed to make horovod work once for whatever reason.
I need it built with tensorflow (aka horovod.tensorflow) and when I try to force the tensorflow flag during installation I get a 10 page log dump which is hard to discern what it actually needs.
How do I get horovod running?
Im not sure what im doing wrong, I've tried everything else
The text was updated successfully, but these errors were encountered: