GitHub - wjxiz1992/e2e-train: Horovod+Spark+TensorFlow+cuDF

(DRAFT version)

set up environment for Horovod

//install cudnn and nccl
sudo apt install libnccl2 libnccl-dev


// install conda
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
source ~/.bashrc
// install cudf, set those numbers accordingly
conda install -c rapidsai -c nvidia -c numba -c conda-forge \
 cudf=21.06 python=3.8 cudatoolkit=11.2
// install openmpi
conda install openjdk=8 cmake openmpi openmpi-mpicc -y

// install tensorflow
pip install tensorflow

// install horovod
HOROVOD_WITH_MPI=1 HOROVOD_WITH_TENSORFLOW=1 HOROVOD_GPU_OPERATIONS=NCCL \
pip install horovod[spark] --no-cache-dir

// check if ok
horovodrun --check-build

should see:

Horovod v0.22.1:

Available Frameworks:
 [X] TensorFlow
 [X] PyTorch
 [ ] MXNet

Available Controllers:
 [X] MPI
 [X] Gloo

Available Tensor Operations:
 [X] NCCL
 [ ] DDL
 [ ] CCL
 [X] MPI
 [X] Gloo

the training data could be downloaded from: https://drive.google.com/file/d/1lPBCTfUv1aSiWBGjT4WBv1zntNx0MAB2/view?usp=sharing, extract the content by
```
tar -xvf train_100k_block_size.tar
```
Please set up your Spark cluster properly according to https://nvidia.github.io/spark-rapids/docs/get-started/getting-started-on-prem.html
Modify necessary parameters like SPARK_HOME, SPARK_URL etc. accordingly.
application specific parameters:
1. --num-proc : The number of Spark Executors to run the application
2. --model-output-path: the path to save best model
3. --input-data: training dataset path
launch the app:
```
./run.sh
```

Note: when using more than 1 executors, a known issue will be ovserved: horovod/horovod#3005

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
main.py		main.py
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

wjxiz1992/e2e-train

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages