Deep Learning on Big Data with Multi-Node GPU Jobs

Tutorial held at IEEE BigData 2019

Thomas Breuel, Alex Aizman

Both traditional machine learning (clustering, decision trees, parametric models, cross-validation, function decompositions) and deep learning (DL) are often used for the analysis of big data on hundreds of nodes (clustered servers). However, the systems and I/O considerations for multi-node deep learning are quite different from traditional machine learning. While traditional machine learning is often well served by MapReduce style infrastructure (Hadoop, Spark), distributed deep learning places different demands on hardware, storage software, and networking infrastructure. In this tutorial, we cover:

the structure and properties of large-scale GPU-based deep learning systems
large-scale distributed stochastic gradient descent and supporting frameworks (PyTorch, TensorFlow, Horovod, NCCL)
common storage and compression formats (TFRecord/tf.Example, DataLoader, etc.) and their interconnects (Ethernet, Infiniband, RDMA, NVLINK)
common storage architectures for large-scale DL (network file systems, distributed file systems, object storage)
batch queueing systems, Kubernetes, and NGC for scheduling and large-scale parallelism
ETL techniques including distributed GPU-based augmentation (DALI)

The tutorial will focus on techniques and tools by which deep learning practitioners can take advantage of these technologies and move from single-desktop training to training models on hundreds of GPUs and petascale datasets. It will also help researchers and system engineers to choose and size the systems necessary for such large-scale deep learning. Participants should have some experience in training deep learning models on a single node. The tutorial will cover both TensorFlow and PyTorch frameworks as well as additional open-source tools required to scale deep learning to multi-node storage and multi-node training.

Running Jupyter

Many of the examples in this directory are in Jupyter Notebook format.

There are two ways of running Jupyter: on the local machine or inside a container.

To run it directly on the local machine, you need to install Anaconda3. Afterwads, you can run:

/opt/anaconda3/bin/jupyter lab

To run Jupyter inside a container, you need to have NVIDIA Docker installed (e.g., using ansible-playbook docker-nv.yml). Then you can use the run script in the parent directory:

./run jupyter lab

Ansible Scripts

These are Ansible scripts that help you set up your machine:

anaconda3.yml -- install Python3 using Anaconda, plus various packages
docker-nv.yml -- install Docker with NVIDIA support
microk8s.yml -- install MicroK8s
gui.yml -- simple GUI tools for remote access via VNC

These are intended for recent versions of Ubuntu, 19.04 and 19.10. They install and reinstall various packages, so look at them first before running.

To use:

$ sudo apt-get install python-pip
$ sudo pip install ansible
...
$ cd Ansible
$ ansible-playbook anaconda3.yml
...

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
Ansible		Ansible
010-introduction.pdf		010-introduction.pdf
020-image-classifier.pdf		020-image-classifier.pdf
022-basic-recognition.ipynb		022-basic-recognition.ipynb
022-basic-recognition.pdf		022-basic-recognition.pdf
023-docker.ipynb		023-docker.ipynb
023-docker.pdf		023-docker.pdf
025-performance-profile.pdf		025-performance-profile.pdf
027-fp16.ipynb		027-fp16.ipynb
027-fp16.pdf		027-fp16.pdf
030-multi-gpu.pdf		030-multi-gpu.pdf
035-dataparallel.ipynb		035-dataparallel.ipynb
035-dataparallel.pdf		035-dataparallel.pdf
040-formats-compression.pdf		040-formats-compression.pdf
045-storage-server.pdf		045-storage-server.pdf
047-webdataset.ipynb		047-webdataset.ipynb
047-webdataset.pdf		047-webdataset.pdf
050-multinode.pdf		050-multinode.pdf
052-k8s-intro.ipynb		052-k8s-intro.ipynb
052-k8s-intro.pdf		052-k8s-intro.pdf
053-multinode.ipynb		053-multinode.ipynb
053-multinode.pdf		053-multinode.pdf
054-k8s-templating.ipynb		054-k8s-templating.ipynb
054-k8s-templating.pdf		054-k8s-templating.pdf
056-k8s-reduce.ipynb		056-k8s-reduce.ipynb
056-k8s-reduce.pdf		056-k8s-reduce.pdf
058-k8s-distributed.ipynb		058-k8s-distributed.ipynb
058-k8s-distributed.pdf		058-k8s-distributed.pdf
060-etl-augmentation.pdf		060-etl-augmentation.pdf
065-etl-map-reduce.ipynb		065-etl-map-reduce.ipynb
065-etl-map-reduce.pdf		065-etl-map-reduce.pdf
080-tensorcom.pdf		080-tensorcom.pdf
085-tensorcom.ipynb		085-tensorcom.ipynb
085-tensorcom.pdf		085-tensorcom.pdf
090-recommendations.pdf		090-recommendations.pdf
README.md		README.md
helpers.py		helpers.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Deep Learning on Big Data with Multi-Node GPU Jobs

Running Jupyter

Ansible Scripts

About

Uh oh!

Releases

Packages

Languages

tmbdev-archive/bigdata19-tutorial

Folders and files

Latest commit

History

Repository files navigation

Deep Learning on Big Data with Multi-Node GPU Jobs

Running Jupyter

Ansible Scripts

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages