Skip to content
No description, website, or topics provided.
Jupyter Notebook
Branch: master
Clone or download
Latest commit 639dc7e Dec 15, 2019
Type Name Latest commit message Commit time
Failed to load latest commit information.
Ansible Added Ansible scripts. Dec 12, 2019
010-introduction.pdf Cleanup. Dec 12, 2019
020-image-classifier.pdf Cleanup. Dec 12, 2019
022-basic-recognition.ipynb latest update Dec 12, 2019
022-basic-recognition.pdf initial Dec 12, 2019
023-docker.ipynb latest update Dec 12, 2019
023-docker.pdf initial Dec 12, 2019
025-performance-profile.pdf Cleanup. Dec 12, 2019
027-fp16.ipynb latest update Dec 12, 2019
027-fp16.pdf initial Dec 12, 2019
030-multi-gpu.pdf Cleanup. Dec 12, 2019
035-dataparallel.ipynb latest update Dec 12, 2019
035-dataparallel.pdf initial Dec 12, 2019
040-formats-compression.pdf Cleanup. Dec 12, 2019
045-storage-server.pdf Cleanup. Dec 12, 2019
047-webdataset.ipynb initial Dec 12, 2019
047-webdataset.pdf initial Dec 12, 2019
050-multinode.pdf Cleanup. Dec 12, 2019
052-k8s-intro.ipynb latest update Dec 12, 2019
052-k8s-intro.pdf initial Dec 12, 2019
053-multinode.ipynb latest update Dec 12, 2019
053-multinode.pdf initial Dec 12, 2019
054-k8s-templating.ipynb latest update Dec 12, 2019
054-k8s-templating.pdf initial Dec 12, 2019
056-k8s-reduce.ipynb latest update Dec 12, 2019
056-k8s-reduce.pdf initial Dec 12, 2019
058-k8s-distributed.ipynb initial Dec 12, 2019
058-k8s-distributed.pdf initial Dec 12, 2019
060-etl-augmentation.pdf Cleanup. Dec 12, 2019
065-etl-map-reduce.ipynb initial Dec 12, 2019
065-etl-map-reduce.pdf initial Dec 12, 2019
080-tensorcom.pdf Cleanup. Dec 12, 2019
085-tensorcom.ipynb initial Dec 12, 2019
085-tensorcom.pdf initial Dec 12, 2019
090-recommendations.pdf Cleanup. Dec 12, 2019 fix Dec 15, 2019

Deep Learning on Big Data with Multi-Node GPU Jobs

Tutorial held at IEEE BigData 2019

Thomas Breuel, Alex Aizman

Both traditional machine learning (clustering, decision trees, parametric models, cross-validation, function decompositions) and deep learning (DL) are often used for the analysis of big data on hundreds of nodes (clustered servers). However, the systems and I/O considerations for multi-node deep learning are quite different from traditional machine learning. While traditional machine learning is often well served by MapReduce style infrastructure (Hadoop, Spark), distributed deep learning places different demands on hardware, storage software, and networking infrastructure. In this tutorial, we cover:

  • the structure and properties of large-scale GPU-based deep learning systems
  • large-scale distributed stochastic gradient descent and supporting frameworks (PyTorch, TensorFlow, Horovod, NCCL)
  • common storage and compression formats (TFRecord/tf.Example, DataLoader, etc.) and their interconnects (Ethernet, Infiniband, RDMA, NVLINK)
  • common storage architectures for large-scale DL (network file systems, distributed file systems, object storage)
  • batch queueing systems, Kubernetes, and NGC for scheduling and large-scale parallelism
  • ETL techniques including distributed GPU-based augmentation (DALI)

The tutorial will focus on techniques and tools by which deep learning practitioners can take advantage of these technologies and move from single-desktop training to training models on hundreds of GPUs and petascale datasets. It will also help researchers and system engineers to choose and size the systems necessary for such large-scale deep learning. Participants should have some experience in training deep learning models on a single node. The tutorial will cover both TensorFlow and PyTorch frameworks as well as additional open-source tools required to scale deep learning to multi-node storage and multi-node training.

Running Jupyter

Many of the examples in this directory are in Jupyter Notebook format.

There are two ways of running Jupyter: on the local machine or inside a container.

To run it directly on the local machine, you need to install Anaconda3. Afterwads, you can run:

/opt/anaconda3/bin/jupyter lab

To run Jupyter inside a container, you need to have NVIDIA Docker installed (e.g., using ansible-playbook docker-nv.yml). Then you can use the run script in the parent directory:

./run jupyter lab

Ansible Scripts

These are Ansible scripts that help you set up your machine:

  • anaconda3.yml -- install Python3 using Anaconda, plus various packages
  • docker-nv.yml -- install Docker with NVIDIA support
  • microk8s.yml -- install MicroK8s
  • gui.yml -- simple GUI tools for remote access via VNC

These are intended for recent versions of Ubuntu, 19.04 and 19.10. They install and reinstall various packages, so look at them first before running.

To use:

$ sudo apt-get install python-pip
$ sudo pip install ansible
$ cd Ansible
$ ansible-playbook anaconda3.yml
You can’t perform that action at this time.