In [None]:
import os
import multiprocessing as mp
import time

all_procs = []

def background(command):
    proc = mp.Process(target=os.system, args=(command,))
    all_procs.append(proc)
    proc.start()
    
def kill_all():
    for proc in all_procs:
        proc.terminate()

# Sharded Dataset

For the subsequent code, we assume that the Imagenet shards are stored in ./shards.

If the shards do not exist, we generate them directly from the original Imagenet data using a small script.

In [None]:
%%bash
test -f shards/imagenet-train-000000.tar || {
    python3 ./convert-imagenet.py ./imagenet-data ./shards
}

# Data Server

Since we are implementing distributed training, we need to be able to retrieve shards over the network.

Here, we use a small web server to serve the shards; the web server is simply nginx running in a Docker container.

In practice, you would use some kind of permanently installed web server, or even better the AIStore object store.

In [None]:
%%bash
docker ps | awk '$2=="nginx"{print $1}' | xargs docker kill

In [None]:
%%bash
imagenetdir=/media/tmb/data1/gs/nvdata-imagenet
docker run -it --rm -d -p 8080:80 --name web -v $imagenetdir:/usr/share/nginx/html nginx

In [None]:
!curl http://$(hostname -i):8080/imagenet-train-000000.tar | tar tvf - | tail

# Ray Cluster

In this example, we use Ray for starting up distributed training. Ray is a distributed processing system for Python. We start a two node Ray cluster. Of course, you can start up as many nodes as you like, with as many GPUs as you like.

In [None]:
%%bash
. ./venv/bin/activate

In [None]:
%%bash
ray stop > /dev/null 2>&1 || true

In [None]:
%%bash
ray start --head

In [None]:
%%bash
ssh sedna "cd $(/bin/pwd) && . ./venv/bin/activate && ray start --address=$(hostname -i):6379"

In [None]:
%%bash
ray status

# Training Jobs

Since Ray manages the cluster and handles the distributed computing aspect, starting up a multi-GPU distributed job is particularly simple and can be done with just a single command.

In [None]:
%%bash
python3 imagenet.py raytrain --verbose --bucket=http://$(hostname -i):8080/ --backend=gloo --mname=resnet18