In [None]:
import os
import multiprocessing as mp
import time

all_procs = []

def background(command):
    proc = mp.Process(target=os.system, args=(command,))
    all_procs.append(proc)
    proc.start()
    
def kill_all():
    for proc in all_procs:
        proc.terminate()

# Sharded Dataset

For the subsequent code, we assume that the Imagenet shards are stored in ./shards.

If the shards do not exist, we generate them directly from the original Imagenet data using a small script.

In [None]:
%%bash
test -f shards/imagenet-train-000000.tar || {
    mkdir shards
    python3 ./convert-imagenet.py ./imagenet-data ./shards
}

# Data Server

Since we are implementing distributed training, we need to be able to retrieve shards over the network.

Here, we use a small web server to serve the shards; the web server is simply nginx running in a Docker container.

In practice, you would use some kind of permanently installed web server, or even better the AIStore object store.

In [None]:
%%bash
docker ps | awk '$2=="nginx"{print $1}' | xargs docker kill

In [None]:
%%bash
imagenetdir=/media/tmb/data1/gs/nvdata-imagenet
docker run -it --rm -d -p 8080:80 --name web -v $imagenetdir:/usr/share/nginx/html nginx

In [None]:
!curl http://$(hostname -i):8080/imagenet-train-000000.tar | tar tvf - | tail

# Training Jobs

here, we are just starting up the distributed training jobs using ssh.
This requires setting environment variables MASTER_ADDR, MASTER_PORT, RANK, and WORLD_SIZE.
We put RANK=0 on the local machine.

In [None]:
%%bash
. ./venv/bin/activate

In [None]:
background("""
ssh sedna "cd $(/bin/pwd) && . ./venv/bin/activate && env MASTER_ADDR=$(hostname -i) MASTER_PORT=29500 RANK=1 WORLD_SIZE=2 python3 imagenet.py train --verbose --bucket=http://$(hostname -i):8080/ --mname=resnet18 --backend=gloo"
""")

In [None]:
%%bash 
env MASTER_ADDR=$(hostname -i) MASTER_PORT=29500 RANK=0 WORLD_SIZE=2 \
python3 imagenet.py train --verbose --bucket=http://$(hostname -i):8080/ --mname=resnet18 --backend=gloo