Parallelism is All You Need

When BERT [1] model was introduced in 2018, it has about 110 million parameters for base and 345 million for large model. Cross lingual model XLM-R [2] has 550M parameters. GPT-2-xl [3] has 1.5 billion parameters while GPT-3 [4] has about 175 billion parameters. Text-To-Text Transfer Transformer (T5) [5] has 11 billion parameters. As the model becomes bigger and bigger, it brings new challenges: it takes more and more time to train or fine tune the model; some models are too big to be loaded into one GPU because of limited GPU memory; or even if it can be loaded into GPU, the batch size has to be set as very small since the parameters and gradients need to be saved for parameter update and it will result in lacking of GPU memory.

1. By pytorch `multiprocess.spawn`

pytorch multiprocess.spawn will create the processes and pass the local_rank as well as the arguments to it. You can run python from command line to initialize one process and then spawn multiple process inside the code.

It is as simple as these 3 steps:

initial the process by dist.init_process_group
send model to torch.nn.parallel.DistributedDataParallel
distribute the data by torch.utils.data.distributed.DistributedSampler

mp.spawn(main_worker, nprocs = args['nprocs'], args = (args['nprocs'], args))

Initialize the device and set model to each device:

    dist.init_process_group(backend = 'nccl', init_method="tcp://127.0.0.1:23456", world_size=args['nprocs'], rank=local_rank)
    torch.cuda.set_device(local_rank)
    model.cuda(local_rank)
    ...
    model = torch.nn.parallel.DistributedDataParallel(model, device_ids = [local_rank])

Set up the sampler to distribute the data:

train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset)

Need to make sure all the loss is calcualted on all devices by

torch.distributed.barrier()

2. pytorch distributed launcher

torch.distributed.launch can run the python code distributely. It is similar to the spawn above but it will launch n processes immediately when the command is kicked off.

It has the similar 3 steps as above.

def main():
    ...
    main_worker(args['local_rank'], args['nprocs'], args)

def main_worker(local_rank, nprocs, args):
    ...
    dist.init_process_group(backend = 'nccl', init_method="tcp://127.0.0.1:23456", world_size=args['nprocs'], rank=local_rank)
    torch.cuda.set_device(local_rank)
    model.cuda(local_rank)
    args['batch_size'] = int(args['batch_size'] / args['nprocs'])
    model = torch.nn.parallel.DistributedDataParallel(model, device_ids = [local_rank])
    ...
    train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset)

3. nvidia apex mixed precision

apex will not only provide parallel training, but also automatically help to manage the precision of the parameters in the models, optimizers and loss to reduce the GPU memory usage and improve the training speed.

It includes these key steps:

use apex to initialize the model and optimizer so that it can automatically manage the precision
send the model to apex.parallel.DistributedDataParallel(model)
scale the loss

    from apex import amp
    ...
    dist.init_process_group(backend = 'nccl', init_method="tcp://127.0.0.1:23456", world_size=args['nprocs'], rank=local_rank)
    torch.cuda.set_device(local_rank)
    model.cuda(local_rank)
    args['batch_size'] = int(args['batch_size'] / args['nprocs'])
    ...
    model, optimizer = amp.initialize(model, optimizer)
    model = apex.parallel.DistributedDataParallel(model)
    ...
    with amp.scale_loss(loss, optimizer) as scaled_loss:
        scaled_loss.backward()

4. hovorod

hovorod do the parallel computing through MPI ring AllReduce which can reduce the data trandfer and thus improve the training speed.

It mainly includes these steps:

init the process hvd.init()
broadcast the model parameters hvd.broadcast_parameters(model.state_dict(), root_rank=0)
broadcast the optimizer hvd.broadcast_optimizer_state(optimizer, root_rank=0)

def main():
    ...
    hvd.init()
    args['local_rank'] = hvd.local_rank()
    ...
    main_worker(args['local_rank'], args['nprocs'], args)

def main_worker(local_rank, nprocs, args):
    ...
    hvd.broadcast_parameters(model.state_dict(), root_rank=0)
    ...
    optimizer = torch.optim.AdamW(model.parameters(), args['learning_rate'], weight_decay=args['weight_decay'])
    hvd.broadcast_optimizer_state(optimizer, root_rank=0)
    optimizer = hvd.DistributedOptimizer(optimizer, named_parameters=model.named_parameters(), compression=hvd.Compression.fp16, op=hvd.Average)

5. huggingface accelerate

Huggingface Accelerate is a wrapper for PyTorch to make the parallel training much easier.

It mainly needs these steps:

send the model, optimizer, and data loader to accelerator.
backpropagate with accelerator

    accelerator = Accelerator(fp16=args['fp16'], cpu=args['cpu'])
    model = model.to(accelerator.device)

    optimizer= AdamW(params = model.parameters(), lr = lr, correct_bias = correct_bias)
    model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
        model, optimizer, train_dataloader, eval_dataloader)

    ...
    accelerator.backward(loss)

6. gradients accumulate

split the mini-batch data into n steps defined in the accumulation steps
for each split, calc forward pred and loss backpropagation
accumulatively update the parameters

    for step, batch in enumerate(train_dataloader):
        split_size = args['train_batch_size'] / args['gradient_accumulation_steps']
        split_accum = zip(*[torch.split(x, int(split_size)) for x in batch])
        for j, batch_split in enumerate(split_accum):
            batch_split = [r.to(device) for r in batch_split]
            sent_id, mask, labels = batch_split
            # reset gradients tensors
            model.zero_grad()
            # get model predictions for the current batch_split
            predictions = model(sent_id, mask)
            # compute the loss between actual and predicted values
            loss = criterion(predictions, labels)
            # normalize the loss
            if args['gradient_accumulation_steps'] > 1:
                loss = loss / args['gradient_accumulation_steps']
            # add on to the total loss
            total_loss = total_loss + loss.item()
            # apex.amp mixed precision
            if args['fp16']:
                with amp.scale_loss(loss, optimizer) as scaled_loss:
                    scaled_loss.backward()
                torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args['max_grad_norm'])
            else:
                # backward pass to calculate the gradients
                loss.backward()
                # clip the the gradients to 1.0. It helps in preventing the exploding gradient problem
                torch.nn.utils.clip_grad_norm_(model.parameters(), args['max_grad_norm'])
            # accumulative gradients
            if (j + 1) % args['gradient_accumulation_steps'] == 0:
                # update parameters
                optimizer.step()
                # reset gradients tensors
                model.zero_grad()

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
data		data
figure		figure
0.no.distributed.py		0.no.distributed.py
1.multiprocessing.distributed.py		1.multiprocessing.distributed.py
2.distributed.py		2.distributed.py
3.apex.distributed.py		3.apex.distributed.py
4.horovod.distributed.py		4.horovod.distributed.py
5.hgf.accelerate.py		5.hgf.accelerate.py
6.gradients.accumulate.py		6.gradients.accumulate.py
Parallelism is All You Need 2021-09-08.pdf		Parallelism is All You Need 2021-09-08.pdf
config.py		config.py
model_arch.py		model_arch.py
readME.md		readME.md
trainvalid.py		trainvalid.py
utils_evaluate_test.py		utils_evaluate_test.py
utils_load_data_model.py		utils_load_data_model.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Parallelism is All You Need

1. By pytorch `multiprocess.spawn`

2. pytorch distributed launcher

3. nvidia apex mixed precision

4. hovorod

5. huggingface accelerate

6. gradients accumulate

About

Releases

Packages

Languages

songhuiming/ParallelismIsAllYouNeed

Folders and files

Latest commit

History

Repository files navigation

Parallelism is All You Need

1. By pytorch multiprocess.spawn

2. pytorch distributed launcher

3. nvidia apex mixed precision

4. hovorod

5. huggingface accelerate

6. gradients accumulate

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

1. By pytorch `multiprocess.spawn`

Packages