# GPT pretraining

Configuration:

- `transformers` version: 4.11.0.dev0
- Platform: Linux-4.15.0-124-generic-x86_64-with-Ubuntu-18.04-bionic
- Python version: 3.6.9
- PyTorch version (GPU?): 1.9.0+cu102 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>
- Deepspeed version:  0.5.3

Docker image: in [dockerfile](./dockerfile)
    
## Torch DDP
Training with distributed data parallel.

In [1]:
LOG_DIR = "./models/gpt2-small"
!rm -rf $LOG_DIR
cmd = """python3  -m torch.distributed.launch --nproc_per_node=8 my_run_clm.py \
    --model_name_or_path gpt2 \
    --dataset_name wikitext \
    --dataset_config_name wikitext-2-raw-v1 \
    --do_train \
    --do_eval \
    --eval_steps=10 \
    --logging_steps=10 \
    --save_steps=200 \
    --save_total_limit=1 \
    --fp16=true \
    --per_device_train_batch_size=4 \
    --output_dir {} \
    --num_train_epochs=1 \
    --overwrite_output_dir
""".format(LOG_DIR)
!$cmd

The module torch.distributed.launch is deprecated and going to be removed in future.Migrate to torch.distributed.run
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
 Please read local_rank from `os.environ('LOCAL_RANK')` instead.
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
  entrypoint       : my_run_clm.py
  min_nodes        : 1
  max_nodes        : 1
  nproc_per_node   : 8
  run_id           : none
  rdzv_backend     : static
  rdzv_endpoint    : 127.0.0.1:29500
  rdzv_configs     : {'rank': 0, 'timeout': 900}
  max_restarts     : 3
  monitor_interval : 5
  log_dir          : None
  metrics_cfg      : {}

INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelas

## DeepSpeed With CPU offloading

In [4]:
%%writefile deepspeed-gpt2-small-V100.config.json

{
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },

    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "_reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_fp16_weights_on_model_save": true
    },

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 100,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

Overwriting deepspeed-gpt2-small-V100.config.json


In [5]:
LOG_DIR = "./models/gpt2-small-deepspeed-V100"
!rm -rf $LOG_DIR
cmd = """deepspeed --num_gpus=8 my_run_clm.py \
    --model_name_or_path gpt2 \
    --dataset_name wikitext \
    --dataset_config_name wikitext-2-raw-v1 \
    --do_train \
    --do_eval \
    --eval_steps=10 \
    --logging_steps=10 \
    --save_steps=50 \
    --fp16=true \
    --per_device_train_batch_size=4\
    --output_dir {} \
    --save_total_limit=1 \
    --num_train_epochs=1 \
    --overwrite_output_dir=true \
    --deepspeed=deepspeed-gpt2-small-V100.config.json
""".format(LOG_DIR)
! $cmd

[2021-09-23 04:09:45,896] [INFO] [runner.py:360:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 my_run_clm.py --model_name_or_path gpt2 --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train --do_eval --eval_steps=10 --logging_steps=10 --save_steps=50 --fp16=true --per_device_train_batch_size=4 --output_dir ./models/gpt2-small-deepspeed-V100 --save_total_limit=1 --num_train_epochs=1 --overwrite_output_dir=true --deepspeed=deepspeed-gpt2-small-V100.config.json
[2021-09-23 04:09:46,545] [INFO] [launch.py:73:main] 0 NCCL_VERSION 2.10.3
[2021-09-23 04:09:46,545] [INFO] [launch.py:80:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2021-09-23 04:09:46,545] [INFO] [launch.py:89:main] nnodes=1, num_local_procs=8, node_rank=0
[2021-09-23 04:09:46,545] [INFO] [launch.py:101:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost'

In [2]:
!ls /usr/local

bin   cuda-10.2  games	  lib	     man   share
cuda  etc	 include  licensing  sbin  src
