# transformers + deepspeed CLI

This notebook demonstrates how to setup `transformers` + `deepspeed` on colab to be run as an external process.

You can of course use it under any notebook environment.

It's possible to run `transformers` + `deepspeed` inside the notebook as well: 

**XXX**: make another notebook with a demo that isn't CLI



## Setting up the correct environment

In order to run `transformers` with `deepspeed`, you need:
1. enough general RAM. Different users seem to get a instance with different size of allocated general RAM. Try `!free -h` and if your process gets killed, you probably run out of memory. If you can't get enough memory you can turn `cpu_offload` off in `ds_config.json` below.
2. matching cuda versions. Your pytorch needs to be built with the exact cuda version as you system-wide installed cuda. This is normally not needed to run `pytorch` alone, but is needed for building CUDA extensions, like DeepSpeed. You will find full documentation [here](https://huggingface.co/transformers/main_classes/trainer.html#installation-notes).

Since we can't control which cuda version colab has it can be tricky to find the right matching pytorch version. So this notebook will save you time by already showing you all the required versions you need to install.

Surely, this notebook will get outdated in time. So make sure you check for the latest version of it at https://github.com/stas00/porting/blob/master/transformers/deepspeed/ and please let me know if it needs to be updated if deepspeed stops building.

As I mentioned earlier if Deepspeed builds but the training gets killed you got a colab instance with too little RAM. There is no need to contact me then as there is nothing I can do about it.

In [9]:
# Free colab seems to give different 

!free -h

              total        used        free      shared  buff/cache   available
Mem:            25G        676M         18G        1.0M        6.6G         24G
Swap:            0B          0B          0B


In [2]:
# need to match the system-wide installed cuda-11 for deepspeed to compile
# so install the matching pytorch

# pt-1.8 doesn't have cu110 build at the moment. 
# colab will eventually upgrade to cuda-11.1 and then we can use 1.8.0+cu111
# 
!pip install torch==1.7.1+cu110 torchvision==0.8.2+cu110 -f https://download.pytorch.org/whl/torch_stable.html
#!pip install torch==1.8.0+cu111 torchvision==0.9.0+cu111 -f https://download.pytorch.org/whl/torch_stable.html

Looking in links: https://download.pytorch.org/whl/torch_stable.html
Collecting torch==1.7.1+cu110
[?25l  Downloading https://download.pytorch.org/whl/cu110/torch-1.7.1%2Bcu110-cp37-cp37m-linux_x86_64.whl (1156.8MB)
[K     |███████████████████████         | 834.1MB 56.3MB/s eta 0:00:06tcmalloc: large alloc 1147494400 bytes == 0x56174883e000 @  0x7fadae5a6615 0x56170fc4206c 0x56170fd21eba 0x56170fc44e8d 0x56170fd3699d 0x56170fcb8fe9 0x56170fcb3b0e 0x56170fc4677a 0x56170fcb8e50 0x56170fcb3b0e 0x56170fc4677a 0x56170fcb586a 0x56170fd377c6 0x56170fcb4ee2 0x56170fd377c6 0x56170fcb4ee2 0x56170fd377c6 0x56170fcb4ee2 0x56170fd377c6 0x56170fdb9431 0x56170fd1a049 0x56170fc84c84 0x56170fc458e9 0x56170fcb9ade 0x56170fc4669a 0x56170fcb4a45 0x56170fcb3e0d 0x56170fc4677a 0x56170fcb4a45 0x56170fc4669a 0x56170fcb4a45
[K     |█████████████████████████████▏  | 1055.7MB 1.2MB/s eta 0:01:26tcmalloc: large alloc 1434370048 bytes == 0x56178ce94000 @  0x7fadae5a6615 0x56170fc4206c 0x56170fd21eba 0x56170fc44

In [3]:
# either install the release
#!pip install deepspeed
# or the master 
!pip install git+https://github.com/microsoft/deepspeed

Collecting git+https://github.com/microsoft/deepspeed
  Cloning https://github.com/microsoft/deepspeed to /tmp/pip-req-build-i3k2b31c
  Running command git clone -q https://github.com/microsoft/deepspeed /tmp/pip-req-build-i3k2b31c
  Running command git submodule update --init --recursive -q
Building wheels for collected packages: deepspeed
  Building wheel for deepspeed (setup.py) ... [?25l[?25hdone
  Created wheel for deepspeed: filename=deepspeed-0.3.13+22d5a1f-cp37-none-any.whl size=341169 sha256=aa960aa10b43e7ed9e3bec1a4c9c181b8f94e1a7b7cd8edc48e3a995e67c4598
  Stored in directory: /tmp/pip-ephem-wheel-cache-9vyb3bp_/wheels/33/7c/6d/1ac44092dd4e4b5ddd1dec9474fed46ec3fe5588be7b6ffe9e
Successfully built deepspeed


In [10]:
%%bash
git clone https://github.com/huggingface/transformers
cd transformers
# examples change a lot so let's pick a sha that we know this notebook will work with
# comment out/remove the next line if you want the master
git checkout 1c06240e1b34777281
pip install -e .
pip install -r examples/_tests_requirements.txt

# if needed free up some space used by cached pip packages
# rm -rf /root/.cache/pip


Obtaining file:///content/transformers
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
    Preparing wheel metadata: started
    Preparing wheel metadata: finished with status 'done'
Installing collected packages: transformers
  Found existing installation: transformers 4.4.0.dev0
    Can't uninstall 'transformers'. No files were found to uninstall.
  Running setup.py develop for transformers
Successfully installed transformers


fatal: destination path 'transformers' already exists and is not an empty directory.
Previous HEAD position was 1aa9c13f7 Fix GPU tests with speech
HEAD is now at 1c06240e1 Update training args ignore_skip_data -> ignore_data_skip (#10891)


In [11]:
%%bash

cd transformers

cat <<'EOT' > ds_config.json
{
    "fp16": {
        "enabled": true,
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,        
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "zero_optimization": {
        "stage": 2,
       "allgather_partitions": true,
       "allgather_bucket_size": 2e8,
       "reduce_scatter": true,
       "reduce_bucket_size": 2e8,
        "overlap_comm": true,
        "contiguous_gradients": true,
        "cpu_offload": true
    },

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": 3e-5,
            "betas": [ 0.9, 0.999 ],
            "eps": 1e-8,
            "weight_decay": 3e-7
        }
    },

    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": 0,
            "warmup_max_lr": 3e-5,
            "warmup_num_steps": 500
        }
    },
    "steps_per_print": 2000,
    "wall_clock_breakdown": false
}
EOT


In [6]:
#!ls -l transformers
#!cat transformers/ds_config.json

## Running Traning + Evaluation CLI style

In [13]:
!cd transformers; export BS=16; rm -r output_dir; \
PYTHONPATH=src USE_TF=0 CUDA_VISIBLE_DEVICES=0 deepspeed --num_gpus=1 examples/seq2seq/run_translation.py \
--model_name_or_path t5-small --output_dir output_dir --adam_eps 1e-06 --evaluation_strategy=steps \
--do_train --do_eval --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 \
--max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir  \
--per_device_train_batch_size $BS --per_device_eval_batch_size $BS --predict_with_generate --sortish_sampler \
--val_max_target_length 128 --warmup_steps 500 --max_train_samples 2000 --max_val_samples 500 \
--dataset_name wmt16 --dataset_config ro-en --source_lang en --target_lang ro \
--source_prefix "translate English to Romanian: " --deepspeed ds_config.json --fp16

[2021-03-25 03:13:20,908] [INFO] [runner.py:358:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 examples/seq2seq/run_translation.py --model_name_or_path t5-small --output_dir output_dir --adam_eps 1e-06 --evaluation_strategy=steps --do_train --do_eval --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_train_batch_size 16 --per_device_eval_batch_size 16 --predict_with_generate --sortish_sampler --val_max_target_length 128 --warmup_steps 500 --max_train_samples 2000 --max_val_samples 500 --dataset_name wmt16 --dataset_config ro-en --source_lang en --target_lang ro --source_prefix translate English to Romanian:  --deepspeed ds_config.json --fp16
[2021-03-25 03:13:22,193] [INFO] [launch.py:73:main] 0 NCCL_VERSION 2.7.8
[2021-03-25 03:13:22,194] [INFO] [launch.