# DeepSpeed

[MicroSoft DeepSpeed](https://github.com/microsoft/DeepSpeed)

[참고한 블로그: DeepSpeed로 큰 모델 튜닝하기](https://velog.io/@seoyeon96/DeepSpeed%EB%A1%9C-%ED%81%B0-%EB%AA%A8%EB%8D%B8-%ED%8A%9C%EB%8B%9D%ED%95%98%EA%B8%B0)

## DeepSpeed 환경세팅

In [None]:
# Runtime: T4 GPU
!nvidia-smi

Wed Apr 10 16:50:14 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   45C    P8               9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [None]:
!pip install deepspeed
# !git clone https://github.com/microsoft/DeepSpeed 으로 수동으로 설치하는 방법도 있다.

In [None]:
# 현재 환경과 호환되는 옵션
!ds_report

[2024-04-10 16:50:31,146] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [92m[OKAY][0m
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [93m[NO][0m ....... [93m[NO][0m
fused_adam ............. [93m[NO][0m ....... [92m[OKAY][0m
cpu_adam ............... [92m[YES][0m ...... [92m[OKAY][0m
cpu_adagrad ............ [93m[NO][0m ....... [92m[OKAY][0m
cpu_lion ............... 

In [None]:
# CUDA version check
!nvcc --version
# 12.2

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0


In [None]:
# TORCH_CUDA_ARCH_LIST 버전 체크
CUDA_VISIBLE_DEVICES=0
!python -c "import torch; print(torch.cuda.get_device_capability())"

(7, 5)


**DeepSpeed Ops 사전 설치**

DS_BUILD_ADAM: CPUAdam을 구축

In [None]:
%%bash
TORCH_CUDA_ARCH_LIST="7.5" DS_BUILD_CPU_ADAM=1  DS_BUILD_UTILS=1 pip install . \
--global-option="build_ext" --global-option="-j8" --no-cache -v \
--disable-pip-version-check 2>&1 | tee build.log

Using pip 23.1.2 from /usr/local/lib/python3.10/dist-packages/pip (python 3.10)
ERROR: Directory '.' is not installable. Neither 'setup.py' nor 'pyproject.toml' found.


In [None]:
%env
CUDA_LAUNCH_BLOCKING=1

In [None]:
# 병렬처리를 하기 위해 mpi4py 패키지 설치
!pip install mpi4py



**DeepSpeed에서 사용할 모델 설정 관련 참고 자료**

[Fine-Tuning Llama-2 LLM on Google Colab: A Step-by-Step Guide.](https://medium.com/@csakash03/fine-tuning-llama-2-llm-on-google-colab-a-step-by-step-guide-cf7bb367e790)

In [None]:
# model requirment download
!pip install -q -U git+https://github.com/huggingface/transformers.git
# !pip install -q -U git+https://github.com/huggingface/peft.git
# !pip install -q -U git+https://github.com/huggingface/accelerate.git
# !pip install -q trl xformers wandb datasets einops gradio sentencepiece bitsandbytes

In [None]:
!pip install git+https://github.com/huggingface/transformers.git@main accelerate bitsandbytes

## Getting Started

In [None]:
# DeepSpeed Engine 초기화
import deepspeed
deepspeed.init_distributed()

[2024-04-10 17:00:59,791] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-04-10 17:01:01,304] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-04-10 17:01:01,305] [INFO] [comm.py:652:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
[2024-04-10 17:01:01,679] [INFO] [comm.py:702:mpi_discovery] Discovered MPI settings of world_rank=0, local_rank=0, world_size=1, master_addr=172.28.0.12, master_port=29500
[2024-04-10 17:01:01,681] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl


In [None]:
# DeepSpeed Configuration
cmd_args = {
  "train_batch_size": 8,
  "gradient_accumulation_steps": 1,
  "optimizer": {
    "type": "Adam",
    "params": {
      "lr": 0.00015
    }
  },
  "fp16": {
    "enabled": True
  },
  "zero_optimization": True,
  "loss_scale": 2**20
}

In [None]:
# BART model import
from transformers import BartTokenizer, BartForConditionalGeneration, BartConfig, PretrainedConfig, AutoModel, AutoModelForPreTraining, AutoTokenizer, AutoModelForSeq2SeqLM

model_name_or_path = "facebook/bart-large-cnn"
tokenizer = BartTokenizer.from_pretrained(model_name_or_path)
model = BartForConditionalGeneration.from_pretrained(model_name_or_path)

In [None]:
print(cmd_args)

{'train_batch_size': 8, 'gradient_accumulation_steps': 1, 'optimizer': {'type': 'Adam', 'params': {'lr': 0.00015}}, 'fp16': {'enabled': True}, 'zero_optimization': True, 'loss_scale': 1048576}


In [None]:
import deepspeed

# DeepSpeed 엔진을 초기화합니다.
model_engine, optimizer, _, _ = deepspeed.initialize(args=cmd_args,
                                                     model=model,
                                                     model_parameters=model.parameters(),
                                                     config_params=cmd_args)
# AssertionError: DeepSpeed requires --deepspeed_config to specify configuration file
# DeepSpeed가 --deepspeed_config 옵션을 찾기 위해 구성 파일을 요구함 -> config_params=cmd_args 으로 직접 매개변수를 지정

[2024-04-10 17:02:09,866] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.14.1+63029e8f, git-hash=63029e8f, git-branch=master
ZeRO optimization should be enabled as:
"session_params": {
  "zero_optimization": {
    "stage": [0|1|2],
    "stage3_max_live_parameters" : 1000000000,
    "stage3_max_reuse_distance" : 1000000000,
    "allgather_partitions": [true|false],
    "use_multi_rank_bucket_allreduce": [true|false],
    "allgather_bucket_size": 500000000,
    "reduce_scatter": [true|false],
    "contiguous_gradients" : [true|false]
    "overlap_comm": [true|false],
    "reduce_bucket_size": 500000000,
    "load_from_fp32_weights": [true|false],
    "cpu_offload": [true|false] (deprecated),
    "cpu_offload_params" : [true|false] (deprecated),
    "cpu_offload_use_pin_memory": [true|false] (deprecated),
    "sub_group_size" : 1000000000000,
    "offload_param": {...},
    "offload_optimizer": {...},
    "ignore_unused_parameters": [true|false],
    "round_robin_gradi

## Training

In [None]:
from datasets import load_dataset

# IMDb 데이터셋을 로드합니다.
imdb_dataset = load_dataset("imdb")

In [None]:
train_dataset = imdb_dataset['train']
test_dataset = imdb_dataset['test']

In [None]:
train_dataset[0]

{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far be

In [None]:
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_engine.to(device)

DeepSpeedEngine(
  (module): BartForConditionalGeneration(
    (model): BartModel(
      (shared): Embedding(50264, 1024, padding_idx=1)
      (encoder): BartEncoder(
        (embed_tokens): Embedding(50264, 1024, padding_idx=1)
        (embed_positions): BartLearnedPositionalEmbedding(1026, 1024)
        (layers): ModuleList(
          (0-11): 12 x BartEncoderLayer(
            (self_attn): BartSdpaAttention(
              (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
              (v_proj): Linear(in_features=1024, out_features=1024, bias=True)
              (q_proj): Linear(in_features=1024, out_features=1024, bias=True)
              (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
            )
            (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
            (activation_fn): GELUActivation()
            (fc1): Linear(in_features=1024, out_features=4096, bias=True)
            (fc2): Linear(in_features=4096,

In [None]:
model_engine.train()

In [None]:
from torch.nn.utils import clip_grad_norm_

for step, batch in enumerate(train_dataset):
    text = batch["text"]

    # 텍스트 데이터를 토큰화하고 인코딩하여 모델에 입력할 형식으로 변환합니다.
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
    inputs = {key: value.to(device) for key, value in inputs.items()}

    #forward() method
    outputs = model_engine(**inputs, labels=inputs["input_ids"])
    loss = outputs.loss

    #runs backpropagation
    # 역전파를 수행하고 그래디언트 클리핑을 적용합니다.
    model_engine.backward(loss)
    clip_grad_norm_(model_engine.parameters(), max_norm=1.0)  # 임계값은 적절하게 설정합니다.

    # DeepSpeed에서 가중치 업데이트를 수행합니다.
    model_engine.step()

    # 일정 간격으로 손실을 출력합니다.
    if step % 100 == 0:
        print(f"Step {step}, Loss: {loss.item()}")

model_engine.eval()

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


[2024-04-10 17:02:23,108] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4294967296, reducing to 2147483648
Step 0, Loss: 0.37255859375
[2024-04-10 17:02:23,284] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2147483648, reducing to 1073741824
[2024-04-10 17:02:23,454] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1073741824, reducing to 536870912
[2024-04-10 17:02:23,614] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 536870912, reducing to 268435456
[2024-04-10 17:02:23,797] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 268435456, reducing to 134217728
[2024-04-10 17:02:23,963] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1342

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
