# 02 Fine tuning
- 이 단계에서는 번역한 데이터 셋과 SageMaker를 활용해 파인 튜닝을 실행합니다.

### SageMaker 환경 확인

In [None]:
import sagemaker
import boto3
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")


### Llama3 8B모델의 FSDP QLoRA 훈련을 위한 구성 파일 만들기

In [None]:
import os
config_folder_name = "accelerator_config"
os.makedirs(config_folder_name, exist_ok=True)

In [None]:
%%writefile accelerator_config/sm_llama_3_8b_fsdp_qlora.yaml
# script parameters
model_id:  "meta-llama/Meta-Llama-3-8B" # Hugging Face model id
max_seq_len:  2048              # max sequence length for model and packing of the dataset
# sagemaker specific parameters
train_dataset_path: "/opt/ml/input/data/train/" # path to where SageMaker saves train dataset
validation_dataset_path: "/opt/ml/input/data/validation/" # path to where SageMaker saves train dataset
test_dataset_path: "/opt/ml/input/data/test/"   # path to where SageMaker saves test dataset
output_dir: "/tmp/llama3"            # where the LoRA adapter weight is
# training parameters
report_to: "tensorboard"               # report metrics to tensorboard
learning_rate: 0.0002                  # learning rate 2e-4
lr_scheduler_type: "constant"          # learning rate scheduler
###########################             
# For Debug
###########################             
num_train_epochs: 1                    # number of training epochs
per_device_train_batch_size: 1         # batch size per device during training
per_device_eval_batch_size: 1          # batch size for evaluation
gradient_accumulation_steps: 2         # number of steps before performing a backward/update pass
###########################             
# For evaluation
###########################             
# num_train_epochs: 3                    # number of training epochs
# per_device_train_batch_size: 16         # batch size per device during training
# per_device_eval_batch_size: 8          # batch size for evaluation
# gradient_accumulation_steps: 2         # number of steps before performing a backward/update pass
###########################             
optim: adamw_torch                     # use torch adamw optimizer
logging_steps: 10                      # log every 10 steps
save_strategy: epoch                   # save checkpoint every epoch
evaluation_strategy: epoch             # evaluate every epoch
max_grad_norm: 0.3                     # max gradient norm
warmup_ratio: 0.03                     # warmup ratio
bf16: true                             # use bfloat16 precision
tf32: true                             # use tf32 precision
gradient_checkpointing: true           # use gradient checkpointing to save memory
# FSDP parameters: https://huggingface.co/docs/transformers/main/en/fsdp
fsdp: "full_shard auto_wrap offload" # remove offload if enough GPU memory
fsdp_config:
  backward_prefetch: "backward_pre"
  forward_prefetch: "false"
  use_orig_params: "false"

1. 모델 및 데이터셋 설정:
   - 사용할 모델 ID와 최대 시퀀스 길이를 지정합니다.
   - SageMaker에서 사용할 훈련, 검증, 테스트 데이터셋의 경로를 정의합니다.

2. 훈련 파라미터 설정:
   - 학습률, 스케줄러 유형, 에포크 수, 배치 크기 등을 지정합니다.
   - 그래디언트 누적 단계, 최대 그래디언트 노름, 웜업 비율 등을 설정합니다.

3. 최적화 및 정밀도 설정:
   - AdamW 옵티마이저 사용을 지정합니다.
   - BFloat16 및 TF32 정밀도 사용을 활성화합니다.

4. 메모리 최적화:
   - 그래디언트 체크포인팅을 활성화하여 메모리 사용을 최적화합니다.

5. FSDP 설정:
   - FSDP 모드를 "full_shard auto_wrap offload"로 설정하여 분산 훈련을 구성합니다.
   - FSDP 관련 세부 설정을 지정합니다.

6. 로깅 및 저장 전략:
   - TensorBoard를 사용한 메트릭 보고를 설정합니다.
   - 로깅 주기와 체크포인트 저장 전략을 정의합니다.

이 설정 파일은 대규모 언어 모델의 효율적인 훈련을 위한 다양한 최적화 기법과 분산 훈련 설정을 포함하고 있습니다.

#### S3에 설정 파일 업로드

In [None]:
from sagemaker.s3 import S3Uploader

def upload_data_s3(desired_s3_uri, file_name, verbose=True):
    # upload the model yaml file to s3
    
    file_s3_path = S3Uploader.upload(local_path=file_name, desired_s3_uri=desired_s3_uri)

    print(f"{file_name} is uploaded to:")
    print(file_s3_path)


    return file_s3_path

In [None]:
input_path = f's3://{sess.default_bucket()}/'
config_desired_s3_uri = f"{input_path}config"
config_model_name = "accelerator_config/sm_llama_3_8b_fsdp_qlora.yaml"
train_config_s3_path = upload_data_s3(desired_s3_uri=config_desired_s3_uri, file_name=config_model_name, verbose=True)

In [None]:
run_debug_sample = True

s3_data = {
    'train': os.path.join(input_path, "data/train/ko_train_dataset.json"),
    'validation': os.path.join(input_path, "data/test/ko_test_dataset.json"),
    'config': train_config_s3_path
}
s3_data    

### SageMaker를 활용해 모델 트레이닝하기

In [None]:
import torch

instance_type = 'ml.g5.4xlarge'
# instance_type = 'ml.g5.12xlarge'
# instance_type = 'ml.g5.48xlarge'
# instance_type = 'ml.p4d.24xlarge'
# Emit: 
# {'train_runtime': 37.2985, 'train_samples_per_second': 0.375, 'train_steps_per_second': 0.054, 'train_loss': 2.3541293144226074, 'epoch': 1.0}
# {'eval_loss': 2.50766658782959, 'eval_runtime': 3.4741, 'eval_samples_per_second': 3.454, 'eval_steps_per_second': 0.864, 'epoch': 1.0}
metric_definitions=[
{"Name": "train:loss", "Regex": "'train_loss':(.*?),"},
{"Name": "validation:loss", "Regex": "'eval_loss':(.*?),"}
]
instance_count = 1
sagemaker_session = sagemaker.session.Session()
data = s3_data
nKeepAliveSeconds = 3600 # Warmpool feature, 1 hour
print(f"## Cloud mode is set with {instance_type} and {instance_count} of instance_count")
print("dataset: \n", data)

- 모델 훈련에 사용할 인스턴스 타입, 매트릭, 데이터 위치를 세팅합니다.

In [None]:
from sagemaker.huggingface import HuggingFace
from huggingface_hub import HfFolder

import time
# define Training Job Name 
job_name = f'llama3-8b-text2sql-{time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())}'
# chkpt_s3_path = f's3://{sess.default_bucket()}/{s3_prefix}/native/checkpoints'

# create the Estimator
os.environ['USE_SHORT_LIVED_CREDENTIALS']="1" 
huggingface_estimator = HuggingFace(
    entry_point          = 'run_fsdp_qlora_llama3.py',      # train script
    source_dir           = '../scripts',  # directory which includes all the files needed for training
    instance_type        = instance_type,  # instances type used for the training job
    instance_count       = instance_count,                 # the number of instances used for training
    sagemaker_session    = sagemaker_session,
    max_run              = 2*24*60*60,        # maximum runtime in seconds (days * hours * minutes * seconds)
    base_job_name        = job_name,          # the name of the training job
    role                 = role,              # Iam role used in training job to access AWS ressources, e.g. S3
    volume_size          = 256,               # the size of the EBS volume in GB
    transformers_version = '4.36.0',          # the transformers version used in the training job
    pytorch_version      = '2.1.0',           # the pytorch_version version used in the training job
    py_version           = 'py310',           # the python version used in the training job
    metric_definitions = metric_definitions,
    hyperparameters      =  {
        "config": "/opt/ml/input/data/config/sm_llama_3_8b_fsdp_qlora.yaml" # path to TRL config which was uploaded to s3
    },
    disable_output_compression = True,        # not compress output to save training time and cost    
    keep_alive_period_in_seconds = nKeepAliveSeconds,     # warm pool 
    distribution={"torch_distributed": {"enabled": True}},   # enables torchrun
    environment  = {
        "HUGGINGFACE_HUB_CACHE": "/tmp/.cache", # set env variable to cache models in /tmp
        "HF_TOKEN": HfFolder.get_token(),       # huggingface token to access gated models, e.g. llama 3
        "ACCELERATE_USE_FSDP": "1",             # enable FSDP
        "FSDP_CPU_RAM_EFFICIENT_LOADING": "1"   # enable CPU RAM efficient loading
    }, 
)

- Hugging Face Estimator를 활용해 Amazon SageMaker에서 모델을 훈련하고 배포합니다.
- 훈련 시간은 `ml.g5.4xlarge` 인스턴스 기준 8시간이 소요됩니다.

In [None]:
from sagemaker.experiments.run import Run
from sagemaker.utils import unique_name_from_base
from sagemaker.session import Session

experiment_name = "text2sql"
    
run_name = f"training-job-experiment"
print(f"experiment_name:{experiment_name}")    

with Run(experiment_name=experiment_name, run_name=run_name, sagemaker_session=sagemaker_session) as run:
        huggingface_estimator.fit(data,wait=False)    

In [None]:
huggingface_estimator.logs()