# Using LLama Factory finetune on SageMaker - Multi Nodes
# 4. 使用SageMaker Training Job分布式训练
- 本示例使用deepspeed进行多机多GPU卡分布式训练
- 如果指定instance数量为1，则是单机多GPU训练

## Install Pre-requisites 安装依赖包

In [1]:
%pip install -Uq sagemaker boto3 datasets

Note: you may need to restart the kernel to use updated packages.


In [2]:
import os
import glob
import boto3
import pprint
from tqdm import tqdm
import sagemaker
from sagemaker.collection import Collection
from sagemaker.utils import name_from_base

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


In [3]:
sagemaker_session =  sagemaker.session.Session() #sagemaker.session.Session()
region = sagemaker_session.boto_region_name
role = sagemaker.get_execution_role()
default_bucket = sagemaker_session.default_bucket()
sm_client = boto3.client('sagemaker', region_name=region)

## 准备数据集
### 数据集1. 从huggingface上下载ruozhiba数据集
- 该数据集有近5k条数据，本次实验我们可以只用前1k条做训练

In [4]:
from datasets import load_dataset
from random import randrange
dataset_name = "hfl/ruozhiba_gpt4"
# Load dataset from the hub
train_dataset = load_dataset(dataset_name, split="train",revision='41d2c61beb86c8d4c61916cc656c39d018c40ce5')

print(f"Training size: {len(train_dataset)}")
print("\nTraining sample:\n")
print(train_dataset[randrange(len(train_dataset))])

Training size: 4898

Training sample:

{'input': '', 'instruction': '大家有没有发现，无论考试还是啥，每次都是第一后面接着就是第二，第三名为什么从来就没有拿第二名或者第一名过，一定是有黑幕', 'output': '这个问题可能是一个误解。每次考试或比赛的排名是根据成绩或表现依次排列的，第一名之后自然是第二名，第三名之后紧接着是第四名。这是一个简单的逻辑顺序，没有什么黑幕。至于某个人能否拿不同的名次，取决于他的表现和成绩。如果一个人一直在某个名次上下波动，这是正常的现象。每次考试或比赛都有不同的参与者和变数，结果可能会不同。所以，第三名的人没有拿过第二名或者第一名，只是因为在那次特定的情况下他的表现决定了名次，而非有黑幕的存在。'}


{'output': '这个问题是针对《海绵宝宝》中的蟹老板和他的快餐店“蟹堡王”的情节。在动画中，蟹黄堡是蟹堡王里最受欢迎的食品，但是具体配方是保密的。虽然蟹老板是一只蟹，但是该动画并未明确指出蟹黄堡是使用蟹肉制作的。事实上，这更多是一种戏剧化的设定，用来增加情节的幽默感。因此，从字面上来理解蟹老板制作蟹黄堡吃掉同类是不正确的，这只是一种讽刺的艺术表现手法，旨在娱乐观众而已。', 'input': '', 'instruction': '为什么蟹老板做蟹黄堡蚕食它的同类？'}


### 数据集2. 身份数据集
```json
[{'instruction': 'hi',
  'input': '',
  'output': 'Hello! I am {{name}}, an AI assistant developed by {{author}}. How can I assist you today?'},
 {'instruction': 'hello',
  'input': '',
  'output': 'Hello! I am {{name}}, an AI assistant developed by {{author}}. How can I assist you today?'},
 {'instruction': 'Who are you?',
  'input': '',
  'output': 'I am {{name}}, an AI assistant developed by {{author}}. How can I assist you today?'}]
```
把其中的name和author替换成您自己想替换的值，这样微调完成之后，问模型“你是谁，谁创造的你？”这类的身份问题，模型就会按这个新的值来回答

In [5]:
def format_identity(origin_obj,name,author):
    ret = []
    for ele in origin_obj:
        ele['output'] = ele['output'].replace("{{name}}",name).replace("{{author}}",author)
        ret.append(ele)
    return ret

- 替换成您自己的设定

In [6]:
NAME = 'RiverBot'
AUTHOR = 'GOGOGO'

In [7]:
import json
file_name = './LLaMA-Factory/data/identity.json'
with open(file_name) as f:
    identity = json.load(f)

identity_2 = format_identity(identity,name='RiverBot',author='River')
identity_2[:2]

[{'instruction': 'hi',
  'input': '',
  'output': 'Hello! I am RiverBot, an AI assistant developed by River. How can I assist you today?'},
 {'instruction': 'hello',
  'input': '',
  'output': 'Hello! I am RiverBot, an AI assistant developed by River. How can I assist you today?'}]

In [8]:
os.makedirs('./train',exist_ok=True)
with open('./train/identity_2.json','w') as f:
    json.dump(identity_2,f)

### 把数据copy至S3

In [9]:
s3_data_uri = f"s3://{default_bucket}/dataset-for-training"
training_input_path = f'{s3_data_uri}/train'

In [10]:
# save train_dataset to s3
train_dataset.to_json('./train/ruozhiba.json')
sagemaker.s3.S3Uploader.upload(local_path="./train/ruozhiba.json", desired_s3_uri=training_input_path, sagemaker_session=sagemaker_session)
sagemaker.s3.S3Uploader.upload(local_path="./train/identity_2.json", desired_s3_uri=training_input_path, sagemaker_session=sagemaker_session)

print(f"saving training dataset to: {training_input_path}")

Creating json from Arrow format:   0%|          | 0/5 [00:00<?, ?ba/s]

saving training dataset to: s3://sagemaker-us-east-1-342367142984/dataset-for-training/train


## 准备LLaMA-Factory 的 dataset info

In [11]:
import json
file_name = './LLaMA-Factory/data/dataset_info.json'
with open(file_name) as f:
    datainfo = json.load(f)

In [12]:
datainfo['identity']={'file_name': 'identity_2.json'}

In [13]:
datainfo['ruozhiba']={
    'file_name':'ruozhiba.json',
    "columns": {
    "prompt": "instruction",
    "query": "input",
    "response": "output",
  }      
}

In [14]:
with open('./LLaMA-Factory/data/dataset_info.json','w') as f:
    json.dump(fp=f,obj=datainfo)

## 准备LLaMA-Factory 的 训练配置yaml文件
### 从LLaMA-Factory/examples/train_lora/目录中复制出llama3_lora_sft_ds3.yaml，并修改
- 本次实验是使用Lora训练
- 如果用全量微调，则使用LLaMA-Factory/examples/train_lora/llama3_lora_sft_ds3.yaml

In [15]:
#load template
import yaml
file_name = './LLaMA-Factory/examples/train_lora/llama3_lora_sft_ds3.yaml'
with open(file_name) as f:
    doc = yaml.safe_load(f)
doc

{'model_name_or_path': 'meta-llama/Meta-Llama-3-8B-Instruct',
 'stage': 'sft',
 'do_train': True,
 'finetuning_type': 'lora',
 'lora_target': 'all',
 'deepspeed': 'examples/deepspeed/ds_z3_config.json',
 'dataset': 'identity,alpaca_en_demo',
 'template': 'llama3',
 'cutoff_len': 1024,
 'max_samples': 1000,
 'overwrite_cache': True,
 'preprocessing_num_workers': 16,
 'output_dir': 'saves/llama3-8b/lora/sft',
 'logging_steps': 10,
 'save_steps': 500,
 'plot_loss': True,
 'overwrite_output_dir': True,
 'per_device_train_batch_size': 1,
 'gradient_accumulation_steps': 2,
 'learning_rate': 0.0001,
 'num_train_epochs': 3.0,
 'lr_scheduler_type': 'cosine',
 'warmup_ratio': 0.1,
 'fp16': True,
 'ddp_timeout': 180000000,
 'val_size': 0.1,
 'per_device_eval_batch_size': 1,
 'eval_strategy': 'steps',
 'eval_steps': 500}

- 本次实验我们使用原始精度的LLaMA-3-8b， 从hf的repo 'unsloth/llama-3-8b-Instruct' 下载模型

In [16]:
model_id = 'unsloth/llama-3-8b-Instruct'

In [17]:

doc['model_name_or_path'] = model_id
doc['output_dir'] ='/tmp/finetuned_model'
doc['num_train_epochs'] = 3
doc['warmup_steps'] = 10
doc['per_device_train_batch_size'] =1
doc['gradient_accumulation_steps'] =2
# doc['lora_target'] = 'all'
doc['cutoff_len'] = 2048
#实验时间，只选取前1000条数据做训练
doc['max_samples'] = 200
doc['dataset'] = 'identity,ruozhiba'
doc['eval_steps'] = 500
doc

{'model_name_or_path': 'unsloth/llama-3-8b-Instruct',
 'stage': 'sft',
 'do_train': True,
 'finetuning_type': 'lora',
 'lora_target': 'all',
 'deepspeed': 'examples/deepspeed/ds_z3_config.json',
 'dataset': 'identity,ruozhiba',
 'template': 'llama3',
 'cutoff_len': 2048,
 'max_samples': 200,
 'overwrite_cache': True,
 'preprocessing_num_workers': 16,
 'output_dir': '/tmp/finetuned_model',
 'logging_steps': 10,
 'save_steps': 500,
 'plot_loss': True,
 'overwrite_output_dir': True,
 'per_device_train_batch_size': 1,
 'gradient_accumulation_steps': 2,
 'learning_rate': 0.0001,
 'num_train_epochs': 3,
 'lr_scheduler_type': 'cosine',
 'warmup_ratio': 0.1,
 'fp16': True,
 'ddp_timeout': 180000000,
 'val_size': 0.1,
 'per_device_eval_batch_size': 1,
 'eval_strategy': 'steps',
 'eval_steps': 500,
 'warmup_steps': 10}

In [18]:
sg_config = 'sg_config_multl_node_lora_ds.yaml'
with open(f'./LLaMA-Factory/{sg_config}', 'w') as f:
    yaml.safe_dump(doc, f)

### 设置Lora权重Merge 配置文件

In [19]:
file_name = './LLaMA-Factory/examples/merge_lora/llama3_lora_sft.yaml'
with open(file_name) as f:
    doc2 = yaml.safe_load(f)

In [20]:
sg_lora_merge_config = 'sg_config_lora_merge.yaml'
doc2['model_name_or_path'] = model_id
doc2['adapter_name_or_path'] ='/tmp/finetuned_model'
doc2['export_dir'] ='/tmp/finetuned_model_merged'
with open(f'./LLaMA-Factory/{sg_lora_merge_config}', 'w') as f:
    yaml.safe_dump(doc2, f)

## 提交训练任务

- 模型输出s3目录

In [21]:
destination_s3 = f's3://{default_bucket}/llama3-8b-lora-sft-ds/output/'

In [None]:
from sagemaker.estimator import Estimator
from sagemaker.pytorch import PyTorch
from datetime import datetime


instance_count = 2
instance_type = 'ml.g5.48xlarge' 
max_time = 3600*24

# Get the current time
current_time = datetime.now()

# wandb.sagemaker_auth(path="./")
# Format the current time as a string
formatted_time = current_time.strftime("%Y%m%d%H%M%S")
print(formatted_time)

base_job_name = 'llama3-8b-instruct-finetune'
environment = {
    'NODE_NUMBER':str(instance_count),
    "merge_lora":"1", ##是否合并lora模型
    "sg_config":sg_config,
    "sg_lora_merge_config":sg_lora_merge_config,
    "s3_data_paths":f"{training_input_path}",
    'OUTPUT_MODEL_S3_PATH': destination_s3
}

estimator = PyTorch(entry_point='entry-multi-nodes.py',
                            source_dir='./LLaMA-Factory/',
                            role=role,
                            base_job_name=base_job_name,
                            environment=environment,
                            framework_version='2.2.0',
                            py_version='py310',
                            script_mode=True,
                            instance_count=instance_count,
                            instance_type=instance_type,
                            enable_remote_debug=True,
                            # keep_alive_period_in_seconds=600,
                            max_run=max_time)

# # data in channel will be automatically copied to each node - /opt/ml/input/data/train1
#input_channel = {'train': f's3://{sagemaker_default_bucket}/datasets/qiandao/{version}/train.json'}
estimator.fit()

20240710163701


INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: llama3-8b-instruct-finetune-2024-07-10-16-37-01-740


2024-07-10 16:37:05 Starting - Starting the training job...
2024-07-10 16:37:06 Pending - Training job waiting for capacity......................................