# 在Amazon SageMaker上训练bloomz7b(BELLE)

BELLE用200万条中文对话在Bloomz7b上FT了一版模型，之前测试法律问答分数比同样200万中文对话FT LLama130亿高一些。
所以第一次训练，采用参数较少的200万条FT的Bloom7b作为我们的基座。
后续如果在增加训练样本后模型表现无任何提升，则证明70亿参数基座无法满足我们的需求，到时候再考虑选择130亿LLama。
（目前Bloomz蒸馏的最大参数仅有70亿参数版本，再往上就是千亿参数了）

## 1、环境配置、准备数据、下载模型

In [15]:
##注意使用这个笔记本时，尽量翻墙，否则可能出现504导致你写了一半的代码无法保存。
## Update sagemaker python sdk version
!pip install -U sagemaker

import sagemaker
import boto3
from sagemaker import get_execution_role

sess = sagemaker.Session()
role = get_execution_role()
sagemaker_default_bucket = sess.default_bucket()

account = sess.boto_session.client("sts").get_caller_identity()["Account"]
region = sess.boto_session.region_name

Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com


INFO:botocore.credentials:Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
INFO:botocore.credentials:Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole


In [2]:
## 下载BELLE项目的训练代码
!git clone https://github.com/LianjiaTech/BELLE.git
## 下载亚马逊官方教程代码
!git clone https://github.com/snowolf/alpaca-on-amazon-sagemaker.git

Cloning into 'BELLE'...
remote: Enumerating objects: 1758, done.[K
remote: Counting objects: 100% (754/754), done.[K
remote: Compressing objects: 100% (435/435), done.[K
remote: Total 1758 (delta 482), reused 501 (delta 312), pack-reused 1004[K
Receiving objects: 100% (1758/1758), 11.61 MiB | 24.93 MiB/s, done.
Resolving deltas: 100% (926/926), done.
Cloning into 'alpaca-on-amazon-sagemaker'...
remote: Enumerating objects: 46, done.[K
remote: Counting objects: 100% (46/46), done.[K
remote: Compressing objects: 100% (39/39), done.[K
remote: Total 46 (delta 20), reused 17 (delta 6), pack-reused 0[K
Receiving objects: 100% (46/46), 4.03 MiB | 16.42 MiB/s, done.
Resolving deltas: 100% (20/20), done.


In [3]:
## 手动上传训练数据后，再查看数据，可以看出，与LLama（Alpaca）不同，Belle开源方案的训练代码，考虑到兼容后续的多轮对话，所以训练数据格式如下：
##（第一次先用1000条跑一下，跑通了再改成50多万的）
!head -n 50 data/ilaw-dev.json

{"id": "ilaw-1", "conversations": [{"from": "human", "value": "民法商法-证券法2019-12-28:    \"第一百一十五条 证券交易所依照法律、行政法规和国务院证券监督管理机构的规定，制定上市规则、交易规则、会员管理规则和其他有关业务规则，并报国务院证券监督管理机构批准。\",\n民法商法-证券法2019-12-28:    \"在证券交易所从事证券交易，应当遵守证券交易所依法制定的业务规则。违反业务规则的，由证券交易所给予纪律处分或者采取其他自律管理措施。\",\n民法商法-证券法2019-12-28:    \"第一百一十六条 证券交易所的负责人和其他从业人员执行与证券交易有关的职务时，与其本人或者其亲属有利害关系的，应当回避。\",\n民法商法-证券法2019-12-28:    \"第一百一十七条 按照依法制定的交易规则进行的交易，不得改变其交易结果，但本法第一百一十一条第二款规定的除外。对交易中违规交易者应负的民事责任不得免除；在违规交易中所获利益，依照有关规定处理。\",\n请根据以上法律，选择最合适的法律生成问题的合理答复，问题是：\n小张在证券交易所进行证券交易时，未遵守证券交易所的业务规则，而被证券交易所给予了纪律处分，他想知道依据什么法规证券交易所可以制定业务规则。\n\n证券交易所可以根据哪些法规制定业务规则？"}, {"from": "assistant", "value": "证券交易所可以根据法律、行政法规和国务院证券监督管理机构的规定制定上市规则、交易规则、会员管理规则和其他有关业务规则，并报国务院证券监督管理机构批准，这是根据《中华人民共和国证券法》第一百一十五条规定的。"}]}
{"id": "ilaw-2", "conversations": [{"from": "human", "value": "软件著作权产品登记流程有哪些"}, {"from": "assistant", "value": "计算机软件版权登记完成整个时间大约需要4个月左右。整个流程为递交申请、版权中心受理、审查登记材料、获取登记证书 所需文件：计算机软件登记申请表、计算机源程序、软件说明书、著作权归属协议、委托书。"}]}
{"id": "ilaw-3", "conversations":

### 1.1、从HuggingFace Hub上，下载BELLE用200万条中文语料训练的Bloomz7b作为我们的训练基座

In [5]:
#安装huggingface库
!pip install huggingface_hub

Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Collecting huggingface_hub
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Installing collected packages: huggingface_hub
Successfully installed huggingface_hub-0.14.1


In [6]:
from huggingface_hub import snapshot_download
from pathlib import Path

local_cache_path = Path("./model")
local_cache_path.mkdir(exist_ok=True)

model_name = "BelleGroup/BELLE-7B-2M"

# 筛选出和pytorch模型文件相关的文件。
allow_patterns = ["*.json", "*.pt", "*.bin", "*.model"]

model_download_path = snapshot_download(
    repo_id=model_name,
    cache_dir=local_cache_path,
    allow_patterns=allow_patterns,
)

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

Downloading (…)00dc10a1/config.json:   0%|          | 0.00/742 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/199 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/28.3G [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

In [4]:
# 检查并列出模型文件目录（因为第六步的代码，会下载huggingface上项目整体目录结构，所以要用下面的代码列出模型文件的具体位置）
import os
from glob import glob

local_model_path = None

paths = os.walk(r'./model')
for root, dirs, files in paths:
    for file in files:
        if file == 'config.json':
            print(os.path.join(root,file))
            local_model_path = str(os.path.join(root,file))[0:-11]
            print(local_model_path)
if local_model_path == None:
    print("Model download may failed, please check prior step!")

./model/models--BelleGroup--BELLE-7B-2M/snapshots/a9076d928eff1d94fe6b4372ba2bd3a800dc10a1/config.json
./model/models--BelleGroup--BELLE-7B-2M/snapshots/a9076d928eff1d94fe6b4372ba2bd3a800dc10a1/


In [10]:
# 下载s5cmd命令，这个是文件传输提速用的，后续训练时，用是s5cmd可以节约模型传输时间，达到省钱的目的。
!curl -L https://github.com/peak/s5cmd/releases/download/v2.0.0/s5cmd_2.0.0_Linux-64bit.tar.gz | tar -xz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 4176k  100 4176k    0     0  7656k      0 --:--:-- --:--:-- --:--:-- 50.1M


对于下面的脚本：
注意local_model_path要配置为第7行代码曾经输出过的目录地址
sagemaker_default_bucket是s3存储桶的地址，需要提前建好存储桶，登陆网址https://s3.console.aws.amazon.com/s3 去建立。
还要把sagemaker的s3扩展权限打开，让sagemaker有权访问s3资源
然后就可以配置下面的代码，把模型通过s5cmd高速传输命令，将我们之前下载好的模型文件存到s3存储桶中备用了。
最后删除notebook空间中的model,原因是后续训练会把根目录下所有文件都推到docker里运行，如果把模型推过去会浪费时间，所以要删除notebook里面的model

In [5]:
%%script env sagemaker_default_bucket="zmoolb-ryhrefy1ejk8epxttmfkfagrde4w6usw2a-s3alias" local_model_path="./model/models--BelleGroup--BELLE-7B-2M/snapshots/a9076d928eff1d94fe6b4372ba2bd3a800dc10a1/" bash

chmod +x ./s5cmd
./s5cmd sync ${local_model_path} s3://${sagemaker_default_bucket}/BELLE/pretrain/7B/

rm -rf model

cp model/models--BelleGroup--BELLE-7B-2M/snapshots/a9076d928eff1d94fe6b4372ba2bd3a800dc10a1/config.json s3://zmoolb-ryhrefy1ejk8epxttmfkfagrde4w6usw2a-s3alias/BELLE/pretrain/7B/config.json
cp model/models--BelleGroup--BELLE-7B-2M/snapshots/a9076d928eff1d94fe6b4372ba2bd3a800dc10a1/special_tokens_map.json s3://zmoolb-ryhrefy1ejk8epxttmfkfagrde4w6usw2a-s3alias/BELLE/pretrain/7B/special_tokens_map.json
cp model/models--BelleGroup--BELLE-7B-2M/snapshots/a9076d928eff1d94fe6b4372ba2bd3a800dc10a1/tokenizer_config.json s3://zmoolb-ryhrefy1ejk8epxttmfkfagrde4w6usw2a-s3alias/BELLE/pretrain/7B/tokenizer_config.json
cp model/models--BelleGroup--BELLE-7B-2M/snapshots/a9076d928eff1d94fe6b4372ba2bd3a800dc10a1/tokenizer.json s3://zmoolb-ryhrefy1ejk8epxttmfkfagrde4w6usw2a-s3alias/BELLE/pretrain/7B/tokenizer.json
cp model/models--BelleGroup--BELLE-7B-2M/snapshots/a9076d928eff1d94fe6b4372ba2bd3a800dc10a1/pytorch_model.bin s3://zmoolb-ryhrefy1ejk8epxttmfkfagrde4w6usw2a-s3alias/BELLE/pretrain/7B/pytorch_mod

## 2、配置并保存docker映像

这一章主要是配置的docker,后续会推送到集群上，相当于在集群上安装docker映像，然后再docker里面训练。

### 2.1、配置映像

In [2]:
%%writefile Dockerfile
## 注意现在只有us-west-2有机器了，所以配置里已经改为us-west-2
From 763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-training:1.13.1-transformers4.26.0-gpu-py39-cu117-ubuntu20.04 

ENV LANG=C.UTF-8
ENV PYTHONUNBUFFERED=TRUE
ENV PYTHONDONTWRITEBYTECODE=TRUE
## 正常流程是把开源项目的环境依赖考过去，然后安装就行了，但是这个开源项目的依赖有毛病，会出现pytorch和cuda版本不匹配的情况，所以还是手工安装一遍更保险。
#COPY BELLE/requirements.txt ./
#RUN python3 -m pip install -r requirements.txt

##作废pip命令默认安装的是旧版，所以要用下面的代码安装特定版本的huggingface transfomers
##RUN python3 -m pip install git+https://github.com/huggingface/transformers
RUN pip3 install transformers==4.28.1
## 安装BELLE开源项目所需的库
##RUN pip3 install torch torchvision
RUN pip3 uninstall -y deepspeed && pip3 install deepspeed==0.9.0
RUN pip3 install datasets==2.10.1
RUN pip3 install fire==0.5.0
RUN pip3 install accelerate==0.17.1
RUN pip3 install numpy
RUN pip3 install rouge_score
RUN pip3 install gensim==3.8.2
RUN pip3 install peft==0.2.0
RUN pip3 install bitsandbytes==0.37.1
RUN pip3 install tqdm==4.65.0
RUN pip3 install huggingface_hub==0.13.1
# accelerate==0.17.1
# bitsandbytes==0.37.1
# datasets==2.10.1
# fire==0.5.0
# huggingface_hub==0.13.1
# torch==1.13.0
# tqdm==4.65.0
# transformers==4.28.1
# deepspeed==0.9.0
# gradio

## Make all local GPUs visible
ENV NVIDIA_VISIBLE_DEVICES="all"

Writing Dockerfile


### 2.2、保存并将映像推送到ECR服务

In [15]:
!aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-west-2.amazonaws.com

https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded


In [12]:
## 登陆亚马逊ECR(Elastic Container Registry)服务,这个服务可以用来管理docker映像。
!aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin 049701616856.dkr.ecr.us-west-2.amazonaws.com

https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded


In [5]:
## 定义repo name, 文件名要包含*sagemaker*
repo_name = "sagemaker-bloomz-demo"

In [None]:
%%script env repo_name=$repo_name bash

#!/usr/bin/env bash

# This script shows how to build the Docker image and push it to ECR to be ready for use
# by SageMaker.

# The argument to this script is the image name. This will be used as the image on the local
# machine and combined with the account and region to form the repository name for ECR.
# The name of our algorithm
algorithm_name=${repo_name}

account=$(aws sts get-caller-identity --query Account --output text)

# Get the region defined in the current configuration (default to us-west-2 if none defined)
region=$(aws configure get region)
region=${region:-us-west-2}

fullname="${account}.dkr.ecr.${region}.amazonaws.com/${algorithm_name}:latest"

# If the repository doesn't exist in ECR, create it.
aws ecr describe-repositories --repository-names "${algorithm_name}" > /dev/null 2>&1

if [ $? -ne 0 ]
then
    aws ecr create-repository --repository-name "${algorithm_name}" > /dev/null
fi

# Get the login command from ECR and execute it directly
aws ecr get-login-password --region ${region}|docker login --username AWS --password-stdin ${fullname}

# Build the docker image locally with the image name and then push it to ECR
# with the full name.
sudo docker build -t ${algorithm_name} .
sudo docker tag ${algorithm_name} ${fullname}

sudo docker push ${fullname}

## 3、配置训练参数

### 3.1、配置deepspeed配置文件

In [21]:
#bloomz不支持半精度fp16加速，禁用半精度加速
##这个代码先别执行了，感觉用代码改完后，训练时读取会报错。还是本地用vscode
import json

ds_config_file = '../BELLE/train/configs/deepspeed_config.json'
with open (ds_config_file, 'rb') as f:
    ds_config = json.load(f)
    f.close()
    
ds_config['fp16']['enabled'] = False

with open(ds_config_file, 'w') as f:
    json.dump(ds_config, f, indent=2)
    f.close()

### 3.2、Generate training entrypoint script

**Note: DO NOT CHANGE BELOW VAlUE OF "output_dir" and "cache_dir", keep it "/tmp/llama_out" and "/tmp".**

Below is just a testing to fine-tune on a sample dataset (just 8 samples), you could change ```data_path``` to your dataset for furthur fine tune.
$MODEL_S3_BUCKET是我们在之前代码中定义的：zmoolb-ryhrefy1ejk8epxttmfkfagrde4w6usw2a-s3alias,后续会在调用时配置为环境变量
For the dataset download, you could follow the way how to download pretrain model:
```
./s5cmd sync s3://$MODEL_S3_BUCKET/BELLE/pretrain/7B/* /tmp/bloomz_pretrain/
```

It is recommend to use the folder ```/tmp/dataset/```.
第一次测试，先设置成1轮，数据集使用999条训练+30条验证，把流程跑通
第二次再用正式数据57w+3000条验证

In [1]:
%%writefile train.sh
#!/bin/bash
#使用s5cmd命令将我们之前保存在s3桶里的模型文件存到集群的/tmp/BELLEbloomz_pretrain/目录
chmod +x ./s5cmd
./s5cmd sync s3://$MODEL_S3_BUCKET/BELLE/pretrain/7B/* /tmp/BELLEbloomz_pretrain/
#第一次测试，为了快速跑通，先设置为1轮
torchrun --nproc_per_node 8 --master_port=12345 BELLE/train/src/train.py \
    --model_name_or_path "/tmp/BELLEbloomz_pretrain/" \
    --deepspeed "/BELLE/train/configs/deepspeed_config.json" \
    --train_file "/data/ilaw.json" \
    --validation_file "/data/ilaw-dev.json" \
    --per_device_train_batch_size 16 \
    --per_device_eval_batch_size 16 \
    --gradient_accumulation_steps 4 \
    --num_train_epochs 1 \
    --model_max_length 1024 \
    --save_strategy "steps" \
    --save_total_limit 3 \
    --learning_rate 8e-5 \
    --weight_decay 0.00001 \
    --warmup_ratio 0.05 \
    --lr_scheduler_type "cosine" \
    --logging_steps 10 \
    --evaluation_strategy "steps" \
    --fp16 False \
    --seed 1234 \
    --gradient_checkpointing True \
    --cache_dir "/tmp" \
    --output_dir "/tmp/bloomz_out"

if [ $? -eq 1 ]; then
    echo "Training script error, please check CloudWatch logs"
    exit 1
fi

./s5cmd sync /tmp/bloomz_out s3://$MODEL_S3_BUCKET/bloomz/output/$(date +%Y-%m-%d-%H-%M-%S)/




Writing train.sh


In [14]:
## The image uri which is build and pushed above
#核实一下名字是不是之前推送的[049701616856.dkr.ecr.us-west-2.amazonaws.com/sagemaker-bloomz-demo]
image_uri = "{}.dkr.ecr.{}.amazonaws.com/{}:latest".format(account, region, repo_name)
image_uri

'049701616856.dkr.ecr.us-west-2.amazonaws.com/sagemaker-bloomz-demo:latest'

**The modified training script**

Everything is ready, let's launch the training job.

## Create SageMaker Training Job

In [16]:
import time
from sagemaker.estimator import Estimator

environment = {
              'MODEL_S3_BUCKET': 'zmoolb-ryhrefy1ejk8epxttmfkfagrde4w6usw2a-s3alias' # 改一下s3的桶地址
}

base_job_name = 'bloomz20230523v1'         

instance_type = 'ml.p4d.24xlarge'

estimator = Estimator(role=role,
                      entry_point='train.sh',
                      source_dir='./',
                      base_job_name=base_job_name,
                      instance_count=1,
                      instance_type=instance_type,
                      image_uri=image_uri,
                      environment=environment,
                      disable_profiler=True,
                      debugger_hook_config=False,
                      max_run=24*60*60*3)

estimator.fit()
# estimator.fit(inputs)

INFO:botocore.credentials:Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole


Using provided s3_resource


INFO:sagemaker:Creating training-job with name: bloomz20230523v1-2023-05-23-11-03-19-778


2023-05-23 11:03:26 Starting - Starting the training job......
2023-05-23 11:04:03 Starting - Preparing the instances for training.....................
2023-05-23 11:07:43 Downloading - Downloading input data...
2023-05-23 11:07:58 Training - Downloading the training image.....................
2023-05-23 11:11:30 Training - Training image download completed. Training in progress.....[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2023-05-23 11:12:25,603 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2023-05-23 11:12:25,666 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2023-05-23 11:12:25,675 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2023-05-23 11:12:25,677 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2

UnexpectedStatusException: Error for Training job bloomz20230523v1-2023-05-23-11-03-19-778: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
ExitCode 1
ErrorMessage "│    54 │   │   │   except (UnicodeDecodeError, AttributeError):               │
 │    55 │   │   │   │   raise ValueError(                                      │
 │                                                                              │
 │ /opt/conda/lib/python3.9/base64.py:133 in urlsafe_b64decode                  │
 │   130 │   """                                                                │
 │   131 │   s = _bytes_from_decode_data(s)                                     │
 │   132 │   s = s.translate(_urlsafe_decode_translation)                       │
 │ ❱ 133 │   return b64decode(s)                                                │
 │   134                                                                        │
 │   135                                                                        │
 │   136                                                                        │
 │ /, exit code: 1