### controlnet 模型微调
controlnet模型使得用户可以通过施加额外条件，细粒度地控制扩散模型的生成过程。这一技术最初由 Adding Conditional Control to Text-to-Image Diffusion Models 这篇论文提出，并很快地风靡了扩散模型的开源社区。作者开源了 8 个不同的模型，使得用户可以用 8 种条件去控制 Stable Diffusion 模型（包括版本 1 到 5 ）。这 8 种条件包括姿态估计、深度图、边缘图、素描图。

接下来我们将使用 controlnet 来微调我们的 stable diffusion xl模型.

#### Notebook 步骤
1. 导入 boto3, sagemaker python SDK
2. 构建 controlnet fine-tuning 镜像
3. 实现模型微调
   * 配置超参
   * 创建训练任务
4. 测试

#### 1. 导入 boto3, sagemaker python SDK

In [16]:
import sagemaker
import boto3
from sagemaker.pytorch import PyTorch
sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket()
role = sagemaker.get_execution_role()
account_id = boto3.client('sts').get_caller_identity().get('Account')
region_name = boto3.session.Session().region_name

images_s3uri = 's3://{0}/controlnet-xl/images/'.format(bucket)
models_s3uri = 's3://{0}/stable-diffusion/models/'.format(bucket)
controlnet_s3uri = 's3://{0}/stable-diffusion/controlnet/'.format(bucket)

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


#### 2. 构建 controlnet xl fine-tuning 镜像

In [17]:
!rm -rf sd_controlnet
!mkdir -p sd_controlnet
!cd sd_controlnet && git clone https://github.com/huggingface/diffusers

Cloning into 'diffusers'...
remote: Enumerating objects: 48706, done.[K
remote: Counting objects: 100% (2239/2239), done.[K
remote: Compressing objects: 100% (834/834), done.[K
remote: Total 48706 (delta 1586), reused 1790 (delta 1257), pack-reused 46467[K
Receiving objects: 100% (48706/48706), 31.37 MiB | 34.57 MiB/s, done.
Resolving deltas: 100% (35863/35863), done.


In [None]:
!curl -L https://github.com/peak/s5cmd/releases/download/v2.2.2/s5cmd_2.2.2_Linux-64bit.tar.gz | tar -xz && mv s5cmd sd_controlnet

In [31]:
%%writefile Dockerfile_controlnet
## You should change below region code to the region you used, here sample is use us-west-2
#From 763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-training:1.13.1-transformers4.26.0-gpu-py39-cu117-ubuntu20.04
From 763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-training:2.0.0-transformers4.28.1-gpu-py310-cu118-ubuntu20.04

RUN pip install wandb
#RUN pip install xformers==0.0.19 --no-deps
RUN pip install xformers
RUN pip install bitsandbytes
#RUN export TORCH_CUDA_ARCH_LIST="7.5 8.0 8.6" && export FORCE_CUDA="1" && pip install ninja triton==2.0.0.dev20221120 && git clone https://github.com/xieyongliang/xformers.git /tmp/xformers && cd /tmp/xformers && git submodule update --init --recursive && pip install -r requirements.txt && pip install -e . 


ENV LANG=C.UTF-8
ENV PYTHONUNBUFFERED=TRUE
ENV PYTHONDONTWRITEBYTECODE=TRUE

Overwriting Dockerfile_controlnet


* build & push docker镜像

In [32]:
## You should change below region code to the region you used, here sample is use us-west-2
!aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-west-2.amazonaws.com

https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded


In [33]:
## define repo name, should contain *sagemaker* in the name
repo_name = "sd_controlnet_finetuning"

In [34]:
%%script env repo_name=$repo_name bash

#!/usr/bin/env bash

# This script shows how to build the Docker image and push it to ECR to be ready for use
# by SageMaker.

# The argument to this script is the image name. This will be used as the image on the local
# machine and combined with the account and region to form the repository name for ECR.
# The name of our algorithm
algorithm_name=${repo_name}

account=$(aws sts get-caller-identity --query Account --output text)

# Get the region defined in the current configuration (default to us-west-2 if none defined)
region=$(aws configure get region)
region=${region:-us-west-2}

fullname="${account}.dkr.ecr.${region}.amazonaws.com/${algorithm_name}:latest"

# If the repository doesn't exist in ECR, create it.
aws ecr describe-repositories --repository-names "${algorithm_name}" > /dev/null 2>&1

if [ $? -ne 0 ]
then
    aws ecr create-repository --repository-name "${algorithm_name}" > /dev/null
fi

# Get the login command from ECR and execute it directly
aws ecr get-login-password --region ${region}|docker login --username AWS --password-stdin ${fullname}

# Build the docker image locally with the image name and then push it to ECR
# with the full name.

docker build -t ${algorithm_name} -f ./Dockerfile_controlnet .
docker tag ${algorithm_name} ${fullname}

docker push ${fullname}

https://docs.docker.com/engine/reference/commandline/login/#credentials-store



Login Succeeded
Sending build context to Docker daemon    132MB
Step 1/7 : From 763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-training:2.0.0-transformers4.28.1-gpu-py310-cu118-ubuntu20.04
 ---> 1f37d018af76
Step 2/7 : RUN pip install wandb
 ---> Using cache
 ---> 38494bb19d4b
Step 3/7 : RUN pip install xformers
 ---> Running in 415570f7d7f0
Collecting xformers
  Downloading xformers-0.0.22.post7-cp310-cp310-manylinux2014_x86_64.whl (211.8 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 211.8/211.8 MB 15.5 MB/s eta 0:00:00
Collecting torch==2.1.0 (from xformers)
  Downloading torch-2.1.0-cp310-cp310-manylinux1_x86_64.whl (670.2 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 670.2/670.2 MB 2.2 MB/s eta 0:00:00
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch==2.1.0->xformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23.7/23.7 MB 73.0 MB/s eta 0:00:00
Collecting nvidia-c

#### 3. 模型微调

   * image_uri: ecr仓库中的 docker 镜像地址
   * instance_type: 用于训练任务的实例大小 , 建议使用 ml.g4dn.xlarge, ml.g5.xlarge
   * class_prompt: 提示词类别
   * instance_prompt: 用于你的图片的关键词
   * model_name: 预训练的模型名称
   

In [8]:
%%writefile ./sd_controlnet/train.sh
bash ./train_controlnet_sdxl-h100.sh
# Run this after 1st raise error
pip uninstall torch torchvision
pip install --pre torch torchvision --index-url https://download.pytorch.org/whl/nightly/cu118
bash ./train_controlnet_sdxl-h100.sh

Overwriting ./sd_controlnet/train.sh


In [28]:
%%writefile ./sd_controlnet/train_controlnet_sdxl.sh

export WANDB_API_KEY="298b59ce8a416fd45b5fa9ffc17fe72327854e0c"
export WANDB_WATCH="all"
export WANDB_ENTITY="121102723"
export WANDB_PROJECT="controlnet" 

mkdir -p /tmp/dog
ls -lt ./
chmod 777 ./s5cmd


cd diffusers && pip install -e .
cd examples/controlnet/ && pip install -r requirements_sdxl.txt

curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | bash
apt-get install git-lfs

# Clone Train Dataset(for production)
#git clone https://huggingface.co/datasets/zobnec/controlnet_fs_dataset_df /tmp/dataset/


export MODEL_NAME="stabilityai/stable-diffusion-xl-base-1.0"
export INSTANCE_DIR="/tmp/dataset/"
export OUTPUT_DIR="/tmp/ouput"
export controlnet_s3uri="s3://sagemaker-us-west-2-687912291502/stable-diffusion/controlnet/"

wget https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet_training/conditioning_image_1.png
wget https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet_training/conditioning_image_2.png

accelerate launch train_controlnet_sdxl.py \
 --pretrained_model_name_or_path=$MODEL_NAME \
 --output_dir=$OUTPUT_DIR \
 --dataset_name="fusing/fill50k" \
 --conditioning_image_column=conditioning_image \
 --image_column=image \
 --caption_column=text \
 --resolution=512 \
 --learning_rate=1e-5 \
 --validation_image "./conditioning_image_1.png" "./conditioning_image_2.png"  \
 --validation_prompt "red circle with blue background" "cyan circle with brown floral background" \
 --train_batch_size=1 \
 --max_train_steps=15000 \
 --tracker_project_name="controlnet" \
 --checkpointing_steps=15000 \
 --validation_steps=15000 \
 --report_to="wandb"  \
 --enable_xformers_memory_efficient_attention 


/opt/ml/code/s5cmd sync /tmp/ouput/checkpoint-1000/controlnet/ $controlnet_s3uri/output/$(date +%Y-%m-%d-%H-%M-%S)/


Overwriting ./sd_controlnet/train_controlnet_sdxl.sh


In [29]:
%%writefile ./sd_controlnet/train_controlnet_sdxl_h100.sh

export WANDB_API_KEY="298b59ce8a416fd45b5fa9ffc17fe72327854e0c"
export WANDB_WATCH="all"
export WANDB_ENTITY="121102723"
export WANDB_PROJECT="controlnet" 

mkdir -p /tmp/dog
ls -lt ./
chmod 777 ./s5cmd


cd diffusers && pip install -e .
cd examples/controlnet/ && pip install -r requirements_sdxl.txt

curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | bash
apt-get install git-lfs

# Clone Train Dataset(for production)
#git clone https://huggingface.co/datasets/zobnec/controlnet_fs_dataset_df /tmp/dataset/


export MODEL_NAME="stabilityai/stable-diffusion-xl-base-1.0"
export INSTANCE_DIR="/tmp/dataset/"
export OUTPUT_DIR="/tmp/ouput"
export controlnet_s3uri="s3://sagemaker-us-west-2-687912291502/stable-diffusion/controlnet/"


accelerate launch train_controlnet_sdxl.py \
 --pretrained_model_name_or_path=$MODEL_NAME \
 --output_dir=$OUTPUT_DIR \
 --dataset_name=multimodalart/facesyntheticsspigacaptioned \
 --conditioning_image_column=spiga_seg \
 --image_column=image \
 --caption_column=image_caption \
 --resolution=512 \
 --learning_rate=1e-5 \
 --train_batch_size=2 \
 --max_train_steps=15000 \
 --tracker_project_name="controlnet" \
 --checkpointing_steps=5000 \
 --report_to="wandb"  \
 --enable_xformers_memory_efficient_attention 


/opt/ml/code/s5cmd sync /tmp/ouput/checkpoint-1000/controlnet/ $controlnet_s3uri/output/$(date +%Y-%m-%d-%H-%M-%S)/




Overwriting ./sd_controlnet/train_controlnet_sdxl_h100.sh


   * 创建训练任务

In [None]:
import time
from sagemaker.estimator import Estimator
from sagemaker.pytorch.estimator import PyTorch

environment = {
    'PYTORCH_CUDA_ALLOC_CONF':'max_split_size_mb:32',
    'MODEL_NAME':'stabilityai/stable-diffusion-2-1-base'
}

## The image uri which is build and pushed above
image_uri = "{}.dkr.ecr.{}.amazonaws.com/{}:latest".format(account_id, region_name, repo_name)
base_job_name = 'sd-xl-controlnet-finetuning-high'
#instance_type = 'ml.p4d.24xlarge'
instance_type = 'ml.g5.48xlarge'
#inputs = {
#    'images': f"s3://{bucket}/controlnet-xl/images/"
#}

estimator = PyTorch(role=role,
                      entry_point='train_controlnet_sdxl.sh',
                      #entry_point='train_controlnet_sdxl_h100.sh',
                      source_dir='./sd_controlnet/',
                      base_job_name=base_job_name,
                      instance_count=1,
                      instance_type=instance_type,
                      image_uri=image_uri,
                      environment=environment,
                      volume_size = 1000,
                      keep_alive_period_in_seconds=3600, #warmpool，为下一次训练保持机器&镜像（滚动续期，最大1hour）；需要开quota。
                      disable_profiler=True,
                      debugger_hook_config=False,
                      max_run=24*60*60*2)

estimator.fit()

In [36]:
print("Model artifact saved at:\n", controlnet_s3uri)

Model artifact saved at:
 s3://sagemaker-us-west-2-687912291502/stable-diffusion/controlnet/
