## FLUX.1-dev Lora finetune on SageMaker (training job method)  
***Note: Experimental only, since training scripts from kohya have not been published yet  
Note: Decoupled training job, training used seperated SageMaker training job***
* ml.t3.medium notebook instance is good to run, configure storage with 150GB+, because we need to build large docker images for training job.
* Scripts and codes based on [kohya-ss/sd-scripts.](https://github.com/kohya-ss/sd-scripts/tree/sd3)
* Training images are from [here](https://github.com/shirayu/example_lora_training)  
* We used the trigger word "wta"(see in images' caption) for Lora training here, you can change or remove it  

## 1. Download training dataset and Flux model files

In [None]:
import os

flux_lora_training = "./flux_lora_training"
os.makedirs(flux_lora_training, exist_ok=True)

%cd flux_lora_training/

train_image_dir = "./images"
docker_file_dir = "./dockerfile"
os.makedirs(train_image_dir, exist_ok=True)
os.makedirs(docker_file_dir, exist_ok=True)

In [None]:
!wget -P $train_image_dir https://huggingface.co/datasets/terrificdm/wikipe-tan/resolve/main/dataset.zip
!unzip $train_image_dir/dataset.zip -d $train_image_dir && rm $train_image_dir/dataset.zip

In [3]:
import json

metadata_file = os.path.join(train_image_dir, 'metadata.jsonl')
with open(metadata_file, 'r', encoding='utf-8') as f:
    for line in f:
        data = json.loads(line)
        filename = data['file_name'].split('.')[0]
        text = data['text']
        output_file = os.path.join(train_image_dir, f'{filename}.txt')
        with open(output_file, 'w', encoding='utf-8') as txt_file:
            txt_file.write(text)
os.remove(metadata_file)

In [None]:
!pip install --upgrade huggingface_hub

In [None]:
from huggingface_hub import login, hf_hub_download

# You need to replace below with your own. 
access_token = "YOUR_ACCESS_TOKEN_HERE"
login(access_token)

In [6]:
model_dir = "./models"
os.makedirs(model_dir, exist_ok=True)

model_vae_repo_id = "black-forest-labs/FLUX.1-dev"
model_vae_files = ["flux1-dev.safetensors", "ae.safetensors"]
text_encoders_repo_id = "comfyanonymous/flux_text_encoders"
text_encoders_files = ["clip_l.safetensors", "t5xxl_fp16.safetensors"]

for file in model_vae_files:
    hf_hub_download(model_vae_repo_id, local_dir=model_dir, filename=file)
for file in text_encoders_files:
    hf_hub_download(text_encoders_repo_id, local_dir=model_dir, filename=file)

## 2. Prepare training config files and Dockerfile(docker image for training job)

***Refer "dataset-example.toml" to configure your own .toml file***

In [None]:
%%writefile ./images/dataset.toml
[general]
enable_bucket = true
caption_extension = '.txt'
keep_tokens = 0

[[datasets]]
resolution = 1024
# min_bucket_reso = 640
# max_bucket_reso = 1536
bucket_reso_steps = 32
batch_size = 2

[[datasets.subsets]]
image_dir = '/opt/ml/input/data/images'

In [None]:
%%writefile ./images/sample_prompt.toml
[prompt]
sample_steps = 20
width = 1024
height = 1024

[[prompt.subset]]
prompt = "wta, 1girl, looking at viewer, blue hair, short twintails, hair ornament, blue eyes, blush, smile, open mouth, shirt, skirt, kneehighs, brown footwear, standing, solo"
seed = 1000
[[prompt.subset]]
prompt = "wta, 1girl, looking at viewer, blue hair, short twintails, hair ornament, blue eyes, blush, smile, open mouth, shirt, skirt, kneehighs, brown footwear, standing, solo"
seed = 2000

In [None]:
%%writefile ./dockerfile/Dockerfile
FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:2.3.0-gpu-py311-cu121-ubuntu20.04-sagemaker

ENV PATH="/opt/ml/code:${PATH}"
ENV SAGEMAKER_SUBMIT_DIRECTORY /opt/ml/code
ENV DEBIAN_FRONTEND noninteractive

RUN git clone -b sd3 https://github.com/kohya-ss/sd-scripts /opt/ml/code

WORKDIR /opt/ml/code

RUN mv flux_train_network.py flux_train_network && \
    sed -i 's/-e \./\./g' requirements.txt && \
    pip3 install torch==2.4.0 torchvision==0.19.0 --index-url https://download.pytorch.org/whl/cu124 && \
    pip install -r requirements.txt && \
    pip install wandb && \
    pip uninstall transformer-engine -y # Solve error of "transformer_engine_extensions.cpython-311-x86_64-linux-gnu.so: undefined symbol"

# RUN mkdir -p images/

# COPY ./images/* ./images/

WORKDIR /

ENV SAGEMAKER_PROGRAM accelerate.commands.launch --mixed_precision bf16 --num_cpu_threads_per_process 1 flux_train_network

## 3. Change default docker-root-dir of SageMaker notebook
***Default docker-root-dir of SageMaker notebook has limited space, which is not big enough for building large images***

In [None]:
import os
os.system('sudo service docker stop')

docker_dir = "/home/ec2-user/SageMaker/docker"
if not os.path.isdir(docker_dir):
    os.system(f'sudo mv /var/lib/docker {docker_dir}')

os.system(f'sudo ln -s {docker_dir} /var/lib/docker')

os.system('sudo service docker start')

## 4. Build docker image and push to ECR

In [None]:
%%sh

# Specify an algorithm name
algorithm_name=flux-lora-taining-job

account=$(aws sts get-caller-identity --query Account --output text)

# Get the region defined in the current configuration (default to us-west-2 if none defined)
region=$(aws configure get region)

fullname="${account}.dkr.ecr.${region}.amazonaws.com/${algorithm_name}:latest"
base_image_repo="763104351884.dkr.ecr.${region}.amazonaws.com"

# If the repository doesn't exist in ECR, create it.

aws ecr describe-repositories --repository-names "${algorithm_name}" > /dev/null 2>&1
if [ $? -ne 0 ]
then
aws ecr create-repository --repository-name "${algorithm_name}" > /dev/null
fi

# Get the login command from ECR and execute it directly

aws ecr get-login-password --region ${region}|docker login --username AWS --password-stdin ${fullname}
aws ecr get-login-password --region ${region}|docker login --username AWS --password-stdin ${base_image_repo}

# Build the docker image locally with the image name and then push it to ECR
# with the full name.

docker build -t ${algorithm_name} ./dockerfile
docker tag ${algorithm_name} ${fullname}

docker push ${fullname}

## 5. Train models with SageMaker training job

In [None]:
import sagemaker
import boto3

sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket()
role = sagemaker.get_execution_role()
account_id = boto3.client('sts').get_caller_identity().get('Account')
region_name = boto3.session.Session().region_name
images_s3uri = 's3://{0}/flux-lora-train/dataset/'.format(bucket)
models_s3uri = 's3://{0}/flux-lora-train/models/'.format(bucket)

In [None]:
# Copy training dataset to S3 bucket

!aws s3 cp images $images_s3uri --recursive

In [None]:
# Copy flux model files to S3 bucket

!aws s3 cp models $models_s3uri --recursive

***You need to provide your own "wandb_api_key" for below scripts***

In [None]:
import json
def json_encode_hyperparameters(hyperparameters):
    for (k, v) in hyperparameters.items():
        print(k, v)
    return {k: json.dumps(v) for (k, v) in hyperparameters.items()}

docker_image_uri = '{0}.dkr.ecr.{1}.amazonaws.com/flux-lora-taining-job'.format(account_id, region_name)
instance_type = 'ml.g5.4xlarge'

lora_name = "flux_lora_wta"
output_dir="/opt/ml/model/"
wandb_api_key = "b005e93d3edff303e85dfd44d72a3a5f69fdd853c7" # Provide your wandb key

environment = {'LD_LIBRARY_PATH': "${LD_LIBRARY_PATH}:/opt/conda/lib/python3.11/site-packages/nvidia/nvjitlink/lib/"}

hyperparameters = {
                    'pretrained_model_name_or_path': '/opt/ml/input/data/models/flux1-dev.safetensors',
                    'clip_l': '/opt/ml/input/data/models/clip_l.safetensors',
                    't5xxl': '/opt/ml/input/data/models/t5xxl_fp16.safetensors',
                    'ae': '/opt/ml/input/data/models/ae.safetensors',
                    'save_model_as': 'safetensors',
                    'sdpa': '',
                    'persistent_data_loader_workers': '',
                    'max_data_loader_n_workers': 2,
                    'gradient_checkpointing': '',
                    'mixed_precision': 'bf16',
                    'save_precision': 'bf16',
                    'full_bf16': '',
                    'network_module': 'networks.lora_flux',
                    'network_dim': 64,
                    'network_alpha': 32,
                    'lr_scheduler': 'cosine_with_restarts',
                    'lr_scheduler_num_cycles': 1,
                    'optimizer_type': 'prodigy',
                    'optimizer_args': 'safeguard_warmup=True',
                    'learning_rate': 1.0,
                    'cache_latents_to_disk': '',
                    'cache_text_encoder_outputs_to_disk': '',
                    'fp8_base': '',
                    'highvram': '',
                    'max_train_steps': 540,
                    'save_every_n_steps': 120,
                    'dataset_config': '/opt/ml/input/data/images/dataset.toml',
                    'output_dir': output_dir,
                    'output_name': lora_name,
                    'timestep_sampling': 'shift',
                    'discrete_flow_shift': 3.1582,
                    'model_prediction_type': 'raw',
                    'guidance_scale': 1,
                    't5xxl_max_token_length': 512,
                    'sample_every_n_steps': 120,
                    'sample_prompts': '/opt/ml/input/data/images/sample_prompt.toml',
                    'sample_sampler': 'euler_a',
                    'logging_dir': '/opt/ml/code/logs',
                    'log_with': 'all',
                    'log_tracker_name': lora_name,
                    'log_config':'',
                    'wandb_api_key': wandb_api_key
}

hyperparameters = json_encode_hyperparameters(hyperparameters)

In [None]:
from sagemaker.estimator import Estimator

inputs = {
    'images': images_s3uri,
    'models': models_s3uri
}

estimator = Estimator(
    role = role,
    instance_count=1,
    instance_type = instance_type,
    image_uri = docker_image_uri,
    hyperparameters = hyperparameters,
    environment=environment,
    disable_output_compression = True
)
estimator.fit(inputs)

In [17]:
model_data = estimator.model_data
model_s3_path = model_data['S3DataSource']['S3Uri']
print("Model artifact saved at:", "\n"+model_s3_path+"\n")
!aws s3 ls {model_s3_path}

Model artifact saved at: 
s3://sagemaker-us-east-1-091166060467/flux-lora-taining-job-2024-09-01-00-15-09-583/output/model/

                           PRE sample/
2024-09-01 02:12:07  634012952 flux_lora_wta-step00000120.safetensors
2024-09-01 02:12:14  634012952 flux_lora_wta-step00000240.safetensors
2024-09-01 02:12:11  634012952 flux_lora_wta-step00000360.safetensors
2024-09-01 02:12:17  634012952 flux_lora_wta-step00000480.safetensors
2024-09-01 02:12:19  634012952 flux_lora_wta.safetensors


In [18]:
# You can change the applied lora weight by changing lora weight name

lora_s3_path = model_s3_path + 'flux_lora_wta.safetensors'
print ("Lora weight is saved at:", "\n"+lora_s3_path)

Lora weight is saved at: 
s3://sagemaker-us-east-1-091166060467/flux-lora-taining-job-2024-09-01-00-15-09-583/output/model/flux_lora_wta.safetensors
