# HW4: Stable Diffusion Fine-tuning
In this homework, you will fine-tune your own Stable Diffusion model to generate your customized images from given text description. For more details, please refer to homework slides

## **TODOs**

1. Read the slides and make sure you know the objectives of this homework.
2. Save a copy of this Colab notebook.
3. Follow the steps in this Colab notebook to fine-tune your Stable Diffusion.
4. Evaluate outputs using FaceNet and CLIP


This is based on the work of [Hugging Face](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image_lora.py).

And special thanks to [Celebrity Face Image Dataset](https://www.kaggle.com/datasets/vishesh1412/celebrity-face-image-dataset).


Thank you!


In [6]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [5]:
pip install pip==22.0

Collecting pip==22.0
  Downloading pip-22.0-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m14.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 23.0
    Uninstalling pip-23.0:
      Successfully uninstalled pip-23.0
Successfully installed pip-22.0


In [6]:
import os
from IPython import get_ipython
from IPython.display import display, Markdown

COLAB = True

if COLAB:
    from google.colab.output import clear as clear_output
else:
    from IPython.display import clear_output


#@markdown ## Link to Google Drive
#@markdown This cell will load some requirements and create the necessary folders in your Google Drive. <p>
#@markdown Your project name can't contain spaces but it can contain a single / to make a subfolder in your dataset.
project_name = "proj_brad" #@param {type:"string"}
project_name = project_name.strip()
# dataset_name = "Brad-512" #@param ["Brad-512", "Anne-512"]
dataset_name = "Brad"

if not project_name or any(c in project_name for c in " .()\"'\\") or project_name.count("/") > 1:
    print("Please write a valid project_name.")
else:
    if COLAB and not os.path.exists('/content/drive'):
      from google.colab import drive
      print("📂 Connecting to Google Drive...")
      drive.mount('/content/drive')

    project_base = project_name if "/" not in project_name else project_name[:project_name.rfind("/")]
    project_subfolder = project_name if "/" not in project_name else project_name[project_name.rfind("/")+1:]

    root_dir = "/content" if COLAB else "~/Loras"
    main_dir        = os.path.join(root_dir, "drive/MyDrive/GenAI-HW4") if COLAB else root_dir
    project_dir =  os.path.join(main_dir, project_name)
    os.makedirs(main_dir, exist_ok=True)
    os.makedirs(project_dir, exist_ok=True)
    zip_file = os.path.join(main_dir, "Datasets.zip")
    !gdown 1OXPG2vNb8bG2334HML8vKpqo8UbAEV3d -O {zip_file}
    !unzip -q -o {zip_file} -d {main_dir}
    log_file = os.path.join(project_dir, "logs.zip")
    log_dir = os.path.join(project_dir, "logs")
    !git clone https://huggingface.co/yahcreeper/GenAI-HW10-Model {log_dir}
    # !cd {log_dir}
    # !git lfs pull
    # !gdown 1kalT3k7kEV0xcD6pf_OTSo7npHmsfZ_z -O {log_file}
    # !unzip -q -o {log_file} -d {project_dir}
    # !rm -f {log_file}
    model_path = os.path.join(project_dir, "logs", "checkpoint-last")
    images_folder   = os.path.join(main_dir, "Datasets", dataset_name)
    prompts_folder  = os.path.join(main_dir, "Datasets", "prompts")
    captions_folder = images_folder
    os.makedirs(images_folder, exist_ok=True)

    print(f"✅ Project {project_name} is ready!")
    step1_installed_flag = True

Downloading...
From: https://drive.google.com/uc?id=1OXPG2vNb8bG2334HML8vKpqo8UbAEV3d
To: /content/drive/MyDrive/GenAI-HW4/Datasets.zip
  0% 0.00/2.78M [00:00<?, ?B/s]100% 2.78M/2.78M [00:00<00:00, 144MB/s]
fatal: destination path '/content/drive/MyDrive/GenAI-HW4/proj_brad/logs' already exists and is not an empty directory.
✅ Project proj_brad is ready!


In [7]:
#@markdown ##  Install the required packages
#@markdown In this session, we will install some well-established packages to facilitate the fine-tuning process. <p>
#@markdown The installation will take about 5 minutes.
os.chdir(root_dir)
!pip -q install timm==1.0.7 fairscale==0.4.13 transformers==4.41.2 requests==2.31.0 accelerate==0.31.0 diffusers==0.29.1 einop==0.0.1 safetensors==0.4.3 voluptuous==0.15.1 jax==0.4.28 peft==0.11.1 deepface==0.0.93 tensorflow==2.15.0 keras==2.15.0

[0m

In [8]:
!pip install jaxlib==0.4.28
!pip install facenet-pytorch

[0mCollecting jaxlib==0.4.28
  Downloading jaxlib-0.4.28-cp311-cp311-manylinux2014_x86_64.whl (77.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.5/77.5 MB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[0mInstalling collected packages: jaxlib
  Attempting uninstall: jaxlib
[0m    Found existing installation: jaxlib 0.5.1
    Uninstalling jaxlib-0.5.1:
      Successfully uninstalled jaxlib-0.5.1
[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
flax 0.10.5 requires jax>=0.5.1, but you have jax 0.4.28 which is incompatible.[0m[31m
[0mSuccessfully installed jaxlib-0.4.28
[0mCollecting facenet-pytorch
  Downloading facenet_pytorch-2.6.0-py3-none-any.whl (1.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
Collecting Pillow<10.3.0,>=10.2.0
  Downloading pil

In [1]:
#@markdown ##  Import necessary packages
#@markdown It is recommmended NOT to change codes in this cell.
import argparse
import logging
import math
import os
import random
import glob
import shutil
from pathlib import Path
import numpy as np
import torch
import torch.nn.functional as F
import torch.utils.checkpoint
import transformers
from PIL import Image
from torchvision import transforms
from torchvision.utils import save_image
from tqdm.auto import tqdm
from peft import LoraConfig
from peft.utils import get_peft_model_state_dict
from transformers import AutoProcessor, AutoModel, CLIPTextModel, CLIPTokenizer

import diffusers
from diffusers import AutoencoderKL, DDPMScheduler, DiffusionPipeline, StableDiffusionPipeline, UNet2DConditionModel
from diffusers.optimization import get_scheduler
from diffusers.utils import convert_state_dict_to_diffusers
from diffusers.training_utils import compute_snr
from diffusers.utils.torch_utils import is_compiled_module
from deepface import DeepFace
import cv2

  deprecate("Transformer2DModelOutput", "1.0.0", deprecation_message)


25-04-06 03:29:47 - Directory /root/.deepface has been created
25-04-06 03:29:47 - Directory /root/.deepface/weights has been created


In [7]:
# Do not change the following parameters, or the process may crashed due to GPU out of memory.
output_folder = os.path.join(project_dir, "logs") # 存放model checkpoints跟validation結果的資料夾
seed = 1126 # random seed
train_batch_size = 2 # training batch size
resolution = 512 # Image size
weight_dtype = torch.bfloat16 #
snr_gamma = 5
#####/content/drive/MyDrive/GenAI-HW4/Datasets
#@markdown ## Important parameters for fine-tuning Stable Diffusion
#pretrained_model_name_or_path = "stablediffusionapi/cyberrealistic-41"
#pretrained_model_name_or_path = "/content/drive/MyDrive/GenAI-HW4/stablediffusion/sd1_5"
pretrained_model_name_or_path = "stable-diffusion-v1-5/stable-diffusion-v1-5"
lora_rank = 32
lora_alpha = 16
#@markdown ### ▶️ Learning Rate
#@markdown The learning rate is the most important for your results. If you want to train slower with lots of images, or if your dim and alpha are high, move the unet to 2e-4 or lower. <p>
#@markdown The text encoder helps your Lora learn concepts slightly better. It is recommended to make it half or a fifth of the unet. If you're training a style you can even set it to 0.
learning_rate = 1e-4 #@param {type:"number"}
unet_learning_rate = learning_rate
text_encoder_learning_rate = learning_rate
lr_scheduler_name = "cosine_with_restarts" # 設定學習率的排程
lr_warmup_steps = 100 # 設定緩慢更新的步數
#@markdown ### ▶️ Steps
#@markdown Choose your training step and the number of generated images per each validaion
max_train_steps = 200 #@param {type:"slider", min:200, max:2000, step:100}
validation_prompt = "validation_prompt.txt"
validation_prompt_path = os.path.join(prompts_folder, validation_prompt)
validation_prompt_num = 3 #@param {type:"slider", min:1, max:5, step:1}
validation_step_ratio = 1 #@param {type:"slider", min:0, max:1, step:0.1}
with open(validation_prompt_path, "r") as f:
    validation_prompt = [line.strip() for line in f.readlines()]
#####

## Define Some Useful Functions and Class

In [8]:
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
IMAGE_EXTENSIONS = [".png", ".jpg", ".jpeg", ".webp", ".bmp", ".PNG", ".JPG", ".JPEG", ".WEBP", ".BMP"]
train_transform = transforms.Compose(
    [
        transforms.Resize(resolution, interpolation=transforms.InterpolationMode.BILINEAR),
        transforms.CenterCrop(resolution),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize([0.5], [0.5]),
    ]
)
# class Text2ImageDataset(torch.utils.data.Dataset):
#     """
#     (1) Goal:
#         - This class is used to build dataset for finetuning text-to-image model

#     """
#     def __init__(self, images_folder, captions_folder, transform, tokenizer):
#         """
#         (2) Arguments:
#             - images_folder: str, path to images
#             - captions_folder: str, path to captions
#             - transform: function, turn raw image into torch.tensor
#             - tokenizer: CLIPTokenize, turn sentences into word ids
#         """
#         self.image_paths = []
#         for ext in IMAGE_EXTENSIONS:
#             self.image_paths.extend(glob.glob(f"{images_folder}/*{ext}"))
#         self.image_paths = sorted(self.image_paths)
#         self.train_emb = torch.tensor([DeepFace.represent(img_path, detector_backend="ssd", model_name="GhostFaceNet", enforce_detection=False)[0]['embedding'] for img_path in self.image_paths])
#         caption_paths = sorted(glob.glob(f"{captions_folder}/*txt"))
#         captions = []
#         for p in caption_paths:
#             with open(p, "r") as f:
#                 captions.append(f.readline())
#         inputs = tokenizer(
#             captions, max_length=tokenizer.model_max_length, padding="max_length", truncation=True, return_tensors="pt"
#         )
#         self.input_ids = inputs.input_ids
#         self.transform = transform

#     def __getitem__(self, idx):
#         img_path = self.image_paths[idx]
#         input_id = self.input_ids[idx]
#         try:
#             image = Image.open(img_path).convert("RGB")
#             # convert to tensor temporarily so dataloader will accept it
#             tensor = self.transform(image)
#         except Exception as e:
#             print(f"Could not load image path: {img_path}, error: {e}")
#             return None


#         return tensor, input_id

#     def __len__(self):
#         return len(self.image_paths)


class Text2ImageDataset(torch.utils.data.Dataset):
    """
    (1) Goal:
        - This class is used to build dataset for finetuning text-to-image model

    """
    def __init__(self, images_folder, captions_folder, transform, tokenizer):
        """
        (2) Arguments:
            - images_folder: str, path to images
            - captions_folder: str, path to captions
            - transform: function, turn raw image into torch.tensor
            - tokenizer: CLIPTokenize, turn sentences into word ids
        """
        self.image_paths = []
        for ext in IMAGE_EXTENSIONS:
            self.image_paths.extend(glob.glob(f"{images_folder}/*{ext}"))
        self.image_paths = sorted(self.image_paths)
        self.train_emb = torch.tensor([DeepFace.represent(img_path, detector_backend="ssd", model_name="GhostFaceNet", enforce_detection=False)[0]['embedding'] for img_path in self.image_paths])
        caption_paths = sorted(glob.glob(f"{captions_folder}/*txt"))
        captions = []
        for p in caption_paths:
            with open(p, "r") as f:
                captions.append(f.readline())
        inputs = tokenizer(
            captions, max_length=tokenizer.model_max_length, padding="max_length", truncation=True, return_tensors="pt"
        )
        self.input_ids = inputs.input_ids
        self.transform = transform

    def __getitem__(self, idx):
        img_path = self.image_paths[idx]
        input_id = self.input_ids[idx]
        try:
            image = Image.open(img_path).convert("RGB")
            # convert to tensor temporarily so dataloader will accept it
            tensor = self.transform(image)
        except Exception as e:
            print(f"Could not load image path: {img_path}, error: {e}")
            return None


        return tensor, input_id

    def __len__(self):
        return len(self.image_paths)



def prepare_lora_model(pretrained_model_name_or_path="stable-diffusion-v1-5/stable-diffusion-v1-5", model_path=None):
    """
    (1) Goal:
        - This function is used to get the whole stable diffusion model with lora layers and freeze non-lora parameters, including Tokenizer, Noise Scheduler, UNet, Text Encoder, and VAE

    (2) Arguments:
        - pretrained_model_name_or_path: str, model name from Hugging Face
        - model_path: str, path to pretrained model.

    (3) Returns:
        - output: Tokenizer, Noise Scheduler, UNet, Text Encoder, and VAE

    """
    print("modelpath",pretrained_model_name_or_path)
    noise_scheduler = DDPMScheduler.from_pretrained(pretrained_model_name_or_path, subfolder="scheduler")
    tokenizer = CLIPTokenizer.from_pretrained(
        pretrained_model_name_or_path,
        subfolder="tokenizer"
    )
    # text_encoder = CLIPTextModel.from_pretrained(
    #     pretrained_model_name_or_path,
    #     torch_dtype=weight_dtype,
    #     subfolder="text_encoder"
    # )
    vae = AutoencoderKL.from_pretrained(
        pretrained_model_name_or_path,
        subfolder="vae"
    )
    # unet = UNet2DConditionModel.from_pretrained(
    #     pretrained_model_name_or_path,
    #     torch_dtype=weight_dtype,
    #     subfolder="unet"
    # )
    text_encoder = torch.load(os.path.join(model_path, "text_encoder.pt"))
    unet = torch.load(os.path.join(model_path, "unet.pt"))
    vae.requires_grad_(False)
    for name, param in unet.named_parameters():
        if "lora" in name:
            param.requires_grad_(True)
        else:
            param.requires_grad_(False)
    for name, param in text_encoder.named_parameters():
        if "lora" in name:
            param.requires_grad_(True)
        else:
            param.requires_grad_(False)

    unet.to(DEVICE, dtype=weight_dtype)
    vae.to(DEVICE, dtype=weight_dtype)
    text_encoder.to(DEVICE, dtype=weight_dtype)
    return tokenizer, noise_scheduler, unet, vae, text_encoder

def prepare_optimizer(unet, text_encoder, unet_learning_rate=5e-4, text_encoder_learning_rate=1e-4):
    """
    (1) Goal:
        - This function is used to feed trainable parameters from UNet and Text Encoder in to optimizer each with different learning rate

    (2) Arguments:
        - unet: UNet2DConditionModel, UNet from Hugging Face
        - text_encoder: CLIPTextModel, Text Encoder from Hugging Face
        - unet_learning_rate: float, learning rate for UNet
        - text_encoder_learning_rate: float, learning rate for Text Encoder

    (3) Returns:
        - output: Optimizer

    """
    unet_lora_layers = list(filter(lambda p: p.requires_grad, unet.parameters()))
    text_encoder_lora_layers = list(filter(lambda p: p.requires_grad, text_encoder.parameters()))
    trainable_params = [
        {"params": unet_lora_layers, "lr": unet_learning_rate},
        {"params": text_encoder_lora_layers, "lr": text_encoder_learning_rate}
    ]
    optimizer = torch.optim.AdamW(
        trainable_params,
        lr=unet_learning_rate,
    )
    return optimizer

def evaluate(pretrained_model_name_or_path, weight_dtype, seed, unet_path, text_encoder_path, validation_prompt, output_folder, train_emb):
    """
    (1) Goal:
        - This function is used to evaluate Stable Diffusion by loading UNet and Text Encoder from the given path and calculating face similarity, CLIP score, and the number of faceless images.

    (2) Arguments:
        - pretrained_model_name_or_path: str, model name from Hugging Face
        - weight_dtype: torch.type, model weight type
        - seed: int, random seed
        - unet_path: str, path to UNet model checkpoint
        - text_encoder_path: str, path to Text Encoder model checkpoint
        - validation_prompt: list, list of str storing texts for validation
        - output_folder: str, directory for saving generated images
        - train_emb: tensor, face features of training images

    (3) Returns:
        - output: face similarity, CLIP score, the number of faceless images

    """
    pipeline = DiffusionPipeline.from_pretrained(
        pretrained_model_name_or_path,
        torch_dtype=weight_dtype,
        safety_checker=None,
    )
    pipeline.unet = torch.load(unet_path)
    pipeline.text_encoder = torch.load(text_encoder_path)
    pipeline = pipeline.to(DEVICE)
    clip_model_name = "openai/clip-vit-base-patch32"
    clip_model = AutoModel.from_pretrained(clip_model_name)
    clip_processor = AutoProcessor.from_pretrained(clip_model_name)

    # run inference
    with torch.no_grad():
        generator = torch.Generator(device=DEVICE)
        generator = generator.manual_seed(seed)
        face_score = 0
        clip_score = 0
        mis = 0
        print("Generating validaion pictures ......")
        images = []
        for i in range(0, len(validation_prompt), 4):
            images.extend(pipeline(validation_prompt[i:min(i + 4, len(validation_prompt))], num_inference_steps=30, generator=generator).images)
        print("Calculating validaion score ......")
        valid_emb = []
        for i, image in enumerate(tqdm(images)):
            save_file = f"{output_folder}/valid_image_{i}.png"
            image.save(save_file)
            opencvImage = cv2.cvtColor(np.array(image), cv2.COLOR_RGB2BGR)
            emb = DeepFace.represent(
                opencvImage,
                detector_backend="ssd",
                model_name="GhostFaceNet",
                enforce_detection=False,
            )
            if emb == [] or emb[0]['face_confidence'] == 0:
                mis += 1
                continue
            emb = emb[0]
            inputs = clip_processor(text=validation_prompt[i], images=image, return_tensors="pt")
            with torch.no_grad():
                outputs = clip_model(**inputs)
            sim = outputs.logits_per_image
            clip_score += sim.item()
            valid_emb.append(emb['embedding'])
        if len(valid_emb) == 0:
            return 0, 0, mis
        valid_emb = torch.tensor(valid_emb)
        valid_emb = (valid_emb / torch.norm(valid_emb, p=2, dim=-1)[:, None]).cuda()
        train_emb = (train_emb / torch.norm(train_emb, p=2, dim=-1)[:, None]).cuda()
        face_score = torch.cdist(valid_emb, train_emb, p=2).mean().item()
        # face_score = torch.min(face_score, 1)[0].mean()
        clip_score /= len(validation_prompt) - mis
    return face_score, clip_score, mis

## Prepare Dataset, LoRA model, and Optimizer
Declare everything needed for Stable Diffusion fine-tuning.

In [None]:
tokenizer, noise_scheduler, unet, vae, text_encoder = prepare_lora_model(pretrained_model_name_or_path, model_path)
optimizer                                           = prepare_optimizer(unet, text_encoder, unet_learning_rate, text_encoder_learning_rate)
lr_scheduler = get_scheduler(
    lr_scheduler_name,
    optimizer=optimizer,
    num_warmup_steps=lr_warmup_steps,
    num_training_steps=max_train_steps,
    num_cycles=3
)

dataset = Text2ImageDataset(
    images_folder=images_folder,
    captions_folder=captions_folder,
    transform=train_transform,
    tokenizer=tokenizer,
)
def collate_fn(examples):
    pixel_values = []
    input_ids = []
    for tensor, input_id in examples:
        pixel_values.append(tensor)
        input_ids.append(input_id)
    pixel_values = torch.stack(pixel_values, dim=0).float()
    input_ids = torch.stack(input_ids, dim=0)
    return {"pixel_values": pixel_values, "input_ids": input_ids}
train_dataloader = torch.utils.data.DataLoader(
    dataset,
    shuffle=True,
    collate_fn=collate_fn,
    batch_size=train_batch_size,
    num_workers=8,
)
print("Preparation Finished!")

modelpath stable-diffusion-v1-5/stable-diffusion-v1-5


scheduler_config.json:   0%|          | 0.00/308 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/806 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.06M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/525k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/472 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/547 [00:00<?, ?B/s]

diffusion_pytorch_model.safetensors:   0%|          | 0.00/335M [00:00<?, ?B/s]

25-04-05 15:28:50 - Pre-trained weights is downloaded from https://github.com/HamadYA/GhostFaceNets/releases/download/v1.2/GhostFaceNet_W1.3_S1_ArcFace.h5 to /root/.deepface/weights/ghostfacenet_v1.h5


Downloading...
From: https://github.com/HamadYA/GhostFaceNets/releases/download/v1.2/GhostFaceNet_W1.3_S1_ArcFace.h5
To: /root/.deepface/weights/ghostfacenet_v1.h5
100%|██████████| 17.3M/17.3M [00:00<00:00, 46.9MB/s]


25-04-05 15:28:52 - Pre-trained weights is just downloaded to /root/.deepface/weights/ghostfacenet_v1.h5
25-04-05 15:28:52 - deploy.prototxt will be downloaded...


Downloading...
From: https://github.com/opencv/opencv/raw/3.4.0/samples/dnn/face_detector/deploy.prototxt
To: /root/.deepface/weights/deploy.prototxt
28.1kB [00:00, 43.4MB/s]                   


25-04-05 15:28:53 - res10_300x300_ssd_iter_140000.caffemodel will be downloaded...


Downloading...
From: https://github.com/opencv/opencv_3rdparty/raw/dnn_samples_face_detector_20170830/res10_300x300_ssd_iter_140000.caffemodel
To: /root/.deepface/weights/res10_300x300_ssd_iter_140000.caffemodel
100%|██████████| 10.7M/10.7M [00:00<00:00, 122MB/s]


Preparation Finished!


LoRA参数优化：
*   lora_rank从32提升到64：增加模型的表达能力和容量
*   lora_alpha从16提升到32：增加LoRA的权重，使模型更好地学习训练数据的特征


学习率调整：

* 基础学习率从1e-4降低到5e-5：较小的学习率有助于模型更稳定地学习
* text_encoder_learning_rate设为UNet学习率的一半：这是推荐的最佳实践，可以帮助模型更好地理解概念
* 预热步数从100增加到200：更长的预热期有助于模型在早期阶段更稳定地学习

训练和验证设置优化：
* max_train_steps从200增加到1000：增加训练步数以确保模型充分学习
* validation_prompt_num从3增加到4：增加验证图像数量，获得更全面的模型评估
* validation_step_ratio从1降低到0.2：降低验证频率但保持足够的监控，可以加快训练速度

这些修改的预期效果：
* 更强的模型表达能力（通过增加LoRA参数）
* 更稳定的训练过程（通过优化学习率和预热步数）
* 更充分的训练（通过增加训练步数）
* 更全面的验证评估（通过增加验证图像数量）

In [9]:
# Do not change the following parameters, or the process may crashed due to GPU out of memory.
output_folder = os.path.join(project_dir, "logs") # 存放model checkpoints跟validation結果的資料夾
seed = 1126 # random seed
train_batch_size = 2 # training batch size
resolution = 512 # Image size
weight_dtype = torch.bfloat16 #
snr_gamma = 5
#####/content/drive/MyDrive/GenAI-HW4/Datasets
#@markdown ## Important parameters for fine-tuning Stable Diffusion
#@markdown ### ▶️ Learning Rate
#@markdown The learning rate is the most important for your results. If you want to train slower with lots of images, or if your dim and alpha are high, move the unet to 2e-4 or lower. <p>
#@markdown The text encoder helps your Lora learn concepts slightly better. It is recommended to make it half or a fifth of the unet. If you're training a style you can even set it to 0.
learning_rate = 5e-5 #@param {type:"number"}
unet_learning_rate = learning_rate
text_encoder_learning_rate = learning_rate * 0.5
lr_scheduler_name = "cosine_with_restarts" # 設定學習率的排程
lr_warmup_steps = 200 # 設定緩慢更新的步數
#@markdown ### ▶️ Steps
#@markdown Choose your training step and the number of generated images per each validaion
max_train_steps = 1000 #@param {type:"slider", min:200, max:2000, step:100}
validation_prompt = "validation_prompt.txt"
validation_prompt_path = os.path.join(prompts_folder, validation_prompt)
validation_prompt_num = 4 #@param {type:"slider", min:1, max:5, step:1}
validation_step_ratio = 0.2 #@param {type:"slider", min:0, max:1, step:0.1}
with open(validation_prompt_path, "r") as f:
   validation_prompt = [line.strip() for line in f.readlines()]

In [None]:
# 微调
os.environ["TOKENIZERS_PARALLELISM"] = "false"
torch.backends.cuda.enable_mem_efficient_sdp(False)
torch.backends.cuda.enable_flash_sdp(False)
progress_bar = tqdm(
    range(0, max_train_steps),
    initial=0,
    desc="Steps",
)
global_step = 0
num_epochs = math.ceil(max_train_steps / len(train_dataloader))
validation_step = int(max_train_steps * validation_step_ratio)
best_face_score = float("inf")
for epoch in range(num_epochs):
    unet.train()
    text_encoder.train()
    for step, batch in enumerate(train_dataloader):
        if global_step >= max_train_steps:
            break
        latents = vae.encode(batch["pixel_values"].to(DEVICE, dtype=weight_dtype)).latent_dist.sample()
        latents = latents * vae.config.scaling_factor
        # Sample noise that we'll add to the latents
        noise = torch.randn_like(latents)
        bsz = latents.shape[0]
        timesteps = torch.randint(0, noise_scheduler.config.num_train_timesteps, (bsz,), device=latents.device)
        timesteps = timesteps.long()
        noisy_latents = noise_scheduler.add_noise(latents, noise, timesteps)

        # Get the text embedding for conditioning
        encoder_hidden_states = text_encoder(batch["input_ids"].to(latents.device), return_dict=False)[0]
        if noise_scheduler.config.prediction_type == "epsilon":
            target = noise
        elif noise_scheduler.config.prediction_type == "v_prediction":
            target = noise_scheduler.get_velocity(latents, noise, timesteps)
        model_pred = unet(noisy_latents, timesteps, encoder_hidden_states, return_dict=False)[0]
        if not snr_gamma:
            loss = F.mse_loss(model_pred.float(), target.float(), reduction="mean")
        else:
            snr = compute_snr(noise_scheduler, timesteps)
            mse_loss_weights = torch.stack([snr, snr_gamma * torch.ones_like(timesteps)], dim=1).min(
                dim=1
            )[0]
            if noise_scheduler.config.prediction_type == "epsilon":
                mse_loss_weights = mse_loss_weights / snr
            elif noise_scheduler.config.prediction_type == "v_prediction":
                mse_loss_weights = mse_loss_weights / (snr + 1)

            loss = F.mse_loss(model_pred.float(), target.float(), reduction="none")
            loss = loss.mean(dim=list(range(1, len(loss.shape)))) * mse_loss_weights
            loss = loss.mean()

        loss.backward()
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)
        global_step += 1

        if global_step % validation_step == 0 or global_step == max_train_steps:
            save_path = os.path.join(output_folder, f"checkpoint-last")
            unet_path = os.path.join(save_path, "unet.pt")
            text_encoder_path = os.path.join(save_path, "text_encoder.pt")
            print(f"Saving Checkpoint to {save_path} ......")
            os.makedirs(save_path, exist_ok=True)
            torch.save(unet, unet_path)
            torch.save(text_encoder, text_encoder_path)
            save_path = os.path.join(output_folder, f"checkpoint-{global_step + 1000}")
            os.makedirs(save_path, exist_ok=True)
            face_score, clip_score, mis = evaluate(
                pretrained_model_name_or_path=pretrained_model_name_or_path,
                weight_dtype=weight_dtype,
                seed=seed,
                unet_path=unet_path,
                text_encoder_path=text_encoder_path,
                validation_prompt=validation_prompt[:validation_prompt_num],
                output_folder=save_path,
                train_emb=dataset.train_emb
            )
            print("Step:", global_step, "Face Similarity Score:", face_score, "CLIP Score:", clip_score, "Faceless Images:", mis)
            if face_score < best_face_score:
                best_face_score = face_score
                save_path = os.path.join(output_folder, f"checkpoint-best-final")
                unet_path = os.path.join(save_path, "unet.pt")
                text_encoder_path = os.path.join(save_path, "text_encoder.pt")
                os.makedirs(save_path, exist_ok=True)
                torch.save(unet, unet_path)
                torch.save(text_encoder, text_encoder_path)
print("Fine-tuning Finished!!!")

Steps:   0%|          | 0/1000 [00:00<?, ?it/s]

Saving Checkpoint to /content/drive/MyDrive/GenAI-HW4/proj_brad/logs/checkpoint-last ......


Loading pipeline components...:   0%|          | 0/6 [00:00<?, ?it/s]

You have disabled the safety checker for <class 'diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline'> by passing `safety_checker=None`. Ensure that you abide to the conditions of the Stable Diffusion license and do not expose unfiltered results in services or applications open to the public. Both the diffusers team and Hugging Face strongly recommend to keep the safety filter enabled in all public facing circumstances, disabling it only for use-cases that involve analyzing network behavior or auditing its results. For more information, please have a look at https://github.com/huggingface/diffusers/pull/254 .


Generating validaion pictures ......


  0%|          | 0/30 [00:00<?, ?it/s]

Calculating validaion score ......


  0%|          | 0/4 [00:00<?, ?it/s]

Step: 200 Face Similarity Score: 1.1893823146820068 CLIP Score: 30.515554904937744 Faceless Images: 0
Saving Checkpoint to /content/drive/MyDrive/GenAI-HW4/proj_brad/logs/checkpoint-last ......


Loading pipeline components...:   0%|          | 0/6 [00:00<?, ?it/s]

You have disabled the safety checker for <class 'diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline'> by passing `safety_checker=None`. Ensure that you abide to the conditions of the Stable Diffusion license and do not expose unfiltered results in services or applications open to the public. Both the diffusers team and Hugging Face strongly recommend to keep the safety filter enabled in all public facing circumstances, disabling it only for use-cases that involve analyzing network behavior or auditing its results. For more information, please have a look at https://github.com/huggingface/diffusers/pull/254 .


Generating validaion pictures ......


  0%|          | 0/30 [00:00<?, ?it/s]

Calculating validaion score ......


  0%|          | 0/4 [00:00<?, ?it/s]

Step: 400 Face Similarity Score: 1.1893823146820068 CLIP Score: 30.515554904937744 Faceless Images: 0
Saving Checkpoint to /content/drive/MyDrive/GenAI-HW4/proj_brad/logs/checkpoint-last ......


Loading pipeline components...:   0%|          | 0/6 [00:00<?, ?it/s]

You have disabled the safety checker for <class 'diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline'> by passing `safety_checker=None`. Ensure that you abide to the conditions of the Stable Diffusion license and do not expose unfiltered results in services or applications open to the public. Both the diffusers team and Hugging Face strongly recommend to keep the safety filter enabled in all public facing circumstances, disabling it only for use-cases that involve analyzing network behavior or auditing its results. For more information, please have a look at https://github.com/huggingface/diffusers/pull/254 .


Generating validaion pictures ......


  0%|          | 0/30 [00:00<?, ?it/s]

Calculating validaion score ......


  0%|          | 0/4 [00:00<?, ?it/s]

Step: 600 Face Similarity Score: 1.1893823146820068 CLIP Score: 30.515554904937744 Faceless Images: 0
Saving Checkpoint to /content/drive/MyDrive/GenAI-HW4/proj_brad/logs/checkpoint-last ......


Loading pipeline components...:   0%|          | 0/6 [00:00<?, ?it/s]

You have disabled the safety checker for <class 'diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline'> by passing `safety_checker=None`. Ensure that you abide to the conditions of the Stable Diffusion license and do not expose unfiltered results in services or applications open to the public. Both the diffusers team and Hugging Face strongly recommend to keep the safety filter enabled in all public facing circumstances, disabling it only for use-cases that involve analyzing network behavior or auditing its results. For more information, please have a look at https://github.com/huggingface/diffusers/pull/254 .


Generating validaion pictures ......


  0%|          | 0/30 [00:00<?, ?it/s]

Calculating validaion score ......


  0%|          | 0/4 [00:00<?, ?it/s]

Step: 800 Face Similarity Score: 1.1893823146820068 CLIP Score: 30.515554904937744 Faceless Images: 0
Saving Checkpoint to /content/drive/MyDrive/GenAI-HW4/proj_brad/logs/checkpoint-last ......


Loading pipeline components...:   0%|          | 0/6 [00:00<?, ?it/s]

You have disabled the safety checker for <class 'diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline'> by passing `safety_checker=None`. Ensure that you abide to the conditions of the Stable Diffusion license and do not expose unfiltered results in services or applications open to the public. Both the diffusers team and Hugging Face strongly recommend to keep the safety filter enabled in all public facing circumstances, disabling it only for use-cases that involve analyzing network behavior or auditing its results. For more information, please have a look at https://github.com/huggingface/diffusers/pull/254 .


Generating validaion pictures ......


  0%|          | 0/30 [00:00<?, ?it/s]

Calculating validaion score ......


  0%|          | 0/4 [00:00<?, ?it/s]

Step: 1000 Face Similarity Score: 1.1893823146820068 CLIP Score: 30.515554904937744 Faceless Images: 0
Fine-tuning Finished!!!


In [10]:
checkpoint_path = os.path.join(output_folder, f"checkpoint-best-final") # 設定使用哪個checkpoint inference
unet_path = os.path.join(checkpoint_path, "unet.pt")
text_encoder_path = os.path.join(checkpoint_path, "text_encoder.pt")
inference_path = os.path.join(project_dir, "inference_best_final")
os.makedirs(inference_path, exist_ok=True)
train_image_paths = []
for ext in IMAGE_EXTENSIONS:
    train_image_paths.extend(glob.glob(f"{images_folder}/*{ext}"))
train_image_paths = sorted(train_image_paths)
train_emb = torch.tensor([DeepFace.represent(img_path, detector_backend="ssd", model_name="GhostFaceNet", enforce_detection=False)[0]['embedding'] for img_path in train_image_paths])

face_score, clip_score, mis = evaluate(
    pretrained_model_name_or_path=pretrained_model_name_or_path,
    weight_dtype=weight_dtype,
    seed=seed,
    unet_path=unet_path,
    text_encoder_path=text_encoder_path,
    validation_prompt=validation_prompt,
    output_folder=inference_path,
    train_emb=train_emb,
)
print("Face Similarity Score:", face_score, "CLIP Score:", clip_score, "Faceless Images:", mis)

25-04-06 03:31:43 - Pre-trained weights is downloaded from https://github.com/HamadYA/GhostFaceNets/releases/download/v1.2/GhostFaceNet_W1.3_S1_ArcFace.h5 to /root/.deepface/weights/ghostfacenet_v1.h5


Downloading...
From: https://github.com/HamadYA/GhostFaceNets/releases/download/v1.2/GhostFaceNet_W1.3_S1_ArcFace.h5
To: /root/.deepface/weights/ghostfacenet_v1.h5
100%|██████████| 17.3M/17.3M [00:00<00:00, 34.1MB/s]


25-04-06 03:31:44 - Pre-trained weights is just downloaded to /root/.deepface/weights/ghostfacenet_v1.h5
25-04-06 03:31:44 - deploy.prototxt will be downloaded...


Downloading...
From: https://github.com/opencv/opencv/raw/3.4.0/samples/dnn/face_detector/deploy.prototxt
To: /root/.deepface/weights/deploy.prototxt
28.1kB [00:00, 44.6MB/s]                   


25-04-06 03:31:45 - res10_300x300_ssd_iter_140000.caffemodel will be downloaded...


Downloading...
From: https://github.com/opencv/opencv_3rdparty/raw/dnn_samples_face_detector_20170830/res10_300x300_ssd_iter_140000.caffemodel
To: /root/.deepface/weights/res10_300x300_ssd_iter_140000.caffemodel
100%|██████████| 10.7M/10.7M [00:00<00:00, 23.6MB/s]


model_index.json:   0%|          | 0.00/541 [00:00<?, ?B/s]

Fetching 13 files:   0%|          | 0/13 [00:00<?, ?it/s]

merges.txt:   0%|          | 0.00/525k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/617 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/472 [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/342 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.06M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/492M [00:00<?, ?B/s]

scheduler_config.json:   0%|          | 0.00/308 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/806 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/547 [00:00<?, ?B/s]

diffusion_pytorch_model.safetensors:   0%|          | 0.00/3.44G [00:00<?, ?B/s]

diffusion_pytorch_model.safetensors:   0%|          | 0.00/335M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

Loading pipeline components...:   0%|          | 0/6 [00:00<?, ?it/s]

You have disabled the safety checker for <class 'diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline'> by passing `safety_checker=None`. Ensure that you abide to the conditions of the Stable Diffusion license and do not expose unfiltered results in services or applications open to the public. Both the diffusers team and Hugging Face strongly recommend to keep the safety filter enabled in all public facing circumstances, disabling it only for use-cases that involve analyzing network behavior or auditing its results. For more information, please have a look at https://github.com/huggingface/diffusers/pull/254 .


config.json:   0%|          | 0.00/4.19k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/605M [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/592 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/862k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/525k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.22M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/389 [00:00<?, ?B/s]

Generating validaion pictures ......


  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

Calculating validaion score ......


  0%|          | 0/25 [00:00<?, ?it/s]

Face Similarity Score: 1.1874144077301025 CLIP Score: 30.09142605463664 Faceless Images: 1
