## LoRA + PEFT Fine-Tuning of Stable Diffusion with Metric Logging

This notebook fine-tunes the Stable Diffusion v1.5 model using Low-Rank
Adaptation (LoRA) and the PEFT framework on a custom speed-bump image–caption
dataset. The base model weights are frozen, and only LoRA adapter parameters
are trained. Training loss is logged at each step and saved as a JSON file for
visualization and analysis. The trained LoRA adapters are saved for
downstream testing and CLIP-based evaluation.

In [2]:
# IMPORTS
import os
import random
import numpy as np
from pathlib import Path
from dataclasses import dataclass
from datetime import datetime
import json

import torch
from torch import nn
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms
from PIL import Image

from diffusers import StableDiffusionPipeline
from peft import LoraConfig, get_peft_model

import warnings
warnings.filterwarnings("ignore")

#### Path setup

In [3]:
# PATH CONFIGURATION
IMG_DIR = Path(r"D:\work_space\projects\deep_learning\CAP6415_F25_project-Finding-and-solving-hard-to-generate-examples\data\processed\lora_ready")

OUTPUT_DIR = Path(r"D:\work_space\projects\deep_learning\CAP6415_F25_project-Finding-and-solving-hard-to-generate-examples\model\lora_peft_checkpoint")

METRICS_DIR = Path(r"D:\work_space\projects\deep_learning\CAP6415_F25_project-Finding-and-solving-hard-to-generate-examples\results\Metrics_json")

METRICS_DIR.mkdir(parents=True, exist_ok=True)
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

if not IMG_DIR.exists():
    raise RuntimeError(f"Dataset directory not found: {IMG_DIR}")

print("Dataset directory:", IMG_DIR)
print("Output directory:", OUTPUT_DIR)

Dataset directory: D:\work_space\projects\deep_learning\CAP6415_F25_project-Finding-and-solving-hard-to-generate-examples\data\processed\lora_ready
Output directory: D:\work_space\projects\deep_learning\CAP6415_F25_project-Finding-and-solving-hard-to-generate-examples\model\lora_peft_checkpoint


#### Training configurations 

In [4]:
# TRAINING CONFIG
@dataclass
class TrainConfig:
    model_id: str = "runwayml/stable-diffusion-v1-5"
    resolution: int = 512
    train_batch_size: int = 1
    gradient_accumulation_steps: int = 4
    learning_rate: float = 1e-4
    max_train_steps: int = 1500 # keep this small for debug
    mixed_precision: str = "fp16" # fp16 for RTX 3050
    rank: int = 4
    seed: int = 42

cfg = TrainConfig()

In [5]:
# SMALL UTIL FUNCTIONS
def log(msg):
    print(f"[{datetime.now().strftime('%H:%M:%S')}] {msg}")

def set_seed(seed):
    """For reproducibility. so basically to control the sequence of the random numbers"""
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)

In [6]:
# DATASET
class LoraDebugDataset(Dataset):
    """It load the data from folder and resizes and normalizes the images and returns the image+caption formate to train the lora"""
    def __init__(self, img_dir: Path, resolution: int):
        self.img_dir = img_dir
        self.image_files = [
            f for f in img_dir.iterdir()
            if f.suffix.lower() in [".jpg", ".jpeg", ".png", ".bmp"]
        ]

        if len(self.image_files) == 0:
            raise RuntimeError("No images found in dataset folder!")

        self.transform = transforms.Compose([
            transforms.Resize((resolution, resolution)),
            transforms.ToTensor(),
            transforms.Normalize([0.5], [0.5])
        ])

        log(f"Dataset loaded with {len(self.image_files)} images")

    def __len__(self):
        return len(self.image_files)

    def __getitem__(self, idx):
        img_path = self.image_files[idx]
        txt_path = img_path.with_suffix(".txt")

        caption = txt_path.read_text(encoding="utf-8").strip()
        image = Image.open(img_path).convert("RGB")
        image = self.transform(image)

        return {
            "pixel_values": image,
            "caption": caption
        }

#### Training Function

In [7]:
# MAIN TRAINING FUNCTION
def main():

    # BASIC SETUP 
    set_seed(cfg.seed)
    device = "cuda" if torch.cuda.is_available() else "cpu"
    log(f"Using device: {device}")

    # LOAD MODEL 
    log("Loading Stable Diffusion base model...")
    pipe = StableDiffusionPipeline.from_pretrained(
        cfg.model_id,
        torch_dtype=torch.float16 if cfg.mixed_precision == "fp16" else torch.float32,
        safety_checker=None
    ).to(device)

    tokenizer = pipe.tokenizer
    text_encoder = pipe.text_encoder
    vae = pipe.vae
    unet = pipe.unet
    scheduler = pipe.scheduler

    # FREEZE BASE MODEL 
    print("\n")
    log("Freezing VAE and Text Encoder...")
    vae.requires_grad_(False)
    text_encoder.requires_grad_(False)
    unet.requires_grad_(False)

    if hasattr(unet, "enable_gradient_checkpointing"):
        unet.enable_gradient_checkpointing()

    # ADD LORA 
    log("Injecting LoRA into UNet...")
    lora_config = LoraConfig(
        r=cfg.rank,
        lora_alpha=cfg.rank * 4,
        lora_dropout=0.1,
        target_modules=["to_q", "to_k", "to_v", "to_out.0"]
    )

    unet = get_peft_model(unet, lora_config).to(device)
    unet.train()

    trainable_params = [p for p in unet.parameters() if p.requires_grad]
    log(f"Trainable LoRA parameters: {sum(p.numel() for p in trainable_params):,}")

    # DATASET & DATALOADER 
    dataset = LoraDebugDataset(IMG_DIR, cfg.resolution)

    dataloader = DataLoader(
        dataset,
        batch_size=cfg.train_batch_size,
        shuffle=True,
        num_workers=0,
        pin_memory=(device == "cuda")
    )

    # OPTIMIZER & AMP 
    optimizer = torch.optim.AdamW(trainable_params, lr=cfg.learning_rate)
    use_fp16 = cfg.mixed_precision == "fp16" and device == "cuda"
    scaler = torch.cuda.amp.GradScaler(enabled=use_fp16)

    # METRIC LOGGING 
    train_losses = []
    step_history = []

    # TRAINING LOOP 
    global_step = 0
    print("\n")
    log("START TRAINING")

    while global_step < cfg.max_train_steps:
        for batch_idx, batch in enumerate(dataloader):

            if global_step >= cfg.max_train_steps:
                break

            with torch.cuda.amp.autocast(enabled=use_fp16):

                # TEXT -> EMBEDDINGS
                captions = batch["caption"]
                encoding = tokenizer(
                    list(captions),
                    padding="max_length",
                    truncation=True,
                    max_length=tokenizer.model_max_length,
                    return_tensors="pt"
                ).to(device)

                with torch.no_grad():
                    text_embeddings = text_encoder(encoding.input_ids)[0]

                # IMAGE -> LATENTS
                pixels = batch["pixel_values"].to(device)

                with torch.no_grad():
                    latents = vae.encode(pixels).latent_dist.sample()
                    latents = latents * 0.18215

                # NOISE & TIMESTEP
                noise = torch.randn_like(latents)
                t = torch.randint(
                    0,
                    scheduler.config.num_train_timesteps,
                    (latents.shape[0],),
                    device=device
                )

                noisy_latents = scheduler.add_noise(latents, noise, t)

                # UNET PREDICTION
                preds = unet(
                    noisy_latents,
                    t,
                    encoder_hidden_states=text_embeddings
                ).sample

                # LOSS
                loss = nn.functional.mse_loss(preds, noise)
                loss = loss / cfg.gradient_accumulation_steps

            scaler.scale(loss).backward()

            if (global_step + 1) % cfg.gradient_accumulation_steps == 0:
                scaler.step(optimizer)
                scaler.update()
                optimizer.zero_grad(set_to_none=True)

            # LOG METRICS
            train_losses.append(loss.item() * cfg.gradient_accumulation_steps)
            step_history.append(global_step)

            if global_step % 10 == 0:
                gpu_mem = (
                    torch.cuda.memory_allocated() / (1024 ** 3)
                    if device == "cuda" else 0
                )

                log(
                    f"Step {global_step:04d} | "
                    f"Loss: {loss.item() * cfg.gradient_accumulation_steps:.6f} | "
                    f"GPU Mem: {gpu_mem:.2f} GB"
                )

            global_step += 1

    # SAVE LOGS 
    log_data = {
        "steps": step_history,
        "train_loss": train_losses
    }

    with open(METRICS_DIR / "train_logs.json", "w") as f:
        json.dump(log_data, f)
        
    print("\n")
    log("Training logs saved.")

    # SAVE TRAINED LORA 
    log("Saving trained LoRA adapter...")
    unet.save_pretrained(OUTPUT_DIR)

    log("TRAINING COMPLETE")
    log(f"LoRA saved to: {OUTPUT_DIR}")

In [8]:
# RUN
if __name__ == "__main__":
    main()

[23:44:25] Using device: cuda
[23:44:25] Loading Stable Diffusion base model...


Loading pipeline components...: 100%|██████████| 6/6 [00:07<00:00,  1.23s/it]
You have disabled the safety checker for <class 'diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline'> by passing `safety_checker=None`. Ensure that you abide to the conditions of the Stable Diffusion license and do not expose unfiltered results in services or applications open to the public. Both the diffusers team and Hugging Face strongly recommend to keep the safety filter enabled in all public facing circumstances, disabling it only for use-cases that involve analyzing network behavior or auditing its results. For more information, please have a look at https://github.com/huggingface/diffusers/pull/254 .




[23:44:34] Freezing VAE and Text Encoder...
[23:44:34] Injecting LoRA into UNet...
[23:44:35] Trainable LoRA parameters: 797,184
[23:44:35] Dataset loaded with 61 images


[23:44:35] START TRAINING
[23:44:38] Step 0000 | Loss: 0.666697 | GPU Mem: 2.05 GB
[23:44:46] Step 0010 | Loss: 0.131168 | GPU Mem: 2.05 GB
[23:44:54] Step 0020 | Loss: 0.096082 | GPU Mem: 2.05 GB
[23:45:02] Step 0030 | Loss: 0.009931 | GPU Mem: 2.05 GB
[23:45:11] Step 0040 | Loss: 0.054086 | GPU Mem: 2.05 GB
[23:45:19] Step 0050 | Loss: 0.144473 | GPU Mem: 2.05 GB
[23:45:27] Step 0060 | Loss: 0.016732 | GPU Mem: 2.05 GB
[23:45:35] Step 0070 | Loss: 0.398107 | GPU Mem: 2.05 GB
[23:45:42] Step 0080 | Loss: 0.070085 | GPU Mem: 2.05 GB
[23:45:50] Step 0090 | Loss: 0.023817 | GPU Mem: 2.05 GB
[23:45:59] Step 0100 | Loss: 0.052459 | GPU Mem: 2.05 GB
[23:46:07] Step 0110 | Loss: 0.003549 | GPU Mem: 2.05 GB
[23:46:15] Step 0120 | Loss: 0.069916 | GPU Mem: 2.05 GB
[23:46:23] Step 0130 | Loss: 0.064113 | GPU Mem: 2.05 GB
[2

### Code Explanation 

This script sets up and executes LoRA-based fine-tuning of Stable Diffusion on a
custom image–caption dataset. After loading required libraries and defining
directory paths, a training configuration dataclass is used to store all
hyperparameters in a clean and reproducible format.

A custom PyTorch dataset class loads the LoRA-ready images and their matching
caption files, resizes them to 512×512, converts them to tensors, and applies
normalization suitable for Stable Diffusion.

Inside the main training function, the Stable Diffusion base model is loaded
and its VAE, text encoder, and UNet parameters are frozen. LoRA adapters are
then injected into the UNet attention layers using PEFT, making only the LoRA
parameters trainable.

During each training step, captions are tokenized and encoded into text
embeddings, images are converted into latent representations using the VAE,
and Gaussian noise is added according to the diffusion scheduler. The UNet is
trained to predict this noise using a Mean Squared Error (MSE) loss function.
Gradient accumulation and mixed-precision training are used for efficient GPU
utilization.

Training loss and step numbers are logged and saved as a JSON file for later
visualization. After training completes, the learned LoRA adapter weights are
saved to disk for inference and evaluation.
