# Running Stable Diffusion 3 (SD3) DreamBooth LoRA training under 16GB GPU VRAM

## Install Dependencies

In [None]:
!pip install -q -U git+https://github.com/huggingface/diffusers
!pip install -q -U \
    transformers \
    accelerate \
    wandb \
    bitsandbytes \
    peft

As SD3 is gated, before using it with diffusers you first need to go to the [Stable Diffusion 3 Medium Hugging Face page](https://huggingface.co/stabilityai/stable-diffusion-3-medium-diffusers), fill in the form and accept the gate. Once you are in, you need to log in so that your system knows you’ve accepted the gate. Use the command below to log in:

In [None]:
!huggingface-cli login

## Clone `diffusers`

In [None]:
!git clone https://github.com/huggingface/diffusers
%cd diffusers/examples/research_projects/sd3_lora_colab

## Download instance data images

In [12]:
from huggingface_hub import snapshot_download

local_dir = "./faith_knox/stable"
snapshot_download(
    "diffusers/dog-example",
    local_dir=local_dir, repo_type="dataset",
    ignore_patterns=".gitattributes",
)

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

'/content/faith_knox/stable'

In [None]:
!rm -rf dog/.cache

## Compute embeddings

Here we are using the default instance prompt "a photo of sks dog". But you can configure this. Refer to the `compute_embeddings.py` script for details on other supported arguments.

In [14]:
%%writefile /content/compute_embeddings.py
# Paste your script contents here
import os

def main():
    print("Running embedding computation...")

if __name__ == "__main__":
    main()


Writing /content/compute_embeddings.py


In [16]:
!python3 /content/compute_embeddings.py


Running embedding computation...


In [17]:
!python compute_embeddings.py --instance_data_dir="./faith_knox/stable"


Running embedding computation...


## Clear memory

In [None]:
import torch
import gc


def flush():
    torch.cuda.empty_cache()
    gc.collect()

flush()

## Train!

In [None]:
!accelerate launch train_dreambooth_lora_sd3_miniature.py \
  --pretrained_model_name_or_path="stabilityai/stable-diffusion-3-medium-diffusers"  \
  --instance_data_dir="dog" \
  --data_df_path="sample_embeddings.parquet" \
  --output_dir="trained-sd3-lora-miniature" \
  --mixed_precision="fp16" \
  --instance_prompt="a photo of sks dog" \
  --resolution=1024 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 --gradient_checkpointing \
  --use_8bit_adam \
  --learning_rate=1e-4 \
  --report_to="wandb" \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --max_train_steps=500 \
  --seed="0"

Training will take about an hour to complete depending on the length of your dataset.

## Inference

In [None]:
flush()

In [None]:
from diffusers import DiffusionPipeline
import torch

pipeline = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-3-medium-diffusers",
    torch_dtype=torch.float16
)
lora_output_path = "trained-sd3-lora-miniature"
pipeline.load_lora_weights("trained-sd3-lora-miniature")

pipeline.enable_sequential_cpu_offload()

image = pipeline("a photo of sks dog in a bucket").images[0]
image.save("bucket_dog.png")

In [1]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [2]:
!unzip "/content/drive/MyDrive/faithknoxtraining/stable.zip" -d /content/faith_knox


Archive:  /content/drive/MyDrive/faithknoxtraining/stable.zip
   creating: /content/faith_knox/stable/
   creating: /content/faith_knox/stable/compressed/
  inflating: /content/faith_knox/__MACOSX/stable/._compressed  
  inflating: /content/faith_knox/stable/._compressed  
  inflating: /content/faith_knox/stable/th-2645293770.jpg  
  inflating: /content/faith_knox/__MACOSX/stable/._th-2645293770.jpg  
  inflating: /content/faith_knox/stable/._th-2645293770.jpg  
  inflating: /content/faith_knox/stable/th-3916788417.jpg  
  inflating: /content/faith_knox/__MACOSX/stable/._th-3916788417.jpg  
  inflating: /content/faith_knox/stable/._th-3916788417.jpg  
  inflating: /content/faith_knox/stable/th-4019533498.jpg  
  inflating: /content/faith_knox/__MACOSX/stable/._th-4019533498.jpg  
  inflating: /content/faith_knox/stable/._th-4019533498.jpg  
  inflating: /content/faith_knox/stable/th-4048935629.jpg  
  inflating: /content/faith_knox/__MACOSX/stable/._th-4048935629.jpg  
  inflating: /co

In [4]:
!--instance_data_dir="/content/faith_knox"


/bin/bash: --: invalid option
Usage:	/bin/bash [GNU long option] [option] ...
	/bin/bash [GNU long option] [option] script-file ...
GNU long options:
	--debug
	--debugger
	--dump-po-strings
	--dump-strings
	--help
	--init-file
	--login
	--noediting
	--noprofile
	--norc
	--posix
	--pretty-print
	--rcfile
	--restricted
	--verbose
	--version
Shell options:
	-ilrsD or -c command or -O shopt_option		(invocation only)
	-abefhkmnptuvxBCHP or -o option


Note that inference will be very slow in this case because we're loading and unloading individual components of the models and that introduces significant data movement overhead. Refer to [this resource](https://huggingface.co/blog/sd3#memory-optimizations-for-sd3) for more memory optimization related techniques.

In [7]:
!wget https://raw.githubusercontent.com/huggingface/diffusers/main/examples/dreambooth/train_dreambooth_lora_sd3.py


--2025-05-10 20:33:07--  https://raw.githubusercontent.com/huggingface/diffusers/main/examples/dreambooth/train_dreambooth_lora_sd3.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 82055 (80K) [text/plain]
Saving to: ‘train_dreambooth_lora_sd3.py’


2025-05-10 20:33:07 (5.37 MB/s) - ‘train_dreambooth_lora_sd3.py’ saved [82055/82055]



In [10]:
!accelerate launch train_dreambooth_lora_sd3.py \
  --pretrained_model_name_or_path=stabilityai/stable-diffusion-3-medium-diffusers \
  --instance_data_dir=/content/faith_knox \
  --data_df_path=sample_embeddings.parquet \
  --output_dir=/content/trained-faithknox-lora \
  --mixed_precision=fp16 \
  --instance_prompt="a photo of sks faithknox woman" \
  --resolution=1024 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 \
  --gradient_checkpointing \
  --use_8bit_adam \
  --learning_rate=1e-4 \
  --report_to=wandb \
  --lr_scheduler=constant \
  --lr_warmup_steps=0 \
  --max_train_steps=500 \
  --seed=0


The following values were not passed to `accelerate launch` and had defaults used instead:
	`--num_processes` was set to a value of `1`
	`--num_machines` was set to a value of `1`
	`--mixed_precision` was set to a value of `'no'`
	`--dynamo_backend` was set to a value of `'no'`
2025-05-10 20:34:25.970548: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1746909265.989943    3097 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1746909265.995933    3097 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-05-10 20:34:26.015297: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in perf

In [19]:
!pip install faiss-cpu


Collecting faiss-cpu
  Downloading faiss_cpu-1.11.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.8 kB)
Downloading faiss_cpu-1.11.0-cp311-cp311-manylinux_2_28_x86_64.whl (31.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.3/31.3 MB[0m [31m70.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.11.0


In [21]:
!pip install ftfy regex tqdm
!pip install git+https://github.com/openai/CLIP.git


Collecting ftfy
  Downloading ftfy-6.3.1-py3-none-any.whl.metadata (7.3 kB)
Downloading ftfy-6.3.1-py3-none-any.whl (44 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/44.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.8/44.8 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: ftfy
Successfully installed ftfy-6.3.1
Collecting git+https://github.com/openai/CLIP.git
  Cloning https://github.com/openai/CLIP.git to /tmp/pip-req-build-i31lt1vw
  Running command git clone --filter=blob:none --quiet https://github.com/openai/CLIP.git /tmp/pip-req-build-i31lt1vw
  Resolved https://github.com/openai/CLIP.git to commit dcba3cb2e2827b402d2701e7e1c7d9fed8a20ef1
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch->clip==1.0)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting

In [22]:
import clip
import torch
from PIL import Image
import numpy as np


In [23]:
device = "cuda" if torch.cuda.is_available() else "cpu"

# Load the pre-trained CLIP model
model, preprocess = clip.load("ViT-B/32", device)


100%|███████████████████████████████████████| 338M/338M [00:04<00:00, 83.6MiB/s]


In [24]:
from google.colab import files

# Upload the images
uploaded = files.upload()

# Once uploaded, they will be available in the current working directory
# Example: Display the names of uploaded files
print(uploaded.keys())


Saving stable.zip to stable.zip
dict_keys(['stable.zip'])


In [26]:
import os

# List everything in the root directory
for f in os.listdir('/content'):
    print(f)


.config
stable.zip
faith_knox
compute_embeddings.py
drive
train_dreambooth_lora_sd3.py
sample_data


In [27]:
!unzip -q stable.zip -d /content/stable_unzipped


In [28]:
import os

unzipped_path = '/content/stable_unzipped'
for fname in os.listdir(unzipped_path):
    print(fname)


__MACOSX
stable


In [30]:
# Import glob module to search for image files
import glob

# Adjust for the actual path inside 'stable' folder
image_dir = '/content/stable_unzipped/stable'

image_paths = glob.glob(f"{image_dir}/**/*.jpg", recursive=True) + \
              glob.glob(f"{image_dir}/**/*.png", recursive=True) + \
              glob.glob(f"{image_dir}/**/*.jpeg", recursive=True)

print(f"Found {len(image_paths)} images.")



Found 208 images.


In [31]:
import torch
from transformers import CLIPProcessor, CLIPModel
import numpy as np
from tqdm import tqdm


In [33]:
# Initialize the CLIP model and processor
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")


In [34]:
from PIL import Image

# Function to compute embeddings for images
def compute_image_embeddings(image_paths):
    embeddings = []

    # Loop through all the image paths and compute embeddings
    for image_path in tqdm(image_paths, desc="Processing images"):
        image = Image.open(image_path)  # Open the image

        # Process the image using the CLIP processor
        inputs = processor(images=image, return_tensors="pt", padding=True)

        # Generate the image embeddings from the model
        with torch.no_grad():
            image_features = model.get_image_features(**inputs)

        embeddings.append(image_features.cpu().numpy())  # Store the embeddings

    return np.array(embeddings)  # Convert list to numpy array for easier handling


In [35]:
# Compute embeddings for all images
embeddings = compute_image_embeddings(image_paths)

# Save the embeddings to a .npy file (you can use this later for training or other purposes)
np.save('/content/faith_knox_embeddings.npy', embeddings)


Processing images: 100%|██████████| 208/208 [01:09<00:00,  2.99it/s]
