## Data Preprocessing for Stable Diffusion LoRA Training

This notebook preprocesses the raw image dataset for Stable Diffusion
fine-tuning using LoRA and PEFT. The script reads high-quality raw images,
resizes them to 512×512 resolution (as required by Stable Diffusion),
renames them in a clean sequential format, and stores them in a structured
output directory. This ensures a standardized dataset format suitable for
efficient model training.


In [4]:
import os
import cv2
from pathlib import Path
from tqdm import tqdm

In [5]:
# CONFIG 
SRC_DIR = Path(r"D:\work_space\projects\deep_learning\CAP6415_F25_project-Finding-and-solving-hard-to-generate-examples\data\raw\quality_images")
OUT_DIR = Path(r"D:\work_space\projects\deep_learning\CAP6415_F25_project-Finding-and-solving-hard-to-generate-examples\data\processed\lora_ready")

RES = 512
DEBUG = True   # turn OFF later if you want

OUT_DIR.mkdir(parents=True, exist_ok=True)

In [6]:
# MAIN PROCESS
img_count = 0
skipped = 0

image_files = sorted(SRC_DIR.glob("*.*"))
print(f"\nFound {len(image_files)} images in source folder\n")
for idx, img_path in enumerate(tqdm(image_files)):
    try:
        img = cv2.imread(str(img_path))
        if img is None:
            print(f"[WARNING] Could not read image: {img_path.name}")
            skipped += 1
            continue

        # Resize
        img = cv2.resize(img, (RES, RES), interpolation=cv2.INTER_AREA)
        # Clean renamed output
        out_name = f"speedbump_{img_count:05d}.jpg"
        out_path = OUT_DIR / out_name

        # Safety check
        if out_path.exists():
            print(f"[SKIP] File already exists: {out_name}")
            skipped += 1
            continue
        cv2.imwrite(str(out_path), img)
        img_count += 1
        if DEBUG:
            print(f"[OK] {img_path.name} → {out_name}")
    except Exception as e:
        print(f"[ERROR] Failed on {img_path.name}: {e}")
        skipped += 1
print("\n--")
print(f"Total processed: {img_count}")
print(f"Total skipped  : {skipped}")


Found 61 images in source folder



 10%|▉         | 6/61 [00:00<00:00, 58.50it/s]

[OK] speedbump_00000.jpg → speedbump_00000.jpg
[OK] speedbump_00001.jpg → speedbump_00001.jpg
[OK] speedbump_00002.jpg → speedbump_00002.jpg
[OK] speedbump_00003.jpg → speedbump_00003.jpg
[OK] speedbump_00004.jpg → speedbump_00004.jpg
[OK] speedbump_00005.jpg → speedbump_00005.jpg
[OK] speedbump_00006.jpg → speedbump_00006.jpg
[OK] speedbump_00007.jpg → speedbump_00007.jpg
[OK] speedbump_00008.jpg → speedbump_00008.jpg
[OK] speedbump_00009.jpg → speedbump_00009.jpg


 30%|██▉       | 18/61 [00:00<00:00, 55.95it/s]

[OK] speedbump_00010.jpg → speedbump_00010.jpg
[OK] speedbump_00011.jpg → speedbump_00011.jpg
[OK] speedbump_00012.jpg → speedbump_00012.jpg
[OK] speedbump_00013.jpg → speedbump_00013.jpg
[OK] speedbump_00014.jpg → speedbump_00014.jpg
[OK] speedbump_00015.jpg → speedbump_00015.jpg
[OK] speedbump_00016.jpg → speedbump_00016.jpg
[OK] speedbump_00017.jpg → speedbump_00017.jpg
[OK] speedbump_00018.jpg → speedbump_00018.jpg
[OK] speedbump_00019.jpg → speedbump_00019.jpg
[OK] speedbump_00020.jpg → speedbump_00020.jpg
[OK] speedbump_00021.jpg → speedbump_00021.jpg
[OK] speedbump_00022.jpg → speedbump_00022.jpg
[OK] speedbump_00023.jpg → speedbump_00023.jpg


 52%|█████▏    | 32/61 [00:00<00:00, 60.61it/s]

[OK] speedbump_00024.jpg → speedbump_00024.jpg
[OK] speedbump_00025.jpg → speedbump_00025.jpg
[OK] speedbump_00026.jpg → speedbump_00026.jpg
[OK] speedbump_00027.jpg → speedbump_00027.jpg
[OK] speedbump_00028.jpg → speedbump_00028.jpg
[OK] speedbump_00029.jpg → speedbump_00029.jpg
[OK] speedbump_00030.jpg → speedbump_00030.jpg
[OK] speedbump_00031.jpg → speedbump_00031.jpg
[OK] speedbump_00032.jpg → speedbump_00032.jpg
[OK] speedbump_00033.jpg → speedbump_00033.jpg
[OK] speedbump_00034.jpg → speedbump_00034.jpg
[OK] speedbump_00035.jpg → speedbump_00035.jpg
[OK] speedbump_00036.jpg → speedbump_00036.jpg
[OK] speedbump_00037.jpg → speedbump_00037.jpg


 75%|███████▌  | 46/61 [00:00<00:00, 62.53it/s]

[OK] speedbump_00038.jpg → speedbump_00038.jpg
[OK] speedbump_00039.jpg → speedbump_00039.jpg
[OK] speedbump_00040.jpg → speedbump_00040.jpg
[OK] speedbump_00041.jpg → speedbump_00041.jpg
[OK] speedbump_00042.jpg → speedbump_00042.jpg
[OK] speedbump_00043.jpg → speedbump_00043.jpg
[OK] speedbump_00044.jpg → speedbump_00044.jpg
[OK] speedbump_00045.jpg → speedbump_00045.jpg
[OK] speedbump_00046.jpg → speedbump_00046.jpg
[OK] speedbump_00047.jpg → speedbump_00047.jpg
[OK] speedbump_00048.jpg → speedbump_00048.jpg
[OK] speedbump_00049.jpg → speedbump_00049.jpg
[OK] speedbump_00050.jpg → speedbump_00050.jpg
[OK] speedbump_00051.jpg → speedbump_00051.jpg


100%|██████████| 61/61 [00:01<00:00, 60.51it/s]

[OK] speedbump_00052.jpg → speedbump_00052.jpg
[OK] speedbump_00053.jpg → speedbump_00053.jpg
[OK] speedbump_00054.jpg → speedbump_00054.jpg
[OK] speedbump_00055.jpg → speedbump_00055.jpg
[OK] speedbump_00056.jpg → speedbump_00056.jpg
[OK] speedbump_00057.jpg → speedbump_00057.jpg
[OK] speedbump_00058.jpg → speedbump_00058.jpg
[OK] speedbump_00059.jpg → speedbump_00059.jpg
[OK] speedbump_00060.jpg → speedbump_00060.jpg

--
Total processed: 61
Total skipped  : 0





### Code Explanation

The preprocessing pipeline begins by importing essential libraries such as
`os` for file handling, `cv2` (OpenCV) for image processing, `Path` for
cross-platform path management, and `tqdm` to display a progress bar during
image processing.

Two directory paths are defined:
- `SRC_DIR`: This points to the raw input image directory.
- `OUT_DIR`: This specifies where the processed images will be saved.

The output resolution is fixed at `512 × 512`, which matches the input
requirement of Stable Diffusion models. A `DEBUG` flag is enabled to print
detailed logs during preprocessing for verification.

The output directory is created automatically if it does not already exist
using `mkdir(parents=True, exist_ok=True)`.

The script then scans the source directory for all image files and iterates
over them using a progress bar for better usability. Each image is loaded
using OpenCV. If an image fails to load, it is skipped safely with a warning.

Each valid image is resized to 512 × 512 pixels using area interpolation,
which preserves image quality during downscaling. The processed images are
renamed in a clean, sequential format (e.g., `speedbump_00001.jpg`) to ensure
uniform naming and avoid conflicts during training.

Before saving, the script performs a safety check to avoid overwriting
existing files. If a filename already exists, the image is skipped. All
successfully processed images are saved to the output directory.

The script also maintains counters for successfully processed and skipped
images. At the end of execution, a final summary is printed showing the total
number of processed and skipped images.

This preprocessing step ensures that the dataset is clean, uniformly sized,
and properly structured for efficient LoRA-based fine-tuning of Stable
Diffusion.