## Testing and CLIP-Based Evaluation of LoRA Fine-Tuned Stable Diffusion

This notebook performs inference using the LoRA fine-tuned Stable Diffusion
model and evaluates the generated images using the CLIP score. Multiple
speed-bump-focused prompts are used to generate images that remain within the
training distribution. The generated images are saved to disk, and CLIP-based
text–image similarity is computed for each prompt–image pair to quantify
semantic alignment.

In [3]:
import torch
from diffusers import StableDiffusionPipeline
from peft import PeftModel
from pathlib import Path
import json
from PIL import Image
import clip
import warnings
warnings.filterwarnings("ignore")

In [4]:
# PATHS 
base_model = "runwayml/stable-diffusion-v1-5"

LORA_DIR = Path(r"D:\work_space\projects\deep_learning\CAP6415_F25_project-Finding-and-solving-hard-to-generate-examples\model\lora_peft_checkpoint")
CLIP_OUT_PATH = Path(r"D:\work_space\projects\deep_learning\CAP6415_F25_project-Finding-and-solving-hard-to-generate-examples\results\Metrics_json")
OUTPUT_DIR = Path(r"D:\work_space\projects\deep_learning\CAP6415_F25_project-Finding-and-solving-hard-to-generate-examples\results\After")
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

In [5]:
# DEVICE
device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.float16 if device == "cuda" else torch.float32
print("Using dtype:", dtype)

Using dtype: torch.float16


In [6]:
# LOAD BASE SD
pipe = StableDiffusionPipeline.from_pretrained(
    base_model,
    torch_dtype=dtype,
    safety_checker=None
).to(device)

Loading pipeline components...: 100%|██████████| 6/6 [00:05<00:00,  1.03it/s]
You have disabled the safety checker for <class 'diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline'> by passing `safety_checker=None`. Ensure that you abide to the conditions of the Stable Diffusion license and do not expose unfiltered results in services or applications open to the public. Both the diffusers team and Hugging Face strongly recommend to keep the safety filter enabled in all public facing circumstances, disabling it only for use-cases that involve analyzing network behavior or auditing its results. For more information, please have a look at https://github.com/huggingface/diffusers/pull/254 .


In [7]:
# LOAD & MERGE LORA 
pipe.unet = PeftModel.from_pretrained(pipe.unet, LORA_DIR).merge_and_unload()
pipe.set_progress_bar_config(disable=False)

In [8]:
# MULTIPLE TEST PROMPTS 
prompts = [
    "a close-up of a speed bump on an asphalt road surface, realistic photo",
    "a detailed view of a speed bump on an asphalt road, realistic photo",
    "a wide view of a speed bump on a residential asphalt road, realistic photo",
    "a ground-level view of a yellow and black speed bump on an asphalt street",
    "a clear view of a speed bump on a paved asphalt road, realistic photo",

    "a speed bump on a wet asphalt road after rain, realistic photo",
    "a speed bump on a dry asphalt road in bright daylight, realistic photo",
    "a speed bump on an asphalt road under cloudy lighting, realistic photo",
    "a speed bump on an asphalt road in soft evening light, realistic photo",
    "a speed bump on an asphalt road under street lighting at night, realistic photo",

    "a newly painted yellow and black speed bump on an asphalt road, realistic photo",
    "a slightly worn speed bump on an asphalt street, realistic photo",
    "a faded yellow speed bump on an asphalt road surface, realistic photo",

    "a speed bump near a pedestrian crossing on an asphalt road, realistic photo",
    "a speed bump in a residential asphalt street, realistic photo",
    "a speed bump in a parking area on asphalt, realistic photo",

    "a speed bump on a narrow asphalt road, realistic photo",
    "a speed bump on a straight asphalt road, realistic photo",
    "a speed bump on an empty asphalt road, realistic photo",
    "a speed bump on an urban asphalt street, realistic photo"
]
negative_prompt = (
    "blurry, low resolution, bad anatomy, distorted, deformed, extra objects, "
    "cartoon, anime, painting, illustration, unrealistic, fake, CGI, pothole, "
    "flat road, smooth road surface, no speed bump"
)

In [9]:
# IMAGE GENERATION LOOP 
generated_images = []
for idx, prompt in enumerate(prompts):
    print(f"Generating {idx+1}/{len(prompts)}")

    image = pipe(
        prompt=prompt,
        negative_prompt=negative_prompt,
        num_inference_steps=30,
        guidance_scale=7.5,
        height=512,
        width=512
    ).images[0]

    out_path = OUTPUT_DIR / f"lora_result_{idx:03d}.png"
    image.save(out_path)
    generated_images.append(out_path)
# Display one sample image
image.show(generated_images[0])

Generating 1/20


100%|██████████| 30/30 [00:08<00:00,  3.73it/s]


Generating 2/20


100%|██████████| 30/30 [00:07<00:00,  4.03it/s]


Generating 3/20


100%|██████████| 30/30 [00:07<00:00,  4.13it/s]


Generating 4/20


100%|██████████| 30/30 [00:07<00:00,  4.12it/s]


Generating 5/20


100%|██████████| 30/30 [00:07<00:00,  4.06it/s]


Generating 6/20


100%|██████████| 30/30 [00:07<00:00,  4.14it/s]


Generating 7/20


100%|██████████| 30/30 [00:07<00:00,  4.13it/s]


Generating 8/20


100%|██████████| 30/30 [00:07<00:00,  4.16it/s]


Generating 9/20


100%|██████████| 30/30 [00:07<00:00,  4.16it/s]


Generating 10/20


100%|██████████| 30/30 [00:07<00:00,  4.14it/s]


Generating 11/20


100%|██████████| 30/30 [00:07<00:00,  4.15it/s]


Generating 12/20


100%|██████████| 30/30 [00:07<00:00,  4.15it/s]


Generating 13/20


100%|██████████| 30/30 [00:07<00:00,  4.12it/s]


Generating 14/20


100%|██████████| 30/30 [00:07<00:00,  4.14it/s]


Generating 15/20


100%|██████████| 30/30 [00:07<00:00,  4.15it/s]


Generating 16/20


100%|██████████| 30/30 [00:07<00:00,  4.13it/s]


Generating 17/20


100%|██████████| 30/30 [00:07<00:00,  4.13it/s]


Generating 18/20


100%|██████████| 30/30 [00:07<00:00,  4.13it/s]


Generating 19/20


100%|██████████| 30/30 [00:07<00:00,  4.16it/s]


Generating 20/20


100%|██████████| 30/30 [00:07<00:00,  4.14it/s]


### CLIP SCORE EVALUATION

In [10]:
# Load CLIP
model, preprocess = clip.load("ViT-B/32", device=device)
clip_scores = []
for idx, prompt in enumerate(prompts):
    img_path = OUTPUT_DIR / f"lora_result_{idx:03d}.png"
    image = preprocess(Image.open(img_path).convert("RGB")).unsqueeze(0).to(device)
    text = clip.tokenize([prompt]).to(device)

    with torch.no_grad():
        image_features = model.encode_image(image)
        text_features = model.encode_text(text)

        image_features /= image_features.norm(dim=-1, keepdim=True)
        text_features /= text_features.norm(dim=-1, keepdim=True)
        similarity = (image_features @ text_features.T).item()
    clip_scores.append(similarity)
# Save scores
with open(CLIP_OUT_PATH/ "clip_scores.json", "w") as f:
    json.dump(clip_scores, f)

print("CLIP scores saved to:", CLIP_OUT_PATH)
print("Average CLIP score:", sum(clip_scores) / len(clip_scores))

CLIP scores saved to: D:\work_space\projects\deep_learning\CAP6415_F25_project-Finding-and-solving-hard-to-generate-examples\results\Metrics_json
Average CLIP score: 0.29957275390625


### Code Explanation 

This notebook loads the base Stable Diffusion v1.5 model and merges the trained
LoRA adapter into the UNet using the PEFT framework for efficient inference.
The computing device and numerical precision are selected automatically based
on hardware availability.

Multiple speed-bump-focused evaluation prompts are defined along with a
negative prompt to suppress visual artifacts. For each prompt, the fine-tuned
model generates a 512×512 image, which is saved to disk for evaluation.

For quantitative analysis, the pretrained CLIP model (ViT-B/32) is used to
measure text–image alignment. Each generated image and its corresponding
prompt are encoded using CLIP, and the cosine similarity between their
embeddings is computed as the CLIP score.

All CLIP scores are saved as a JSON file, and the average CLIP score is
reported as an overall measure of how well the fine-tuned model aligns with
the given text prompts.
