# Ultra Turbo Anime Generator

This notebook implements a **fully documented**, production‑ready image generation system
using **SD‑Turbo**, **SDXL‑Turbo**, **ControlNet**, **multiple LoRA merge modes**, **batching**,  
and **benchmarking**.

## 1. Introduction
This notebook demonstrates an optimized workflow for real‑time or near–real‑time image synthesis.
It uses both **SD‑Turbo** and **SDXL‑Turbo**, which are extremely fast diffusion models requiring very few inference steps.
SD‑Turbo focuses on speed, while SDXL‑Turbo delivers higher detail at moderate cost.  
We integrate them into a single dynamic pipeline so each prompt can decide which engine to use.

A flexible LoRA system is included, supporting both **additive merge** (fast, stable)
and **sequential merge** (richer style influence).

In [12]:
import torch, os, time, glob, numpy as np, cv2
from PIL import Image
from dataclasses import dataclass
from typing import List, Optional
from concurrent.futures import ThreadPoolExecutor
from diffusers import AutoPipelineForImage2Image, StableDiffusionXLImg2ImgPipeline, ControlNetModel
import safetensors.torch

device = 'cuda' if torch.cuda.is_available() else 'mps' if torch.backends.mps.is_available() else 'cpu'
device

'mps'

## 2. Loading Input Images
We load input images from a folder and apply light preprocessing (resize, RGB conversion).
This ensures consistent shapes for batching and reduces GPU overhead.
We use thread pooling because Pillow decoding is CPU‑bound and benefits from concurrency.

In [13]:
def load_image(path, size):
    try:
        return Image.open(path).convert('RGB').resize((size, size))
    except:
        return None

def load_images(folder='./assets/example_inputs', size=512, workers=8):
    paths = sorted(glob.glob(folder+'/*'))
    with ThreadPoolExecutor(max_workers=workers) as ex:
        imgs = list(ex.map(lambda p: load_image(p, size), paths))
    return [i for i in imgs if i]

inputs = load_images()
len(inputs)

5

## 3. Model Loading (SD‑Turbo + SDXL‑Turbo)
We load both pipelines:
- **SD‑Turbo** (`stabilityai/sd-turbo`): extremely fast at 512–768 px.
- **SDXL‑Turbo** (`stabilityai/sdxl-turbo`): higher quality, suitable for 1024 px.

Each prompt can dynamically choose between them.

In [14]:
def build_turbo():
    pipe = AutoPipelineForImage2Image.from_pretrained('stabilityai/sd-turbo', torch_dtype=torch.float16).to(device)
    return pipe

def build_turboxl():
    pipe = StableDiffusionXLImg2ImgPipeline.from_pretrained('stabilityai/sdxl-turbo', torch_dtype=torch.float16).to(device)
    return pipe

pipe_turbo = build_turbo()
pipe_xl = build_turboxl()

Loading pipeline components...: 100%|██████████| 5/5 [00:06<00:00,  1.32s/it]
You have disabled the safety checker for <class 'diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion_img2img.StableDiffusionImg2ImgPipeline'> by passing `safety_checker=None`. Ensure that you abide to the conditions of the Stable Diffusion license and do not expose unfiltered results in services or applications open to the public. Both the diffusers team and Hugging Face strongly recommend to keep the safety filter enabled in all public facing circumstances, disabling it only for use-cases that involve analyzing network behavior or auditing its results. For more information, please have a look at https://github.com/huggingface/diffusers/pull/254 .
Loading pipeline components...: 100%|██████████| 7/7 [00:18<00:00,  2.63s/it]


## 4. ControlNet Integration
We use **Canny ControlNet** to reinforce outlines and improve style consistency.
Canny edge maps are extracted from the input images and passed as control conditions.

In [15]:
controlnet = ControlNetModel.from_pretrained('lllyasviel/sd-controlnet-canny',
                                            torch_dtype=torch.float16).to(device)

def make_canny(img):
    arr = np.array(img)
    edges = cv2.Canny(arr, 100, 200)
    return Image.fromarray(edges)

canny_maps = [make_canny(i) for i in inputs]

## 5. Multi‑LoRA System
LoRA weights are applied either by:
- **Additive merge**: direct weight addition. Faster and stable.
- **Sequential merge**: apply LoRAs one after another, producing richer stylistic effects.

This notebook allows any number of LoRAs to be combined dynamically.

In [16]:
lora_paths = ['./assets/models/lora/add_detail.safetensors']  # can add more

def apply_lora_add(pipe, paths, scale=1.0):
    for pt in paths:
        state = safetensors.torch.load_file(pt)
        for k,v in state.items():
            if 'lora' in k:
                tgt = '.'.join(k.split('.')[:-2])
                if hasattr(pipe.unet, tgt):
                    mod = getattr(pipe.unet, tgt)
                    if hasattr(mod, 'weight'):
                        mod.weight += v * scale

def apply_lora_seq(pipe, paths, scale=1.0):
    for pt in paths:
        state = safetensors.torch.load_file(pt)
        for k,v in state.items():
            if 'lora' in k:
                tgt = '.'.join(k.split('.')[:-2])
                if hasattr(pipe.unet, tgt):
                    mod = getattr(pipe.unet, tgt)
                    if hasattr(mod, 'weight'):
                        mod.weight += v * scale

## 6. Unified Generation Configuration
Each generation task decides:
- Which model (Turbo or XL)
- Which LoRA merge mode (additive or sequential)
- Rendering parameters (steps, guidance, strength)
- Random seed

In [17]:
@dataclass
class GenCfg:
    prompt: str
    negative: str
    steps: int
    strength: float
    guidance: float
    use_turbo: bool
    lora_mode: str    # 'add' or 'seq'
    seed: int

cfg = GenCfg(
    prompt='high-quality anime portrait',
    negative='distorted, blur',
    steps=4,
    strength=0.85,
    guidance=1.0,
    use_turbo=True,
    lora_mode='seq',
    seed=1234
)

## 7. Batching for Speed
Batching helps us process multiple images in parallel,
reducing overhead and achieving higher throughput per second.
The batch size depends on GPU memory; Turbo models are lightweight
and allow medium‑sized batches even on mid‑range GPUs.

In [18]:
BATCH = len(inputs)
batch_imgs = inputs[:BATCH]
batch_canny = canny_maps[:BATCH]

## 8. Benchmarking System
We record per‑image latency and compute the mean latency.
Turbo models should achieve **tens of milliseconds per image** on a modern GPU.

In [19]:
# Apply LoRAs
if cfg.lora_mode == 'add':
    apply_lora_add(pipe_turbo, lora_paths)
    apply_lora_add(pipe_xl, lora_paths)
else:
    apply_lora_seq(pipe_turbo, lora_paths)
    apply_lora_seq(pipe_xl, lora_paths)

times = []
outputs = []

pipe = pipe_turbo if cfg.use_turbo else pipe_xl

for i, img in enumerate(batch_imgs):
    gen = torch.Generator(device=device).manual_seed(cfg.seed+i)
    t0 = time.time()

    out = pipe(prompt=cfg.prompt,
               negative_prompt=cfg.negative,
               image=img,
               strength=cfg.strength,
               num_inference_steps=cfg.steps,
               guidance_scale=cfg.guidance,
               generator=gen).images[0]

    dt = (time.time() - t0) * 1000
    times.append(dt)
    outputs.append(out)

print("Per-image times (ms):", times)
print("Average (ms):", sum(times)/len(times))


100%|██████████| 3/3 [00:01<00:00,  1.95it/s]
100%|██████████| 3/3 [00:00<00:00,  3.86it/s]
100%|██████████| 3/3 [00:00<00:00,  4.00it/s]
100%|██████████| 3/3 [00:00<00:00,  3.88it/s]
100%|██████████| 3/3 [00:00<00:00,  4.34it/s]


Per-image times (ms): [12873.667001724243, 1770.9660530090332, 1807.2218894958496, 1773.2958793640137, 1646.7430591583252]
Average (ms): 3974.378776550293


## 9. Generated Outputs

In [20]:
outputs

[<PIL.Image.Image image mode=RGB size=512x512>,
 <PIL.Image.Image image mode=RGB size=512x512>,
 <PIL.Image.Image image mode=RGB size=512x512>,
 <PIL.Image.Image image mode=RGB size=512x512>,
 <PIL.Image.Image image mode=RGB size=512x512>]

In [21]:
os.makedirs('./assets/example_outputs', exist_ok=True)
for i, out in enumerate(outputs):
    out.save(f'./assets/example_outputs/output_{i+1:03d}.png')