Skip to content

VAE tiling may cause image interference between generations #1603

Closed
leejet/stable-diffusion.cpp
#703
@wbruna

Description

@wbruna

With the default VAE tiling, some image generations can seemingly be corrupted by previously generated images.

A test with plain text2img, model cyberrealisticPony_semiRealV30, DMD LoRA at 1.0, LCM 8 steps, CFG 1, Seed 1, 1024x640, VAE tiling on, in sequence (no batch, just clicking Generate repeatedly):

  • prompt: "car", 5 generations:

Image

  • prompt: "forest", 1 generation:

Image

  • prompt "forest", 2 generations:

Image

  • prompt "forest", 1 generation:

Image

These were generated on rocm build a7706be , but I see the same behavior with Vulkan on the 1.93.2 build.

I can't reproduce it when disabling VAE tiling, so this could be related to leejet/stable-diffusion.cpp#588 (but I get a warning "Requested buffer size (4362076160) exceeds device memory allocation limit (4294967292)!" when disabling VAE tiling, so I don't know if I can really trust this test).

Activity

LostRuins

LostRuins commented on Jun 15, 2025

@LostRuins
Owner

Yes, this is a known issue that seems hardware specific.

Image

Image

Goes back to at least 1.78, months ago. We were never able to find out why, but not everyone gets it.

What are your hardware and system specs?

wbruna

wbruna commented on Jun 16, 2025

@wbruna
Author

Yeah, I was able to hit this bug with @stduhpf 's simple web server too, so those VAE fixes are not enough.

On the other hand, I need just three or four attempts to reproduce it with VAE (or TAESD) tiling, while I can't reproduce it at all with normal VAE.

What are your hardware and system specs?

On Vulkan: AMD Radeon RX 7600 XT (RADV NAVI33) (radv) | uma: 0 | fp16: 1 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat

Happens on ROCm too. The specs: Device 0: AMD Radeon RX 7600 XT, gfx1100 (0x1100), VMM: no, Wave Size: 32, running with HSA_OVERRIDE_GFX_VERSION=11.0.0.

16G VRAM, 40G RAM, Linux kernel 6.12.27, amdgpu from mainline. Nothing else running on the card (display is on the iGPU).

LostRuins

LostRuins commented on Jun 17, 2025

@LostRuins
Owner

pararace was using a 4060Ti, so its not AMD specific.

stduhpf

stduhpf commented on Jun 17, 2025

@stduhpf

@wbruna Can you give me setps to reproduce it with my server? I can't make it happen.

Also does stduhpf/stable-diffusion.cpp@e201588 fix it?

wbruna

wbruna commented on Jun 17, 2025

@wbruna
Author

Can you give me setps to reproduce it with my server? I can't make it happen.

For me, it's is (or was) enough to render a large-ish image (like 1024x576) repeated times, with VAE or TAESD tiling, to eventually hit it.

Also does stduhpf/stable-diffusion.cpp@e201588 fix it?

Interesting. That may have fixed it, thanks! I didn't hit the bug again for some 30 or 40 renders. So, perhaps it's that zero-filling that's hardware or system specific? But why would that only affect tiling...

There may be another initialization bug lurking somewhere else. Running a generation for a second time (same seed, and a non-random sampler), the second image changes very slightly, in a few details (easy to see on the simple server, because it keeps showing the previous image until the new one is displayed on its place). Afterwards, new renders are repetitions of that second one.

stduhpf

stduhpf commented on Jun 17, 2025

@stduhpf

But why would that only affect tiling...

Tiling works by adding the decoded tiles to the output buffer one after the other, rather than completely overwriting the content. This is done to be able to smoothly blend between neighboring tiles. So it makes sense that if the output buffer is not empty at the beginning, whatever was in there will interfere with the decoded image.

What's stranger to me is why isn't it happening to everyone more often? Because I used my server a lot and never had this issue. I was even sure that creating a new tensor would automatically initialize it to 0, but maybe it's just undefined behavior.

wbruna

wbruna commented on Jun 17, 2025

@wbruna
Author

But why would that only affect tiling...

Tiling works by adding the decoded tiles to the output buffer one after the other, rather than completely overwriting the content. This is done to be able to smoothly blend between neighboring tiles. So it makes sense that if the output buffer is not empty at the beginning, whatever was in there will interfere with the decoded image.

Yeah, that makes sense.

What's stranger to me is why isn't it happening to everyone more often? Because I used my server a lot and never had this issue. I was even sure that creating a new tensor would automatically initialize it to 0, but maybe it's just undefined behavior.

Looking at ggml.c, the memory pool initialization ends up just calling malloc. So, at least on Linux, it depends on glibc's policy for that pool size: it could come directly from mmap (so the OS zeroes it out), or it could come from the process heap (so it may reuse a previously free'd area). If, for instance, the system is configured to return memory aggressively to the OS, it could completely avoid reusing the process heap for that memory pool.

wbruna

wbruna commented on Jun 19, 2025

@wbruna
Author

@LostRuins , I'm reusing this VAE-related issue to avoid polluting the Chroma PR.

Koboldcpp 924dfa7 , with model Fluently V4 LCM, seed 2, prompt "clear blue sky, few clouds", width 960, height 640, 10 steps, VAE tiling:

Image

That darker "band" (or a lighter one) should be very noticeable with pretty much any generation with a 960 pixels side and areas of uniform color.

Same parameters on sd.cpp, with the proposed fix for leejet/stable-diffusion.cpp#588 :

Image

For comparison, the version without tiling:

Image

Of course, the "fixed" image still has a few visible banded artifacts; I've chosen kind of a worst case for tiling, just to make the effect more evident.

added a commit that references this issue on Jun 20, 2025
LostRuins

LostRuins commented on Jun 20, 2025

@LostRuins
Owner

Fair enough, merged the fix. Tested seems to work.

LostRuins

LostRuins commented on Jun 21, 2025

@LostRuins
Owner

Please try v1.94 which should be fixed.

wbruna

wbruna commented on Jun 21, 2025

@wbruna
Author

1.94 seems to be working fine! No noisy or ghost images, no dark band, for some 30 generations. Tested with both Vulkan and ROCm. Thanks @LostRuins and @stduhpf !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      Participants

      @wbruna@stduhpf@LostRuins

      Issue actions

        VAE tiling may cause image interference between generations · Issue #1603 · LostRuins/koboldcpp