Skip to content

Conversation

wbruna
Copy link

@wbruna wbruna commented Sep 13, 2025

I fixed the tiled parameter processing to limit both direct dimensions and relative factors, and added limits to the overlapping factor (I remember we shouldn't allow an overlapping factor larger than 0.5).

I also noticed the tile size bump for the encoding path wasn't being included on that limit, and pulled it into the auxiliary function too.

And since different rel_size ranges already have different effects on the calculation, I changed the explicit relative boolean to implicit rel_size > 0 tests.

@wbruna
Copy link
Author

wbruna commented Sep 13, 2025

And it's not a problem with the PR itself, but another thing I've noticed:

[INFO ] stable-diffusion.cpp:2154 - decoding 1 latents
[DEBUG] stable-diffusion.cpp:1493 - VAE Tile size: 48x48
[DEBUG] ggml_extend.hpp:811  - num tiles : 3, 5 
[DEBUG] ggml_extend.hpp:812  - optimal overlap : 0.416667, 0.500000 (targeting 0.500000)
[DEBUG] ggml_extend.hpp:845  - tile work buffer size: 1.72 MB
[INFO ] ggml_extend.hpp:858  - processing 15 tiles
[DEBUG] ggml_extend.hpp:1540 - vae compute buffer size: 360.04 MB(VRAM)
  |===>                                              | 1/15 - 2.07it/s[DEBUG] ggml_extend.hpp:1540 - vae compute buffer size: 360.04 MB(VRAM)
  |======>                                           | 2/15 - 2.11it/s[DEBUG] ggml_extend.hpp:1540 - vae compute buffer size: 360.04 MB(VRAM)
  |==========>                                       | 3/15 - 2.11it/s[DEBUG] ggml_extend.hpp:1540 - vae compute buffer size: 360.04 MB(VRAM)
  |=============>                                    | 4/15 - 2.12it/s[DEBUG] ggml_extend.hpp:1540 - vae compute buffer size: 360.04 MB(VRAM)
  |================>                                 | 5/15 - 2.12it/s[DEBUG] ggml_extend.hpp:1540 - vae compute buffer size: 360.04 MB(VRAM)

We don't seem to be reusing the context across tiles.

It looks like that'd be controlled by the free_compute_buffer_immediately boolean from the GGMLRunner::compute function, but we actually call the VAE through the AutoEncoderKL::compute, so we'd need to either add a boolean to that overload, or fix it to false and free the buffers explicitly on each call site. Am I on the right track? :-)

@stduhpf
Copy link
Owner

stduhpf commented Sep 13, 2025

It looks like that'd be controlled by the free_compute_buffer_immediately boolean from the GGMLRunner::compute function, but we actually call the VAE through the AutoEncoderKL::compute, so we'd need to either add a boolean to that overload, or fix it to false and free the buffers explicitly on each call site. Am I on the right track? :-)

Yes I think so too. Maybe it would be worth investigating that in a separate PR. It might make vae tiling a bit faster

@stduhpf
Copy link
Owner

stduhpf commented Sep 13, 2025

Did a few tests, LGTM

@stduhpf stduhpf merged commit 2995c92 into stduhpf:tiled-vae-encode Sep 13, 2025
9 checks passed
@wbruna
Copy link
Author

wbruna commented Sep 13, 2025

I just noticed the decoding tile size became too big when I used a relative factor 😕

W and H are being multiplied by 8 at the beginning of decode_first_stage, so:

diff --git a/stable-diffusion.cpp b/stable-diffusion.cpp
index 9085bd3..b1fd72d 100644
--- a/stable-diffusion.cpp
+++ b/stable-diffusion.cpp
@@ -1488,7 +1492,7 @@ public:
         if (!use_tiny_autoencoder) {
             float tile_overlap;
             int tile_size_x, tile_size_y;
-            get_tile_sizes(tile_size_x, tile_size_y, tile_overlap, vae_tiling_params, W, H);
+            get_tile_sizes(tile_size_x, tile_size_y, tile_overlap, vae_tiling_params, W / 8, H / 8);
 
             LOG_DEBUG("VAE Tile size: %dx%d", tile_size_x, tile_size_y);

Or maybe we should use x->ne instead, since the latent size factor is different for Wan?

stduhpf added a commit that referenced this pull request Sep 19, 2025
* implement  tiling vae encode support

* Tiling (vae/upscale): adaptative overlap

* Tiling: fix edge case

* Tiling: fix crash when less than 2 tiles per dim

* remove extra dot

* Tiling: fix edge cases for adaptative overlap

* tiling: fix edge case

* set vae tile size via env var

* vae tiling: refactor again, base on smaller buffer for alignment

* Use bigger tiles for encode (to match compute buffer size)

* Fix edge case when tile is bigger than latent

* non-square VAE tiling (#3)

* refactor tile number calculation

* support non-square tiles

* add env var to change tile overlap

* add safeguards and better error messages for SD_TILE_OVERLAP

* add safeguards and include overlapping factor for SD_TILE_SIZE

* avoid rounding issues when specifying SD_TILE_SIZE as a factor

* lower SD_TILE_OVERLAP limit

* zero-init empty output buffer

* Fix decode latent size

* fix encode

* tile size params instead of env

* Tiled vae parameter validation (#6)

* avoid crash with invalid tile sizes, use 0 for default

* refactor default tile size, limit overlap factor

* remove explicit parameter for relative tile size

* limit encoding tile to latent size

* unify code style and format code

* update docs

* fix get_tile_sizes in decode_first_stage

---------

Co-authored-by: Wagner Bruna <wbruna@users.noreply.github.com>
Co-authored-by: leejet <leejet714@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants