TUI subtraction overflow panic #1063

kameko · 2023-12-12T16:36:32Z

Describe the bug
Attempting to train on the default MNIST example with Burn 0.11.1 does not complete because of a TUI panic, which I assume is this line:

burn/burn-train/src/renderer/tui/progress.rs

Line 179 in 1a5f252

let num_items = epoch_items + iteration_items - ignore_num_items;

Desktop (please complete the following information):

OS: Windows 11
Version 0.11.1

Additional context
This is the last part of the training log:

2023-12-12T16:25:37.408801Z  INFO burn_train::learner::epoch: Executing training step for epoch 2    
2023-12-12T16:25:37.408798Z  INFO burn_train::checkpoint::file: Saving checkpoint 1 to D:/Projects/tc/tc0/data\training\mnist/checkpoint/model-1    
2023-12-12T16:25:37.408818Z  INFO burn_train::checkpoint::file: Saving checkpoint 1 to D:/Projects/tc/tc0/data\training\mnist/checkpoint/scheduler-1    
2023-12-12T16:25:37.408821Z  INFO burn_train::checkpoint::file: Saving checkpoint 1 to D:/Projects/tc/tc0/data\training\mnist/checkpoint/optim-1    
2023-12-12T16:25:37.408878Z  INFO burn_core::record::file: File exists, replacing    
2023-12-12T16:25:37.415033Z  INFO burn_core::record::file: File exists, replacing    
2023-12-12T16:25:37.421931Z  INFO burn_core::record::file: File exists, replacing    
2023-12-12T16:25:37.424361Z  INFO burn_train::learner::epoch: Iteration 1    
2023-12-12T16:25:37.454946Z ERROR burn_train::learner::log: PANIC => panicked at C:\Users\Kameko\.cargo\registry\src\index.crates.io-6f17d22bba15001f\burn-train-0.11.1\src\renderer\tui\progress.rs:179:21:
attempt to subtract with overflow

Let me know if you need the rest of it.

Additionally, is there any way for me to fall back to a trainer that does not use the TUI?

UPDATE: For some reason, this worked when I ran it in debug mode in VSCode (still Windows, no remote/WSL). I really can't tell why my terminal didn't work, it's standard PowerShell, same one it uses in debug mode in VSCode, didn't even change the window size.

UPDATE 2: It, predictably, works if I compile as a release mode since release mode doesn't do overflow checking. Although it's still odd it worked in debug mode inside of VSCode.

The text was updated successfully, but these errors were encountered:

nathanielsimard · 2023-12-12T18:51:05Z

@kameko, maybe the order of the operations was not respected for one of the builds you did (it's not supposed to affect the math, but it can cause an overflow). So maybe we should add some parentheses or checks?

kameko · 2023-12-12T22:37:56Z

I doubt the compiler is arbitrarily rearranging the order of operations. I just tested that order of operations with some #[inline(never)] code in the playground to do what amounts to 2 + 0 - 1 and it compiled.
Although I haven't peeked at any code above what I originally linked, I'm assuming some logic somewhere is providing negative values, because when I ran the trainer, the percentage complete would keep going up and down, 2 to 3, 3 to 2, to 3 again, 3 to 4, which I'm assuming is what was responsible.
Either way, an easy way around this is to just replace the operators with overflowing_sub and overflowing_add, either use the overflowing value or just don't update the status if they return true.

louisfd · 2023-12-19T15:32:38Z

Okay, so I think I found the root of the bug, which is rather indirect.

For context, the underflow comes from epoch_items + iteration_items - ignore_num_items, so when ignore_num_items is larger than epoch_items + iteration_items. In theory this should never happen because ignore_num_items is computed from epoch_items + iteration_items at some point then doesn't change, while epoch_items + iteration_items keeps growing monotonously.

However, epoch_items + iteration_items is computed from progress, and progress is computed from an aggregation of thread sub-progresses in burn-core/src/data/dataloader/multithread.rs.

It seems that at every epoch a new thread is spawned for every new iteration, until the maximum number of thread is reached. If this max is 4 then iterations 1, 2, 3 will work with less than the expected amount of threads.
Then at those iterations the progress aggregation is incomplete, therefore value of epoch_items + iteration_items is invalid and it may happen that it is below ignore_num_items (which was computed at some other epoch, with less iterations having happened but with the full thread aggregation).

@nathanielsimard how do you think we should fix this?

antimora added the bug Something isn't working label Dec 14, 2023

louisfd mentioned this issue Dec 19, 2023

fix multithread progress aggregation in dataloader #1083

Merged

nathanielsimard closed this as completed in #1083 Dec 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TUI subtraction overflow panic #1063

TUI subtraction overflow panic #1063

kameko commented Dec 12, 2023 •

edited

nathanielsimard commented Dec 12, 2023

kameko commented Dec 12, 2023

louisfd commented Dec 19, 2023

TUI subtraction overflow panic #1063

TUI subtraction overflow panic #1063

Comments

kameko commented Dec 12, 2023 • edited

nathanielsimard commented Dec 12, 2023

kameko commented Dec 12, 2023

louisfd commented Dec 19, 2023

kameko commented Dec 12, 2023 •

edited