Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TUI subtraction overflow panic #1063

Closed
kameko opened this issue Dec 12, 2023 · 3 comments · Fixed by #1083
Closed

TUI subtraction overflow panic #1063

kameko opened this issue Dec 12, 2023 · 3 comments · Fixed by #1083
Labels
bug Something isn't working

Comments

@kameko
Copy link

kameko commented Dec 12, 2023

Describe the bug
Attempting to train on the default MNIST example with Burn 0.11.1 does not complete because of a TUI panic, which I assume is this line:

let num_items = epoch_items + iteration_items - ignore_num_items;

Desktop (please complete the following information):

  • OS: Windows 11
  • Version 0.11.1

Additional context
This is the last part of the training log:

2023-12-12T16:25:37.408801Z  INFO burn_train::learner::epoch: Executing training step for epoch 2    
2023-12-12T16:25:37.408798Z  INFO burn_train::checkpoint::file: Saving checkpoint 1 to D:/Projects/tc/tc0/data\training\mnist/checkpoint/model-1    
2023-12-12T16:25:37.408818Z  INFO burn_train::checkpoint::file: Saving checkpoint 1 to D:/Projects/tc/tc0/data\training\mnist/checkpoint/scheduler-1    
2023-12-12T16:25:37.408821Z  INFO burn_train::checkpoint::file: Saving checkpoint 1 to D:/Projects/tc/tc0/data\training\mnist/checkpoint/optim-1    
2023-12-12T16:25:37.408878Z  INFO burn_core::record::file: File exists, replacing    
2023-12-12T16:25:37.415033Z  INFO burn_core::record::file: File exists, replacing    
2023-12-12T16:25:37.421931Z  INFO burn_core::record::file: File exists, replacing    
2023-12-12T16:25:37.424361Z  INFO burn_train::learner::epoch: Iteration 1    
2023-12-12T16:25:37.454946Z ERROR burn_train::learner::log: PANIC => panicked at C:\Users\Kameko\.cargo\registry\src\index.crates.io-6f17d22bba15001f\burn-train-0.11.1\src\renderer\tui\progress.rs:179:21:
attempt to subtract with overflow    

Let me know if you need the rest of it.

Additionally, is there any way for me to fall back to a trainer that does not use the TUI?

UPDATE: For some reason, this worked when I ran it in debug mode in VSCode (still Windows, no remote/WSL). I really can't tell why my terminal didn't work, it's standard PowerShell, same one it uses in debug mode in VSCode, didn't even change the window size.

UPDATE 2: It, predictably, works if I compile as a release mode since release mode doesn't do overflow checking. Although it's still odd it worked in debug mode inside of VSCode.

@nathanielsimard
Copy link
Member

@kameko, maybe the order of the operations was not respected for one of the builds you did (it's not supposed to affect the math, but it can cause an overflow). So maybe we should add some parentheses or checks?

@kameko
Copy link
Author

kameko commented Dec 12, 2023

I doubt the compiler is arbitrarily rearranging the order of operations. I just tested that order of operations with some #[inline(never)] code in the playground to do what amounts to 2 + 0 - 1 and it compiled.
Although I haven't peeked at any code above what I originally linked, I'm assuming some logic somewhere is providing negative values, because when I ran the trainer, the percentage complete would keep going up and down, 2 to 3, 3 to 2, to 3 again, 3 to 4, which I'm assuming is what was responsible.
Either way, an easy way around this is to just replace the operators with overflowing_sub and overflowing_add, either use the overflowing value or just don't update the status if they return true.

@antimora antimora added the bug Something isn't working label Dec 14, 2023
@louisfd
Copy link
Member

louisfd commented Dec 19, 2023

Okay, so I think I found the root of the bug, which is rather indirect.

For context, the underflow comes from epoch_items + iteration_items - ignore_num_items, so when ignore_num_items is larger than epoch_items + iteration_items. In theory this should never happen because ignore_num_items is computed from epoch_items + iteration_items at some point then doesn't change, while epoch_items + iteration_items keeps growing monotonously.

However, epoch_items + iteration_items is computed from progress, and progress is computed from an aggregation of thread sub-progresses in burn-core/src/data/dataloader/multithread.rs.

It seems that at every epoch a new thread is spawned for every new iteration, until the maximum number of thread is reached. If this max is 4 then iterations 1, 2, 3 will work with less than the expected amount of threads.
Then at those iterations the progress aggregation is incomplete, therefore value of epoch_items + iteration_items is invalid and it may happen that it is below ignore_num_items (which was computed at some other epoch, with less iterations having happened but with the full thread aggregation).

@nathanielsimard how do you think we should fix this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants