-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ND hang of SD unit tests on N300 device #7560
Comments
hanging on N150 in main post-commit https://github.com/tenstorrent/tt-metal/actions/runs/8728371259/job/23948244160 seems to be specifically this test
|
some discussions are happening over here - https://tenstorrent.slack.com/archives/C055REZR6Q3/p1713992657339109 |
quick update: some one the unit tests were raising OOM and allocator errors, but e2e test was passing. I'll skip the tests with OOM errors, and launch a pipeline after rebasing to the latest main to double-check all SD unit tests pass |
Next step: Please try to repro it on the lastest FD2/main branch. |
Next step:
|
I have been able to successfully reproduce the hang running watcher without NOC sanitization three times. Twice on the same op, once on a different one. I also ran without hangs ~5 times, so still ND. The different one is in the same submodule (the one @mtatsumiTT identified as problematic), so probably related. |
To repro, run |
Sounds like we should try to repro on the submodule as next step. Then we will have a smaller test to debug in detail. Less ops. |
… hang observed on BM machine where perf tests are run.
… hang observed on BM machine where perf tests are run.
… hang observed on BM machine where perf tests are run.
… ND hang. No hang observed on BM machine where perf tests are run." This reverts commit e517e56.
the FD2 merged over this weekend - can we rebase and re-test against the main to get the latest baseline result? |
Hang is still present on both tests. I rebased and pushed |
@AleksKnezevic have you even seen a seg fault on Seem to be getting it around ~150 iterations (note this was on a t3k system, will try to repro on a vm next) |
I have on occasion seen a seg fault on VM but not on BM. I've been using an N300. @tt-aho, are you clearing the built directory before running the test? |
… hang observed on BM machine where perf tests are run.
… hang observed on BM machine where perf tests are run.
High bw debugging happening in slack thread (internal) - https://tenstorrent.slack.com/archives/C055REZR6Q3/p1715102452304479 |
catching up on debug log, @jliangTT can you add me to the slack thread you linked? |
hey @AleksKnezevic, Could you please summarize the tests used to reproduce the problem: both a small one and the complex one? |
The small test is two matmuls and an elementwise multiply, the larger test is self attention (matmul, softmax, matmul). The tests are currently on branch |
Fyi: #8644 |
Subblok specs shouldn't affect how many cores are used for matmul. It increases the number of loops in compute which essentially slows it down. |
Yup, those are 2 changes we needed to apply to get rid of di/dt (subblocks and grid size reduction) |
Running SD unit tests with
WH_ARCH_YAML
on N300 devices non-deterministically hangs.To repro the issue, switch to
main
branch and run the following on N300 device:EDIT:
Running the same test with enabling watcher in the fast-dispatch CI raises the
std::runtime_error
below ontests/ttnn/integration_tests/stable_diffusion/test_cross_attn_up_block_2d.py
(full log):fyi @AleksKnezevic @vtangTT @TT-billteng
The text was updated successfully, but these errors were encountered: