Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ND hang of SD unit tests on N300 device #7560

Open
mtatsumiTT opened this issue Apr 17, 2024 · 24 comments
Open

ND hang of SD unit tests on N300 device #7560

mtatsumiTT opened this issue Apr 17, 2024 · 24 comments
Assignees
Labels
bug Something isn't working ci-bug bugs found in CI didt_confirmed P2_should_have

Comments

@mtatsumiTT
Copy link
Contributor

mtatsumiTT commented Apr 17, 2024

Running SD unit tests with WH_ARCH_YAML on N300 devices non-deterministically hangs.

To repro the issue, switch to main branch and run the following on N300 device:

WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml pytest tests/ttnn/integration_tests/stable_diffusion

EDIT:
Running the same test with enabling watcher in the fast-dispatch CI raises the std::runtime_error below on tests/ttnn/integration_tests/stable_diffusion/test_cross_attn_up_block_2d.py (full log):

terminate called after throwing an instance of 'std::runtime_error'
  what():  Read 0xffffffff from ARC scratch[6]: auto-reset succeeded.
Fatal Python error: Aborted
Thread 0x00007f3744ff9700 (most recent call first):
  File "/home/ubuntu/actions-runner/_work/_tool/Python/3.8.18/x64/lib/python3.8/threading.py", line 306 in wait
  File "/home/ubuntu/actions-runner/_work/_tool/Python/3.8.18/x64/lib/python3.8/threading.py", line 558 in wait
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/tqdm/_monitor.py", line 60 in run
  File "/home/ubuntu/actions-runner/_work/_tool/Python/3.8.18/x64/lib/python3.8/threading.py", line 932 in _bootstrap_inner
  File "/home/ubuntu/actions-runner/_work/_tool/Python/3.8.18/x64/lib/python3.8/threading.py", line 890 in _bootstrap
Thread 0x00007f38db2c1740 (most recent call first):
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/ttnn/ttnn/decorators.py", line 410 in call_wrapper
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/ttnn/ttnn/decorators.py", line 616 in call_wrapper
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/ttnn/ttnn/decorators.py", line 693 in __call__
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/models/experimental/functional_stable_diffusion/tt2/ttnn_functional_cross_attention.py", line 306 in time_sharded_attention
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/models/experimental/functional_stable_diffusion/tt2/ttnn_functional_cross_attention.py", line 471 in get_attention_scores_opt
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/models/experimental/functional_stable_diffusion/tt2/ttnn_functional_cross_attention.py", line 706 in __call__
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/models/experimental/functional_stable_diffusion/tt2/ttnn_functional_basic_transformer_block.py", line 90 in __call__
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/models/experimental/functional_stable_diffusion/tt2/ttnn_functional_transformer_2d.py", line 298 in __call__
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/models/experimental/functional_stable_diffusion/tt2/ttnn_functional_cross_attn_upblock.py", line 153 in __call__
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/tests/ttnn/integration_tests/stable_diffusion/test_cross_attn_up_block_2d.py", line 321 in test_cross_attn_up_block_2d_512x512
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/python.py", line 195 in pytest_pyfunc_call
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_callers.py", line 102 in _multicall
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_manager.py", line 119 in _hookexec
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_hooks.py", line 501 in __call__
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/python.py", line 1789 in runtest
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 167 in pytest_runtest_call
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_callers.py", line 102 in _multicall
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_manager.py", line 119 in _hookexec
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_hooks.py", line 501 in __call__
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 260 in <lambda>
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 339 in from_call
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 259 in call_runtest_hook
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 220 in call_and_report
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 131 in runtestprotocol
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 112 in pytest_runtest_protocol
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_callers.py", line 102 in _multicall
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_manager.py", line 119 in _hookexec
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_hooks.py", line 501 in __call__
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/main.py", line 349 in pytest_runtestloop
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_callers.py", line 102 in _multicall
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_manager.py", line 119 in _hookexec
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_hooks.py", line 501 in __call__
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/main.py", line 324 in _main
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/main.py", line 270 in wrap_session
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/main.py", line 317 in pytest_cmdline_main
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_callers.py", line 102 in _multicall
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_manager.py", line 119 in _hookexec
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_hooks.py", line 501 in __call__
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/config/__init__.py", line 167 in main
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/config/__init__.py", line 190 in console_main
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/bin/pytest", line 8 in <module>

fyi @AleksKnezevic @vtangTT @TT-billteng

@TT-billteng
Copy link
Collaborator

TT-billteng commented Apr 18, 2024

hanging on N150 in main post-commit

https://github.com/tenstorrent/tt-metal/actions/runs/8728371259/job/23948244160
https://github.com/tenstorrent/tt-metal/actions/runs/8739120961/job/23980015228

seems to be specifically this test

tests/ttnn/unit_tests/test_sd_e2e.py::test_unet_2d_condition_model_512x512[batch_size=2-in_channels=4-input_height=64-input_width=64]

@jliangTT
Copy link

some discussions are happening over here - https://tenstorrent.slack.com/archives/C055REZR6Q3/p1713992657339109

@mtatsumiTT
Copy link
Contributor Author

quick update: some one the unit tests were raising OOM and allocator errors, but e2e test was passing. I'll skip the tests with OOM errors, and launch a pipeline after rebasing to the latest main to double-check all SD unit tests pass

@jliangTT
Copy link

jliangTT commented May 1, 2024

Next step: Please try to repro it on the lastest FD2/main branch.

@jliangTT
Copy link

jliangTT commented May 1, 2024

Next step:

  • debugging/repro with watcher

@AleksKnezevic
Copy link
Contributor

I have been able to successfully reproduce the hang running watcher without NOC sanitization three times. Twice on the same op, once on a different one. I also ran without hangs ~5 times, so still ND. The different one is in the same submodule (the one @mtatsumiTT identified as problematic), so probably related.

@AleksKnezevic
Copy link
Contributor

AleksKnezevic commented May 2, 2024

To repro, run WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml TT_METAL_WATCHER=1 TT_METAL_WATCHER_DISABLE_NOC_SANITIZE=1 pytest --count=100 -svv tests/ttnn/integration_tests/stable_diffusion/test_unet_2d_condition_model.py -k 512 on aknezevic/hang_debug

@AleksKnezevic
Copy link
Contributor

hang_debug.txt

@jvasilje
Copy link
Collaborator

jvasilje commented May 2, 2024

Sounds like we should try to repro on the submodule as next step. Then we will have a smaller test to debug in detail. Less ops.

AleksKnezevic added a commit that referenced this issue May 4, 2024
… hang observed on BM machine where perf tests are run.
AleksKnezevic added a commit that referenced this issue May 4, 2024
… hang observed on BM machine where perf tests are run.
AleksKnezevic added a commit that referenced this issue May 4, 2024
… hang observed on BM machine where perf tests are run.
tt-asaigal added a commit that referenced this issue May 4, 2024
… ND hang. No hang observed on BM machine where perf tests are run."

This reverts commit e517e56.
@jliangTT
Copy link

jliangTT commented May 6, 2024

the FD2 merged over this weekend - can we rebase and re-test against the main to get the latest baseline result?

@AleksKnezevic
Copy link
Contributor

Hang is still present on both tests. I rebased and pushed aknezevic/repro_MM_hang

@tt-aho
Copy link
Contributor

tt-aho commented May 6, 2024

@AleksKnezevic have you even seen a seg fault on WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml TT_METAL_WATCHER=1 TT_METAL_WATCHER_DISABLE_NOC_SANITIZE=1 pytest -svv tests/ttnn/integration_tests/stable_diffusion/test_geglu.py::test_geglu_512x512[N=1-C=2-H=256-W=1280-index=1-model_name=CompVis/stable-diffusion-v1-4-device_l1_small_size=32768]?

Seem to be getting it around ~150 iterations (note this was on a t3k system, will try to repro on a vm next)

@AleksKnezevic
Copy link
Contributor

I have on occasion seen a seg fault on VM but not on BM. I've been using an N300. @tt-aho, are you clearing the built directory before running the test?

ankitmcw pushed a commit that referenced this issue May 7, 2024
… hang observed on BM machine where perf tests are run.
ankitmcw pushed a commit that referenced this issue May 7, 2024
… hang observed on BM machine where perf tests are run.
@jliangTT
Copy link

jliangTT commented May 7, 2024

High bw debugging happening in slack thread (internal) - https://tenstorrent.slack.com/archives/C055REZR6Q3/p1715102452304479

@jliangTT jliangTT assigned aliuTT, ttmtrajkovic and rtawfik01 and unassigned jliangTT May 7, 2024
@aliuTT
Copy link
Contributor

aliuTT commented May 8, 2024

catching up on debug log, @jliangTT can you add me to the slack thread you linked?

@ttmtrajkovic
Copy link
Contributor

hey @AleksKnezevic,

Could you please summarize the tests used to reproduce the problem: both a small one and the complex one?
Also, is it reproducible on main?

@AleksKnezevic
Copy link
Contributor

The small test is two matmuls and an elementwise multiply, the larger test is self attention (matmul, softmax, matmul). The tests are currently on branch aknezevic/repro_MM_hang

@pavlepopovic
Copy link
Contributor

Fyi: #8644
In falcon7b prefill, we've also had a problem with matmul->softmax->matmul.
We needed to set subblock_h/w to 1 on both of these matmuls and furthermore to reduce number of cores allocated for matmuls to 57 to avoid di/dt issues.

@TT-BrianLiu
Copy link
Contributor

Fyi: #8644 In falcon7b prefill, we've also had a problem with matmul->softmax->matmul. We needed to set subblock_h/w to 1 on both of these matmuls and furthermore to reduce number of cores allocated for matmuls to 57 to avoid di/dt issues.

Subblok specs shouldn't affect how many cores are used for matmul. It increases the number of loops in compute which essentially slows it down.

@pavlepopovic
Copy link
Contributor

pavlepopovic commented May 30, 2024

Fyi: #8644 In falcon7b prefill, we've also had a problem with matmul->softmax->matmul. We needed to set subblock_h/w to 1 on both of these matmuls and furthermore to reduce number of cores allocated for matmuls to 57 to avoid di/dt issues.

Subblok specs shouldn't affect how many cores are used for matmul. It increases the number of loops in compute which essentially slows it down.

Yup, those are 2 changes we needed to apply to get rid of di/dt (subblocks and grid size reduction)
(Just realised that ‘furthermore’ is a bad choice of words for what I was trying to say :D)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ci-bug bugs found in CI didt_confirmed P2_should_have
Projects
None yet
Development

No branches or pull requests