tt_lib.tensor.sum operation with dim=2 failed with low PCC [Grayskull] #7006

banekg · 2024-04-02T13:12:42Z

Describe the bug
ttlib.sum_2 operation breaks with low PCC value error in some test cases. And with BFLOAT8_B in many test cases. In BFLOAT8_B operation fails on both Grayskull and Wormhole cards.

To Reproduce
Steps to reproduce the behavior:

Checkout main branch
Run unit test test_sum_2.py using this command:
pytest tests/tt_eager/python_api_testing/non_working_unit_tests/grayskull/test_sum_2.py

Expected behavior
There is a test case presented in the unit test tests/tt_eager/python_api_testing/non_working_unit_tests/grayskull/test_sum_2.py and it is are expected to fail with low PCC value.

Getting Additional info for the operation under test and its behavior
To get additional information and results for different combinations of input shapes, types, layouts and memory configs for which this operation was tested you can also run locally sweeps for tt_lib.tensor.sum and check the results. To do this you should:

Follow the Getting Started page to setup the repo, environment variables and python-env
Activate source build/python_env/bin/activate
Run sweeps by using python tests/tt_eager/python_api_testing/sweep_tests/run_pytorch_test.py -i tests/tt_eager/python_api_testing/sweep_tests/test_configs/ci_sweep_tests_broken/grayskull/ttlib_sum_2_test.yaml -o ./result-sweeps
After the run is completed all test sweeps results should be available inside specified output directory (in this case ./result-sweeps). There you will find sum_2_sweep.csv which holds all executed sweeps, among which you can also find the ones that failed and were recreated by the unit test, which you can get by searching unique data_seed field.

The text was updated successfully, but these errors were encountered:

umadevimcw · 2024-04-30T09:51:01Z

@nemanjagrujic @banekg @jliangTT Reduce op is failing in multicore for TILE Layout. To use ROW_MAJOR layout then datatype should be bf16 instead of bf8. With these changes submitted PR #7962. The tests are passing now.

@tt-aho

We have observed that the reduce op is not functioning correctly as expected in the Multicore implementation for the TILE layout (dim H) of BF8 datatype.
For testing filled the tensor values with constant 1 for the shape (7, 14, 32, 160) (sum = dim H) in the above test file and dump the results in text files and observed in tensor some of the values are 16 instead of 32

PFA

TT.txt

pytorch.txt

tt-aho · 2024-04-30T19:47:15Z

@umadevimcw is your test case for bfloat8_b? Could you check if 21e0671 fixes it? If it does you can cherry-pick it to your pr to merge and enable the tests for it.

tt-aho · 2024-05-01T23:15:33Z

I merged my commit to main so you can rebase to pick it up instead of cherry-picking

umadevimcw · 2024-05-02T01:29:04Z

Sure. Will check it out.

umadevimcw · 2024-05-02T09:08:26Z

@tt-aho In the recent changes, the test cases with bf8 and tile layout passed. But the W should be multiples of 32, if W is not multiples of 32 face below error message

E       RuntimeError: TT_THROW @ tt_metal/impl/program/program.cpp:36: tt::exception
E       info:
E       Failed to generate binaries for {} {}
E       pack_untilize
E       TT_THROW @ tt_metal/jit_build/build.cpp:412: tt::exception
E       info:
E       {} build failed
E       trisc2
E       backtrace:
E        --- tt::tt_metal::JitBuildState::compile_one(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, tt::tt_metal::JitBuildSettings const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const
E        --- std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, std::__future_base::_Result_base::_Deleter>, std::__future_base::_Task_state<std::_Bind<std::function<void ()> ()>, std::allocator<int>, void ()>::_M_run()::{lambda()#1}, void> >::_M_invoke(std::_Any_data const&)
E        --- /home/ubuntu/uma/tt-metal/build/lib/libtt_metal.so(+0x2cd5ed) [0x7fa590bfb5ed]
E        --- /lib/x86_64-linux-gnu/libpthread.so.0(+0x114df) [0x7fa5d34194df]
E        --- auto tt::tt_metal::detail::async<std::function<void ()> const&>(std::function<void ()> const&)
E        --- tt::tt_metal::JitBuildState::compile(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, tt::tt_metal::JitBuildSettings const*) const
E        --- tt::tt_metal::JitBuildState::build(tt::tt_metal::JitBuildSettings const*) const
E        --- std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, std::__future_base::_Result_base::_Deleter>, std::__future_base::_Task_state<std::_Bind<std::function<void ()> ()>, std::allocator<int>, void ()>::_M_run()::{lambda()#1}, void> >::_M_invoke(std::_Any_data const&)
E        --- /home/ubuntu/uma/tt-metal/build/lib/libtt_metal.so(+0x2cd5ed) [0x7fa590bfb5ed]
E        --- /lib/x86_64-linux-gnu/libpthread.so.0(+0x114df) [0x7fa5d34194df]
E        --- std::_Function_handler<void (), tt::tt_metal::detail::async<std::function<void ()> const&>(std::function<void ()> const&)::{lambda()#1}>::_M_invoke(std::_Any_data const&)
E        --- /home/ubuntu/uma/tt-metal/build/lib/libtt_metal.so(+0x2de187) [0x7fa590c0c187]
E        --- std::thread::_State_impl<std::thread::_Invoker<std::tuple<tf::Executor::_spawn(unsigned long)::{lambda()#1}> > >::_M_run()
E        --- /lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6df4) [0x7fa5cfd46df4]
E        --- /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7fa5d3410609]
E        --- /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7fa5d354a353]
E       
E       backtrace:
E        --- std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, std::__future_base::_Result_base::_Deleter>, std::__future_base::_Task_state<std::_Bind<std::function<void ()> ()>, std::allocator<int>, void ()>::_M_run()::{lambda()#1}, void> >::_M_invoke(std::_Any_data const&)
E        --- /home/ubuntu/uma/tt-metal/build/lib/libtt_metal.so(+0x2cd5ed) [0x7fa590bfb5ed]
E        --- /lib/x86_64-linux-gnu/libpthread.so.0(+0x114df) [0x7fa5d34194df]
E        --- std::_Function_handler<void (), tt::tt_metal::detail::async<std::function<void ()> const&>(std::function<void ()> const&)::{lambda()#1}>::_M_invoke(std::_Any_data const&)
E        --- /home/ubuntu/uma/tt-metal/build/lib/libtt_metal.so(+0x2de187) [0x7fa590c0c187]
E        --- std::thread::_State_impl<std::thread::_Invoker<std::tuple<tf::Executor::_spawn(unsigned long)::{lambda()#1}> > >::_M_run()
E        --- /lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6df4) [0x7fa5cfd46df4]
E        --- /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7fa5d3410609]
E        --- /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7fa5d354a353]

tests/tt_eager/python_api_testing/sweep_tests/tt_lib_ops.py:1898: RuntimeError

tt-aho · 2024-05-02T13:38:04Z

@tt-aho In the recent changes, the test cases with bf8 and tile layout passed. But the W should be multiples of 32, if W is not multiples of 32 face below error message

E       RuntimeError: TT_THROW @ tt_metal/impl/program/program.cpp:36: tt::exception
E       info:
E       Failed to generate binaries for {} {}
E       pack_untilize
E       TT_THROW @ tt_metal/jit_build/build.cpp:412: tt::exception
E       info:
E       {} build failed
E       trisc2
E       backtrace:
E        --- tt::tt_metal::JitBuildState::compile_one(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, tt::tt_metal::JitBuildSettings const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const
E        --- std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, std::__future_base::_Result_base::_Deleter>, std::__future_base::_Task_state<std::_Bind<std::function<void ()> ()>, std::allocator<int>, void ()>::_M_run()::{lambda()#1}, void> >::_M_invoke(std::_Any_data const&)
E        --- /home/ubuntu/uma/tt-metal/build/lib/libtt_metal.so(+0x2cd5ed) [0x7fa590bfb5ed]
E        --- /lib/x86_64-linux-gnu/libpthread.so.0(+0x114df) [0x7fa5d34194df]
E        --- auto tt::tt_metal::detail::async<std::function<void ()> const&>(std::function<void ()> const&)
E        --- tt::tt_metal::JitBuildState::compile(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, tt::tt_metal::JitBuildSettings const*) const
E        --- tt::tt_metal::JitBuildState::build(tt::tt_metal::JitBuildSettings const*) const
E        --- std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, std::__future_base::_Result_base::_Deleter>, std::__future_base::_Task_state<std::_Bind<std::function<void ()> ()>, std::allocator<int>, void ()>::_M_run()::{lambda()#1}, void> >::_M_invoke(std::_Any_data const&)
E        --- /home/ubuntu/uma/tt-metal/build/lib/libtt_metal.so(+0x2cd5ed) [0x7fa590bfb5ed]
E        --- /lib/x86_64-linux-gnu/libpthread.so.0(+0x114df) [0x7fa5d34194df]
E        --- std::_Function_handler<void (), tt::tt_metal::detail::async<std::function<void ()> const&>(std::function<void ()> const&)::{lambda()#1}>::_M_invoke(std::_Any_data const&)
E        --- /home/ubuntu/uma/tt-metal/build/lib/libtt_metal.so(+0x2de187) [0x7fa590c0c187]
E        --- std::thread::_State_impl<std::thread::_Invoker<std::tuple<tf::Executor::_spawn(unsigned long)::{lambda()#1}> > >::_M_run()
E        --- /lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6df4) [0x7fa5cfd46df4]
E        --- /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7fa5d3410609]
E        --- /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7fa5d354a353]
E       
E       backtrace:
E        --- std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, std::__future_base::_Result_base::_Deleter>, std::__future_base::_Task_state<std::_Bind<std::function<void ()> ()>, std::allocator<int>, void ()>::_M_run()::{lambda()#1}, void> >::_M_invoke(std::_Any_data const&)
E        --- /home/ubuntu/uma/tt-metal/build/lib/libtt_metal.so(+0x2cd5ed) [0x7fa590bfb5ed]
E        --- /lib/x86_64-linux-gnu/libpthread.so.0(+0x114df) [0x7fa5d34194df]
E        --- std::_Function_handler<void (), tt::tt_metal::detail::async<std::function<void ()> const&>(std::function<void ()> const&)::{lambda()#1}>::_M_invoke(std::_Any_data const&)
E        --- /home/ubuntu/uma/tt-metal/build/lib/libtt_metal.so(+0x2de187) [0x7fa590c0c187]
E        --- std::thread::_State_impl<std::thread::_Invoker<std::tuple<tf::Executor::_spawn(unsigned long)::{lambda()#1}> > >::_M_run()
E        --- /lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6df4) [0x7fa5cfd46df4]
E        --- /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7fa5d3410609]
E        --- /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7fa5d354a353]

tests/tt_eager/python_api_testing/sweep_tests/tt_lib_ops.py:1898: RuntimeError

What is the specific test to repro the error?

umadevimcw · 2024-05-02T13:40:24Z

@tt-aho (6, 1, 140, 110) This is one of the shape that I faced above issue

tt-aho · 2024-05-02T13:50:37Z

I don't see an error when changing the test to that shape. Could you rebuild/retry?

umadevimcw · 2024-05-03T08:20:14Z

@tt-aho Cloned the repo and compiled it. Now I am not getting the above error

umadevimcw · 2024-05-06T08:51:10Z

Closing as the test passes with the updated config

banekg added bug Something isn't working GS labels Apr 2, 2024

banekg changed the title ~~ttlib.sum_2 failed with low PCC []~~ tt_lib.tensor.sum operation with dim=2 failed with low PCC [Grayskull] Apr 2, 2024

banekg added the op_cat: TM label Apr 3, 2024

nemanjagrujic added the WH label Apr 3, 2024

banekg added op_cat: reduces and removed op_cat: TM labels Apr 3, 2024

jliangTT assigned umadevimcw Apr 11, 2024

jliangTT added the P2_should_have label Apr 11, 2024

umadevimcw added a commit that referenced this issue Apr 30, 2024

#7006: Fix reduce sum2 fail issue

8a6bf3c

umadevimcw added a commit that referenced this issue Apr 30, 2024

#7006: Fix reduce sum2 fail issue

7824d65

umadevimcw added a commit that referenced this issue Apr 30, 2024

#7006: Fix reduce sum2 fail issue

3bac302

umadevimcw added a commit that referenced this issue May 2, 2024

#7006: Fix reduce sum2 fail issue

d4f3bd8

umadevimcw added a commit that referenced this issue May 2, 2024

#7006: Fix reduce sum2 fail issue

eeb0863

umadevimcw added a commit that referenced this issue May 3, 2024

#7006: Fix reduce sum2 fail issue

b9d9cd8

umadevimcw added a commit that referenced this issue May 3, 2024

#7006: Fix reduce sum2 fail issue

0f063c1

umadevimcw added a commit that referenced this issue May 6, 2024

#7006: Fix reduce sum2 fail issue

490ba67

umadevimcw added a commit that referenced this issue May 6, 2024

#7006: Fix reduce sum2 fail issue

b55da4b

umadevimcw added a commit that referenced this issue May 6, 2024

#7006: Fix reduce sum2 fail issue

f9dd970

umadevimcw closed this as completed May 6, 2024

ankitmcw pushed a commit that referenced this issue May 7, 2024

#7006: Fix reduce sum2 fail issue

9dca572

ankitmcw pushed a commit that referenced this issue May 7, 2024

#7006: Fix reduce sum2 fail issue

3491438

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tt_lib.tensor.sum operation with dim=2 failed with low PCC [Grayskull] #7006

tt_lib.tensor.sum operation with dim=2 failed with low PCC [Grayskull] #7006

banekg commented Apr 2, 2024 •

edited by nemanjagrujic

Loading

umadevimcw commented Apr 30, 2024 •

edited

Loading

tt-aho commented Apr 30, 2024 •

edited

Loading

tt-aho commented May 1, 2024

umadevimcw commented May 2, 2024

umadevimcw commented May 2, 2024

tt-aho commented May 2, 2024

umadevimcw commented May 2, 2024

tt-aho commented May 2, 2024

umadevimcw commented May 3, 2024

umadevimcw commented May 6, 2024

tt_lib.tensor.sum operation with dim=2 failed with low PCC [Grayskull] #7006

tt_lib.tensor.sum operation with dim=2 failed with low PCC [Grayskull] #7006

Comments

banekg commented Apr 2, 2024 • edited by nemanjagrujic Loading

umadevimcw commented Apr 30, 2024 • edited Loading

tt-aho commented Apr 30, 2024 • edited Loading

tt-aho commented May 1, 2024

umadevimcw commented May 2, 2024

umadevimcw commented May 2, 2024

tt-aho commented May 2, 2024

umadevimcw commented May 2, 2024

tt-aho commented May 2, 2024

umadevimcw commented May 3, 2024

umadevimcw commented May 6, 2024

banekg commented Apr 2, 2024 •

edited by nemanjagrujic

Loading

umadevimcw commented Apr 30, 2024 •

edited

Loading

tt-aho commented Apr 30, 2024 •

edited

Loading