Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tt_lib.tensor.sum operation with dim=2 failed with low PCC [Grayskull] #7006

Closed
banekg opened this issue Apr 2, 2024 · 10 comments
Closed

tt_lib.tensor.sum operation with dim=2 failed with low PCC [Grayskull] #7006

banekg opened this issue Apr 2, 2024 · 10 comments
Assignees

Comments

@banekg
Copy link
Contributor

banekg commented Apr 2, 2024

Describe the bug
ttlib.sum_2 operation breaks with low PCC value error in some test cases. And with BFLOAT8_B in many test cases. In BFLOAT8_B operation fails on both Grayskull and Wormhole cards.

To Reproduce
Steps to reproduce the behavior:

  1. Checkout main branch
  2. Run unit test test_sum_2.py using this command:
    pytest tests/tt_eager/python_api_testing/non_working_unit_tests/grayskull/test_sum_2.py

Expected behavior
There is a test case presented in the unit test tests/tt_eager/python_api_testing/non_working_unit_tests/grayskull/test_sum_2.py and it is are expected to fail with low PCC value.

Getting Additional info for the operation under test and its behavior
To get additional information and results for different combinations of input shapes, types, layouts and memory configs for which this operation was tested you can also run locally sweeps for tt_lib.tensor.sum and check the results. To do this you should:

  1. Follow the Getting Started page to setup the repo, environment variables and python-env
  2. Activate source build/python_env/bin/activate
  3. Run sweeps by using python tests/tt_eager/python_api_testing/sweep_tests/run_pytorch_test.py -i tests/tt_eager/python_api_testing/sweep_tests/test_configs/ci_sweep_tests_broken/grayskull/ttlib_sum_2_test.yaml -o ./result-sweeps
  4. After the run is completed all test sweeps results should be available inside specified output directory (in this case ./result-sweeps). There you will find sum_2_sweep.csv which holds all executed sweeps, among which you can also find the ones that failed and were recreated by the unit test, which you can get by searching unique data_seed field.
@banekg banekg added bug Something isn't working GS labels Apr 2, 2024
@banekg banekg changed the title ttlib.sum_2 failed with low PCC [] tt_lib.tensor.sum operation with dim=2 failed with low PCC [Grayskull] Apr 2, 2024
umadevimcw added a commit that referenced this issue Apr 30, 2024
umadevimcw added a commit that referenced this issue Apr 30, 2024
@umadevimcw
Copy link
Contributor

umadevimcw commented Apr 30, 2024

@nemanjagrujic @banekg @jliangTT Reduce op is failing in multicore for TILE Layout. To use ROW_MAJOR layout then datatype should be bf16 instead of bf8. With these changes submitted PR #7962. The tests are passing now.

@tt-aho

  • We have observed that the reduce op is not functioning correctly as expected in the Multicore implementation for the TILE layout (dim H) of BF8 datatype.
  • For testing filled the tensor values with constant 1 for the shape (7, 14, 32, 160) (sum = dim H) in the above test file and dump the results in text files and observed in tensor some of the values are 16 instead of 32

PFA

TT.txt

pytorch.txt

umadevimcw added a commit that referenced this issue Apr 30, 2024
@tt-aho
Copy link
Contributor

tt-aho commented Apr 30, 2024

@umadevimcw is your test case for bfloat8_b? Could you check if 21e0671 fixes it? If it does you can cherry-pick it to your pr to merge and enable the tests for it.

@tt-aho
Copy link
Contributor

tt-aho commented May 1, 2024

I merged my commit to main so you can rebase to pick it up instead of cherry-picking

@umadevimcw
Copy link
Contributor

Sure. Will check it out.

umadevimcw added a commit that referenced this issue May 2, 2024
umadevimcw added a commit that referenced this issue May 2, 2024
@umadevimcw
Copy link
Contributor

@tt-aho In the recent changes, the test cases with bf8 and tile layout passed. But the W should be multiples of 32, if W is not multiples of 32 face below error message

E       RuntimeError: TT_THROW @ tt_metal/impl/program/program.cpp:36: tt::exception
E       info:
E       Failed to generate binaries for {} {}
E       pack_untilize
E       TT_THROW @ tt_metal/jit_build/build.cpp:412: tt::exception
E       info:
E       {} build failed
E       trisc2
E       backtrace:
E        --- tt::tt_metal::JitBuildState::compile_one(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, tt::tt_metal::JitBuildSettings const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const
E        --- std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, std::__future_base::_Result_base::_Deleter>, std::__future_base::_Task_state<std::_Bind<std::function<void ()> ()>, std::allocator<int>, void ()>::_M_run()::{lambda()#1}, void> >::_M_invoke(std::_Any_data const&)
E        --- /home/ubuntu/uma/tt-metal/build/lib/libtt_metal.so(+0x2cd5ed) [0x7fa590bfb5ed]
E        --- /lib/x86_64-linux-gnu/libpthread.so.0(+0x114df) [0x7fa5d34194df]
E        --- auto tt::tt_metal::detail::async<std::function<void ()> const&>(std::function<void ()> const&)
E        --- tt::tt_metal::JitBuildState::compile(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, tt::tt_metal::JitBuildSettings const*) const
E        --- tt::tt_metal::JitBuildState::build(tt::tt_metal::JitBuildSettings const*) const
E        --- std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, std::__future_base::_Result_base::_Deleter>, std::__future_base::_Task_state<std::_Bind<std::function<void ()> ()>, std::allocator<int>, void ()>::_M_run()::{lambda()#1}, void> >::_M_invoke(std::_Any_data const&)
E        --- /home/ubuntu/uma/tt-metal/build/lib/libtt_metal.so(+0x2cd5ed) [0x7fa590bfb5ed]
E        --- /lib/x86_64-linux-gnu/libpthread.so.0(+0x114df) [0x7fa5d34194df]
E        --- std::_Function_handler<void (), tt::tt_metal::detail::async<std::function<void ()> const&>(std::function<void ()> const&)::{lambda()#1}>::_M_invoke(std::_Any_data const&)
E        --- /home/ubuntu/uma/tt-metal/build/lib/libtt_metal.so(+0x2de187) [0x7fa590c0c187]
E        --- std::thread::_State_impl<std::thread::_Invoker<std::tuple<tf::Executor::_spawn(unsigned long)::{lambda()#1}> > >::_M_run()
E        --- /lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6df4) [0x7fa5cfd46df4]
E        --- /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7fa5d3410609]
E        --- /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7fa5d354a353]
E       
E       backtrace:
E        --- std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, std::__future_base::_Result_base::_Deleter>, std::__future_base::_Task_state<std::_Bind<std::function<void ()> ()>, std::allocator<int>, void ()>::_M_run()::{lambda()#1}, void> >::_M_invoke(std::_Any_data const&)
E        --- /home/ubuntu/uma/tt-metal/build/lib/libtt_metal.so(+0x2cd5ed) [0x7fa590bfb5ed]
E        --- /lib/x86_64-linux-gnu/libpthread.so.0(+0x114df) [0x7fa5d34194df]
E        --- std::_Function_handler<void (), tt::tt_metal::detail::async<std::function<void ()> const&>(std::function<void ()> const&)::{lambda()#1}>::_M_invoke(std::_Any_data const&)
E        --- /home/ubuntu/uma/tt-metal/build/lib/libtt_metal.so(+0x2de187) [0x7fa590c0c187]
E        --- std::thread::_State_impl<std::thread::_Invoker<std::tuple<tf::Executor::_spawn(unsigned long)::{lambda()#1}> > >::_M_run()
E        --- /lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6df4) [0x7fa5cfd46df4]
E        --- /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7fa5d3410609]
E        --- /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7fa5d354a353]

tests/tt_eager/python_api_testing/sweep_tests/tt_lib_ops.py:1898: RuntimeError

@tt-aho
Copy link
Contributor

tt-aho commented May 2, 2024

@tt-aho In the recent changes, the test cases with bf8 and tile layout passed. But the W should be multiples of 32, if W is not multiples of 32 face below error message

E       RuntimeError: TT_THROW @ tt_metal/impl/program/program.cpp:36: tt::exception
E       info:
E       Failed to generate binaries for {} {}
E       pack_untilize
E       TT_THROW @ tt_metal/jit_build/build.cpp:412: tt::exception
E       info:
E       {} build failed
E       trisc2
E       backtrace:
E        --- tt::tt_metal::JitBuildState::compile_one(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, tt::tt_metal::JitBuildSettings const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const
E        --- std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, std::__future_base::_Result_base::_Deleter>, std::__future_base::_Task_state<std::_Bind<std::function<void ()> ()>, std::allocator<int>, void ()>::_M_run()::{lambda()#1}, void> >::_M_invoke(std::_Any_data const&)
E        --- /home/ubuntu/uma/tt-metal/build/lib/libtt_metal.so(+0x2cd5ed) [0x7fa590bfb5ed]
E        --- /lib/x86_64-linux-gnu/libpthread.so.0(+0x114df) [0x7fa5d34194df]
E        --- auto tt::tt_metal::detail::async<std::function<void ()> const&>(std::function<void ()> const&)
E        --- tt::tt_metal::JitBuildState::compile(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, tt::tt_metal::JitBuildSettings const*) const
E        --- tt::tt_metal::JitBuildState::build(tt::tt_metal::JitBuildSettings const*) const
E        --- std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, std::__future_base::_Result_base::_Deleter>, std::__future_base::_Task_state<std::_Bind<std::function<void ()> ()>, std::allocator<int>, void ()>::_M_run()::{lambda()#1}, void> >::_M_invoke(std::_Any_data const&)
E        --- /home/ubuntu/uma/tt-metal/build/lib/libtt_metal.so(+0x2cd5ed) [0x7fa590bfb5ed]
E        --- /lib/x86_64-linux-gnu/libpthread.so.0(+0x114df) [0x7fa5d34194df]
E        --- std::_Function_handler<void (), tt::tt_metal::detail::async<std::function<void ()> const&>(std::function<void ()> const&)::{lambda()#1}>::_M_invoke(std::_Any_data const&)
E        --- /home/ubuntu/uma/tt-metal/build/lib/libtt_metal.so(+0x2de187) [0x7fa590c0c187]
E        --- std::thread::_State_impl<std::thread::_Invoker<std::tuple<tf::Executor::_spawn(unsigned long)::{lambda()#1}> > >::_M_run()
E        --- /lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6df4) [0x7fa5cfd46df4]
E        --- /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7fa5d3410609]
E        --- /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7fa5d354a353]
E       
E       backtrace:
E        --- std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, std::__future_base::_Result_base::_Deleter>, std::__future_base::_Task_state<std::_Bind<std::function<void ()> ()>, std::allocator<int>, void ()>::_M_run()::{lambda()#1}, void> >::_M_invoke(std::_Any_data const&)
E        --- /home/ubuntu/uma/tt-metal/build/lib/libtt_metal.so(+0x2cd5ed) [0x7fa590bfb5ed]
E        --- /lib/x86_64-linux-gnu/libpthread.so.0(+0x114df) [0x7fa5d34194df]
E        --- std::_Function_handler<void (), tt::tt_metal::detail::async<std::function<void ()> const&>(std::function<void ()> const&)::{lambda()#1}>::_M_invoke(std::_Any_data const&)
E        --- /home/ubuntu/uma/tt-metal/build/lib/libtt_metal.so(+0x2de187) [0x7fa590c0c187]
E        --- std::thread::_State_impl<std::thread::_Invoker<std::tuple<tf::Executor::_spawn(unsigned long)::{lambda()#1}> > >::_M_run()
E        --- /lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6df4) [0x7fa5cfd46df4]
E        --- /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7fa5d3410609]
E        --- /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7fa5d354a353]

tests/tt_eager/python_api_testing/sweep_tests/tt_lib_ops.py:1898: RuntimeError

What is the specific test to repro the error?

@umadevimcw
Copy link
Contributor

@tt-aho (6, 1, 140, 110) This is one of the shape that I faced above issue

@tt-aho
Copy link
Contributor

tt-aho commented May 2, 2024

I don't see an error when changing the test to that shape. Could you rebuild/retry?

umadevimcw added a commit that referenced this issue May 3, 2024
umadevimcw added a commit that referenced this issue May 3, 2024
@umadevimcw
Copy link
Contributor

@tt-aho Cloned the repo and compiled it. Now I am not getting the above error

umadevimcw added a commit that referenced this issue May 6, 2024
umadevimcw added a commit that referenced this issue May 6, 2024
umadevimcw added a commit that referenced this issue May 6, 2024
@umadevimcw
Copy link
Contributor

Closing as the test passes with the updated config

ankitmcw pushed a commit that referenced this issue May 7, 2024
ankitmcw pushed a commit that referenced this issue May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants