[Bug Report] Result of a high dimension matmul corrupts when program cache is enabled and assigning to an already allocated tensor and the 2nd matrix is transposed. At index 10240 of the result tensor #9849

marty1885 · 2024-06-29T11:39:37Z

Describe the bug

The result of a 1x10x16x256 by 1x20x256x16 matmul is corrupted when program cache is enabled and a few very specific conditions. The situation is weirdly specificity. But happens to be exactly what I am doing when running GGML. I am able to create a minimal example:

static tt::tt_metal::Tensor make_random_tensor(tt::tt_metal::Shape s)
{
    static int seed = 42;
     auto b = tt::tt_metal::owned_buffer::create(
        create_random_vector_of_bfloat16_native(
        s[0] * s[1] * s[2] * s[3] * 2
            , 2, seed++, -1));
    tt::tt_metal::Tensor t(OwnedStorage{std::move(b)}, s
        , tt::tt_metal::DataType::BFLOAT16, tt::tt_metal::Layout::ROW_MAJOR);
    return tt::tt_metal::tilize_with_zero_padding(t.to(AutoFormat::GetDefaultDevice()));
}

int main()
{
    auto device = &ttnn::open_device(0);
    AutoFormat::SetDefaultDevice(device);
    ttnn::enable_program_cache(*device); // Program cache HAS to be enabled

    auto m1 = make_random_tensor(tt::tt_metal::Shape{1, 10, 32, 256});
    auto m2 = make_random_tensor(tt::tt_metal::Shape{1, 20, 32, 256});

    auto m3 = make_random_tensor(tt::tt_metal::Shape{1, 1, 32, 32}); // m3 must be assigned with something
    // The transpose() in the following line must be there. Directly matmul a 1x10x32x256 by 1x20x256x32 does not trigger the bug
    m3 = ttnn::operations::matmul::matmul(m2, tt::tt_metal::transpose(m1, -2, -1), std::nullopt);

    // This won't trigger the bug
    // auto m3 = ttnn::operations::matmul::matmul(m2, tt::tt_metal::transpose(m1, -2, -1), std::nullopt);

    auto& q = m3;
    std::vector<bfloat16> buf(q.shape().volume());
    tt::tt_metal::memcpy(buf.data(), q);
    std::cout << std::endl;
    for(size_t i = 0; i < buf.size(); i++) {
        if(i % 32 == 0)
            std::cout << std::endl;
        if(i % 1024 == 0)
            std::cout << std::endl;
        std::cout << buf[i].to_float() << " ";

        // detect NaN and corrupted values
        if((std::isnan(buf[i].to_float()) == true || std::abs(buf[i].to_float()) > 100)) {
            std::cerr << "NaN or corrupted value detected at index " << i << std::endl;
            abort();
        }
    }

    std::cout << std::endl;


    device->close();
}

To Reproduce
Steps to reproduce the behavior:

Compile and run the above minimal example
- The equivalent Python program does not experience the same issue
Observe the output NaN or corrupted value detected at index 10240
Try disable the program cache and the NaN disappears.

Expected behavior

There should not be a NaN in the result what so ever

Screenshots
If applicable, add screenshots to help explain your problem.

Please complete the following environment information:

OS: Arch Linux
commit: 0ee098a
Card: Grayskull e75

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

TT-BrianLiu · 2024-07-03T15:23:49Z

Hey Marty, I will take a look at this. We haven't done much testing with our matmul APIs from C++ side, so thanks for pointing out this issue!

TT-BrianLiu · 2024-07-04T19:10:31Z

Can you help provide the full test file with the includes and show how you built and run the test?

marty1885 · 2024-07-05T14:24:17Z

@TT-BrianLiu No problem. I have uploaded the example code into a self-contained repository on my GitHub. Please let me know if the instructions in the README is not clear. I just tried again on c52e153 and I am experiencing the same issue. LMK if you cannot replicate it on your machine.

https://github.com/marty1885/ttnn-matmul-corruption-demo

TT-BrianLiu · 2024-07-05T16:05:26Z

Thank you! I will try running the test

TT-BrianLiu · 2024-07-05T19:30:19Z

I was able to repro it and I pushed the test here: jedi

TT-BrianLiu · 2024-07-05T21:26:50Z

I figured our your issue. Our matmuls either support [B, 1, M, K] x [1, 1, K, N] (bcast_batch=True) or [B, 1, M, K] x [B, 1, K, N] (bcast_batch=False). In this test, the batches for input 1 (m2) and input 2 (m1) don't match so this matmul should actually assert out but it doesn't. Instead, our matmul is treating this as the second bcast_batch=False case and reading input 2 (m1) as a full [20, 1, 256, 32] tensor when only [2, 1, 256, 32] is actually allocated.

So, fix is simple. I will add the missing asserts for the matmul variants that are missing it, but let me explain what you're seeing in your tests. I will leave your test below for future reference since I will remove it when I merge the fix. I also removed everything that is not relevant (eg. the transpose, the extra allocation of m3).

The output tile that starts being corrupted exactly corresponds to the batch dim of m2. In your original test, you had [10, 1, 256, 32], so the 11th tile was being corrupted. I switched it to [2, 1, 256, 32] and then the 3rd tile started being corrupted. This is good evidence for the matmul reading in garbage inputs for input 2 (m1).
The reason you see corrupted values only when program cache is enabled is because program binaries are dumped in DRAM and the tilize_with_zero_padding op inside make_random_tensor also defaults to DRAM for the output. Tensors and binaries are allocated bottom-up in DRAM, so when the matmul reads memory above the allocated space for input 2 (m1), it's reading in non-zero values from the program binaries. When program cache isn't enabled, it's either reading in sensible (but wrong) values of input 1 (if you allocate m2, then m1) or zeros (if you allocate m1, then m2). If you enable program cache and let the test print instead of aborting, you will see some more sensible (but wrong) values later when it gets past the region of memory where the program binaries are enabled.
If you switch the output of make_random_tensor to L1 by passing in an L1 memory config (see code), you will see the same result regardless if program cache is enabled or not. This is because the incorrect space that matmul is reading from for input 2 (m1) is now in L1 and that region of space will be unaffected by whether program cache is enabled or not. You will still see incorrect values that will change depending on the order of allocation for m1 and m2, just not complete garbage.

auto MEMORY_CONFIG = MemoryConfig{.memory_layout = tt::tt_metal::TensorMemoryLayout::INTERLEAVED, .buffer_type = BufferType::L1};
return tt::tt_metal::tilize_with_zero_padding(t.to(AutoFormat::GetDefaultDevice()), MEMORY_CONFIG);

Reference code:

#include "tensor/host_buffer/functions.hpp"
#include "tensor/types.hpp"
#include "tt_dnn/op_library/auto_format.hpp"
#include <cstddef>
#include <tt_eager/tensor/tensor.hpp>
#include <ttnn/core.hpp>
#include <ttnn/operations/eltwise/binary/binary.hpp>
#include <ttnn/device.hpp>
#include <tt_dnn/op_library/fully_connected/fully_connected_op.hpp>
#include <tt_dnn/op_library/eltwise_unary/eltwise_unary_op.hpp>
#include <tt_eager/tensor/tensor.hpp>
#include <tt_dnn/op_library/transpose/transpose_op.hpp>

#include "common/bfloat16.hpp"
#include "tt_dnn/op_library/composite/composite_ops.hpp"
#include "tt_dnn/op_library/tilize/tilize_op.hpp"
#include <ttnn/operations/eltwise/binary/binary.hpp>
#include <ttnn/operations/matmul.hpp>
#include <tt_dnn/op_library/update_cache/update_cache_op.hpp>

#include <vector>
#include <iostream>

static tt::tt_metal::Tensor make_random_tensor(tt::tt_metal::Shape s)
{
    static int seed = 42;
     auto b = tt::tt_metal::owned_buffer::create(
        create_random_vector_of_bfloat16_native(
        s[0] * s[1] * s[2] * s[3] * 2
            , 2, seed++, -1));
    tt::tt_metal::Tensor t(OwnedStorage{std::move(b)}, s
        , tt::tt_metal::DataType::BFLOAT16, tt::tt_metal::Layout::ROW_MAJOR);
    auto MEMORY_CONFIG = MemoryConfig{.memory_layout = tt::tt_metal::TensorMemoryLayout::INTERLEAVED, .buffer_type = BufferType::DRAM};  // Switch to L1
    return tt::tt_metal::tilize_with_zero_padding(t.to(AutoFormat::GetDefaultDevice()), MEMORY_CONFIG);
}

int main() {
    auto device = &ttnn::open_device(0);
    AutoFormat::SetDefaultDevice(device);
    ttnn::enable_program_cache(*device); // Program cache HAS to be enabled if m1 is in DRAM; otherwise, test will "pass" with bad results either way

    auto m2 = make_random_tensor(tt::tt_metal::Shape{20, 1, 32, 256});
    auto m1 = make_random_tensor(tt::tt_metal::Shape{2, 1, 256, 32});

    auto m3 = ttnn::operations::matmul::matmul(m2, m1, std::nullopt);

    std::vector<bfloat16> buf(m3.shape().volume());
    tt::tt_metal::memcpy(buf.data(), m3);
    std::cout << "Total ele: " << buf.size() << std::endl;
    for(size_t i = 0; i < buf.size(); i++) {
        if (i % 8 != 0) continue; // Print every 8 from each row
        if(i % 32 == 0)
            std::cout << std::endl;
        if(i % 1024 == 0)
            std::cout << std::endl;
        std::cout << buf[i].to_float() << " ";

        // detect NaN and corrupted values
        if((std::isnan(buf[i].to_float()) == true || std::abs(buf[i].to_float()) > 100)) {
            std::cerr << "NaN or corrupted value detected at index " << i << std::endl;
            abort();
        }
    }

    std::cout << std::endl;

    device->close();
}

- This adds these checks to matmul_multicore and matmul_multicore_reuse as an intended side effect

TT-BrianLiu · 2024-07-08T18:00:35Z

Added the appropriate checks here: #10013

marty1885 added the bug Something isn't working label Jun 29, 2024

github-actions bot added the community label Jun 29, 2024

ayerofieiev-tt assigned TT-BrianLiu Jun 29, 2024

TT-BrianLiu added a commit that referenced this issue Jul 5, 2024

#9849: Move checks on batch dims for matmul to validate

e1cb832

- This adds these checks to matmul_multicore and matmul_multicore_reuse as an intended side effect

TT-BrianLiu added a commit that referenced this issue Jul 5, 2024

#9849: Move checks on batch dims for matmul to validate

08d6006

- This adds these checks to matmul_multicore and matmul_multicore_reuse as an intended side effect

TT-BrianLiu added a commit that referenced this issue Jul 8, 2024

#9849: Move checks on batch dims for matmul to validate

5fd4c59

- This adds these checks to matmul_multicore and matmul_multicore_reuse as an intended side effect

TT-BrianLiu mentioned this issue Jul 8, 2024

#9849: Move checks on batch dims for matmul to validate #10013

Merged

3 tasks

TT-BrianLiu added a commit that referenced this issue Jul 8, 2024

#9849: Move checks on batch dims for matmul to validate

cc30a67

- This adds these checks to matmul_multicore and matmul_multicore_reuse as an intended side effect

TT-BrianLiu added a commit that referenced this issue Jul 8, 2024

#9849: Move checks on batch dims for matmul to validate

9c3d3e7

- This adds these checks to matmul_multicore and matmul_multicore_reuse as an intended side effect

TT-BrianLiu added a commit that referenced this issue Jul 8, 2024

#9849: Move checks on batch dims for matmul to validate

7d03e02

- This adds these checks to matmul_multicore and matmul_multicore_reuse as an intended side effect

TT-BrianLiu closed this as completed Jul 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug Report] Result of a high dimension matmul corrupts when program cache is enabled and assigning to an already allocated tensor and the 2nd matrix is transposed. At index 10240 of the result tensor #9849

[Bug Report] Result of a high dimension matmul corrupts when program cache is enabled and assigning to an already allocated tensor and the 2nd matrix is transposed. At index 10240 of the result tensor #9849

marty1885 commented Jun 29, 2024 •

edited

Loading

TT-BrianLiu commented Jul 3, 2024

TT-BrianLiu commented Jul 4, 2024

marty1885 commented Jul 5, 2024

TT-BrianLiu commented Jul 5, 2024

TT-BrianLiu commented Jul 5, 2024

TT-BrianLiu commented Jul 5, 2024 •

edited

Loading

TT-BrianLiu commented Jul 8, 2024

[Bug Report] Result of a high dimension matmul corrupts when program cache is enabled and assigning to an already allocated tensor and the 2nd matrix is transposed. At index 10240 of the result tensor #9849

[Bug Report] Result of a high dimension matmul corrupts when program cache is enabled and assigning to an already allocated tensor and the 2nd matrix is transposed. At index 10240 of the result tensor #9849

Comments

marty1885 commented Jun 29, 2024 • edited Loading

TT-BrianLiu commented Jul 3, 2024

TT-BrianLiu commented Jul 4, 2024

marty1885 commented Jul 5, 2024

TT-BrianLiu commented Jul 5, 2024

TT-BrianLiu commented Jul 5, 2024

TT-BrianLiu commented Jul 5, 2024 • edited Loading

TT-BrianLiu commented Jul 8, 2024

marty1885 commented Jun 29, 2024 •

edited

Loading

TT-BrianLiu commented Jul 5, 2024 •

edited

Loading