Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug Report] Result of a high dimension matmul corrupts when program cache is enabled and assigning to an already allocated tensor and the 2nd matrix is transposed. At index 10240 of the result tensor #9849

Closed
marty1885 opened this issue Jun 29, 2024 · 7 comments
Assignees
Labels
bug Something isn't working community

Comments

@marty1885
Copy link
Contributor

marty1885 commented Jun 29, 2024

Describe the bug

The result of a 1x10x16x256 by 1x20x256x16 matmul is corrupted when program cache is enabled and a few very specific conditions. The situation is weirdly specificity. But happens to be exactly what I am doing when running GGML. I am able to create a minimal example:

static tt::tt_metal::Tensor make_random_tensor(tt::tt_metal::Shape s)
{
    static int seed = 42;
     auto b = tt::tt_metal::owned_buffer::create(
        create_random_vector_of_bfloat16_native(
        s[0] * s[1] * s[2] * s[3] * 2
            , 2, seed++, -1));
    tt::tt_metal::Tensor t(OwnedStorage{std::move(b)}, s
        , tt::tt_metal::DataType::BFLOAT16, tt::tt_metal::Layout::ROW_MAJOR);
    return tt::tt_metal::tilize_with_zero_padding(t.to(AutoFormat::GetDefaultDevice()));
}

int main()
{
    auto device = &ttnn::open_device(0);
    AutoFormat::SetDefaultDevice(device);
    ttnn::enable_program_cache(*device); // Program cache HAS to be enabled

    auto m1 = make_random_tensor(tt::tt_metal::Shape{1, 10, 32, 256});
    auto m2 = make_random_tensor(tt::tt_metal::Shape{1, 20, 32, 256});

    auto m3 = make_random_tensor(tt::tt_metal::Shape{1, 1, 32, 32}); // m3 must be assigned with something
    // The transpose() in the following line must be there. Directly matmul a 1x10x32x256 by 1x20x256x32 does not trigger the bug
    m3 = ttnn::operations::matmul::matmul(m2, tt::tt_metal::transpose(m1, -2, -1), std::nullopt);

    // This won't trigger the bug
    // auto m3 = ttnn::operations::matmul::matmul(m2, tt::tt_metal::transpose(m1, -2, -1), std::nullopt);

    auto& q = m3;
    std::vector<bfloat16> buf(q.shape().volume());
    tt::tt_metal::memcpy(buf.data(), q);
    std::cout << std::endl;
    for(size_t i = 0; i < buf.size(); i++) {
        if(i % 32 == 0)
            std::cout << std::endl;
        if(i % 1024 == 0)
            std::cout << std::endl;
        std::cout << buf[i].to_float() << " ";

        // detect NaN and corrupted values
        if((std::isnan(buf[i].to_float()) == true || std::abs(buf[i].to_float()) > 100)) {
            std::cerr << "NaN or corrupted value detected at index " << i << std::endl;
            abort();
        }
    }

    std::cout << std::endl;


    device->close();
}

To Reproduce
Steps to reproduce the behavior:

  1. Compile and run the above minimal example
    • The equivalent Python program does not experience the same issue
  2. Observe the output NaN or corrupted value detected at index 10240
  3. Try disable the program cache and the NaN disappears.

Expected behavior

There should not be a NaN in the result what so ever

Screenshots
If applicable, add screenshots to help explain your problem.

Please complete the following environment information:

  • OS: Arch Linux
  • commit: 0ee098a
  • Card: Grayskull e75

Additional context
Add any other context about the problem here.

@marty1885 marty1885 added the bug Something isn't working label Jun 29, 2024
@marty1885 marty1885 changed the title [Bug Report] Result of a high dimension matmul corrupted when program cache is enabled and assigning to an already allocated tensor and the 2nd matrix is transposed. At index 10240 of the result tensor [Bug Report] Result of a high dimension matmul corrupts when program cache is enabled and assigning to an already allocated tensor and the 2nd matrix is transposed. At index 10240 of the result tensor Jun 29, 2024
@TT-BrianLiu
Copy link
Contributor

Hey Marty, I will take a look at this. We haven't done much testing with our matmul APIs from C++ side, so thanks for pointing out this issue!

@TT-BrianLiu
Copy link
Contributor

Can you help provide the full test file with the includes and show how you built and run the test?

@marty1885
Copy link
Contributor Author

@TT-BrianLiu No problem. I have uploaded the example code into a self-contained repository on my GitHub. Please let me know if the instructions in the README is not clear. I just tried again on c52e153 and I am experiencing the same issue. LMK if you cannot replicate it on your machine.

https://github.com/marty1885/ttnn-matmul-corruption-demo

@TT-BrianLiu
Copy link
Contributor

Thank you! I will try running the test

@TT-BrianLiu
Copy link
Contributor

I was able to repro it and I pushed the test here: jedi

@TT-BrianLiu
Copy link
Contributor

TT-BrianLiu commented Jul 5, 2024

I figured our your issue. Our matmuls either support [B, 1, M, K] x [1, 1, K, N] (bcast_batch=True) or [B, 1, M, K] x [B, 1, K, N] (bcast_batch=False). In this test, the batches for input 1 (m2) and input 2 (m1) don't match so this matmul should actually assert out but it doesn't. Instead, our matmul is treating this as the second bcast_batch=False case and reading input 2 (m1) as a full [20, 1, 256, 32] tensor when only [2, 1, 256, 32] is actually allocated.

So, fix is simple. I will add the missing asserts for the matmul variants that are missing it, but let me explain what you're seeing in your tests. I will leave your test below for future reference since I will remove it when I merge the fix. I also removed everything that is not relevant (eg. the transpose, the extra allocation of m3).

  • The output tile that starts being corrupted exactly corresponds to the batch dim of m2. In your original test, you had [10, 1, 256, 32], so the 11th tile was being corrupted. I switched it to [2, 1, 256, 32] and then the 3rd tile started being corrupted. This is good evidence for the matmul reading in garbage inputs for input 2 (m1).
  • The reason you see corrupted values only when program cache is enabled is because program binaries are dumped in DRAM and the tilize_with_zero_padding op inside make_random_tensor also defaults to DRAM for the output. Tensors and binaries are allocated bottom-up in DRAM, so when the matmul reads memory above the allocated space for input 2 (m1), it's reading in non-zero values from the program binaries. When program cache isn't enabled, it's either reading in sensible (but wrong) values of input 1 (if you allocate m2, then m1) or zeros (if you allocate m1, then m2). If you enable program cache and let the test print instead of aborting, you will see some more sensible (but wrong) values later when it gets past the region of memory where the program binaries are enabled.
  • If you switch the output of make_random_tensor to L1 by passing in an L1 memory config (see code), you will see the same result regardless if program cache is enabled or not. This is because the incorrect space that matmul is reading from for input 2 (m1) is now in L1 and that region of space will be unaffected by whether program cache is enabled or not. You will still see incorrect values that will change depending on the order of allocation for m1 and m2, just not complete garbage.
auto MEMORY_CONFIG = MemoryConfig{.memory_layout = tt::tt_metal::TensorMemoryLayout::INTERLEAVED, .buffer_type = BufferType::L1};
return tt::tt_metal::tilize_with_zero_padding(t.to(AutoFormat::GetDefaultDevice()), MEMORY_CONFIG);

Reference code:

#include "tensor/host_buffer/functions.hpp"
#include "tensor/types.hpp"
#include "tt_dnn/op_library/auto_format.hpp"
#include <cstddef>
#include <tt_eager/tensor/tensor.hpp>
#include <ttnn/core.hpp>
#include <ttnn/operations/eltwise/binary/binary.hpp>
#include <ttnn/device.hpp>
#include <tt_dnn/op_library/fully_connected/fully_connected_op.hpp>
#include <tt_dnn/op_library/eltwise_unary/eltwise_unary_op.hpp>
#include <tt_eager/tensor/tensor.hpp>
#include <tt_dnn/op_library/transpose/transpose_op.hpp>

#include "common/bfloat16.hpp"
#include "tt_dnn/op_library/composite/composite_ops.hpp"
#include "tt_dnn/op_library/tilize/tilize_op.hpp"
#include <ttnn/operations/eltwise/binary/binary.hpp>
#include <ttnn/operations/matmul.hpp>
#include <tt_dnn/op_library/update_cache/update_cache_op.hpp>

#include <vector>
#include <iostream>

static tt::tt_metal::Tensor make_random_tensor(tt::tt_metal::Shape s)
{
    static int seed = 42;
     auto b = tt::tt_metal::owned_buffer::create(
        create_random_vector_of_bfloat16_native(
        s[0] * s[1] * s[2] * s[3] * 2
            , 2, seed++, -1));
    tt::tt_metal::Tensor t(OwnedStorage{std::move(b)}, s
        , tt::tt_metal::DataType::BFLOAT16, tt::tt_metal::Layout::ROW_MAJOR);
    auto MEMORY_CONFIG = MemoryConfig{.memory_layout = tt::tt_metal::TensorMemoryLayout::INTERLEAVED, .buffer_type = BufferType::DRAM};  // Switch to L1
    return tt::tt_metal::tilize_with_zero_padding(t.to(AutoFormat::GetDefaultDevice()), MEMORY_CONFIG);
}

int main() {
    auto device = &ttnn::open_device(0);
    AutoFormat::SetDefaultDevice(device);
    ttnn::enable_program_cache(*device); // Program cache HAS to be enabled if m1 is in DRAM; otherwise, test will "pass" with bad results either way

    auto m2 = make_random_tensor(tt::tt_metal::Shape{20, 1, 32, 256});
    auto m1 = make_random_tensor(tt::tt_metal::Shape{2, 1, 256, 32});

    auto m3 = ttnn::operations::matmul::matmul(m2, m1, std::nullopt);

    std::vector<bfloat16> buf(m3.shape().volume());
    tt::tt_metal::memcpy(buf.data(), m3);
    std::cout << "Total ele: " << buf.size() << std::endl;
    for(size_t i = 0; i < buf.size(); i++) {
        if (i % 8 != 0) continue; // Print every 8 from each row
        if(i % 32 == 0)
            std::cout << std::endl;
        if(i % 1024 == 0)
            std::cout << std::endl;
        std::cout << buf[i].to_float() << " ";

        // detect NaN and corrupted values
        if((std::isnan(buf[i].to_float()) == true || std::abs(buf[i].to_float()) > 100)) {
            std::cerr << "NaN or corrupted value detected at index " << i << std::endl;
            abort();
        }
    }

    std::cout << std::endl;

    device->close();
}

TT-BrianLiu added a commit that referenced this issue Jul 5, 2024
- This adds these checks to matmul_multicore and matmul_multicore_reuse as an intended side effect
TT-BrianLiu added a commit that referenced this issue Jul 5, 2024
- This adds these checks to matmul_multicore and matmul_multicore_reuse as an intended side effect
TT-BrianLiu added a commit that referenced this issue Jul 8, 2024
- This adds these checks to matmul_multicore and matmul_multicore_reuse as an intended side effect
TT-BrianLiu added a commit that referenced this issue Jul 8, 2024
- This adds these checks to matmul_multicore and matmul_multicore_reuse as an intended side effect
TT-BrianLiu added a commit that referenced this issue Jul 8, 2024
- This adds these checks to matmul_multicore and matmul_multicore_reuse as an intended side effect
TT-BrianLiu added a commit that referenced this issue Jul 8, 2024
- This adds these checks to matmul_multicore and matmul_multicore_reuse as an intended side effect
@TT-BrianLiu
Copy link
Contributor

Added the appropriate checks here: #10013

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working community
Projects
None yet
Development

No branches or pull requests

2 participants