# Matrix Multiplication

In [None]:
import os

In [None]:
import ttnn

device_id = 0
device = ttnn.open_device(device_id=device_id)

## Enable program cache

Enabling the program cache will speed up the execution of operations that run repeatedly

In [None]:
device.enable_program_cache()

# Configuration

In [None]:
m = 1024
k = 1024
n = 1024

## Initialize tensors a and b with random values

In [None]:
a = ttnn.rand((m, k), dtype=ttnn.bfloat16, device=device, layout=ttnn.TILE_LAYOUT)
b = ttnn.rand((k, n), dtype=ttnn.bfloat16, device=device, layout=ttnn.TILE_LAYOUT)

## Matrix multiply tensor a and b
The operation will run longer the first time because the kernels need to get compiled

In [None]:
output = a @ b

Re-running the operation shows significant speed up by utilizing program caching

In [None]:
output = a @ b

## Inspect the layout of matrix multiplication output

In [None]:
print(output.layout)

As can be seen, matrix multiplication produces outputs in a tile layout. That is because it's much more efficient to use this layout for computing matrix multiplications on Tenstorrent accelerators compared to a row-major layout.

And this is also why the logs show 2 tilize operations, as the inputs get automatically convered to the tile layout if they are in a row-major layout.

Learn more about tile layout [here](https://github.com/tenstorrent/tt-metal/blob/main/tech_reports/tensor_layouts/tensor_layouts.md#32-tiled-layout)

## Inspect the result of the matrix multiplication

To inspect the results we will first convert to row-major layout.

In [None]:
output = ttnn.to_layout(output, ttnn.ROW_MAJOR_LAYOUT)

print("Printing ttnn tensor")
print(f"shape: {output.shape}")
print(f"chunk of a tensor:\n{output[:1, :32]}")

## Matrix multiply tensor a and b by using more performant config
By default, matrix multiplication might not be as effecient as it could be. To speed it up further, the user can specify how many cores they want matrix multiplication to use. This can speed up the operation significantly.

In [None]:
a = ttnn.rand((m, k), dtype=ttnn.bfloat16, device=device, layout=ttnn.TILE_LAYOUT, memory_config=ttnn.L1_MEMORY_CONFIG)
b = ttnn.rand((k, n), dtype=ttnn.bfloat16, device=device, layout=ttnn.TILE_LAYOUT, memory_config=ttnn.L1_MEMORY_CONFIG)

Run once to compile the kernels

In [None]:
output = ttnn.matmul(a, b, memory_config=ttnn.L1_MEMORY_CONFIG, core_grid=ttnn.CoreGrid(y=8, x=8))

Enjoy a massive speed up on the subsequent runs

In [None]:
output = ttnn.matmul(a, b, memory_config=ttnn.L1_MEMORY_CONFIG, core_grid=ttnn.CoreGrid(y=8, x=8))

## Close the device

In [None]:
ttnn.close_device(device)