Skip to content

sarihammad/tensorlib

Repository files navigation

TensorLib

A tensor library featuring expression templates, custom aligned allocators, zero-copy views, and SIMD-optimized matrix multiplication. Built for performance-critical applications.

Key Features

1. Expression Templates - Zero Intermediate Copies

Complex tensor operations like D = A + B + C + E are evaluated in a single pass without creating temporary objects. This is achieved through C++ template metaprogramming.

Tensor<float, 1000> A, B, C, D, E;
E = A + B * 2.0f - C / 3.0f + D;  // Single evaluation pass, no temporaries!

2. Custom 64-byte Aligned Allocator

All tensors are allocated with 64-byte alignment, crucial for:

  • AVX-512 SIMD instructions (512-bit = 64 bytes)
  • Preventing cache line splits
  • Optimal memory access patterns

3. Zero-Copy Views (TensorView)

Create lightweight views into tensor data without copying:

Tensor<float, 100, 100> large_tensor;
auto view = large_tensor.view();  // No data copy!
view[0] = 42;  // Modifies original tensor

4. SIMD-Optimized GEMM (Matrix Multiplication)

Optimized matrix multiplication using:

  • AVX-512 intrinsics (16 floats or 8 doubles at once)
  • AVX2 fallback (8 floats or 4 doubles at once)
  • Cache blocking/tiling for L1/L2 optimization
  • Manual prefetching with __builtin_prefetch
  • Loop unrolling

Performance on modern CPUs: 100+ GFLOPS for 256×256 matrix multiplication.

Architecture

Memory Layout & Alignment

graph TD
    A[Tensor Request] --> B[AlignedAllocator]
    B --> C{Check Alignment}
    C -->|64-byte boundary| D[posix_memalign]
    D --> E[Cache-Friendly Access]
    E --> F[SIMD Instructions]
    F --> G[High Performance]
    
    style B fill:#e1f5ff
    style E fill:#d4edda
    style G fill:#d4edda
Loading

Expression Template Evaluation Flow

graph LR
    A[A + B + C] --> B[Parse Time]
    B --> C[BinaryOp&lt;Add, A, BinaryOp&lt;Add, B, C&gt;&gt;]
    C --> D[Single Evaluation]
    D --> E[No Temporaries]
    
    F[Traditional: A + B + C] --> G[Temp1 = A + B]
    G --> H[Temp2 = Temp1 + C]
    H --> I[2 Allocations + Copies]
    
    style E fill:#d4edda
    style I fill:#f8d7da
Loading

GEMM Optimization Pipeline

flowchart TD
    A[Matrix A: M×K] --> B{Size Check}
    A1[Matrix B: K×N] --> B
    
    B --> C[Cache Blocking]
    C --> D[Split into Blocks]
    D --> E[64×64×64 tiles]
    
    E --> F{SIMD Available?}
    F -->|AVX-512| G[16 floats/cycle]
    F -->|AVX2| H[8 floats/cycle]
    F -->|Scalar| I[1 float/cycle]
    
    G --> J[Prefetch Next Block]
    H --> J
    I --> J
    
    J --> K[FMA Instructions]
    K --> L[Output Matrix C: M×N]
    
    style G fill:#d4edda
    style H fill:#fff3cd
    style I fill:#f8d7da
Loading

Project Structure

tensor-library/
├── tensor.hpp          # Main tensor class with expression templates
├── gemm.hpp           # Optimized matrix multiplication
├── main.cpp           # Demo and test program
├── benchmark.cpp      # Google Benchmark integration
├── CMakeLists.txt     # CMake build configuration
├── Makefile           # Simple make-based build
└── README.md          # This file

Building and Running

Option 1: Using Make (Simplest)

make              # Build with automatic SIMD detection
make run          # Build and run demo
make clean        # Clean build artifacts

Option 2: Using CMake

mkdir build && cd build
cmake ..
make
./tensor_demo

Option 3: Manual Compilation

# With AVX-512 support
g++ -std=c++20 -O3 -march=native -mavx512f -mavx512dq main.cpp -o tensor_demo

# With AVX2 support only
g++ -std=c++20 -O3 -march=native -mavx2 -mfma main.cpp -o tensor_demo

# Scalar fallback
g++ -std=c++20 -O3 main.cpp -o tensor_demo

Usage Examples

Basic Tensor Operations

#include "tensor.hpp"
using namespace tensor;

// Create tensors
Tensor<float, 4> A{1.0f, 2.0f, 3.0f, 4.0f};
Tensor<float, 4> B{5.0f, 6.0f, 7.0f, 8.0f};

// Expression templates (no intermediate copies!)
Tensor<float, 4> C = A + B * 2.0f - A / 3.0f;

// Element-wise operations
C = (A + B) * (C - A);

// Scalar operations
C = A * 3.14f + 2.71f;

Matrix Operations

#include "tensor.hpp"
#include "gemm.hpp"
using namespace tensor;

// 2D tensors (matrices)
Tensor<float, 3, 4> A;  // 3×4 matrix
Tensor<float, 4, 2> B;  // 4×2 matrix

// Initialize
for (std::size_t i = 0; i < 3; ++i)
    for (std::size_t j = 0; j < 4; ++j)
        A(i, j) = i + j;

// Matrix multiplication: C = A × B
auto C = matmul(A, B);  // Results in 3×2 matrix

// Transpose
auto A_T = transpose(A);  // 4×3 matrix

// Matrix-vector multiplication
Tensor<float, 4> x;
auto y = matvec(A, x);  // Results in 3-element vector

Zero-Copy Views

Tensor<float, 100, 100> large_tensor;

// Create view (no copy)
auto view = large_tensor.view();

// Modify through view
view[0] = 999.0f;
// large_tensor[0] is now 999.0f!

// Views share the same memory
assert(view.data() == large_tensor.data());

Multi-dimensional Indexing

Tensor<float, 2, 3, 4> tensor3d;  // 3D tensor

// Access elements
tensor3d(0, 1, 2) = 42.0f;

// Linear indexing also works
tensor3d[5] = 3.14f;

Technical Deep Dive

Expression Templates Explained

Traditional approach (creates temporaries):

Tensor D = A + B + C;
// Compiled as:
// Temp1 = A + B        (allocation + computation)
// D = Temp1 + C        (allocation + computation)

With expression templates:

Tensor D = A + B + C;
// Compiled as:
// for (i = 0; i < size; ++i)
//     D[i] = A[i] + B[i] + C[i]  (single pass)

Type Structure

classDiagram
    class Expression~E~ {
        <<interface>>
        +self() E&
        +operator[](i) auto
        +size() size_t
    }
    
    class Tensor~T, Dims~ {
        -unique_ptr~T[]~ data_
        -AlignedAllocator alloc_
        +operator=(Expression)
        +operator[](i) T&
        +data() T*
        +view() TensorView
    }
    
    class BinaryOp~Op, L, R~ {
        -const L& left_
        -const R& right_
        +operator[](i) auto
    }
    
    class TensorView~T, Dims~ {
        -T* data_
        -array strides_
        +operator[](i) T&
    }
    
    Expression <|-- Tensor
    Expression <|-- BinaryOp
    Expression <|-- TensorView
    Tensor --> TensorView : creates
    BinaryOp --> Expression : composes
Loading

SIMD Optimization Levels

Instruction Set Vector Width Floats/Cycle Doubles/Cycle
Scalar N/A 1 1
SSE 128-bit 4 2
AVX2 256-bit 8 4
AVX-512 512-bit 16 8

Cache Blocking Strategy

graph TD
    A[Large Matrix A: 1024×1024] --> B[Block 1: 64×64]
    A --> C[Block 2: 64×64]
    A --> D[Block 3: 64×64]
    A --> E[...]
    
    B --> F[Fits in L1 Cache]
    F --> G[Minimize Cache Misses]
    G --> H[Higher Performance]
    
    style F fill:#d4edda
    style H fill:#d4edda
Loading

Performance Characteristics

Memory Alignment Benefits

Unaligned Access:  [-----|-----|-----|-----]
                         ^
                    Split across cache lines = 2× memory access

Aligned Access:    [-----------|]
                   ^
                   Single cache line access

GEMM Performance

Measured on modern CPU (AVX2/AVX-512):

Matrix Size Time (ms) GFLOPS SIMD
32×32 0.02 3.2 AVX-512
128×128 0.8 52 AVX-512
256×256 6.5 103 AVX-512
512×512 48 112 AVX-512

GFLOPS = (2 × M × N × K) / (time × 10⁹)

Testing

Run the demo program to see all features in action:

./tensor_demo

Output includes:

  • SIMD instruction set detection
  • Memory alignment verification
  • Expression template demonstrations
  • View/slicing examples
  • Matrix multiplication correctness
  • Performance benchmarks

Running Benchmarks

With Google Benchmark installed:

make benchmark_tensor
./benchmark_tensor

Customization

Custom Element Types

// Works with any arithmetic type
Tensor<double, 100> doubles;
Tensor<int, 50, 50> integers;

// Even custom types (must support arithmetic ops)
struct Complex {
    float real, imag;
    Complex operator+(const Complex& o) const { /* ... */ }
    // ... other operators
};
Tensor<Complex, 10> complex_tensor;

Custom Alignment

// Default is 64-byte alignment
template<typename T, std::size_t Alignment>
class AlignedAllocator { /* ... */ };

// Use 32-byte alignment instead
using Alloc32 = AlignedAllocator<float, 32>;

*Can use but not by default

About

High-Performance Tensor Library

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published