TensorLang: A Native AI/ML Language

TensorLang is an open-source, native machine learning (ML) language designed to compile tensor operations directly into GPU-accelerated kernels. This project addresses pain points in Python-based ML frameworks by offering a streamlined, hardware-aware alternative for high-performance tensor computations.

Overview

TensorLang eliminates Python bottlenecks by providing a unified stack for parsing, type checking, and GPU code generation. The language compiles directly to optimized CUDA kernels, enabling native GPU acceleration for machine learning workloads.

Key Benefits:

Native GPU acceleration - Direct CUDA kernel generation
Type safety - Compile-time tensor shape and type checking
Zero Python overhead - Native tensor operations without interpreter bottlenecks
ML-first design - Built specifically for neural network operations
Extensible architecture - Clean separation of parsing, type checking, and code generation

Features

Tensor Declarations: Explicit tensor types with shape inference (Tensor[f32, (batch, features)])
Comprehensive Operations: 40+ operations covering linear algebra, activations, reductions, and comparisons
GPU Acceleration: Compiles to optimized CUDA kernels with shared memory and broadcasting
Neural Network Support: Build complete MLPs, classification networks, and feature extractors
Batch Processing: Efficient handling of mini-batch training and inference
Memory Management: Automatic GPU memory allocation and cleanup
Pipeline Operations: Chain complex operations (slice → activate → reduce)
Comment Support: Single-line (//) and multi-line (/* */) comments
Comprehensive Testing: 47+ test cases with expected results validation

Operations

Tensor Creation & Manipulation

Literals: [[1.0, 2.0], [3.0, 4.0]] - Multi-dimensional tensor literals
Fill: fill(0.0, (3, 3)) - Create tensors with constant values
Slicing: tensor[1:3, :] - Extract subtensors with NumPy-style syntax

Linear Algebra

Matrix Multiplication: matmul(A, B) - Optimized GPU matrix multiplication
Element-wise Operations: add, minus, mult, div with broadcasting support
Linear Layers: linear(input, weight, bias) - Complete neural network layers

Activation Functions

ReLU: relu(x) - Rectified Linear Unit activation
Sigmoid: sigmoid(x) - Logistic activation function
Tanh: tanh(x) - Hyperbolic tangent activation
Softmax: softmax(x, axis=1) - Normalized exponential with numerical stability

Reductions

Statistical: sum, mean - Along specified axes or full tensor
Min/Max: min, max - Find minimum/maximum values
Argmin/Argmax: argmin, argmax - Locate indices of extreme values

Comparisons

Element-wise: greater, less, equal - Boolean operations returning 0.0/1.0
Broadcasting: Support different tensor shapes following NumPy rules
Masking: Create conditional masks for data filtering

Pipeline Examples

// Neural network layer
let hidden = relu(linear(input, w1, b1))
let output = softmax(linear(hidden, w2, b2))

// Data preprocessing pipeline  
let filtered = mult(data, greater(data, zeros))  // Remove negatives
let normalized = div(filtered, sum(filtered))    // Normalize

// Batch processing
let batch = data[0:32, :]                       // Select batch
let features = batch[:, 1:10]                   // Extract features
let predictions = argmax(softmax(linear(features, weights, bias)), axis=1)

Setup

Clone the repository and prepare the environment to build and run TensorLang:

System Requirements

sudo apt install nvidia-cuda-toolkit python3-dev
sudo apt install python3.12-venv
pip install pycuda lark numpy

Clone

git clone https://github.com/davro/tensor-lang.git
cd tensor-lang

Build

# Ensure the build script is executable
chmod +x build.sh

# Activate the environment
source build.sh

Tests

Single Test Execution

# Run individual TensorLang programs
python3 tensorlang.py tests/mlp_network.tl
python3 tensorlang.py tests/linear_classification.tl

Full Test Suite

# Run all 47 test cases with validation
python3 tests/runner.py

Test Coverage:

Basic Operations: Element-wise arithmetic, matrix multiplication
Activation Functions: ReLU, sigmoid, tanh, softmax
Reductions: Sum, mean, min, max, argmin, argmax
Comparisons: Greater, less, equal with broadcasting
Slicing: 1D/2D tensor slicing with range specifications
Linear Layers: 1D and batch processing with bias
Neural Networks: Complete MLP construction and classification
Pipeline Operations: Multi-stage tensor transformations
Performance: Large tensor operations (4096×4096 matrices)

Cache

TensorLang stores all compilation artifacts in a cache directory organized by input file:

Cache Structure (`cache/tests/program.tl/`)

kernel.cu: Generated CUDA source code with optimized kernels
kernel.so: Compiled shared library for GPU execution
tensor_name.npy: NumPy arrays containing computed tensor results
Logs: Detailed compilation and execution information

Cache Benefits

Reusability: Compiled kernels can be loaded directly in Python
Debugging: Inspect generated CUDA code and intermediate results
Integration: Use TensorLang computations in existing Python workflows
Performance: Skip recompilation for repeated executions

Architecture

Compilation Pipeline

Lexing & Parsing: Lark-based grammar with comprehensive syntax support
AST Construction: Build abstract syntax tree with operation dependencies
Type Checking: Validate tensor shapes, broadcasting rules, and operation compatibility
CUDA Generation: Emit optimized GPU kernels with shared memory and atomic operations
Compilation: nvcc compilation to shared libraries
Execution: PyCUDA-based kernel launching with memory management

Design Principles

Type Safety: Compile-time shape checking prevents runtime errors
GPU-First: All operations generate native CUDA kernels
Composability: Operations chain naturally with automatic memory management
Performance: Optimized kernels with broadcasting, shared memory, and numerical stability
Extensibility: Clean separation enables easy addition of new operations

Performance

Benchmark Results (NVIDIA GPU):

Matrix Multiplication: 4096×4096 matrices execute efficiently
Batch Processing: Linear layers handle large batch sizes
Memory Efficiency: Automatic GPU memory management with cleanup
Test Suite: 47 tests complete in ~2 minutes including compilation

Optimization Features:

Shared Memory: Reduction operations use efficient parallel patterns
Broadcasting: Hardware-accelerated element-wise operations
Kernel Fusion: Potential for combining operations (future work)
Numerical Stability: Softmax uses max subtraction, comparisons use tolerance

Code Examples

Neural Network Construction

// 2-layer MLP for binary classification
let input: Tensor[f32, (batch, 784)] = load_data()

// Hidden layer: 784 -> 256 features
let hidden_w: Tensor[f32, (784, 256)] = random_weights()
let hidden_b: Tensor[f32, (256,)] = zeros()
let hidden = relu(linear(input, hidden_w, hidden_b))

// Output layer: 256 -> 2 classes
let output_w: Tensor[f32, (256, 2)] = random_weights() 
let output_b: Tensor[f32, (2,)] = zeros()
let logits = linear(hidden, output_w, output_b)
let probs = softmax(logits, axis=1)

// Predictions
let predictions = argmax(probs, axis=1)

Data Processing Pipeline

// Feature extraction and normalization
let raw_data: Tensor[f32, (1000, 50)] = load_features()

// Remove outliers using masking
let outlier_mask = less(abs(raw_data), fill(3.0, (1000, 50)))
let clean_data = mult(raw_data, outlier_mask)

// Normalize by column statistics  
let col_means = mean(clean_data, axis=0)
let centered = minus(clean_data, col_means)
let col_stds = sqrt(mean(mult(centered, centered), axis=0))
let normalized = div(centered, col_stds)

Future Work

Short Term

More Activations: GELU, Swish, LayerNorm primitives
Advanced Indexing: Boolean and fancy indexing support
Optimizations: Kernel fusion for operation chains
Error Handling: Better error messages and recovery

Medium Term

Automatic Differentiation: Gradient computation for training
Control Flow: Conditional operations and loops
Custom Functions: User-defined operations and layers
Memory Optimization: Memory pooling and reuse strategies

Long Term

Multi-GPU: Distributed tensor operations
Mixed Precision: FP16/FP32 automatic casting
MLIR Integration: Leverage compiler infrastructure
Hardware Backends: Support for other accelerators (ROCm, Metal, etc.)

Contributing

Contributions welcome! TensorLang is designed for extensibility:

New Operations: Add grammar rules, type checking, and CUDA kernels
Optimizations: Improve existing kernel implementations
Testing: Add test cases for edge cases and new operations
Documentation: Improve examples and architectural documentation

See CONTRIBUTING.md for detailed guidelines.

License

Licensed under GNU Lesser General Public License v3 (LGPL-3.0) - see LICENSE file for details.

TensorLang - Native ML language with GPU acceleration. Built for performance, designed for productivity.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
examples		examples
tensorlang		tensorlang
tests		tests
LICENSE		LICENSE
README.md		README.md
build.sh		build.sh
requirements.txt		requirements.txt
tensorlang.lark		tensorlang.lark
tensorlang.py		tensorlang.py

License

wilson/tensor-lang

Folders and files

Latest commit

History

Repository files navigation

TensorLang: A Native AI/ML Language

Table of Contents

Overview

Features

Operations

Tensor Creation & Manipulation

Linear Algebra

Activation Functions

Reductions

Comparisons

Pipeline Examples

Setup

System Requirements

Clone

Build

Tests

Single Test Execution

Full Test Suite

Cache

Cache Structure (cache/tests/program.tl/)

Cache Benefits

Architecture

Compilation Pipeline

Design Principles

Performance

Code Examples

Neural Network Construction

Data Processing Pipeline

Future Work

Short Term

Medium Term

Long Term

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Cache Structure (`cache/tests/program.tl/`)

Packages