Skip to content
Branch: master
Find file Copy path
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
721 lines (551 sloc) 35.7 KB

Generating tensor contractions with Circle and TACO

  1. Tensor Algebra Compiler
  2. Program 1: Compile-time TACO
  3. Program 2: TACO as a compile-time library
  4. Program 3: Code generation from TACO IR
    1. Function declaration
    2. Expression generation
    3. Statement generation
  5. Program 4: A stringy frontend

The limitations of template metaprogramming has frustrated C++ programmers for decades. Circle provides an integrated interpreter and introspection and reflection features to enable you to express your intent and lower it to implementation using clean, simple, imperative programming constructs.

Circle's reflection mechanisms allow you to utilize C++'s semantic features like name lookup, overload resolution and type safety, without actually having to code your software in C++. An appropriate intermediate language, like the tensor IR defined by TACO, can generate code into the translation unit when fed through statement and expression macros.

TensorFlow XLA defines a higher-level IR to support its more comprehensive set of transformations. The TensorFlow package includes tooling to generate LLVM IR directly from the operator semantics. As with the TACO code examined below, Circle reflection can be used to generate code from TensorFlow XLA IR. Circle reflection knocks down barriers to generating code from IR, allowing more projects to benefit from this strategy.

Tensor Algebra Compiler

The Tensor Algebra Compiler (taco) is a C++ library that computes tensor algebra expressions on sparse and dense tensors. It uses novel compiler techniques to get performance competitive with hand-optimized kernels in widely used libraries for both sparse tensor algebra and sparse linear algebra.

You can use taco as a C++ library that lets you load tensors, read tensors from files, and compute tensor expressions. You can also use taco as a code generator that generates C functions that compute tensor expressions.

The Tensor Algebra Compiler

The most direct way to use the TACO compiler is with the command-line tool taco. Just type in a tensor contraction--two of the same index on the right-hand side of the assignment indicates an implicit contraction, as in the Einstein summation notation.

$ taco "y(i) = M(i, j) * x(j)"
// Generated by the Tensor Algebra Compiler (

int compute(taco_tensor_t *y, taco_tensor_t *M, taco_tensor_t *x) {
  double* restrict y_vals = (double*)(y->vals);
  int M1_dimension = (int)(M->dimensions[0]);
  int M2_dimension = (int)(M->dimensions[1]);
  double* restrict M_vals = (double*)(M->vals);
  double* restrict x_vals = (double*)(x->vals);

  #pragma omp parallel for schedule(static)
  for (int32_t iM = 0; iM < M1_dimension; iM++) {
    double tj = 0.0;
    for (int32_t jM = 0; jM < M2_dimension; jM++) {
      int32_t pM2 = iM * M2_dimension + jM;
      tj += M_vals[pM2] * x_vals[jM];
    y_vals[iM] = tj;
  return 0;

The taco tool is a mini-compiler. Let's break down the phases of translation:

  1. A lexer and parser transforms input text in Einstein notation to "index notation," which is TACO's own class hierarchy for representing tensor algebra. Consider this a parse tree. It represents the intent of the operation, but not its implementation.
  2. An analyzer lowers index notation to intermediate representation (IR). TACO's IR is another class hierarchy, but different from the index notation classes. It's more fine-grained and represents many statement and expression constructs in the C language. This is the implementation of the tensor contraction. Although not in a form convenient for execution, it includes the scheduling choices made by TACO's backend for computing a correct and efficient tensor operation.
  3. A code generator recurses over the IR and emits C code into a C++ string stream. Each IR node represents a statement or a fragment of an expression, and when visited, recurses on its child nodes to generate a correct textural fragment for the stringified program.

How does an end-user incorporate TACO kernels? Paste the result of the taco tool into a source file, change the name from compute to something unique, and call it from your program.

Visual inspection of your kernel is nice, but this process is fairly inconvenient for use in larger programs. It doesn't feel like a C++ library at all. Compilation is tedious and it's not possible to generate kernels in response to template instantiation. The TACO designers address this with a second compilation model. The TACO compiler's headers are included by the client application, and a special Tensor API is used for configuring the tensor contraction.

Format csr({Dense,Sparse});
Format csf({Sparse,Sparse,Sparse});
Format  sv({Sparse});

// Create tensors
Tensor<double> A({2,3},   csr);
Tensor<double> B({2,3,4}, csf);
Tensor<double> c({4},     sv);

// Form a tensor-vector multiplication expression
IndexVar i, j, k;
A(i,j) = B(i,j,k) * c(k);

// Compile the expression

// Initialize the tensors B and c with data.
// ...

// Assemble A's indices and numerically compute the result

The compilation model for the API interface is more involved than for the command-line tool:

  1. Declare Tensor and IndexVar object instances from the taco.h header.
  2. Evaluate an assignment expression involving the tensors and indices. Overloaded arithmetic operators return index notation objects representing each operation. For example, the tensor index operation B(i,j,k) returns an instance of the Access object, which conveys intent about accessing a tensor. Note that this is not a compile-time operation. It doesn't use expression templates or other template metaprogramming tricks to build an index notation representation of the tensor contraction. This is the API alternative to parsing, and it is executed when the program is run rather than when it is compiled.
  3. Call Tensor::compile. This sends the index notation tree built in step 2 to for lowering into IR.
  4. The TACO library sends the resulting IR to the code generator, which creates a C-code string version of the tensor operation.
  5. The TACO library makes a system call to execute a Standard C++ compiler (such as gcc, clang or nvcc), pointing it to the generated code and producing a shared object in a temp directory.
  6. dlopen opens the generated shared libary and dlsym finds the address of the compiled function that affects the tensor operation.
  7. Tensor::compute calls the returned function pointer and passes data bound to the Tensor objects during setup.

Although the designers have allowed TACO to be used like a library, it came at considerable implementation cost. Why go through the trials of stringifying a C++ data structure into C++ code, handing it off to a different instance of a C++ compiler than the one that it's running from, generating a different binary product than what's already been built, and loading the resulting image into a process that's already running? Why, in other words, not just do this all at compile time?

The answer is that C++ has inadequate facilities for compile-time programming. The TACO developers couldn't use C++ to run the parser or evaluate overloaded operators to generate index notation at compile time. They couldn't lower the index notation to IR at compile time. And they certainly couldn't use C++ to generate code from IR at compile time.

Circle provides many tools for compile-time execution and code generation. These example programs take an unmodified installation of the TACO library and incrementally relieve TACO from compiler frontend and backend responsibilities, highlighting TACO's nature as a provider of scheduling intelligence, rather than parsing and code generation, which are merely ancillary responsiblities of the library.

Program 1: Compile-time TACO

In program 1, we'll use the taco command-line compiler to generate our tensor code. The tensor formula is provided as a string, and the resulting C code is printed to the terminal.


// Use popen to make a system call and capture the output in a file handle.
// Make it inline to prevent it from being emitted by the backend, unless
// called by real code.
inline std::string capture_call(const char* cmd) {
  FILE* f = popen(cmd, "r");

  // We can't seek within this stream, so read in a KB at a time.
  std::vector<char> vec;
  char line[1024];
  size_t count = 0;
  do {
    count = fread(line, 1, 1024, f);
    vec.insert(vec.end(), line, line + count);

  } while(1024 == count);


  return std::string(vec.begin(), vec.end());

We don't want to execute taco manually, but programmatically and at compile time, as part of our program's source code. The POSIX API popen will execute a shell command and capture the terminal output into a file handle. capture_call is our utility function for reading the file handle out into a std::string and closing the process's file handle. This is a general function, and you can use it both from runtime and compile-time code. We mark it inline so we can define it in the util.hxx header file.


#include "util.hxx"
#include <taco.h>

@meta int taco_kernel_count = 0;

template<typename... args_t>
@meta void call_taco(@meta const char* pattern, args_t&&... args) {

  // Run the taco system call.
  @meta std::string s = format("taco \"%s\" -print-nocolor", pattern);
  @meta std::string code = capture_call(s.c_str());

  // Print the code we're injecting.
  @meta printf("%s\n", code.c_str());

  // taco always returns kernels named "kernel". To avoid name collisions, 
  // create a new namespace. Use the global counter taco_kernel_count.
  @meta std::string ns_name = format("taco_kernel_%d", ++taco_kernel_count);

  // Inject the code into a custom namespace.
  @statements namespace(@(ns_name))(code, "taco kernel");

  // Call the function. Pass each argument tensor by address.

int main() {
  taco_tensor_t v { }, M { }, x { }, N { };

  // inject taco_kernel_1::compute
  call_taco("v(i) = M(i, j) * x(j)", v, M, x);

  // inject taco_kernel_2::compute
  call_taco("v(i) = M(i, j) * N(j, k) * x(k)", v, M, N, x);

  return 0;

The first program is really simple. call_taco is a meta function which takes the formula string as a meta parameter (meaning its value is available during source translation) and each of the tensor arguments as real parameters.

The formula string is formatted into a shell command for the taco program. The capture_call function is called in a meta statement, and returns a compile-time string code, which is echoed to the terminal for visual inspection.

Every kernel that taco generates is given the name compute. We need to avoid name collisions, so we place each kernel into a uniquely-named namespace; the taco_kernel_count counter assists in generating unique namespaces names. The captured text is tokenized and parsed into the unique namespace by using the @statements extension. Finally, we call the newly-minted compute function and forward each of the arguments as pointers, as the generated code expecs.

Note that the four @meta statements and the @statements injection all occur once--during source translation. The resulting call_taco metafunction may be invoked many times at runtime with different data, but each invocation will call the same instance of the taco_kernels::compute function.

Neither the taco command-line tool nor the TACO library are runtime dependencies of the taco1 example program. Simply by leveraging the @meta features of Circle, we were able to use TACO as a compile-time library, something not possible with Standard C++.

$ nm taco1 | c++filt
0000000000400670 T taco_kernel_1::compute(taco_tensor_t*, taco_tensor_t*, taco_tensor_t*)
00000000004006e0 T taco_kernel_2::compute(taco_tensor_t*, taco_tensor_t*, taco_tensor_t*, taco_tensor_t*)

Listing the symbols in taco1 reveals the taco-generated functions, reposing in their dynamically-named namespaces.

Program 2: TACO as a compile-time library

We've been making system calls to the taco compiler, capturing its terminal output, and injecting that into a Circle program. But TACO also comes as a library with more options for configuration. Our Circle program will now assume the responsibilities of the taco compiler, calling the parser, lowering- and code-generation functions directly from the TACO API. I modelled this utility function on the simplest possible sparsity-supporting path through the taco compiler.


#include "util.hxx"
#include <taco.h>
#include <taco/parser/lexer.h>
#include <taco/parser/parser.h>
#include <taco/lower/lower.h>
#include <../../taco/src/codegen/codegen.h>  // This should be in include/taco

struct format_t {
  const char* tensor_name;
  const char* format;

struct options_t {
  std::vector<format_t> formats;

inline std::string gen_taco_kernel(const char* pattern, 
  const options_t& options, const char* func_name) {
  using namespace taco;

  // Options for compilation that we aren't using.
  std::map<std::string, Format> formats;
  std::map<std::string, Datatype> datatypes;
  std::map<std::string, std::vector<int> > dimensions;
  std::map<std::string, TensorBase> tensors;

  for(format_t format : options.formats) {
    std::vector<ModeFormatPack> modePacks;
    int len = strlen(format.format);
    for(int i = 0; i < len; ++i) {
      switch(format.format[i]) {
        case 'd': 
          modePacks.push_back({ ModeFormat::Dense });

        case 's': 
          modePacks.push_back({ ModeFormat::Sparse });

        case 'u':
          modePacks.push_back({ ModeFormat::Sparse(ModeFormat::NOT_UNIQUE) });

        case 'c': 
          modePacks.push_back({ ModeFormat::Singleton(ModeFormat::NOT_UNIQUE) });

        case 'q':
          modePacks.push_back({ ModeFormat::Singleton+ });


  // Make a simple parser. Need to add a larger default dimension or else the
  // code generator defaults to the small dimension of 5.
  parser::Parser parser(pattern, formats, datatypes, dimensions, tensors, 1000);

  // Parse the pattern.
  const TensorBase& tensor = parser.getResultTensor();

  // Lower the parsed tensor to IR.
  std::set<old::Property> computeProperties;
  ir::Stmt compute = old::lower(tensor.getAssignment(), func_name, 
    computeProperties, tensor.getAllocSize());

  // Stream the C code to oss.
  std::ostringstream oss;
  std::shared_ptr<ir::CodeGen> codegen = ir::CodeGen::init_default(oss, 

  // Compile the IR to C code.

  return oss.str();

Like capture_call above, gen_taco_kernel is a normal C++ function. It can be used at runtime or compile time to generate C code from a tensor formula. Although we're only using the defaults, the formats, datatypes, dimensions and tensors maps can be used to specify different lowering properties.


@meta int kernel_counter = 0;

template<typename... args_t>
@meta void call_taco(@meta const char* pattern, @meta const options_t& options, 
  args_t&&... args) {

  // Generate a unique function name for each call_taco.
  @meta std::string name = format("compute_%d", ++kernel_counter);

  // Execute the gen_taco_kernel at compile time.
  @meta std::string code = gen_taco_kernel(pattern, options, name.c_str());

  // Print the emitted kernel.
  @meta printf("%s\n", code.c_str());

  // Inject the code into namespace taco_kernel.
  @statements namespace(taco_kernel)(code, name);

  // Call the function. Pass each argument tensor by address.

// TACO relies on macros TACO_MIN and TACO_MAX. Define them as alias to
// std::min and std::max
#define TACO_MIN std::min
#define TACO_MAX std::max

int main() {
  taco_tensor_t a { }, b { }, c { };

  // Declare each tensor as 1D and sparse.
  @meta options_t options { };
  @meta options.formats.push_back({ "a", "s" });
  @meta options.formats.push_back({ "b", "s" });
  @meta options.formats.push_back({ "c", "s" });

  call_taco("a(i) = b(i) + c(i)", options, a, b, c);

  return 0;

The remainder of the program resembles program 1. However, we're using the kernel counter to generate a unique function name rather than a unique namespace name. The meta function passes the formula to gen_taco_kernel, which uses the TACO class library to parser, lower and generate C code, and captures the code in a string. This string is injected into namespace taco_kernel as in program 1.

We're not using the taco command-line compiler anymore. We're using the TACO library, but at compile time, so there's no runtime dependency. When we invoke circle, we need to provide the path to to indicate that it's a compile-time library, so that the interpreter will search the module when dealing with unresolved functions in TACO's headers.

$ circle taco2.cxx -I /home/sean/projects/opensource/taco/include \
> -M /home/sean/projects/opensource/taco/Debug/lib/

int compute_1(taco_tensor_t *a, taco_tensor_t *b, taco_tensor_t *c) {
  double* restrict a_vals = (double*)(a->vals);
  int* restrict b1_pos = (int*)(b->indices[0][0]);
  int* restrict b1_crd = (int*)(b->indices[0][1]);
  double* restrict b_vals = (double*)(b->vals);
  int* restrict c1_pos = (int*)(c->indices[0][0]);
  int* restrict c1_crd = (int*)(c->indices[0][1]);
  double* restrict c_vals = (double*)(c->vals);

  int32_t pa1 = 0;
  int32_t pb1 = b1_pos[0];
  int32_t b1_end = b1_pos[1];
  int32_t pc1 = c1_pos[0];
  int32_t c1_end = c1_pos[1];
  while (pb1 < b1_end && pc1 < c1_end) {
    int32_t ib = b1_crd[pb1];
    int32_t ic = c1_crd[pc1];
    int32_t i = TACO_MIN(ib,ic);
    if (ib == i && ic == i) {
      a_vals[pa1] = b_vals[pb1] + c_vals[pc1];
    else if (ib == i) {
      a_vals[pa1] = b_vals[pb1];
    else {
      a_vals[pa1] = c_vals[pc1];
    pb1 += (int32_t)(ib == i);
    pc1 += (int32_t)(ic == i);
  while (pb1 < b1_end) {
    a_vals[pa1] = b_vals[pb1];
  while (pc1 < c1_end) {
    a_vals[pa1] = c_vals[pc1];
  return 0;

Program 3: Code generation from TACO IR

This program demonstrates code generation directly from IR. It eliminates the need to stringify IR into C code. The techniques shown here may be the biggest improvements that Circle provides over Standard C++. We'll recurse over TACO's hierarchical IR, emitting statements and expression fragments at each node. Circle's statement and expression macros make this operation no harder than printing the node contents.

Function declaration

TACO's IR is similar to C++ AST. Most of the nodes correspond to subexpressions; the rest code for statements. The Stmt node returned from lower_taco_kernel is always an instance of Function:


struct Function : public StmtNode<Function> {
  std::string name;
  Stmt body;
  std::vector<Expr> inputs;
  std::vector<Expr> outputs;
  static Stmt make(std::string name,
                   std::vector<Expr> outputs, std::vector<Expr> inputs,
                   Stmt body);
  std::pair<std::vector<Datatype>,Datatype> getReturnType() const;
  static const IRNodeType _type_info = IRNodeType::Function;

The trickiest part of generating code from TACO IR is generating the function declaration. The Function node holds a vector of output and input parameters, which are stored as Expr objects but hold Var nodes. We can't use meta control flow inside a function declaration to deposit function parameters. But hope is not lost: consider the @expand_params extension. This expands a class or typed-enum to a set of function parameters: each non-static data member (or enumerator) expands to a function parameter with the corresponding type and name.


@macro void gen_kernel(const ir::Function* function) {
  @meta const char* name = function->name.c_str();
  @meta std::string struct_name = format("%s_t", name);

  @macro void make_member(const std::vector<ir::Expr>& body) {
    @meta for(const ir::Expr& expr : body) {
      @meta const ir::Var* var =<ir::Var>();
      @static_type(mtype_from_var(var)) @(var->name);
  struct @(struct_name) {
    // TACO convention is to put output vars followed by input vars.
    @macro make_member(function->outputs); 
    @macro make_member(function->inputs);

  // Define a function. The parameters are each of the members of the
  // structure defined above.
  void @(name)(@expand_params(@(struct_name))) {
    @meta printf("Creating %s function\n", name);

    // Inject the body. We expect the body to be a Scope, representing the
    // function's braces.
    @macro stmt_inject(function->body);    

gen_kernel defines a struct to specify the function's parameters. It expands the make_member macro in the class-specifier, which executes a compile-time loop over all input and output variables, defining a data member corresponding to each function parameter. The @expand_params extension transforms the structure into a set of function parameters in a function declaration--this is how we can programmatically define function declarations.

Once the function is declared, we expand the stmt_inject macro on the function's body node, which recurses over the nodes in the TACO IR to define the function.

Expression generation


@macro auto expr_inject(const ir::Expr& expr) {
  @meta+ if(const ir::Literal* literal =<ir::Literal>()) {
    @emit return @expression(util::toString(expr));

  } else if(const ir::Var* var =<ir::Var>()) {
    @emit return @(var->name);

  } else if(const ir::Neg* neg =<ir::Neg>()) {
    @emit return -expr_inject(add->a);

  } else if(const ir::Sqrt* sqrt =<ir::Sqrt>()) {
    @emit return sqrt(expr_inject(add->a));

  } else if(const ir::Add* add =<ir::Add>()) {
    @emit return expr_inject(add->a) + expr_inject(add->b);

  } else if(const ir::Sub* sub =<ir::Sub>()) {
    @emit return expr_inject(sub->a) - expr_inject(sub->b);

  } else if(const ir::Mul* mul =<ir::Mul>()) {
    @emit return expr_inject(mul->a) * expr_inject(mul->b);

  } else if(const ir::Div* div =<ir::Div>()) {
    @emit return expr_inject(div->a) / expr_inject(div->b);

  } else if(const ir::Rem* rem =<ir::Rem>()) {
    @emit return expr_inject(rem->a) % expr_inject(rem->b);

  } else if(const ir::Min* min =<ir::Min>()) {
    static_assert(2 == min->operands.size(), "only 2-operand min supported");
    @emit return std::min(

  } else if(const ir::Max* max =<ir::Max>()) {
    @emit return std::max(expr_inject(max->a), expr_inject(max->b));

  } else if(const ir::BitAnd* bit_and =<ir::BitAnd>()) {
    @emit return expr_inject(bit_and->a) & expr_inject(bit_and->b);

  } else if(const ir::BitOr* bit_or =<ir::BitOr>()) {
    @emit return expr_inject(bit_or->a) | expr_inject(bit_or->b);

  } else if(const ir::Eq* eq =<ir::Eq>()) {
    @emit return expr_inject(eq->a) == expr_inject(eq->b);

  } else if(const ir::Neq* neq =<ir::Neq>()) {
    @emit return expr_inject(neq->a) != expr_inject(neq->b);

  } else if(const ir::Gt* gt =<ir::Gt>()) {
    @emit return expr_inject(gt->a) > expr_inject(gt->b);

  } else if(const ir::Lt* lt =<ir::Lt>()) {
    @emit return expr_inject(lt->a) < expr_inject(lt->b);

  } else if(const ir::Gte* gte =<ir::Gte>()) {
    @emit return expr_inject(gte->a) >= expr_inject(gte->b);

  } else if(const ir::Lte* lte =<ir::Lte>()) {
    @emit return expr_inject(lte->a) <= expr_inject(lte->b);

  } else if(const ir::And* and_ =<ir::And>()) {
    @emit return expr_inject(and_->a) && expr_inject(and_->b);

  } else if(const ir::Or* or_ =<ir::Or>()) {
    @emit return expr_inject(or_->a) || expr_inject(or_->b);

  } else if(const ir::Cast* cast =<ir::Cast>()) {
    @emit return 
      (typename type_from_datatype_t<@meta cast->type.getKind()>::type_t)
  } else if(const ir::Load* load =<ir::Load>()) {
    @emit return expr_inject(load->arr)[expr_inject(load->loc)];

  } else if(const ir::GetProperty* prop =<ir::GetProperty>()) {
    @emit return expr_prop(prop);

  } else {
    @meta std::string error = format("unsupported expr kind '%s'", 
    static_assert(false, error);

IR types inheriting ExprNode carry information for subexpressions. Each node in the tree represents one operator. The TACO C code generator prints the left child, then the operator, then the right child, in order to stringify a program. Our job for direct code generation is even easier--we simply recurse on the left child, type the operator's symbol, then recurse on the right node. Because the operands are implemented with expression macros, each operand expansion is replaced by the expression in the return statement--which is either a terminal or another operator. Name lookup/ADL, argument deduction and overload resolution apply as normal. Even though we're recursing over a tree, the generated expression is ordinary C++, with no extraneous function calls or block scopes created.

If the authors of TACO had been using Circle to write their library, they could have defined a single IR operator type and stored an enum or string indicating the operator encoded. The Circle @op extension then be used in expr_inject to execute an operator given the operator's string name.

An impossible-to-underrate benefit of Circle is introspection into enums. If we don't support a specific Expr node in our code, we can stringify the enumerator's identifier with @enum_name, format that into a nice error string, and print it with static_assert. This takes the guess work out of debugging when using someone else's code.

Statement generation


@macro void stmt_inject(const ir::Stmt& stmt) {

  @meta+ if(const ir::Case* case_ =<ir::Case>()) {
    // TACO's Case statement isn't a C++ case statement at all. It's a sequence
    // of if/else statements all chained together.
    @macro stmt_case(case_, 0);

  } else if(const ir::Store* store =<ir::Store>()) {
    // store->arr[store->loc] = store->data
    @emit expr_inject(store->arr)[expr_inject(store->loc)] = 
  } else if(const ir::For* for_ =<ir::For>()) {
    const ir::Var* var = for_-><ir::Var>();
    @emit for(
      @static_type(mtype_from_var(var)) @(var->name) = 
      @(var->name) < expr_inject(for_->end); 
      @(var->name) += expr_inject(for_->increment)
    ) {
      @macro stmt_inject(for_->contents);

  } else if(const ir::While* while_ =<ir::While>()) {
    @emit while(expr_inject(while_->cond)) {
      @macro stmt_inject(while_->contents);

  } else if(const ir::Block* block =<ir::Block>()) {
    // Inject each individual statement.
    for(const ir::Stmt& stmt : block->contents)
      @macro stmt_inject(stmt);

  } else if(const ir::Scope* scope =<ir::Scope>()) {
    @macro stmt_inject(scope->scopedStmt);    

  } else if(const ir::VarDecl* var_decl =<ir::VarDecl>()) {
    const ir::Var* var = var_decl-><ir::Var>();
    @emit @static_type(mtype_from_var(var)) @(var->name) = 

  } else if(const ir::Assign* assign =<ir::Assign>()) {
    // assign->lhs = assign->rhs.
    @emit expr_inject(assign->lhs) = expr_inject(assign->rhs);

  } else {
    @meta std::string error = format("unsupported stmt kind '%s'", 
    static_assert(false, error);

@macro void stmt_case(const ir::Case* case_, size_t index) {
  // If this is anything but the last case, create an if-statement.
  @meta if(index + 1 < case_->clauses.size()) {
    if(expr_inject(case_->clauses[index].first)) {
      @macro stmt_inject(case_->clauses[index].second);

    } else {
      // Recursively build the else case.
      @macro stmt_case(case_, index + 1);

  } else {
    // We're in the last case, which is just an else. Emit only the body.
    @macro stmt_inject(case_->clauses[index].second);

Statements are generated by expanding Circle statement macros. We switch over each statement IR node and write ordinary C++ code for that statement: a for-statement, an object declaration, an assignment, and so on. An object declaration in C++, for example, has three parts: the type, the object name and the object initializer. In this code generator, each of those three parts is defined from the data in the IR node: the variable member holds the type information and object name, and the rhs member holds an expression tree that is macro-expanded into the object's initializer.

Circle code generation doesn't have an elaborate API or DOM that models the syntax of C++. Just write C++ constructs and use the metaprogramming tools Circle provides to generate syntax elements from data wherever needed.

Program 4: A stringy frontend

A blemish in our treatment of TACO is that we're supplying some information twice: the tensors are named both in the string formula and again as call_taco arguments.

taco_tensor_t a { }, b { }, c { };
call_taco("a(i) = b(i) + c(i)", options, a, b, c);

If we set a convention that the C++ taco_tensor_t objects are given the same names as their corresponding tensors in the formula, can we use Circle's metaprogramming features to just plain eliminate the restatement of the tensor names?


// An expression macro that takes only a compile-time pattern string and 
// compile-time options. Read out the tensor names from the IR's parsed 
// Function object and evaluate each of those names as an expression.
// Pass these to the kernel function.
@macro auto call_taco(const char* __pattern, const options_t& __options) {
  // Generate a unique function name for each call_taco. The taco IR
  // will include this name in its data structures.
  @meta std::string __name = format("compute_%d", ++kernel_counter);

  // Parse the pattern and lower to IR.
  @meta ir::Stmt __stmt = lower_taco_kernel(__pattern, __options, 

  // Generate the function in the taco_kernel namespace.
  @meta const ir::Function* __function =<ir::Function>();

  // Expand the gen_kernel in namespace taco_kernel. 
  @macro namespace(taco_kernel) gen_kernel(__function);

  // Collect the names of all function arguments.
  @meta std::vector<std::string> __expr_names;
  @meta for(const ir::Expr& expr : __function->outputs)
    @meta __expr_names.push_back(<ir::Var>()->name);
  @meta for(const ir::Expr& expr : __function->inputs)
    @meta __expr_names.push_back(<ir::Var>()->name);

  // Call the function. Evaluate each of the parameter names as an @expression
  // and expand into the generated function.
  return taco_kernel::@(__name)(

int main() {
  taco_tensor_t a { }, b { }, c { };

  // Declare each tensor as 1D and sparse.
  @meta options_t options { };
  @meta options.formats.push_back({ "a", "s" });
  @meta options.formats.push_back({ "b", "s" });
  @meta options.formats.push_back({ "c", "s" });

  call_taco("a(i) = b(i) + c(i)", options);
  return 0;

Change call_taco from a metafunction to an expression macro (i.e. an auto-returning macro). Keep in mind that expression macros are expanded in the scope of the call site, and may only contain meta statements and return statements. We'll follow taco3's example in parsing the tensor formula and lowering that to TACO IR. But we double-underscore our object names, so that we don't shadow tensor names in the calling scope.

TACO will parse the formula, lower it to IR, and return that IR as an ir::Function object. We'll read each of the function argument names into the array __expr_names as strings. Finally we call the generated kernel with this rather amazing construction:


Let's take this from the inside out:

  1. __integer_pack(__expr_names.size()) yields an unexpanded non-type parameter pack. When expanded, this produces integers in the range 0 through __expr_names.size() - 1.
  2. __expr_names[__integer_pack(__expr_names.size())] yields an unexpanded non-type parameter pack of std::strings. When expanded, this yields each of the function's argument names.
  3. @expression(__expr_names[__integer_pack(__expr_names.size())]) yields an unexpanded non-type parameter pack of expressions. When substituted, the @expression extension tokenizes, parses and builds AST from its string argument. The argument may be as complex as you'd like. For the TACO usage, each argument is just the name of a tensor, so the expression will yield an lvalue to that object, where name lookup uses a declarative region which includes the site of the macro expansion (i.e., main).
  4. &@expression(__expr_names[__integer_pack(__expr_names.size())])... yields an expanded parameter pack of object pointers. The @expression will return an object lvalue, and the unary & operator converts that to a pointer prvalue. The trailing ellipsis expands the parameter pack into the argument list for the kernel function.

What happens if there's a mismatch between the tensor formula and the tensor object names?

int main() {
  taco_tensor_t a { }, b { }, d { };      // we call it "d"
  call_taco("a(i) = b(i) + c(i)",  { });  // the formula calls it "c"
  return 0;
$ circle taco4.cxx -I /home/sean/projects/opensource/taco/include -M /home/sean/projects/opensource/taco/Debug/lib/  -filetype=ll
Creating compute_1 function
macro expansion: taco4.cxx:428:3
error in expansion of macro auto call_taco(const char*, const options_t&)
  code injection: taco4.cxx:420:6
  error thrown in @expression("c")
    error: @expression:1:1
    undeclared identifier 'c'

The error message is nested to help understand the events that led us to this failure. First, we get an undeclared identifier 'c' error. Name lookup tried finding object c, but we accidentally named it 'd'. That is given line:col offset 1:1 inside source "@expression", helping us locate the token's position within the string argument of @expression.

Containing that statement is a code injection error refering to taco4.cxx:420:6, which is the position of the @expression call in the line containing the text &@expression(__expr_names[__integer_pack(__expr_names.size())])....

Containing that, we get a macro expansion for 428:3, which is the token position for the original call_taco macro expansion in main.

Even though we're injecting code as a string harvested from an ir::Function object generated by at compile-time, the Circle compiler gives adequete errors to clue us in to the context of the error.

You can’t perform that action at this time.