Cyclops Tensor Framework: distributed-memory tensor algebra
C++ Other

README.md

Cyclops Tensor Framework (CTF)

alt text

CTF is a C++ library for algebraic operations on distributed multidimensional arrays.

The operations are expressed algebraically via vector, matrix, or tensor representation of the arrays. An array with k dimensions is represented by a tensor of order k. By default, the tensor elements are floating point numbers and tensor operations are combinations of products and sums. However, a user can define any statically-sized type and elementwise operations. CTF supports general tensor sparsity, so it is possible to define graph algorithms with the use of sparse adjacency matrices.

Applications

The CTF library enables general-purpose programming, but is particularly useful in some specific application domains.

Numerical methods based on tensor contractions

The framework provides a powerful domain specific language for computational chemistry and physics codes that work on higher order tensors (e.g. coupled cluster, lattice QCD, quantum informatics). See the CCSD and sparse MP3 examples, or the Aquarius application. This domain was the initial motivation for the development of CTF. An exemplary paper for this type of applications is

Edgar Solomonik, Devin Matthews, Jeff R. Hammond, John F. Stanton, and James Demmel; A massively parallel tensor contraction framework for coupled-cluster computations; Journal of Parallel and Distributed Computing, June 2014.

Algebraic graph algorithms

Much like the CombBLAS library, CTF provides sparse matrix primitives that enable development of parallel graph algorithms. Matrices and vectors can be defined on a user-defined semiring or monoid. Multiplication of sparse matrices with sparse vectors or other sparse matrices is parallelized automatically. The library includes example code for Bellman-Ford and betweenness centrality. An paper describing and analyzing the betweenness centrality code is

Edgar Solomonik, Maciej Besta, Flavio Vella, and Torsten Hoefler; Betweenness centrality is more parallelizable than dense matrix multiplication arXiv preprint, arXiv:1609.07008 [cs.MS], September 2016.

Prototyping of parallel numerical algorithms

The high-level abstractions for vectors, matrices, and order 3+ tensors allow CTF to be a useful tool for developing many numerical algorithms. Interface hook-ups for ScaLAPACK make coordination with distributed-memory solvers easy. Algorithms like algebraic multigrid, which require sparse matrix multiplication can be rapidly implemented with CTF. Further, higher order tensors can be used to express recursive algorithms like parallel scans and FFT. Some basic examples of numerical codes using CTF are presented in

Edgar Solomonik and Torsten Hoefler; Sparse tensor algebra as a parallel programming model arXiv preprint, arXiv:1512.00066 [cs.MS], November 2015.

Sample code and minimal tutorial

A simple Jacobi iteration code using CTF is given below, also found in this examples.

Vector<> Jacobi(Matrix<> A , Vector<> b , int n){
  Matrix<> R(A);
  R["ii"] = 0.0;
  Vector<> x (n), d(n), r(n);
  Function<> inv([]( double & d ){ return 1./d; });
  d["i"] = inv(A["ii"]); // set d to inverse of diagonal of A
  do {
    x["i"] = d["i"]*(b["i"]-R["ij"]*x["j"]);
    r["i"] = b["i"]-A ["ij"]*x["j"]; // compute residual
  } while ( r.norm2 () > 1. E -6); // check for convergence
  return x;
}

The above Jacobi function accepts n-by-n matrix A and n-dimensional vector b containing double precision floating point elements and solves Ax=b using Jacobi iteration. The matrix R is defined to be a copy of the data in A. Its diagonal is subsequently set to 0, while the diagonal of A is extracted into d and inverted. A while loop then computes the jacobi iteration by use of matrix vector multiplication, vector addition, and vector Hadamard products.

This Jacobi code uses Vector and Matrix objects which are specializations of the Tensor object. Each of these is a distributed data structure, which is partitioned across an MPI communicator.

The key illustrative part of the above example is

x["i"] = d["i"]*(b["i"]-R["ij"]*x["j"]);

to evaluate this expression, CTF would execute the following set of loops, each in parallel,

double y[n];
for (int i=0; i<n; i++){
  y[i] = 0.0;
  for (int j=0; j<n; j++){
    y[i] += R[i][j]*x[j];
  }
}
for (int i=0; i<n; i++) 
  y[i] = b[i]-y[i];
for (int i=0; i<n; i++) 
  x[i] = d[i]*y[i];

This parallelization is done with any programming language constructs outside of basic C++. Operator overloading is used to interpret the indexing strings and build an expression tree. Each operation is then evaluated by an appropriate parallelization. Locally, BLAS kernels are used when possible.

Its also worth noting the different ways indices are employed in the above example. In the matrix-vector multiplication,

y["i"]=R["ij"]*b["j"]

the index j is contracted (summed) over, as it does not appear in the output, while the index i appears in one of the operand and the output. Note, that a different semantic occurs if an index appears in two operands the result

x["i"]=d["i"]*y["i"];

This is most often referred to as a Hadamard product. In general, CTF can execute any operation of the form,

C["..."] += A["..."]*B["..."];

so long as the length of each of the three strings matches the order of the tensors. The operation can be interpreted by looping over all unique characters that appear in the union of the three strings, and putting the specified multiply-add operation in the innermost loop.

Another piece of functionality employed in the Jacobi example is the Function object and its application,

Function<> inv([]( double & d ){ return 1./d; });
d["i"] = inv(A["ii"]); // set d to inverse of diagonal of A

the same code could have been written even more concisely,

d["i"] = Function<> inv([]( double & d ){ return 1./d; })(A["ii"]);

This syntax defines and employs an elementwise function that inverts each element of A to which it is applied. The operation is executing the following loop,

for (int i=0; i<n; i++)
  d[i] = 1./A[i][i];

In this way, arbitrary functions can be applied to elements of the tensors. There are additional 'algebraic structure' constructs that allow a redefinition of addition and multiplication for any given tensor. Finally, there is a Transform object, which acts in a similar way as Function, but takes the output element by reference. The Transform construct is more powerful than Function, but limits the transformations that can be applied internally for efficiency. Both Function and Transform can operate on one or two tensors with different element types, and output another element type.

Documentation

Detailed documentation of all functionality and the organization of the source code can be found in the Doxygen page. Much of the functionality is expressed through the Tensor object.

The examples and aforementioned papers can be used to gain further insight. If you have any questions regarding usage, do not hesitate to contact us! You can do so by posting an issue on the github page or emailing solomon2@illinois.edu.

Building and testing

To build the library, it is necessary to execute './configure' and 'make'. The configure file does not use autotools and should be relatively easily to interpret and tweak. Run './configure --help' to see a list of options. The configure file will generate a file called config.mk. You can tweak this file further, it has options for running CTF in verbose mode (stating which contractions are executed), debug mode (doing extra checks and outputting various execution information) or with performance profiling (outputting a breakdown of performance at the end in the style of a TAU performance profile summary). It is possible to execute configure from an external folder to build the library out of source. Note that rerunning configure might overwrite any changes you make to the config.mk file.

For testing the library, run 'make test', which will run a suite of dozens of tests sequentially, but should only take a couple of seconds. You can also build and execute 'test_suite' using multiple MPI processors ('make test2', 'make test4', ... will build and run the test suite with 2, 4, ... processors, but only works for some small processor counts).

Parallel make (e.g. -j4) is supported. The library will build in seconds if you switch off the optimization flags in config.mk, but may take a few minutes otherwise.

Performance

Please see the aforementioned papers for various applications and benchmarks, which are also summarized in this recent presentation. Generally, the distributed-memory dense and sparse matrix multiplication performance should be very good. Similar performance is achieved for many types of contractions. CTF can leverage threading, but is fastest with pure MPI or hybrid MPI+OpenMP. The code aims at scalability to a large number of processors by minimizing communication cost, rather than necessarily achieving perfect absolute performance. User-defined functions naturally inhibit the sequential kernel performance. Algorithms that have a low flop-to-byte ratio may not achieve memory-bandwidth peak as some copying/transposition may take place. Absolute performance of operations that have Hadamard indices is relatively low for the time being, but will be improved.

Alternative frameworks

Elemental and ScaLAPACK provide distributed-memory support for dense matrix operations in addition to a powerful suite of solver routines. It is also possible to interface them with CTF, in particular, we provide routines for retrieving a ScaLAPACK descriptor.

A faster library for dense tensor contractions in shared memory is Libtensor.

An excellent distributed-memory library with native support for block-sparse tensors is TiledArray.

Acknowledging usage

The library and source code is available to everyone. If you would like to acknowledge the usage of the library, please cite one of our papers. The below one would be a good choice,

Edgar Solomonik, Devin Matthews, Jeff R. Hammond, John F. Stanton, and James Demmel; A massively parallel tensor contraction framework for coupled-cluster computations; Journal of Parallel and Distributed Computing, June 2014.

here is the bibtex.

We hope you enjoy writing your parallel program with algebra!