Add a leaf call transform for TBLIS #119

TimothyGu · 2022-05-10T07:09:08Z

The TBLIS leaf call transform essentially transforms any binary contractions of form:

ForAll(i, ForAll(j, ForAll(k, … ForAll(z, A[idx1] = B[idx2] * C[idx3]) … )))

to a call to tblis::mult. The tblis::mult function takes an einsum string as input, and does the contraction in an optimal manner that doesn't require us to explicit figure out the sequence of GEMMs and reductions.

legion/pummaMM/taco-generated-tblis.cpp

src/index_notation/transformations.cpp

rohany

Can you add the instructions for building TBLIS into a document similar to the install_*.md files? I'm planning on writing an installation script to do everything (#120), so I can go back and look at that later.

Overall, this looks great! Just a bunch of small comments. I'm working on getting you a machine allocation so that we can do some initial performance testing with this as well. Ideally, we might be able to replace the hand-written kernels of TTV and TTMC with this (maybe innerprod too?) and see similar performance.

CMakeLists.txt

include/taco/index_notation/transformations.h

legion/CMakeLists.txt

legion/pummaMM/taco-generated-tblis.cpp

src/index_notation/transformations.cpp

test/tests-distributed.cpp

src/index_notation/transformations.cpp

TimothyGu · 2022-05-11T01:50:03Z

Addressed all the comments. PTAL

rohany

Generally looks great! Nice job Tim. I want to have some preliminary performance results, get distribution working, and add some correctness tests as well before we land this. Once we do, I'll let the chemistry folks know, and we can try it out! However, we can run comparisons of our own vs CTF to ensure that we're better.

legion/CMakeLists.txt

legion/chemTest/main.cpp

legion/src/taco_legion_header.cpp

legion/cmake/FindTBLIS.cmake

src/index_notation/transformations.cpp

TimothyGu · 2022-05-12T00:09:26Z

Okay, this should now work in a distributed setting. Will be testing it on sapling in the coming days. I also made sure to set the number of TBLIS threads to omp_set_max_threads() as suggested by the cuNumeric folks, so now it should autoscale with -ll:othr and no longer hang.

TimothyGu · 2022-06-06T21:32:50Z

Rebased.

Here are some final benchmarks (using rank-per-socket except for the single node case):

$ bin/chemTest -n 70 -tblis -gx 2 -gy 1 -ll:ocpu 2 -ll:othr 9 -ll:util 1 -ll:nsize 10G -ll:ncsize 0
Execution time: 463 ms.
Execution time: 437 ms.
Execution time: 428 ms.
Execution time: 429 ms.
Execution time: 427 ms.
Execution time: 433 ms.
Execution time: 433 ms.
Execution time: 429 ms.
Execution time: 431 ms.
Execution time: 432 ms.

$ mpirun -H c0001:2,c0002:2 --bind-to socket bin/chemTest -n 83 -tblis -gx 2 -gy 2 -ll:ocpu 1 -ll:othr 9 -ll:util 1 -ll:nsize 10G -ll:ncsize 0
[3 - 7f73eda98d00]    0.000185 {4}{threads}: reservation ('dedicated worker (generic) #2') cannot be satisfied
Execution time: 897 ms.
Execution time: 731 ms.
Execution time: 697 ms.
Execution time: 684 ms.
Execution time: 700 ms.
Execution time: 739 ms.
Execution time: 717 ms.
Execution time: 690 ms.
Execution time: 725 ms.
Execution time: 693 ms.

$ mpirun -H c0001:2,c0002:2,c0003:2,c0004:2 --bind-to socket bin/chemTest -n 99 -tblis -gx 4 -gy 2 -ll:ocpu 1 -ll:othr 9 -ll:util 1 -ll:nsize 10G -ll:ncsize 0
[5 - 7fcbca44ed00]    0.000215 {4}{threads}: reservation ('dedicated worker (generic) #2') cannot be satisfied
Execution time: 1147 ms.
Execution time: 1049 ms.
Execution time: 1004 ms.
Execution time: 1048 ms.
Execution time: 1024 ms.
Execution time: 1022 ms.
Execution time: 1031 ms.
Execution time: 1104 ms.
Execution time: 992 ms.
Execution time: 1058 ms.

This is much faster than CTF in all cases (almost 2× improvement for the 4-node case):

$ mpirun -H c0002:20 env LD_LIBRARY_PATH='/scratch2/tigu/taco/ctf/scalapack/build/lib:/scratch2/tigu/taco/deps/openblas/lib' ctf/bin/chemtest -n 70
Execution time: 905 ms.
Execution time: 784 ms.
Execution time: 745 ms.
Execution time: 745 ms.
Execution time: 739 ms.
Execution time: 731 ms.
Execution time: 744 ms.
Execution time: 745 ms.
Execution time: 730 ms.
Execution time: 728 ms.

$ mpirun -H c0001:20,c0002:20 env LD_LIBRARY_PATH='/scratch2/tigu/taco/ctf/scalapack/build/lib:/scratch2/tigu/taco/deps/openblas/lib' ctf/bin/chemtest -n 83
Execution time: 968 ms.
Execution time: 926 ms.
Execution time: 984 ms.
Execution time: 916 ms.
Execution time: 933 ms.
Execution time: 985 ms.
Execution time: 952 ms.
Execution time: 887 ms.
Execution time: 970 ms.
Execution time: 933 ms.

$ mpirun -H c0001:20,c0002:20,c0003:20,c0004:20 env LD_LIBRARY_PATH='/scratch2/tigu/taco/ctf/scalapack/build/lib:/scratch2/tigu/taco/deps/openblas/lib' ctf/bin/chemtest -n 99
Execution time: 1897 ms.
Execution time: 2058 ms.
Execution time: 1981 ms.
Execution time: 2038 ms.
Execution time: 2019 ms.
Execution time: 2148 ms.
Execution time: 2066 ms.
Execution time: 2046 ms.
Execution time: 2085 ms.

taco/test/tests-tensor_types.cpp: In instantiation of 'void gtest_case_VectorTensorTest_::types<gtest_TypeParam_>::TestBody() [with gtest_TypeParam_ = unsigned char]': taco/test/tests-tensor_types.cpp:42:1: required from here taco/test/tests-tensor_types.cpp:48:30: error: narrowing conversion of '1.0e+0' from 'double' to 'unsigned char' [-Wnarrowing] 48 | map<vector<int>,TypeParam> vals = {{{0}, 1.0}, {{2}, 2.0}}; |

TBLIS is a fast CPU tensor contraction library, that works with any binary contraction. Any current callsite that uses GEMM can be changed to use TBLIS instead.

Not the best name... Sample usage: OpenMP: ./bin/chemTest -n 32 -gx 1 -gy 1 -gz 1 -ll:ocpu 1 -ll:othr 2 -ll:nsize 1G -ll:ncsize 0 -lg:eager_alloc_percentage 30 TBLIS: ./bin/chemTest -tblis -n 32 -gx 1 -gy 1 -gz 1 -ll:ocpu 1 -ll:othr 2 -ll:nsize 1G -ll:ncsize 0 -lg:eager_alloc_percentage 30

rohany · 2022-06-06T23:54:40Z

Looks good, great work Tim!

TimothyGu commented May 10, 2022

View reviewed changes

legion/pummaMM/taco-generated-tblis.cpp Outdated Show resolved Hide resolved

legion/pummaMM/taco-generated-tblis.cpp Outdated Show resolved Hide resolved

src/index_notation/transformations.cpp Outdated Show resolved Hide resolved

TimothyGu commented May 10, 2022

View reviewed changes

src/index_notation/transformations.cpp Outdated Show resolved Hide resolved

src/index_notation/transformations.cpp Outdated Show resolved Hide resolved

rohany requested changes May 10, 2022

View reviewed changes

TimothyGu force-pushed the tg/tblis branch from be26018 to 0d4f5e0 Compare May 11, 2022 01:38

rohany reviewed May 11, 2022

View reviewed changes

TimothyGu force-pushed the tg/tblis branch 2 times, most recently from 9cd1195 to 8b88489 Compare May 12, 2022 00:05

TimothyGu force-pushed the tg/tblis branch from 5a3b1cb to 3c7a873 Compare June 6, 2022 22:15

TimothyGu requested a review from rohany June 6, 2022 22:15

TimothyGu added 6 commits June 6, 2022 15:16

Comment out more of Git submodule things

a6cb0ee

legion/tblis: add submodule and build system

35ef46b

legion: silence some type warnings

fcc0902

index_notation: add leaf call transform for TBLIS

80d1d47

TBLIS is a fast CPU tensor contraction library, that works with any binary contraction. Any current callsite that uses GEMM can be changed to use TBLIS instead.

TimothyGu force-pushed the tg/tblis branch from 3c7a873 to 7cad48f Compare June 6, 2022 22:18

rohany merged commit a987496 into rohany:DISTAL Jun 6, 2022

rohany mentioned this pull request Jun 7, 2022

extend TBLIS leaf kernel strategy to use CuTensor for GPUs #129

Open

TimothyGu deleted the tg/tblis branch June 7, 2022 04:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a leaf call transform for TBLIS #119

Add a leaf call transform for TBLIS #119

TimothyGu commented May 10, 2022 •

edited

Loading

rohany left a comment

TimothyGu commented May 11, 2022

rohany left a comment

TimothyGu commented May 12, 2022

TimothyGu commented Jun 6, 2022 •

edited

Loading

rohany commented Jun 6, 2022

Add a leaf call transform for TBLIS #119

Add a leaf call transform for TBLIS #119

Conversation

TimothyGu commented May 10, 2022 • edited Loading

rohany left a comment

Choose a reason for hiding this comment

TimothyGu commented May 11, 2022

rohany left a comment

Choose a reason for hiding this comment

TimothyGu commented May 12, 2022

TimothyGu commented Jun 6, 2022 • edited Loading

rohany commented Jun 6, 2022

TimothyGu commented May 10, 2022 •

edited

Loading

TimothyGu commented Jun 6, 2022 •

edited

Loading