Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a leaf call transform for TBLIS #119

Merged
merged 6 commits into from
Jun 6, 2022
Merged

Conversation

TimothyGu
Copy link

@TimothyGu TimothyGu commented May 10, 2022

The TBLIS leaf call transform essentially transforms any binary contractions of form:

ForAll(i, ForAll(j, ForAll(k, … ForAll(z, A[idx1] = B[idx2] * C[idx3]) … )))

to a call to tblis::mult. The tblis::mult function takes an einsum string as input, and does the contraction in an optimal manner that doesn't require us to explicit figure out the sequence of GEMMs and reductions.

legion/pummaMM/taco-generated-tblis.cpp Outdated Show resolved Hide resolved
legion/pummaMM/taco-generated-tblis.cpp Outdated Show resolved Hide resolved
src/index_notation/transformations.cpp Outdated Show resolved Hide resolved
src/index_notation/transformations.cpp Outdated Show resolved Hide resolved
src/index_notation/transformations.cpp Outdated Show resolved Hide resolved
Copy link
Owner

@rohany rohany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add the instructions for building TBLIS into a document similar to the install_*.md files? I'm planning on writing an installation script to do everything (#120), so I can go back and look at that later.

Overall, this looks great! Just a bunch of small comments. I'm working on getting you a machine allocation so that we can do some initial performance testing with this as well. Ideally, we might be able to replace the hand-written kernels of TTV and TTMC with this (maybe innerprod too?) and see similar performance.

CMakeLists.txt Outdated Show resolved Hide resolved
include/taco/index_notation/transformations.h Show resolved Hide resolved
legion/CMakeLists.txt Outdated Show resolved Hide resolved
legion/pummaMM/taco-generated-tblis.cpp Outdated Show resolved Hide resolved
legion/pummaMM/taco-generated-tblis.cpp Outdated Show resolved Hide resolved
src/index_notation/transformations.cpp Outdated Show resolved Hide resolved
src/index_notation/transformations.cpp Outdated Show resolved Hide resolved
src/index_notation/transformations.cpp Outdated Show resolved Hide resolved
test/tests-distributed.cpp Outdated Show resolved Hide resolved
src/index_notation/transformations.cpp Outdated Show resolved Hide resolved
@TimothyGu
Copy link
Author

Addressed all the comments. PTAL

Copy link
Owner

@rohany rohany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks great! Nice job Tim. I want to have some preliminary performance results, get distribution working, and add some correctness tests as well before we land this. Once we do, I'll let the chemistry folks know, and we can try it out! However, we can run comparisons of our own vs CTF to ensure that we're better.

legion/CMakeLists.txt Outdated Show resolved Hide resolved
legion/chemTest/main.cpp Outdated Show resolved Hide resolved
legion/src/taco_legion_header.cpp Outdated Show resolved Hide resolved
legion/cmake/FindTBLIS.cmake Show resolved Hide resolved
src/index_notation/transformations.cpp Outdated Show resolved Hide resolved
@TimothyGu TimothyGu force-pushed the tg/tblis branch 2 times, most recently from 9cd1195 to 8b88489 Compare May 12, 2022 00:05
@TimothyGu
Copy link
Author

Okay, this should now work in a distributed setting. Will be testing it on sapling in the coming days. I also made sure to set the number of TBLIS threads to omp_set_max_threads() as suggested by the cuNumeric folks, so now it should autoscale with -ll:othr and no longer hang.

@TimothyGu
Copy link
Author

TimothyGu commented Jun 6, 2022

Rebased.

Here are some final benchmarks (using rank-per-socket except for the single node case):

$ bin/chemTest -n 70 -tblis -gx 2 -gy 1 -ll:ocpu 2 -ll:othr 9 -ll:util 1 -ll:nsize 10G -ll:ncsize 0
Execution time: 463 ms.
Execution time: 437 ms.
Execution time: 428 ms.
Execution time: 429 ms.
Execution time: 427 ms.
Execution time: 433 ms.
Execution time: 433 ms.
Execution time: 429 ms.
Execution time: 431 ms.
Execution time: 432 ms.
$ mpirun -H c0001:2,c0002:2 --bind-to socket bin/chemTest -n 83 -tblis -gx 2 -gy 2 -ll:ocpu 1 -ll:othr 9 -ll:util 1 -ll:nsize 10G -ll:ncsize 0
[3 - 7f73eda98d00]    0.000185 {4}{threads}: reservation ('dedicated worker (generic) #2') cannot be satisfied
Execution time: 897 ms.
Execution time: 731 ms.
Execution time: 697 ms.
Execution time: 684 ms.
Execution time: 700 ms.
Execution time: 739 ms.
Execution time: 717 ms.
Execution time: 690 ms.
Execution time: 725 ms.
Execution time: 693 ms.
$ mpirun -H c0001:2,c0002:2,c0003:2,c0004:2 --bind-to socket bin/chemTest -n 99 -tblis -gx 4 -gy 2 -ll:ocpu 1 -ll:othr 9 -ll:util 1 -ll:nsize 10G -ll:ncsize 0
[5 - 7fcbca44ed00]    0.000215 {4}{threads}: reservation ('dedicated worker (generic) #2') cannot be satisfied
Execution time: 1147 ms.
Execution time: 1049 ms.
Execution time: 1004 ms.
Execution time: 1048 ms.
Execution time: 1024 ms.
Execution time: 1022 ms.
Execution time: 1031 ms.
Execution time: 1104 ms.
Execution time: 992 ms.
Execution time: 1058 ms.

This is much faster than CTF in all cases (almost 2× improvement for the 4-node case):

$ mpirun -H c0002:20 env LD_LIBRARY_PATH='/scratch2/tigu/taco/ctf/scalapack/build/lib:/scratch2/tigu/taco/deps/openblas/lib' ctf/bin/chemtest -n 70
Execution time: 905 ms.
Execution time: 784 ms.
Execution time: 745 ms.
Execution time: 745 ms.
Execution time: 739 ms.
Execution time: 731 ms.
Execution time: 744 ms.
Execution time: 745 ms.
Execution time: 730 ms.
Execution time: 728 ms.
$ mpirun -H c0001:20,c0002:20 env LD_LIBRARY_PATH='/scratch2/tigu/taco/ctf/scalapack/build/lib:/scratch2/tigu/taco/deps/openblas/lib' ctf/bin/chemtest -n 83
Execution time: 968 ms.
Execution time: 926 ms.
Execution time: 984 ms.
Execution time: 916 ms.
Execution time: 933 ms.
Execution time: 985 ms.
Execution time: 952 ms.
Execution time: 887 ms.
Execution time: 970 ms.
Execution time: 933 ms.
$ mpirun -H c0001:20,c0002:20,c0003:20,c0004:20 env LD_LIBRARY_PATH='/scratch2/tigu/taco/ctf/scalapack/build/lib:/scratch2/tigu/taco/deps/openblas/lib' ctf/bin/chemtest -n 99
Execution time: 1897 ms.
Execution time: 2058 ms.
Execution time: 1981 ms.
Execution time: 2038 ms.
Execution time: 2019 ms.
Execution time: 2148 ms.
Execution time: 2066 ms.
Execution time: 2046 ms.
Execution time: 2085 ms.

taco/test/tests-tensor_types.cpp: In instantiation of 'void gtest_case_VectorTensorTest_::types<gtest_TypeParam_>::TestBody() [with gtest_TypeParam_ = unsigned char]':
taco/test/tests-tensor_types.cpp:42:1:   required from here
taco/test/tests-tensor_types.cpp:48:30: error: narrowing conversion of '1.0e+0' from 'double' to 'unsigned char' [-Wnarrowing]
   48 |   map<vector<int>,TypeParam> vals = {{{0}, 1.0}, {{2}, 2.0}};
      |
TBLIS is a fast CPU tensor contraction library, that works with any
binary contraction. Any current callsite that uses GEMM can be changed
to use TBLIS instead.
Not the best name...

Sample usage:

OpenMP: ./bin/chemTest        -n 32 -gx 1 -gy 1 -gz 1 -ll:ocpu 1 -ll:othr 2 -ll:nsize 1G -ll:ncsize 0 -lg:eager_alloc_percentage 30
 TBLIS: ./bin/chemTest -tblis -n 32 -gx 1 -gy 1 -gz 1 -ll:ocpu 1 -ll:othr 2 -ll:nsize 1G -ll:ncsize 0 -lg:eager_alloc_percentage 30
@rohany
Copy link
Owner

rohany commented Jun 6, 2022

Looks good, great work Tim!

@rohany rohany merged commit a987496 into rohany:DISTAL Jun 6, 2022
@TimothyGu TimothyGu deleted the tg/tblis branch June 7, 2022 04:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants