-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a leaf call transform for TBLIS #119
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add the instructions for building TBLIS into a document similar to the install_*.md
files? I'm planning on writing an installation script to do everything (#120), so I can go back and look at that later.
Overall, this looks great! Just a bunch of small comments. I'm working on getting you a machine allocation so that we can do some initial performance testing with this as well. Ideally, we might be able to replace the hand-written kernels of TTV and TTMC with this (maybe innerprod too?) and see similar performance.
Addressed all the comments. PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks great! Nice job Tim. I want to have some preliminary performance results, get distribution working, and add some correctness tests as well before we land this. Once we do, I'll let the chemistry folks know, and we can try it out! However, we can run comparisons of our own vs CTF to ensure that we're better.
9cd1195
to
8b88489
Compare
Okay, this should now work in a distributed setting. Will be testing it on sapling in the coming days. I also made sure to set the number of TBLIS threads to |
Rebased. Here are some final benchmarks (using rank-per-socket except for the single node case):
This is much faster than CTF in all cases (almost 2× improvement for the 4-node case):
|
taco/test/tests-tensor_types.cpp: In instantiation of 'void gtest_case_VectorTensorTest_::types<gtest_TypeParam_>::TestBody() [with gtest_TypeParam_ = unsigned char]': taco/test/tests-tensor_types.cpp:42:1: required from here taco/test/tests-tensor_types.cpp:48:30: error: narrowing conversion of '1.0e+0' from 'double' to 'unsigned char' [-Wnarrowing] 48 | map<vector<int>,TypeParam> vals = {{{0}, 1.0}, {{2}, 2.0}}; |
TBLIS is a fast CPU tensor contraction library, that works with any binary contraction. Any current callsite that uses GEMM can be changed to use TBLIS instead.
Not the best name... Sample usage: OpenMP: ./bin/chemTest -n 32 -gx 1 -gy 1 -gz 1 -ll:ocpu 1 -ll:othr 2 -ll:nsize 1G -ll:ncsize 0 -lg:eager_alloc_percentage 30 TBLIS: ./bin/chemTest -tblis -n 32 -gx 1 -gy 1 -gz 1 -ll:ocpu 1 -ll:othr 2 -ll:nsize 1G -ll:ncsize 0 -lg:eager_alloc_percentage 30
Looks good, great work Tim! |
The TBLIS leaf call transform essentially transforms any binary contractions of form:
to a call to
tblis::mult
. Thetblis::mult
function takes an einsum string as input, and does the contraction in an optimal manner that doesn't require us to explicit figure out the sequence of GEMMs and reductions.