-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
opt: add ReplaceMaskedMemOps pass #2836
Conversation
7ab5b84
to
01b43d5
Compare
01b43d5
to
6b3a053
Compare
The test from #2719 was added. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it makes sense to add unit LIT-tests? E.g. for given input IR check the new pass output.
It would but at the moment I don't see such tests for other passes from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this PR supersedes #2809?
src/opt.cpp
Outdated
@@ -731,6 +731,18 @@ void ispc::Optimize(llvm::Module *module, int optLevel) { | |||
|
|||
optPM.addFunctionPass(PeepholePass()); | |||
optPM.addFunctionPass(llvm::ADCEPass()); | |||
optPM.addFunctionPass(ReplaceHalfMaskedMemOpsPass()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to move it upper to avoid extra cleanup passes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, it should be an earlier pass, to open more optimization doors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved it earlier.
src/opt/ReplaceHalfMaskedMemOps.h
Outdated
// The masked.load and masked.store intrinsics are directly mapped to machine | ||
// instructions with the specified full width of vector values being loaded or | ||
// stored. This transformation allows the backend to generate shorter vector | ||
// loads and stores avoidind extra spills. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// loads and stores avoidind extra spills. | |
// loads and stores avoiding extra spills. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
src/opt/ReplaceHalfMaskedMemOps.h
Outdated
// The masked.load and masked.store intrinsics are directly mapped to machine | ||
// instructions with the specified full width of vector values being loaded or | ||
// stored. This transformation allows the backend to generate shorter vector | ||
// loads and stores avoidind extra spills. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it only about shorter memory ops or it's also about short math operations? I assumed that it's both.
Also, what do you mean by "spill" here? It's reads/writes of the user visible memory, while spills refer to storing/restoring from temporary memory location when register allocator runs out of registers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to "This transformation allows the backend to generate shorter vector memory operations and corresponding math operations avoiding extra spills of temporal values to memory"
src/opt.cpp
Outdated
@@ -731,6 +731,18 @@ void ispc::Optimize(llvm::Module *module, int optLevel) { | |||
|
|||
optPM.addFunctionPass(PeepholePass()); | |||
optPM.addFunctionPass(llvm::ADCEPass()); | |||
optPM.addFunctionPass(ReplaceHalfMaskedMemOpsPass()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, it should be an earlier pass, to open more optimization doors.
src/opt/ReplaceHalfMaskedMemOps.cpp
Outdated
auto N = CV->getType()->getNumElements(); | ||
for (auto i = N / 2; i < N; i++) { | ||
llvm::Constant *E = CV->getAggregateElement(i); | ||
if (!E || !llvm::isa<llvm::ConstantInt>(E) || !llvm::cast<llvm::ConstantInt>(E)->isZero()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to use llvm::all_of
in this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not quite sure how to use it for aggregates.
src/opt/ReplaceHalfMaskedMemOps.cpp
Outdated
llvm::Value *lBitcastPointerType(llvm::IRBuilder<> &B, llvm::Value *ptr, llvm::Value *value) { | ||
auto *vecType = llvm::cast<llvm::VectorType>(value->getType()); | ||
auto *newPtrType = llvm::PointerType::get(vecType, 0 /* TODO! */); | ||
// TODO! opaque pointer is no-op here, any special handling? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to check if the pointer is opaque and skip the bitcast in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand, no bitcast is generated in such case. See tests/lit-tests/2611.ll
src/opt/ReplaceHalfMaskedMemOps.cpp
Outdated
return B.CreateBitCast(ptr, newPtrType); | ||
} | ||
|
||
llvm::Constant *lShrinkConstVec(llvm::LLVMContext &context, llvm::Value *originalValue) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would clearer if the method accepted llvm::Constant*
argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, fixed
src/opt/ReplaceHalfMaskedMemOps.cpp
Outdated
|
||
for (auto CI : loadsToReplace) { | ||
lReplaceMaskedLoad(builder, CI); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should llvm::PreservedAnalyses::all()
be returned in case both storesToReplace
and loadsToReplace
are empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
src/opt/ReplaceHalfMaskedMemOps.cpp
Outdated
|
||
// This function replaces, e.g., | ||
// | ||
// %ptr = bitcast %v8_uniform_FVector4f* %Result.i to <8 x float>* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would be code generation if the original N = 4
and we cut it to load <2 x float>
? We need to ensure it wouldn't be worse than just masked.load/store.v4f32
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added tests/lit-tests/2611-2.ispc
for this case.
src/opt/ReplaceHalfMaskedMemOps.cpp
Outdated
|
||
llvm::Value *lMergeVectors(llvm::IRBuilder<> &B, llvm::Value *firstVector, llvm::Value *secondVector, | ||
llvm::Twine &name) { | ||
auto *firstVecType = llvm::cast<llvm::VectorType>(firstVector->getType()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you use llvm::cast
and not llvm::dyn_cast
?
auto *firstVecType = llvm::dyn_cast<llvm::VectorType>(firstVector->getType());
auto *secondVecType = llvm::dyn_cast<llvm::VectorType>(secondVector->getType());
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still see mix of llvm::dyn_cast
and llvm::cast
in the code. It's not a problem if you're certain of the type you are casting to, and the program logic guarantees that the type is correct. If it's not, I suggest using llvm::dyn_cast
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have changed all llvm::cast
to llvm::dyn_cast
2ba1c81
to
053a80a
Compare
Yes, it is. |
This addressed by #2845 and |
60739be
to
66213c9
Compare
Please format lit-tests with |
It traverse bitcode for masked stores that have the turned-off part half and the turned-on first part. We can safely replace them with narrow unmasked stores and loads with the following shuffle with the passthrough value. This can help the back-end to generate better code (no extra spills, assigning narrow registers).
66213c9
to
166a633
Compare
Done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
|
||
namespace ispc { | ||
|
||
bool lIsPowerOf2(unsigned n) { return (n > 0) && !(n & (n - 1)); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are such functions in LLVM, but it would need additional headers and libs to link.
It traverses bitcode for masked stores that have the turned-off second part and the turned-on first part. We can safely replace them with narrow unmasked stores and loads with the following shuffle with the passthrough value. This can help the back-end to generate better code (no extra spills, assigning narrow registers).
This fixes that last part of issue #2611