Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

opt: add ReplaceMaskedMemOps pass #2836

Merged
merged 1 commit into from
May 16, 2024
Merged

Conversation

nurmukhametov
Copy link
Collaborator

@nurmukhametov nurmukhametov commented Apr 16, 2024

It traverses bitcode for masked stores that have the turned-off second part and the turned-on first part. We can safely replace them with narrow unmasked stores and loads with the following shuffle with the passthrough value. This can help the back-end to generate better code (no extra spills, assigning narrow registers).

This fixes that last part of issue #2611

@nurmukhametov
Copy link
Collaborator Author

The test from #2719 was added.

Copy link
Collaborator

@turinevgeny turinevgeny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it makes sense to add unit LIT-tests? E.g. for given input IR check the new pass output.

@nurmukhametov
Copy link
Collaborator Author

Would it makes sense to add unit LIT-tests? E.g. for given input IR check the new pass output.

It would but at the moment I don't see such tests for other passes from src/opt as well as ability to run specific pass only on input IR.

Copy link
Collaborator

@dbabokin dbabokin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this PR supersedes #2809?

src/opt.cpp Outdated
@@ -731,6 +731,18 @@ void ispc::Optimize(llvm::Module *module, int optLevel) {

optPM.addFunctionPass(PeepholePass());
optPM.addFunctionPass(llvm::ADCEPass());
optPM.addFunctionPass(ReplaceHalfMaskedMemOpsPass());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to move it upper to avoid extra cleanup passes?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, it should be an earlier pass, to open more optimization doors.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved it earlier.

// The masked.load and masked.store intrinsics are directly mapped to machine
// instructions with the specified full width of vector values being loaded or
// stored. This transformation allows the backend to generate shorter vector
// loads and stores avoidind extra spills.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// loads and stores avoidind extra spills.
// loads and stores avoiding extra spills.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// The masked.load and masked.store intrinsics are directly mapped to machine
// instructions with the specified full width of vector values being loaded or
// stored. This transformation allows the backend to generate shorter vector
// loads and stores avoidind extra spills.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it only about shorter memory ops or it's also about short math operations? I assumed that it's both.

Also, what do you mean by "spill" here? It's reads/writes of the user visible memory, while spills refer to storing/restoring from temporary memory location when register allocator runs out of registers.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to "This transformation allows the backend to generate shorter vector memory operations and corresponding math operations avoiding extra spills of temporal values to memory"

src/opt.cpp Outdated
@@ -731,6 +731,18 @@ void ispc::Optimize(llvm::Module *module, int optLevel) {

optPM.addFunctionPass(PeepholePass());
optPM.addFunctionPass(llvm::ADCEPass());
optPM.addFunctionPass(ReplaceHalfMaskedMemOpsPass());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, it should be an earlier pass, to open more optimization doors.

auto N = CV->getType()->getNumElements();
for (auto i = N / 2; i < N; i++) {
llvm::Constant *E = CV->getAggregateElement(i);
if (!E || !llvm::isa<llvm::ConstantInt>(E) || !llvm::cast<llvm::ConstantInt>(E)->isZero()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to use llvm::all_of in this function?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not quite sure how to use it for aggregates.

llvm::Value *lBitcastPointerType(llvm::IRBuilder<> &B, llvm::Value *ptr, llvm::Value *value) {
auto *vecType = llvm::cast<llvm::VectorType>(value->getType());
auto *newPtrType = llvm::PointerType::get(vecType, 0 /* TODO! */);
// TODO! opaque pointer is no-op here, any special handling?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to check if the pointer is opaque and skip the bitcast in this case?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand, no bitcast is generated in such case. See tests/lit-tests/2611.ll

return B.CreateBitCast(ptr, newPtrType);
}

llvm::Constant *lShrinkConstVec(llvm::LLVMContext &context, llvm::Value *originalValue) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would clearer if the method accepted llvm::Constant* argument.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, fixed


for (auto CI : loadsToReplace) {
lReplaceMaskedLoad(builder, CI);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should llvm::PreservedAnalyses::all() be returned in case both storesToReplace and loadsToReplace are empty?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


// This function replaces, e.g.,
//
// %ptr = bitcast %v8_uniform_FVector4f* %Result.i to <8 x float>*
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be code generation if the original N = 4 and we cut it to load <2 x float>? We need to ensure it wouldn't be worse than just masked.load/store.v4f32.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added tests/lit-tests/2611-2.ispc for this case.

src/opt/ReplaceHalfMaskedMemOps.cpp Outdated Show resolved Hide resolved
src/opt/ReplaceHalfMaskedMemOps.cpp Outdated Show resolved Hide resolved
src/opt/ReplaceHalfMaskedMemOps.cpp Outdated Show resolved Hide resolved

llvm::Value *lMergeVectors(llvm::IRBuilder<> &B, llvm::Value *firstVector, llvm::Value *secondVector,
llvm::Twine &name) {
auto *firstVecType = llvm::cast<llvm::VectorType>(firstVector->getType());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you use llvm::cast and not llvm::dyn_cast?

auto *firstVecType = llvm::dyn_cast<llvm::VectorType>(firstVector->getType());
auto *secondVecType = llvm::dyn_cast<llvm::VectorType>(secondVector->getType());

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still see mix of llvm::dyn_cast and llvm::cast in the code. It's not a problem if you're certain of the type you are casting to, and the program logic guarantees that the type is correct. If it's not, I suggest using llvm::dyn_cast.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed all llvm::cast to llvm::dyn_cast

src/opt/ReplaceHalfMaskedMemOps.cpp Outdated Show resolved Hide resolved
@nurmukhametov nurmukhametov mentioned this pull request May 2, 2024
@nurmukhametov nurmukhametov force-pushed the fix-2611-v3 branch 3 times, most recently from 2ba1c81 to 053a80a Compare May 9, 2024 15:37
@nurmukhametov
Copy link
Collaborator Author

Does this PR supersedes #2809?

Yes, it is.

@nurmukhametov
Copy link
Collaborator Author

Would it makes sense to add unit LIT-tests? E.g. for given input IR check the new pass output.

This addressed by #2845 and tests/lit-tests/2611.ll.

@nurmukhametov nurmukhametov changed the title WIP: opt: add ReplaceHalfMaskedMemOps pass opt: add ReplaceMaskedMemOps pass May 9, 2024
@nurmukhametov nurmukhametov force-pushed the fix-2611-v3 branch 2 times, most recently from 60739be to 66213c9 Compare May 10, 2024 15:01
@aneshlya
Copy link
Collaborator

Please format lit-tests with clang-format.

It traverse bitcode for masked stores that have the turned-off part
half and the turned-on first part. We can safely replace them with
narrow unmasked stores and loads with the following shuffle with the
passthrough value. This can help the back-end to generate better code
(no extra spills, assigning narrow registers).
@nurmukhametov
Copy link
Collaborator Author

Please format lit-tests with clang-format.

Done

Copy link
Collaborator

@turinevgeny turinevgeny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!


namespace ispc {

bool lIsPowerOf2(unsigned n) { return (n > 0) && !(n & (n - 1)); }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are such functions in LLVM, but it would need additional headers and libs to link.

@nurmukhametov nurmukhametov merged commit 66c8e1d into ispc:main May 16, 2024
61 checks passed
@nurmukhametov nurmukhametov linked an issue May 16, 2024 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Efficient codegen for narrower register widths
4 participants