Skip to content

[vulkan] Efficient gemm implementation #49609

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 27 commits into from

Conversation

SS-JIA
Copy link
Contributor

@SS-JIA SS-JIA commented Dec 18, 2020

Stack from ghstack:

Differential Revision: D26209677

SS-JIA pushed a commit that referenced this pull request Dec 18, 2020
ghstack-source-id: 0fc0936
Pull Request resolved: #49609
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Dec 18, 2020

💊 CI failures summary and remediations

As of commit 19eb3aa (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

SS-JIA pushed a commit that referenced this pull request Dec 19, 2020
ghstack-source-id: 48f4a4b
Pull Request resolved: #49609
SS-JIA pushed a commit that referenced this pull request Dec 21, 2020
ghstack-source-id: b856914
Pull Request resolved: #49609
SS-JIA pushed a commit that referenced this pull request Dec 22, 2020
ghstack-source-id: a974859
Pull Request resolved: #49609
SS-JIA pushed a commit that referenced this pull request Dec 22, 2020
ghstack-source-id: accedc6
Pull Request resolved: #49609
SS-JIA pushed a commit that referenced this pull request Dec 22, 2020
ghstack-source-id: e70be38
Pull Request resolved: #49609
SS-JIA pushed a commit that referenced this pull request Dec 28, 2020
ghstack-source-id: 2d55a26
Pull Request resolved: #49609
SS-JIA pushed a commit that referenced this pull request Dec 29, 2020
ghstack-source-id: 19e047b
Pull Request resolved: #49609
SS-JIA pushed a commit that referenced this pull request Dec 30, 2020
ghstack-source-id: ad85e8a
Pull Request resolved: #49609
SS-JIA pushed a commit that referenced this pull request Jan 5, 2021
ghstack-source-id: 550874d
Pull Request resolved: #49609
SS-JIA pushed a commit that referenced this pull request Jan 6, 2021
ghstack-source-id: 5291806
Pull Request resolved: #49609
SS-JIA pushed a commit that referenced this pull request Jan 7, 2021
ghstack-source-id: afff895
Pull Request resolved: #49609
SS-JIA pushed a commit that referenced this pull request Jan 7, 2021
ghstack-source-id: 9ac8cff
Pull Request resolved: #49609
SS-JIA pushed a commit that referenced this pull request Jan 8, 2021
ghstack-source-id: 5065c40
Pull Request resolved: #49609
SS-JIA pushed a commit that referenced this pull request Feb 2, 2021
ghstack-source-id: 14f2590
Pull Request resolved: #49609
@SS-JIA SS-JIA requested a review from AshkanAliabadi February 2, 2021 19:15
SS-JIA pushed a commit that referenced this pull request Feb 2, 2021
ghstack-source-id: e4d9e29
Pull Request resolved: #49609
SS-JIA pushed a commit that referenced this pull request Feb 9, 2021
ghstack-source-id: e48a05d
Pull Request resolved: #49609
SS-JIA pushed a commit that referenced this pull request Feb 9, 2021
ghstack-source-id: 3dd6957
Pull Request resolved: #49609
Copy link
Contributor

@AshkanAliabadi AshkanAliabadi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Stephen!


if (all(lessThan(pos, uBlock.size.xyz))) {
const int base_x = 2*pos.x;
const int base_y = 2*pos.y;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think in terms of vectors. Not sure if it will perform better on modern scalar GPUs with a SIMT architecture (shouldn't be worse anyway) but should perform better on older VLIW.

By the way, swizzling in shaders is free.

const int2 base = 2 * pos.xy;

const ivec4 index = base + ivec4(0, 1 ,uBlock.orig_size.x, uBlock.orig_size.x+1);

vec4 outvec = vec4(0,0,0,0);
if (base_x < uBlock.orig_size.x && base_y < uBlock.orig_size.y) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shader is not performance sensitive if it's just a one time transformation but still branches are expensive in shaders. Generally if you can rework the logic to avoid branches it is better.

const Shader::Descriptor& shader_descriptor,
const Shader::WorkGroup& global_work_group,
const Shader::WorkGroup& local_work_group_size,
Arguments&&... arguments);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please delete the old version of this function that does not take local work group size explicitly, replacing it with this new version only. Then pass local_work_group_size (adapter->blah_blah() - don't remember the name) explicitly at all call sites. We are going to need that flexibility anyway for tweaking local work group size.

VK_IMAGE_PACK_NC4HW_3D = 0,
VK_IMAGE_PACK_NC4HW_2D = 1,
VK_IMAGE_PACK_H2W2 = 2,
} VkImagePackFormat;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need this? Sorry it may become apparent as I scroll down.

vec4 texel1 = texelFetch(uM1, ivec3(k, pos.y, pos.z), 0);
vec4 texel2 = texelFetch(uM2, ivec3(pos.x, k, pos.z), 0);
sum = fma(texel1.xxzz, texel2.xyxy, sum);
sum = fma(texel1.yyww, texel2.zwzw, sum);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a by-product of our new packing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the new packing makes use of the entire input texel.

},
v_src.options()
};
const struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment regarding anonymous structs on GCC.

};

uint32_t orig_w = output_sizes[output_sizes.size() - 1];
uint32_t orig_h = output_sizes[output_sizes.size() - 2];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const. const everywhere please. I am a const zealot. :)

return v_src_unpacked;
}

vTensor unpack_image1x1(vTensor v_src, c10::SmallVector<int64_t, 4u> output_sizes, api::Context* context, api::Command::Buffer& command_buffer) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pass all objects greater than the size of the two machine words (2 x 64-bits on 64-bit, 2 x 32-bit for 32-bits) by [const] reference. I add a fudge factor of 2 since pointer chasing and dereferencing (which is effectively what references are - just syntactic sugar for pointers) has a cost so it's best avoided when the cost of passing by value is small.

vTensor pack_image2d_h2w2(vTensor v_src, api::Context* context, api::Command::Buffer& command_buffer);
vTensor unpack_image2d_h2w2(vTensor v_src, c10::SmallVector<int64_t, 4u> output_sizes, api::Context* context, api::Command::Buffer& command_buffer);

vTensor unpack_image1x1(vTensor v_src, c10::SmallVector<int64_t, 4u> output_sizes, api::Context* context, api::Command::Buffer& command_buffer);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these functions are only used in one single implementation file, please remove the common header. Reason: Software engineering is the art (since it is not all science unfortunately) and science of change management, and the bedrock of managing changes is limiting scope. Limiting scope in general is the single most important tool software engineers have to get a handle on entropy.


const auto check = almostEqual(out_cpu, out_vulkan.cpu());
if (!check) {
std::cout << "Expected:\n" << out_cpu << std::endl;
std::cout << "Got:\n" << out_vulkan.cpu() << std::endl;
showRtol(out_cpu, out_vulkan.cpu());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change other places to this function as well.

SS-JIA pushed a commit that referenced this pull request Feb 10, 2021
ghstack-source-id: b5f1c5d
Pull Request resolved: #49609
SS-JIA pushed a commit that referenced this pull request Feb 11, 2021
ghstack-source-id: 2bea274
Pull Request resolved: #49609
@facebook-github-bot
Copy link
Contributor

@SS-JIA merged this pull request in 6385c13.

@facebook-github-bot facebook-github-bot deleted the gh/SS-JIA/15/head branch February 15, 2021 15:18
xsacha pushed a commit to xsacha/pytorch that referenced this pull request Mar 31, 2021
Summary: Pull Request resolved: pytorch#49609

Test Plan: Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D26209677

Pulled By: SS-JIA

fbshipit-source-id: 773a944559bf0deb3cf3e233d833220a12f9f2ab
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants