Adds optimization from proxy application #19

dmclark17 · 2020-10-12T21:53:44Z

Ported proxy app changes from my end to GauXC. I am marking as a draft because I did not include the downstream changes from the matrix transpose edit so it is not currently correct.

…ollocation functions

…reduce register pressure

…tegrator O(2-5x) improvement in GFLOP/s for collocation on V100 + PPC

…inor performance degredation, disabled for now

…nning

wavefunction91 · 2020-10-12T22:57:04Z

@dmclark17 I've rebased this locally, do you want me to push directly or have you review in a separate branch?

dmclark17 · 2020-10-12T23:02:34Z

I think pushing directly here would work

wavefunction91 · 2020-10-12T23:30:59Z

@dmclark17 Up-to-date. You can toggle your compact collocation kernel implementation by uncommenting
https://github.com/wavefunction91/GauXC/pull/19/files#diff-c52215967dc1bb88bf7adcefa6c71b05a94b16c7af12c87ff3ea889112716b3cR4

and changing the analogous kernel launch. FWIW, this increases the register usage (-gencode sm_70,compute_70 -O3) from 64 -> 88.

Note, I'd like to either

Settle on one (optimized) implementation (i.e. remove the alg variants I have toggled)
Determine a criteria (hardware or inputs or both) to dispatch to a particular implementation.

src/integrator/cuda/cuda_eval_denvars.cu

dmclark17 · 2020-10-13T20:34:45Z

For the collocation_device_masked_combined_kernel_deriv1 kernel, I am also seeing that the grid-stride version is slower (135ms vs. 152ms for Taxol). I think we can stick with your implementation as the increased register usage is detrimental.

wavefunction91 · 2020-10-13T20:38:05Z

I'm wondering if this is might be due to the fact that we're striding on multiple dimensions, it might be possible that a stride along x (i.e. the grid points) might be beneficial while stride along y (shells) is detrimental, or visa versa.

…ark/gauxc-mirror into clark/opt

src/integrator/cuda/cuda_eval_denvars.cu

wavefunction91 · 2020-10-15T19:06:16Z

@dmclark17 Is there anything else you'd like to include in this PR?

* [CI] Try container based CI * [CI] Typo * [CI] Typo * [CI] Typo * [CI] Typo * [CI] Typo * [CI] Add BLIS linkage * [CI] Reenable LLVM in tests * [CI] Reenable LLVM in tests * [CI] Reenable LLVM in tests * [CI] Reenable LLVM in tests * [CI] typo * [CI] Renable Debug + subproject tests * [CI] Renable Debug + subproject tests * [CI] Renable Debug + subproject tests * [CI] Renable Debug + subproject tests * [CI] Some cleanup * [CI] Some cleanup * [CI] Some cleanup * [CI] Use installed LibXC in Docker container * [CMake] bug in discovery export * [CMake] Add ExchCXX discovery in export config... how did this ever work?? * [CI] Try running on self-hosted * [CI] Enable CUDA CI * [CI] Pass CMAKE_CUDA_ARCHITECTURES to GH Actions toolchain * [CI] Disable MPI for CUDA tests * Fix CUDA + no MPI Build * Fix CUDA + no MPI Build * Actually fix CUDA + no MPI * Disable pinned vector for CUDA 12

wavefunction91 and others added 15 commits September 25, 2020 18:53

[CUDA] Added template parameter and __restrict__ keyword in angular c…

1fedeb0

…ollocation functions

[CUDA] Propagate __restrict__ in collocation

abab6a4

[CUDA] Added npts to angular function signatures, force no inline to …

4308c46

…reduce register pressure

Merge branch 'master' into cuda/opt_mem_access

70f4e9f

Merge branch 'master' into cuda/opt_mem_access

7ddd9d4

Merge branch 'master' into cuda/opt_mem_access

4fc2dfe

[CUDA] Transposed (col major -> row major) access patterns in CUDA in…

8177db6

…tegrator O(2-5x) improvement in GFLOP/s for collocation on V100 + PPC

[CUDA] Experimenting with shmem optimization in collocation, yields m…

312409a

…inor performance degredation, disabled for now

Merge branch 'master' into cuda/opt_mem_access

1ab0e13

Merge branch 'cuda/opt_mem_access' into cuda/shmem_collocation_opt

3f782f9

[CUDA] Changed kernel launch in Z matrix to have points as fastest ru…

2a15942

…nning

Merge branch 'master' into cuda/opt_mem_access

ec4a574

Merge branch 'master' into cuda/opt_mem_access

4d89e24

Adds optimization

0c33841

Merge branch 'cuda/opt_mem_access' into clark/opt

9224958

[CUDA] Reenable compact collocation kernel of @dmclark17 + misc cleanup

c015f26

wavefunction91 added cuda CUDA related Issue enhancement New feature or request labels Oct 12, 2020

wavefunction91 reviewed Oct 12, 2020

View reviewed changes

src/integrator/cuda/cuda_eval_denvars.cu Outdated Show resolved Hide resolved

src/integrator/cuda/cuda_eval_denvars.cu Show resolved Hide resolved

wavefunction91 mentioned this pull request Oct 13, 2020

[WIP][CUDA] Split CUDA collocation into radial and angular parts #20

Closed

wavefunction91 and others added 4 commits October 13, 2020 17:23

[CUDA] Reenable warp level reductions by default

0eadb5b

Merge branch 'master' into clark/opt

2aa640e

Changes launch parameters for gga kernel

c0d1c15

Merge branch 'clark/opt' of ssh://gitlab-master.nvidia.com:12051/dacl…

5336a1b

…ark/gauxc-mirror into clark/opt

wavefunction91 reviewed Oct 14, 2020

View reviewed changes

src/integrator/cuda/cuda_eval_denvars.cu Show resolved Hide resolved

wavefunction91 added 3 commits October 15, 2020 13:09

[CUDA] Cleanup of CUDA collocation

99fdddf

[CUDA] Cherry-pick cuda_kernel_max_threads_per_block utility function

9bbba9f

[CUDA] GGA_KERNEL_SM_BLOCK_X -> warp_size

35e006e

dmclark17 marked this pull request as ready for review October 15, 2020 19:23

wavefunction91 approved these changes Oct 15, 2020

View reviewed changes

wavefunction91 changed the title ~~WIP: Adds optimization from proxy application~~ Adds optimization from proxy application Oct 15, 2020

wavefunction91 merged commit 02fa8d9 into wavefunction91:master Oct 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adds optimization from proxy application #19

Adds optimization from proxy application #19

Uh oh!

dmclark17 commented Oct 12, 2020

Uh oh!

wavefunction91 commented Oct 12, 2020

Uh oh!

dmclark17 commented Oct 12, 2020

Uh oh!

wavefunction91 commented Oct 12, 2020 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

dmclark17 commented Oct 13, 2020

Uh oh!

wavefunction91 commented Oct 13, 2020

Uh oh!

Uh oh!

wavefunction91 commented Oct 15, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Adds optimization from proxy application #19

Adds optimization from proxy application #19

Uh oh!

Conversation

dmclark17 commented Oct 12, 2020

Uh oh!

wavefunction91 commented Oct 12, 2020

Uh oh!

dmclark17 commented Oct 12, 2020

Uh oh!

wavefunction91 commented Oct 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dmclark17 commented Oct 13, 2020

Uh oh!

wavefunction91 commented Oct 13, 2020

Uh oh!

Uh oh!

wavefunction91 commented Oct 15, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wavefunction91 commented Oct 12, 2020 •

edited

Loading