Skip to content

Conversation

@dmclark17
Copy link
Contributor

Ported proxy app changes from my end to GauXC. I am marking as a draft because I did not include the downstream changes from the matrix transpose edit so it is not currently correct.

@wavefunction91
Copy link
Owner

@dmclark17 I've rebased this locally, do you want me to push directly or have you review in a separate branch?

@dmclark17
Copy link
Contributor Author

I think pushing directly here would work

@wavefunction91 wavefunction91 added cuda CUDA related Issue enhancement New feature or request labels Oct 12, 2020
@wavefunction91
Copy link
Owner

wavefunction91 commented Oct 12, 2020

@dmclark17 Up-to-date. You can toggle your compact collocation kernel implementation by uncommenting
https://github.com/wavefunction91/GauXC/pull/19/files#diff-c52215967dc1bb88bf7adcefa6c71b05a94b16c7af12c87ff3ea889112716b3cR4

and changing the analogous kernel launch. FWIW, this increases the register usage (-gencode sm_70,compute_70 -O3) from 64 -> 88.

Note, I'd like to either

  1. Settle on one (optimized) implementation (i.e. remove the alg variants I have toggled)
  2. Determine a criteria (hardware or inputs or both) to dispatch to a particular implementation.

@dmclark17
Copy link
Contributor Author

For the collocation_device_masked_combined_kernel_deriv1 kernel, I am also seeing that the grid-stride version is slower (135ms vs. 152ms for Taxol). I think we can stick with your implementation as the increased register usage is detrimental.

@wavefunction91
Copy link
Owner

I'm wondering if this is might be due to the fact that we're striding on multiple dimensions, it might be possible that a stride along x (i.e. the grid points) might be beneficial while stride along y (shells) is detrimental, or visa versa.

@wavefunction91
Copy link
Owner

@dmclark17 Is there anything else you'd like to include in this PR?

@dmclark17 dmclark17 marked this pull request as ready for review October 15, 2020 19:23
@wavefunction91 wavefunction91 changed the title WIP: Adds optimization from proxy application Adds optimization from proxy application Oct 15, 2020
@wavefunction91 wavefunction91 merged commit 02fa8d9 into wavefunction91:master Oct 15, 2020
wavefunction91 added a commit that referenced this pull request Jun 13, 2023
* [CI] Try container based CI

* [CI] Typo

* [CI] Typo

* [CI] Typo

* [CI] Typo

* [CI] Typo

* [CI] Add BLIS linkage

* [CI] Reenable LLVM in tests

* [CI] Reenable LLVM in tests

* [CI] Reenable LLVM in tests

* [CI] Reenable LLVM in tests

* [CI] typo

* [CI] Renable Debug + subproject tests

* [CI] Renable Debug + subproject tests

* [CI] Renable Debug + subproject tests

* [CI] Renable Debug + subproject tests

* [CI] Some cleanup

* [CI] Some cleanup

* [CI] Some cleanup

* [CI] Use installed LibXC in Docker container

* [CMake] bug in discovery export

* [CMake] Add ExchCXX discovery in export config... how did this ever work??

* [CI] Try running on self-hosted

* [CI] Enable CUDA CI

* [CI] Pass CMAKE_CUDA_ARCHITECTURES to GH Actions toolchain

* [CI] Disable MPI for CUDA tests

* Fix CUDA + no MPI Build

* Fix CUDA + no MPI Build

* Actually fix CUDA + no MPI

* Disable pinned vector for CUDA 12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cuda CUDA related Issue enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants