Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tpetra: Thread parallelization of unpackAndCombineIntoCrsArrays #1665

Merged
merged 1 commit into from
Sep 8, 2017

Conversation

tjfulle
Copy link
Contributor

@tjfulle tjfulle commented Aug 31, 2017

Work in Progress pull request to discuss with @mhoemmen

This pull request addresses thread parallelization of unpackAndCombineIntoCrsArrays. The current state passes all standard tests with and without OpenMP. All @trilinos/tpetra tests pass on CUDA (I did not run any other tests).

@tjfulle tjfulle requested a review from mhoemmen August 31, 2017 16:22
@bartlettroscoe bartlettroscoe added the stage: in progress Work on the issue has started label Aug 31, 2017
@tjfulle
Copy link
Contributor Author

tjfulle commented Aug 31, 2017

There is only one section of the (new) unpacking code that is not yet thread parallel. Once that is complete, this PR will be complete.

@tjfulle
Copy link
Contributor Author

tjfulle commented Sep 1, 2017

@mhoemmen, with my latest changes, all Tpetra tests pass on CUDA, all Tpetra tests pass on Darwin with OpenMP node type, and all Tpetra+downstream tests pass with the checkin script. However, two @trilinos/muelu tests fail for OpenMP node types. But, I don't think the issue is with this unpacking work, since they also fail on the HEAD of the develop branch. I'm bisecting the git history as we speak to determine when they started failing for OpenMP node types. Unfortunately, it takes sooooooooooooooooooooo long to recompile...

@mhoemmen
Copy link
Contributor

mhoemmen commented Sep 1, 2017

@tjfulle See #1660 -- it looks like MueLu uses Mehmet's sparse matrix-matrix multiply now by default, when OpenMP is enabled. I'm not sure if it's possible to disable this with a CMake option any more, but you could try the following, just to see if tests pass with that code disabled:

  -D KokkosKernels_ENABLE_Experimental:BOOL=ON
  -D MueLu_ENABLE_Kokkos_Refactor=ON
  -D Xpetra_ENABLE_Kokkos_Refactor=ON

@tjfulle
Copy link
Contributor Author

tjfulle commented Sep 1, 2017

@mhoemmen the failing tests are Muelu_TpetraUnitTes_MPI_[1,4]. My machine is running git bisect right now, I should have the offending commit in a few hours

@tjfulle
Copy link
Contributor Author

tjfulle commented Sep 1, 2017

Editing last comment to add that those are approximate names, I don't recall the exact names and I am away fro a computer

@mhoemmen
Copy link
Contributor

mhoemmen commented Sep 1, 2017

the failing tests are Muelu_TpetraUnitTes_MPI_[1,4].

Ross mentioned that there was a MueLu issue with those tests. Scroll to the bottom of #1304 (today's comments). I think it would be wise to wait until those tests pass, alas -- MueLu is the code that uses CrsMatrix pack and unpack the most.

@tjfulle
Copy link
Contributor Author

tjfulle commented Sep 1, 2017

@mhoemmen , I was not planning on removing the [WIP] label until those tests pass. When the git bisect magic is done, I'll post the failing commit commit here and #1304

@tjfulle
Copy link
Contributor Author

tjfulle commented Sep 1, 2017

@mhoemmen , once #1677 is fixed, I can retest this commit and it should be good to go

@tjfulle
Copy link
Contributor Author

tjfulle commented Sep 2, 2017

@mhoemmen , the fix for #1677 that @mndevec pushed resolved the test failures. This PR passes the standard checkin script tests with/without OpenMP and passes tpetra tests on CUDA.

@tjfulle tjfulle changed the title [WIP] Tpetra: Thread parallelization of unpackAndCombineIntoCrsArrays Tpetra: Thread parallelization of unpackAndCombineIntoCrsArrays Sep 2, 2017
@mhoemmen
Copy link
Contributor

mhoemmen commented Sep 2, 2017

@tjfulle Sweet! :-D

Albany has been reporting CUDA build issues involving Stokhos and Tpetra. See

https://github.com/gahansen/Albany/issues/176

I'm worried that I built Stokhos and Tpetra, but wasn't able to manifest this build error. Could you take a look? Thanks! :-D

@tjfulle
Copy link
Contributor Author

tjfulle commented Sep 3, 2017

@mhoemmen I'll take a look on Tuesday morning - I'm checking out for the holiday.

Copy link
Contributor

@mhoemmen mhoemmen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See inline comments. Thanks!

# Tpetra::Details::unpackCrsMatrixAndCombine. Do so for the same
# Tpetra::Details::unpackCrsMatrixAndCombine,
# Tpetra::Details::unpackAndCombineIntoCrsArrays, and
# Tpetra::Details::unpackAndCombineWithOwningPIDsCount. Do so for the same
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine for now, but later we could think about splitting these into separate files, for more parallel builds.

@@ -8172,12 +8166,12 @@ namespace Tpetra {
// in a huge list of arrays is icky. Can't we have a bit of an
// abstraction? Implementing a concrete DistObject subclass only
// takes five methods.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comment still stands, but it stands for the future, not for this PR ;-) .

/// \param numPacketsPerLID [out] Entry k gives the number of bytes
/// packed for row exportLIDs[k] of the local matrix.
///
/// \param ixportLIDs [in] Local indices of the rows to pack.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

importLIDs not "ixportLIDs"

///
/// \param sourceMatrix [in] the CrsMatrix source
///
/// \param ixports [in/out] Output pack buffer; resized if needed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"imports" not "ixports". More importantly, imports is a const Teuchos::ArrayView<const char>&, so it can't possibly be "resized if needed." (This could be just a matter of fixing the documentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied and pasted documentation. Perhaps at some time in the past it was an input/output? I'll fix the documentation to reflect the current state

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do; thanks!


if (num_bytes_per_value == 0) {
ST val; // packValueCount wants this
num_bytes_per_value = PackTraits<ST, Device>::packValueCount(val);
num_bytes_per_value = PackTraits<ST, DT>::packValueCount(ST());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will break Stokhos, due to the assumption that default-constructed ST has the right size, but it was broken before anyway ;-) . Thus, it's OK to defer fixing this to a separate PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this branch is unused and can be removed. All callers to this function send in a value > 0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tjfulle What's actually the point of the num_bytes_per_value branch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe when I put the branch in I was combining the several unpacking schemes and this branch allowed for one case that the num_bytes_per_value was not known and Scalar was default constructed. On further combining/refactoring, that case went away.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tjfulle It looks like the idea is that Stokhos is supposed to pass in the right num_bytes_per_value value -- is that right? (@etphipp)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stohkos never calls this function, it calls packValueCount and friends. This function is a combination of similar functions in the unpackCrsMatrix and unpackWithOwningPIDs functions. Stokhos use of packValueCount is pretty simple and will allow simplifying packValueCount as @ibaned has advocated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just looking at the code in full now and remember why the branch exists. When unpacking the CrsMatrix, matrix values are not known a-priori, so the only (current) way of getting the size of the values is by the default constructor. When unpacking in to CrsArrays, at least some matrix values are known and those values can be used to determine the size needed to unpack.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've modified the code to push the computation of the number of bytes per value further up stream. It does not remove the issue, but simplifies this function by removing the branch.

struct TotNumEntTag {};

/// \brief Functor to determine the number of entries in a matrix using
/// Kokkos::parallel_reduce
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation looks wrong -- this is a functor to determine either the total number of bytes required for unpacking incoming matrix entries, or the maximum number of bytes required for unpacking a row of incoming matrix entries, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If so, could you do both at once? Each requires one pass over the incoming packed data, so you could get the max and total in one shot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the fun for does compute the max and total number of entries for array allocation purposes. They can probably be combined

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tag could be more explicit, it's the max bum entries on any one row. It's used to allocate scratch space on device

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mhoemmen said:

if so, could you do both at once? Each requires one pass over the incoming packed data, so you could get the max and total in one shot.

I suppose they could be combined, but it would not save work done as the two counting functions are not called in the same procedure. Unpacking in to the CrsMatrix requires the maximum number of entries in any one row and unpacking in to CrsArrays requires total number of entries, thus two separate counting procedures (but only one functor)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but it would not save work done as the two counting functions are not called in the same procedure.

It seems like you could compute both of them once, before anybody needs them, and then pass along those two integer values (the max and the total).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I wasn't too clear. The total is needed for transferAndFillComplete and the maximum is needed for packAndPrepare. I'm having a hard time seeing how computing the max and total at once would be beneficial as these two procedures are independent.

tjfulle added a commit to tjfulle/Trilinos that referenced this pull request Sep 4, 2017
Addresses feedback from @mhoemmen for PR trilinos#1665
@tjfulle
Copy link
Contributor Author

tjfulle commented Sep 4, 2017

@mhoemmen, the latest commit addresses your comments. Passes all Tpetra+downstream on RHEL6 with/without OpenMP and Tpetra tests pass on CUDA 8.0.44 (I really ought to write a script that automagically runs all "standard" Tpetra tests for me...)

@mhoemmen
Copy link
Contributor

mhoemmen commented Sep 4, 2017

@tjfulle Not to scold, but I'm just curious: Do you have trouble getting downstream tests to build with CUDA? I would say, as long as MueLu and Stokhos pass with CUDA, we're golden :-) .

@mhoemmen
Copy link
Contributor

mhoemmen commented Sep 4, 2017

I really ought to write a script that automagically runs all "standard" Tpetra tests for me

If you mean "run Trilinos tests with all combinations of OpenMP and CUDA options," Christian might have a script that you could modify and use.

@tjfulle
Copy link
Contributor Author

tjfulle commented Sep 4, 2017 via email

@mhoemmen
Copy link
Contributor

mhoemmen commented Sep 4, 2017

I've only built and run Tpetra tests, I can run the others later tonight or tomorrow

It's trickier to get CUDA builds to work with downstream stuff. You may have to disable some packages that don't matter so much for MueLu and Stokhos testing.

Also, no hurry :) .

@tjfulle
Copy link
Contributor Author

tjfulle commented Sep 4, 2017

If you mean "run Trilinos tests with all combinations of OpenMP and CUDA options," Christian might have a script that you could modify and use.

Kinda, I was more thinking a script like Ross' remote testing script that would build and run tests on different remote machines to test all important builds simultaneously

@mhoemmen
Copy link
Contributor

mhoemmen commented Sep 4, 2017

@tjfulle wrote:

I was more thinking a script like Ross' remote testing script that would build and run tests on different remote machines to test all important builds simultaneously

I think we could build something like that out of Ross' script. It would make sense to farm out the test platforms to a cloud service. Ross is probably working on that already!

@etphipp
Copy link
Contributor

etphipp commented Sep 6, 2017

The fix will be in Stokhos.

@mhoemmen
Copy link
Contributor

mhoemmen commented Sep 6, 2017

@etphipp Awesome, thanks for digging into this! Glad we could help find a bug! :-D

@tjfulle I can do CUDA tests on my workstation if that could help, though that didn't catch everything, as we saw with the recent Albany issue.... Thanks for working on this :-D .

@tjfulle
Copy link
Contributor Author

tjfulle commented Sep 6, 2017

@mhoemmen , I can run tests easily on Cuda, but debugging is a bit clumsy on an interactive node.

@etphipp
Copy link
Contributor

etphipp commented Sep 6, 2017

I forgot to mention this issue in the commit message, but I just pushed a change to Stokhos that appears to fix the failing Stokhos test.

@tjfulle
Copy link
Contributor Author

tjfulle commented Sep 6, 2017

Thanks @etphipp !

@tjfulle tjfulle changed the title Tpetra: Thread parallelization of unpackAndCombineIntoCrsArrays [WIP] Tpetra: Thread parallelization of unpackAndCombineIntoCrsArrays Sep 7, 2017
@tjfulle
Copy link
Contributor Author

tjfulle commented Sep 7, 2017

Adding the [WIP] label until the failing Stokhos tests can be resolved.

@etphipp
Copy link
Contributor

etphipp commented Sep 7, 2017

OK, I believe all of the failing Stokhos tests have been resolved. There was a problem with compute capability 6.x that was preventing the PCE mat-vec from functioning properly, which was causing the failures on ride (and nowhere else). All of the Stokhos tests pass on ride for me now.

- Moved unpackAndCombineIntoCrsArrays (and friends) from Tpetra_Import_Util2.hpp
  to Tpetra_Details_unpackCrsMatrixAndCombine_de*.hpp.
- unpackAndCombineIntoCrsArrays broken up in to many many smaller functions (it
  was previously one large monolithic function).  Each of the small functions
  was refactored to be thread parallel.
- Race conditions were identified and resolved, mostly by using
  Kokkos::atomic_fetch_add where appropriate.

Addresses: trilinos#797, trilinos#800, trilinos#802
Review: @mhoemmen

Tests were run on two different machines and there results amended to this
commit:

Build/Test Cases Summary [RHEL6, standard checkin script]
Enabled Packages: TpetraCore
Disabled Packages: PyTrilinos,Claps,TriKota
Enabled all Forward Packages
0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=1506,notpassed=0 (19.13 min)
1) MPI_RELEASE_DEBUG_SHARED_OPENMP_PT => passed: passed=1509,notpassed=0 (13.75 min)

Build/Test Cases Summary [ride.sandia.gov, CUDA]
Enabled Packages: Tpetra,MueLu,Stokhos
0) MPI_RELEASE_SHARED_CUDA => passed=233,notpassed=14 (8.76 min)

The 14 failing tests are unrelated MueLu tests that can be ignored, see trilinos#1699

The failing Stokhos tests mentioned in trilinos#1655 were fixed with
commit e97e37b
@tjfulle tjfulle changed the title [WIP] Tpetra: Thread parallelization of unpackAndCombineIntoCrsArrays Tpetra: Thread parallelization of unpackAndCombineIntoCrsArrays Sep 8, 2017
@tjfulle
Copy link
Contributor Author

tjfulle commented Sep 8, 2017

@mhoemmen, with the updates to Stokhos, all tests are passing on RHEL6 and CUDA. This is PR is ready once it passes your review.

@mhoemmen
Copy link
Contributor

mhoemmen commented Sep 8, 2017

@tjfulle SWEET :-D I'll (re-)review!

@tjfulle
Copy link
Contributor Author

tjfulle commented Sep 8, 2017

I apologize that the commit is so large! Swapping in the PackTraits code required a more extensive refactor and we wanted a single atomic commit.

@mhoemmen
Copy link
Contributor

mhoemmen commented Sep 8, 2017

I'm working on #1706 first, then this :-)

@mhoemmen mhoemmen merged commit 475118b into trilinos:develop Sep 8, 2017
@mhoemmen mhoemmen removed the stage: in progress Work on the issue has started label Sep 8, 2017
@mhoemmen
Copy link
Contributor

mhoemmen commented Sep 8, 2017

@tjfulle turns out I can multitask! ;-) thanks for working on this!!! :-D

@mhoemmen
Copy link
Contributor

mhoemmen commented Sep 8, 2017

Hm, did we ever test this for complex builds? I'm getting some build errors.... I'll work on it though.

@mhoemmen
Copy link
Contributor

mhoemmen commented Sep 8, 2017

I've got a fix for complex builds ready and will push soon.

mhoemmen pushed a commit that referenced this pull request Sep 8, 2017
@trilinos/tpetra If Scalar could be std::complex<T>, it needs to turn
into impl_scalar_type (via reinterpret_cast) before it enters the
Kokkos world.
@tjfulle
Copy link
Contributor Author

tjfulle commented Sep 8, 2017

Doh! The last push was not tested with complex. What was the error?

@mhoemmen
Copy link
Contributor

mhoemmen commented Sep 8, 2017

@tjfulle Just pushed the fix for complex builds. There was one spot that was passing a Teuchos::ArrayView<Scalar> into create_mirror_view_from_raw_host_array. This resulted in a Kokkos::View<Scalar*, ...>, rather than a Kokkos::View<impl_scalar_type*, ...>.

I think we could fix create_mirror_view_from_raw_host_array to use ArithTraits to get the right type for the View.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants