Tpetra: Thread parallelization of unpackAndCombineIntoCrsArrays #1665

tjfulle · 2017-08-31T16:22:47Z

Work in Progress pull request to discuss with @mhoemmen

This pull request addresses thread parallelization of unpackAndCombineIntoCrsArrays. The current state passes all standard tests with and without OpenMP. All @trilinos/tpetra tests pass on CUDA (I did not run any other tests).

tjfulle · 2017-08-31T16:24:06Z

There is only one section of the (new) unpacking code that is not yet thread parallel. Once that is complete, this PR will be complete.

tjfulle · 2017-09-01T17:01:49Z

@mhoemmen, with my latest changes, all Tpetra tests pass on CUDA, all Tpetra tests pass on Darwin with OpenMP node type, and all Tpetra+downstream tests pass with the checkin script. However, two @trilinos/muelu tests fail for OpenMP node types. But, I don't think the issue is with this unpacking work, since they also fail on the HEAD of the develop branch. I'm bisecting the git history as we speak to determine when they started failing for OpenMP node types. Unfortunately, it takes sooooooooooooooooooooo long to recompile...

mhoemmen · 2017-09-01T21:57:14Z

@tjfulle See #1660 -- it looks like MueLu uses Mehmet's sparse matrix-matrix multiply now by default, when OpenMP is enabled. I'm not sure if it's possible to disable this with a CMake option any more, but you could try the following, just to see if tests pass with that code disabled:

  -D KokkosKernels_ENABLE_Experimental:BOOL=ON
  -D MueLu_ENABLE_Kokkos_Refactor=ON
  -D Xpetra_ENABLE_Kokkos_Refactor=ON

tjfulle · 2017-09-01T22:05:04Z

@mhoemmen the failing tests are Muelu_TpetraUnitTes_MPI_[1,4]. My machine is running git bisect right now, I should have the offending commit in a few hours

tjfulle · 2017-09-01T22:06:13Z

Editing last comment to add that those are approximate names, I don't recall the exact names and I am away fro a computer

mhoemmen · 2017-09-01T22:21:21Z

the failing tests are Muelu_TpetraUnitTes_MPI_[1,4].

Ross mentioned that there was a MueLu issue with those tests. Scroll to the bottom of #1304 (today's comments). I think it would be wise to wait until those tests pass, alas -- MueLu is the code that uses CrsMatrix pack and unpack the most.

tjfulle · 2017-09-01T22:30:21Z

@mhoemmen , I was not planning on removing the [WIP] label until those tests pass. When the git bisect magic is done, I'll post the failing commit commit here and #1304

tjfulle · 2017-09-01T23:08:28Z

@mhoemmen , once #1677 is fixed, I can retest this commit and it should be good to go

tjfulle · 2017-09-02T20:29:38Z

@mhoemmen , the fix for #1677 that @mndevec pushed resolved the test failures. This PR passes the standard checkin script tests with/without OpenMP and passes tpetra tests on CUDA.

mhoemmen · 2017-09-02T23:56:20Z

@tjfulle Sweet! :-D

Albany has been reporting CUDA build issues involving Stokhos and Tpetra. See

https://github.com/gahansen/Albany/issues/176

I'm worried that I built Stokhos and Tpetra, but wasn't able to manifest this build error. Could you take a look? Thanks! :-D

tjfulle · 2017-09-03T01:12:30Z

@mhoemmen I'll take a look on Tuesday morning - I'm checking out for the holiday.

mhoemmen

See inline comments. Thanks!

mhoemmen · 2017-09-03T22:18:19Z

packages/tpetra/core/src/CMakeLists.txt

-  # Tpetra::Details::unpackCrsMatrixAndCombine.  Do so for the same
+  # Tpetra::Details::unpackCrsMatrixAndCombine,
+  # Tpetra::Details::unpackAndCombineIntoCrsArrays, and
+  # Tpetra::Details::unpackAndCombineWithOwningPIDsCount.  Do so for the same


Fine for now, but later we could think about splitting these into separate files, for more parallel builds.

mhoemmen · 2017-09-03T22:20:00Z

packages/tpetra/core/src/Tpetra_CrsMatrix_def.hpp

@@ -8172,12 +8166,12 @@ namespace Tpetra {
    // in a huge list of arrays is icky.  Can't we have a bit of an
    // abstraction?  Implementing a concrete DistObject subclass only
    // takes five methods.


My comment still stands, but it stands for the future, not for this PR ;-) .

mhoemmen · 2017-09-03T22:20:51Z

packages/tpetra/core/src/Tpetra_Details_unpackCrsMatrixAndCombine_decl.hpp

+/// \param numPacketsPerLID [out] Entry k gives the number of bytes
+///   packed for row exportLIDs[k] of the local matrix.
+///
+/// \param ixportLIDs [in] Local indices of the rows to pack.


importLIDs not "ixportLIDs"

mhoemmen · 2017-09-03T22:22:30Z

packages/tpetra/core/src/Tpetra_Details_unpackCrsMatrixAndCombine_decl.hpp

+///
+/// \param sourceMatrix [in] the CrsMatrix source
+///
+/// \param ixports [in/out] Output pack buffer; resized if needed.


"imports" not "ixports". More importantly, imports is a const Teuchos::ArrayView<const char>&, so it can't possibly be "resized if needed." (This could be just a matter of fixing the documentation.

I copied and pasted documentation. Perhaps at some time in the past it was an input/output? I'll fix the documentation to reflect the current state

Please do; thanks!

mhoemmen · 2017-09-03T23:08:21Z

packages/tpetra/core/src/Tpetra_Details_unpackCrsMatrixAndCombine_def.hpp


  if (num_bytes_per_value == 0) {
-    ST val; // packValueCount wants this
-    num_bytes_per_value = PackTraits<ST, Device>::packValueCount(val);
+    num_bytes_per_value = PackTraits<ST, DT>::packValueCount(ST());


This will break Stokhos, due to the assumption that default-constructed ST has the right size, but it was broken before anyway ;-) . Thus, it's OK to defer fixing this to a separate PR.

I think this branch is unused and can be removed. All callers to this function send in a value > 0

@tjfulle What's actually the point of the num_bytes_per_value branch?

I believe when I put the branch in I was combining the several unpacking schemes and this branch allowed for one case that the num_bytes_per_value was not known and Scalar was default constructed. On further combining/refactoring, that case went away.

@tjfulle It looks like the idea is that Stokhos is supposed to pass in the right num_bytes_per_value value -- is that right? (@etphipp)

Stohkos never calls this function, it calls packValueCount and friends. This function is a combination of similar functions in the unpackCrsMatrix and unpackWithOwningPIDs functions. Stokhos use of packValueCount is pretty simple and will allow simplifying packValueCount as @ibaned has advocated.

I'm just looking at the code in full now and remember why the branch exists. When unpacking the CrsMatrix, matrix values are not known a-priori, so the only (current) way of getting the size of the values is by the default constructor. When unpacking in to CrsArrays, at least some matrix values are known and those values can be used to determine the size needed to unpack.

I've modified the code to push the computation of the number of bytes per value further up stream. It does not remove the issue, but simplifies this function by removing the branch.

mhoemmen · 2017-09-03T23:10:58Z

packages/tpetra/core/src/Tpetra_Details_unpackCrsMatrixAndCombine_def.hpp

+struct TotNumEntTag {};
+
+/// \brief Functor to determine the number of entries in a matrix using
+/// Kokkos::parallel_reduce


The documentation looks wrong -- this is a functor to determine either the total number of bytes required for unpacking incoming matrix entries, or the maximum number of bytes required for unpacking a row of incoming matrix entries, right?

If so, could you do both at once? Each requires one pass over the incoming packed data, so you could get the max and total in one shot.

I believe the fun for does compute the max and total number of entries for array allocation purposes. They can probably be combined

The tag could be more explicit, it's the max bum entries on any one row. It's used to allocate scratch space on device

@mhoemmen said:

if so, could you do both at once? Each requires one pass over the incoming packed data, so you could get the max and total in one shot.

I suppose they could be combined, but it would not save work done as the two counting functions are not called in the same procedure. Unpacking in to the CrsMatrix requires the maximum number of entries in any one row and unpacking in to CrsArrays requires total number of entries, thus two separate counting procedures (but only one functor)

but it would not save work done as the two counting functions are not called in the same procedure.

It seems like you could compute both of them once, before anybody needs them, and then pass along those two integer values (the max and the total).

Maybe I wasn't too clear. The total is needed for transferAndFillComplete and the maximum is needed for packAndPrepare. I'm having a hard time seeing how computing the max and total at once would be beneficial as these two procedures are independent.

@mhoemmen

Addresses feedback from @mhoemmen for PR trilinos#1665

tjfulle · 2017-09-04T05:26:38Z

@mhoemmen, the latest commit addresses your comments. Passes all Tpetra+downstream on RHEL6 with/without OpenMP and Tpetra tests pass on CUDA 8.0.44 (I really ought to write a script that automagically runs all "standard" Tpetra tests for me...)

mhoemmen · 2017-09-04T17:47:53Z

@tjfulle Not to scold, but I'm just curious: Do you have trouble getting downstream tests to build with CUDA? I would say, as long as MueLu and Stokhos pass with CUDA, we're golden :-) .

mhoemmen · 2017-09-04T17:48:44Z

I really ought to write a script that automagically runs all "standard" Tpetra tests for me

If you mean "run Trilinos tests with all combinations of OpenMP and CUDA options," Christian might have a script that you could modify and use.

tjfulle · 2017-09-04T17:50:06Z

I've only built and run Tpetra tests, I can run the others later tonight or tomorrow Cheers,

…

-- Tim -- Tim Fuller, PhD Scalable Algorithms Organization 1426 Sandia National Laboratories E: tjfulle@sandia.gov<mailto:tjfulle@sandia.gov> P: (505) 205-0003 On Sep 4, 2017, at 11:48 AM, Mark Hoemmen <notifications@github.com<mailto:notifications@github.com>> wrote: @tjfulle<https://github.com/tjfulle> Not to scold, but I'm just curious: Do you have trouble getting downstream tests to build with CUDA? I would say, as long as MueLu and Stokhos pass with CUDA, we're golden :-) . - You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#1665 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ABfe5_1U6LXIDpqUlkm_IHierEZPgjV6ks5sfDfKgaJpZM4PJFI1>.

mhoemmen · 2017-09-04T18:18:06Z

I've only built and run Tpetra tests, I can run the others later tonight or tomorrow

It's trickier to get CUDA builds to work with downstream stuff. You may have to disable some packages that don't matter so much for MueLu and Stokhos testing.

Also, no hurry :) .

tjfulle · 2017-09-04T18:30:02Z

If you mean "run Trilinos tests with all combinations of OpenMP and CUDA options," Christian might have a script that you could modify and use.

Kinda, I was more thinking a script like Ross' remote testing script that would build and run tests on different remote machines to test all important builds simultaneously

mhoemmen · 2017-09-04T19:01:13Z

@tjfulle wrote:

I was more thinking a script like Ross' remote testing script that would build and run tests on different remote machines to test all important builds simultaneously

I think we could build something like that out of Ross' script. It would make sense to farm out the test platforms to a cloud service. Ross is probably working on that already!

etphipp · 2017-09-06T20:53:59Z

The fix will be in Stokhos.

mhoemmen · 2017-09-06T20:57:34Z

@etphipp Awesome, thanks for digging into this! Glad we could help find a bug! :-D

@tjfulle I can do CUDA tests on my workstation if that could help, though that didn't catch everything, as we saw with the recent Albany issue.... Thanks for working on this :-D .

tjfulle · 2017-09-06T21:01:34Z

@mhoemmen , I can run tests easily on Cuda, but debugging is a bit clumsy on an interactive node.

etphipp · 2017-09-06T22:57:18Z

I forgot to mention this issue in the commit message, but I just pushed a change to Stokhos that appears to fix the failing Stokhos test.

tjfulle · 2017-09-06T23:00:21Z

Thanks @etphipp !

tjfulle · 2017-09-07T06:22:08Z

Adding the [WIP] label until the failing Stokhos tests can be resolved.

etphipp · 2017-09-07T23:58:03Z

OK, I believe all of the failing Stokhos tests have been resolved. There was a problem with compute capability 6.x that was preventing the PCE mat-vec from functioning properly, which was causing the failures on ride (and nowhere else). All of the Stokhos tests pass on ride for me now.

@mhoemmen

- Moved unpackAndCombineIntoCrsArrays (and friends) from Tpetra_Import_Util2.hpp to Tpetra_Details_unpackCrsMatrixAndCombine_de*.hpp. - unpackAndCombineIntoCrsArrays broken up in to many many smaller functions (it was previously one large monolithic function). Each of the small functions was refactored to be thread parallel. - Race conditions were identified and resolved, mostly by using Kokkos::atomic_fetch_add where appropriate. Addresses: trilinos#797, trilinos#800, trilinos#802 Review: @mhoemmen Tests were run on two different machines and there results amended to this commit: Build/Test Cases Summary [RHEL6, standard checkin script] Enabled Packages: TpetraCore Disabled Packages: PyTrilinos,Claps,TriKota Enabled all Forward Packages 0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=1506,notpassed=0 (19.13 min) 1) MPI_RELEASE_DEBUG_SHARED_OPENMP_PT => passed: passed=1509,notpassed=0 (13.75 min) Build/Test Cases Summary [ride.sandia.gov, CUDA] Enabled Packages: Tpetra,MueLu,Stokhos 0) MPI_RELEASE_SHARED_CUDA => passed=233,notpassed=14 (8.76 min) The 14 failing tests are unrelated MueLu tests that can be ignored, see trilinos#1699 The failing Stokhos tests mentioned in trilinos#1655 were fixed with commit e97e37b

tjfulle · 2017-09-08T04:45:50Z

@mhoemmen, with the updates to Stokhos, all tests are passing on RHEL6 and CUDA. This is PR is ready once it passes your review.

mhoemmen · 2017-09-08T16:10:19Z

@tjfulle SWEET :-D I'll (re-)review!

tjfulle · 2017-09-08T16:55:59Z

I apologize that the commit is so large! Swapping in the PackTraits code required a more extensive refactor and we wanted a single atomic commit.

mhoemmen · 2017-09-08T17:11:01Z

I'm working on #1706 first, then this :-)

mhoemmen · 2017-09-08T17:19:27Z

@tjfulle turns out I can multitask! ;-) thanks for working on this!!! :-D

mhoemmen · 2017-09-08T18:06:02Z

Hm, did we ever test this for complex builds? I'm getting some build errors.... I'll work on it though.

mhoemmen · 2017-09-08T18:51:26Z

I've got a fix for complex builds ready and will push soon.

@trilinos/tpetra If Scalar could be std::complex<T>, it needs to turn into impl_scalar_type (via reinterpret_cast) before it enters the Kokkos world.

tjfulle · 2017-09-08T18:52:48Z

Doh! The last push was not tested with complex. What was the error?

mhoemmen · 2017-09-08T18:54:23Z

@tjfulle Just pushed the fix for complex builds. There was one spot that was passing a Teuchos::ArrayView<Scalar> into create_mirror_view_from_raw_host_array. This resulted in a Kokkos::View<Scalar*, ...>, rather than a Kokkos::View<impl_scalar_type*, ...>.

I think we could fix create_mirror_view_from_raw_host_array to use ArithTraits to get the right type for the View.

tjfulle assigned mhoemmen Aug 31, 2017

tjfulle requested a review from mhoemmen August 31, 2017 16:22

bartlettroscoe added the stage: in progress Work on the issue has started label Aug 31, 2017

tjfulle force-pushed the topics/threaded-pack-unpack branch from b08e108 to 2262f5f Compare September 1, 2017 04:03

tjfulle mentioned this pull request Sep 1, 2017

Tpetra: shadow warnings #1672

Closed

tjfulle force-pushed the topics/threaded-pack-unpack branch from 2262f5f to bb0dff8 Compare September 2, 2017 20:26

tjfulle changed the title ~~[WIP] Tpetra: Thread parallelization of unpackAndCombineIntoCrsArrays~~ Tpetra: Thread parallelization of unpackAndCombineIntoCrsArrays Sep 2, 2017

This was referenced Sep 3, 2017

broken CUDA build? sandialabs/Albany#176

Closed

Tpetra: CUDA packCrs build failure #1673

Closed

tjfulle force-pushed the topics/threaded-pack-unpack branch from 3834521 to a58e8e1 Compare September 3, 2017 22:49

mhoemmen reviewed Sep 3, 2017

View reviewed changes

tjfulle added a commit to tjfulle/Trilinos that referenced this pull request Sep 4, 2017

Tpetra: Addressing feedback

7e512b0

Addresses feedback from @mhoemmen for PR trilinos#1665

tjfulle force-pushed the topics/threaded-pack-unpack branch from 74d47e6 to c8769a7 Compare September 7, 2017 03:54

tjfulle changed the title ~~Tpetra: Thread parallelization of unpackAndCombineIntoCrsArrays~~ [WIP] Tpetra: Thread parallelization of unpackAndCombineIntoCrsArrays Sep 7, 2017

tjfulle force-pushed the topics/threaded-pack-unpack branch from c8769a7 to a8b943e Compare September 8, 2017 02:16

tjfulle force-pushed the topics/threaded-pack-unpack branch from a8b943e to 7cc9af2 Compare September 8, 2017 03:02

tjfulle changed the title ~~[WIP] Tpetra: Thread parallelization of unpackAndCombineIntoCrsArrays~~ Tpetra: Thread parallelization of unpackAndCombineIntoCrsArrays Sep 8, 2017

mhoemmen approved these changes Sep 8, 2017

View reviewed changes

mhoemmen merged commit 475118b into trilinos:develop Sep 8, 2017

mhoemmen removed the stage: in progress Work on the issue has started label Sep 8, 2017

mhoemmen mentioned this pull request Sep 8, 2017

Stokhos_KokkosCrsMatrixUQPCEUnitTest_Serial_MPI_1 failing for clean builds #1710

Closed

mhoemmen pushed a commit that referenced this pull request Sep 8, 2017

Tpetra: Fix complex build error with #1665

dbd33f5

@trilinos/tpetra If Scalar could be std::complex<T>, it needs to turn into impl_scalar_type (via reinterpret_cast) before it enters the Kokkos world.

tjfulle deleted the topics/threaded-pack-unpack branch May 3, 2018 19:23

This was referenced Mar 17, 2023

Tpetra: improve performance of TAFC (Transfer and fillComplete) #11689

Closed

Tpetra: improve performance of unpackAndCombineIntoCrsArrays #11693

Closed

Tpetra: Thread parallelization of unpackAndCombineIntoCrsArrays #1665

Tpetra: Thread parallelization of unpackAndCombineIntoCrsArrays #1665

Conversation

tjfulle commented Aug 31, 2017

tjfulle commented Aug 31, 2017

tjfulle commented Sep 1, 2017

mhoemmen commented Sep 1, 2017

tjfulle commented Sep 1, 2017

tjfulle commented Sep 1, 2017

mhoemmen commented Sep 1, 2017

tjfulle commented Sep 1, 2017

tjfulle commented Sep 1, 2017

tjfulle commented Sep 2, 2017

mhoemmen commented Sep 2, 2017

tjfulle commented Sep 3, 2017

mhoemmen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tjfulle commented Sep 4, 2017 • edited

mhoemmen commented Sep 4, 2017

mhoemmen commented Sep 4, 2017

tjfulle commented Sep 4, 2017 via email

mhoemmen commented Sep 4, 2017

tjfulle commented Sep 4, 2017

mhoemmen commented Sep 4, 2017

etphipp commented Sep 6, 2017

mhoemmen commented Sep 6, 2017

tjfulle commented Sep 6, 2017

etphipp commented Sep 6, 2017

tjfulle commented Sep 6, 2017

tjfulle commented Sep 7, 2017

etphipp commented Sep 7, 2017

tjfulle commented Sep 8, 2017

mhoemmen commented Sep 8, 2017

tjfulle commented Sep 8, 2017

mhoemmen commented Sep 8, 2017

mhoemmen commented Sep 8, 2017

mhoemmen commented Sep 8, 2017

mhoemmen commented Sep 8, 2017

tjfulle commented Sep 8, 2017

mhoemmen commented Sep 8, 2017

tjfulle commented Sep 4, 2017 •

edited