Implement transeq in omp backend #27

Nanoseb · 2024-01-19T17:18:04Z

closes #21

still WIP, compiles but not tested

semi-h · 2024-01-19T17:41:21Z

src/omp/backend.f90

+   subroutine transeq_omp_dist(self, du, dv, dw, u, v, w, dirps)
+      implicit none
+
+      class(omp_backend_t) :: self
+      class(field_t), intent(inout) :: du, dv, dw
+      class(field_t), intent(in) :: u, v, w
+      type(dirps_t), intent(in) :: dirps
+      class(field_t), pointer :: duu, d2u, uu, du_temp
+
+      ! du
+      du_temp => self%allocator%get_block()
+      call tds_solve_omp(self, du_temp, u, dirps, dirps%der1st)
+
+      duu => self%allocator%get_block()
+      uu => self%allocator%get_block()
+      call vecmul_omp(uu, u, u, dirps)
+      call tds_solve_omp(self, duu, uu, dirps, dirps%der1st_sym)
+
+      d2u => self%allocator%get_block()
+      call tds_solve_omp(self, d2u, u, dirps, dirps%der2nd)
+
+
+
+
+   end subroutine transeq_omp_dist
+


This is not the way we want to implement transeq stuff on the OpenMP backend. We can't take advantage of the cache and reduce the data movements if we implement in this way.

We should instead implement a subroutine that is similar to exec_dist_tds_compact.
https://github.com/xcompact3d/x3d2/blob/main/src/omp/exec_dist.f90#L14C15-L14C36

And the idea is we call the der_univ_dist to get du/dx, duu/dx, and d2u/dx2 one after another as we loop through batches of lines we have in the domain. (So a loop over the 3rd dimension of the arrays we have, aka the group number.

It should look like the loop below, but with 3 distinct calls to der_univ_dist inside the loop with correct set of tdsops instances.
https://github.com/xcompact3d/x3d2/blob/main/src/omp/exec_dist.f90#L40-L47

This will also make an operation like vecmul_omp unnecessary, saving some data movement.

ah yes, makes sense indeed. I will do that

still WIP, not tested and not cleaned up

Nanoseb · 2024-01-26T17:15:35Z

Just reimplemented following your comment. Note that, for now, I have kept the velmul_omp because removing it will require a rewrite of der_univ_dist to do it inplace. So it isn't as optimised as it could be yet, though I will likely do that in an other branch because I may try to simplify der_univ_dist itself at the same time.

Still not ready to be merged, I need to cleanup the naming convention I used, test it and add a test.

src/omp/backend.f90

src/omp/exec_dist.f90

also moves halo exchange to its own function

semi-h

Looks great! I think now we get all the benefit from CPU cache.

semi-h · 2024-01-29T11:04:34Z

src/omp/backend.f90

+      call sendrecv_fields(self%w_recv_s, self%w_recv_e, self%w_send_s, self%w_send_e, &
+                           SZ*n_halo*dirps%n_blocks, dirps%nproc, dirps%pprev, dirps%pnext)
+
+   end subroutine


Is it okay to omit subroutine name here? On github it broke the syntax highlighting for me in the subroutine below.

yes, it is never a requirement (maybe it used to be in older standards?). It is weird indeed that github syntax highlight doesn't handle it. I can add it if needed.

src/omp/backend.f90

test fails for now, either bug in implementation or in the test itself

semi-h · 2024-02-06T15:38:51Z

I would only test exec_dist_transeq_compact for now. It would be much simpler and targeted. transeq only calls exec_dist_transeq_compact 3x3 times after all and testing it requres instantiating the allocator and backend and then setting a few variables. I think its beyond the scope of the current test which should be more low level.
Also, its very likely that allocator will change a bit and then we would need to edit this test program too. Probably there'll be changes in how we set the variables in the backend as well, and mirroring all these here in the transeq test shouldn't be necessary.
I think any test initiating the backend should be one level up so that we can test all backends picking the right one at runtime and the main procedures in backends can be tested in a better way and in a single program.

semi-h · 2024-02-06T16:16:49Z

Also, its a good idea to test the performance of the exec_dist_transeq_compact now that we have it. With the distributed strategy the first phase of the algorithm reads u and v, (or only u, but lets test with u and v), and writes out 3 fields, du, dud, and d2u. There is a bit of buffering and MPI after the first phase, but lets ignore the cost of this for the moment. Then we have the second phase where we read 3 fields du, dud, d2u and write out 1 field as a result. So in total the data movement is equal to 5 reads and 4 writes, so 5 + 4*2 = 13. If we run this function on a $512^3$ mesh data movement is equal to 13 GiB. If you run the test on ARCHER2 on a single node with 8 ranks so that there is a rank per NUMA zone, where the total available bandwidth is ~400 GiB/s, I would expect a single execution of this function to take 0.046 seconds assuming we'll achieve a modest %70 of peak BW. Would you be able to test this when you have time? If it is too different than what we expect we should investigate what causes it.

Nanoseb · 2024-02-16T12:22:43Z

I would only test exec_dist_transeq_compact for now.

Added a test for it too. Now we are testing both.

Nanoseb · 2024-02-16T12:24:50Z

Also, its a good idea to test the performance of the exec_dist_transeq_compact...

Now that we've decided to separate performance and verification/unit test. I think I will leave that for now until we have a framework in place. (see #35)

JamieJQuinn

Looks great! Just a few small suggestions.

src/omp/backend.f90

src/omp/exec_dist.f90

for consistency with the rest of the codebase Co-authored-by: Jamie J Quinn <JamieJQuinn@users.noreply.github.com>

src/omp/backend.f90

semi-h · 2024-02-16T16:02:14Z

Distributed solver requires up to 128/256 points per rank based on the particular compact scheme we solve. If you want to do parallel tests the test can fail if you have too few points in a rank.

We don't have a parallel testing enviroment yet but can you confirm that the test passes when you run the executable by hand with multiple ranks? Because you use the default schemes I think 64 points per rank should be more than enough.

Nanoseb · 2024-02-16T16:07:25Z

Distributed solver requires up to 128/256 points per rank based on the particular compact scheme we solve. If you want to do parallel tests the test can fail if you have too few points in a rank.

Yes, exactly. With 64 cells, it was failing with 4 cores (error ~ 7e-8) and working on 2. That's why I have increased it to 96 now so it works up to 4 cores with the tolerance that was set (1e-8).

Nanoseb · 2024-02-16T16:10:16Z

I lowered the cell count just to run them faster. Went from 2s to 0.4 I think. For now it isn't a big deal, but when we will have 10 or 20 tests, that quickly adds up. Being able to run many tests quickly makes you run them more often.

src/omp/exec_dist.f90

semi-h

Making the temporaries private inside the parallel loops is very important, and once it is fixed I'm happy to merge. We can work on the performance related part in a new PR.

src/omp/backend.f90

semi-h · 2024-02-19T11:09:21Z

src/omp/exec_dist.f90

+
+      !$omp parallel do
+      do k = 1, n_block
+         call der_univ_subs(du(:, :, k), &


I realised that here we're writing 3 field sized arrays into main memory unnecessarily. It is potentially increasing the runtime %20.

In the second phase of the algorithm here we pass a part of the du, dud, and d2u into der_univ_subs, and they're all rewritten in place. Then later we combine them in rhs for the final result. Ideally, we want du, dud, and d2u to be read once and rhs to be written only once. However because of the way der_univ_subs work, the updated data in du arrays after der_univ_subs call gets written in the main memory, even though we don't need this at all.

There are three ways we can fix this

In the parallel do loop in the second phase we can copy the relevant parts of du, dud, and d2u arrays into (SZ, n) sized temporary arrays. Then we pass temporary arrays into der_univ_subs, and at the end we use these temporaries to obtain final rhs. This is the easiest solution but it may not be the best in terms of performance.

We can write an alternative der_univ_subs and separate input and output arrays. This way we can pass a part of the du arrays as we do now, and pass a small temporary array as the output one. Because du arrays will be input arrays no data will be written in main memory. Then we can combine the temporaries to get rhs.

If we're writing an alternative der_univ_subs to be used in transeq, we can go one step further and have a fused version of it. This would probably the most performant solution. der_univ_subs is relatively lightweight so it isn't really hard to do so. The new subrotuine can input all du, dud, and d2u, and write the final result rhs.

#40 is relevant here

Good points, I will have a think about it, but indeed having it in a new PR focusing on performance makes sense.

Alright, I think best way to move forward is implementing the first strategy and checking how much of the peak BW we get. If its a good utilisation then maybe we don't need to bother with a new der_univ_subs at all.

src/omp/exec_dist.f90

feat(omp): begin implementation of transeq solver

ef3d02c

still WIP, compiles but not tested

Nanoseb added core Issue affecting core mechanisms of the software omp Related to openMP backend labels Jan 19, 2024

Nanoseb self-assigned this Jan 19, 2024

semi-h reviewed Jan 19, 2024

View reviewed changes

feat(omp): initial transeq solver implementation

9a98a7f

still WIP, not tested and not cleaned up

semi-h requested changes Jan 26, 2024

View reviewed changes

src/omp/backend.f90 Outdated Show resolved Hide resolved

src/omp/backend.f90 Outdated Show resolved Hide resolved

src/omp/backend.f90 Outdated Show resolved Hide resolved

src/omp/backend.f90 Outdated Show resolved Hide resolved

src/omp/exec_dist.f90 Outdated Show resolved Hide resolved

Nanoseb added 2 commits January 26, 2024 18:32

feat(omp): handle multiplication inplace for convective flux

03754b7

also moves halo exchange to its own function

docs(omp): add comment

29f9f6a

semi-h reviewed Jan 29, 2024

View reviewed changes

Nanoseb added 3 commits February 5, 2024 11:57

refactor(omp): remove unusef vecmul_omp function

e805ac0

refactor: refactor solver allocation of tdsops

732f4da

test: add test for omp transeq

237c61c

test fails for now, either bug in implementation or in the test itself

Nanoseb added 3 commits February 16, 2024 12:19

Fix typo in transeq computation

b30271f

add dist_transeq test

813804c

remove unused variable

33a24aa

Nanoseb added 2 commits February 16, 2024 12:26

reformat long line

f7bc7e3

Merge branch 'main' of github.com:xcompact3d/x3d2 into omp_transeq

f11ec2e

Nanoseb requested a review from JamieJQuinn February 16, 2024 12:30

Nanoseb marked this pull request as ready for review February 16, 2024 12:31

JamieJQuinn requested changes Feb 16, 2024

View reviewed changes

src/omp/backend.f90 Outdated Show resolved Hide resolved

src/omp/backend.f90 Outdated Show resolved Hide resolved

src/omp/backend.f90 Show resolved Hide resolved

src/omp/exec_dist.f90 Show resolved Hide resolved

Nanoseb and others added 2 commits February 16, 2024 12:52

add subroutine name after end subroutine

3a327b9

for consistency with the rest of the codebase Co-authored-by: Jamie J Quinn <JamieJQuinn@users.noreply.github.com>

add subroutine name after end subroutine

1646e53

for consistency with the rest of the codebase Co-authored-by: Jamie J Quinn <JamieJQuinn@users.noreply.github.com>

semi-h reviewed Feb 16, 2024

View reviewed changes

src/omp/backend.f90 Outdated Show resolved Hide resolved

improve performance of copy_into_buffers

b20b503

Merge branch 'omp_transeq' of github.com:Nanoseb/x3d2 into omp_transeq

6379a5d

Nanoseb marked this pull request as draft February 16, 2024 14:44

Nanoseb added 2 commits February 16, 2024 15:07

fix test_omp_transeq for parallel runs

ddaaf7d

revert back to running tests in serial

aba7b1a

Nanoseb marked this pull request as ready for review February 16, 2024 15:10

semi-h reviewed Feb 16, 2024

View reviewed changes

src/omp/exec_dist.f90 Outdated Show resolved Hide resolved

Explicitely set _dp in for transeq computation

6b63417

semi-h requested changes Feb 19, 2024

View reviewed changes

Nanoseb added 2 commits February 19, 2024 14:16

fix identation

e97c287

fix openmp race condition

33345c0

semi-h merged commit 00d0986 into xcompact3d:main Feb 19, 2024
2 checks passed

semi-h mentioned this pull request Feb 27, 2024

Performance improvement in transeq in the OpenMP backend #66

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement transeq in omp backend #27

Implement transeq in omp backend #27

Nanoseb commented Jan 19, 2024

semi-h Jan 19, 2024

Nanoseb Jan 19, 2024

Nanoseb commented Jan 26, 2024 •

edited

Loading

semi-h left a comment

semi-h Jan 29, 2024

Nanoseb Jan 29, 2024

semi-h commented Feb 6, 2024

semi-h commented Feb 6, 2024

Nanoseb commented Feb 16, 2024

Nanoseb commented Feb 16, 2024

JamieJQuinn left a comment

semi-h commented Feb 16, 2024

Nanoseb commented Feb 16, 2024

Nanoseb commented Feb 16, 2024

semi-h left a comment

semi-h Feb 19, 2024

semi-h Feb 19, 2024

Nanoseb Feb 19, 2024

semi-h Feb 19, 2024

Implement transeq in omp backend #27

Implement transeq in omp backend #27

Conversation

Nanoseb commented Jan 19, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Nanoseb commented Jan 26, 2024 • edited Loading

semi-h left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

semi-h commented Feb 6, 2024

semi-h commented Feb 6, 2024

Nanoseb commented Feb 16, 2024

Nanoseb commented Feb 16, 2024

JamieJQuinn left a comment

Choose a reason for hiding this comment

semi-h commented Feb 16, 2024

Nanoseb commented Feb 16, 2024

Nanoseb commented Feb 16, 2024

semi-h left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Nanoseb commented Jan 26, 2024 •

edited

Loading