Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance improvement in transeq in the OpenMP backend #66

Open
semi-h opened this issue Feb 27, 2024 · 0 comments
Open

Performance improvement in transeq in the OpenMP backend #66

semi-h opened this issue Feb 27, 2024 · 0 comments
Assignees
Labels
omp Related to openMP backend performance

Comments

@semi-h
Copy link
Member

semi-h commented Feb 27, 2024

I'm copying my comment in #27 so that we don't forget about it.

x3d2/src/omp/exec_dist.f90

Lines 164 to 185 in 2d906a5

do k = 1, n_block
call der_univ_subs(du(:, :, k), &
du_recv_s(:, :, k), du_recv_e(:, :, k), &
tdsops_du%n, tdsops_du%dist_sa, tdsops_du%dist_sc)
call der_univ_subs(dud(:, :, k), &
dud_recv_s(:, :, k), dud_recv_e(:, :, k), &
tdsops_dud%n, tdsops_dud%dist_sa, tdsops_dud%dist_sc)
call der_univ_subs(d2u(:, :, k), &
d2u_recv_s(:, :, k), d2u_recv_e(:, :, k), &
tdsops_d2u%n, tdsops_d2u%dist_sa, tdsops_d2u%dist_sc)
do j = 1, n
!$omp simd
do i = 1, SZ
rhs(i, j, k) = -0.5_dp*(v(i, j, k)*du(i, j, k) + dud(i, j, k)) + nu*d2u(i, j, k)
end do
!$omp end simd
end do
end do

I realised that here we're writing 3 field sized arrays into main memory unnecessarily. It is potentially increasing the runtime %20.

In the second phase of the algorithm here we pass a part of the du, dud, and d2u into der_univ_subs, and they're all rewritten in place. Then later we combine them in rhs for the final result. Ideally, we want du, dud, and d2u to be read once and rhs to be written only once. However because of the way der_univ_subs work, the updated data in du arrays after der_univ_subs call gets written in the main memory, even though we don't need this at all.

There are three ways we can fix this

  • In the parallel do loop in the second phase we can copy the relevant parts of du, dud, and d2u arrays into (SZ, n) sized temporary arrays. Then we pass temporary arrays into der_univ_subs, and at the end we use these temporaries to obtain final rhs. This is the easiest solution but it may not be the best in terms of performance.
  • We can write an alternative der_univ_subs and separate input and output arrays. This way we can pass a part of the du arrays as we do now, and pass a small temporary array as the output one. Because du arrays will be input arrays no data will be written in main memory. Then we can combine the temporaries to get rhs.
  • If we're writing an alternative der_univ_subs to be used in transeq, we can go one step further and have a fused version of it. This would probably the most performant solution. der_univ_subs is relatively lightweight so it isn't really hard to do so. The new subrotuine can input all du, dud, and d2u, and write the final result rhs.
@Nanoseb Nanoseb added omp Related to openMP backend performance labels Mar 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
omp Related to openMP backend performance
Projects
None yet
Development

No branches or pull requests

2 participants