Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move dimensions to allocator #85

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 31 additions & 5 deletions src/allocator.f90
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,8 @@ module m_allocator
!> Padded dimensions for x, y, and z oriented fields
integer :: xdims_padded(3), ydims_padded(3), zdims_padded(3)
!> Padded dimensions for natural Cartesian ordering
integer :: cdims_padded(3)
integer :: cdims_padded(3), cdims(3)
integer :: n_groups_x, n_groups_y, n_groups_z
!> The pointer to the first block on the list. Non associated if
!> the list is empty
! TODO: Rename first to head
Expand All @@ -66,9 +67,13 @@ module m_allocator
class(field_t), pointer :: next
real(dp), pointer, private :: p_data(:)
real(dp), pointer, contiguous :: data(:, :, :)
integer :: dir
integer :: refcount = 0
integer :: id !! An integer identifying the memory block.
integer :: dir
integer :: n !! number of cells in the `dir` direction
integer :: n_padded !! number of cells in the `dir` direction including padding
integer :: SZ
integer :: n_groups
contains
procedure :: set_shape
end type field_t
Expand Down Expand Up @@ -119,10 +124,15 @@ function allocator_init(nx, ny, nz, sz) result(allocator)
allocator%ngrid = nx_padded*ny_padded*nz_padded
allocator%sz = sz

allocator%xdims_padded = [sz, nx_padded, ny_padded*nz_padded/sz]
allocator%ydims_padded = [sz, ny_padded, nx_padded*nz_padded/sz]
allocator%zdims_padded = [sz, nz_padded, nx_padded*ny_padded/sz]
allocator%n_groups_x = ny_padded*nz_padded/sz
allocator%n_groups_y = nx_padded*nz_padded/sz
allocator%n_groups_z = nx_padded*ny_padded/sz

allocator%xdims_padded = [sz, nx_padded, allocator%n_groups_x]
allocator%ydims_padded = [sz, ny_padded, allocator%n_groups_y]
allocator%zdims_padded = [sz, nz_padded, allocator%n_groups_z]
allocator%cdims_padded = [nx_padded, ny_padded, nz_padded]
allocator%cdims = [nx, ny, nz]
end function allocator_init

function create_block(self, next) result(ptr)
Expand Down Expand Up @@ -170,12 +180,28 @@ function get_block(self, dir) result(handle)
select case (dir)
case (DIR_X)
dims = self%xdims_padded
handle%n = self%cdims(1)
handle%SZ = self%xdims_padded(1)
handle%n_padded = self%xdims_padded(2)
handle%n_groups = self%xdims_padded(3)
case (DIR_Y)
dims = self%ydims_padded
handle%n = self%cdims(2)
handle%SZ = self%ydims_padded(1)
handle%n_padded = self%ydims_padded(2)
handle%n_groups = self%ydims_padded(3)
case (DIR_Z)
dims = self%zdims_padded
handle%n = self%cdims(3)
handle%SZ = self%zdims_padded(1)
handle%n_padded = self%zdims_padded(2)
handle%n_groups = self%zdims_padded(3)
case (DIR_C)
dims = self%cdims_padded
handle%n = -1
handle%SZ = -1
handle%n_padded = -1
handle%n_groups = -1
Comment on lines +201 to +204
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this also suggests that we shouldn't store these variables in field_t. field_t is effectively a scratch space and sometimes these variables don't have a meaning. These assignments in particular can cause annoying bugs.

case default
error stop 'Undefined direction, allocator cannot provide a shape.'
end select
Expand Down
107 changes: 56 additions & 51 deletions src/cuda/backend.f90
Original file line number Diff line number Diff line change
Expand Up @@ -64,62 +64,63 @@ module m_cuda_backend

contains

function init(globs, allocator) result(backend)
function init(allocator) result(backend)
implicit none

class(globs_t) :: globs
class(allocator_t), target, intent(inout) :: allocator
type(cuda_backend_t) :: backend

type(cuda_poisson_fft_t) :: cuda_poisson_fft
integer :: n_halo, n_block
integer :: n_halo, n_groups, sz

select type (allocator)
type is (cuda_allocator_t)
! class level access to the allocator
backend%allocator => allocator
end select

backend%xthreads = dim3(SZ, 1, 1)
backend%xblocks = dim3(globs%n_groups_x, 1, 1)
backend%ythreads = dim3(SZ, 1, 1)
backend%yblocks = dim3(globs%n_groups_y, 1, 1)
backend%zthreads = dim3(SZ, 1, 1)
backend%zblocks = dim3(globs%n_groups_z, 1, 1)
sz = allocator%sz
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, here I would assign the allocator%sz into a new sz member of the backend_t if we wan't to avoid using SZ parameter from m_omp_common/m_cuda_common. In all the use cases of field_t%sz the PR introduces we can instead do self%sz.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, I think using the SZ parameter directly from m_omp_common/m_cuda_common is a better idea, but I don't really have a strong opinion on this. I think if we're already under omp/ or cuda/, its fine to rely on the corresponding SZ parameter directly. Because otherwise we use sz from allocator_t to set a very critical value for us.


backend%nx_loc = globs%nx_loc
backend%ny_loc = globs%ny_loc
backend%nz_loc = globs%nz_loc
backend%xthreads = dim3(sz, 1, 1)
backend%xblocks = dim3(allocator%n_groups_x, 1, 1)
backend%ythreads = dim3(sz, 1, 1)
backend%yblocks = dim3(allocator%n_groups_y, 1, 1)
backend%zthreads = dim3(sz, 1, 1)
backend%zblocks = dim3(allocator%n_groups_z, 1, 1)

backend%nx_loc = allocator%dims(1)
backend%ny_loc = allocator%dims(2)
backend%nz_loc = allocator%dims(3)

n_halo = 4
! Buffer size should be big enough for the largest MPI exchange.
n_block = max(globs%n_groups_x, globs%n_groups_y, globs%n_groups_z)

allocate (backend%u_send_s_dev(SZ, n_halo, n_block))
allocate (backend%u_send_e_dev(SZ, n_halo, n_block))
allocate (backend%u_recv_s_dev(SZ, n_halo, n_block))
allocate (backend%u_recv_e_dev(SZ, n_halo, n_block))
allocate (backend%v_send_s_dev(SZ, n_halo, n_block))
allocate (backend%v_send_e_dev(SZ, n_halo, n_block))
allocate (backend%v_recv_s_dev(SZ, n_halo, n_block))
allocate (backend%v_recv_e_dev(SZ, n_halo, n_block))
allocate (backend%w_send_s_dev(SZ, n_halo, n_block))
allocate (backend%w_send_e_dev(SZ, n_halo, n_block))
allocate (backend%w_recv_s_dev(SZ, n_halo, n_block))
allocate (backend%w_recv_e_dev(SZ, n_halo, n_block))

allocate (backend%du_send_s_dev(SZ, 1, n_block))
allocate (backend%du_send_e_dev(SZ, 1, n_block))
allocate (backend%du_recv_s_dev(SZ, 1, n_block))
allocate (backend%du_recv_e_dev(SZ, 1, n_block))
allocate (backend%dud_send_s_dev(SZ, 1, n_block))
allocate (backend%dud_send_e_dev(SZ, 1, n_block))
allocate (backend%dud_recv_s_dev(SZ, 1, n_block))
allocate (backend%dud_recv_e_dev(SZ, 1, n_block))
allocate (backend%d2u_send_s_dev(SZ, 1, n_block))
allocate (backend%d2u_send_e_dev(SZ, 1, n_block))
allocate (backend%d2u_recv_s_dev(SZ, 1, n_block))
allocate (backend%d2u_recv_e_dev(SZ, 1, n_block))
n_groups = max(allocator%n_groups_x, allocator%n_groups_y, allocator%n_groups_z)

allocate (backend%u_send_s_dev(sz, n_halo, n_groups))
allocate (backend%u_send_e_dev(sz, n_halo, n_groups))
allocate (backend%u_recv_s_dev(sz, n_halo, n_groups))
allocate (backend%u_recv_e_dev(sz, n_halo, n_groups))
allocate (backend%v_send_s_dev(sz, n_halo, n_groups))
allocate (backend%v_send_e_dev(sz, n_halo, n_groups))
allocate (backend%v_recv_s_dev(sz, n_halo, n_groups))
allocate (backend%v_recv_e_dev(sz, n_halo, n_groups))
allocate (backend%w_send_s_dev(sz, n_halo, n_groups))
allocate (backend%w_send_e_dev(sz, n_halo, n_groups))
allocate (backend%w_recv_s_dev(sz, n_halo, n_groups))
allocate (backend%w_recv_e_dev(sz, n_halo, n_groups))

allocate (backend%du_send_s_dev(sz, 1, n_groups))
allocate (backend%du_send_e_dev(sz, 1, n_groups))
allocate (backend%du_recv_s_dev(sz, 1, n_groups))
allocate (backend%du_recv_e_dev(sz, 1, n_groups))
allocate (backend%dud_send_s_dev(sz, 1, n_groups))
allocate (backend%dud_send_e_dev(sz, 1, n_groups))
allocate (backend%dud_recv_s_dev(sz, 1, n_groups))
allocate (backend%dud_recv_e_dev(sz, 1, n_groups))
allocate (backend%d2u_send_s_dev(sz, 1, n_groups))
allocate (backend%d2u_send_e_dev(sz, 1, n_groups))
allocate (backend%d2u_recv_s_dev(sz, 1, n_groups))
allocate (backend%d2u_recv_e_dev(sz, 1, n_groups))

end function init

Expand Down Expand Up @@ -226,7 +227,7 @@ subroutine transeq_cuda_dist(self, du, dv, dw, u, v, w, dirps, &
type is (cuda_tdsops_t); der2nd_sym => tdsops
end select

call transeq_halo_exchange(self, u_dev, v_dev, w_dev, dirps)
call transeq_halo_exchange(self, u, v, w, dirps)

call transeq_dist_component(self, du_dev, u_dev, u_dev, &
self%u_recv_s_dev, self%u_recv_e_dev, &
Expand All @@ -246,22 +247,26 @@ subroutine transeq_cuda_dist(self, du, dv, dw, u, v, w, dirps, &

end subroutine transeq_cuda_dist

subroutine transeq_halo_exchange(self, u_dev, v_dev, w_dev, dirps)
subroutine transeq_halo_exchange(self, u, v, w, dirps)
class(cuda_backend_t) :: self
real(dp), device, dimension(:, :, :), intent(in) :: u_dev, v_dev, w_dev
class(field_t), intent(in) :: u, v, w
type(dirps_t), intent(in) :: dirps
real(dp), device, dimension(:, :, :) :: u_dev, v_dev, w_dev
integer :: n_halo

call resolve_field_t(u_dev, u)
call resolve_field_t(v_dev, v)
call resolve_field_t(w_dev, w)
! TODO: don't hardcode n_halo
n_halo = 4

! Copy halo data into buffer arrays
call copy_into_buffers(self%u_send_s_dev, self%u_send_e_dev, u_dev, &
dirps%n)
u%n, u%sz)
call copy_into_buffers(self%v_send_s_dev, self%v_send_e_dev, v_dev, &
dirps%n)
v%n, u%sz)
call copy_into_buffers(self%w_send_s_dev, self%w_send_e_dev, w_dev, &
dirps%n)
w%n, u%sz)

! halo exchange
call sendrecv_3fields( &
Expand All @@ -271,7 +276,7 @@ subroutine transeq_halo_exchange(self, u_dev, v_dev, w_dev, dirps)
self%u_send_s_dev, self%u_send_e_dev, &
self%v_send_s_dev, self%v_send_e_dev, &
self%w_send_s_dev, self%w_send_e_dev, &
SZ*n_halo*dirps%n_blocks, dirps%nproc, dirps%pprev, dirps%pnext &
u%sz*n_halo*u%n_groups, dirps%nproc, dirps%pprev, dirps%pnext &
)

end subroutine transeq_halo_exchange
Expand Down Expand Up @@ -360,7 +365,7 @@ subroutine tds_solve_cuda(self, du, u, dirps, tdsops)
error stop 'DIR mismatch between fields and dirps in tds_solve.'
end if

blocks = dim3(dirps%n_blocks, 1, 1); threads = dim3(SZ, 1, 1)
blocks = dim3(u%n_groups, 1, 1); threads = dim3(u%sz, 1, 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example here sz doesn't depend on the field u. This is always the same SZ that practically backend instantiation freezes.


call tds_solve_dist(self, du, u, dirps, tdsops, blocks, threads)

Expand Down Expand Up @@ -397,7 +402,7 @@ subroutine tds_solve_dist(self, du, u, dirps, tdsops, blocks, threads)

call sendrecv_fields(self%u_recv_s_dev, self%u_recv_e_dev, &
self%u_send_s_dev, self%u_send_e_dev, &
SZ*n_halo*dirps%n_blocks, dirps%nproc, &
u%sz*n_halo*u%n_groups, dirps%nproc, &
dirps%pprev, dirps%pnext)

! call exec_dist
Expand Down Expand Up @@ -612,19 +617,19 @@ real(dp) function scalar_product_cuda(self, x, y) result(s)

end function scalar_product_cuda

subroutine copy_into_buffers(u_send_s_dev, u_send_e_dev, u_dev, n)
subroutine copy_into_buffers(u_send_s_dev, u_send_e_dev, u_dev, n, sz)
implicit none

real(dp), device, dimension(:, :, :), intent(out) :: u_send_s_dev, &
u_send_e_dev
real(dp), device, dimension(:, :, :), intent(in) :: u_dev
integer, intent(in) :: n
integer, intent(in) :: n, sz

type(dim3) :: blocks, threads
integer :: n_halo = 4

blocks = dim3(size(u_dev, dim=3), 1, 1)
threads = dim3(SZ, 1, 1)
threads = dim3(sz, 1, 1)
call buffer_copy<<<blocks, threads>>>(u_send_s_dev, u_send_e_dev, & !&
u_dev, n, n_halo)

Expand Down
Loading
Loading