Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for errest/residual of individual field variables. #116

Merged
merged 1 commit into from
Feb 23, 2017

Conversation

nloppi
Copy link
Member

@nloppi nloppi commented Jan 24, 2017

No description provided.

@nloppi
Copy link
Member Author

nloppi commented Jan 24, 2017

Hi Freddie,

I opened the pull request early just see if you are happy with the structure. I am still polishing the code, checking formatting and renaming inconsistencies. Phi backend still needs to be tested as well.
Thanks!

@nloppi nloppi force-pushed the resid branch 3 times, most recently from 5ef5f73 to fa807c5 Compare January 24, 2017 14:39
.format(npdtype_to_ctype(x.dtype))
# Determine the blocksize
block = (128, 1, 1)
# Determine the gridsize
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add newline.


# Build the reduction kernel
rkern = self._build_kernel(
'errest', src, [np.int32]*3 + [np.intp]*4 + [dtype]*2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move ) onto newline.

block = (128, 1, 1)
# Determine the gridsize
grid = (int((ncolb + block[0] - 1)/block[0]), 1, 1)
# Empty result buffer on host, shape = (nvars, nblocks)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add newline.

grid = (int((ncolb + block[0] - 1)/block[0]), 1, 1)
# Empty result buffer on host, shape = (nvars, nblocks)
err_host = cuda.pagelocked_empty((ncola, grid[0]), dtype, 'C')
# Device memory allocation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add newline.

}

// Last warp
if (tid < 32){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move { onto newline.

}

// Unrolled reduction within blocks except the last block
if (blockIdx.x != gridDim.x - 1){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move { onto linewline.


// Unrolled reduction within blocks except the last block
if (blockIdx.x != gridDim.x - 1){
if (tid < 64){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move { onto newline.

}

// Copy to global memory
if (tid == 0){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move { onto newline.

}

// Last block reduced with a variable sized loop
else{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move { onto newline.

% endfor
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove newline.


// Last block reduced with a variable sized loop
else{
for(int s = 1; s < lastblksize; s *= 2){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move { onto newline.

// Last block reduced with a variable sized loop
else{
for(int s = 1; s < lastblksize; s *= 2){
if (tid % (2*s) == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move { onto newline.


// Last warp
if (tid < 32){
% for n in [32, 16, 8, 4, 2, 1]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pyfr.ndrange

nrow, ldim, dtype = x.traits
ncola, ncolb = x.ioshape[1:]

# Determine the "blocksize"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use OpenCL terminology.

# L2 norm
if self._norm == 'l2':
# Reduce locally (element types) and globally (MPI ranks)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove newline.

# Normalise
res = np.sqrt(rg/self._gndofs)

# Linfinity norm
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

L^


# Normalise
res = np.sqrt(rg/self._gndofs)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove newline.

rg = [comm.allreduce(r, op=get_mpi('sum')) for r in rl]

# Normalise
res = np.sqrt(rg/self._gndofs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Space around /

# Reduce locally (element types) and globally (MPI ranks)

rl = [np.sum(e) for e in zip(*errest.retval)]
rg = [comm.allreduce(r, op=get_mpi('sum')) for r in rl]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MPI can do vector -> vector reductions (see force calculation plugin for an example).

@@ -1,6 +1,7 @@
# -*- coding: utf-8 -*-

import math
import numpy as np
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add newline. (NumPy is not a standard Python package and so goes into its own stanza.)

self.queue = queue
rkern(queue.cl_queue_comp, (ncolb, 1, 1), block,
nrow, ncolb, ldim, err_dev, x.data, y.data,
z.data, atol, rtol)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move ) onto newline.

zarr = Array(qcomp, cnt, dtype, data=z.data)
# Reduction across blocks outsourced to numpy
if norm == 'uniform':
return np.max(err_host, axis=1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outside the class define something line

reducer = np.max if norm == 'uniform' else np.sum

then just do return reducer(err_host, axis=1).

@FreddieWitherden
Copy link
Contributor

Biggest things: Blocksize is currently always 128. This could be a problem in relation to running out of shared memory (think MHD where we have 9 field variables). Adapt the code to support any power of two blocksize. The result should be one for loop with an if statement to decide if we need to sync threads or not.

Secondly, the OpenCL code is probably wrong in the sense that it, unlike NVIDIA, does not have warps and does not guarantee that groups of 32 threads execute synchronously. As such you always need the barrier.

@nloppi
Copy link
Member Author

nloppi commented Jan 24, 2017

Thanks for the comments, Freddie!

Yes, actually I always found a difference in 5th or 6th digit between OpenCL (my desktop) and CUDA (nighthawk) and was planning to double check it on the test machine that has both backends. It must be related to this synchronisation feature.

@nloppi
Copy link
Member Author

nloppi commented Jan 27, 2017

Now the OpenCL and CUDA outputs match as well. I still need to test Phi

@nloppi nloppi force-pushed the resid branch 2 times, most recently from 86d9814 to da8f2ec Compare February 20, 2017 15:18
@@ -63,6 +64,12 @@ def __init__(self, intg, cfgsect, suffix=None):
raise RuntimeError('System {0} not supported by plugin {1}'
.format(intg.system.name, self.name))

# Check that we support this particular integrator formulation
if not ('*' in self.formulations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overkill to support *. Just check for intg.formulation in self.formulations and require plugins to specify both std and dual.

@@ -14,6 +15,11 @@ def ndrange(context, *args):
return util.ndrange(*args)


def ilog2range(context, x):
n = x.bit_length() - 1
return [int(x/(2**(i + 1))) for i in range(n)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In [8]: x = 256

In [9]: [2**i for i in range(x.bit_length())]
Out[9]: [1, 2, 4, 8, 16, 32, 64, 128, 256]


# Norm type
reduce_expr = 'a + b' if norm == 'l2' else 'max(a, b)'
# Determine the reduction block dimensions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Misleading comment.

{
% for i in range(ncola):
idx = r*ldim + X_IDX_AOSOA(${i}, ${ncola});
% if norm == 'uniform':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identation.

block = (128, 1, 1)

# Determine the number of groups
grid = ((ncolb + block[0] - 1)//block[0], 1, 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Space around //


self._retarr = rkern(xarr, yarr, zarr, atol, rtol,
queue=qcomp)
self.queue = queue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

{
% for i in range(ncola):
idx = r*ldim + X_IDX_AOSOA(${i}, ${ncola});
% if norm == 'uniform':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation.

@@ -63,6 +64,11 @@ def __init__(self, intg, cfgsect, suffix=None):
raise RuntimeError('System {0} not supported by plugin {1}'
.format(intg.system.name, self.name))

# Check that we support this particular integrator formulation
if not (intg.formulation in self.formulations):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not in.

@FreddieWitherden
Copy link
Contributor

FreddieWitherden commented Feb 21, 2017

In terms of the stats plugin I would change the code to firstly keep track of the total (global) number of pseudo iterations. Say, self.npseudoiter. Then I would do:

if monitoring:
    <stuff>
    self.pseudostats.append((self.npseudoiter, i, ...))
else:
    self.pseudostats.append((self.npseudoiter, i, None))

self.npseudoiter += 1

this way any plugin always knows the number of iterations taken and there is no issue with rejected steps (i resets, but that's fine, as npseudoiter is always increasing no matter what);

@nloppi
Copy link
Member Author

nloppi commented Feb 21, 2017

I did as you suggested. Now it prints (None,)*nvars if the residuals are not computed.

@@ -2,6 +2,7 @@

from collections import Iterable
import itertools as it
import math
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove import.

@@ -14,6 +15,10 @@ def ndrange(context, *args):
return util.ndrange(*args)


def ilog2range(context, x):
return [2**i for i in range(x.bit_length())][-2::-1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rework range: range(x.bit_length() - 2, -1, -1).

block = (128, 1, 1)

# Determine the number of groups
grid = ((ncolb + block[0] - 1) // block[0], 1, 1)
Copy link
Contributor

@FreddieWitherden FreddieWitherden Feb 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expression is different to that used by the CUDA backend, check equivalence. We also do not use this variable properly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed so that it is passed to the kernel

# Norm type
reduce_expr = 'a + b' if norm == 'l2' else 'max(a, b)'
# Reduction workgroup dimensions
block = (128, 1, 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not believe that PyOpenCL requires the additional dimensions if they're 1.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File "/home/niki/python3venv/venv/lib/python3.5/site-packages/pyopencl-2016.1-py3.5-linux-x86_64.egg/pyopencl/cffi_cl.py", line 1192, in enqueue_nd_range_kernel
work_dim = len(global_work_size)
TypeError: object of type 'int' has no len()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try (128,).

@@ -15,7 +15,9 @@ def __init__(self, *args, **kwargs):
self._fnsteps = self.cfg.getint('soln-filter', 'nsteps', '0')

# Stats on the most recent step
self.stepinfo = []
self.pseudostepinfo = []
# Pseudo-step counter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add newline.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also dump into stats file (see collect_stats method).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean dumping the global iteration number self.npseudosteps into stats?

self._pseudo_aresid = self.cfg.getfloat(sect, 'pseudo-aresid')
self._pseudo_rresid = self.cfg.getfloat(sect, 'pseudo-rresid')
self._pseudo_residtol= self.cfg.getfloat(sect, 'pseudo-resid-tol')
self._norm = self.cfg.get(sect, 'pseudo-resid-norm', 'uniform')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefix with pseudo.

# The root rank needs to open the output file
if rank == root:
self.outf = init_csv(self.cfg, cfgsect,
'total_niter,t,pseudo-iter,' + fvars)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent column names.

@nloppi nloppi force-pushed the resid branch 2 times, most recently from 99d4850 to 03804bb Compare February 22, 2017 22:05
@nloppi
Copy link
Member Author

nloppi commented Feb 22, 2017

Now the npseudosteps is dumped into the stats file.

Because the writer plugin is called for t=0.0 in the init of BaseIntegrator, I need to initialise npseudosteps = 0 there. I can also do it in DualController/BaseDualIntegrator by switching the order

def init(self, *args, **kwargs):
self.npseudosteps = 0
super().init(*args, **kwargs)

Which one do you prefer?


self._retarr = rkern(xarr, yarr, zarr, atol, rtol,
queue=qcomp)
rkern(queue.cl_queue_comp, (ncolb, 1, 1), block,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be able to get away with (ncolb,)

@FreddieWitherden
Copy link
Contributor

Putting it in the init function before we call super will be fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants