## HACCKernels

Time is around 33% in each of 3 versions of this template.  Template prevents detailed breakdown.

```c
template <int PolyOrder, const float (&PolyCoefficients)[PolyOrder+1]>
static void GravityForceKernel(int n, float *RESTRICT x, float *RESTRICT y,
                               float *RESTRICT z, float *RESTRICT mass,
                               float x0, float y0, float z0,
                               float MaxSepSqrd, float SofteningLenSqrd,
                               float &RESTRICT ax, float &RESTRICT ay,
                               float &RESTRICT az) {
  float lax = 0.0f, lay = 0.0f, laz = 0.0f;

// As written below, the mass array is conditionally accessed (i.e. accessed
// only if the interaction is not filtered by the distance checks). This will
// tend to inhibit vectorization on architectures without masked vector loads.
// With OpenMP 4+, we can explicitly inform the compiler that vectorization is
// safe.
#if _OPENMP >= 201307
#pragma omp simd reduction(+:lax,lay,laz)
#endif
  for (int i = 0; i < n; ++i) {
    float dx = x[i] - x0, dy = y[i] - y0, dz = z[i] - z0;
    float r2 = dx * dx + dy * dy + dz * dz;

    if (r2 >= MaxSepSqrd || r2 == 0.0f)
      continue;

    float r2s = r2 + SofteningLenSqrd;
    float f = PolyCoefficients[PolyOrder];
    for (int p = 1; p <= PolyOrder; ++p)
      f = PolyCoefficients[PolyOrder-p] + r2*f;

    f = (1.0f / (r2s * std::sqrt(r2s)) - f) * mass[i];

    lax += f * dx;
    lay += f * dy;
    laz += f * dz; 
  }

  ax += lax;
  ay += lay;
  az += laz;
}
```

#### Serial Run

#### Top Level Characteristics
9.55e+06 usec CPUTIME  
50.1% of Instructions are Load/Store Instructions  
1.00 Cycles per Instruction

#### Issue Cycles  
2.32e+09 Full Issue | 10.6% Cycles Issuing Max Instructions  
1.00e+07 No Issue | less than .1% Cycles Issuing No Instructions  
2.18e+10 Total Cycle  
  
#### Retiring Cycles
6.62e+09 Full Retire | 30.4% Cycles Retiring Max Instructions
6.62e+09 No Retire | 30.4% Cycles Retiring No Instructions
2.18e+10 Total Cycle 

### Memory
#### Data Cache
2.80e+07 L1 Data Cache Misses | 0.2% L1 Cache Miss Rate  
5.57e+03 L2 Data Cache Misses | over 99.9% L1 Misses Hit L2  
1.14e+10 Load/Store Instructions