## HACCKernels

```c
template <int PolyOrder, const float (&PolyCoefficients)[PolyOrder+1]>
static void GravityForceKernel(int n, float *RESTRICT x, float *RESTRICT y,
                               float *RESTRICT z, float *RESTRICT mass,
                               float x0, float y0, float z0,
                               float MaxSepSqrd, float SofteningLenSqrd,
                               float &RESTRICT ax, float &RESTRICT ay,
                               float &RESTRICT az) {
  float lax = 0.0f, lay = 0.0f, laz = 0.0f;

// As written below, the mass array is conditionally accessed (i.e. accessed
// only if the interaction is not filtered by the distance checks). This will
// tend to inhibit vectorization on architectures without masked vector loads.
// With OpenMP 4+, we can explicitly inform the compiler that vectorization is
// safe.
#if _OPENMP >= 201307
#pragma omp simd reduction(+:lax,lay,laz)
#endif
  for (int i = 0; i < n; ++i) {
    float dx = x[i] - x0, dy = y[i] - y0, dz = z[i] - z0;
    float r2 = dx * dx + dy * dy + dz * dz;

    if (r2 >= MaxSepSqrd || r2 == 0.0f)
      continue;

    float r2s = r2 + SofteningLenSqrd;
    float f = PolyCoefficients[PolyOrder];
    for (int p = 1; p <= PolyOrder; ++p)
      f = PolyCoefficients[PolyOrder-p] + r2*f;

    f = (1.0f / (r2s * std::sqrt(r2s)) - f) * mass[i];

    lax += f * dx;
    lay += f * dy;
    laz += f * dz; 
  }

  ax += lax;
  ay += lay;
  az += laz;
}
```

#### Serial Run

#### Top Level Characteristics
9.55e+06 usec CPUTIME  
.98 Cycles per Instruction (2.18e+10 / 2.23e+10)  
13.3% of Cycles Issuing Max Instructions  
161x Issuing Max Instructions per cycle than no Instructions per Cycle  

50.1% of Instructions are Load/Store Instructions

61.5% of Cycles Stalled on Any Resource.   
30.7% of Cycles Retiring Max Instructions   
30.7% of Cycles Retiring no Instructions  // measured together.  Test apart.  


#### Memory  
0.2% Cache Miss Rate  (2.80e+07 L1_DCM / (9.09e+09 LD_INS + 2.27e+09 SR_INS))  
99.9% of L1 Cache Misses hit L2 Cache (1 - (5.22e+03 L2_DCM / 2.80e+07 L1_DCM))