Vectorized (SIMD) Numerical Schemes #1022

pcarruscag · 2020-06-10T10:56:28Z

Proposed Changes

The objective is to provide a natural (i.e. readable) way to write code that extracts all the performance modern CPU have to offer.

As described in the last update of #789 this will require completely different numerics classes, also to address other inefficiencies they have. For now all centered schemes and plain Roe have been "ported" (this also adds JST with matrix dissipation).
Most of the auxiliary CNumerics functions were also ported as consequence (inviscid fluxes, Jacobians, and so on) and so implementing new schemes only gets easier.

If you want to test this, the new numerics are called for Roe on ideal gas without low Mach corrections (enabled by option USE_VECTORIZATION=YES) and also for centered schemes.
To get good performance on gcc you need the following optimization flags:
-O2 -funroll-loops -ffast-math -march=?? -mtune=?? Where ?? should be something with AVX (e.g. haswell, skylake).
There are AVX and AVX512 specific optimizations.
Also the OpenMP simd directive is used, at least on gcc this makes a difference (because it is stubborn at vectorizing) so -Dwith-omp=true when calling meson.py.
You can expect around 30% speedup +- 10% on problems that do not need a lot of linear solver iterations, or more on machines with AVX512.

Related Work

#789

PR Checklist

I am submitting my contribution to the develop branch.
My contribution generates no new compiler warnings (try with the '-Wall -Wextra -Wno-unused-parameter -Wno-empty-body' compiler flags).
My contribution is commented and consistent with SU2 style.
I have added a test case that demonstrates my contribution, if necessary.
I have updated appropriate documentation (Tutorials, Docs Page, config_template.cpp) , if necessary.

Common/include/parallelization/vectorization.hpp

SU2_CFD/include/numerics_simd/CNumericsSIMD.hpp

…on errors

pcarruscag · 2020-06-18T22:14:17Z

SU2_CFD/include/numerics_simd/CNumericsSIMD.hpp

+template<size_t nDim>
+struct CCompressiblePrimitives {
+  enum : size_t {nVar = nDim+4};
+  VectorDbl<nVar> all;
+  FORCEINLINE Double& temperature() { return all(0); }
+  FORCEINLINE Double& pressure() { return all(nDim+1); }
+  FORCEINLINE Double& density() { return all(nDim+2); }
+  FORCEINLINE Double& enthalpy() { return all(nDim+3); }
+  FORCEINLINE Double& velocity(size_t iDim) { return all(iDim+1); }
+  FORCEINLINE const Double& temperature() const { return all(0); }
+  FORCEINLINE const Double& pressure() const { return all(nDim+1); }
+  FORCEINLINE const Double& density() const { return all(nDim+2); }
+  FORCEINLINE const Double& enthalpy() const { return all(nDim+3); }
+  FORCEINLINE const Double& velocity(size_t iDim) const { return all(iDim+1); }
+  FORCEINLINE const Double* velocity() const { return &velocity(0); }
+};


I'm also trying to address the magic indexes (iDim+1) by defining specific types for primitive and conservative variables rather that using raw arrays.
Initially I thought of making temperature, pressure, etc. references pointing into the correct position of the storage vector (all) but unfortunately that disables compiler generated constructors... so they are functions.

pcarruscag · 2020-06-18T22:21:26Z

Common/include/toolboxes/C2DContainer.hpp

+  /*!
+   * \brief Get a SIMD gather iterator to the inner dimension of the container.
+   */
+  template<size_t nCols, class T, size_t N>
+  FORCEINLINE CInnerIterGather<simd::Array<T,N> > innerIter(simd::Array<T,N> row) const noexcept
+  {
+    return CInnerIterGather<simd::Array<T,N> >(m_data, IsRowMajor? 1 : this->rows(), IsRowMajor? row*nCols : row);
+  }


The only places where the "vector nature" of the SIMD type needs to be handled explicitly are when retrieving data from containers (C2DContainer and CVectorOfMatrix) and when putting it back in other containers (CSysVector and CSysMatrix).
Data is retrieved via iterators (see #789 for the reason, spoiler alert, it's for performance).

EDIT: Or via methods that copy the data into a static container of simd types, the iterator approach only performs well if the target CPU has good performance gather instructions (the majority don't).

SU2_CFD/include/numerics_simd/CNumericsSIMD.hpp

pcarruscag

So in #789 I mentioned that I want these new numerics to be thread-safe and have minimal virtual overhead.
For that I'm using CRTP (the curiously recurring template pattern) to have static polymorphism, the mechanics are explained below.

SU2_CFD/include/numerics_simd/CNumericsSIMD.hpp

pcarruscag · 2020-06-18T23:12:05Z

SU2_CFD/include/numerics_simd/CNumericsSIMD.hpp

+                   const CVariable& solution_,
+                   UpdateType updateType,
+                   CSysVector<su2double>& vector,
+                   CSysMatrix<su2mixedfloat>& matrix) const final {
+
+    const bool implicit = (config.GetKind_TimeIntScheme() == EULER_IMPLICIT);
+    const auto& solution = static_cast<const CEulerVariable&>(solution_);


Another significant change from the current numerics is that (again for thread safety) we do not "set" anything into them, we pass them the whole variable structure (the solver nodes) and the numerics can read (and only read) any data it needs.
Similarly to what is done in the solvers, since we know the type of variable that goes with the numeric scheme we can static_cast to the derived type to avoid polymorphism.

I see Base Roe flux is working with CEulerVariable. Given it's a inviscid flux considering density, momentums and energy, I suppose this is something that will work for all compressible solvers (compressible euler, NS and RANS) ?

It will work since CEulerVariable is used in all of those in one way or another, i.e. CNSVariable is also a CEulerVariable with extra methods.

…orator

…mainder" edges

…erics

CatarinaGarbacz

@pcarruscag , this significant contribution on vectorization seems very valuable as one more initiative to speed up (and tidy up) the code. It's great to see SU2 evolving to make use of handy programming and C++ performance strategies.

I am not an expert in this type of implementation, but overall it seems to be implemented in a neat and advanced way, making good use of C++ tools, functions and user defined types.

CatarinaGarbacz · 2020-09-24T10:01:36Z

SU2_CFD/include/numerics_simd/CNumericsSIMD.hpp

+                   const CVariable& solution_,
+                   UpdateType updateType,
+                   CSysVector<su2double>& vector,
+                   CSysMatrix<su2mixedfloat>& matrix) const final {
+
+    const bool implicit = (config.GetKind_TimeIntScheme() == EULER_IMPLICIT);
+    const auto& solution = static_cast<const CEulerVariable&>(solution_);


I see Base Roe flux is working with CEulerVariable. Given it's a inviscid flux considering density, momentums and energy, I suppose this is something that will work for all compressible solvers (compressible euler, NS and RANS) ?

SU2_CFD/include/numerics_simd/CNumericsSIMD.hpp

SU2_CFD/include/numerics_simd/flow/diffusion/viscous_fluxes.hpp

SU2_CFD/src/solvers/CEulerSolver.cpp

pcarruscag · 2020-09-27T10:48:12Z

Thank you for the review Catarina, based on your comments I will try to explain the new structure better.

The interface

These numerics classes are still abstract and so the outside world only needs to known about their interface:

class CNumericsSIMD {
public:
  virtual void ComputeFlux(...) const = 0;
};

Any class that wants to be a simd numerics needs to inherit from this base and in so doing is forced to implement ComputeFlux (because it is pure virtual).

Static decorator pattern

Now, in this structure we don't have convective and viscous classes separate. Instead, we have convective schemes that can be "decorated" with viscous fluxes. For example:

template<int nDim>
class CCompressibleViscousFlux : public CNumericsSIMD {
protected:
  void viscousTerms(...) const {...}
};

Here note that this class template is derived from CNumericsSIMD but it does not implement ComputeFlux, thus it cannot be instantiated by itself. The idea is that convective schemes can use these viscous fluxes as base class (thereby linking them to CNumericsSIMD) to access the viscousTerms method (when we don't want viscous terms we just use a dummy viscous class).
Note also the template parameter nDim, this is because we want to create specific versions of the numerical schemes for 2D and 3D (we "want" this because it allows static allocation and unrolling loops perfectly).

Then convective classes also need to be class templates so that we can programatically change their base class:

template<class Base>
class MyConvectiveScheme : public Base {
public:
  void ComputeFlux(...) const {
    // do my own thing
    Base::viscousTerms(...);
    // update the linear system with the result
  }
};

Now to create an instance of this class template we do for example:

auto obj = new MyConvectiveScheme< CCompressibleViscousFlux<2> >(...);

which would create an object for 2D problems with viscous terms.

And so we need at least 4 instantiations of these class templates, 2D/3D with or without viscous terms, and this is done in the factory method implemented in CNumericsSIMD.cpp, which is the only cpp in this entire implementation.

Static polymorphism

Another concept used in this implementation for efficiency is static polymorphism.
For example in the non vectorized numerics we have a family of Roe schemes since a lot of code is shared, and the only difference is how the dissipation terms are computed. There this is done with virtual functions, here we want none of that.
Virtual functions allow a parent class to dynamically call methods of its children, we want to do this statically and so we need to let the parent class know who is deriving from it.

template<class Derived>
class Parent {
public:
  void parentMethod() {
    // "I know I am also a Derived and so I can cast myself."
    auto derived = static_cast<Derived*>(this);
    // "now I can use a method of derived"
    derived->childMethod();
  }
};

// A derived class needs to inform the parent about itself
class Child : public Parent<Child> {
public:
  void childMethod() {...}
};

Why is this better? Note that 2 of these derived classes don't actually have the same parent, i.e. one inherits from Parent<ChildA> the other from Parent<ChildB> this means that 2 versions of Parent were instantiated specifically for each derived class, this allows code to be inlined and optimized for each, an ability lost with virtual functions.

Putting it all together

For vectorized central schemes we have something like:

// Intermediate class for centered schemes, note the 2 template parameters
template<class Derived, class Base>
class CCenteredBase : public Base {
public:
  // Main public method implemented here making use of "Derived" and "Base".
  void ComputeFlux(...) const final {
    ... // gather data, do some computation
    derived->DissipationTerms(...); // static polymorphism
    Base::ViscousTerms(...); // static decorator
    ... // update linear system
  }
}

// A final centered scheme, which is what we instantiate, with some viscous decorator.
template<class Decorator>
class CJSTScheme : public CCenteredBase<CJSTScheme<Decorator>, Decorator> {
protected:
  void DissipationTerms(...) const {...}
}

…erics

PR#1022 SIMD introduced some minor difference in my(!) cht reg tests. 8184779..4b9f2a8x contains #1022 & #1080 (only 5 lines). The change is in solid only and only affects the 2D case. Not the 3D. I dont know what specifically introduced the changes, but as they are small I for now assume that it is just a little numeric change.

add basic simd type

82b9edf

pr-triage bot added the PR: draft label Jun 10, 2020

pcarruscag commented Jun 10, 2020

View reviewed changes

Common/include/parallelization/vectorization.hpp Outdated Show resolved Hide resolved

begin prototype of simd numerics

058a5b8

pcarruscag commented Jun 14, 2020

View reviewed changes

SU2_CFD/include/numerics_simd/CNumericsSIMD.hpp Outdated Show resolved Hide resolved

pcarruscag added 10 commits June 17, 2020 13:43

Merge branch 'feature_quasi_newton_adjoint' into feature_simd_numerics

2d2961a

Merge branch 'iteration_class' into feature_simd_numerics

1bfea72

optimize least squares gradients when periodic comms are not needed

e41c6f6

use CRTP for static polymorphism

07d325f

fix search/replace mistakes

8eeac65

fix LS gradients preacc

c24cd1c

add iterators to C2DContainer, fix compiler errors

20a7edc

Merge branch 'feature_quasi_newton_adjoint' into feature_simd_numerics

b72ac57

add SIMD set methods to CSysVector and CSysMatrix, fix 1000 compilati…

d43c6ad

…on errors

codefactor

ff0ea0b

pcarruscag commented Jun 18, 2020

View reviewed changes

SU2_CFD/include/numerics_simd/CNumericsSIMD.hpp Outdated Show resolved Hide resolved

SU2_CFD/include/numerics_simd/CNumericsSIMD.hpp Outdated Show resolved Hide resolved

SU2_CFD/include/numerics_simd/CNumericsSIMD.hpp Outdated Show resolved Hide resolved

pcarruscag commented Jun 18, 2020

View reviewed changes

pcarruscag added 11 commits June 19, 2020 22:21

make new numerics compatible with non-SIMD types (for AD)

7e47d1a

fetching edge nodes needs gather due to coloring, add C3DContainerDec…

06956b6

…orator

improving and cleaning re-orientation checks

112bf04

optimize least squares gradients when periodic comms are not needed

b2db9ba

fix LS gradients preacc

a711554

Merge branch 'cleanup_orientation_checks' into feature_simd_numerics

d905213

Merge branch 'cleanup_orientation_checks' into feature_simd_numerics

80b9453

use scale factor in vector and matrix updates as a mask to handle "re…

c2b7049

…mainder" edges

template mechanism for static decorator pattern

88a6c33

small LS cleanups and comments

5fedf08

small LS cleanups and comments

6054e08

pcarruscag added 5 commits August 9, 2020 00:27

fix clang issues

6621d1d

Merge branch 'feature_simd_numerics' into feature_jst_matrix

1f8bddb

Merge remote-tracking branch 'upstream/develop' into feature_simd_num…

4ced288

…erics

fix clang debug AD build issue

2efb124

re update testcases after merge with develop

beb8688

pcarruscag mentioned this pull request Aug 15, 2020

[WIP] SA turb model consistent implementation #1066

Closed

5 tasks

pcarruscag and others added 5 commits September 4, 2020 23:41

Merge remote-tracking branch 'upstream/develop' into feature_simd_num…

10b56e9

…erics

Merge branch 'feature_simd_numerics' into feature_jst_matrix

94cd99d

Merge branch 'develop' into feature_simd_numerics

c98b530

Merge branch 'develop' into feature_simd_numerics

802d3ef

Merge branch 'feature_simd_numerics' into feature_jst_matrix

1414f3f

CatarinaGarbacz reviewed Sep 24, 2020

View reviewed changes

CatarinaGarbacz approved these changes Sep 24, 2020

View reviewed changes

pr-triage bot added PR: reviewed-approved and removed PR: unreviewed labels Sep 24, 2020

Merge remote-tracking branch 'upstream/develop' into feature_jst_matrix

99988b5

pcarruscag mentioned this pull request Sep 27, 2020

Fix EFFICIENCY calculation #1074

Merged

4 tasks

pcarruscag added 2 commits September 27, 2020 12:58

address PR comments, fix iDim==3 issues

d43a50b

Merge remote-tracking branch 'upstream/develop' into feature_simd_num…

bd8d88f

…erics

pr-triage bot added PR: unreviewed and removed PR: reviewed-approved labels Sep 27, 2020

update config_template

6ea6116

pcarruscag merged commit bc39e7e into develop Sep 30, 2020

pcarruscag deleted the feature_simd_numerics branch September 30, 2020 14:04

pr-triage bot added PR: merged and removed PR: unreviewed labels Sep 30, 2020

pcarruscag mentioned this pull request Dec 29, 2020

Support for UQ and NICF with vectorized (SIMD) centered schemes #1149

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vectorized (SIMD) Numerical Schemes #1022

Vectorized (SIMD) Numerical Schemes #1022

pcarruscag commented Jun 10, 2020 •

edited

Loading

pcarruscag Jun 18, 2020

pcarruscag Jun 18, 2020 •

edited

Loading

pcarruscag left a comment

pcarruscag Jun 18, 2020

CatarinaGarbacz Sep 24, 2020

pcarruscag Sep 27, 2020

CatarinaGarbacz left a comment

CatarinaGarbacz Sep 24, 2020

pcarruscag commented Sep 27, 2020

Vectorized (SIMD) Numerical Schemes #1022

Vectorized (SIMD) Numerical Schemes #1022

Conversation

pcarruscag commented Jun 10, 2020 • edited Loading

Proposed Changes

Related Work

PR Checklist

pcarruscag Jun 18, 2020

Choose a reason for hiding this comment

pcarruscag Jun 18, 2020 • edited Loading

Choose a reason for hiding this comment

pcarruscag left a comment

Choose a reason for hiding this comment

pcarruscag Jun 18, 2020

Choose a reason for hiding this comment

CatarinaGarbacz Sep 24, 2020

Choose a reason for hiding this comment

pcarruscag Sep 27, 2020

Choose a reason for hiding this comment

CatarinaGarbacz left a comment

Choose a reason for hiding this comment

CatarinaGarbacz Sep 24, 2020

Choose a reason for hiding this comment

pcarruscag commented Sep 27, 2020

The interface

Static decorator pattern

Static polymorphism

Putting it all together

pcarruscag commented Jun 10, 2020 •

edited

Loading

pcarruscag Jun 18, 2020 •

edited

Loading