Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorized (SIMD) Numerical Schemes #1022

Merged
merged 112 commits into from
Sep 30, 2020
Merged

Vectorized (SIMD) Numerical Schemes #1022

merged 112 commits into from
Sep 30, 2020

Conversation

pcarruscag
Copy link
Member

@pcarruscag pcarruscag commented Jun 10, 2020

Proposed Changes

The objective is to provide a natural (i.e. readable) way to write code that extracts all the performance modern CPU have to offer.

As described in the last update of #789 this will require completely different numerics classes, also to address other inefficiencies they have. For now all centered schemes and plain Roe have been "ported" (this also adds JST with matrix dissipation).
Most of the auxiliary CNumerics functions were also ported as consequence (inviscid fluxes, Jacobians, and so on) and so implementing new schemes only gets easier.

If you want to test this, the new numerics are called for Roe on ideal gas without low Mach corrections (enabled by option USE_VECTORIZATION=YES) and also for centered schemes.
To get good performance on gcc you need the following optimization flags:
-O2 -funroll-loops -ffast-math -march=?? -mtune=?? Where ?? should be something with AVX (e.g. haswell, skylake).
There are AVX and AVX512 specific optimizations.
Also the OpenMP simd directive is used, at least on gcc this makes a difference (because it is stubborn at vectorizing) so -Dwith-omp=true when calling meson.py.
You can expect around 30% speedup +- 10% on problems that do not need a lot of linear solver iterations, or more on machines with AVX512.

Related Work

#789

PR Checklist

  • I am submitting my contribution to the develop branch.
  • My contribution generates no new compiler warnings (try with the '-Wall -Wextra -Wno-unused-parameter -Wno-empty-body' compiler flags).
  • My contribution is commented and consistent with SU2 style.
  • I have added a test case that demonstrates my contribution, if necessary.
  • I have updated appropriate documentation (Tutorials, Docs Page, config_template.cpp) , if necessary.

@pr-triage pr-triage bot added the PR: draft label Jun 10, 2020
Comment on lines 208 to 223
template<size_t nDim>
struct CCompressiblePrimitives {
enum : size_t {nVar = nDim+4};
VectorDbl<nVar> all;
FORCEINLINE Double& temperature() { return all(0); }
FORCEINLINE Double& pressure() { return all(nDim+1); }
FORCEINLINE Double& density() { return all(nDim+2); }
FORCEINLINE Double& enthalpy() { return all(nDim+3); }
FORCEINLINE Double& velocity(size_t iDim) { return all(iDim+1); }
FORCEINLINE const Double& temperature() const { return all(0); }
FORCEINLINE const Double& pressure() const { return all(nDim+1); }
FORCEINLINE const Double& density() const { return all(nDim+2); }
FORCEINLINE const Double& enthalpy() const { return all(nDim+3); }
FORCEINLINE const Double& velocity(size_t iDim) const { return all(iDim+1); }
FORCEINLINE const Double* velocity() const { return &velocity(0); }
};
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also trying to address the magic indexes (iDim+1) by defining specific types for primitive and conservative variables rather that using raw arrays.
Initially I thought of making temperature, pressure, etc. references pointing into the correct position of the storage vector (all) but unfortunately that disables compiler generated constructors... so they are functions.

Comment on lines 551 to 558
/*!
* \brief Get a SIMD gather iterator to the inner dimension of the container.
*/
template<size_t nCols, class T, size_t N>
FORCEINLINE CInnerIterGather<simd::Array<T,N> > innerIter(simd::Array<T,N> row) const noexcept
{
return CInnerIterGather<simd::Array<T,N> >(m_data, IsRowMajor? 1 : this->rows(), IsRowMajor? row*nCols : row);
}
Copy link
Member Author

@pcarruscag pcarruscag Jun 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only places where the "vector nature" of the SIMD type needs to be handled explicitly are when retrieving data from containers (C2DContainer and CVectorOfMatrix) and when putting it back in other containers (CSysVector and CSysMatrix).
Data is retrieved via iterators (see #789 for the reason, spoiler alert, it's for performance).

EDIT: Or via methods that copy the data into a static container of simd types, the iterator approach only performs well if the target CPU has good performance gather instructions (the majority don't).

SU2_CFD/include/numerics_simd/CNumericsSIMD.hpp Outdated Show resolved Hide resolved
Copy link
Member Author

@pcarruscag pcarruscag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in #789 I mentioned that I want these new numerics to be thread-safe and have minimal virtual overhead.
For that I'm using CRTP (the curiously recurring template pattern) to have static polymorphism, the mechanics are explained below.

SU2_CFD/include/numerics_simd/CNumericsSIMD.hpp Outdated Show resolved Hide resolved
SU2_CFD/include/numerics_simd/CNumericsSIMD.hpp Outdated Show resolved Hide resolved
SU2_CFD/include/numerics_simd/CNumericsSIMD.hpp Outdated Show resolved Hide resolved
Comment on lines 532 to 538
const CVariable& solution_,
UpdateType updateType,
CSysVector<su2double>& vector,
CSysMatrix<su2mixedfloat>& matrix) const final {

const bool implicit = (config.GetKind_TimeIntScheme() == EULER_IMPLICIT);
const auto& solution = static_cast<const CEulerVariable&>(solution_);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another significant change from the current numerics is that (again for thread safety) we do not "set" anything into them, we pass them the whole variable structure (the solver nodes) and the numerics can read (and only read) any data it needs.
Similarly to what is done in the solvers, since we know the type of variable that goes with the numeric scheme we can static_cast to the derived type to avoid polymorphism.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see Base Roe flux is working with CEulerVariable. Given it's a inviscid flux considering density, momentums and energy, I suppose this is something that will work for all compressible solvers (compressible euler, NS and RANS) ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will work since CEulerVariable is used in all of those in one way or another, i.e. CNSVariable is also a CEulerVariable with extra methods.

Copy link
Contributor

@CatarinaGarbacz CatarinaGarbacz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pcarruscag , this significant contribution on vectorization seems very valuable as one more initiative to speed up (and tidy up) the code. It's great to see SU2 evolving to make use of handy programming and C++ performance strategies.

I am not an expert in this type of implementation, but overall it seems to be implemented in a neat and advanced way, making good use of C++ tools, functions and user defined types.

Comment on lines 532 to 538
const CVariable& solution_,
UpdateType updateType,
CSysVector<su2double>& vector,
CSysMatrix<su2mixedfloat>& matrix) const final {

const bool implicit = (config.GetKind_TimeIntScheme() == EULER_IMPLICIT);
const auto& solution = static_cast<const CEulerVariable&>(solution_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see Base Roe flux is working with CEulerVariable. Given it's a inviscid flux considering density, momentums and energy, I suppose this is something that will work for all compressible solvers (compressible euler, NS and RANS) ?

SU2_CFD/include/numerics_simd/CNumericsSIMD.hpp Outdated Show resolved Hide resolved
SU2_CFD/src/solvers/CEulerSolver.cpp Show resolved Hide resolved
@pcarruscag
Copy link
Member Author

Thank you for the review Catarina, based on your comments I will try to explain the new structure better.

The interface

These numerics classes are still abstract and so the outside world only needs to known about their interface:

class CNumericsSIMD {
public:
  virtual void ComputeFlux(...) const = 0;
};

Any class that wants to be a simd numerics needs to inherit from this base and in so doing is forced to implement ComputeFlux (because it is pure virtual).

Static decorator pattern

Now, in this structure we don't have convective and viscous classes separate. Instead, we have convective schemes that can be "decorated" with viscous fluxes. For example:

template<int nDim>
class CCompressibleViscousFlux : public CNumericsSIMD {
protected:
  void viscousTerms(...) const {...}
};

Here note that this class template is derived from CNumericsSIMD but it does not implement ComputeFlux, thus it cannot be instantiated by itself. The idea is that convective schemes can use these viscous fluxes as base class (thereby linking them to CNumericsSIMD) to access the viscousTerms method (when we don't want viscous terms we just use a dummy viscous class).
Note also the template parameter nDim, this is because we want to create specific versions of the numerical schemes for 2D and 3D (we "want" this because it allows static allocation and unrolling loops perfectly).

Then convective classes also need to be class templates so that we can programatically change their base class:

template<class Base>
class MyConvectiveScheme : public Base {
public:
  void ComputeFlux(...) const {
    // do my own thing
    Base::viscousTerms(...);
    // update the linear system with the result
  }
};

Now to create an instance of this class template we do for example:

auto obj = new MyConvectiveScheme< CCompressibleViscousFlux<2> >(...);

which would create an object for 2D problems with viscous terms.

And so we need at least 4 instantiations of these class templates, 2D/3D with or without viscous terms, and this is done in the factory method implemented in CNumericsSIMD.cpp, which is the only cpp in this entire implementation.

Static polymorphism

Another concept used in this implementation for efficiency is static polymorphism.
For example in the non vectorized numerics we have a family of Roe schemes since a lot of code is shared, and the only difference is how the dissipation terms are computed. There this is done with virtual functions, here we want none of that.
Virtual functions allow a parent class to dynamically call methods of its children, we want to do this statically and so we need to let the parent class know who is deriving from it.

template<class Derived>
class Parent {
public:
  void parentMethod() {
    // "I know I am also a Derived and so I can cast myself."
    auto derived = static_cast<Derived*>(this);
    // "now I can use a method of derived"
    derived->childMethod();
  }
};

// A derived class needs to inform the parent about itself
class Child : public Parent<Child> {
public:
  void childMethod() {...}
};

Why is this better? Note that 2 of these derived classes don't actually have the same parent, i.e. one inherits from Parent<ChildA> the other from Parent<ChildB> this means that 2 versions of Parent were instantiated specifically for each derived class, this allows code to be inlined and optimized for each, an ability lost with virtual functions.

Putting it all together

For vectorized central schemes we have something like:

// Intermediate class for centered schemes, note the 2 template parameters
template<class Derived, class Base>
class CCenteredBase : public Base {
public:
  // Main public method implemented here making use of "Derived" and "Base".
  void ComputeFlux(...) const final {
    ... // gather data, do some computation
    derived->DissipationTerms(...); // static polymorphism
    Base::ViscousTerms(...); // static decorator
    ... // update linear system
  }
}

// A final centered scheme, which is what we instantiate, with some viscous decorator.
template<class Decorator>
class CJSTScheme : public CCenteredBase<CJSTScheme<Decorator>, Decorator> {
protected:
  void DissipationTerms(...) const {...}
}

@pcarruscag pcarruscag mentioned this pull request Sep 27, 2020
4 tasks
@pcarruscag pcarruscag merged commit bc39e7e into develop Sep 30, 2020
@pcarruscag pcarruscag deleted the feature_simd_numerics branch September 30, 2020 14:04
TobiKattmann added a commit that referenced this pull request Oct 20, 2020
PR#1022 SIMD introduced some minor difference in my(!) cht reg tests.
8184779..4b9f2a8x contains #1022 & #1080 (only 5 lines).
The change is in solid only and only affects the 2D case. Not the 3D.
I dont know what specifically introduced the changes, but as they
are small I for now assume that it is just a little numeric change.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants