Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

missing gather/scatter #4

Open
kfjahnke opened this issue Oct 18, 2019 · 10 comments
Open

missing gather/scatter #4

kfjahnke opened this issue Oct 18, 2019 · 10 comments
Assignees
Labels
enhancement New feature or request

Comments

@kfjahnke
Copy link

Going through the 'Working Draft, C++Extensions forParallelism Version 2', I could not find any reference to gather/scatter functions. How is this supposed to be done? Additionally, I think that strided gather/scatter, where the data are equidistant in memory, would also be desirable.

@mattkretz
Copy link
Member

mattkretz commented Oct 18, 2019

Correct. It's not there and not proposed yet. Note that I just posted a request for feedback on the Parallelism TS 2. You can submit feedback at https://github.com/mattkretz/std-simd-feedback/ which I will collect into a paper for (one of) the next meeting(s).

Anyway, my plan is to finish the non-member subscript paper (I'd take contributions 😉). See what EWG thinks of it and then decide how to design the gather/scatter interface.

However, we could start earlier, since low-level gather/scatter (i.e. without using nice subscript interfaces) will be needed as well.

At this point, one needs a non-std extension to simd to make gather/scatter work. Which is totally doable with my implementation, since you can cast simd objects to [[gnu::vector_size(N)]] or SIMD intrinsics (and back).

@mattkretz mattkretz added the enhancement New feature or request label Oct 18, 2019
@kfjahnke
Copy link
Author

kfjahnke commented Oct 19, 2019

Anyway, my plan is to finish the non-member subscript paper

I can't get your papers to build here. Maybe you have a readymade copy of the relevant paper somewhere?

However, we could start earlier

I felt that the standard's constructor for simd using a generator function suggests code like this:

template < typename T , typename A >
struct gs_simd // class simd with gather/scatter
: public std::experimental::simd < T , A >
{
  typedef std::experimental::simd < T , A > base_t ;

  // perfect-forward any arguments to the base class c'tor.
  template < typename ... types >
  gs_simd ( types && ... args )
  : base_t ( std::forward < types > ( args ) ... )
  { } ;

  template < typename indexable_t ,
             typename index_t >
  void gather ( const indexable_t & src ,
                const index_t & indexes )
  {
    // assign base_t object created by gen-type c'tor
    *this = base_t ( [&] ( const size_t & i )
                         { return src [ indexes[i] ] ; } ) ;
  }

  template < typename indexable_t ,
             typename index_t >
  void scatter ( indexable_t & trg ,
                 const index_t & indexes )
  {
    // gen-type c'tor is only used for side effects; let the compiler
    // figure out that the result is unused
    base_t ( [&] ( const size_t & i )
                 { return trg [ indexes[i] ] = (*this)[i] ; } ) ;
  }
} ;

Mileage does of course vary with the compiler and ISA used, but it would be a start. It works for all simd objects, and gives a reference implementation for doublechecking.
I think it's best to start out with a perfectly general and simple solution and specialize it to speed up the binary where such opportunites can be detected - with c++17 the use of 'if constexpr' makes this much simpler than the hard-to-grasp use of enable_if in earlier standards.

@kfjahnke
Copy link
Author

kfjahnke commented Oct 19, 2019

BTW, coding the g/s this way also encompasses 'regular' g/s (from/to strided memory), just by passing a suitable object as 'indexes':

template < size_t S >
struct strides_t
{
  size_t operator[] ( const size_t & i ) const
  {
    return S * i ;
  }
} ;

@mattkretz
Copy link
Member

I just added a link from the README to the paper: https://web-docs.gsi.de/~mkretz/DNMSO.pdf.

And yes, I agree that also allowing a type like std::strided<4> in place of a simd<int> would be helpful.

Regarding the generator ctor. Functionally, this ctor will plug any missing-feature hole. In terms of optimization (which is the reason for simd to exist, of course) it's typically less than stellar. I could write so many missed optimization reports...

In terms of low-level API I'm confident with https://github.com/VcDevel/Vc/blob/1.4/Vc/common/gatherinterface.h and https://github.com/VcDevel/Vc/blob/1.4/Vc/common/scatterinterface.h to work. Though the optional Scale template parameter is still missing on most overloads.

The GatherArguments type basically represents a gather expression you would express via container[simd_int_obj]. The subscript itself cannot express the gather since it could also be the LHS of an assignment (i.e. scatter), the number of elements on the other side of the assignment expression might be different, or the element type could be different.

Anyway, I'll start on a paper ASAP. I'd be happy if you want to co-author.

@mattkretz
Copy link
Member

Oh, and of course, I need to work on an implementation to go with the paper...

@kfjahnke
Copy link
Author

And yes, I agree that also allowing a type like std::strided<4> in place of a simd would be helpful.

I meant it the other way round: pass strides_t<4> as 'indexes'.

Regarding the generator ctor. Functionally, this ctor will plug any missing-feature hole. In terms of optimization (which is the reason for simd to exist, of course) it's typically less than stellar. I could write so many missed optimization reports...

It does have great conceptual appeal, though, because it keeps the interface really small and simple and burdens the search for optimizations to 'further inside'. I did a lot of playing around with self-made 'pseudo-vector' types (you can look at my code here) and for the 'basics' (like, all of vspline) the resulting code came out just about as fast with Vc as without - nowadays compilers are very clever indeed (I use clang++, which does a better job for me than g++)

Anyway, I'll start on a paper ASAP. I'd be happy if you want to co-author.

Thanks for the offer. I won't have time for a couple of weeks now. I'll get in touch when I'm back 'on it'.

@mattkretz
Copy link
Member

And yes, I agree that also allowing a type like std::strided<4> in place of a simd would be helpful.

I meant it the other way round: pass strides_t<4> as 'indexes'.

That's also what I meant (or I'm still completely misunderstanding what you're saying). Basically indexes would be of type simd<int> in most cases. So:

void f(std::array<float, N>& data, simd<int> indexes) {
  auto x = simd<float>::gather(data.data(), indexes); // data[indexes[{0,1,2,3,...}]]
  auto y = simd<float>::gather(data.data(), std::strided<4>()); // data[{0,4,8,12,...}]
  ...
}

Regarding the generator ctor. Functionally, this ctor will plug any missing-feature hole. In terms of optimization (which is the reason for simd to exist, of course) it's typically less than stellar. I could write so many missed optimization reports...

It does have great conceptual appeal, though, because it keeps the interface really small and simple and burdens the search for optimizations to 'further inside'. I did a lot of playing around with self-made 'pseudo-vector' types (you can look at my code here) and for the 'basics' (like, all of vspline) the resulting code came out just about as fast with Vc as without - nowadays compilers are very clever indeed (I use clang++, which does a better job for me than g++)

I totally agree. And I won't stop anyone from using the generator ctor for such things. That's what it's meant to be used for. I failed to mention it here, but at least as important is the syntax issue.

histogram[index_vec] += 1;

is nicer than

simd<int>([&](auto i) {
  histogram[index_vec[i]] += 1;
});

Also, the generator callable is one of those weird code regions where exceptions, I/O, and locking mutexes may be UB. (just like for unseq parallel algorithms).

@kfjahnke
Copy link
Author

kfjahnke commented Nov 7, 2019

Sorry for the delay. Now I'm available again.

That's also what I meant (or I'm still completely misunderstanding what you're saying). Basically indexes would be of type simd in most cases.

I misunderstood your comment. So we agree. Yes, indexes would usually be simd or, in a more generalized way, any indexable object yielding integral types with operator[]. The general g/s formulation via the gen ctor covers all that and can serve as reference implementation. Then, by overloading the g/s methods, specific types of index_t objects - like simd - can dispatch to tweaked code, all the way down to directly using intrinsics if the vector sizes match the hardware vector size.

histogram[index_vec] += 1; is nicer than ...

I agree. It's great to have, to write generic code which works for both single and simdized indexes. But this is beyond g/s; it's a full gather/modify/scatter cycle. I think you'd like to avoid making g/s explicit and use subscripting with index vectors exclusively. I like having explict g/s methods, because it makes it easier to opt out by writing a class which can stand as a replacement for std::simd with the same interface. Coding with explicit g/s is more verbose, but the implementation can be very compact.

Initially, when I started using Vc, I just used Vc's API exclusively. After some time I had doubts about Vc's long-term viability and accessability and decided to become able to opt out, by building a class template to emulate the funcionality of Vc which my program was actually using, and at the same time I tried to narrow down my use of vector code to some comfortable minimum to keep the amount of code I had to emulate small. It turned out that using explicit g/s was helpful in doing that. The emulator class template uses small loops which can be picked up by the compiler's loop vectorizer. Next I added an option to use Vc, which delegates some operations to Vc. This way my code can run either way and allow users to avoid using Vc - e.g. because the version their package management offers is too old or they'd have to use Vc from source.

This gave me an idea for the task at hand: since we already have Vc with all of it's well-thought-out g/s functionality, how about delegating to Vc to have some 'real' g/s code to compare with g/s via the gen c'tor? Like:

#ifdef USE_VC

  using base_t::size ;

  template < typename indexable_t ,
             typename index_t >
  void gather ( const indexable_t & src ,
                const index_t & indexes )
  {
    typedef Vc::SimdArray < T , size() > vc_t ;
    // assuming simd and SimdArray are compatible
    vc_t & target ( * ( reinterpret_cast < vc_t * > ( this ) ) ) ; 
    target.gather ( src , indexes ) ;
  }

  template < typename indexable_t ,
             typename index_t >
  void scatter ( indexable_t & trg ,
                 const index_t & indexes ) const
  {
    typedef Vc::SimdArray < T , size() > vc_t ;
    // assuming simd and SimdArray are compatible
    const vc_t & src ( * ( reinterpret_cast < const vc_t * > ( this ) ) ) ; 
    src.scatter ( trg , indexes ) ;
  }
  
#else

// ...

This would provide a test-bed with minimal effort, which could be used to make the case for making g/s part of the standard (if actual test code does support the notion that this is faster). If these preliminary tests suggest it's a good idea, the Vc code can be pulled into your std::simd implementation.

@kfjahnke
Copy link
Author

A year has passed, so I thought I'd get back to this issue again. Just to remind you where I'm coming from, my library vspline requires efficient gather/scatter operations and (de)interleaving via load/shuffle/store. I currently use Vc 1.4.1 for the purpose. I'd like to eventually move to std::simd, but the standard does not address these issues.
Do you have any plans to add this functionality to your std::simd implementation? Or can you point me to other code providing it?

@kfjahnke
Copy link
Author

kfjahnke commented Dec 13, 2020

Again I'll post to this issue, following up our recent exchange on the Vc issue tracker. I took your progress with std::simd development as an encouragement to try and port pv, my image viewer (https://bitbucket.org/kfj/pv) to use std::simd instead of Vc. This has succeeded, using a wrapper class to produce a data type which internally is a std::simd (by private inheritance) and yet exposes an interface which is similar to Vc's SimdArray, to make it usable throughout pv's and vspline's code base. So now I can offer a large-ish, complex application, using your std::simd implementation extensively, to play with. I set up a new branch 'stdsimd' in the pv repo, the implementation of my wrapper class is here:

https://bitbucket.org/kfj/pv/src/stdsimd/vspline/simd_type.h

This class may be of interest to other Vc users wanting to port their code to std::simd, because it's interface covers a good part of Vc::SimdArray functionality.

Performance of a pv binary using this code, clang++ 10.0.0 and -Ofast, is about 30% worse than the Vc version. I tried 'slotting in' Vc's gather and scatter code by reinterpret_casting the std::simd types to their equivalent Vc types (this does indeed work), but this slowed things down. My code to employ where_expressions is probably suboptimal. I suspect that use of an equivalent of Vc's interleaved_memory_mapper may improve performance (by less than 10%). Another issue might be the vectorization of standard functions, which may still be better in Vc (especially the transcendental ones, which are used a lot in pv).

So this is encouraging, but it's not quite there yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants