New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIMD Integration #502

Open
sebcrozet opened this Issue Dec 11, 2018 · 4 comments

Comments

Projects
None yet
4 participants
@sebcrozet
Copy link
Member

sebcrozet commented Dec 11, 2018

This is a revival of #27 and #217 and a follow-up to that comment. We need to figure out what we can do to use explicit SIMD in order to improve the overall performances of nalgebra.

Once rusty-machine is integrated to nalgebra (#498), we could use it for benchmarking to see if your optimizations have any impact on some real-world machine-learning operations.

We should also keep in mind the ffsvm crate as an example of application requiring peak performances for some linalg operations, and the simd_aligned crate for an example of an interesting design of a SIMD-friendly data structure for the data storage of a matrix or vector.

Here are some tasks we should start with to get some measurements that will serve at references for our optimizations. This will be useful to guide us through our work:

  • Setup a benchmark for various machine-learning operations from rusty-machine.
  • Setup a list of what operations are the bottleneck on our benchmarks.
  • List the set of linalg operations from ffsvm that requires peak performances.
  • Decide what crate crate we want to use for SIMD manipulation: faster, SIMDeez, std::simd, packed_simd?
@ralfbiedert

This comment has been minimized.

Copy link

ralfbiedert commented Dec 11, 2018

For ffsvm (and SVMs in general I would argue) most of the CPU time is spent inside one loop inside the kernel. I don't recall the exact numbers, but for our actual production model I think it was in the ballpark of ±80% for this code:

for (i, sv) in vectors.row_iter().enumerate() {
    let mut sum = f32s::splat(0.0);
    let feature: &[f32s] = &feature;

    for (a, b) in sv.iter().zip(feature) {
        sum += (*a - *b) * (*a - *b);
    }

    output[i] = f64::from((-self.gamma * sum.sum()).exp());
}

Here vectors is a "list of vectors", and each vector (sv and feature) is a [f32x8] (or similar). Each f32x8 is a std::simd("packed_simd") packed float, that is guaranteed to be properly aligned for SIMD operations, and has overloaded operators +, *, .sum() that map to native SIMD intrinsics if available.

As a developer, what I would need for ffsvm is roughly:

  • The ability to create vectors of random length (decided at runtime based on a loaded model)
  • These models being aligned for f32s (f32x4, f32x8, ...) operations
  • Having "basic SIMD math" support for vector operations (+, -, *, .sum() ...)
  • Using SIMD at least on x86 and ARM (and later WebAssembly once the time comes).
  • Having the ability to load / set values float-by-float, as they might come from a weird model format
  • Having all (SIMD'ed or not) vector and matrix operations be "cache aware". That is, have data structures so that iteration is CPU predictable, and the amount of cache waste is minimal. Vectors do that probably naturally, for matrices row/column ordering is important.
  • (Edit – compare comment below): Being able to control for allocation (e.g., by operating in-place in pre-allocated vector / matrix, or having nalgebra provide allocation-free SIMDified kernel operations)

I think my benchmark baseline would be the equivalent of

    for (a, b) in sv.iter().zip(feature) {
        sum += (*a - *b) * (*a - *b);
    }

implemented in nalgebra to have the same performance as if implemented manually via std::simd.

Edit - I just tried to implement a naive benchmark and realized it's not that simple, since I'd have to deal with DVectors, which seem to be heap allocated. That means all temporaries would also be heap allocated. While it's fine to allocate ahead of time, there probably can't be any allocation in the hot loop.

@ralfbiedert

This comment has been minimized.

Copy link

ralfbiedert commented Dec 11, 2018

Decide what crate crate we want to use for SIMD manipulation: faster, SIMDeez, std::simd, packed_simd?

Some thoughts:

  • faster is awesome when working with "random" arrays. However there might be a bit overhead when working with small vectors and I don't know what the latest non-x86 support is like.
  • SIMDeez haven't tried unfortunately.
  • std::simd and packed_simd are the same thing I believe. std::simd is discussed in a Rust RFC, while packed_simd is the repo / implementation of that. From what I understood std::simd is meant to be Rust's safe, cross platform SIMD middle-layer. I think it was even part of nightly for a brief while, but I haven't heard any status update lately.

If std::simd is still a thing, I would probably investigate that one first, as it might come with free cross-platform compatibility.

@jswrenn

This comment has been minimized.

Copy link
Member

jswrenn commented Dec 12, 2018

A thought: nalgebra's matrix types aren't always backed by totally contiguous storage. There aren't any gaps between elements if the Matrix is backed by a VecStorage or ArrayStorage, but there can be gaps if it's backed by a SliceStorage. I suspect this will be a barrier to a straight-forward drop-in of SIMD to accelerate matrix operations. We could get around this by lowering the actual implementations of operations we want to accelerate to a Storage level; that way we could provide specialized implementations for each of the three storage types.

@ybyygu ybyygu referenced this issue Dec 20, 2018

Open

SIMD support #5

@HadrienG2

This comment has been minimized.

Copy link

HadrienG2 commented Jan 11, 2019

@jswrenn Note that non-contiguous storage can also be purposely used to improve SIMD performance on small matrices. For example, if you pad matrix rows to a multiple of the SIMD vector length, and are careful not to introduce NaNs and other IEEE 754 silliness in the padding, you can implement a very fast small-matrix multiplication.

In general, SIMD is very sensitive to data layout, which makes abstraction design harder.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment