Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Undefined behaviour in generated code #506

Closed
rigtorp opened this issue Aug 30, 2017 · 21 comments
Closed

[C++] Undefined behaviour in generated code #506

rigtorp opened this issue Aug 30, 2017 · 21 comments

Comments

@rigtorp
Copy link
Contributor

rigtorp commented Aug 30, 2017

The C++ code generator typically generates code like this for accessing integer fields:

    std::int32_t securityID(void) const
    {
        return SBE_LITTLE_ENDIAN_ENCODE_32(*((std::int32_t *)(m_buffer + m_offset + 24)));
    }

    NewOrderSingle &securityID(const std::int32_t value)
    {
        *((std::int32_t *)(m_buffer + m_offset + 24)) = SBE_LITTLE_ENDIAN_ENCODE_32(value);
        return *this;
    }

The compiler assumes that a pointer to a int32_t has the correct alignment. With this generated code the alignment requirement might not fulfilled. On amd64 this is fine as long as the compiler doesn't try to use SSE or AVX instructions, but it's not safe in general to assume it will work.

The solution is to use memcpy like this (https://chromium.googlesource.com/chromium/src.git/+/master/base/bit_cast.h):

template <class Dest, class Source>
inline Dest bit_cast(const Source& source) {
   static_assert(sizeof(Dest) == sizeof(Source),
   "bit_cast requires source and destination to be the same size");
   static_assert(base::is_trivially_copyable<Dest>::value,
   "bit_cast requires the destination type to be copyable");
   static_assert(base::is_trivially_copyable<Source>::value,
   "bit_cast requires the source type to be copyable");
    
   Dest dest;
   memcpy(&dest, &source, sizeof(dest));
   return dest;
}

This should optimize to a single load load on amd64, guaranteed not to use instructions requiring alignment.

@mjpt777
Copy link
Contributor

mjpt777 commented Aug 30, 2017

The schema, as per the SBE specification, can use the offset attribute for aligning fields when targeting platforms which do no support unaligned access.

@rigtorp
Copy link
Contributor Author

rigtorp commented Aug 30, 2017

Currently our unit tests do not pass due to this issue and we had to disable alignment check in ubsan. By using the memcpy trick we can inform the compiler that the load is unaligned and make sure it will not use instructions that require alignment. On amd64 the resulting binary should be the same, except the code is guaranteed not to break if some new optimization pass is introduced that would generate SSE or AVX instructions for this code.

@mjpt777
Copy link
Contributor

mjpt777 commented Aug 30, 2017

Have you used the offset attribute to align you fields?

@rigtorp
Copy link
Contributor Author

rigtorp commented Aug 30, 2017

I don't control the specification, I'm using what is provided by the exchange etc. But that is not the problem.

The problem is that the generated code is relying on undefined behaviour. The compiler will generate correct instructions on amd64 most of the time. By using memcpy we can make sure it will generate correct instructions all the time.

@tmontgomery
Copy link
Contributor

Would love to see a PR that does this for gets and sets. But it would need to be optional so that the original intent (of using offset for guaranteeing aligned fields) would be preserved.

I'll put it on my list to get to. But may be a while. Lots of stuff ahead of it.

@myrgy
Copy link

myrgy commented Aug 30, 2017

I think in this case it could be

    std::int32_t securityID(void) const
    {
        std::int32_t val;
        memcpy(&val, m_buffer + m_offset + 24, sizeof(val));
        return SBE_LITTLE_ENDIAN_ENCODE_32(val);
    }

    void setSecurityID(int32_t val)
    {
        val = SBE_LITTLE_ENDIAN_ENCODE_32(val);
        memcpy(m_buffer + m_offset + 24,  &val, sizeof(val)); 
    }

btw, what is the reason to use void for function in C++ code? It's so old C fashion .... But for modern compiler it's completely redundant.

@tmontgomery
Copy link
Contributor

Haven't exterminated all the (void) yet. And we are old C hackers anyway.

@rigtorp
Copy link
Contributor Author

rigtorp commented Sep 1, 2017

I think @myrgy code will generate optimal code on amd64 for any alignment. On platforms without unaligned loads it would probably generate correct, but less efficient code. If loads are guaranteed to be aligned, the current code is better.

What about adding a template like this to sbe.h support header:

template <typename T>
// Add enable_if to only work with integer types
// Maybe add force inline attributes for supported compilers
inline T load(const char *buffer) {
#ifdef SBE_ASSUME_ALIGNED
  assert(buffer & (alignof(T) - 1) == 0 && "loads need to be aligned");
  return *reinterpret_cast<const T*>(buffer);
#else
  T val;
  std::memcpy(&val, buffer, sizeof(val));
  return val;
#endif
}

@rigtorp
Copy link
Contributor Author

rigtorp commented Sep 1, 2017

Or maybe we can let people on these esoteric platform deal with this problem? AMD64 and ARM have efficient loads for unaligned integers and the memcpy version should generate the correct instructions.

@mjpt777
Copy link
Contributor

mjpt777 commented Sep 1, 2017

memcpy will have to at least check for alignment so it will always be slower to some extent. It is a difficult choice of how far to go with support with the costs it will incur in development and at runtime.

@myrgy
Copy link

myrgy commented Sep 1, 2017

@mjpt777 , please take a look at compilation results:
https://godbolt.org/g/XWCYBo

gcc and clang generate almost equal code for memcpy and reinterpret_cast
about memcpy - please take into account that compiler even is aware about platform capabilities and can perform optimization as well. In addition gcc will have it's own version of __builtin_memcpy that could be used.

@rigtorp
Copy link
Contributor Author

rigtorp commented Sep 1, 2017

There is zero cost to using this correct implementation. It also has the benefit of working on all CPUs without crashing. Given that the overhead is 0 I don't see why we should use a implementation which allows the compiler to generate code that will crash my system.

Minimal example:

#include <stdlib.h>
#include <memory.h>
typedef unsigned int u32;

u32 reada32(const void* ptr) { return *(const u32*) ptr; }

u32 readu32(const void* ptr) 
{ 
    u32 result; 
    memcpy(&result, ptr, 4); 
    return result; 
}

On amd64 (-std=c++14 -O3):

reada32(void const*):
        mov     eax, DWORD PTR [rdi]
        ret
readu32(void const*):
        mov     eax, DWORD PTR [rdi]
        ret

On arm64 (-std=c++14 -O3):

reada32(void const*):
        ldr     w0, [x0]
        ret
readu32(void const*):
        ldr     w0, [x0]
        ret

On arm (-std=c++14 -O3 -mcpu=cortex-m7 -march=armv6):

reada32(void const*):
        ldr     r0, [r0]
        bx      lr
readu32(void const*):
        ldr     r0, [r0]  @ unaligned
        bx      lr

@mjpt777
Copy link
Contributor

mjpt777 commented Sep 1, 2017

Thanks this is interesting to know. BTW how does memcpy know the address in the buffer is aligned?

@mjpt777
Copy link
Contributor

mjpt777 commented Sep 1, 2017

I can see looking from your example it is easy for the compiler to determine what it is copying is aligned. Have you tried it with the address coming from an opaque buffer the it is likely to happen in a real application?

@tmontgomery
Copy link
Contributor

I do agree that this will mostly be optimized and thus have no cost. However, implying this would always be optimized on all systems, all CPUs, and with all compilers is a strong statement. That I am pretty sure will be false. When adding in that the buffer being used may or may not be aligned it is guaranteed to not always be performant.

SBE is a performance oriented tool. That means some requirements get pushed back onto the user to make the correct choice.

I am not convinced that memcpy should be the default. Given the prevalence of CPUs that can work on unaligned data (x86, ARMv6 and above, etc.) and given the schema can be aligned when needed. The memcpy method, on ARM v6 and ARM v7, does sometimes defeat some optimizations and add cautious instructions. See this for example.

Although, I think a memcpy version should be enable-able via a param to SbeTool for generation. Having a C++ compile time macro isn't needed for this.

@myrgy
Copy link

myrgy commented Sep 1, 2017

yes, can work, but if compiler decided to vectorize operation on such data - it will crash application. because vector registers don't support unaligned access. memcpy inform compiler that access might be unaligned - so prevent such unsafe operation in that case.

here is example of crashing app: https://godbolt.org/g/TbVKee
With -O3 gcc perform automatic vectorization
And it will cause crash even on platform, which allows to perform unaligned access.

#include <cstdint>
#include <cstring>
#include <iostream>
#include <algorithm>

#define SBE_LITTLE_ENDIAN_ENCODE_16(v) (v)
#define SBE_LITTLE_ENDIAN_ENCODE_32(v) (v)
#define SBE_LITTLE_ENDIAN_ENCODE_64(v) (v)

#define SIZE 32

struct Test{
    char m_buffer [SIZE];
    int m_offset = 5;
/*
    std::int32_t securityID(int i) const
    {
        std::int32_t val;
        memcpy(&val, m_buffer + m_offset + 24 + sizeof(std::int32_t) * i, sizeof(std::int32_t));
        return SBE_LITTLE_ENDIAN_ENCODE_32(val);
    }
//*/
//*
    std::int32_t securityID(int i) const
    {
        return SBE_LITTLE_ENDIAN_ENCODE_32(*((std::int32_t *)(m_buffer + m_offset + 24 + sizeof(std::int32_t) * i)));
    }
    //*/
};

int main() {
    char c;
    Test t1;
    Test t2;
   // std::generate(t1.m_buffer, t1.m_buffer + SIZE, std::rand);
    //std::generate(t2.m_buffer, t2.m_buffer + SIZE, std::rand); 

    std::int32_t r[SIZE];
    for(int i= 0; i < SIZE; i++) {
       r[i]   = t1.securityID(i)   + t2.securityID(i);
    }

    for (int i = 0; i < SIZE; ++i) {
        std::cout << r[i];
    }
    return 0;
}

@rigtorp
Copy link
Contributor Author

rigtorp commented Sep 2, 2017

@mjpt777 The whole point of introducing the memcpy is to tell the compiler that this load is unaligned. Per the C standard memcpy has no alignment requirement and thus the compiler needs to ensure the load is done in a way that no alignment is assumed. So memcpy specifically doesn't know about alignment at compile time, it will use alignment at runtime to copy large blocks using vector instructions, but that is irrelevant here.

@tmontgomery I'm saying that using memcpy will cause correct code (sans compiler bugs) to be generated on all platforms. Currently a new optimization pass can break existing code, since SBE is relying on undefined behavior. Be default we should not be relying on undefined behavior (ie use memcpy).

If someone want to use this on some old SPARC platform etc and take advantage of a properly aligned schema, I would say it is up to them to implement it. Currently the code is causing issues for the 99.9% of users not using old esoteric platforms.

Also for the version without memcpy there should be asserts verifying the alignment requirements when loading and storing values.

@myrgy It actually generated unaligned vector loads in this case.

@myrgy
Copy link

myrgy commented Sep 2, 2017

@rigtorp , yep, my fault - I saw xmm registers usage and didn't check if load instructions support unaligned access.

@mjpt777
Copy link
Contributor

mjpt777 commented Sep 3, 2017

Processors with AVX support allow unaligned access to XMM registers having relaxed the older rule.

http://www.agner.org/optimize/optimizing_assembly.pdf

However this discussion is interesting. We'll look at how we can generate the memcpy option.

@tmontgomery
Copy link
Contributor

Discussing it, memcpy should be the default. We use bounds checking on as the default and this goes along with that "safe by default" mentality. But we do want to allow the current method to be generated if desired. However, when it is generated, all alignment is up to the app.

tmontgomery added a commit that referenced this issue Nov 29, 2017
…ts so that alignment isn't an issue. For #530 used a union to do byte swap on float and double accesses for single values and arrays.
@tmontgomery
Copy link
Contributor

Moved to memcpy on access of all fields for safety. Decided not to make this optional as it was just too much of a nightmare.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants