Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ARM Blake2b #845

Closed
noloader opened this issue Jan 22, 2017 · 2 comments
Closed

Add ARM Blake2b #845

noloader opened this issue Jan 22, 2017 · 2 comments

Comments

@noloader
Copy link
Contributor

noloader commented Jan 22, 2017

Attached and below is a patch for Blake2b using ARM NEON instrinsics. Its another partial patch, and others will have to complete it.

The code is based on a reference implementation provided by Samuel Neves (@sneves), who is one of the authors of BLAKE2. He provided it in a private email some time ago. It was recently revisited for some benchmarking, so it was a good time to offer a Botan cut-in. Samuel Neves and JP Aumasson should receive credit.

The dev-boards used for testing were a BeagleBoard v3 (Cortex-A8), Banana Pi (Cortex-A7) and CubieTruck v5 (Cortex-A7). The BeagleBoard was configured with -march=armv7-a -mtune=cortex-a8 -mfpu=neon -mfloat-abi=hard; and the CubieTruck was configured with -march=armv7-a -mtune=cortex-a7 -mfpu=neon-vfpv4 -mfloat-abi=hard.

Here are the relative numbers:

  • BeagleBoard, CXX implementation (0.95 GiHz)
$ ./botan speed --msec=3000 Blake2b
Blake2b(512) [base] hash 11.721 MiB/sec (35.164 MiB in 3000.092 ms)
  • BeagleBoard, NEON implementation (0.95 GiHz)
$ ./botan speed --msec=3000 Blake2b
Blake2b(512) [base] hash 31.662 MiB/sec (94.988 MiB in 3000.038 ms)
  • BananaPi, CXX implementation (0.96 GiHz)
$ ./botan speed --msec=3000 Blake2b
Blake2b(512) [base] hash 15.769 MiB/sec (47.309 MiB in 3000.182 ms)
  • BananaPi, NEON implementation (0.96 GiHz)
$ ./botan speed --msec=3000 Blake2b
Blake2b(512) [base] hash 41.872 MiB/sec (125.617 MiB in 3000.044 ms)
  • CubieTruck, CXX implementation (1.7 GiHz)
$ ./botan speed --msec=3000 Blake2b
Blake2b(512) [base] hash 27.119 MiB/sec (81.359 MiB in 3000.123 ms)
  • CubieTruck, NEON implementation (1.7 GiHz)
$ ./botan speed --msec=3000 Blake2b
Blake2b(512) [base] hash 70.449 MiB/sec (211.348 MiB in 3000.020 ms)

$ git diff > blake2.diff
$ cat blake2.diff
diff --git a/src/lib/hash/blake2/blake2b.cpp b/src/lib/hash/blake2/blake2b.cpp
index b478af106..b2e75acef 100644
--- a/src/lib/hash/blake2/blake2b.cpp
+++ b/src/lib/hash/blake2/blake2b.cpp
@@ -12,6 +12,19 @@
 #include <botan/rotate.h>
 #include <algorithm>

+// ARM32 and ARM64 Headers
+#if defined(__GNUC__)
+# include <stdint.h>
+#endif
+#if defined(__ARM_NEON)
+# include <arm_neon.h>
+#endif
+#if defined(__GNUC__) && !defined(__apple_build_version__)
+# if defined(__ARM_ACLE) || defined(__ARM_FEATURE_CRC32) || defined(__ARM_FEATURE_CRYPTO)
+#  include <arm_acle.h>
+# endif
+#endif  // ARM32 and ARM64 Headers
+
 namespace Botan {

 namespace {
@@ -64,8 +77,6 @@ void Blake2b::state_init()

 void Blake2b::compress(bool lastblock)
    {
-   uint64_t m[16];
-   uint64_t v[16];
    uint64_t* const H = m_H.data();
    const uint8_t* const block = m_buffer.data();

@@ -74,66 +85,307 @@ void Blake2b::compress(bool lastblock)
       m_F[0] = ~0ULL;
       }

-   for(int i = 0; i < 16; i++)
-      {
-      m[i] = load_le<uint64_t>(block, i);
-      }
+    #undef LOAD_MSG_0_1
+    #define LOAD_MSG_0_1(b0, b1) \
+    do { b0 = vcombine_u64(vget_low_u64(m0), vget_low_u64(m1)); b1 = vcombine_u64(vget_low_u64(m2), vget_low_u64(m3)); } while(0)

-   for(int i = 0; i < 8; i++)
-      {
-      v[i] = H[i];
-      v[i + 8] = blake2b_IV[i];
-      }
+    #undef LOAD_MSG_0_2
+    #define LOAD_MSG_0_2(b0, b1) \
+    do { b0 = vcombine_u64(vget_high_u64(m0), vget_high_u64(m1)); b1 = vcombine_u64(vget_high_u64(m2), vget_high_u64(m3)); } while(0)

-   v[12] ^= m_T[0];
-   v[13] ^= m_T[1];
-   v[14] ^= m_F[0];
-   v[15] ^= m_F[1];
-
-#define G(r, i, a, b, c, d)                     \
-   do {                                         \
-   a = a + b + m[blake2b_sigma[r][2 * i + 0]];  \
-   d = rotate_right<uint64_t>(d ^ a, 32);         \
-   c = c + d;                                   \
-   b = rotate_right<uint64_t>(b ^ c, 24);         \
-   a = a + b + m[blake2b_sigma[r][2 * i + 1]];  \
-   d = rotate_right<uint64_t>(d ^ a, 16);         \
-   c = c + d;                                   \
-   b = rotate_right<uint64_t>(b ^ c, 63);         \
-   } while(0)
-
-#define ROUND(r)                                \
-   do {                                         \
-   G(r, 0, v[0], v[4], v[8], v[12]);            \
-   G(r, 1, v[1], v[5], v[9], v[13]);            \
-   G(r, 2, v[2], v[6], v[10], v[14]);           \
-   G(r, 3, v[3], v[7], v[11], v[15]);           \
-   G(r, 4, v[0], v[5], v[10], v[15]);           \
-   G(r, 5, v[1], v[6], v[11], v[12]);           \
-   G(r, 6, v[2], v[7], v[8], v[13]);            \
-   G(r, 7, v[3], v[4], v[9], v[14]);            \
-   } while(0)
-
-   ROUND(0);
-   ROUND(1);
-   ROUND(2);
-   ROUND(3);
-   ROUND(4);
-   ROUND(5);
-   ROUND(6);
-   ROUND(7);
-   ROUND(8);
-   ROUND(9);
-   ROUND(10);
-   ROUND(11);
-
-   for(int i = 0; i < 8; i++)
-      {
-      H[i] ^= v[i] ^ v[i + 8];
-      }
+    #undef LOAD_MSG_0_3
+    #define LOAD_MSG_0_3(b0, b1) \
+    do { b0 = vcombine_u64(vget_low_u64(m4), vget_low_u64(m5)); b1 = vcombine_u64(vget_low_u64(m6), vget_low_u64(m7)); } while(0)
+
+    #undef LOAD_MSG_0_4
+    #define LOAD_MSG_0_4(b0, b1) \
+    do { b0 = vcombine_u64(vget_high_u64(m4), vget_high_u64(m5)); b1 = vcombine_u64(vget_high_u64(m6), vget_high_u64(m7)); } while(0)
+
+    #undef LOAD_MSG_1_1
+    #define LOAD_MSG_1_1(b0, b1) \
+    do { b0 = vcombine_u64(vget_low_u64(m7), vget_low_u64(m2)); b1 = vcombine_u64(vget_high_u64(m4), vget_high_u64(m6)); } while(0)
+
+    #undef LOAD_MSG_1_2
+    #define LOAD_MSG_1_2(b0, b1) \
+    do { b0 = vcombine_u64(vget_low_u64(m5), vget_low_u64(m4)); b1 = vextq_u64(m7, m3, 1); } while(0)
+
+    #undef LOAD_MSG_1_3
+    #define LOAD_MSG_1_3(b0, b1) \
+    do { b0 = vextq_u64(m0, m0, 1); b1 = vcombine_u64(vget_high_u64(m5), vget_high_u64(m2)); } while(0)
+
+    #undef LOAD_MSG_1_4
+    #define LOAD_MSG_1_4(b0, b1) \
+    do { b0 = vcombine_u64(vget_low_u64(m6), vget_low_u64(m1)); b1 = vcombine_u64(vget_high_u64(m3), vget_high_u64(m1)); } while(0)
+
+    #undef LOAD_MSG_2_1
+    #define LOAD_MSG_2_1(b0, b1) \
+    do { b0 = vextq_u64(m5, m6, 1); b1 = vcombine_u64(vget_high_u64(m2), vget_high_u64(m7)); } while(0)
+
+    #undef LOAD_MSG_2_2
+    #define LOAD_MSG_2_2(b0, b1) \
+    do { b0 = vcombine_u64(vget_low_u64(m4), vget_low_u64(m0)); b1 = vcombine_u64(vget_low_u64(m1), vget_high_u64(m6)); } while(0)
+
+    #undef LOAD_MSG_2_3
+    #define LOAD_MSG_2_3(b0, b1) \
+    do { b0 = vcombine_u64(vget_low_u64(m5), vget_high_u64(m1)); b1 = vcombine_u64(vget_high_u64(m3), vget_high_u64(m4)); } while(0)
+
+    #undef LOAD_MSG_2_4
+    #define LOAD_MSG_2_4(b0, b1) \
+    do { b0 = vcombine_u64(vget_low_u64(m7), vget_low_u64(m3)); b1 = vextq_u64(m0, m2, 1); } while(0)
+
+    #undef LOAD_MSG_3_1
+    #define LOAD_MSG_3_1(b0, b1) \
+    do { b0 = vcombine_u64(vget_high_u64(m3), vget_high_u64(m1)); b1 = vcombine_u64(vget_high_u64(m6), vget_high_u64(m5)); } while(0)
+
+    #undef LOAD_MSG_3_2
+    #define LOAD_MSG_3_2(b0, b1) \
+    do { b0 = vcombine_u64(vget_high_u64(m4), vget_high_u64(m0)); b1 = vcombine_u64(vget_low_u64(m6), vget_low_u64(m7)); } while(0)
+
+    #undef LOAD_MSG_3_3
+    #define LOAD_MSG_3_3(b0, b1) \
+    do { b0 = vcombine_u64(vget_low_u64(m1), vget_high_u64(m2)); b1 = vcombine_u64(vget_low_u64(m2), vget_high_u64(m7)); } while(0)
+
+    #undef LOAD_MSG_3_4
+    #define LOAD_MSG_3_4(b0, b1) \
+    do { b0 = vcombine_u64(vget_low_u64(m3), vget_low_u64(m5)); b1 = vcombine_u64(vget_low_u64(m0), vget_low_u64(m4)); } while(0)
+
+    #undef LOAD_MSG_4_1
+    #define LOAD_MSG_4_1(b0, b1) \
+    do { b0 = vcombine_u64(vget_high_u64(m4), vget_high_u64(m2)); b1 = vcombine_u64(vget_low_u64(m1), vget_low_u64(m5)); } while(0)
+
+    #undef LOAD_MSG_4_2
+    #define LOAD_MSG_4_2(b0, b1) \
+    do { b0 = vcombine_u64(vget_low_u64(m0), vget_high_u64(m3)); b1 = vcombine_u64(vget_low_u64(m2), vget_high_u64(m7)); } while(0)
+
+    #undef LOAD_MSG_4_3
+    #define LOAD_MSG_4_3(b0, b1) \
+    do { b0 = vcombine_u64(vget_low_u64(m7), vget_high_u64(m5)); b1 = vcombine_u64(vget_low_u64(m3), vget_high_u64(m1)); } while(0)
+
+    #undef LOAD_MSG_4_4
+    #define LOAD_MSG_4_4(b0, b1) \
+    do { b0 = vextq_u64(m0, m6, 1); b1 = vcombine_u64(vget_low_u64(m4), vget_high_u64(m6)); } while(0)
+
+    #undef LOAD_MSG_5_1
+    #define LOAD_MSG_5_1(b0, b1) \
+    do { b0 = vcombine_u64(vget_low_u64(m1), vget_low_u64(m3)); b1 = vcombine_u64(vget_low_u64(m0), vget_low_u64(m4)); } while(0)
+
+    #undef LOAD_MSG_5_2
+    #define LOAD_MSG_5_2(b0, b1) \
+    do { b0 = vcombine_u64(vget_low_u64(m6), vget_low_u64(m5)); b1 = vcombine_u64(vget_high_u64(m5), vget_high_u64(m1)); } while(0)
+
+    #undef LOAD_MSG_5_3
+    #define LOAD_MSG_5_3(b0, b1) \
+    do { b0 = vcombine_u64(vget_low_u64(m2), vget_high_u64(m3)); b1 = vcombine_u64(vget_high_u64(m7), vget_high_u64(m0)); } while(0)
+
+    #undef LOAD_MSG_5_4
+    #define LOAD_MSG_5_4(b0, b1) \
+    do { b0 = vcombine_u64(vget_high_u64(m6), vget_high_u64(m2)); b1 = vcombine_u64(vget_low_u64(m7), vget_high_u64(m4)); } while(0)
+
+    #undef LOAD_MSG_6_1
+    #define LOAD_MSG_6_1(b0, b1) \
+    do { b0 = vcombine_u64(vget_low_u64(m6), vget_high_u64(m0)); b1 = vcombine_u64(vget_low_u64(m7), vget_low_u64(m2)); } while(0)
+
+    #undef LOAD_MSG_6_2
+    #define LOAD_MSG_6_2(b0, b1) \
+    do { b0 = vcombine_u64(vget_high_u64(m2), vget_high_u64(m7)); b1 = vextq_u64(m6, m5, 1); } while(0)
+
+    #undef LOAD_MSG_6_3
+    #define LOAD_MSG_6_3(b0, b1) \
+    do { b0 = vcombine_u64(vget_low_u64(m0), vget_low_u64(m3)); b1 = vextq_u64(m4, m4, 1); } while(0)
+
+    #undef LOAD_MSG_6_4
+    #define LOAD_MSG_6_4(b0, b1) \
+    do { b0 = vcombine_u64(vget_high_u64(m3), vget_high_u64(m1)); b1 = vcombine_u64(vget_low_u64(m1), vget_high_u64(m5)); } while(0)
+
+    #undef LOAD_MSG_7_1
+    #define LOAD_MSG_7_1(b0, b1) \
+    do { b0 = vcombine_u64(vget_high_u64(m6), vget_high_u64(m3)); b1 = vcombine_u64(vget_low_u64(m6), vget_high_u64(m1)); } while(0)

-#undef G
-#undef ROUND
+    #undef LOAD_MSG_7_2
+    #define LOAD_MSG_7_2(b0, b1) \
+    do { b0 = vextq_u64(m5, m7, 1); b1 = vcombine_u64(vget_high_u64(m0), vget_high_u64(m4)); } while(0)
+
+    #undef LOAD_MSG_7_3
+    #define LOAD_MSG_7_3(b0, b1) \
+    do { b0 = vcombine_u64(vget_high_u64(m2), vget_high_u64(m7)); b1 = vcombine_u64(vget_low_u64(m4), vget_low_u64(m1)); } while(0)
+
+    #undef LOAD_MSG_7_4
+    #define LOAD_MSG_7_4(b0, b1) \
+    do { b0 = vcombine_u64(vget_low_u64(m0), vget_low_u64(m2)); b1 = vcombine_u64(vget_low_u64(m3), vget_low_u64(m5)); } while(0)
+
+    #undef LOAD_MSG_8_1
+    #define LOAD_MSG_8_1(b0, b1) \
+    do { b0 = vcombine_u64(vget_low_u64(m3), vget_low_u64(m7)); b1 = vextq_u64(m5, m0, 1); } while(0)
+
+    #undef LOAD_MSG_8_2
+    #define LOAD_MSG_8_2(b0, b1) \
+    do { b0 = vcombine_u64(vget_high_u64(m7), vget_high_u64(m4)); b1 = vextq_u64(m1, m4, 1); } while(0)
+
+    #undef LOAD_MSG_8_3
+    #define LOAD_MSG_8_3(b0, b1) \
+    do { b0 = m6; b1 = vextq_u64(m0, m5, 1); } while(0)
+
+    #undef LOAD_MSG_8_4
+    #define LOAD_MSG_8_4(b0, b1) \
+    do { b0 = vcombine_u64(vget_low_u64(m1), vget_high_u64(m3)); b1 = m2; } while(0)
+
+    #undef LOAD_MSG_9_1
+    #define LOAD_MSG_9_1(b0, b1) \
+    do { b0 = vcombine_u64(vget_low_u64(m5), vget_low_u64(m4)); b1 = vcombine_u64(vget_high_u64(m3), vget_high_u64(m0)); } while(0)
+
+    #undef LOAD_MSG_9_2
+    #define LOAD_MSG_9_2(b0, b1) \
+    do { b0 = vcombine_u64(vget_low_u64(m1), vget_low_u64(m2)); b1 = vcombine_u64(vget_low_u64(m3), vget_high_u64(m2)); } while(0)
+
+    #undef LOAD_MSG_9_3
+    #define LOAD_MSG_9_3(b0, b1) \
+    do { b0 = vcombine_u64(vget_high_u64(m7), vget_high_u64(m4)); b1 = vcombine_u64(vget_high_u64(m1), vget_high_u64(m6)); } while(0)
+
+    #undef LOAD_MSG_9_4
+    #define LOAD_MSG_9_4(b0, b1) \
+    do { b0 = vextq_u64(m5, m7, 1); b1 = vcombine_u64(vget_low_u64(m6), vget_low_u64(m0)); } while(0)
+
+    #undef LOAD_MSG_10_1
+    #define LOAD_MSG_10_1(b0, b1) \
+    do { b0 = vcombine_u64(vget_low_u64(m0), vget_low_u64(m1)); b1 = vcombine_u64(vget_low_u64(m2), vget_low_u64(m3)); } while(0)
+
+    #undef LOAD_MSG_10_2
+    #define LOAD_MSG_10_2(b0, b1) \
+    do { b0 = vcombine_u64(vget_high_u64(m0), vget_high_u64(m1)); b1 = vcombine_u64(vget_high_u64(m2), vget_high_u64(m3)); } while(0)
+
+    #undef LOAD_MSG_10_3
+    #define LOAD_MSG_10_3(b0, b1) \
+    do { b0 = vcombine_u64(vget_low_u64(m4), vget_low_u64(m5)); b1 = vcombine_u64(vget_low_u64(m6), vget_low_u64(m7)); } while(0)
+
+    #undef LOAD_MSG_10_4
+    #define LOAD_MSG_10_4(b0, b1) \
+    do { b0 = vcombine_u64(vget_high_u64(m4), vget_high_u64(m5)); b1 = vcombine_u64(vget_high_u64(m6), vget_high_u64(m7)); } while(0)
+
+    #undef LOAD_MSG_11_1
+    #define LOAD_MSG_11_1(b0, b1) \
+    do { b0 = vcombine_u64(vget_low_u64(m7), vget_low_u64(m2)); b1 = vcombine_u64(vget_high_u64(m4), vget_high_u64(m6)); } while(0)
+
+    #undef LOAD_MSG_11_2
+    #define LOAD_MSG_11_2(b0, b1) \
+    do { b0 = vcombine_u64(vget_low_u64(m5), vget_low_u64(m4)); b1 = vextq_u64(m7, m3, 1); } while(0)
+
+    #undef LOAD_MSG_11_3
+    #define LOAD_MSG_11_3(b0, b1) \
+    do { b0 = vextq_u64(m0, m0, 1); b1 = vcombine_u64(vget_high_u64(m5), vget_high_u64(m2)); } while(0)
+
+    #undef LOAD_MSG_11_4
+    #define LOAD_MSG_11_4(b0, b1) \
+    do { b0 = vcombine_u64(vget_low_u64(m6), vget_low_u64(m1)); b1 = vcombine_u64(vget_high_u64(m3), vget_high_u64(m1)); } while(0)
+
+    #define vrorq_n_u64_32(x) vreinterpretq_u64_u32(vrev64q_u32(vreinterpretq_u32_u64((x))))
+
+    #define vrorq_n_u64_24(x) vcombine_u64( \
+        vreinterpret_u64_u8(vext_u8(vreinterpret_u8_u64(vget_low_u64(x)), vreinterpret_u8_u64(vget_low_u64(x)), 3)), \
+        vreinterpret_u64_u8(vext_u8(vreinterpret_u8_u64(vget_high_u64(x)), vreinterpret_u8_u64(vget_high_u64(x)), 3)))
+
+    #define vrorq_n_u64_16(x) vcombine_u64( \
+        vreinterpret_u64_u8(vext_u8(vreinterpret_u8_u64(vget_low_u64(x)), vreinterpret_u8_u64(vget_low_u64(x)), 2)), \
+        vreinterpret_u64_u8(vext_u8(vreinterpret_u8_u64(vget_high_u64(x)), vreinterpret_u8_u64(vget_high_u64(x)), 2)))
+
+    #define vrorq_n_u64_63(x) veorq_u64(vaddq_u64(x, x), vshrq_n_u64(x, 63))
+
+    #undef G1
+    #define G1(row1l,row2l,row3l,row4l,row1h,row2h,row3h,row4h,b0,b1) \
+    do { \
+      row1l = vaddq_u64(vaddq_u64(row1l, b0), row2l); \
+      row1h = vaddq_u64(vaddq_u64(row1h, b1), row2h); \
+      row4l = veorq_u64(row4l, row1l); row4h = veorq_u64(row4h, row1h); \
+      row4l = vrorq_n_u64_32(row4l); row4h = vrorq_n_u64_32(row4h); \
+      row3l = vaddq_u64(row3l, row4l); row3h = vaddq_u64(row3h, row4h); \
+      row2l = veorq_u64(row2l, row3l); row2h = veorq_u64(row2h, row3h); \
+      row2l = vrorq_n_u64_24(row2l); row2h = vrorq_n_u64_24(row2h); \
+    } while(0)
+
+    #undef G2
+    #define G2(row1l,row2l,row3l,row4l,row1h,row2h,row3h,row4h,b0,b1) \
+    do { \
+      row1l = vaddq_u64(vaddq_u64(row1l, b0), row2l); \
+      row1h = vaddq_u64(vaddq_u64(row1h, b1), row2h); \
+      row4l = veorq_u64(row4l, row1l); row4h = veorq_u64(row4h, row1h); \
+      row4l = vrorq_n_u64_16(row4l); row4h = vrorq_n_u64_16(row4h); \
+      row3l = vaddq_u64(row3l, row4l); row3h = vaddq_u64(row3h, row4h); \
+      row2l = veorq_u64(row2l, row3l); row2h = veorq_u64(row2h, row3h); \
+      row2l = vrorq_n_u64_63(row2l); row2h = vrorq_n_u64_63(row2h); \
+    } while(0)
+
+    #define DIAGONALIZE(row1l,row2l,row3l,row4l,row1h,row2h,row3h,row4h) \
+    do { \
+      uint64x2_t t0 = vextq_u64(row2l, row2h, 1); \
+      uint64x2_t t1 = vextq_u64(row2h, row2l, 1); \
+      row2l = t0; row2h = t1; t0 = row3l;  row3l = row3h; row3h = t0; \
+      t0 = vextq_u64(row4h, row4l, 1); t1 = vextq_u64(row4l, row4h, 1); \
+      row4l = t0; row4h = t1; \
+    } while(0)
+
+    #define UNDIAGONALIZE(row1l,row2l,row3l,row4l,row1h,row2h,row3h,row4h) \
+    do { \
+      uint64x2_t t0 = vextq_u64(row2h, row2l, 1); \
+      uint64x2_t t1 = vextq_u64(row2l, row2h, 1); \
+      row2l = t0; row2h = t1; t0 = row3l; row3l = row3h; row3h = t0; \
+      t0 = vextq_u64(row4l, row4h, 1); t1 = vextq_u64(row4h, row4l, 1); \
+      row4l = t0; row4h = t1; \
+    } while(0)
+
+    #undef ROUND
+    #define ROUND(r) \
+    do { \
+      uint64x2_t b0, b1; \
+      LOAD_MSG_ ##r ##_1(b0, b1); \
+      G1(row1l,row2l,row3l,row4l,row1h,row2h,row3h,row4h,b0,b1); \
+      LOAD_MSG_ ##r ##_2(b0, b1); \
+      G2(row1l,row2l,row3l,row4l,row1h,row2h,row3h,row4h,b0,b1); \
+      DIAGONALIZE(row1l,row2l,row3l,row4l,row1h,row2h,row3h,row4h); \
+      LOAD_MSG_ ##r ##_3(b0, b1); \
+      G1(row1l,row2l,row3l,row4l,row1h,row2h,row3h,row4h,b0,b1); \
+      LOAD_MSG_ ##r ##_4(b0, b1); \
+      G2(row1l,row2l,row3l,row4l,row1h,row2h,row3h,row4h,b0,b1); \
+      UNDIAGONALIZE(row1l,row2l,row3l,row4l,row1h,row2h,row3h,row4h); \
+    } while(0)
+
+    const uint64x2_t m0 = vreinterpretq_u64_u8(vld1q_u8(block +  00));
+    const uint64x2_t m1 = vreinterpretq_u64_u8(vld1q_u8(block +  16));
+    const uint64x2_t m2 = vreinterpretq_u64_u8(vld1q_u8(block +  32));
+    const uint64x2_t m3 = vreinterpretq_u64_u8(vld1q_u8(block +  48));
+    const uint64x2_t m4 = vreinterpretq_u64_u8(vld1q_u8(block +  64));
+    const uint64x2_t m5 = vreinterpretq_u64_u8(vld1q_u8(block +  80));
+    const uint64x2_t m6 = vreinterpretq_u64_u8(vld1q_u8(block +  96));
+    const uint64x2_t m7 = vreinterpretq_u64_u8(vld1q_u8(block + 112));
+
+    uint64x2_t row1l, row1h, row2l, row2h;
+    uint64x2_t row3l, row3h, row4l, row4h;
+
+    const uint64x2_t h0 = row1l = vld1q_u64(&H[0]);
+    const uint64x2_t h1 = row1h = vld1q_u64(&H[2]);
+    const uint64x2_t h2 = row2l = vld1q_u64(&H[4]);
+    const uint64x2_t h3 = row2h = vld1q_u64(&H[6]);
+
+    row3l = vld1q_u64(&blake2b_IV[0]);
+    row3h = vld1q_u64(&blake2b_IV[2]);
+    row4l = veorq_u64(vld1q_u64(&blake2b_IV[4]), vld1q_u64(&m_T[0]));
+    row4h = veorq_u64(vld1q_u64(&blake2b_IV[6]), vld1q_u64(&m_F[0]));
+
+    ROUND(0);
+    ROUND(1);
+    ROUND(2);
+    ROUND(3);
+    ROUND(4);
+    ROUND(5);
+    ROUND(6);
+    ROUND(7);
+    ROUND(8);
+    ROUND(9);
+    ROUND(10);
+    ROUND(11);
+
+    vst1q_u64(&H[0], veorq_u64(h0, veorq_u64(row1l, row3l)));
+    vst1q_u64(&H[2], veorq_u64(h1, veorq_u64(row1h, row3h)));
+    vst1q_u64(&H[4], veorq_u64(h2, veorq_u64(row2l, row4l)));
+    vst1q_u64(&H[6], veorq_u64(h3, veorq_u64(row2h, row4h)));
    }

 void Blake2b::increment_counter(const uint64_t inc)

Only enable NEON for 32-bit ARM (A-32). Do not use NEON for 64-bit ARM (Aarch64). Here's a surprising result: (1) CXX and NEON run equally fast on Cortex-A53 Pine64 and HiKey. (2) NEON runs 50% slower than CXX on Cortex-A57 Overdrive 1000. I don't know what happens under Aarch32 because I don't have a device.

A quick look at a disassembly of the A57 CXX code reveals GCC never generates a NEON instruction for the compress function. Its purely integer operations on the cpu, and nothing goes to the coprocessor.

Here are the A57 timings after configuring with -march=armv8-a+crc+crypto -mtune=cortex-a57:

  • CXX implementation:
gcc117:~/botan> LD_LIBRARY_PATH=/opt/cfarm/gcc-latest/lib64 ./botan speed --msec=3000 Blake2b
Blake2b(512) [base] hash 315.197 MiB/sec (945.594 MiB in 3000.008 ms)
  • NEON implementation:
gcc117:~/botan> LD_LIBRARY_PATH=/opt/cfarm/gcc-latest/lib64 ./botan speed --msec=3000 Blake2b
Blake2b(512) [base] hash 148.028 MiB/sec (444.086 MiB in 3000.014 ms)

Here is the updated blake2.cpp and the diff packaged as a ZIP file.

blake2_updated.zip

@noloader
Copy link
Contributor Author

noloader commented Jan 23, 2017

@randombit,

I was talking with SN from the BLAKE2 team. Here's what he uncovered:

I got a private email about poor performance for BLAKE2. I could not
get all the details, but I was able to duplicate it at the GCC compile
farm using GCC117. GCC117 is an ARMVv8/Aarch64 with an 8-core Cortex
A57.
...

The problem is/was, CXX outperforms NEON on a Cortex-A57 (ARM severs,
like SofitIron Overdrive 1000). CXX and NEON performs about the same on a
Cortex-A53 (Pine64, HiKey, etc). A7, A8 and A9 perform as expected. It
was as if CXX and NEON were inverted under A57.
...

There's a small wildcard still in play. Aarch32 is ARMv8 but in 32-bit
mode. NEON is still enabled for Aarch32 because we effectively disable
NEON for 64-bit ARM. We don't know how its going perform because we
don't have a test device.

From the A57 optimization guide
(http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.uan0015b/index.html),
NEON addition and xor have throughput 2, just like regular
instructions. But unlike regular instructions, where rotation is
nearly free, NEON needs to waste time with slow-ish (only one per
cycle) shift instructions. On balance, NEON loses out, not to mention
the extra message permutation stuff. Makes sense. I can't find any
detailed information for A53, though. It probably can't do 2
independent instructions per cycle, I guess.

The Aarch32 case is interesting; it doubles all of the necessary
instructions for the BLAKE2b general purpose register case, so I would
expect that performance will be more evenly paired between NEON and C,
but it may still not be worth enabling NEON. On the other hand, it
will probably make no difference whatsoever for BLAKE2s, so NEON
should remain disabled there.

@randombit
Copy link
Owner

Closing. I don't think we want to take this on, BLAKE is already fast enough in plain C++ so the extra complexity doesn't seem worthwhile.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants