Skip to content
Permalink
Browse files

libFLAC: Add a workaround for a bug in MSVC2105 update2

MSVC2105 update2 compiles the C code:

    abs_residual_partition_sums[partition] =
                  (FLAC__uint32)_mm_cvtsi128_si32(mm_sum);

into this:

    movq    QWORD PTR [rsi], xmm2

while it should be:

    movd    eax, xmm2
    mov     QWORD PTR [rsi], rax

With this patch, MSVC emits:

    movq    QWORD PTR [rsi], xmm2
    mov     DWORD PTR [rsi+4], r9d

so the price of this workaround is 1 extra write instruction per
partition.

Patch-from: lvqcl <lvqcl.mail@gmail.com>
  • Loading branch information...
erikd committed May 5, 2016
1 parent 387992b commit 94a61241b02064c7d9fe508f72a742f2a90b8492
@@ -84,6 +84,10 @@ void FLAC__precompute_partition_info_sums_intrin_avx2(const FLAC__int32 residual
sum128 = _mm_hadd_epi32(sum128, sum128);
sum128 = _mm_hadd_epi32(sum128, sum128);
abs_residual_partition_sums[partition] = (FLAC__uint32)_mm_cvtsi128_si32(sum128);
/* workaround for a bug in MSVC2015U2 - see https://connect.microsoft.com/VisualStudio/feedback/details/2659191/incorrect-code-generation-for-x86-64 */
#if (defined _MSC_VER) && (_MSC_FULL_VER == 190023918) && (defined FLAC__CPU_X86_64)
abs_residual_partition_sums[partition] &= 0xFFFFFFFF; /**/
#endif
}
}
else { /* have to pessimistically use 64 bits for accumulator */
@@ -98,6 +98,10 @@ void FLAC__precompute_partition_info_sums_intrin_sse2(const FLAC__int32 residual
mm_sum = _mm_add_epi32(mm_sum, _mm_srli_si128(mm_sum, 8));
mm_sum = _mm_add_epi32(mm_sum, _mm_srli_si128(mm_sum, 4));
abs_residual_partition_sums[partition] = (FLAC__uint32)_mm_cvtsi128_si32(mm_sum);
/* workaround for a bug in MSVC2015U2 - see https://connect.microsoft.com/VisualStudio/feedback/details/2659191/incorrect-code-generation-for-x86-64 */
#if (defined _MSC_VER) && (_MSC_FULL_VER == 190023918) && (defined FLAC__CPU_X86_64)
abs_residual_partition_sums[partition] &= 0xFFFFFFFF;
#endif
}
}
else { /* have to pessimistically use 64 bits for accumulator */
@@ -87,6 +87,10 @@ void FLAC__precompute_partition_info_sums_intrin_ssse3(const FLAC__int32 residua
mm_sum = _mm_hadd_epi32(mm_sum, mm_sum);
mm_sum = _mm_hadd_epi32(mm_sum, mm_sum);
abs_residual_partition_sums[partition] = (FLAC__uint32)_mm_cvtsi128_si32(mm_sum);
/* workaround for a bug in MSVC2015U2 - see https://connect.microsoft.com/VisualStudio/feedback/details/2659191/incorrect-code-generation-for-x86-64 */
#if (defined _MSC_VER) && (_MSC_FULL_VER == 190023918) && (defined FLAC__CPU_X86_64)
abs_residual_partition_sums[partition] &= 0xFFFFFFFF;
#endif
}
}
else { /* have to pessimistically use 64 bits for accumulator */

2 comments on commit 94a6124

@40th

This comment has been minimized.

Copy link

replied Mar 1, 2018

For those interested, this is still a problem with the current MS compiler (_MSC_FULL_VER reports 191225835). I suggest removing at least the _MSC_FULL_VER check if you want an x64 release build to compress down more than 92%.

Also, the Connect site link goes nowhere. For more background,

https://www.mail-archive.com/flac-dev@xiph.org/msg04176.html

Update: the bug introduced by the release x64 build is a new one; same effect.

Here is the doesn't-work one (from stream_encoder_intrin_sse2.c -- don't forget to turn off LTCG (make that, /GL) for this file to get the /Facs generated):

`; 96 : __m128i mm_res = local_abs_epi32(_mm_cvtsi32_si128(residual[residual_sample]));

00120 66 0f 6e 01 movd xmm0, DWORD PTR [rcx]
00124 48 8d 49 04 lea rcx, QWORD PTR [rcx+4]

[snip]

; 102 : abs_residual_partition_sums[partition] = (FLAC__uint32)_mm_cvtsi128_si32(mm_sum);

00165 66 0f 7e c1 movd ecx, xmm0
00169 48 89 0e mov QWORD PTR [rsi], rcx
`
The problem is, the high 32 bits of rcx still have the result from a previous lea (at 00124), so same problem, different reason.

Here is the working one:
`; 102 : abs_residual_partition_sums[partition] = (FLAC__uint32)_mm_cvtsi128_si32(mm_sum);

00165 66 0f 7e c1 movd ecx, xmm0
00169 48 89 0e mov QWORD PTR [rsi], rcx

; 107 : abs_residual_partition_sums[partition] &= 0xFFFFFFFF;

0016c 44 89 4e 04 mov DWORD PTR [rsi+4], r9d
`
where r9d has 0, clearing out the high 32 bits.

@erikd

This comment has been minimized.

Copy link
Member Author

replied Mar 1, 2018

So how about a PR with an updated version check and a corrected URL?

Please sign in to comment.
You can’t perform that action at this time.