-
Notifications
You must be signed in to change notification settings - Fork 834
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Non-SCA resistant elliptic curve fixed point multiplication #3184
Comments
There shouldn't be issues with caches but there are. Timing of the operation remotely should not be an issue either. The inherent jitter should mask any differences but again, attacks appear to be getting more advanced all the time. I'm looking into this now and will hopefully have an option of compiling without the scan by the end of the week. Would you mind raising a ticket on this? Sean |
Hi @SparkiDev , sure, I just didn't find a better place for the report/ticket - where I should rise the ticket? |
You can raise a ticket by emailing "support [at] wolfssl [dot] com" or by going to "https [colon //] wolfssl [dot] zendesk [dot] com Warm Regards, KH |
Done, the ticket number is 10722 |
Thanks @krizhanovsky The pull request has been merged. Sean |
Hi,
ECDSA uses elliptic curve fixed point multiplication and
sp_256_ecc_mulmod_add_only_avx2_4()
is responsible for the actual multiplication using the precomputed tablep256_table
.With
WOLFSSL_SP_SMALL
the table is relatively small: 256 * sizeof( sp_table_entry_256) = 256*2 * 4 * 8 = 16KB, but for withoutWOLFSSL_SP_SMALL
it is much larger: 256 * sizeof( sp_table_entry_256) = 2405 * 32 = 150KB. Having the typical L1d cache size of 32KB and keeping in mind the cache associativity, it's quite likely that accessing the table will cause many L1d cache misses, especially for the fastest version of the function. The level 1 data cache is shared between the 2 hyperthreads running on the same core of the modern Intel CPUs.sp_256_ecc_mulmod_add_only_avx2_4()
recodes the secret scalark
withsp_256_ecc_recode_7_4()
, but without any randomization. After that the recoded value is used to directly address the table of precomputed values, so different multiplication steps take different time, depending on which items of the table are cached.I also observed the same code for OpenSSL and mbedTLS. OpenSSL uses the same 150KB precomputed table, but fully scans it on each iteration (also see the paper "Fast prime field elliptic-curve cryptography with 256-bit primes" by Gueron and Krasnov). AVX2 for faster scanning. MbedTLS uses much smaller precomputed table (actually they use the same Comba method for both the fixed and unknown point multiplications), but they also use full table scan and point randomization.
I see at least 2 problems with the current implementation:
An attacked can observe different computation time remotely (timing SCA). Probably even more crucial for an embedded devices, which can be physically accessed and attacked by power analysis.
If the library work in a public cloud, then a container/VM (also applicable for a hostile process running on the same system) working on sibling hyperthread may influence the CPU cache and measure response times at the same time.
The text was updated successfully, but these errors were encountered: