Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-SCA resistant elliptic curve fixed point multiplication #3184

Closed
krizhanovsky opened this issue Aug 2, 2020 · 5 comments
Closed

Non-SCA resistant elliptic curve fixed point multiplication #3184

krizhanovsky opened this issue Aug 2, 2020 · 5 comments
Assignees

Comments

@krizhanovsky
Copy link

Hi,

ECDSA uses elliptic curve fixed point multiplication and sp_256_ecc_mulmod_add_only_avx2_4() is responsible for the actual multiplication using the precomputed table p256_table.

With WOLFSSL_SP_SMALL the table is relatively small: 256 * sizeof( sp_table_entry_256) = 256*2 * 4 * 8 = 16KB, but for without WOLFSSL_SP_SMALL it is much larger: 256 * sizeof( sp_table_entry_256) = 2405 * 32 = 150KB. Having the typical L1d cache size of 32KB and keeping in mind the cache associativity, it's quite likely that accessing the table will cause many L1d cache misses, especially for the fastest version of the function. The level 1 data cache is shared between the 2 hyperthreads running on the same core of the modern Intel CPUs.

sp_256_ecc_mulmod_add_only_avx2_4() recodes the secret scalar k with sp_256_ecc_recode_7_4(), but without any randomization. After that the recoded value is used to directly address the table of precomputed values, so different multiplication steps take different time, depending on which items of the table are cached.

I also observed the same code for OpenSSL and mbedTLS. OpenSSL uses the same 150KB precomputed table, but fully scans it on each iteration (also see the paper "Fast prime field elliptic-curve cryptography with 256-bit primes" by Gueron and Krasnov). AVX2 for faster scanning. MbedTLS uses much smaller precomputed table (actually they use the same Comba method for both the fixed and unknown point multiplications), but they also use full table scan and point randomization.

I see at least 2 problems with the current implementation:

  1. An attacked can observe different computation time remotely (timing SCA). Probably even more crucial for an embedded devices, which can be physically accessed and attacked by power analysis.

  2. If the library work in a public cloud, then a container/VM (also applicable for a hostile process running on the same system) working on sibling hyperthread may influence the CPU cache and measure response times at the same time.

@SparkiDev
Copy link
Contributor

Hi @krizhanovsky

There shouldn't be issues with caches but there are.
In normal usage, another process shouldn't be running on the same CPU.
Even in the cloud this is something that should not be allowed! But of course this is going to happen.

Timing of the operation remotely should not be an issue either. The inherent jitter should mask any differences but again, attacks appear to be getting more advanced all the time.
Embedded devices typically don't have the issue of cache attacks. You always have to go to memory.

I'm looking into this now and will hopefully have an option of compiling without the scan by the end of the week.

Would you mind raising a ticket on this?

Sean

@krizhanovsky
Copy link
Author

Hi @SparkiDev ,

sure, I just didn't find a better place for the report/ticket - where I should rise the ticket?

@kaleb-himes
Copy link
Contributor

@krizhanovsky,

You can raise a ticket by emailing "support [at] wolfssl [dot] com" or by going to "https [colon //] wolfssl [dot] zendesk [dot] com

Warm Regards,

KH

@krizhanovsky
Copy link
Author

Done, the ticket number is 10722

@SparkiDev
Copy link
Contributor

Thanks @krizhanovsky

The pull request has been merged.

#3195

Sean

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants