# A Hardware-Friendly Shuffling Countermeasure Against Side-Channel Attacks for Kyber

Dejun Xu , Kai Wang , and Jing Tian , Member, IEEE

Abstract-CRYSTALS-Kyber has been standardized as the only key-encapsulation mechanism (KEM) scheme by NIST to withstand attacks by large-scale quantum computers. However, the side-channel attacks (SCAs) on its implementation are still needed to be well considered for the upcoming migration. In this brief, we propose a secure and efficient hardware implementation for Kyber by incorporating a novel compact shuffling architecture. First of all, we modify the Fisher-Yates shuffle to make it more hardware-friendly. We then design an optimized shuffling architecture for the well-known open-source Kyber hardware implementation to enhance the security of all known and potential side-channel leakage points. Finally, we implement the modified Kyber design on FPGA and evaluate its security and performance. The security is verified by conducting correlation power analysis (CPA) and test vector leakage assessment (TVLA) on the hardware. Meanwhile, FPGA place-and-route results show that the proposed design reports only 8.7% degradation on the hardware efficiency compared with the original unprotected version, much better than existing hardware hiding schemes.

*Index Terms*—CRYSTALS-Kyber, hardware implementation, shuffling, side-channel attack, countermeasure.

#### I. INTRODUCTION

ITH the rapid development of quantum computing, modern cryptography faces severe threats. Since 2016, NIST has held a post-quantum cryptography (PQC) standardization competition, receiving global proposals. After three rounds, CRYSTALS-Kyber (Kyber) was chosen as the only key-encapsulation mechanism (KEM) protocol [1]. Recently, NIST issued FIPS 203, a standard for module-lattice-based key-encapsulation mechanism (ML-KEM) based on CRYSTALS-Kyber. As of now, research on the efficient hardware implementations of ML-KEM/Kyber has attained a relatively mature stage [2], [3]. Nevertheless, during the impending shift from existing conventional cryptographic algorithms to ML-KEM/Kyber, it is equally crucial to assess their security against physical attacks, particularly side-channel attacks (SCAs) and fault injection attacks (FIAs).

SCAs recover keys by acquiring power consumption or electromagnetic emanations from cryptographic devices, and typically employ countermeasures such as hiding or masking to defend against them. FIAs, on the other hand, aim to recover keys by introducing faults into cryptographic devices during their operation. Error detection mechanisms can effectively

This work was supported in part by the National Natural Science Foundation of China under Grant 62104097, in part by the Key Research Plan of Jiangsu Province of China under Grant BE2022098, and in part by the Young Elite Scientists Sponsorship Program by CAST under Grant 2023QNRC001. (Corresponding author: Jing Tian.)

The authors are with the School of Integrated Circuits, Nanjing University, Suzhou, 215163, China (e-mail: tianjing@nju.edu.cn).

identify and respond to the injection of these faults. On the SCAs side, Tosun *et al.* [4] successfully recovered the full key of Kyber from the polynomial multiplication on Kyber's software implementations. Zhao *et al.* [5] recently pointed out that there are three side-channel leakage points in Kyber's decryption procedure, and successfully completed SCAs on Kyber's hardware implementations. On the FIAs side, Ni *et al.* [6] successfully implemented bitstream fault injection attacks on Kyber's hardware implementations. In addition, combined fault and power attacks also pose a significant threat to lattice-based cryptographic algorithms.

However, defense work for Kyber, especially on hardware platform, is still far from sufficient. Sarker et al. [7] proposed the error detection architectures for hardware/coftware codesign approaches of number theoretic transform (NTT). Zhao et al. [5] and Kamucheka et al. [8] proposed two different hardware masking schemes, both of which introduce significant area overhead. On the other hand, two hiding schemes on hardware platform are provided in [9] and [10]. In [9], Morait et al. used the method of randomized clock and duplication to reduce the energy correlation during the operation of Kyber. In [10], Jati et al. adopted the methodology of adding random delays and address & instruction shuffling to reduce the energy correlation. Nevertheless, these two hiding schemes result in high resource overhead and clock cycle overhead, respectively. In this brief, we try to propose a hiding protection method against SCAs for Kyber's hardware implementation by taking the performance and security both into consideration.

This work provides a good tradeoff between security enhancement and hardware efficiency. Specifically, we modify the Fisher-Yates shuffle to make it more hardware-friendly. Based on the algorithm, we propose an optimized shuffling architecture and apply it to the open-source hardware implementation of Kyber from [11]. We shuffle the sub-operation orders of all the potential side-channel leakage points during Kyber's decryption procedure. The proposed design shows only an 8.7% efficiency drop and surpasses current hiding schemes.

#### II. PRELIMINARIES

# A. Related Work on Shuffling

As an efficient and effective countermeasure, shuffling is widely used in various cryptographic implementations against SCAs. In [12], Ravi *et al.* proposed three variants of the shuffling countermeasure with varying granularity for the NTT on software platform. In [10], [13], [14], three different shuffling countermeasures on the hardware platform are

TABLE I
COMPARISON WITH EXISTING HARDWARE SHUFFLING METHODS

| Work           | Huge Space of Permutations | No Dynamic<br>Input Required | No Pre-Storage<br>Required |  |  |
|----------------|----------------------------|------------------------------|----------------------------|--|--|
| TECS [10]      | ×                          | ×                            | ×                          |  |  |
| TCAD [13]      | ×                          | ×                            | ×                          |  |  |
| INDOCRYPT [14] | ✓                          | ×                            | ✓                          |  |  |
| This work      | ✓                          | ✓                            | ✓                          |  |  |

proposed, as shown in Table I. The schemes from [10], [13] require lots of storage space to achieve high security as their random permutations are generated outside of the hardware and preloaded into storage in advance. Therefore, they require lots of storage space to achieve high security. Differently, Zijlstra *et al.* [14] used the random permutation generator (RPG) from [15] to protect the lattice-based cryptographic schemes. However, the RPG they proposed is quite costly, making it difficult to meet the low-cost requirements.

## B. Kyber and Side-Channel Leakage Points in Kyber

Kyber is a CCA-secure protocol whose security is based on the module learning with errors (MLWE) problem. The core operations of Kyber are polynomial multiplications over the polynomial ring  $\mathcal{R}_q = \mathbb{Z}_q/(X^n+1)$ , where n=256, q=3329, and k=2,3, and 4. The whole Kyber protocol consists of three procedures, which are key generation, encryption, and decryption. The side-channel leakage points of the secret key are mainly concentrated in Kyber's decrypting procedure, and the main operations of this procedure are as follows:

$$m \leftarrow Compress_q(v - INTT(\hat{\boldsymbol{s}}^T \circ NTT(\boldsymbol{u})), 1).$$
 (1)

They include NTT, point-wise multiplication (PWM), inverse NTT (INTT), subtraction, and compress, where m denotes the decrypted message,  $\hat{s}$  denotes the unpacked secret key, u and v denote the unpacked ciphertext. NTT, PWM, and INTT are used to reduce the complexity of polynomial multiplications from  $\mathcal{O}(n^2)$  to  $\mathcal{O}(n \cdot logn)$ .

According to [5], there are three side-channel leakage points in Kyber's decryption procedure (hardware) as shown in (2).

$$\begin{cases}
\text{point } 1 : \hat{\boldsymbol{s}}^T \circ \hat{\boldsymbol{u}}, \\
\text{point } 2 : (\hat{\boldsymbol{s}}^T \circ \hat{\boldsymbol{u}}) \bmod q, \\
\text{point } 3 : v - (\boldsymbol{s}^T \boldsymbol{u} \bmod q).
\end{cases} \tag{2}$$

The three leakage points lie in PWM, the modular reduction after PWM, and the subtraction after INTT, respectively. In addition, INTT has been successfully attacked in software implementations [16], so we retain the protection of INTT by shuffling the butterfly units using a series of fixed-length random permutations produced by the proposed RPG.

## III. PROPOSED SHUFFLING ARCHITECTURE FOR KYBER

Fisher-Yates shuffle is one of the most widely used shuffling algorithms. When it is implemented in hardware, two problems need to be solved. One is to continuously input random seeds and the other is to dynamically reduce the range of indexes. In this work, we have elaborately solved these problems. The key ideas of the improved algorithm will be shown in Algorithm 1.

(a) The method with no adjustment.



These 12 elements do not cause errors because they satisfy the above conditions.

>08 >06

>0a

>02 | >00

(c) The designed address permutation.

<3e

<3c | <3a | <38 | <36 | <34



The conditions that the 11 elements of the designed permutation need to meet.

Fig. 1. The comparison of the shuffling method without and with adjustment.

## A. Random Permutation Generator

RPG is a module that generates random permutations from an ordered one. There are five different random permutations to be generated, which are  $00 \sim 3f$ ,  $40 \sim 7f$ ,  $00 \sim 7f$ ,  $80 \sim bf$ , and  $c0 \sim ff$ . Since there is a correspondence between these permutations, it is only necessary to generate the permutation  $00 \sim 3f$ , and the other permutations can be obtained by extending it.

Before introducing the hardware architecture, it is necessary to explain that due to the existence of overlapping addresses, it is necessary to impose restrictions on the first six and last six elements of the randomly generated permutations to avoid address conflicts. The key idea is to specially introduce a fixed intermediate permutation and generate the random one in two stages. The comparison of the shuffling methods without and with adjustment are shown in Fig. 1. As explained above, if we directly use the random address permutation, there may exist errors in the first six or the last six locations as illustrated in Fig. 1(a). To avoid those errors, an intermediate address permutation is specially designed in advance as shown in Fig. 1(c). It is divided into two regions,  $per_a$  and  $per_b$ , which are defined as:

$$\begin{cases} per\_a = \{33, 32, 31, ..., 0d, 0c, 0b\}, \\ per\_b = \{34, ..., 38\} \cup \{00, ..., 0a\} \cup \{39, ..., 3f\}. \end{cases}$$
(3)

Actually, the designed address permutation only needs to restrict those 11 elements and can be in any form. For simplicity, we define it in the above form in this brief.

In the first stage of shuffling, the source of address permutation is  $per\_a$  and twelve addresses are selected based on the proposed random index permutation, which are placed at the rightmost of a new empty permutation. Those selected locations of  $per\_a$  are then filled up by the elements of



Fig. 2. The proposed architecture of RPG.

the leftmost of  $per_b$  successively. This stage is to protect from errors. In the second stage of shuffling, the rest random addresses are selected from the remaining 52 elements of  $per_a$  and  $per_b$ . After finishing all the selections, the filled permutation is cyclically shifted to the right 6 times and output as the final random address permutation, as shown in Fig. 1(b).

Based on the above analysis, the proposed architecture of RPG is shown in Fig. 2, divided into five parts. Part (a) contains one 64-to-1 MUX named MUX0, one 1-to-64 DEMUX, and one register group named REG with data width of 6 and depth of 64. REG stores the designed address permutation and its updated version. Under the control of idx', MUX0 outputs an element from REG continuously and DEMUX outputs an element from MUX0 to update the data in REG. Part (b) contains one 3-to-1 MUX and a FIFO with data width of 6 and depth of 64, which is reused to cache the random indexes from the LFSR module and the generated random address permutation to reduce the area consumption. Three control signals are used, i.e., sel\_0, ena, and enb. With the control signals of ena and enb, FIFO is worked in the input mode and the cyclic mode, respectively. When FIFO is in the input mode, the data enter from REG or LFSR; and when FIFO is in the cyclic mode, the output is pushed back into the input. Part (c) contains one 3-to-1 MUX named MUX2, two 2-to-1 MUXs, two comparators, one counter, one subtractor, and one AND gate. MUX2 has three inputs,  $idx_0$ ,  $idx_1$ , and rest, controlled by sel 1. The input random index idx output from FIFO is adjusted to match the two stages of shuffling, formulated as:

$$idx\_0 = \begin{cases} idx \ (idx \le 28), \\ idx - 28 \ (idx > 28), \end{cases}$$
 (4)

$$idx\_1 = \begin{cases} idx \ (idx \le rest), \\ idx \ \& \ rest \ (idx > rest), \end{cases}$$
 (5)

where the value 28 is in the hexadecimal format. The variable *rest* denotes the number of the remaining elements to be selected minus 1, which is computed by a decrement counter. The output signal idx' is selected from  $idx_0$ ,  $idx_1$ , and rest, served as the control signal of MUX0 and DEMUX in

# Algorithm 1 The Processing Schedule of RPG

```
1: REG \leftarrow \{0b, 0c, ..., 32, 33, 3f, 3e, ..., 3a, 39, ..., 3a, 3b, ..., 3a, ..., 3a,
                                                           0a, 09, ..., 01, 00, 38, 37, 36, 35, 34
       2: FIFO_{ena} \leftarrow 1
       3: for i = 0 to 63 do
       4:
                                 for j = 0 to 5 do
                                                BUF[j] \leftarrow LFSR
       5:
                                  end for
                                 FIFO_{in}
       7:
                                                                        \leftarrow BUF
       8: end for
       9: for k = 0 to 11 do
  10:
                                rest \leftarrow 63 - k
 11:
                                idx\_0 \leftarrow \text{FIFO}_{out} > 0x28 ? \text{FIFO}_{out} : \text{FIFO}_{out} - 0x28
                                FIFO_{in} \leftarrow REG[idx\_0]
 12:
                                \mathbf{REG}[idx\_0] \leftarrow \mathbf{REG}[rest]
 13:
 14: end for
15: for k = 12 to 63 do
                                rest \leftarrow 63 - k
 17:
                                idx\_1 \leftarrow FIFO_{out} > rest ? FIFO_{out} : FIFO_{out} \& rest
 18:
                                FIFO_{in} \leftarrow REG[idx\_1]
  19:
                                 \mathbf{REG}[idx\_1] \leftarrow \mathbf{REG}[rest]
20: end for
21: for i = 0 to 5 do
22:
                                FIFO_{in} \leftarrow FIFO_{out}
23: end for
24: FIFO<sub>ena</sub>\leftarrow 0
```

(a). Part (d) contains a linear feedback shift register (LFSR) with depth of 32 and a buffer. The buffer outputs every six cycles to convert six 1-bit numbers into a 6-bit number (index). The random seeds are generated by an external true random number generator (TRNG). When  $rst_l$  is equal to 1, the LFSR will be updated by the input random seed. Part (e) contains a controller made up with several logic circuits. It produce the control signals ena,  $sel_0$ ,  $sel_1$ , and finish based on the input signals for the other three modules.

To make it more clear, we give an algorithm to illustrate the processing schedule of RPG as shown in Algorithm 1. The generation of random permutations can be divided into four steps: (a) initialization, (b) shuffling 12, (c) shuffling 52, and (d) cyclic shift\_6. In the initialization step, LFSR outputs  $6 \times$ 64 = 384 bits to fill up FIFO, and REG is set with the designed address permutation. Note that the 64 registers in REG use 00 to 3f as their serial numbers. In the shuffling\_12 step, the 12 random addresses are selected from the 41 (0x00  $\sim$  0x28) lower-side locations of REG based on the shift-out elements from FIFO<sub>out</sub> and the equation (4), and cached into FIFO<sub>in</sub>. Meanwhile, those selected locations of REG are filled up with its elements from the specific locations (corresponding to the dynamic maximum serial number rest). After 12 rounds, this step is finished and the shuffling\_52 step starts. In this step, the rest 52 addresses in REG are messed up based on the remaining random indexes in FIFO and the equation (5). They are serially chosen and cached into FIFO from the lower-side locations of REG. Assume that when rest is 0x02, the random index is 0x05, larger than 0x02. When it is computed based on the equation (5), we get the adjusted index equal to 0x00. We then use the new index to choose the element in the zeroth register of REG and push it into FIFO from the lower-side. At the same time, the element in the zeroth register is updated by the element in the second register (corresponding to the dynamic maximum serial number rest) of REG. When the



Fig. 3. The proposed architecture of ADDR.

shuffling\_52 step is completed, we start the forth step, *i.e.*, the cyclic shift\_6 step. FIFO is shifted six times in the cyclic mode and we get the target random address permutation.

The protection will not increase the consumption of total time of Kyber since RPG is conducted in parallel with Kyber, costs much less cycles, and has shorter critical path.

## B. Address Controller

As shown in Fig. 3, the proposed architecture of the address controller (ADDR) is divided into four parts. In part (a), four new read addresses (raddr'\_x) and two new write addresses (waddr'\_x) are used to replace the read and write addresses of the RAMs and ROM. Six 2-to-1 MUXs are used to determine the data sources of these new addresses. When there is no need for shuffling, the original addresses are directly connected. When shuffling is required,  $addr0 \sim addr5$  are selected. The generation of the control signals  $repl0 \sim repl5$  for these



Fig. 4. The CPA results of the unprotected scheme with  $4\times10^3$  EM traces (a) and the protected scheme with  $1\times10^5$  EM traces (b). The TVLA results of the unprotected scheme with  $1\times10^4$  EM traces (c) and the protected scheme with  $1\times10^7$  EM traces (d). The red dashed lines represent  $\pm4.5$ .

MUXs is shown in part (c). In part (b), four shift registers are used to output the delayed random permutations and control signals. The external enable signal enb of FIFO in RPG is set to 1 every other clock cycle (PWM and subtraction). The generation of  $addr0 \sim addr5$  is shown in part (d). The permutations  $40 \sim 7f$  and  $00 \sim 7f$  required by addr2, addr4, and addr5 can be obtained by concatenating one 0 or 1 after addr,  $addr_r12$ , and  $addr_r13$ , respectively. The four permutations  $00 \sim 3f$ ,  $40 \sim 7f$ ,  $80 \sim bf$ , and  $c0 \sim ff$  required by addr0 are obtained by concatenating line (equal to  $0 \sim 3$ ) and addr into  $\{line, addr\}$ .

#### IV. RESULTS AND COMPARISONS

# A. Evaluation on SCA Resistance

To better assess the level of improvement in side-channel security after applying the protective measures, we conduct correlation power analysis (CPA) on PWM and test vector

 $\begin{tabular}{ll} TABLE~II\\ Comparison~of~the~FPGA~Area~and~Time~Performance\\ \end{tabular}$ 

| Implementation                | Parameter | Platform | LUTs   | FFs  | Slices            | DSPs | BRAMs | ENS <sup>1</sup> | Cycles (×10 <sup>3</sup> ) | Frequency (MHz) | ATP $^2$ (ENS× $ms$ ) |
|-------------------------------|-----------|----------|--------|------|-------------------|------|-------|------------------|----------------------------|-----------------|-----------------------|
| Xing et al. [11] <sup>3</sup> | Kyber512  | Artix-7  | 7353   | 4633 | 2173              | 2    | 3     | 2973             | 6.7                        | 206             | 96.7                  |
| Kamucheka et al. [8]          | Kyber512  | Virtex-7 | 143112 | -    | 81746             | 60   | 294   | 146546           | 126.6                      | 100             | $186 \times 10^{3}$   |
| Jati et al. [10]              | Kyber512  | Artix-7  | 7151   | 3730 | 2260 4            | 2    | 5.5   | 3560             | 57.2                       | 258             | 789.3                 |
| This work                     | Kyber512  | Artix-7  | 8143   | 5151 | 2433              | 2    | 3     | 3233             | 6.7                        | 206             | 105.2                 |
| Xing et al. [11] <sup>3</sup> | Kyber768  | Artix-7  | 7353   | 4633 | 2173              | 2    | 3     | 2973             | 10.0                       | 206             | 144.3                 |
| Moraitis et al. [9]           | Kyber768  | Artix-7  | 14341  | 9190 | 4734 <sup>4</sup> | ≥2   | 6     | ≥6134            | 10.0                       | ≤206            | ≥297.8                |
| This work                     | Kyber768  | Artix-7  | 8143   | 5151 | 2433              | 2    | 3     | 3233             | 10.0                       | 206             | 156.9                 |

- 1. ENS (equivalent number of slices) =  $\#Slices + 100 \times \#DSPs + 200 \times \#BRAMs$  [17].
- 2. ATP (area-time product) = ENS  $\times$  Time (ms) = ENS  $\times$  Cycles  $\times$  1/Frequency  $\times$  10<sup>3</sup>.
- 3. The unprotected work was originally evaluated using Artix-7 XC7A12T. We reevaluate it using Artix-7 XC7A100T for a fair comparison.
- 4. The number of Slices is approximately computed by using  $0.25 \times \text{\#LUTs} + 0.125 \times \text{\#FFs}$  [17].

leakage assessment (TVLA) on Kyber's decryption procedure. In the experiments, Pearson correlation coefficient and Welch's t-test are used as metrics in CPA and TVLA, respectively. We deploy our experiments for Kyber768 on a target board containing an Artix-7 XC7A100T FPGA.

We target the registers at the output ports of multipliers and utilize the Hamming distance model to simulate the energy consumption resulting from the dynamic flipping of these registers, similar to the methodology employed in [5]. Finally, the correlation between the assumed energy values and the collected energy values are calculated to analyze the correct keys. As shown in Fig. 4, all experiments have achieved the expected results, *i.e.*, the proposed shuffling countermeasure with huge permutation space can significantly improve the side-channel security of Kyber's hardware implementation. It should be pointed out that we have omitted the CPA results of modular reduction and subtraction here for simplicity as they adopt the same defense countermeasure and have the similar conclusions.

#### B. Comparisons on Area and Performance

Table II shows the experiment results and comparison of Kyber's hardware implementations with different hiding schemes, including the area and time performance. Compared to the unprotected Kyber implementation [11], the proposed protected design only consumes an extra resource of 260 Slices. The numbers of DSPs and BRAMs both are the same. Compared with the unprotected design, the ATP (also area) of the proposed version is increased by only 8.7%.

We also make comparisons with previous protected Kyber hardware implementations. It can be seen that the hiding design proposed in [8] has tremendous area and time consumption, more than three orders of magnitude larger than ours in ATP. The design in [10] has a relatively small area but consumes huge cycles, so its ATP is also much worse than ours. In contrast, the design in [9] has the same clock cycles, but their area is almost twice that of ours.

# V. CONCLUSION

In this brief, we have devised an optimized shuffling architecture against SCAs for Kyber's hardware implementation. The experimental results show that the proposed design can effectively improve the side-channel security, and reports only 8.7% degradation on the hardware efficiency compared with the original unprotected version, much better than existing hiding schemes.

## VI. DISCUSSION

For lattice-based KEMs and signature algorithms, NTT-based polynomial multiplications serve as their most basic operation, with similar security foundations, making the proposed method applicable to other lattice-based schemes. In addition, guarding against FIAs is equally important as guarding against SCAs. However, this brief has so far only protected the Kyber algorithm and has not yet integrated error detection schemes. Future work will apply the proposed method to other promising algorithms like Dilithium and Raccoon, and consider combining it with error detection schemes.

#### REFERENCES

- [1] G. Alagic, D. Apon, D. Cooper, Q. Dang, T. Dang, J. Kelsey, J. Lichtinger, C. Miller, D. Moody, R. Peralta *el al.*, "Status Report on the Third Round of the NIST Post-Quantum Cryptography Standardization Process," *US Department of Commerce, NIST*, 2022.
- [2] J. Zhang, J. Lu, A. Li, M. Wang, X. Li, T. Huang, L. Chen, and D. Liu, "Super K: A Superscalar CRYSTALS-KYBER Processor Based on Efficient Arithmetic Array," *IEEE Trans. Circuits Syst. II, Exp. Briefs*, 2024.
- [3] H. Kim, H. Jung, A. Satriawan, and H. Lee, "A Configurable ML-KEM/Kyber Key-Encapsulation Hardware Accelerator Architecture," *IEEE Trans. Circuits Syst. II, Exp. Briefs*, 2024.
- [4] T. Tosun and E. Savas, "Zero-Value Filtering for Accelerating Non-Profiled Side-Channel Attack on Incomplete NTT-Based Implementations of Lattice-based Cryptography," IEEE Trans. Inf. Forensics Security, 2024.
- [5] Y. Zhao, S. Pan, H. Ma, Y. Gao, X. Song, J. He, and Y. Jin, "Side Channel Security Oriented Evaluation and Protection on Hardware Implementations of Kyber," *IEEE Trans. Circuits Syst. I, Reg. Papers*, vol. 70, no. 12, pp. 5025–5035, 2023.
- [6] Z. Ni, A. Khalid, W. Liu, and M. O'Neill, "Bitstream Fault Injection Attacks on CRYSTALS Kyber Implementations on FPGAs," in *Proc. Des., Automat., Test Europe Conf. Exhib. (DATE)*, pp. 1-6, 2024.
- [7] A. Sarker, A. C. Canto, M. M. Kermani, and R. Azarderakhsh, "Error Detection Architectures for Hardware/Software Co-Design Approaches of Number-Theoretic Transform," *IEEE Trans. Comput. Aided Design Integr. Circuits Syst.*, vol. 42, no. 7, pp. 2418–2422, 2022.
- [8] T. Kamucheka, A. Nelson, D. Andrews, and M. Huang, "A Masked Pure-Hardware Implementation of Kyber Cryptographic Algorithm," in Proc. Int. Conf. on Field-Program. Technol. (FPT), 2022.
- [9] M. Moraitis, Y. Ji, M. Brisfors, E. Dubrova, N. Lindskog et al., "Securing CRYSTALS-Kyber in FPGA Using Duplication and Clock Randomization," *IEEE Design & Test*, 2023.
- [10] A. Jati, N. Gupta, A. Chattopadhyay, and S. K. Sanadhya, "A Configurable CRYSTALS-Kyber Hardware Implementation with Side-Channel Potection," ACM Trans. Embed. Comput. Syst., vol. 23, no. 2, pp. 1–25, 2024
- [11] Y. Xing and S. Li, "A Compact Hardware implementation of CCA-Secure Key Exchange Mechanism CRYSTALS-KYBER on FPGA," IACR Trans. Cryptograph. Hardw. Embedded Syst., pp. 328–356, 2021.
- [12] P. Ravi, R. Poussier, S. Bhasin, and A. Chattopadhyay, "On Configurable SCA Countermeasures Against Single Trace Attacks for the NTT: A Performance Evaluation Study over Kyber and Dilithium on the ARM Cortex-M4," in *Proc. Int. Conf. Secur., Privacy, Appl. Cryptogr. Eng.* (SPACE), pp. 123–146, 2020.
- [13] Z. Chen, Y. Ma, and J. Jing, "Low-Cost Shuffling Countermeasures Against Side-Channel Attacks for NTT-Based Post-Quantum Cryptography," *IEEE Trans. Comput. Aided Design Integr. Circuits Syst.*, vol. 42, no. 1, pp. 322–326, 2022.
- [14] T. Zijlstra, K. Bigou, and A. Tisserand, "FPGA Implementation and Comparison of Protections Against SCAs for RLWE," in *Proc. Int. Conf. Cryptol. India*, pp. 535–555, 2019.
- [15] A. G. Bayrak, N. Velickovic, P. Ienne, and W. Burleson, "An Architecture-Independent Instruction Shuffler to Protect Against Side-Channel Attacks," ACM Trans. Archit. Code Optim., vol. 8, no. 4, pp. 1–19, 2012.
- [16] M. Hamburg, J. Hermelink, R. Primas, S. Samardjiska, T. Schamberger, S. Streit, E. Strieder, and C. van Vredendaal, "Chosen Ciphertext k-Trace Attacks on Masked CCA2 Secure Kyber," *IACR Trans. Cryptograph. Hardw. Embedded Syst.*, pp. 88–113, 2021.
- [17] M. Li, J. Tian, X. Hu, and Z. Wang, "Reconfigurable and High-Efficiency Polynomial Multiplication Accelerator for CRYSTALS-Kyber," *IEEE Trans. Comput. Aided Design Integr. Circuits Syst.*, vol. 42, no. 8, pp. 2540–2551, 2023.