

## Università di Pisa

Master of Science in Computer Engineering

### Project Report

Full Hash Algorithm (AES S-Box)

Hardware and Embedded Security

Tommaso Billi

 $A cademic\ Year\ 2020/2021$ 

## Contents

| 1 | Specification analysis                                                | 2          |  |  |  |  |  |  |  |  |
|---|-----------------------------------------------------------------------|------------|--|--|--|--|--|--|--|--|
| 2 | Design choices and block diagrams2.1 Design choices2.2 Block diagrams | <b>4</b> 4 |  |  |  |  |  |  |  |  |
| 3 | Expected waveforms                                                    | 7          |  |  |  |  |  |  |  |  |
| 4 | Testbenches                                                           |            |  |  |  |  |  |  |  |  |
| 5 | Implementation of RTL design on FPGA and results                      | 11         |  |  |  |  |  |  |  |  |
| 6 | Static Timing Analysis                                                |            |  |  |  |  |  |  |  |  |

#### Specification analysis

The aim of the project was to develop a hardware implementation of a hash module based on the Rijndael S-box. In figure 1.1 are reported the full specifications of the module to implement.

Design the following hash module based on the S-box of AES algorithm. The hash algorithm generates a 64-bit digest formed by the concatenation of 8 bytes H[i], from H[0] up to H[7], being H[0] the MSB. For each message the H[i] variables are initialized with the following values:

|             | H[0]  | <i>H</i> [1] | H[2]  | <b>H</b> [3] | <i>H</i> [4] | <i>H</i> [5] | <i>H</i> [6] | <i>H</i> [7] |
|-------------|-------|--------------|-------|--------------|--------------|--------------|--------------|--------------|
| Init. value | 8'h20 | 8'hA3        | 8'h9F | 8'h3E        | 8'hCC        | 8'h74        | 8'hCB        | 8'hF0        |

The, for each byte M of the input message (i.e. the 8-bit ASCII code of a message character) the hash module performs the following operation:

for 
$$(r = 0; r < 64; r + +)$$
  
for  $(i = 0; i < 8; i + +)$   
 $H[i] = S((H[(i + 1) \mod 8] \oplus M) \ll i)$ 

where

mod n is the modulo operator by n.

is the XOR operator.

 $X \ll n$  is the left circular shift by n bits.

S( ) is the S-box transformation of AES algorithm, that works over a byte.

Once the last message byte has been processed, the hash module performs the following operation:

$$for(i = 0; i < 8; i + +)$$
  
 $H[i] = S((H[(i + 1) mod 8] \oplus C[i]) \ll i)$ 

where C[i] is the i<sup>th</sup> byte, for i=0,1,2,...,7, of the counter C which reports the byte length of message. Note that C reports the real number of byte length, i.e. if the message length is 1 byte, then C=1 (not 0). The value of counter C has to be provided as input.

Figure 1.1: Project specifications.

Along with the specifications, a reference design that implements a similar hashing function based on the DES S-box was also provided.

The main differences of the AES S-box hashing module and the reference design are the following:

• The number of iterations of the outer-most round within the main round of the algorithm is 64, compared to 4 in the reference design;

- The main computation performed at each iteration, both in the main round and in the final round is substantially different;
- The produced digest is on 64 bits (formed by the concatenation of 8 bytes), instead of 32 bits (concatenation of 8 4-bit nibbles);
- The hash module uses the Rijndael (AES) S-box, instead of the DES S-box.

The reference design is made out of 5 different submodules (5 individual SystemVerilog files). More specifically, the submodules are the following:

- des\_sbox.sv: as the name states, it implements DES's S-box as a  $4 \times 16$  LUT.
- hash\_main\_round.sv: this module computes the inner-most operation within the main iteration of the algorithm. All 8 4-bit vectors are computed concurrently within this module.
- hash\_main\_iteration.sv: the module is a wrapper for all operations that are performed during the first step of the algorithm (i.e. it encapsulates the rounds of the main digest computation algorithm). It does so by cascading 4 main round modules. More specifically, the input of the first round is used as an output for the second, and so on.
- hash\_final\_round.sv: this submodule implements the last step of the hashing operation, performed after the last byte has been provided.
- full\_hash\_des\_top.sv: this module is a wrapper for all other submodules that make up the final circuit. It takes as input the clock signal, an asynchronous reset, the current byte to process, the M\_valid flag, and the count of the total bytes of the message (on a 64-bit register). It outputs the digest, and a digest\_ready flag, to indicate that the computation has ended for the current message.

The reference design is implemented as a finite state machine, to model the switching between the the different phases of the hashing algorithm. Three states have been defined: init\_state (the first state that is entered as soon as a new message is provided; the circuit is initialized with the provided initialization vector), compute\_state (the main state of the computation; after the initialization, when in this state, the module performs all the main round iterations), final\_state (once the last byte of the message has been processed, the circuit enters in this final state, in which the output of the main round goes through the final round submodule; at the end of this step, the final digest is presented at the output of the circuit).

Additionally, there exists also a fourth "hidden" state: when the M\_valid bit is set to 0, the module enters an idle state in which no computations are performed, and the circuit pauses every operation, waiting for a new valid input to be provided.

To make sure the requirements were clear, as an initial step a Python script that emulates the design was developed. Such prototype provides the hash computed according to the specifications for any given input. A modular design resembling the reference model was used in the Python script as well, in order to better separate the stages of the hashing algorithm. The results of such prototype proved to be extremely useful in the debugging and testing phases of the development.

#### Design choices and block diagrams

#### 2.1 Design choices

The first step of the design phase was the analysis of the reference design, that is briefly explained in the previous chapter.

The idea was to understand how the reference design was implemented in order to reuse as many components from it as possible, and to develop a variation of the design that made it compliant with this project's requirements.

Apart from the S-box as a LUT implementation, that had to be re-written from scratch, many elements of the original design could be reused. The partitioning in submodules was also coherent with the specifications of this project, so the overall structure was kept substantially unchanged.

As a design choice, the reference project's idea of sampling each input byte at every clock cycle was maintained. As opposed to the reference design, however, where only 4 iterations of the main round are performed, according to the specifications of this project, the main round needs to go through 64 iterations. In order to perform all these operations within a single clock cycle, the final circuit will have a long critical path formed by a cascade of hash\_main\_round modules. The practical implications of such design choice will be discussed more in detail in chapter 6.

#### 2.2 Block diagrams

In this section are reported the block diagrams of some key components of the circuit. The diagrams were exported from the RTL viewer tool of Quartus, after having successfully synthesized the design. As the final RTL design is extremely complex and hard to read, only some of the most significant sections of the circuit are shown.

In figure 2.1 we can see the H\_main registers (8 in total, one for each byte of the digest) and the M\_r register. They hold, respectively, the value of the digest computed at the previous iteration and the byte (ASCII character representation) that is currently being processed. Such registers form the inputs to the cascade of hash\_main\_round modules (in the picture, the first two are shown).



Figure 2.1: Circuit main iteration.

In figure 2.2 we can see the logic behind the counter of the circuit. The counter holds the value of the number of bytes that still need to be processed in the current message. If the M\_valid bit is set, the counter is decremented by 1 at the next clock cycle. Once the counter reaches a value of 0, the final\_state is triggered.



Figure 2.2: Counter logic.

Finally, in figure 2.3, it can be observed that when the final\_state is reached, the output from the final round module (whose inputs are the outputs of the main iteration, and a register which stores the message length) is propagated to the digest\_out register, while at the same time the hash\_ready register is set to 1.



Figure 2.3: Circuit's final round.

#### Expected waveforms

In this chapter the different expected waveforms associated with different scenarios are going to be addressed and discussed.

For the first example, shown in figure 3.1, one of the two testbenches (described in chapter 4) will be taken into account.

In this case, the string "Hello" is provided as input to the model, with each byte separated from one another by a number of clock cycles between 1 and 5. As soon as the first valid input is provided, the circuit performs the initialization round, computes the result of the main iteration for the first character, decrements the counter, and then waits for the next valid input. Once the last character of the input message has been processed, the output of the final\_round submodule is presented at the output of the circuit at the next clock cycle, and the hash\_ready flag is enabled.



Figure 3.1: Waveform with non-sequential input characters.

Figure 3.2 shows a possible waveform when a continuous stream (one character at every clock cycle) is provided as input. The situation is basically the same as before: at every round the same partial results are obtained, and in the end, since the same input string was provided, the same digest is obtained in the end. The only difference is that, once asserted, the M\_valid bit always remains set to 1, and thus the circuit never enters into the idle state.



Figure 3.2: Waveform with a continuous stream of characters.

Another interesting case to analyze is the one in which an empty string is provided as input. In this case, the counter is already set to zero, so the circuit never enters into the compute\_state phase. The only operations that it needs to perform are the initialization and the final round; the output is ready after two clock cycles (fig. 3.3).



Figure 3.3: Waveform with an empty string as input.

With this in mind, it can be said that the overall latency L of the module (i.e. the delay between an input is provided to the circuit, and the corresponding output is computed), when providing a continuous stream of data as input, is

$$L = T_{clk} \times (N_b + 2),$$

where  $N_b$  is the number of bytes in the input message. The final addend of two is accounting for the two additional clock cycles needed for the first input character to be available on the M\_r register, and for the final digest to be ready on the digest\_out register, respectively.

#### **Testbenches**

In order to test the RTL design, a SystemVerilog testbench and a modified version of the Python script that was initially used for prototyping were developed.

The Python script, which can be found inside the scripts/ directory, asks the user the input message, computes the hash, and outputs the digest on screen. By providing the "-w" option when launching the script, the script also produces two files, that by default will be created in the tb/test\_files/ directory.

The first file (message.txt) will contain the entire input for which the user wishes to compute the hash, as well as the length of the message itself, that is a mandatory input parameter to the circuit.

The second file (target.txt) will contain instead the hexadecimal (lowercase) string of the computed hash.

These files are then used for the System Verilog testbench itself. Please note that both these files are required for the test bench to run properly; the Python script with the "-w" option specified should be run at least once, prior to running the test bench itself.

From the first file, the length of the message is initially read, and C\_in (the 64-bit register on which the message length is stored) is set accordingly. Then, at each clock cycle, a single character of the message is read, until the end of the file is reached. After an additional clock cycle after the last character is read, the final digest is computed.

Once this operation is completed, the second file is opened, and the hash computed by the circuit is compared to the one that is found in the target.txt file. If the two digests match, the test passes, and the console will output "Success!"; otherwise, an error message will be printed.

Many different inputs were tried during the test phase, to ensure the correct functionality even in some edge cases. In particular, it was verified:

- that an identical input message would always produce the same hash;
- that the circuit correctly computed digests for long messages (500+ characters);
- that the digest was computed even for an empty message (0 characters long input).

In a second testbench, the correct behavior of the circuit when non-contiguous inputs were provided was tested (i.e. not a character to process per clock cycle).

The input string "Hello" was tested, providing one byte at a time, and then waiting for a random number of clock cycles (between 1 and 5) before providing the next

character. In order to generate a different random sequence at each simulation, the seed value (line 45) must be changed before launching the simulation.

In this case, it is important to set the M\_valid bit to 0 after providing each input byte, to let the module know that subsequent inputs have to be ignored, and not interpreted as repeated characters. If everything is done correctly, the circuit enters into an idle state, where neither the counter or the output are updated, and the waveform is the one seen in figure 3.1.

The test is considered passed if the final hash is always the same, even in different clock wait cycles.

# Implementation of RTL design on FPGA and results

Once the implementation was completed and adequately tested, the last step of the project was to use the Quartus tool by Intel to try to synthesize the RTL design on an actual FPGA. In this case, the 5CGXFC9D6F27C7 device of the Cyclone V family was chosen.

At the end of the processing, the final synthesis report that was produced is the one shown in figure 5.1.



Figure 5.1: Synthesis report of the design.

From the flow summary, it can be seen that there is a total utilization of 140 pins (1 real pin, and 139 virtual pins). This is coherent with the design, as each pin (excluding the clock) corresponds to a single bit of the inputs and outputs of the design. Since the input and output ports of the final top-level design is the following:

```
module full_hash_aes_top (
input clk

input rst_n

input M_valid

input [63:0] C_in

input [7:0] M

output reg hash_ready

output reg [63:0] digest_out

);
```

it can be said that this result reflects the initial expectations.

#### Static Timing Analysis

After making sure that the synthesis process could be performed without any errors, the final step was the timing analysis of the synthesized design.

An initial Synopsys Design Constraints (.sdc) file was created, defining a clock period constraint of 10 ns (100 MHz frequency), and constraining all inputs and outputs with a minimum and maximum delay. Such values were chosen using the empirical rule of picking the 10% of the clock period as the minimum delay, and the 20% of it as the maximum.

Said requirements were not expected to be met by the design, however; as all computations, including the 64 iterations of the first round, are computed in a single clock cycle, the critical path (i.e. the longest possible register-to-register path) is very long. This implies a considerably high propagation delay, that severely limits the maximum clock frequency of the design. The purpose of this was to get an idea of the performance of the synthesized design for an FPGA.

In the following figure the timing report summary for the worst case (Slow 1100mV 85C model) is reported.



Figure 6.1: Timing report summary  $(F_{clk} = 100 \text{ MHz})$ .

The negative slack (in red) represents the margin by which the timing requirements are

not met. Furthermore, in the first tab it can be observed that the maximum frequency for the clock is 5.7 MHz, which is far from the 100 MHz constraint that was defined. Additionally, it is worth mentioning that this analysis was performed prior to assigning the virtual pins of the FPGA device.

After these considerations, the .sdc file was modified, specifying a clock period of 200 ns (i.e. a 5 MHz clock frequency). All the other constraints regarding the inputs and the outputs were also modified accordingly. Before re-launching the analysis, the virtual pins were assigned as follows:

|   | Status      | From | То                   | Assignment Name | Value | Enabled | Entity            |
|---|-------------|------|----------------------|-----------------|-------|---------|-------------------|
| 1 | ✓ Ok        |      | ablash_ready         | Virtual Pin     | On    | Yes     | full_hash_aes_top |
| 2 | <b>✓</b> Ok |      | <mark>- rst_n</mark> | Virtual Pin     | On    | Yes     | full_hash_aes_top |
| 3 | <b>✓</b> Ok |      | 跲 C_in               | Virtual Pin     | On    | Yes     | full_hash_aes_top |
| 4 | <b>✓</b> Ok |      | <b>№</b> M           | Virtual Pin     | On    | Yes     | full_hash_aes_top |
| 5 | <b>✓</b> Ok |      |                      | Virtual Pin     | On    | Yes     | full_hash_aes_top |
| 6 | <b>✓</b> Ok |      | L M_valid            | Virtual Pin     | On    | Yes     | full_hash_aes_top |

Figure 6.2: Virtual pins assignments.

Once the new Synopsys Design Constraints file and the virtual pins were assigned, the Timing Analysis was launched a second time. The operation was successfully carried out, and the specified constraints were respected. No parts of the Timing Analyzer summary were highlighted in red, and success was confirmed by the Timing Closure Recommendations tab, which reported: "This design does not contain any failing setup paths. The worst-case slack is 13.343 ns.", and "No paths fail setup timing.".

The final results are reported in the following figure (6.3).



Figure 6.3: Timing analysis summary for Slow and Fast models (0 and 85C).

As a final consideration, it is worth mentioning that the low maximum clock frequency is due to the design choice of letting the user provide a character at every clock cycle. As previously stated, this choice creates a long critical path, that in turn is responsible for a considerable propagation delay for the circuit.

An alternative design to obtain a higher maximum clock frequency would have been to implement the 64 iterations of the main round in more stages (i.e. implement a pipeline by placing registers between one or more rounds of the main iteration). While increasing the maximum clock frequency, this would also increase the latency of the circuit. For example, placing a register every 2 or 4 rounds would require 32 or 16 clock cycles, respectively, for each input byte to be processed. In this case, the latency formula (assuming a constant input stream) would become the following:

$$L = T_{clk} \times (N_b + N_s + 2),$$

where  $N_s$  is the number of stages:  $N_s = 64 \div (\# \text{ rounds per stage})$ .