## Technical Report : A Market Based Framework for Manycore Processors with Accelerators

Rajshekar Kalayappan and Smruti R. Sarangi Department of Computer Science & Engineering Indian Institute of Technology, New Delhi, India {rajshekark,srsarangi}@cse.iitd.ac.in

| Parameter                   | Value         | Parameter             | Value                |  |  |  |  |
|-----------------------------|---------------|-----------------------|----------------------|--|--|--|--|
| System Configuration        |               |                       |                      |  |  |  |  |
| Cores                       | 24            |                       |                      |  |  |  |  |
| Accelerators                | 24            | Accelerator Types     | 6                    |  |  |  |  |
| Technology                  | 14 nm         | Frequency             | 3.4 GHz              |  |  |  |  |
| Shared Elements             |               |                       |                      |  |  |  |  |
|                             | Shared L3 LLC |                       |                      |  |  |  |  |
| Configuration               | NUCA          | Number of Banks       | 32                   |  |  |  |  |
| Write-mode                  | Write-back    | Block size            | 64                   |  |  |  |  |
| Associativity               | 8             | Bank Size             | 2 MB                 |  |  |  |  |
| Bank Latency                | 30 cycles     |                       |                      |  |  |  |  |
| Main Memory                 | Latency       | 200 c                 | ycles                |  |  |  |  |
|                             | NOC           | and Traffic           |                      |  |  |  |  |
| Topology                    | 2-D Torus     | Routing Alg.          | dynamic X-Y routing  |  |  |  |  |
| Flit size                   | 32 bytes      | Hop-latency           | 1 cycle              |  |  |  |  |
| Router-Latency              | 3 cycles      | Average Node Distance | 5 hops               |  |  |  |  |
| General Purpose             |               |                       |                      |  |  |  |  |
|                             | ]             | Pipeline              |                      |  |  |  |  |
| Retire Width                | 4             | Integer RF (phy)      | 160                  |  |  |  |  |
| Issue Width                 | 6             | Float RF (phy)        | 160                  |  |  |  |  |
| ROB size                    | 168           | Predictor             | Tournament (PAG-PAP) |  |  |  |  |
| IW size                     | 54            | Bmispred penalty      | 14 cycles            |  |  |  |  |
| LSQ size                    | 64            | Multi-threading       | 4-way                |  |  |  |  |
| iTLB                        | 128 entry     | dTLB                  | 128 entry            |  |  |  |  |
| Integer ALU                 | 4 units       | Int ALU latency       | 1 cycles             |  |  |  |  |
| Integer Mul                 | 1 unit        | Int Mul latency       | 2 cycles             |  |  |  |  |
| Integer Div                 | 1 unit        | Int Div latency       | 4 cycles             |  |  |  |  |
| Float ALU                   | 2 units       | FP ALU latency        | 2 cycles             |  |  |  |  |
| Float Mul                   | 1 unit        | FP Mul latency        | 4 cycles             |  |  |  |  |
| Float Div                   | 1 unit        | FP Div latency        | 8 cycles             |  |  |  |  |
| Private L1 i-cache, d-cache |               |                       |                      |  |  |  |  |
| Write-mode                  | Write-back    | Block size            | 64                   |  |  |  |  |
| Associativity               | 8             | Size                  | 32 kB                |  |  |  |  |
| Latency                     | 4 cycles      |                       |                      |  |  |  |  |
| Private L2 Unified Cache    |               |                       |                      |  |  |  |  |
| Write-mode                  | Write-back    | Block size            | 64                   |  |  |  |  |
| Associativity               | 8             | Size                  | 256 kB               |  |  |  |  |
| Latency                     | 12 cycles     |                       |                      |  |  |  |  |
| Cryptographic Circuitry     |               |                       |                      |  |  |  |  |
| XOR Encryption              | 4 cycles      | SHA Encryption        | 41.04ns [2]          |  |  |  |  |
|                             |               |                       |                      |  |  |  |  |

Table I SIMULATION PARAMETERS

The time taken to compute the MPP was obtained by simulating the operation on Tejas [1].

## REFERENCES

- G. Malhotra, P. Aggarwal, A. Sagar, and S. R. Sarangi, "ParTejas: A parallel simulator for multicore processors," in ISPASS, 2014.
- [2] L. Dadda, M. Macchetti, and J. Owen, "The design of a high speed asic unit for the hash function sha-256 (384, 512)," in *Design, Automation and Test in Europe Conference and*

| Task                                         | Cycles | Time (ns) |  |  |
|----------------------------------------------|--------|-----------|--|--|
| Offer Announcement Phase                     |        |           |  |  |
| host calculates MPP, adds QoE                | 367    | 107.94    |  |  |
| host→AC : Offer                              | 20     | 5.88      |  |  |
| AC book-keeping                              | 24     | 7.06      |  |  |
| $AC \Rightarrow guests : Offer Announcement$ | 24     | 7.00      |  |  |
| Sum                                          | 411    | 120.88    |  |  |
| Bidding Phase                                |        |           |  |  |
| guest(s) calculates MRP                      | 45     | 13.24     |  |  |
| $guest(s) \rightarrow AC : Bid$              | 35     | 10.29     |  |  |
| (accommodating congestion)                   | 33     | 10.27     |  |  |
| AC book-keeping                              | 24     | 7.06      |  |  |
| $AC \Rightarrow guests : Offer Announcement$ |        | 7.00      |  |  |
| 1 Round Sum                                  | 104    | 30.59     |  |  |
| Auction Finalization                         |        |           |  |  |
| $AC \rightarrow host$ : Final Auction Terms  | 20     | 5.88      |  |  |
| $AC \rightarrow guest : Final Auction Terms$ | 20     | 3.00      |  |  |
| $host \rightarrow AC$ : Signed Terms         |        | 41.04     |  |  |
| guest $\rightarrow$ AC : Signed Terms        |        |           |  |  |
| AC verifies host sign                        |        | 41.04     |  |  |
| AC verifies guest sign                       |        | 41.04     |  |  |
| $AC \rightarrow host : Signed Terms$         |        | 41.04     |  |  |
| $AC \rightarrow guest : Signed Terms$        |        | 41.04     |  |  |
| Sum                                          | 578    | 170.04    |  |  |
| Considering 2-round Auction (average)        |        |           |  |  |
| Sum                                          | 1197   | 352.1     |  |  |

Table II
TIMING THE AUCTION PROCESS

- Exhibition, 2004. Proceedings, vol. 3, Feb 2004, pp. 70–75 Vol. 3
- [3] W. Huang, K. Rajamani, M. Stan, and K. Skadron, "Scaling with design constraints: Predicting the future of big chips," *Micro*, *IEEE*, vol. 31, no. 4, pp. 16–29, July 2011.
- [4] L. Gwennap, "Sandy bridge spans generations," *Microprocessor Report*, vol. 9, no. 27, pp. 10–01, 2010.
- [5] S. Thoziyoor, N. Muralimanohar, J. H. Ahn, and N. P. Jouppi, "Cacti 5.1," HP Laboratories, Tech. Rep., 2008.
- [6] Z. Keija, X. ke, W. Yang, and M. Hao, "A novel asic implementation of rsa algorithm," in ASIC, 2003. Proceedings. 5th International Conference on, vol. 2, Oct 2003, pp. 1300–1303 Vol.2.
- [7] X. Dong, C. Xu, Y. Xie, and N. Jouppi, "Nvsim: A circuit-level performance, energy, and area model for emerging nonvolatile memory," *Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on*, vol. 31, no. 7, pp. 994–1007, July 2012.

| Component                           | Formula                    | Area Estimate | Details                                                |
|-------------------------------------|----------------------------|---------------|--------------------------------------------------------|
|                                     |                            |               |                                                        |
| Cores                               | $6.5mm^2 \times 24$        | $156.00mm^2$  | Area of different structures taken by scaling([3]) the |
|                                     |                            |               | area estimates of the SandyBridge processor [4]        |
| LLC (L3)                            | $14.62mm^2 \times 8$       | $116.96mm^2$  | 64MB LLC                                               |
| Graphics + Video                    |                            | $14.62mm^2$   |                                                        |
| North-Bridge                        |                            | $5.85mm^{2}$  |                                                        |
| Processor Area Without Accelerators | $293.43mm^2 \times 100/81$ | $362.26mm^2$  | [4] lists the percentage areas of only the above       |
|                                     |                            |               | components, which add up to give 81% of the area       |
| FFT accelerators                    | $9.45mm^2 \times 4$        | $37.80mm^2$   |                                                        |
| MD5 accelerators                    | $0.003mm^2 \times 4$       | $0.01mm^{2}$  |                                                        |
| Sort accelerators                   | $0.07mm^2 \times 4$        | $0.28mm^{2}$  |                                                        |
| JPEG accelerators                   | $0.004mm^2 \times 4$       | $0.02mm^2$    |                                                        |
| KMP accelerators                    | $0.045mm^2 \times 4$       | $0.18mm^{2}$  |                                                        |
| LP accelerators                     | $9.19mm^2 \times 4$        | $37.16mm^2$   |                                                        |
| Total Area of the Base System       |                            | $437.31mm^2$  |                                                        |

| Component                                                                                 | Formula                                                                                                           | Area Estimate                  | Details                                                                                                                                                                          |  |  |
|-------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------|--------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|
| Major Components of the Guest Meter                                                       |                                                                                                                   |                                |                                                                                                                                                                                  |  |  |
| SHA<br>XOR                                                                                |                                                                                                                   | $0.003mm^2 \approx 0.000mm^2$  | [2]                                                                                                                                                                              |  |  |
| I/O hash Memory for Latency Distribution                                                  | $R \times T \times nbins \times 4 \; bytes$                                                                       | $0.011mm^2$                    | 1kB cache for look-up table T:number of tasks guest can run at the same time; R:number of different resources guest makes use of; nbins:number of bins used;                     |  |  |
| Memory for Moving Averages Memory for Nonce Sequence Memory for QoS Memory for I/O hashes | $R \times 4 \ bytes$ $1024 \times 1 \ bytes$ $T \times 4 \ bytes$ $T \times 2 \times 1024 \ bits$                 |                                | Notes that are an exact,                                                                                                                                                         |  |  |
| Memory Total                                                                              | Taking $R = 2, T = 2, nbins = 190$                                                                                | $0.031mm^2$                    | 8kB cache (Cacti [5])                                                                                                                                                            |  |  |
| Sum                                                                                       |                                                                                                                   | $0.045mm^{2}$                  |                                                                                                                                                                                  |  |  |
|                                                                                           | Major Components of the                                                                                           |                                |                                                                                                                                                                                  |  |  |
| XOR                                                                                       |                                                                                                                   | $\approx 0.000mm_2^2$          |                                                                                                                                                                                  |  |  |
| Memory for Nonce Sequence                                                                 | $1024 \times 1 \ bytes$                                                                                           | $0.011mm^2$                    | 1kB cache (Cacti [5])                                                                                                                                                            |  |  |
| Sum                                                                                       |                                                                                                                   | $0.011mm^{2}$                  |                                                                                                                                                                                  |  |  |
|                                                                                           | Major Components of the A                                                                                         |                                |                                                                                                                                                                                  |  |  |
| SHA                                                                                       |                                                                                                                   | $0.003mm^2$                    | [2]                                                                                                                                                                              |  |  |
| XOR<br>RSA                                                                                |                                                                                                                   | $\approx 0.00mm^2$ $0.014mm^2$ | [6]                                                                                                                                                                              |  |  |
| Memory for outstanding auctions/jobs                                                      | $\begin{array}{l} N{\times}(4{+}8{+}20{+}20{+}12{+}8)\;bytes\\ \text{Taking}\;N=112 \end{array}$                  | $0.014mm \\ 0.031mm^2$         | [6] 8kB cache (Cacti [5]); N:number of outstanding jobs; price_bytes = 4, qos_bytes = 8, qoi_bytes = 20, qoo_bytes = 20, qoe_bytes = 12, outstanding_bid_bytes = 8               |  |  |
| Non-volatile Memory for Auction Logs                                                      | $\begin{array}{l} ALN\times (64+8+12+2\times 128+2\times \\ 64) \ bytes \ \text{Taking} \ ALN = 8960 \end{array}$ |                                | ALN: number of auctions logged; auction_term_bytes = 64, qos_log_bytes = 8, qoe_log_bytes = 12, $I/O_hashes_log_bytes = 2 \times 128, host_guest_signatures_bytes = 2 \times 64$ |  |  |
| Non-volatile Memory for Account Balances                                                  | $(num\_guests + 1) \times 4$ bytes<br>Taking $num\_guests = 24$                                                   |                                | +1 for host                                                                                                                                                                      |  |  |
| Non-volatile Memory Total                                                                 |                                                                                                                   | $0.664mm^{2}$                  | 4MB 32nm RERAM [7]                                                                                                                                                               |  |  |
| Sum                                                                                       |                                                                                                                   | $0.712mm^2$                    |                                                                                                                                                                                  |  |  |
| Major Components of the Gateway                                                           |                                                                                                                   |                                |                                                                                                                                                                                  |  |  |
| Memory for Access Lists                                                                   |                                                                                                                   | $0.011mm^{2}$                  | 1kB cache (Cacti [5])                                                                                                                                                            |  |  |
| Sum                                                                                       |                                                                                                                   | $0.011 mm^2$                   |                                                                                                                                                                                  |  |  |
| Total Area of Additional Hardware                                                         | Taking $num\_guests = 24$                                                                                         | $2.320mm^2$                    |                                                                                                                                                                                  |  |  |
| Area Overhead                                                                             |                                                                                                                   | 0.53%                          |                                                                                                                                                                                  |  |  |

Table IV AREA ESTIMATION OF AUXILIARY STRUCTURES