

Research project report

Candidate 2095J

**Project Title** 

Capability-Based Memory Protection for Scalable Vector Processing

Submitted in partial fulfilment of the requirements for the MPhil in Advanced Computer Science

Total Page Count: 95

Main chapters (excluding front-matter, references, appendix): 52 (pp 9-60)

Main chapters word count: 15,591 words

Word count methodology: ./texcount.pl -1 -sum -merge -q chapters.tex

TeXcount version: 3.2.0.41

Counts body text, heading text, caption text, and slightly over-counts cross-references. The anonymous version redacts some trivially identifying information, but uses the same word count as the de-anonymized version. The appendices contain non-essential data that is summarized in the main text (e.g. full results, details on how to compile code), or re-statements of work done in the main text (e.g. a summary of all changes made to the CHERI-RISC-V spec). [TODO1 Should the hypothesis reproductions also be counted?]

## **Capability-Based Memory Protection for Scalable Vector Processing**

CHERI, a generic architecture extension which improves memory safety, has garnered attention from industry partners for its low overhead and compatibility with existing source code. CHERI has been adapted to multiple ISAs, including RISC-V and Arm, but not to any scalable vector processors.

Vector processing, where the same operation is performed on multiple elements of a "vector" in parallel, is used everywhere in modern computing from high-performance number-crunching to the humble memcpy. Arm SVE[1] and RISC-V "V"[2] are new flagship vector extensions for Arm and RISC-V, which use a "vector-length agnostic programming model" to allow hardware implementations to choose their vector lengths. These scalable vector models are intended to stay in use long into the future, and it is essential for CHERI to support them.

This dissertation focuses on RISC-V V, presenting and evaluating a possible "CHERI-RVV" combination ISA by building and testing a reference implementation in Rust. We find that RVV is easily adaptable to CHERI with no issues, even maintaining binary compatibility with vanilla RVV programs, although other models like Arm SVE may require more investigation. We find a set of issues with the current CHERI compiler that make source-level compatibility difficult, and show they can be easily resolved with engineering effort. Finally, we explore storing capabilities-in-vectors in a limited context, to allow implementing memcpy with vector instructions, and show it does not violate security properties.

We conclude that it is viable to combine RISC-V Vectors with CHERI to enable vectorized arithmetic and memcpy operations without sacrificing performance, source-level compatibility, or memory protection.

# Contents

| 1 | Intr | oductio | on                                           | ç  |
|---|------|---------|----------------------------------------------|----|
|   | 1.1  | Motiva  | ation                                        | 10 |
|   | 1.2  | Hypot   | heses and Aims                               | 11 |
| 2 | Bac  | kgroun  | d                                            | 12 |
|   | 2.1  | RISC-V  | V                                            | 12 |
|   | 2.2  | A brie  | f history of vector processing               | 13 |
|   | 2.3  | The RV  | VV vector model                              | 13 |
|   |      | 2.3.1   | vtype                                        | 14 |
|   |      | 2.3.2   | vl and vstart — Prestart, body, tail         | 15 |
|   |      | 2.3.3   | Masking — Active/inactive elements           | 16 |
|   |      | 2.3.4   | Exception handling                           | 17 |
|   |      |         | 2.3.4.1 Imprecise vector traps               | 17 |
|   |      |         | 2.3.4.2 Precise vector traps                 | 17 |
|   |      |         | 2.3.4.3 Other modes                          | 18 |
|   |      | 2.3.5   | Summary                                      | 19 |
|   | 2.4  | Previo  | us RVV implementations                       | 19 |
|   | 2.5  | RVV n   | nemory instructions                          | 20 |
|   |      | 2.5.1   | Unit and Strided accesses                    | 21 |
|   |      | 2.5.2   | Unit fault-only-first loads                  | 22 |
|   |      | 2.5.3   | Indexed accesses                             | 23 |
|   |      | 2.5.4   | Unit whole-register accesses                 | 24 |
|   |      | 2.5.5   | Unit bytemask accesses                       | 24 |
|   | 2.6  | CHER    | I                                            | 25 |
|   |      | 2.6.1   | CHERI-RISC-V ISA                             | 26 |
|   |      | 2.6.2   | Instruction changes                          | 26 |
|   |      | 2.6.3   | Capability and Integer encoding mode         | 27 |
|   |      | 2.6.4   | Pure-capability and Hybrid compilation modes | 28 |
|   |      | 265     | Canability relocations                       | 28 |

|          |                   | uware                                                                                                                             | emulatio                                                                                                                                                | n investigation                                                                                                                                                                                                                                                                                               | 29                                                                                          |
|----------|-------------------|-----------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|
| 3        | 3.1               | Devel                                                                                                                             | oping the                                                                                                                                               | emulator                                                                                                                                                                                                                                                                                                      | 2                                                                                           |
|          |                   | 3.1.1                                                                                                                             | Emulati                                                                                                                                                 | ng CHERI                                                                                                                                                                                                                                                                                                      | 30                                                                                          |
|          |                   |                                                                                                                                   | 3.1.1.1                                                                                                                                                 | rust-cheri-compressed-cap                                                                                                                                                                                                                                                                                     | 3                                                                                           |
|          |                   |                                                                                                                                   | 3.1.1.2                                                                                                                                                 | Integrating into the emulator                                                                                                                                                                                                                                                                                 | 3                                                                                           |
|          |                   | 3.1.2                                                                                                                             | Emulati                                                                                                                                                 | ng vectors                                                                                                                                                                                                                                                                                                    | 3                                                                                           |
|          |                   |                                                                                                                                   | 3.1.2.1                                                                                                                                                 | Decoding phase                                                                                                                                                                                                                                                                                                | 3                                                                                           |
|          |                   |                                                                                                                                   | 3.1.2.2                                                                                                                                                 | Fast-path checking phase                                                                                                                                                                                                                                                                                      | 3                                                                                           |
|          |                   |                                                                                                                                   | 3.1.2.3                                                                                                                                                 | Execution phase                                                                                                                                                                                                                                                                                               | 3                                                                                           |
|          |                   |                                                                                                                                   | 3.1.2.4                                                                                                                                                 | Integer vs. Capability encoding mode                                                                                                                                                                                                                                                                          | 3                                                                                           |
| 3        | 3.2               | Fast-p                                                                                                                            | ath calcul                                                                                                                                              | ations                                                                                                                                                                                                                                                                                                        | 3                                                                                           |
|          |                   | 3.2.1                                                                                                                             | Possible                                                                                                                                                | fast-path outcomes                                                                                                                                                                                                                                                                                            | 3                                                                                           |
|          |                   | 3.2.2                                                                                                                             | Whole-a                                                                                                                                                 | access fast-paths                                                                                                                                                                                                                                                                                             | 3                                                                                           |
|          |                   | 3.2.3                                                                                                                             | m-eleme                                                                                                                                                 | ent known-range fast-paths                                                                                                                                                                                                                                                                                    | 3                                                                                           |
| 3        | 3.3               | Going                                                                                                                             | beyond th                                                                                                                                               | ne emulator                                                                                                                                                                                                                                                                                                   | 3                                                                                           |
|          |                   | 3.3.1                                                                                                                             | Misalign                                                                                                                                                | ned accesses                                                                                                                                                                                                                                                                                                  | 3                                                                                           |
|          |                   | 3.3.2                                                                                                                             | Atomici                                                                                                                                                 | ty of accesses/General memory model                                                                                                                                                                                                                                                                           | 3                                                                                           |
|          |                   | 3.3.3                                                                                                                             | Relaxed                                                                                                                                                 | access ordering and precise traps                                                                                                                                                                                                                                                                             | 3                                                                                           |
| 3        | 3.4               | Testin                                                                                                                            | g hypothe                                                                                                                                               | ses                                                                                                                                                                                                                                                                                                           | 3                                                                                           |
|          |                   |                                                                                                                                   |                                                                                                                                                         |                                                                                                                                                                                                                                                                                                               |                                                                                             |
|          |                   | Hypot                                                                                                                             | thesis H-1                                                                                                                                              | - Feasibility                                                                                                                                                                                                                                                                                                 | 3                                                                                           |
|          |                   |                                                                                                                                   |                                                                                                                                                         | - Feasibility                                                                                                                                                                                                                                                                                                 |                                                                                             |
|          | The               | Hypot                                                                                                                             | thesis H-2                                                                                                                                              |                                                                                                                                                                                                                                                                                                               | 4                                                                                           |
| 7        | <b>The</b><br>4.1 | Hypot<br>CHER                                                                                                                     | thesis H-2                                                                                                                                              | - Fast-path checks                                                                                                                                                                                                                                                                                            | 4                                                                                           |
| 7        |                   | Hypot<br>CHER                                                                                                                     | thesis H-2  I-RVV so  iling vecto                                                                                                                       | - Fast-path checks                                                                                                                                                                                                                                                                                            | 4<br>4<br>4                                                                                 |
| 7        |                   | Hypot<br>CHER<br>Comp<br>4.1.1                                                                                                    | thesis H-2  I-RVV soliling vector  Available                                                                                                            | - Fast-path checks                                                                                                                                                                                                                                                                                            | 4<br>4<br>4                                                                                 |
| 7        |                   | Hypot<br>CHER<br>Comp<br>4.1.1                                                                                                    | thesis H-2  I-RVV so iling vecto  Available  Automat                                                                                                    | - Fast-path checks                                                                                                                                                                                                                                                                                            | 4<br>4<br>4<br>4                                                                            |
| 7        |                   | CHER<br>Comp<br>4.1.1<br>4.1.2                                                                                                    | I-RVV so<br>iling vector<br>Available<br>Automate<br>Vector in                                                                                          | - Fast-path checks                                                                                                                                                                                                                                                                                            | 4<br>4<br>4<br>4<br>4                                                                       |
| 7        |                   | Hypot<br>CHER<br>Comp<br>4.1.1<br>4.1.2<br>4.1.3                                                                                  | I-RVV so<br>iling vector<br>Available<br>Automat<br>Vector in<br>Inline as                                                                              | - Fast-path checks                                                                                                                                                                                                                                                                                            | 4<br>4<br>4<br>4<br>4                                                                       |
| 7        |                   | Hypot<br>CHER<br>Comp<br>4.1.1<br>4.1.2<br>4.1.3<br>4.1.4<br>4.1.5                                                                | I-RVV so illing vector Available Automate Vector in Inline as RVV vs.                                                                                   | - Fast-path checks                                                                                                                                                                                                                                                                                            | 4<br>4<br>4<br>4<br>4<br>4                                                                  |
| 7        | 4.1               | Hypot<br>CHER<br>Comp<br>4.1.1<br>4.1.2<br>4.1.3<br>4.1.4<br>4.1.5                                                                | I-RVV so iling vector Available Automate Vector in Inline as RVV vs. iling vector                                                                       | - Fast-path checks  ftware stack  or code  e compilers  tic vectorization  ntrinsics  ssembly  Arm SVE                                                                                                                                                                                                        | 44<br>44<br>44<br>44<br>44<br>44                                                            |
| 7        | 4.1               | CHER<br>Comp<br>4.1.1<br>4.1.2<br>4.1.3<br>4.1.4<br>4.1.5<br>Comp                                                                 | I-RVV so iling vector Available Automate Vector in Inline as RVV vs. iling vector Adapting                                                              | - Fast-path checks  ftware stack  or code e compilers tic vectorization otrinsics ssembly Arm SVE or code with CHERI-Clang                                                                                                                                                                                    | 44<br>44<br>44<br>44<br>44<br>44<br>44                                                      |
| 1        | 4.1               | Hypot<br>CHER<br>Comp<br>4.1.1<br>4.1.2<br>4.1.3<br>4.1.4<br>4.1.5<br>Comp<br>4.2.1                                               | I-RVV so illing vector in Inline as RVV vs. illing vector Adapting Adapting                                                                             | ftware stack or code e compilers tic vectorization ntrinsics seembly Arm SVE or code with CHERI-Clang g vector assembly instructions to CHERI                                                                                                                                                                 | 44<br>44<br>44<br>44<br>44<br>44<br>44                                                      |
| 5        | 4.1               | Hypot<br>CHER<br>Comp<br>4.1.1<br>4.1.2<br>4.1.3<br>4.1.4<br>4.1.5<br>Comp<br>4.2.1<br>4.2.2<br>4.2.3                             | I-RVV so illing vector Available Automate Vector in Inline as RVV vs. illing vector Adapting Adapting Storing s                                         | ftware stack or code e compilers tic vectorization ntrinsics ssembly Arm SVE or code with CHERI-Clang g vector assembly instructions to CHERI                                                                                                                                                                 | 44<br>44<br>44<br>44<br>44<br>44<br>44                                                      |
| 7 4      | 4.1               | Hypot<br>CHER<br>Comp<br>4.1.1<br>4.1.2<br>4.1.3<br>4.1.4<br>4.1.5<br>Comp<br>4.2.1<br>4.2.2<br>4.2.3<br>Testin                   | I-RVV so illing vector in Inline as RVV vs. illing vector Adapting Adapting Storing sign hypothe                                                        | Fast-path checks  ftware stack  or code e compilers tic vectorization ntrinsics ssembly Arm SVE or code with CHERI-Clang g vector assembly instructions to CHERI g vector intrinsics to CHERI scalable vectors on the stack                                                                                   | 44<br>44<br>44<br>44<br>44<br>44<br>44                                                      |
| 7 4      | 4.1               | Hypot<br>CHER<br>Comp<br>4.1.1<br>4.1.2<br>4.1.3<br>4.1.4<br>4.1.5<br>Comp<br>4.2.1<br>4.2.2<br>4.2.3<br>Testin<br>Hypot          | I-RVV so illing vector Available Automate Vector in Inline as RVV vs. illing vector Adapting Storing storing storing storing storing storing thesis H-3 | ftware stack or code e compilers tic vectorization ntrinsics ssembly Arm SVE or code with CHERI-Clang g vector assembly instructions to CHERI scalable vectors on the stack sses                                                                                                                              | 4<br>4<br>4<br>4<br>4<br>4<br>4<br>4<br>4<br>4<br>4<br>4<br>4<br>4<br>4<br>4<br>4<br>4<br>4 |
| : 1<br>4 | 4.1               | Hypote CHER Comp 4.1.1 4.1.2 4.1.3 4.1.4 4.1.5 Comp 4.2.1 4.2.2 4.2.3 Testin Hypote Hypote                                        | I-RVV so illing vector in Inline as RVV vs. illing vector Adapting Storing sign hypothesis H-3 thesis H-4                                               | ftware stack  or code e compilers tic vectorization ntrinsics ssembly Arm SVE or code with CHERI-Clang g vector assembly instructions to CHERI scalable vectors on the stack ses - Compiling/running legacy code in integer mode                                                                              | 4<br>4<br>4<br>4<br>4<br>4<br>4<br>4<br>4<br>4<br>4<br>4<br>4<br>4<br>4<br>4<br>4<br>4<br>4 |
| 1 4      | 4.1               | Hypot<br>CHER<br>Comp<br>4.1.1<br>4.1.2<br>4.1.3<br>4.1.4<br>4.1.5<br>Comp<br>4.2.1<br>4.2.2<br>4.2.3<br>Testin<br>Hypot<br>Hypot | I-RVV so illing vector Available Automate Vector in Inline as RVV vs. illing vector Adapting Adapting Storing sign hypothesis H-3 thesis H-4 thesis H-5 | ftware stack or code e compilers tic vectorization ntrinsics ssembly Arm SVE or code with CHERI-Clang g vector assembly instructions to CHERI g vector intrinsics to CHERI scalable vectors on the stack ses - Compiling/running legacy code in integer mode - Converting legacy code to pure-capability code | 4<br>4<br>4<br>4<br>4                                                                       |

| 5  | Cap    | abilities-in-vectors                                             | 51 |
|----|--------|------------------------------------------------------------------|----|
|    | 5.1    | Changing the emulator                                            | 51 |
|    |        | 5.1.1 Testing                                                    | 52 |
|    | 5.2    | Testing hypotheses                                               | 53 |
|    |        | Hypothesis H-7 - Holding capabilities in vectors                 | 53 |
|    |        | Hypothesis H-8 - Sending capabilities between vectors and memory | 53 |
|    |        | Hypothesis H-9 - Manipulating capabilities in vectors            | 54 |
| 6  | Eval   | luation                                                          | 56 |
|    | 6.1    | Testing the emulator                                             | 56 |
|    |        | 6.1.1 hello_world                                                | 56 |
|    |        | 6.1.2 vector_memcpy                                              | 57 |
|    |        | 6.1.2.1 Compiler differences                                     | 57 |
|    |        | 6.1.3 vector_memcpy_pointers                                     | 58 |
|    |        | 6.1.4 Summary                                                    | 58 |
| 7  | Con    | clusion                                                          | 59 |
| Re | eferei | nces                                                             | 61 |
|    |        |                                                                  |    |
| A  | rusi   | c_cheri_compressed_cap documentation                             | 64 |
| В  | Cod    | e Snippets                                                       | 77 |
|    | B.1    | riscv-v-lite - Vector memory accesses                            | 77 |
| C  | Fast   | path vector checks - Pre-checks                                  | 82 |
|    | C.1    | Masked accesses                                                  | 82 |
|    | C.2    | Unit accesses                                                    | 83 |
|    | C.3    | Strided accesses                                                 | 83 |
|    | C.4    | Indexed accesses                                                 | 84 |
| D  | Con    | npiler information                                               | 85 |
|    | D.1    | Building riscv-gnu-toolchain with vector support                 | 85 |
|    | D.2    | Vanilla RVV command-line options                                 | 86 |
|    | D.3    | CHERI-RVV command-line options                                   | 86 |
| E  | CHI    | ERI-RVV changes from CHERI and RVV                               | 88 |
|    | E.1    | Capabilities-in-vectors                                          | 88 |
|    |        | E.1.1 Relevant properties                                        | 89 |
| F  | Full   | vector_memcpy results                                            | 91 |

G Artifacts 94

# Introduction

Since 2010, the Cambridge Computer Lab (in association with SRI) has been developing the CHERI¹ architecture extension, which improves the security of any given architecture by checking all memory accesses in hardware. The core impact of CHERI, on a hardware level, is that memory can no longer be accessed directly through raw addresses, but must pass through a *capability*[3]. Capabilities are unforgeable tokens that grant fine-grained access to ranges of memory. Instead of generating them from scratch, capabilities must be *derived* from another capability with greater permissions. For example, a capability giving read-write access to an array of structures can be used to create a sub-capability granting read-only access to a single element. This vastly reduces the scope of security violations through spatial errors (e.g. buffer overflows[4]), and creates interesting opportunities for software compartmentalization[5].

Industry leaders have recognized the value CHERI provides. Arm Inc have manufactured the Morello System-on-Chip, based on their Neoverse N1 CPU, which incorporates CHERI capabilities into the Armv8.2 ISA. While this represents a great step forward, there are still elements on the SoC that haven't fully embraced CHERI (e.g. the GPU), and architecture extensions that haven't been investigated in the context of CHERI. One such example is Arm's Scalable Vector Extension (introduced in Armv8.2 but not included in Neoverse N1), which is designed to remain in use well into the future[1]. Supporting this and other scalable vector ISAs in CHERI is essential to CHERI's long-term relevance.

In the context of modern computer architecture, vector processing is the practice of dividing a large hardware register into a *vector* of multiple *elements* and executing the same operation on each element in a single instruction<sup>2</sup>. This data-level parallelism can drastically increase throughput, particularly for arithmetic-heavy programs. However, before computing arithmetic, the vectors must be populated with data.

<sup>&</sup>lt;sup>1</sup>Capability Hardware Enhanced RISC Instructions

<sup>&</sup>lt;sup>2</sup>This is a SIMD (Single Instruction Multiple Data) paradigm.

## 1.1 Motivation

Modern vector implementations all provide vector load/store instructions to access a whole vector's worth of memory. These range from simple contiguous accesses (where all elements are next to each other), to complex indexed accesses (where each element loads from a different location based on another vector). They can also have per-element semantics, e.g. "elements must be loaded in order, so if one element fails the preceding elements are still valid"[2, Section 7.7]. If CHERI CPUs want to benefit from vector processing's increased performance and throughput, they must support those instructions at some level. But adding CHERI's boundschecking to the mix may affect these semantics, and could impact performance (e.g. checking each element's access in turn may be slow).

Vector memory access performance is more critical than one may initially assume, because vectors are used for more than just computation. A prime example is memcpy: for x86\_64, glibc includes multiple versions of the function<sup>3</sup> taking advantage of vector platforms, then selects one to use at runtime<sup>4</sup>. These implementations are written in assembly and heavily optimized. If the memory accesses are hitting the cache, a few extra cycles of bounds-checking for each access could actually make a noticeable difference.

memcpy also raises the important question of how the vector model interacts with capabilities. In non-CHERI processors, memcpy will copy pointers around in memory without fuss. For a CHERI-enabled vector processor to support this, it would need to be able to load/store capabilities from vectors without violating any security guarantees. This may require more guarantees than otherwise necessary - for example, each vector register likely needs to be as large or larger than a single capability.

To explore this topic, we chose to focus on the RISC-V Vector extension[2] (shortened to RVV throughout). As of November 2021 this has been ratified by RISC-V International<sup>5</sup>, and will be RISC-V's standard vector instruction set moving forward. This has two key benefits. Studying RVV will allow reference "CHERI-RVV" implementations to be built for the CHERI project's open-source RISC-V cores<sup>6</sup>, which don't currently support vector processing. Secondly, RVV is a *scalable* vector model, where the length of each vector is implementation-dependent. This has more potential roadblocks than a fixed-length vector model, and investigating them here will make life easier if Arm wish to integrate their Scalable Vector Extension with CHERI later down the road.

<sup>&</sup>lt;sup>3</sup>It appears memcpy is implemented as a copy of memmove.

<sup>4</sup>sysdeps/x86\_64/multiarch/ifunc-memmove.h in bminor/glibc on GitHub

<sup>5</sup>https://wiki.riscv.org/display/HOME/Recently+Ratified+Extensions

<sup>6</sup>https://www.cl.cam.ac.uk/research/security/ctsrd/cheri/cheri-risc-v.html

## 1.2 Hypotheses and Aims

The goal of this project is to investigate the impact of, and the roadblocks for, integrating a scalable vector architecture with CHERI's memory protection system. In particular, we focus on integrating RVV with the CHERI-RISC-V ISA, with the aim of enabling a future CHERI-RVV implementation and informing the approach for a future CHERI Arm SVE implementation.

The investigation was carried out by designing and testing a CHERI-RVV emulator written in Rust, but that is only a single implementation. To show that RVV can be integrated with CHERI-RISC-V for a wide range of processors, we use information gathered from the emulator to prove [TODO2 nine] hypotheses (Table 1.1). Hyps. H-1 and H-2 consider basic feasibility and potential hardware performance issues. Hyps. H-3 to H-6 ensure that vector software is compatible with potential CHERI-RVV software stacks. Hyps. H-7 to H-9 considers capabilities-in-vectors: the conditions under which vector registers can hold capabilities, and vectorized instructions can manipulate them.

This document also examines current practical realities, such as the limitations of the current CHERI-RVV software stack, and explains how to compile/execute CHERI-RVV code on baremetal platforms. The RVV and CHERI-RISC-V specifications are succinctly explained in Chapter 2.

#### *Hardware Hypotheses — Chapter 3*

- H-1 It is possible to use CHERI capabilities as memory references in all vector instructions.
- H-2 The capability bounds checks for vector elements within a known range (e.g. a cache line) can be performed in a single check, amortizing the cost.

#### *Software Hypotheses — Chapter 4*

- H-3 Vector code can be compiled in legacy forms (with integer addressing) and function correctly on CHERI with no source code changes.
- H-4 Legacy vector code can be compiled into a pure-capability form with no changes.
- H-5 Vector code that saves/restores variable-length vectors to/from the stack can be compiled on CHERI-RVV with no source code changes.
- H-6 CHERI-vector code can run correctly in multiprocessing systems, where execution may be paused and resumed on interrupts or context switches.

#### *Capabilities-in-Vectors — Chapter 5*

- H-7 It is possible for vector registers to hold capabilities to enable copying without violating CHERI security principles.
- H-8 It is possible for vector memory accesses to load and store capabilities from vector registers without violating CHERI security principles.
- H-9 It is possible for vector instructions to manipulate capabilities in vector registers without violating CHERI security principles.

**Table 1.1:** Project Hypotheses

# BACKGROUND

This chapter describes RISC-V (Section 2.1), RVV (Sections 2.2 to 2.4), and CHERI (Section 2.6) to the detail required to understand the rest of the dissertation. It summarizes the relevant sections of the RISC-V unprivileged spec[6], the RISC-V "V" extension specification v1.0[2, Sections 1–9, 17], the TR-951 CHERI ISAv8 technical report[7, Chapters 5, 8], and the TR-949 technical report about C/C++ safety on CHERI[8, Section 4.4, Appendix C]. Both vectors and CHERI are described, because this dissertation caters to those who may be familiar with one but not the other.

### 2.1 RISC-V

RISC-V is an open family of ISAs which defines "base integer ISAs" (e.g. all 64-bit RISC-V cores implement the RV64I base ISA) and extensions (e.g. the "M" extension for integer multiplication). A base instruction set combined with a set of extensions is known as a RISC-V ISA. Because RISC-V is open, anyone can design, manufacture, and sell chips implementing any RISC-V ISA.

Each RISC-V implementation has a set of constant parameters. The most common example is XLEN, the length of an integer register in bits, which is tied to the base integer ISA (e.g. 64-bit ISA implies XLEN=64). Other constant parameters include CLEN, the length of a capability in bits, defined by CHERI relative to XLEN; and VLEN and ELEN, which are used by RVV and entirely implementation-defined.

The extensions of most relevance to this project are the "V" vector extension (RVV, specified in [2]) and the CHERI extension (specified in [7]). RVV has recently been officially ratified, and is the de facto vector extension for RISC-V. The following sections summarize the vector extension, how it accesses memory, and previous implementations in academia.

## 2.2 A brief history of vector processing

Many vector implementations (Intel SSE/AVX, Arm's Advanced SIMD and Neon) use fixed-length vectors - e.g. 128-bit vectors which a program interprets as four 32-bit elements. As the industry's desire for parallelism grew, new implementations had to be designed with longer vectors of more elements. For example, Intel SSE/SSE2 (both 128-bit) was succeeded by AVX (128 and 256-bit), then AVX2 (entirely 256-bit), then AVX-512 (512-bit). Programs built for one extension, and hence designed for a specific vector size, could not automatically take advantage of longer vectors.

Scalable vectors address this by not specifying the vector length, and instead calculating it on the fly. Instead of hardcoding "this loop iteration uses a single vector of four 32-bit elements", the program has to ask "how many 32-bit elements will this iteration use?". This gives hardware designers more freedom, letting them select a suitable hardware vector length for their power/timing targets, while guaranteeing consistent execution of programs on arbitrarily-sized vectors. RVV uses a scalable vector model.

## 2.3 The RVV vector model

Summarizes [2, Sections 1-6, 17]



Figure 2.1: Example of a RVV vector register VLEN = 128, ELEN = 32

RVV defines thirty-two vector registers, each of an implementation-defined constant width VLEN. These registers can be interpreted as *vectors* of *elements*. The program can configure the size of elements, and the implementation defines a maximum width ELEN. Fig. 2.1 shows a simple example.

RVV also adds some state that defines how the vector registers are used (see Fig. 2.2). These are stored in RISC-V Control and Status Registers (CSRs), which the program can read. vtype (Section 2.3.1) defines how the vector registers are split into elements. vstart and vl (Section 2.3.2) divides the elements into three disjoint subsets: *prestart*, the *body*, and the *tail*. Masked accesses (Section 2.3.3) further divide the *body* into *active* and *inactive* elements. This section also describes the vector exception model (Section 2.3.4).



Figure 2.2: Summary of additional state used by RVV

### 2.3.1 vtype

The vtype CSR contains two key fields that describe how vector instructions interpret the contents of vector registers. The first is the Selected Element Width (SEW), which is self-explanatory. It can be 8, 16, 32, or 64. 128-bit elements are referenced a few times throughout but haven't been formally specified (see [2, p10, p32]).

The second field is the Vector Register Group Multiplier (LMUL). Vector instructions don't just operate over a single register, but over a register *group* as defined by this field. For example, if LMUL=8 then each instruction would operate over 8 register's worth of elements. These groups must use aligned register indices, so if LMUL=4 all vector register operands should be multiples of 4 e.g. v0, v4, v8 etc. In some implementations this may increase throughput, which by itself is beneficial for applications.

However, the true utility of LMUL lies in widening/narrowing operations (see Fig. 2.3). For example, an 8-by-8-bit multiplication can produce 16-bit results. Because the element size doubles, the number of vector registers required to hold the same number of elements also doubles. Doubling LMUL after such an operation allows subsequent instructions to handle all the results at once. At the start of such an operation, fractional LMUL (1/2, 1/4, or 1/8) can be used to avoid subsequent results using too many registers.



Figure 2.3: Example of using LMUL to access results of widening operations



Figure 2.4: Examples of vector utilization with vl and vstart

vtype also encodes two flags: mask-agnostic and tail-agnostic. If these are set, the implementation is *allowed* to overwrite any masked-out or tail elements with all 1s.

Most vector instructions will interpret their operands using vtype, but this is not always the case. Some instructions (such as memory accesses) use different Effective Element Widths (EEW) and Effective LMULs (EMUL) for their operands. In the case of memory accesses, the EEW is encoded in the instruction bits and the EMUL is calculated to keep the number of elements consistent. Another example is widening/narrowing operations, which by definition have to interpret the destination registers differently from the sources.

Programs update vtype through the vsetvl family of instructions. These are designed for a "stripmining" paradigm, where each iteration of a loop processes some elements until all elements are processed. vsetvl instructions take a requested vtype and the number of remaining elements to process (the Application Vector Length or AVL), and return the number of elements that will be processed in this iteration. This value is saved in a register for the program to use, and also saved in the internal vl CSR.

## 2.3.2 vl and vstart — Prestart, body, tail

The first CSR is the Vector Length v1, which holds the number of elements that could be updated from a vector instruction. The program updates this value through fault-only-first loads (Section 2.5.2) and more commonly vsetv1 instructions.

In the simple case, v1 is equal to the total available elements (see Fig. 2.4a). It can also be fewer (see Fig. 2.4b), in which case vector instructions will not write to elements in the "tail" (i.e. elements past v1). This eliminates the need for a 'cleanup loop' common in fixed-length vector programs.

In a similar vein, vstart specifies "the index of the first element to be executed by a vector instruction". Elements before vstart are known as the *pre-start* and are not touched by executed instructions. It is usually only set by the hardware whenever it is interrupted mid-instruction (see Fig. 2.5 and Section 2.3.4) so that the instruction can be re-executed later

without corrupting completed values. Whenever a vector instruction completes, vstart is reset to zero.

The program *can* set vstart manually, but it may not always work. If an implementation couldn't arrive at the value itself, then it is allowed to reject it. The specification gives an example where a vector implementation never takes interrupts during an arithmetic instruction, so it would never set vstart during an arithmetic instruction, so it could raise an exception if vstart was nonzero for an arithmetic instruction.

## 2.3.3 Masking — Active/inactive elements

Most vector instructions allow for per-element *masking* (see Fig. 2.6). When masking is enabled, register v0 acts as the 'mask register', where each bit corresponds to an element in the vector<sup>1</sup>. If the mask bit is 0, that element is *active* and will be used as normal. If the mask bit is 1, that element will be *inactive* and not written to (or depending on the mask-agnostic setting, overwritten with 1s). When masking is disabled, all elements are *active*.





**Figure 2.6:** Example of masking a vector operation

 $<sup>^{1}</sup>$ A single vector register will always have enough bits for all elements. The maximum element count is found when SEW is minimized (8 bits) and LMUL is maximized (8 registers), and is equal to VLEN \* LMUL / SEW = VLEN \* 8 / 8 = VLEN.

## 2.3.4 Exception handling

Summarizes [2, Section 17]

During the execution of a vector instruction, two events can prevent an instruction from fully completing: a synchronous exception in the instruction itself, or an asynchronous interrupt from another part of the system. Implementations may choose to wait until an instruction fully completes before handling asynchronous interrupts, making it unnecessary to pause the instruction halfway through, but synchronous exceptions cannot be avoided in this way (particularly where page fault exceptions must be handled transparently).

The RVV specification defines two modes for 'trapping' these events, which implementations may choose between depending on the context (e.g. the offending instruction), and notes two further modes which may be used in further extensions. All modes start by saving the PC of the trapping instruction to a CSR \*epc.

#### 2.3.4.1 Imprecise vector traps

Imprecise traps are intended for events that are not recoverable, where "reporting an error and terminating execution is the appropriate response". They do not impose any extra requirements on the implementation. For example, an implementation that executes instructions out-of-order does not need to guarantee that instructions older than \*epc have completed, and is allowed to have completed instructions newer than \*epc.

If the trap was triggered by a synchronous exception, the vstart CSR must be updated with the element that caused it. The specification calls out synchronous exceptions in particular, but does not mention asynchronous interrupts. It's likely that imprecise traps for asynchronous interrupts should also set vstart, but this issue has been raised with the authors for further clarification<sup>2</sup>. The specification also states "There is no support for imprecise traps in the current standard extensions", meaning that the other standard RISC-V exceptions do not use and have not considered imprecise traps.

#### 2.3.4.2 Precise vector traps

Precise vector traps are intended for instructions that can be resumed after handling the interrupting event. This means the architectural state (i.e. register values) when starting the trap could be saved and reloaded before continuing execution. Therefore it must look like instructions were completed in-order, even if the implementation is out-of-order:

- Instructions older than \*epc must have completed (committed all results to the architectural state)
- Instructions newer than \*epc must **not** have altered architectural state.

<sup>&</sup>lt;sup>2</sup>(redacted for anonymity)

On a precise trap, regardless of what caused it, the vstart CSR must be set to the element index on which the trap was taken. The save-and-reload expectation then add two constraints on the trapping instruction's execution:

- Operations affecting elements preceding vstart must have committed their results
- Operations affecting elements at or following vstart must either
  - not have committed results or otherwise affected architectural state
  - be *idempotent* i.e. produce exactly the same result when repeated.

The idempotency option gives implementations a lot of leeway. Some instructions, such as indexed segment loads (Section 2.5.3), are specifically prohibited from overwriting their inputs to make them idempotent. If an instruction is idempotent, an implementation is even allowed to repeat operations on elements *preceding* vstart. However for memory accesses the idempotency depends on the memory being accessed. For example, reading or writing a memory-mapped I/O region may not be idempotent.

Another memory-specific issue is that of *demand-paging*, where the OS needs to step in and move virtual memory pages into physical memory for an instruction to use. This use-case is specifically called out by the specification for precise traps. Usually, this is triggered by some element of a vector memory access raising a synchronous exception, invoking a precise trap, and writing the "Machine Trap Value" scalar register with the offending address[9, Section 3.1.21]. vstart must be set to an element at (or before<sup>3</sup>) the one that demanded the page, because that element must perform the access after reloading. If an implementation sets vstart to the offending element, because operations preceding vstart must have completed, any elements that could potentially trigger demand-paging *must* wait for the preceding elements to complete.

#### 2.3.4.3 Other modes

The RVV spec notes two other possible future trap modes. First is "Selectable precise/imprecise traps", where an implementation allows the user to select precise or imprecise traps for e.g. debugging or performance.

The second mode is "Swappable traps", where a trap handler could use special instructions to "save and restore the vector unit microarchitectural state". The intent seems to be to support context switching with imprecise traps, which could also require the *opaque* state (i.e. internal state not visible to the program) to be saved and restored. Right now, it seems that context switching always requires a precise trap.

<sup>&</sup>lt;sup>3</sup>If the memory region is idempotent, then vstart could any value where all preceding elements had completed. It could even be zero, in which case all accesses would be retried on resume, as long as it could guarantee forward progress.

### 2.3.5 Summary



Figure 2.7: Combined examples for RVV vector

Fig. 2.7 shows all of the above features used in a single configuration:

- The instruction was previously interrupted with a precise trap and restarted, so  ${\tt vstart=2}$
- Elements are 16-bit
- LMUL=4 to try and increase throughput
- Only 29 of the 32 available elements were requested, so v1=29 (3 tail elements)
- Some elements are masked out/inactive (in this case seemingly at random)
- Overall, 21 elements are active

## 2.4 Previous RVV implementations

Academia and industry have implemented RVV even before v1.0 was released. The scalable vector model allows great diversity: Johns and Kazmierski integrated a minimal vector processor into a microcontroller's scalar pipeline (VLEN=32) [10], Di Mascio et al. used RVV for deep learning in space[11], and AndesCode, SiFive, and Alibaba have released cores with VLENs up to 512[12][13][14]. Other academic examples include Ara[15], Arrow[16], RISC-V<sup>2</sup>[17], and Vicuna[18], which all decouple the vector processing from the scalar pipeline.

Very recently, more implementations were revealed at RISC-V Week in Paris (May 2022). Vitruvius[19] uses extremely long vectors VLEN = 16384, is implemented as a decoupled processor, and is the first RISC-V processor to support the Open Vector Interface (OVI)<sup>4</sup> to communicate with the scalar core. VecProM[20] splits its approach into two, where vectors beyond a certain length are strip-mined and processed in hardware using a scratch memory, using OVI to connect multiple heterogeneous vector processors to a scalar core. Both were produced from the Barcelona Supercomputing Center under the European Processor Initiative. It seems that adoption of RVV will continue, making it a good choice for adapting to CHERI.

<sup>&</sup>lt;sup>4</sup>semidynamics/OpenVectorInterface on Github

## 2.5 RVV memory instructions

Summarizes [2, Sections 7-9]

RVV defines three broad categories of memory access instructions, which can be further split into five archetypes with different semantics. This section summarizes each archetype, their semantics, their assembly mnemonics, and demonstrates how they map memory accesses to vector elements.

For the most part, memory access instructions handle their operands as described in Section 2.3. EEW and EMUL are usually derived from the instruction encoding, rather than reading the vtype CSR. In a few cases the Effective Vector Length EVL is different from the vl CSR, so for simplicity all instructions are described in terms of EVL.

### Segmented accesses

Three of the five archetypes (unit/strided, fault-only-first, and indexed) support segmented access. This is used for unpacking contiguous structures of  $1 \le nf \le 8$  fields and placing each field in a separate vector. In these instructions, the values of v1, vstart, and the mask register are interpreted in terms of segments.

Fig. 2.8 demonstrates a common example: the extraction of separate R, G, and B components from a color. Without segmentation, i.e. n=1, each consecutive memory address maps to a consecutive element in a single vector register group. With segmentation, elements are grouped into segments of n>1 fields, where each field is mapped to a different vector register group. This principle extends to LMUL > 1 (Fig. 2.8c).



(b) Element-address mapping for segmented access



(c) Example of segment mapping for LMUL > 1

**Figure 2.8:** Comparison between segmented and unsegmented accesses For readability, the vector registers are 2x as wide

#### 2.5.1 Unit and Strided accesses

| Mnmemonic                                                       | Data | Address | Stride | Masked   |
|-----------------------------------------------------------------|------|---------|--------|----------|
| vlseg <nf>e<eew>.v<br/>vlsseg<nf>e<eew>.v</eew></nf></eew></nf> | -    | •       | -      | vm<br>vm |

(a) Instruction



Figure 2.9: Segmented Unit/Strided Access Information

This archetype moves active elements of **nf** vector register groups to/from contiguous segments of memory, where the start of each segment is separated by stride bytes.

- Unit-stride instructions tightly pack segments, i.e. stride = nf \* eew / 8.
- Strided instructions read stride from register rs2.
  - stride may be positive, negative, or zero.
- These instructions don't do anything if vstart >= EVL.

Strided accesses where rs2 is register x0 may perform fewer than EVL memory accesses. Otherwise, even if stride = 0, implementations must appear to perform all accesses.

#### **Ordering**

There are no ordering guarantees, other than those required by precise vector traps (if used).

#### **Exception Handling**

If any element within segment i triggers a synchronous exception, vstart is set to i and a precise or imprecise trap is triggered. Load instructions may overwrite active segments past the segment index at which the trap is reported, but not past EVL.[2, Section 7.7] Upon entering a trap, it is implementation-defined how much of the faulting segment's accesses are performed.

### 2.5.2 Unit fault-only-first loads

| Mnemonio                        | c       |             | Data                                              | Address | Masked |
|---------------------------------|---------|-------------|---------------------------------------------------|---------|--------|
| vlseg <nf>e<eew>ff.v</eew></nf> |         |             | vd,                                               | (rs1),  | vm     |
|                                 |         | (a) Instru  | uction                                            |         |        |
|                                 | Masked? | vm ==       | 0                                                 | _       |        |
|                                 | EEW     | <eew></eew> |                                                   |         |        |
|                                 | EVL     | vl          |                                                   |         |        |
|                                 | EMUL    | VLEN >      | k <eew< td=""><td>&gt; / EVL</td><td></td></eew<> | > / EVL |        |
|                                 | NFIELDS | <nf></nf>   |                                                   |         |        |
|                                 |         | (b) Parar   | neters                                            |         |        |

Figure 2.10: Unit Fault-only-First Information

This archetype is equivalent to a unit load in all respects but exception handling. If any access in segment 0 raises an exception<sup>5</sup>, v1 is not modified and the trap is taken as usual. If any access in any active segment > 0 raises an exception, the trap is not taken, v1 is reduced to the index of the offending segment, and the instruction finishes. If an asynchronous interrupt is encountered at any point, the trap is taken and vstart is set as usual.

This archetype is intended for "loops with data-dependent exit conditions", and is commonly used for string operations. The specification uses it in a strcmp example ([2, Section A.9]).

Similar to plain loads, if an exception is encountered the instruction is allowed to update segments past the offender (but not past the original v1). If any synchronous exception or asynchronous interrupt occurs, regardless of the segment index, it is implementation-defined how much of the faulting segment's accesses are performed.

<sup>&</sup>lt;sup>5</sup>Segment 0 may be masked out, in which case this is impossible.

#### 2.5.3 Indexed accesses

| Mnmemonic                                 | Data | Address | Indices | Masked |  |
|-------------------------------------------|------|---------|---------|--------|--|
| vl <u o>xseg<nf>e<eew>.v</eew></nf></u o> | vd,  | (rs1),  | vs2,    | vm     |  |
| (a) Instruction                           |      |         |         |        |  |





(c) Example of a segmented indexed access EEW=8-bits, nf=3

Figure 2.11: Segmented Indexed Access Information

This archetype moves elements of **nf** vector register groups to/from contiguous segments of memory, where each segment is offset by an index (in bytes) taken from another vector.

- The start of each segment is defined by address + index\_vector[i].
- These instructions don't do anything if vstart >= EVL.
- Indexed segment loads may not overwrite the index vector<sup>6</sup>.

### **Ordering**

Accesses within each segment are not ordered relative to each other. If the ordered variant of this instruction is used, then the segments must be accessed in order (i.e. 17, 53, 8, 44 for Fig. 2.11c). Otherwise, segment ordering is not guaranteed.

#### **Exception Handling**

If any element within segment i triggers a synchronous exception, vstart is set to i and a precise or imprecise trap is triggered. Load instructions may overwrite active segments past the segment index at which the trap is reported, but not past EVL[2, Section 7.7]. Upon entering a trap, it is implementation-defined how much of the faulting segment's accesses are performed.

<sup>&</sup>lt;sup>6</sup>This allows restarting after raising an exception partway through a structure

### 2.5.4 Unit whole-register accesses

|                                 |      |         |                  | (b) Parameters       |
|---------------------------------|------|---------|------------------|----------------------|
| (a) Instruction                 |      |         | EMUL             | 1                    |
| vl <nreg>re<eew>.v</eew></nreg> | vd,  | (rs1)   | EVL              | NFIELDS * VLEN / EEW |
| Mnemonic                        | Data | Address | Registers<br>EEW | <nreg></nreg>        |
|                                 |      |         | Masked?          | False                |

Figure 2.12: Unit Whole Register Information

This archetype moves the contents of **nreg** vector registers to/from a contiguous range in memory. Equivalent to a unit-stride access where EVL equals the total number of elements in **nreg** registers.

- nreg must be a power of two.
- These instructions don't support segmented access.
- These instructions don't do anything if vstart >= EVL.

Ordering and exception handling are identical to unit-stride accesses (Section 2.5.1).

## 2.5.5 Unit bytemask accesses

| vlm.v vd, (rs1) | EVL<br>EMUL | ceil(v1/8) |
|-----------------|-------------|------------|
| (a) Instruction |             | rameters   |

Figure 2.13: Unit Bytemask Information

This archetype moves the contents of a mask register to/from a contiguous range of memory. It transfers at least v1 bits, one bit for each element that could be used in subsequent vector instructions. This will always fit in a single vector register (see Section 2.3.3), hence EMUL = 1 in all cases.

- These instructions operate as if the tail-agnostic setting of vtype is true.
- These instructions don't support segmented access.
- These instructions don't do anything if vstart >= EVL.

Ordering and exception handling are identical to unit-stride accesses (Section 2.5.1).

## 2.6 CHERI

In CHERI, addresses/pointers are replaced with capabilities: unforgeable tokens that provide *specific kinds of access* to an *address* within a *range of memory*. The above statement is enough to understand what capabilities contain<sup>7</sup>:

- Permission bits, to restrict access
- The *cursor*, i.e. the address it currently points to
- The bounds, i.e. the range of addresses this capability could point to

A great deal of work has gone into compressing capabilities down into a reasonable size (see [21], Fig. 2.14), and using the magic of floating-point all of this data has been reduced to just 2x the architectural register size. For example, on 64-bit RISC-V a standard capability is 128-bits long. The rest of this dissertation assumes capabilities are 128-bits long for simplicity.

A CHERI implementation has to enforce three security properties about its capabilities[7, Section 1.2.1]:

- Provenance Capabilities must always be derived from valid manipulations of other capabilities.
- Integrity Corrupted capabilities cannot be dereferenced.
- Monotonicity Capabilities cannot increase their rights.

Integrity is enforced by tagging registers and memory. Every 128-bit register and aligned 128-bit region of memory has an associated tag bit, which denotes if its data encodes a valid capability<sup>8</sup>. If any non-capability data is written to any part of the region the tag bit is zeroed out. Instructions that perform memory accesses can only do so if the provided capability has a valid tag bit. As above, significant work has gone into the implementation to reduce the DRAM overhead of this method (see [22]).

<sup>&</sup>lt;sup>8</sup>This has the side-effect that capabilities must be 128-bit aligned in memory.



**Figure 2.14:** 128-bit compressed capability representation − from [3]

<sup>&</sup>lt;sup>7</sup>This is a slight simplification. For the purposes of vector memory accesses the *otype* of a capability can be ignored, as any type other than UNSEALED cannot be dereferenced anyway.

Provenance and Monotonicity are enforced by all instructions that manipulate capabilities. If an implementation detects a violation of either property, it will zero out the tag bit and rely on Integrity enforcement to ensure it is not dereferenced. Some CHERI-enabled architectures, such as CHERI-RISC-V, also raise a synchronous exception when this occurs.

#### 2.6.1 CHERI-RISC-V ISA

The Cambridge Computer Lab's TR-951 report[7] describes the latest version of the CHERI architecture (CHERI ISAv8) and proposes applications to MIPS, x86-64, and RISC-V. CHERI-RISC-V is a mostly straightforward set of additions to basic RISC-V ISAs. It adds thirty-two general-purpose capability registers, thirty-two Special Capability Registers (SCRs), and many new instructions.

The new general-purpose capability registers are each of size CLEN = 2 \* XLEN plus a tag bit. These registers store compressed capabilities. While there is always a logical distinction between the pre-existing *integer* registers x0-x31 and the *capability* registers cx0-cx31, the architecture may store them in a Split or Merged register file. A Split register file stores the integer registers separately from capability registers, so programs can manipulate them independently. A Merged register file stores thirty-two registers of length CLEN, using the full width for the capability registers, and aliases the integer registers to the bottom XLEN bits. Under a merged register file, writing to an integer register makes the capability counterpart invalid, so programs have to be more careful with register usage.

Many of the new SCRs are intended to support the privileged ISA extensions for e.g. hypervisors or operating systems. The emulator doesn't use these, so their SCRs are not listed here, but there are two highly relevant SCRs for all modes: the Program Counter Capability and the Default Data Capability.

The PCC replaces the program counter and adds more metadata, ensuring instruction fetches have the same security properties as normal loads and stores. The DDC is used to sandbox integer addressing modes. CHERI-RISC-V includes new instructions which use integer addressing, and allows legacy (i.e. integer addressed) code to function on CHERI systems without recompiling for CHERI-RISC-V. These instructions all use integer addresses relative to the DDC, and the DDC controls the permissions those instructions have.

## 2.6.2 Instruction changes

TR-951[7, Chapter 8] specifies a suite of new instructions, as well as a set of modifications to pre-existing instructions. Many of the new instructions are unrelated to pre-existing instructions, and implement capability-specific operations like accessing fields of capability registers. The most relevant new instructions for our case are the various loads/stores.

| Name           | Direction | Data type  | Address calculation     |
|----------------|-----------|------------|-------------------------|
| L[BHWD][U].CAP | Load      | Integer    | via capability register |
| L[BHWD][U].DDC | Load      | Integer    | via DDC                 |
| LC.CAP         | Load      | Capability | via capability register |
| LC.DDC         | Load      | Capability | via DDC                 |
| S[BHWD].CAP    | Store     | Integer    | via capability register |
| S[BHWD].DDC    | Store     | Integer    | via DDC                 |
| SC.CAP         | Store     | Capability | via capability register |
| SC.DDC         | Store     | Capability | via DDC                 |

**Table 2.1:** New CHERI load/store instructions

| Name                                                      | Direction                                       | Data type                                           | Address calculation<br>(Capability/Integer mode)                                                                                                        |
|-----------------------------------------------------------|-------------------------------------------------|-----------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------|
| [C]LC <sup>1</sup><br>[C]SC <sup>2</sup>                  | Load<br>Store                                   | Capability<br>Capability                            | via capability/DDC<br>via capability/DDC                                                                                                                |
| L[BWHD][U] S[BWHD] FL[WDQ] FS[WDQ] LR SC AMO <sup>3</sup> | Load<br>Store<br>Load<br>Store<br>Load<br>Store | Integer Integer Float Float Integer Integer Integer | via capability/DDC |

<sup>&</sup>lt;sup>1</sup> Replaces RV128 LQ <sup>2</sup> Replaces RV128 SQ <sup>3</sup> All atomic memory operations

Table 2.2: Preexisting RISC-V load/store instructions modified by CHERI-RISC-V

CHERI-RISC-V adds new instructions for loading integer and capability data, either via capabilities or using integer addressing through the DDC (Table 2.1). These instructions are intended for hybrid-capability code (see Section 2.6.4, [7, p151]), so are more limited and don't support immediate offsets. The behaviour of basic RISC-V load/store opcodes changes to either use capabilities as memory references or use integer addressing via the DDC, depending on the encoding mode. If any instruction tries to dereference an invalid capability, it raises a synchronous exception.

## 2.6.3 Capability and Integer encoding mode

CHERI-RISC-V specifies two encoding modes, selected using a flag in the PCC flags field. *Capability mode* modifies the behaviour of pre-existing instructions to take address operands as capabilities. This makes the basic load/store instruction behaviour exactly equivalent to newly introduced counterparts: e.g. L[BWHD] [U] == L[BWHD] [U] . CAP. The DDC may still be used in this mode via the new instructions e.g. S[BWHD] . DDC.

Integer mode seeks to emulate a standard CHERI-less RISC-V architecture as much as possible. All pre-existing RISC-V memory access instructions take address operands as integers, which are dereferenced relative to the DDC<sup>9</sup>. This makes the basic load/store instruction behaviour exactly equivalent to newly introduced counterparts: e.g. L[BWHD] [U] == L[BWHD] [U] .DDC. The new instructions may still be used to dereference and inspect capability registers, but all other instructions access registers in an integer context i.e. ignoring the upper bits and tag from merged register files.

## 2.6.4 Pure-capability and Hybrid compilation modes

CHERI-Clang<sup>10</sup>, the main CHERI-enabled compiler, supports two ways to compile CHERI-RISC-V which map to the different encoding modes.

*Pure-capability* mode treats all pointers as capabilities, and emits pre-existing RISC-V instructions that expect to be run in capability mode<sup>11</sup>.

Hybrid mode treats pointers as integer addresses, dereferenced relative to the DDC, unless they are annotated with \_\_capability. This mode emits pre-existing RISC-V instructions that take integer operands, and uses capabilities through the new instructions. All capabilities in hybrid mode are created manually by the program by copying and shrinking the DDC.

Hybrid mode allows programs to be gradually ported to CHERI, making it very easy to adopt on legacy/large codebases. Any extensions to the model (e.g. CHERI-RVV) should try and retain this property.

## 2.6.5 Capability relocations

Summarizes [8, Section 4.4, Appendix C]

Binary applications compiled in pure-capability mode require some "global" capabilities to exist at startup, e.g. the capability which points to the main() function. It would be a security risk to synthesize these capabilities from thin air, or to allow the binary file itself to contain tag bits.

Instead, CHERI ELF binaries contain a set of requested "relocations" (the \_\_cap\_relocs section) which instruct the runtime environment to create capabilities with specific permissions and bounds in specific places. This process uses the normal CHERI capability instructions, so any invalid requests will cause a program crash, maintaining security. Further complexity is introduced with dynamic linking, and in the future these relocations may change format, both described in TR-949[8], but the above description is sufficient to understand the rest of this paper.

<sup>&</sup>lt;sup>9</sup>Of course, the DDC must be valid when it is used in this mode, and all bounds checks etc. must still pass.

<sup>10</sup>https://www.cl.cam.ac.uk/research/security/ctsrd/cheri/cheri-llvm.html

<sup>&</sup>lt;sup>11</sup>This wasn't derived from documentation, but instead from manual inspection of emitted code.

# HARDWARE EMULATION INVESTIGATION

In order to experiment with integrating CHERI and RVV, we implemented a RISC-V emulator in the Rust programming language named riscv-v-lite. The emulator can partially emulate four unprivileged<sup>1</sup> RISC-V ISAs (Table 3.1), and was also used as the base for capabilities-in-vectors research (Chapter 5). This chapter explores the development of the emulator, the implementation of CHERI support (including supplementary libraries), the addition of vector support, and the conclusions drawn about the integration of CHERI and RVV (referred to as CHERI-RVV throughout).

|        | Architecture      | Extensions                                           |
|--------|-------------------|------------------------------------------------------|
| 32-bit | rv32imv           | Multiply, CSR, Vector <sup>1</sup>                   |
| 64-bit | rv64imv           | Multiply, CSR, Vector <sup>1</sup>                   |
| 64-bit | rv64imvxcheri     | Multiply, CSR, Vector <sup>1</sup> , CHERI           |
| 64-bit | rv64imvxcheri-int | Multiply, CSR, Vector <sup>1</sup> , CHERI (Integer) |

<sup>&</sup>lt;sup>1</sup> Floating-point parts of the vector extension are not supported.

Table 3.1: riscv-v-lite supported architectures

## 3.1 Developing the emulator

The emulator for each architecture follows a similar pattern. A Processor struct stores the register file and the available RAM. A separate ProcessorModules struct holds all ISA modules the processor can execute (e.g. the base RV64 Integer ISA, the Multiply extension, and the Vector extension).

The gap between the Processor and the ISA modules is bridged by a module-specific "connector" struct, which holds references to data in the Processor that is required by the

<sup>&</sup>lt;sup>1</sup>i.e. entirely bare-metal without privilege levels for OSs or hypervisors.

ISA module. For example, the RV64 Integer ISA's connector contains the current PC, a virtual reference to a register file, and a virtual reference to memory. This allows different Processor structs (e.g. a normal RV64 and a CHERI-enabled RV64) to reuse the same ISA modules despite using different register file implementations.

Each Processor implements a single stage pipeline. Instructions are fetched, decoded with a common decoder function<sup>2</sup>, and executed. The processor asks each ISA module in turn if it wants to handle the instruction, and uses the first module to say yes. If the ISA module returns a new PC value it is immediately applied, otherwise it is automatically incremented. This structure easily represents basic RISC-V architectures, and can scale up to support many different new modules.

## 3.1.1 Emulating CHERI

Manipulating CHERI capabilities securely and correctly is a must for any CHERI-enabled emulator. Capability encoding logic is not trivial by any means, so the cheri-compressed-cap C library was re-used rather than implementing it from scratch. Rust has generally decent interoperability with C, but some of the particulars of this library caused issues.

#### 3.1.1.1 rust-cheri-compressed-cap

cheri-compressed-cap provides two versions of the library by default, for 64-bit and 128-bit capabilities, which are generated from a common source through extensive use of the preprocessor. Each variant defines a set of preprocessor macros (e.g. the widths of various fields) before including two common header files cheri\_compressed\_cap\_macros.h and cheri\_compressed\_cap\_common.h. The latter then defines every relevant structure or function based on those preprocessor macros. For example, a function compute\_base\_top is generated twice, once as cc64\_decompress\_mem returning cc64\_cap\_t and another time as cc128\_decompress\_mem returning cc128\_cap\_t. Elegantly capturing both sets was the main challenge for the Rust wrapper.

One of Rust's core language elements is the Trait - a set of functions and "associated types" that can be *implemented* for any type. This gives a simple way to define a consistent interface: define a trait CompressedCapability with all of the functions from cheri\_compressed\_cap\_common.h, and implement it for two empty structures Cc64 and Cc128. In the future, this would allow the Morello versions of capabilities to be added easily. A struct CcxCap<T> is also defined which uses specific types for addresses and lengths pulled from a CompressedCapability. For example, the 64-bit capability structure holds a 32-bit address, and the 128-bit capability a 64-bit address.

128-bit capabilities can cover a 64-bit address range, and thus can have a length of  $2^{64}$ .

<sup>&</sup>lt;sup>2</sup>The decoder, and therefore all emulated processors, doesn't support RISC-V Compressed instructions.

Storing this length requires 65-bits, so all math in cheri\_compressed\_cap\_common.h uses 128-bit length values. C doesn't have any standardized 128-bit types, but GCC and LLVM provide so-called "extension types" which are used instead. Although the x86-64 ABI does specify how 128-bit values should be stored and passed as arguments[23], these rules do not seem consistently applied<sup>3</sup>. This causes great pain to anyone who needs to pass them across a language boundary.

Rust explicitly warns against passing 128-bit values across FFI, and the Clang User's Manual even states that passing i128 by value is incompatible with the Microsoft x64 calling convention<sup>4</sup>. This could be resolved through careful examination: for example, on LLVM 128-bit values are passed to functions in two 64-bit registers<sup>5</sup>, which could be replicated in Rust by passing two 64-bit values. For convenience, we instead rely on the Rust and Clang compilers using compatible LLVM versions and having identical 128-bit semantics.

The CHERI-RISC-V documentation contains formal specifications of all the new CHERI instructions, expressed in the Sail architecture definition language<sup>6</sup>. These definitions are used in the CHERI-RISC-V formal model (<sup>7</sup>), and require a few helper functions (see [7, Chapter 8.2]). To make it easier to port the formal definitions directly into the emulator the rust-chericompressed-cap library also provides those helper functions through a wrapper trait.

The above work is available online<sup>8</sup>, and includes documentation for all C functions (which is not documented in the main repository). That documentation is also available online<sup>9</sup> and partially reproduced in Appendix A.

#### 3.1.1.2 Integrating into the emulator

Integrating capabilities into the emulator was relatively simple thanks to the modular emulator structure. A capability-addressed memory type was created, which wraps a simple integer-addressed memory in logic which performs the relevant capability checks. For integer encoding mode, a further integer-addressed memory type was created which wraps the capability addressed mode, where all integer addresses are bundled with the DDC before passing through to the capability-addressed memory. Similarly, a merged capability register file type was created that exposed integer-mode and capability-mode accesses. This layered approach meant code for basic RV64I operations did not need to be modified to handle CHERI at all - simply passing the integer-mode memory and register file would perform all relevant checks.

#### [TODO3 diagram]

<sup>&</sup>lt;sup>3</sup>See https://godbolt.org/z/qj43jssr6 for an example.

<sup>&</sup>lt;sup>4</sup>clang/docs/UsersManual.rst:3384 in llvm/llvm-project (branch release/13.x) on GitHub

 $<sup>^5</sup>$ clang/lib/CodeGen/TargetInfo.cpp:2811 in llvm/llvm-project (branch release/13.x) on Git-Hub

<sup>&</sup>lt;sup>6</sup>rems-project/sail on Github

<sup>&</sup>lt;sup>7</sup>CTSRD-CHERI/sail-cheri-riscv on Github

<sup>&</sup>lt;sup>8</sup>(redacted for anonymity)

<sup>&</sup>lt;sup>9</sup>(redacted for anonymity)

Integrating capability instructions was also simple. Two new ISA modules were created: XCheri64 for the new CHERI instructions, and Rv64imCapabilityMode to override the behaviour of legacy instructions in capability-encoding-mode. [TODO4 show the program flow for using modules] The actual Processor structure was left mostly unchanged. Integer addresses were changed to capabilities throughout, memory and register file types were changed as described above, and the PCC/DDC were added.

The capability model presented by the C/Rust library has one flaw. Each CcxCap instance stores capability metadata (e.g. the uncompressed bounds) as well as the compressed encoding. This makes it potentially error-prone to represent untagged integer data with CcxCap, as the compressed and uncompressed data may not be kept in sync and cause inconsistencies later down the line. CcxCap also provides a simple interface to set the tag bit, without checking whether that is valid. The emulator introduced the SafeTaggedCap to resolve this: a sum type which represents either a CcxCap with the tag bit set, or raw data with the tag bit unset. This adds type safety, as the Rust compiler forces every usage of SafeTaggedCap to consider both options, preventing raw data from being interpreted as a capability by accident and enforcing Provenance.

The final hurdle was the capability relocations outlined in Section 2.6.5. Because we're emulating a bare-metal platform, there is no operating system to do this step for us. A bare-metal C function has been written to perform the relocations <sup>10</sup>, which could be compiled into the emulated program. However, I wasn't sure how to find the addresses of the generated relocations in C, so I performed the relocations in Rust by examining the compiled ELF file before starting emulation. In future research it should be doable to perform the relocations entirely in bare-metal C.

### 3.1.2 Emulating vectors

Vector instructions are executed by a Vector ISA module, which stores all registers and other state. VLEN is hardcoded as 128-bits, chosen because it's the largest integer primitive provided by Rust that's large enough to hold a capability. ELEN is also 128-bits, which isn't supported by the specification, but is required for capabilities-in-vectors (Chapter 5). Scaling VLEN and ELEN any higher would require the creation and integration of new types that were more than 128-bits long.

To support both CHERI and non-CHERI execution pointers are separated into an address and a *provenance*<sup>11</sup>. The vector unit retrieves an address + provenance pair from the base register, generates a stream of addresses to access, then rejoins each address with the provenance to access memory. When using capabilities, provenance is defined in terms of the base register

 $<sup>^{10} {\</sup>tt src/crt\_init\_globals.c}$  in CTSRD-CHERI/device-model on GitHub

<sup>&</sup>lt;sup>11</sup>The "original allocation the pointer is derived from"[24], or in CHERI terms the bounds within which the pointer is valid.

e.g. "the provenance is provided by capability register X", or defined by the DDC in integer  $mode^{12}$ . On non-CHERI platforms the vector unit doesn't check provenance.

Arithmetic and configuration instructions are generally simple to implement, so aren't covered here. The emulator splits vector memory accesses into three phases: decoding, checking, and execution. A separate decoding stage may technically not be necessary in hardware (especially the parts checking for errors and reserved instruction encodings, which a hardware platform could simply assume won't happen), but it allows each memory access instruction to be classified into one of the five archetypes outlined in Section 2.5. It is then easy to define the checking and execution phases separately for each archetype, as the hardware would need to do.

#### 3.1.2.1 Decoding phase

Decoding is split into two steps: finding the encoded **nf** and **eew** values, then interpreting them based on the encoded archetype. Vector memory access instructions are encoded similarly to the F extension's floating-point load/store instructions, which include an "element width". The vector extension adds four extra "element width" values which imply the access is vectorized. If any of these values are found, the instruction is interpreted as a vector access and **nf** is extracted.

Once the generic parameters are extracted, the mop is checked to determine the indexing method (Unit, Strided, Indexed-Ordered, or Indexed-Unordered). If a unit access is selected, the second argument field encodes an extra value to choose between different unit-stride archetypes (normal unit access, fault-only-first, whole register, or bytemask). Strided and indexed accesses just use their dedicated archetypes. Once the archetype is found, supplemental calculations can be performed (e.g. computing EVL = ceil(v1/8) for bytemask accesses), and the relevant information is returned as a DecodedMemOp enumeration.

[TODO5 diagram of floating point ld/st vs. vector ld/st] [TODO6 Decision tree for operation decoding]

#### 3.1.2.2 Fast-path checking phase

The initial motivation for this project was investigating the impact of capability checks on performance. Rather than check each element's access invidually, we determine a set of "fast-path" checks which count as checks for multiple elements at once. In the emulator, this is done by computing the "tight bounds" for each access, i.e. the exact range of bytes that will be accessed, and doing a single capability check with that bounds. Section 3.2 describes methods for calculating the "tight bounds" for each access type, and ways that architectural complexity can be traded off to calculate *wider* bounds.

 $<sup>^{12}</sup>$ See Section 3.1.2.4 for the reasoning behind this decision.

If the tight bounds don't pass the capability check, the emulator raises an imprecise trap and stops immediately. In the case of fault-only-first loads, where synchronous exceptions (e.g. capability checks) are explicitly handled, the access continues regardless and elements are checked individually. This is also the expected behaviour if a capability check for *wider* bounds fails. The emulator deviates from the spec in that vstart is *not* set when the tight bounds check fails, as it does not know exactly which element would have triggered the exception. As noted in Section 3.2, a fully compliant machine must check each access to find vstart in these cases.

#### 3.1.2.3 Execution phase

If the fast-path check deems it appropriate, the emulator continues execution of the instruction in two phases. First, the mapping of vector elements to accessed memory addresses is found. The code for this step is independent of the access direction, and an effective description of how each type of access works. It can be found in Appendix B.1. The previously computed tight bounds are sanity-checked against these accesses, and the accesses are actually performed.

#### 3.1.2.4 Integer vs. Capability encoding mode

As noted in Section 2.6.3 CHERI-RISC-V defines two execution modes that the program can switch between. In Integer mode "address operands to existing RISC-V load and store opcodes contain integer addresses" which are implicitly dereferenced relative to the default data capability, and in Capability mode those opcodes are modified to use capability operands. Integer mode was included in the interests of maintaining compatibility with legacy code that hasn't been adapted to capabilities. As similar vector code may also exist, CHERI-RVV treats vector memory access instructions as "existing RISC-V load and store opcodes" and requires that they respect integer/capability mode.

## 3.2 Fast-path calculations

A fast-path check can be performed over various sets of elements. The emulator chooses to perform a single fast-path check for each vector access, calculating the tight bounds before starting the actual access, but in hardware this may introduce prohibitive latency. This section describes the general principles surrounding fast-paths for CHERI-RVV, notes the areas where whole-access fast-paths are difficult to calculate, and describes possible approaches for hardware.

|                                 |                                                   | Success else if range is wide then Likely-Failure else if fault-only-first then Likely-Failure else Failure end if else Unchecked end if |
|---------------------------------|---------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------|
| Success                         | All accesses will succeed                         |                                                                                                                                          |
| Failure                         | At least one access will raise an exception       |                                                                                                                                          |
| •                               | At least one access <i>may</i> raise an exception |                                                                                                                                          |
| (a) Possible fast-path outcomes |                                                   | (b) Algorithm                                                                                                                            |

if can calculate range then

if range is within capability then

Figure 3.1: Fast-path outcomes

## 3.2.1 Possible fast-path outcomes

In some cases, a failed address range check may not mean the access fails. The obvious case is fault-only-first loads, where capability exceptions may be handled without triggering a trap. Implementations may also choose to calculate wider bounds than accessed for the sake of simplicity, or even forego a fast-path check altogether. Thus, a fast-path check can have four outcomes depending on the circumstances.

A Success means no per-access capability checks are required. Likely-Failure and Unchecked results mean each access must be checked, to see if any of them actually raise an exception. Unfortunately, accesses still need to be checked under Failure, because both precise and imprecise traps need to report the offending element in vstart<sup>13</sup>.

Because all archetypes may have Failure or Likely-Failure outcomes, hardware must provide a fallback slow-path for each archetype which checks/performs each access in turn. In theory, a CHERI-RVV specification could relax the vstart requirement for imprecise traps, and state that all capability exceptions trigger imprecise traps. In this case, only archetypes that produce Likely-Failure outcomes need the slow-path. However, it is likely that for complexity reasons all masked accesses will use wide ranges, thus producing Likely-Failure outcomes and requiring slow-paths for all archetypes anyway. Because the Likely-Failure and Failure cases require the slow-path anyway, computing the fast-path can only be worthwhile if Success is the common case.

 $<sup>^{13}</sup>$ In very particular cases, e.g. unmasked unit-strided accesses where nf = 1, the capability bounds could be used to calculate what the offending element must have been. We believe this is too niche of a use case to investigate further, particularly given the complexity of the resulting hardware.

### 3.2.2 Whole-access fast-paths

It is technically possible to calculate a fast-path for the entirety of an access (see Appendix C), but for some situations it may be equally/more expensive than checking each access. For example, the bounds for masked accesses depend on finding the minimum and maximum active indices, which in hardware may require a linear scan. Indexed accesses require finding the minimum/maximum offset values, which likely requires an expensive parallel reduction over all/some elements. In these cases hardware implementations could defer to the slow-path on all masked/indexed accesses, or for masked accesses use the wider, unmasked bounds and generate Likely-Failure outcomes. Unit and strided accesses are much easier to handle.

Arbitrarily strided accesses (which may have positive, negative, or zero-valued strides) are relatively simple to calculate. After calculating the segment width (i.e. number of fields \* element width) the full bounds just depends on the sign of the stride. Unit-stride accesses simplify this further, because the stride is equal to the segment width and guaranteed to be positive. [TODO7 Captions for equations]

$$bounds(\texttt{stride}) = base + \begin{cases} [\texttt{vstart} * \texttt{stride}, (\texttt{evl} - 1) * \texttt{stride} + \texttt{nf} * \texttt{eew}) & \texttt{stride} \geq 0 \\ [(\texttt{evl} - 1) * \texttt{stride}, \texttt{vstart} * \texttt{stride} + \texttt{nf} * \texttt{eew}) & \texttt{stride} < 0 \end{cases}$$

$$base + [vstart*nf*eew, evl*nf*eew)$$
 (3.2.2)

Ultimately, the potential up-front latency seemed like a dealbreaker for this approach. We turned our attention to fast-pathing smaller groups of elements.

## 3.2.3 *m*-element known-range fast-paths

A hardware implementation of a vector unit may be able to issue m requests within a set range in parallel. For example, elements in the same cache line may be accessible all at once. In these cases, checking elements individually would either require m parallel bounds checks, m checks' worth of latency, or something inbetween. In this subsection we consider a fast-path check for m elements.

Capability checks can be split into two steps: address-agnostic (e.g. permissions checks, bounds decoding) and address-dependent (e.g. bounds checks). Address-agnostic steps can be performed before any bounds checking, and should add minimal start-up latency (bounds decoding must complete before the checks anyway, and permission checks can be performed in parallel). Once the bounds are decoded the actual checks consist of minimal logic <sup>14</sup>, so a fast-path must have very minimal logic to compete.

 $<sup>^{14}</sup>$ Likely requires two arithmetic operations per element, for checking against the top and bottom bounds.

We first consider unit and strided accesses, and note two approaches. First, one could amortize the checking logic cost over multiple sets of m elements by operating in terms of cache lines. Iterating through all accessed cache lines, and then iterating over the elements inside, allows the fast-path to hardcode the bounds width and do one check for multiple cycles of work (if cache lines contain more than m elements). Cache-line-aligned allocations benefit here, as all fast-path checks will be in-bounds i.e. Successful, but misaligned data is guaranteed to create at least one Likely-Failure outcome per access (requiring a slow-path check). Calculating tight bounds for the m accessed elements per cycle could address this.

For unit and strided accesses, the bounds occupied by m elements is straightforward to calculate, as the addresses can be generated in order. The minimum and maximum can then be picked easily to generate tight bounds. An m-way multiplexer is still required for taking the minimum and maximum, because evl and vstart may not be m-aligned. If m is small, this also neatly extends to handle masked/inactive elements. This may use less logic overall than m parallel bounds checks, depending on the hardware platform <sup>15</sup>, but it definitely uses more logic than the cache-line approach. Clearly, there's a trade-off to be made.

Indexed fast-paths are more complicated, because the addresses are unsorted. The two approaches above have different advantages for indexed accesses. If the offsets/indices are spatially close, just not sorted, cache line checks may efficiently cover all elements. An implementation could potentially cache the results, and refer back for each access, instead of trying to iterate through cache lines in order. Otherwise a m-way parallel reduction could be performed to find the min and max, but that would likely take up more logic than m comparisons. This may be a moot point depending on the cache implementation though - if the m accesses per cycle must be in the same cache line, and the addresses are spread out, you're limited to one access and therefore one check per cycle regardless.

In summary, there are fast-path checks that consume less logic than m parallel checks in certain circumstances. Even though a slow-path is always necessary, it can be implemented in a slow way (e.g. doing one check per cycle) to save on logic. Particularly if other parts of the system rely on constraining the addresses accessed in each cycle, a fast-path check can take advantage of those constraints.

# 3.3 Going beyond the emulator

The emulator is a single example of a conformant CHERI-RVV implementation, and does not exercise every part of the specification. Four properties stand out:

 The emulator assumes all element accesses are naturally aligned, but the spec allows misaligned accesses.

<sup>&</sup>lt;sup>15</sup>e.g. on FPGAs multiplexers can be relatively cheap.

- The emulator doesn't consider multiple hardware threads, essentially assuming all accesses are atomic.
- Segments/elements are always accessed in order, despite the spec not enforcing ordering
- Imprecise traps are used for all exceptions precise trap behaviour is not explored.

This section notes how relaxed access ordering and precise exceptions may affect the hardware in ways not previously explored.

#### 3.3.1 Misaligned accesses

Implementations are allowed to handle vector accesses that are not aligned to the size of the element. This support is independent of misaligned scalar access support, so if e.g. misaligned 64-bit scalar accesses are allowed, misaligned vector accesses of 64-bit elements do *not* have to be allowed.

Changing the emulator to allow misaligned accesses of integer data would not have any impact on CHERI correctness. Capability loads/stores must be aligned to CLEN[7, Section 3.5.2], and an implementation cannot change this. Writing misaligned integer values across a CLEN boundary would need to make sure to zero the tag bit on both regions, but this applies to scalar implementations as much as vector ones. Alignment only impacts CHERI-RVV to the extent that it impacts capabilities-in-vectors (Section 5.2).

#### 3.3.2 Atomicity of accesses/General memory model

Vector memory instructions are specified to follow the RISC-V Weak Memory Ordering model[2]<sup>16</sup>, although this model hasn't been fully explained in terms of vectors yet. RVWMO defines a global order of "memory operations": atomic operations that are either loads, stores, or both[6, Chapter 14]. The RVWMO spec assumes all memory instructions create exactly one memory operation but calls out that once the vector model is formalized, vector accesses may be defined to create multiple operations.

The RVV spec states "vector misaligned memory accesses follow the same rules for atomicity as scalar misaligned memory accesses", i.e. that misaligned accesses may be decomposed into multiple memory operations of any granularity<sup>17</sup>. This is the only mention of atomicity in that document.

Again, atomicity of integer data doesn't really impact the fusion of CHERI and RVV, as long as tag bits are correctly zeroed on all integer writes. However, it does impact capabilities-in-vectors (Section 5.2).

<sup>&</sup>lt;sup>16</sup>Behaviour under the Total Store Ordering extension hasn't been defined.

<sup>&</sup>lt;sup>17</sup>e.g. each byte could be written in a separate access.

#### 3.3.3 Relaxed access ordering and precise traps

Ordering is only enforced insofar as it is observable. The only instructions that are forced to perform their accesses in order are indexed-ordered accesses, which can be used to write to e.g. I/O regions where order matters, and instructions that trigger precise traps. Precise traps require vstart to be set to a value such that all elements before vstart have completed their accesses, and all accesses on/after vstart have not completed or are idempotent.

If a vector memory access instruction is 1. not indexed-ordered and 2. guaranteed not to trigger a precise trap<sup>18</sup> then it may execute out of order. This does not affect CHERI-RVV in any way.

# 3.4 Testing hypotheses

#### Hypothesis H-1 - Feasibility

It is possible to use CHERI capabilities as memory references in all vector instructions.

This is true. All vector memory access instructions index the scalar general-purpose register file to read the base address, and CHERI-RVV implementations can simply use this index for the scalar capability register file instead. This can be considered through the lens of adding CHERI to any RISC-V processor, and in particular adding Capability mode to adjust the behaviour of legacy instructions. RVV instructions can have their behaviour adjusted in exactly the same way as the scalar memory access instructions.

That approach then scales to other base architectures that have CHERI variants. For example, Morello's scalar Arm instructions were modified to use CHERI capabilities as memory references[25, Section 1.3], so one may simply try to apply those modifications to e.g. Arm SVE instructions. This only works where Arm SVE accesses memory references in the same way as scalar Arm instructions did i.e. through a scalar register file.

Arm SVE has some addressing modes like u64base, which uses a vector as a set of 64-bit integer addresses[26]. This has more complications, because simply dereferencing integer addresses without a capability is insecure. Would a CHERI version convert this mode to use capabilities-in-vectors, breaking compatibility with legacy code that expects integer references? Another option would be to only enable this instruction in Integer mode, and dereference relative to the DDC. It's possible to port this to CHERI, but requires further investigation and thought.

<sup>&</sup>lt;sup>18</sup>Even instructions that *would* trigger precise traps but are guaranteed not to throw an exception or respond to asynchronous interrupt may execute out of order.

#### Hypothesis H-2 - Fast-path checks

The capability bounds checks for vector elements within a known range (e.g. a cache line) can be performed in a single check, amortizing the cost.

This is also true, at least for Successful accesses. Because the RVV spec requires that the faulting element is *always* recorded[2, Section 17], a Failure due to a capability violation requires elements to be checked individually. CHERI-RVV could change the specification so the faulting element doesn't need to be calculated, which would make Failures faster, but that still requires Likely-Failures to take the slow-path.

There are many ways to combine the checks for a set of vector elements, which can take advantage of the range constraints. For example, a unit-stride access could a hierarchy of checks: cache-line checks until a Likely-Failure, then tight m-element bounds until a Likely-Failure, then the slow-path. However, the choice of fast-path checks is inherently a trade-off between latency, area, energy usage, and more. Picking the right one for the job is highly dependent on the existing implementation, and indeed an implementation may decide that parallel per-element checks is better than a fast-path.

# THE CHERI-RVV SOFTWARE STACK

This chapter explores the current state of the CHERI-RVV software stack: mainstream compiler support for vanilla RVV (Section 4.1) and the modifications required to bring support to CHERI-Clang (Section 4.2). The software hypotheses are tested with this knowledge (Section 4.3), and we recommend a set of changes to bring CHERI-Clang support to par with other compilers (Section 4.4).

# 4.1 Compiling vector code

Modern compilers provide many ways to generate vectorized code. While this support is very advanced for well established vector models, like x86-64 AVX, newer vector models like RVV don't have as many options. It can even be difficult to get the compiler to generate any vector instructions at all. This section examines support across the Clang and GCC compilers for various vectorization methods on RVV.

# 4.1.1 Available compilers

Compiler support for RVV varies. On Clang 13 and other LLVM-13-based compilers, version 0.1(?¹) of the vector specification is supported as an experimental extension. Clang/LLVM 14 and up support RVV v1.0. GCC is an interesting case — there is a version based on RISC-V GCC 10.1 that partially supports RVV (see Section 6.1.2.1), but it was left untouched for a year and deleted as of 17th May 2022. See Appendix D.1 for more information on finding and building this version, and Table D.1 for the required command-line arguments to enable RVV.

<sup>&</sup>lt;sup>1</sup>It is difficult to verify the actual corresponding version, because there is no readily available specification for v0.1, and the extension supports instructions only present from v0.8 such as whole register accesses.

#### 4.1.2 Automatic vectorization

[TODO8 Figure page showing generated ASM for increment loop - see godbolt links MD] Compilers with auto-vectorization can automatically create vectorized code from a scalar program. For example, a scalar loop over an array that increments each element could be converted to a vectorized loop that increments multiple elements at once. Currently there is no support for RVV auto-vectorization in Clang or GCC. Both compilers have support for Arm SVE auto-vectorization, explored further in Section 4.1.5.

#### 4.1.3 Vector intrinsics

"Intrinsics" are functions defined by the compiler that can invoke low-level functionality and instructions directly for a specific architecture. When automatic vectorization is not available, intrinsics are the next best thing — they aren't portable across different ISAs, but present a familiar high-level interface (function calls) that gives fine-grained control over instructions. The compiler then handles low-level decisions like register allocation under the hood, and sometimes may provide extra functionality for ease of use.

RVV has a comprehensive set of vector intrinsics[27]. With these, the general stripmining loop is easy to construct: [TODO9 example based on https://github.com/riscv-non-isa/rvv-intrinsic-doc/blob/master/examples/rvv\_memcpy.c]

- 1. Use a vsetvl intrinsic to get the vector length for this iteration.
- 2. Allocate vector registers by declaring variables with vector types (e.g. vuint32m8\_t represents 8 registers worth of 32-bit unsigned integers).
- 3. Pass the vector length to the computation/memory intrinsics, which operate on the vector variables.

# 4.1.4 Inline assembly

If a compiler doesn't supply complete intrinsics, or if the programmer desires even more control, inline assembly may be used. The programmer gives a string of handwritten assembly code to the compiler, which is parsed and directly inserted into the output code. The compiler still has to parse the instruction, but this method does not depend on any intrinsics being present (or functional). We used this for CHERI-Clang programs, as it could not use intrinsics without crashing.

Inline assembly can interact with C code and variables through a template syntax. The programmer inserts a placeholder in the assembly code with a corresponding expression, noting how the expression is stored using a "constraint". For our purposes, constraints enforce that a value is either in a register or in memory.

Using the constraint, the compiler determines how the expression's value is stored, and inserts a reference to it in the assembly string. Because this is done before the assembly string is parsed, and isn't immediately type-checked against the assembly instruction, it can lead to some difficult errors.

Clang and GCC support inline assembly for RVV quite well, and even allows the intrinsic vector types to be referenced by assembly templates (thus making the compiler do register allocation instead of the programmer). The only caveat is that *memory* constraints are not supported by RVV memory accesses. None of the vector memory access instructions support address offsets, unlike their scalar counterparts. Clang always treats the *memory* constraint as an offset access, even when that offset is zero, so it adds an offset to the assembly string (Fig. 4.1c), making it invalid. To get around this, one must use the pointer itself with a *register* constraint (Fig. 4.1d). On CHERI platforms, because pointers must be stored in capability registers, the *capability register* constraint must be used instead (Fig. 4.1e).

Broadly speaking, inline assembly supports more RVV instructions than intrinsics do. It is used extensively in the testbench code for the evaluation (Chapter 6) alongside intrinsics where possible.

```
int* ptr; int val; vint32m1_t vec_val;
```

#### (a) Preamble

```
// "ld a0, 0(a0)" - valid
asm ("ld %0, %1"
: "=r"(val)
: "m"(*ptr));
```

```
// "vle32.v v8, O(a0)" - invalid
asm ("vle32.v %0, %1"
: "=vr"(vec_val)
: "m"(*ptr));
```

(b) Load scalar from memory

```
// "vle32.v v8, (a0)" - valid
asm ("vle32.v %0, (%1)"
: "=vr"(vec_val)
: "r"(ptr));
```

**(c)** Failed attempt to load vector from memory

```
// "vle32.v v8, (ca0)" - valid
asm ("vle32.v %0, (%1)"
: "=vr"(vec_val)
: "c"(ptr));
```

(d) Load vector from pointer in register

(e) Load vector from capability in register

(f) Portable code for CHERI and non-CHERI.

pure\_capabilities is a feature added to CHERI-Clang for
this project

Figure 4.1: Inline assembly examples https://godbolt.org/z/rW9orr66a
[TODO10 Change the code font]

#### 4.1.5 RVV vs. Arm SVE

Summarizes [26]

Arm SVE uses a similar model to RVV, where the vector length may scale between 128 and 2048<sup>2</sup> and the instructions are designed to be totally agnostic across different platforms[1]. Arm have released a C language extension to support SVE development ([28]), supported by the Arm Compiler for Embedded<sup>3</sup>, Clang, and GCC. They support all of the previously examined vectorization types.

Auto-vectorization is supported, and the main focus of the user guide is helping the compiler decide whether to auto-vectorize [26]. Intrinsics are also supported, and seem to cover all of the SVE instructions, but take a slightly different approach to RVV. Arm SVE intrinsics do not directly map to available instructions, but aim to "provide a regular interface and leave the compiler to pick the best mapping to SVE instructions", while RVV intrinsics (at least for memory) tend to map 1:1 to existing instructions. Arm's approach gives more flexibility for future extensions, as the same intrinsics could be compiled to new instructions with newer compilers.

Arm SVE also supports inline assembly, but the experience is notably worse than for RVV. The two standout issues are a lack of register allocation and the use of condition code flags for branching. Unlike RVV, the intrinsic types for vector values cannot be referenced in inline assembly[1], so all vector registers must be allocated and tracked by the programmer. Arm SVE's equivalent of vsetvl, the while family[28], do not return the number of updated elements, and instead set the condition flags based on how many elements are updated. Because there is no way to branch based on the condition flags in C, the programmer must manually insert a label for the top of the loop, and a branch to that label (see https://godbolt.org/z/zoWh9jq3o), which is more error prone than the RVV method. Examples of Arm SVE code with autovectorization, intrinsics, and inline ASM can be found in [TODO11 appendix based on https://godbolt.org/z/zoWh9jq3o].

<sup>&</sup>lt;sup>2</sup>RVV slightly differs here, as it allows VLEN smaller than 128.

<sup>3</sup>https://developer.arm.com/Tools%20and%20Software/Arm%20Compiler%20for%20Embedded

# 4.2 Compiling vector code with CHERI-Clang

Current CHERI compiler work is done on CHERI-Clang, a fork of Clang and other LLVM tools that supports capabilities. It's based on LLVM 13, so it includes support for vanilla RVV v0.1, but none of the vector-extension related code had been updated to work with capabilities. This section outlines the changes one has to make to CHERI-Clang and vector code to compile programs for CHERI-RVV. The required command-line options for CHERI-Clang are shown in Table D.2.

#### 4.2.1 Adapting vector assembly instructions to CHERI

LLVM uses a domain-specific language to describe the instructions it can emit for a given target. The RISC-V target describes multiple register sets that RISC-V instructions can use. Vanilla RVV vector memory accesses use the General Purpose Registers (GPR) to store the base address of each access. CHERI-Clang added a GPCR set, i.e. the General Purpose Capability Registers. As noted in Section 3.1.2.4 CHERI-RVV requires the vector memory accesses to support integer and capability mode, therefore two versions of the vector accesses must be created: versions which take a capability base address, only available in CHERI/Capability mode, and versions which take integer base addresses for Integer mode.

[TODO12 Appendix on how this was done?] With the above changes, inline assembly could be used to insert capability-enabled vector instructions[TODO13 Example]. However, as this requires using a capability register constraint for the base address, inline assembly code written for CHERI-RVV is not inherently compatible with vanilla RVV. For un-annotated pointers (e.g. int\*), which are only capabilities in pure-capability code and integers in legacy or hybrid code, a conditional macro can be used to insert the correct constraint: [TODO14 example]. However, this falls apart in hybrid code for manually annotated pointers (e.g. int\* \_\_capability) because the macro cannot detect the annotation.

#### 4.2.2 Adapting vector intrinsics to CHERI

Vector intrinsics are another story entirely. When compiling for pure-capability libraries, all attempts to use vector intrinsics crash CHERI-Clang [TODO15 example of error message?]. This is due to a similar issue to inline assembly: the intrinsics (both the Clang intrinsic functions and the underlying LLVM IR intrinsics) were designed to take regular pointers and cannot handle it when capabilities are used instead. [TODO16 Appendix which covers what I know so far about this problem?] Unfortunately the code for generating the intrinsics on both levels is spread across many files, and there's no simple way to change the associated pointer type (much less changing it for pure capability vs. hybrid mode).

It seems that significant compiler development work is required to bring vector intrinsics up

to scratch on CHERI-Clang. We did experiment with creating replacement wrapper functions, where each function tried to mimic an intrinsic using inline asssembly. These were rejected for two reasons: the increased overhead of a function call on every vector instruction<sup>4</sup>, and the lack of support for passing vector types as arguments or return values. The RISC-V ABI treats all vector registers as temporary and explicitly states that "vector registers are not used for passing arguments or return values" [29], and CHERI has its own issues with saving vector registers on the stack.

[TODO17 segue into saving registers on stack doesn't make sense - doesn't explain why someone would want to]

#### 4.2.3 Storing scalable vectors on the stack

If a program uses more data than can fit in registers, or calls a function which may overwrite important register values, the compiler will save those register values to memory on the stack. Because vector registers are temporary, and thus may be overwritten by called functions, they must also be saved/restored from the stack[TODO18 Example https://god-bolt.org/z/KPTW7rcvY]. This also applies to multiprocessing systems where a process can be paused, have the state saved, and resume later. RVV provides the whole-register memory access instructions explicitly to make this process easy.

CHERI-Clang contains an LLVM IR pass<sup>5</sup> which enforces strict bounds on so-called "stack capabilities" (capabilities pointing to stack-allocated data), which by definition requires knowing the size of the data ahead of time. This pass assumes all stack-allocated data has a static size, and crashes when dynamically-sized types e.g. scalable vectors are allocated. It is therefore impossible (for now) to save vectors on the stack in CHERI-Clang, although it's clear that it's theoretically possible. For example, the length of the required vector allocations could be calculated based on VLEN before each stack allocation is performed, or if performance is a concern stack bounds for those allocations could potentially be ignored altogether. These possibilities are investigated further in the next section.

[TODO19 Note: Arm Language C extensions https://developer.arm.com/document-ation/100987/0000/ defines the concept of a sizeless type, which may be stored on the stack. Would be a good base for RVV?]

[TODO20 LLVM IR pass is investigated in TR-949 \$3.8.2, worth noting somewhere]

<sup>&</sup>lt;sup>4</sup>This could have been eliminated by using preprocessor macros instead of real functions, but they are difficult to program and do not easily support returning values like intrinsics do.

 $<sup>^5</sup> llvm/lib/Code Gen/Cheri Bound Allocas.cpp \\$ 

# 4.3 Testing hypotheses

#### Hypothesis H-3 - Compiling/running legacy code in integer mode

Vector code can be compiled in legacy forms (with integer addressing) and function correctly on CHERI with no source code changes.

This is true for CHERI-RVV, when running the compiled programs in integer mode, as long as the programs only access memory within the DDC.

All vanilla RVV instructions have counterparts with identical encodings and behaviour in CHERI-RVV integer mode, assuming the accessed addresses are all accessible through the DDC. There are no changes to instruction behaviour that require the compiler's handling of them to change, so a non-CHERI compiler and an integer-mode-CHERI compiler can always produce the same vector instructions from the same code. This does not apply to capability-mode-CHERI, because integer addressing is not supported in capability-mode-CHERI-RVV.

All legacy vector programs should produce equivalent binaries when compiled for integer-mode-CHERI. On top of that, all binaries compiled for non-CHERI RVV platforms should produce the same results when run on an equivalent integer-mode-CHERI RVV platform. Both claims assume the program doesn't perform accesses that [TODO21 violate? exceed?] the DDC.

#### Hypothesis H-4 - Converting legacy code to pure-capability code

Legacy vector code can be compiled into a pure-capability form with no changes.

This is true for CHERI-RVV, but cannot be done in practice yet. Some engineering effort is required to support this in CHERI-Clang. Because this argument concerns source code, all three ways to generate CHERI-RVV instructions must be examined.

#### Inline Assembly – Unlikely

For GCC-style inline assembly, it is currently impossible for integer-addressed RVV source code to be recompiled in pure-capability mode without modification. Integer-addressed RVV assembly uses general-purpose registers for the base address, but the pure-capability instructions require capability registers instead. The base address register can either be specified directly, so must be changed to a capability register; or specified using template syntax and an "r" constraint (Section ??1), which must be changed to a capability "cr" constraint. Using a preprocessor macro in the template syntax (Section ??2) could make code portable between non-CHERI and CHERI, but adding this still requires source code changes.

In theory, one could change the behaviour of inline assembly to automatically convert general purpose registers/constraints to capability versions in specific circumstances. However,

this can have wide-reaching ramifications, potentially making code more difficult to understand, or even breaking existing code.

#### Intrinsics — Yes

The current specification for RVV intrinsics uses pointer types for all base addresses[27]. In pure-capability compilers all pointers should be treated as capabilities instead of integers, including those in intrinsics. All RVV memory intrinsics have equivalent RVV instructions, which all use capabilities in pure-capability mode, so changing the intrinsics to match is valid.

Assuming all base address pointers are created in a valid manner (e.g. through malloc or monotonic decrease, and not through integer literals), the conversion to pure-capability should make them all valid capabilities which are compatible with the intrinsics. Therefore well-behaved code using RVV intrinsics should be compilable in pure-capability mode without changes.

This is not currently the case for CHERI-Clang, as RVV memory access intrinsics are broken, but this can be fixed with engineering effort.

#### Auto-vectorization — Yes

All vanilla RVV instructions have counterparts with identical encodings and behaviour in CHERI-RVV pure-capability mode, assuming the base addresses can be converted to valid capabilities. Any scalar code that can be a) compiled in scalar pure-capability mode<sup>6</sup>, and b) autovectorized by a legacy RVV compiler, must have an equivalent pure-capability vectorized form. This form could be acquired by performing the auto-vectorization in legacy mode, ensuring all base addresses are available as capabilities, then making the vector instructions use those capabilities. Therefore a pure-capability compiler can always auto-vectorize CHERI-compliant scalar code if some legacy compiler can also auto-vectorize it.

This is not currently possible for CHERI-Clang, as RVV auto-vectorization is not implemented yet. Similar models (e.g. Arm SVE) already have auto-vectorization, so RVV auto-vectorization (and thus CHERI-RVV auto-vectorization) should be possible.

#### Hypothesis H-5 - Saving vectors on the stack

Vector code that saves/restores variable-length vectors to/from the stack can be compiled on CHERI-RVV with no source code changes.

This is true in theory, but not yet supported by CHERI-Clang in practice. Placing variable-length structures on the stack is possible as long as the length can be known at runtime (and as long as the stack has space, of course). This isn't exclusive to CHERI — to push and pop values

<sup>&</sup>lt;sup>6</sup>This ensures all memory accesses use valid capabilities.

on the stack, the stack pointer must be incremented or decremented by the size of the value. Because the length already has to be measured, and CHERI-RISC-V supports setting capability bounds from runtime-computed values, it's entirely possible to correctly set tight bounds for capabilities pointing to variable-length vectors on the stack.

# Hypothesis H-6 - Running CHERI-RVV code in a multiprocessing system

CHERI-vector code can run correctly in multiprocessing systems, where execution may be paused and resumed on interrupts or context switches.

This requires two conditions: an OS must be able to save and restore vector state, and the vector hardware must support resuming from an interrupted state. The first condition is easy to fulfil by extending the previous hypothesis. If it is possible to save variable-length vectors on the stack, given their length is known at runtime, it must also be possible to save their data on the heap. Some OSs might need to make changes to their "current process state" structure to support variable-length data, and they would also need to allocate space for the vtype value, but it is certainly possible.

The second condition can be upheld in two ways. First, if the OS only context switches and services interrupts while the vector hardware is in a complete state (i.e. not partially executing an instruction), then context switches and interrupts are completely transparent to the vector hardware and no changes need to be made. Secondly, if context switches and interrupts can actually interrupt vector instructions partway through, then they can only be cleanly resumed if the vector hardware supports precise traps for the exact instruction being executed.

# 4.4 Recommended changes for CHERI-Clang

- Build on current work to make all RVV memory access instructions and pseudoinstructions CHERI-compatible.
- Make RVV memory access intrinsics take capabilities as arguments when compiled in pure-capability mode.
- Make the CheriBoundAllocas IR pass handle scalable vectors by finding the length at runtime.
- Bring up CHERI-enabled auto-vectorization in parallel with vanilla auto-vectorization.

# CAPABILITIES-IN-VECTORS

Implementing memcpy correctly for CHERI systems requires copying the tag bits as well as the data. As it stands, any vectorized memcpy compiled and executed on the systems described in Chapters 3 and 4 will not copy the tag bits, because the vector registers cannot store the tag bits and indeed cannot store valid capabilities. This chapter examines the changes made to the emulator to support storing capabilities-in-vectors, and determines the conditions required for the related hypotheses to be true. Appendix E.1 lists the changes made and all the relevant properties of the emulator that allow storing capabilities in vectors.

# 5.1 Changing the emulator

This part of the project aimed to investigate Hyps. H-7 to H-9, which led to the following goals:

- (Hypothesis H-7) Vector registers should be able to hold capabilities
- (Hypothesis H-8) At least one vector memory operation should be able to load/store capabilities from vectors
  - Because memcpy should copy both integer and capability data, the vector memory operations should be able to handle both
- (Hypothesis H-9) Vector instructions should be able to affect capabilities in some way
  - Clearing the tag bit on a vector register counts as manipulation

First, we had to consider the impact of vectors on the theoretical vector model. We decided that any operation with element widths less than CLEN cannot output valid capabilities under any circumstances<sup>1</sup>. This, of course, means a new element width equal to CLEN must be

<sup>&</sup>lt;sup>1</sup>This avoids edge cases with masking, where one part of a capability could be modified while the other parts are left alone.

introduced. We set ELEN = VLEN = CLEN =  $128^2$  for our vector unit.

Two new memory access instructions were created to take advantage of this new element width, and the vsetvl family were adjusted to support 128-bit values. Similar to the CHERI-RISC-V LC/SC instructions, we implemented 128-bit unit-stride vector loads and stores, which took over officially-reserved encodings<sup>3</sup> we expected official versions to use. The memory instructions had to be added to CHERI-Clang manually, and Clang already has support for setting SEW=128 in the vsetvl family (Table E.1). These instruction changes affected inline assembly only, rather than adding vector intrinsics, because CHERI-Clang only supports inline assembly anyway.

The next step was to add capability support to the vector register file. Our approach to capabilities-in-vectors is similar in concept to the Merged scalar register file for CHERI-RISC-V (Section 2.6.1), in that the same bits of a register can be accessed in two contexts: an integer context, zeroing the tag, or a capability context which maintains the current tag. The only instructions which can access data in a capability context are the aforementioned 128-bit memory accesses<sup>4</sup>. All other instructions will read out untagged integer data, and clear tag bits when writing data. A new CHERI-specific vector register file was created, where each register is a SafeTaggedCap (page 32) i.e. either zero-tagged integer data or a valid tagged capability.

[TODO22 This paragraph is bad. Plan: split the above paragraph into two, incorporate impact on Provenance and required memory accesses into second half.] Using SafeTaggedCap had a few major consequences. Firstly, SafeTaggedCap enforces the Provenance security property within the vector unit. Secondly, it reuses the code path for accessing scalar capabilities in memory, so all related security properties are maintained (e.g. accesses must be 128-bit aligned, and are atomic).

#### 5.1.1 Testing

The above emulator changes allowed a minimal memcpy example to be constructed, which could copy both capabilities and integer data mixed together. This, along with another test to ensure arithmetic accesses the register file in an integer context, is described in the evaluation (Chapter 6).

<sup>&</sup>lt;sup>2</sup>The tag bits are implicitly instead of explicitly included here because VLEN, ELEN must be powers of two.

<sup>&</sup>lt;sup>3</sup>The RVV spec mentions, but does not specify, potential encodings for 128-bit element widths and instructions ([2, p10, p32], Table E.2).

<sup>&</sup>lt;sup>4</sup>The encoding mode (Section 2.6.3) does not affect register usage: when using the Integer encoding mode, instructions can still access the vector register in a capability context. This is just like how scalar capability registers are still accessible in Integer encoding mode.

# 5.2 Testing hypotheses

#### **Hypothesis H-7 - Holding capabilities in vectors**

It is possible for vector registers to hold capabilities to enable copying without violating CHERI security principles.

It is possible for a single vector register to hold a capability (and differentiate a capability from integer data) as long as VLEN = CLEN. VLEN could also be larger, and a compliant implementation must then have VLEN be an integer multiple of CLEN. In theory, one could also describe a scheme where capabilities must be held by multiple registers together (e.g. VLEN = CLEN/2 with one tag bit for every two registers), but this would complicate matters.

If an implementation decides, as we did, that elements of width CLEN are required to produce capabilities, then  $VLEN \geq ELEN$  therefore  $VLEN \geq CLEN$ . If a short VLEN is absolutely essential, one could place precise guarantees on a specific set of instructions to enable it (e.g. SEW=64, LMUL=2 unit-stride unmasked loads could guarantee atomic capability transfers) but the emulator does not consider this. The CHERI security properties also impose some conditions.

#### **Provenance & Monotonicity**

The tag bit must be protected such that capabilities cannot be forged from integer data. The emulator's integer/capability context approach, where the tag bit may only be set on copying a valid capability from memory, and the output tag bit is zeroed on all other accesses, enforces this correctly.

#### **Integrity**

Integrity is not affected by how a capability is stored, as long as the other properties are maintained.

#### Hypothesis H-8 - Sending capabilities between vectors and memory

It is possible for vector memory accesses to load and store capabilities from vector registers without violating CHERI security principles.

For this to be the case, the instructions which can load/store capabilities must fulfil certain alignment and atomicity requirements. They must require all accesses be CLEN-aligned, or at least only load valid caps from aligned addresses, because tag bits only apply to CLEN-aligned regions. TR-951 states that capability memory accesses must be atomic[7, Section 11.3]. This applies to vectors, even in ways that don't apply to scalar accesses.

Individual element accesses for a vector access must be atomic relative to each other. This is relevant for e.g. a strided store using an unaligned stride, such that one element writes a valid capability and another element overwrites part of that address range. If unaligned 128-bit accesses are allowed, then either the unaligned second element should "win" and clear relevant tag bits, or the first element should "win" and write the full capability atomically. The emulator requires all 128-bit accesses to be aligned so meets this requirement easily.

#### **Provenance**

Provenance requires the accesses be atomic as described above, and require that tag bits are copied correctly: the output tag bit must only be set if the input had a valid tag bit. These conditions also apply to scalar accesses.

#### Monotonicity

These loads/stores do not attempt to manipulate capabilities, so have no relevance to monotonicity.

#### **Integrity**

The same conditions for scalar and other vector accesses apply to maintain Integrity: namely that the base capability for each access should be checked to ensure it is valid. The emulator doesn't allow capabilities-in-vectors to be dereferenced directly, but if an implementation allows it those capabilities would also need to be checked.

# Hypothesis H-9 - Manipulating capabilities in vectors

It is possible for vector instructions to manipulate capabilities in vector registers without violating CHERI security principles.

The emulator limits all manipulation to clearing the tag bit, achieved by writing data to the register in an integer context. In theory, it's possible to do more complex transformations, which can be proven by implementing each vector manipulation on vector elements as sequential scalar manipulations on scalar elements. With this method, all pre-existing scalar capability manipulations can become vector manipulations, but the utility seems limited. For example, instructions for creating capabilities or manipulating bounds en masse don't have an obvious use case. If more transformations are added they should be considered carefully, rather than creating vector equivalents for all scalar manipulations.

#### **Provenance & Monotonicity**

Because the only possible manipulations clear the tag bit, it's impossible to create or change capabilities, so Provenance and Monotonicity cannot be violated. Any manipulations that create capabilities, or potentially any manipulations that transfer capabilities from vector registers directly to scalar registers, would require more scrutiny.

#### **Integrity**

As stated before, capabilities-in-vectors cannot be dereferenced directly, so there is no impact on Integrity.

# **EVALUATION**

[TODO23 Rewrite intro] [TODO24 Upgrade subsection -> section]

# 6.1 Testing the emulator

In order to verify the correct behaviour of the emulator, a set of self-checking integration test C programs were constructed. All the programs follow the same pattern: each test program contains multiple tests, which each initialize an environment, perform the action under test, and check the impact on the environment was correct. The individual test results are combined into a 64-bit bit vector and written to a special I/O memory location. A bit vector noting which tests were run is also written to I/O memory, because some compilers or architectures do not support all tests.

Four compilers are tested, described in detail in Chapter 4. They include a build of GCC with support for vector intrinsics (Appendix D.1), a binary distribution of Clang/LLVM-13, a custom build of Clang/LLVM-15, and a custom build of CHERI-Clang (based on LLVM 13) with changes to support RVV. 32-bit and 64-bit non-CHERI architectures are tested, along with 64-bit CHERI in Capability and Integer modes (Section 2.6.3). This section describes the tests in more detail, and notes any limitations in the software stack that prevent testing under some compilers/architectures.

#### 6.1.1 hello\_world

hello\_world runs three small functions which calculate Fibonacci numbers and factorials, testing that the emulator can handle basic programming constructs like recursive functions. Fibonacci is calculated in two ways: once with a simple recursive function, and once with *memoization* where previous outputs are cached in a static array. The tests compile on all compilers, and output the correct results on all architectures.

|                     | RV32    | RV-64   |         |     |       |             |
|---------------------|---------|---------|---------|-----|-------|-------------|
|                     | 11vm-13 | 11vm-13 | llvm-15 | gcc | CHERI | CHERI (Int) |
| factorial(10)       | Y       | Y       | Y       | Y   | Y     | Y           |
| fib(10) (recursive) | Y       | Y       | Y       | Y   | Y     | Y           |
| fib(33) (memoized)  | Y       | Y       | Y       | Y   | Y     | Y           |

**Table 6.1:** hello\_world results — Basic program tests

#### 6.1.2 vector\_memcpy

vector\_memcpy is generated from a Python script, and consists of fifty-seven tests of different vector memory access archetypes under various configurations (Table 6.2). Each of the nine test schemes are replicated for multiple vtype values, such that each individual LMUL and SEW value is tested. These tests are run under *harnesses*, which provide setup and self-checking code for common cases: The Vanilla harness tests a simple memcpy between two arrays; Masked tests that every other element is copied, not all of them; Segmented tests a memcpy into four separate output arrays, each a different field of a four-field structure. There is also a special test for fault-only-first: FoF loads are performed at the edge of mapped memory, and the test verifies that out-of-bounds exceptions are swallowed and v1 is reduced accordingly.

All tests were successful when they ran, showing the emulator is correctly emulating the instructions, but in some cases testbenches weren't available. The full set of test results is available in Appendix F.

| Test Scheme               | Harness   | Compilers    |  |
|---------------------------|-----------|--------------|--|
| Unit Stride               | Vanilla   | All          |  |
| Strided                   | Vanilla   | All          |  |
| Indexed                   | Vanilla   | All          |  |
| Whole Register            | Vanilla   | All          |  |
| Fault-only-First          | Vanilla   | All          |  |
| Unit Stride (Masked)      | Masked    | All          |  |
| Bytemask Load             | Masked    | llvm-15 only |  |
| Unit Stride (Segmented)   | Segmented | All          |  |
| Fault-only-First Boundary | -         | All          |  |

Table 6.2: vector\_memcpy test schemes and harnesses

#### 6.1.2.1 Compiler differences

While most compilers support all memory access archetypes, there are a few notable exceptions. GCC has the most: there is no support for fractional LMUL or bytemask accesses, the intrinsics for segmented accesses are named differently, and fault-only-first intrinsics emit incorrect

instructions<sup>1</sup>. GCC RVV support has been deprioritized in favor of LLVM<sup>2</sup>, so the rough edges make sense. LLVM-13-based compilers (including CHERI-Clang) support all specified archetypes except bytemask accesses. CHERI-Clang doesn't support intrinsics, but all inline assembly support is intact. Support for bytemask accesses is only available in LLVM-14 and up.

#### 6.1.3 vector\_memcpy\_pointers

vector\_memcpy\_pointers tests the behaviour of memcpy with capabilities-in-vectors, by copying an array of data structures that hold pointers to static data. On CHERI platforms, even in Integer mode, capability pointers are used and copied. The first test simply copies the data and tests that all the copied pointers still work, which succeeds on all compilers/architectures. The second test is CHERI-exclusive, and invalidates all pointers during the copy process by performing integer arithmetic on the vector registers. The copied pointers are examined to make sure their tag bits are all zeroed, and this test succeeds on both CHERI configurations.

|                   | RV32    | RV-64   |         |     |       |             |
|-------------------|---------|---------|---------|-----|-------|-------------|
|                   | 11vm-13 | 11vm-13 | llvm-15 | gcc | CHERI | CHERI (Int) |
| Сору              | Y       | Y       | Y       | Y   | Y     | Y           |
| Copy + Invalidate | -       | -       | -       | -   | Y     | Y           |

Table 6.3: vector\_memcpy\_pointers results

#### 6.1.4 Summary

All tests passed where available, and all tests were available on at least one compiler. This shows the emulator is correctly executing vector instructions, and that it is possible to correctly memcpy capabilties in vector registers.

<sup>&</sup>lt;sup>1</sup>On GCC, fault-only-first intrinsics seem to emit vsetvli.

<sup>&</sup>lt;sup>2</sup>https://github.com/riscv-collab/riscv-gcc/issues/320

# Conclusion

In this project we have built a new RISC-V emulator supporting one possible fusion of CHERI and RVV (Chapter 3), explored the impact on the software stack by compiling and running test programs (Chapters 4 and 6), and extended the RVV instruction set/vector model to store capabilities in vectors (Chapter 5). On top of this substantial implementation effort, we have presented conclusions to a set of hypotheses (Table 1.1) proving the general viability of adding scalable vector models to CHERI processsors.

Hypothesis H-1 proved the most essential point, that a blanket replacement of all memory references with capabilities in all RVV instructions is valid. Hypothesis H-2 then alleviated performance concerns by showing it was possible to combine the required capability checks for all vector accesses, amortizing the overall cost of checking, although with varying practical benefit.

On the software side Hyps. H-3 and H-4 showed that (in theory) non-CHERI vectorized code could be run on CHERI systems, and could also be recompiled for pure-capability execution with no source code changes, but that CHERI-Clang's current state adds some practical limitations. Hyps. H-5 and H-6 address the pausing and resuming of vector code, particularly saving and restoring variable-length architectural state, concluding that it is entirely possible but will require software adjustments.

Through a limited investigation of capabilities-in-vectors, Hyps. H-7 to H-9 showed that a highly constrained implementation could enable a fully-functional vectorized memcpy without violating CHERI security principles. It should be possible to extend the CHERI-RVV ISA with vector equivalents of existing CHERI scalar instructions, but I did not investigate this further.

Overall, it is clear that scalable vector models can be adapted to CHERI without significant loss of functionality. While most of the hypotheses are general enough to cover other scalable models, e.g. Arm SVE, any differences from RVV's model will require careful examination. Given the importance of vector processing to modern computing, and thus its importance to CHERI, I hope that this research and the software artifacts of this project (see Appendix G)

pave the way for future vector-enabled CHERI processor implementations.

# REFERENCES

- [1] Nigel Stephens et al. "The ARM Scalable Vector Extension". In: *IEEE Micro* 37.2 (March 2017), pp. 26–39. ISSN: 0272-1732. DOI: 10.1109/MM.2017.35.
- [2] RISC-V "V" Vector Extension. 20th September 2021. URL: https://github.com/riscv/riscv-v-spec/releases/download/v1.0/riscv-v-spec-1.0.pdf.
- [3] Robert N M Watson et al. *An Introduction to CHERI*. UCAM-CL-TR-941. September 2019, p. 43.
- [4] László Szekeres et al. "SoK: Eternal War in Memory". In: 2013 IEEE Symposium on Security and Privacy. May 2013, pp. 48–62. DOI: 10.1109/SP.2013.13.
- [5] Robert N.M. Watson et al. "CHERI: A Hybrid Capability-System Architecture for Scalable Software Compartmentalization". In: *2015 IEEE Symposium on Security and Privacy*. San Jose, CA: IEEE, May 2015, pp. 20–37. ISBN: 978-1-4673-6949-7. DOI: 10/gfpgzz.
- [6] Andrew Waterman and Krste Asanović, eds. *The RISC-V Instruction Set Manual Volume I: Unprivileged ISA*. 13th December 2019. URL: https://github.com/riscv/riscv-isa-manual/releases/download/Ratified-IMAFDQC/riscv-spec-20191213.pdf.
- [7] Robert N. M. Watson et al. Capability Hardware Enhanced RISC Instructions: CHERI Instruction-Set Architecture (Version 8). UCAM-CL-TR-951. University of Cambridge, Computer Laboratory, 2020. URL: https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-951.html (visited on 06/10/2021).
- [8] Alexander Richardson. Complete Spatial Safety for C and C++ Using CHERI Capabilities. UCAM-CL-TR-949. University of Cambridge, Computer Laboratory, June 2020, p. 189. URL: https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-949.pdf.
- [9] Andrew Waterman, Krste Asanovic and John Hauser, eds. *The RISC-V Instruction Set Manual Volume II: Privileged Architecture.* 4th December 2021. URL: https://github.com/riscv/riscv-isa-manual/releases/download/Priv-v1.12/riscv-privileged-20211203.pdf.

- [10] Matthew Johns and Tom J. Kazmierski. "A Minimal RISC-V Vector Processor for Embedded Systems". In: *2020 Forum for Specification and Design Languages (FDL)*. September 2020, pp. 1–4. DOI: 10/gnrfdb.
- [11] Stefano Di Mascio et al. "On-Board Decision Making in Space with Deep Neural Networks and RISC-V Vector Processors". In: *Journal of Aerospace Information Systems* 18.8 (1st August 2021), pp. 553–570. DOI: 10/gnrfch.
- [12] AndesCore NX27V Processor. Andes Technology. URL: https://www.andestech.com/en/products-solutions/andescore-processors/riscv-nx27v/ (visited on 11/12/2021).
- [13] SiFive Intelligence X280 SiFive. sifive.com. URL: https://www.sifive.com/cores/intelligence-x280 (visited on 15/05/2022).
- [14] Chen Chen et al. "Xuantie-910: A Commercial Multi-Core 12-Stage Pipeline Out-of-Order 64-Bit High Performance RISC-V Processor with Vector Extension: Industrial Product". In: 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). May 2020, pp. 52–64. DOI: 10.1109/ISCA45697.2020.00016.
- [15] Matheus Cavalcante et al. "Ara: A 1-GHz+ Scalable and Energy-Efficient RISC-V Vector Processor With Multiprecision Floating-Point Support in 22-Nm FD-SOI". In: *IEEE Transactions on Very Large Scale Integration (VLSI) Systems* 28.2 (February 2020), pp. 530–543. ISSN: 1557-9999. DOI: 10/gnrd7v.
- [16] Imad Al Assir et al. "Arrow: A RISC-V Vector Accelerator for Machine Learning Inference". 15th July 2021. arXiv: 2107.07169 [cs]. URL: http://arxiv.org/abs/2107.07169 (visited on 11/12/2021).
- [17] Kariofyllis Patsidis et al. "RISC-V2: A Scalable RISC-V Vector Processor". In: 2020 IEEE International Symposium on Circuits and Systems (ISCAS). October 2020, pp. 1–5. DOI: 10/gnfrn3.
- [18] Michael Platzer and Peter Puschner. "Vicuna: A Timing-Predictable RISC-V Vector Coprocessor for Scalable Parallel Computation". In: 33rd Euromicro Conference on Real-Time Systems (ECRTS 2021). Ed. by Björn B. Brandenburg. Vol. 196. Leibniz International Proceedings in Informatics (LIPIcs). Dagstuhl, Germany: Schloss Dagstuhl Leibniz-Zentrum für Informatik, 2021, 1:1–1:18. ISBN: 978-3-95977-192-4. DOI: 10/gnfrn2.
- [19] Francesco Minervini and Oscar Palomar Perez. "Vitruvius: An Area-Efficient RISC-V Decoupled Vector Accelerator for High Performance Computing" (RISC-V Week Paris). 4th May 2022. URL: https://www.youtube.com/watch?v=t1C5kMhrh-k (visited on 13/05/2022).

- [20] Gopinath Mahale et al. "A RISC-V VPU for Very Long and Sparse Vectors" (RISC-V Week Paris). March 2021. URL: https://open-src-soc.org/2022-05/media/posters/4th-RISC-V-Meeting-2022-05-03-Gopinath-Mahale-poster.pdf (visited on 13/05/2022).
- [21] Jonathan Woodruff et al. "CHERI Concentrate: Practical Compressed Capabilities". In: (2019), p. 15. DOI: 10/gm9ngf.
- [22] Alexandre Joannou et al. "Efficient Tagged Memory". In: 2017 IEEE International Conference on Computer Design (ICCD). November 2017, pp. 641–648. DOI: 10/ghnj26.
- [23] H.J. Lu et al., eds. System V Application Binary Interface v1.0. 28th January 2018. URL: https://raw.githubusercontent.com/wiki/hjl-tools/x86-psABI/x86-64-psABI-1.0.pdf (visited on 15/05/2022).
- [24] Kayvan Memarian et al. "Exploring C Semantics and Pointer Provenance". In: *Proceedings of the ACM on Programming Languages* 3 (POPL 2nd January 2019), pp. 1–32. ISSN: 2475-1421. DOI: 10.1145/3290380.
- [25] Arm Ltd. Arm Architecture Reference Manual Supplement Morello for A-profile Architecture. 25th June 2021. URL: https://developer.arm.com/documentation/ddi0606/latest.
- [26] Arm Ltd. Arm Compiler Scalable Vector Extension User Guide Version 6.12.0612-00. 27th February 2019. URL: https://developer.arm.com/documentation/100891/latest/ (visited on 13/05/2022).
- [27] RISC-V Vector Extension Intrinsics (v1.0). RISC-V Non-ISA Specifications, 16th November 2021. URL: https://github.com/riscv-non-isa/rvv-intrinsic-doc/blob/00882f19a84ab354dc8cf6a10c100b8daa2654e4/rvv-intrinsic-api.md (visited on 16/11/2021).
- [28] Arm Ltd. ARM C Language Extensions for SVE 0.0bet6. 00bet6. 2020. URL: https://developer.arm.com/documentation/100987/0000/ (visited on 13/05/2022).
- [29] RISC-V ABIs Specification v1.0rc2. 6th April 2022. URL: https://github.com/riscv-non-isa/riscv-elf-psabi-doc/releases/download/v1.0-rc2/riscv-abi.pdf.
- [30] Shiva Chen and Hsiangkai Wang. "Compiler Support For Linker Relaxation in RISC-V" (RISC-V Workshop Taiwan). 13th March 2019. URL:

  https://riscv.org/wp-content/uploads/2019/03/11.15-Shiva-Chen-Compiler-Support-For-Linker-Relaxation-in-RISC-V-2019-03-13.pdf (visited on 04/05/2022).

# RUST\_CHERI\_COMPRESSED\_CAP DOCUMENTATION

This section reproduces the documentation for cheri-compressed-cap C structs and functions that were linked into Rust. The full documentation<sup>1</sup> explains other Rust-specific decisions and implementation details.

<sup>&</sup>lt;sup>1</sup>Available in **(redacted for anonymity)** or attached to submission

=



Struct CcxCap

# Struct rust\_cheri\_compressed\_cap::CcxCap

```
#[repr(C, align(16))]
pub struct CcxCap<T: CompressedCapability> {
    _cr_cursor: T::Addr,
    cr_pesbt: T::Addr,
    _cr_top: T::FfiLength,
    cr_base: T::Addr,
    cr_tag: u8,
    cr_bounds_valid: u8,
    cr_exp: u8,
    cr_extra: u8,
}
```

Structure matching the C type <code>\_cc\_N(cap)</code> . Field order and layout is binary-compatible with the C version, assuming the C preprocessor macro <code>\_CC\_REVERSE\_PESBT\_CURSOR\_ORDER</code> is *not* defined.

This is a plain-old-data type. It only supplies getters and setters, and does *not* guarantee any safety/correctness. For example, there are no added assertions or checks if you set the cursor to a value outside the bounds. However, the C FFI functions from CompressedCapability may have their own asserts. These are documented where possible.

For a safe interface, use one of the crate::wrappers

#### **Fields**

```
_cr_cursor: T::Addr

The bottom half of the capability as stored in memory.

If Self::cr_tag is 1, this is the capability's "cursor" i.e. the address it's actually pointing to.

cr_pesbt: T::Addr

The top half of the capability as stored in memory.

If Self::cr_tag is 1, this is the compressed capability metadata (permissions, otype, bounds, etc.).

_cr_top: T::Ffilength
```

The top of this capability's valid address range. Derived from Self::cr\_pesbt. As long as self::cr\_tag is 1, the getter/setter will ensure it matches.

```
cr_base: T::Addr
```

The base of this capability's valid address range. Derived from Self::cr\_pesbt. As long as Self::cr\_tag is 1, the getter/setter will ensure it matches.

```
cr_tag: u8
```

Tag - if 1, this is a valid capability, 0 it's just plain data

```
cr_bounds_valid: u8
```

0 (false) if the bounds decode step was given an invalid capability. Should be 1 (true) for all non-Morello capabilities.

```
cr_exp: u8
```

The exponent used for storing the bounds. Stored from various places, only used in Morello-exclusive function cap\_bounds\_uses\_value().

```
cr_extra: u8
```

"Additional data stored by the caller." Seemingly completely unused, essentially padding.

#### **Implementations**

```
impl<T: CompressedCapability> CcxCap<T> [src]
```

Implements getters and setters similar to the C++-only member functions in the header.

Returns a (tag, [cursor, pesbt]) tuple that represents all data required to store a capability in a register.

To store capabilities in memory, see Self::mem\_representation

```
pub fn mem_representation(&self) -> (bool, [T::Addr; 2]) [src]
```

Returns a (tag, [cursor, pesbt]) tuple that represents all data required to store a capability in memory.

To store capabilities in a register, see Self::reg\_representation

```
pub fn tag(&self) -> bool [src]
```

```
ub fn top(&self) -> T::Length
                                                                        [src]
pub fn bounds(&self) -> (T::Addr, T::Length)
                                                                        [src]
pub fn set_bounds_unchecked(
                                                                        [src]
    &mut self,
    req_base: T::Addr,
    req_top: T::Length
) -> bool
Sets the base and top of this capability using C FFI function
CompressedCapability::set_bounds. Updates the PESBT field correspondingly. On non-
Morello platforms, will fail with an assertion error if Self::tag() is not set.
pub fn address(&self) -> T::Addr
                                                                        [src]
pub fn set_address_unchecked(&mut self, addr: T::Addr)
                                                                        [src]
pub fn offset(&self) -> T::Offset
                                                                        [src]
pub fn length(&self) -> T::Length
                                                                        [src]
pub fn software_permissions(&self) -> u32
                                                                        [src]
pub fn set_software_permissions(&mut self, uperms: u32)
                                                                        [src]
pub fn permissions(&self) -> u32
                                                                        [src]
pub fn set_permissions(&mut self, perms: u32)
                                                                        [src]
pub fn otype(&self) -> u32
                                                                        [src]
pub fn is_sealed(&self) -> bool
                                                                        [src]
pub fn set_otype(&mut self, otype: u32)
                                                                        [src]
pub fn reserved_bits(&self) -> u8
                                                                        [src]
pub fn set_reserved_bits(&mut self, bits: u8)
                                                                        [src]
pub fn flags(&self) -> u8
                                                                        [src]
pub fn set_flags(&mut self, flags: u8)
                                                                        [src]
pub fn is_exact(&self) -> bool
                                                                        [src]
Helper function for easily calling FFI function
CompressedCapability::is_representable_cap_exact on this capability. Assertions are
present in the C code, but should never be triggered.
pub fn is_representable_with_new_addr(&self, new_addr: T::Addr) [src]
```

```
bool
```

Helper function for easily calling FFI function

CompressedCapability::is\_representable\_new\_addr on this capability. Assertions are present in the C code, but should never be triggered.

#### **Trait Implementations**

```
impl<T: Clone + CompressedCapability> Clone for CcxCap<T>
                                                                              [src]
   where
       T::Addr: Clone,
       T::Addr: Clone,
       T::FfiLength: Clone,
       T::Addr: Clone,
   impl<T: CompressedCapability> Debug for CcxCap<T>
                                                                              [src]
    Debug printer for capabilities that decodes the PESBT field instead of printing it raw.
  impl<T: CompressedCapability> Default for CcxCap<T>
                                                                              [src]
    Equivalent to initialization pattern used in tests:
      ccx_cap_t value;
      memset(&value, 0, sizeof(value));
    cc64.rs doesn't pick it up when it was automatically #derive-d, so it's manually
    implemented here
impl<T: CompressedCapability> PartialEq<CcxCap<T>> for
                                                                              [src]
   CcxCap<T>
    Implements operator== from cheri_compressed_cap_common.h
   impl<T: Copy + CompressedCapability> Copy for CcxCap<T>
                                                                              [src]
   where
       T::Addr: Copy,
       T::Addr: Copy,
       T::FfiLength: Copy,
       T::Addr: Copy,
   impl<T: CompressedCapability> Eq for CcxCap<T>
                                                                              [src]
  Auto Trait Implementations
   impl<T> RefUnwindSafe for CcxCap<T>
   where
       <T as CompressedCapability>::Addr: RefUnwindSafe,
       <T as CompressedCapability>::FfiLength: RefUnwindSafe,
   impl<T> Send for CcxCap<T>
```

```
ere
       <T as CompressedCapability>::Addr: Send,
       <T as CompressedCapability>::FfiLength: Send,
   impl<T> Sync for CcxCap<T>
   where
       <T as CompressedCapability>::Addr: Sync,
       <T as CompressedCapability>::FfiLength: Sync,
   impl<T> Unpin for CcxCap<T>
   where
       <T as CompressedCapability>::Addr: Unpin,
       <T as CompressedCapability>::FfiLength: Unpin,
   impl<T> UnwindSafe for CcxCap<T>
   where
       <T as CompressedCapability>::Addr: UnwindSafe,
       <T as CompressedCapability>::FfiLength: UnwindSafe,
  Blanket Implementations
-] impl<T> Any for T
                                                                               [src]
   where
       T: 'static + ?Sized,
impl<T> Borrow<T> for T
                                                                               [src]
   where
       T: ?Sized,
impl<T> BorrowMut<T> for T
                                                                               [src]
   where
       T: ?Sized,
  impl<T> From<T> for T
                                                                               [src]
-]
  impl<T, U> Into<U> for T
                                                                               [src]
   where
       U: From<T>,
   impl<T> ToOwned for T
                                                                               [src]
   where
       T: Clone,
      type Owned = T
    The resulting type after obtaining ownership.
impl<T, U> TryFrom<U> for T
                                                                               [src]
   where
       U: Into<T>,
      type Error = Infallible
    The type returned in the event of a conversion error.
```

```
= | ipl<T, U> TryInto<U> for T ere
                                                                               [src]
     U: TryFrom<T>,
    type Error = <U as TryFrom<T>>::Error
   The type returned in the event of a conversion error.
```



# Trait CompressedCapability

# **Trait rust\_cheri\_compressed\_cap::CompressedCapability**

```
pub trait CompressedCapability: Sized + Copy + Clone {
    type Length: NumType + From<Self::Addr>;
    type Offset: NumType + From<Self::Addr>;
    type Addr: NumType + Into<Self::Offset> + Into<Self::Length>;
    type FfiLength: FfiNumType<Self::Length>;
    type FfiOffset: FfiNumType<Self::Offset>;
[+] Show associated constants and methods
}
```

Trait defining an Rust version of the public API for a specific capability type. A type X implementing CompressedCapability is equivalent to the API provided by cheri\_compressed\_cap\_X.h in C, where ccx\_cap\_t is equivalent to CcxCap.

It is not recommended to call the trait functions directly. Instead, use one of the crate::wrappers.

#### **Associated Types**

```
type Length: NumType + From<Self::Addr>
                                                                               [src]
   ccx_length_t Rust-land equivalent - should be a superset of Addr
type Offset: NumType + From<Self::Addr>
                                                                               [src]
   ccx_offset_t Rust-land equivalent - should be a superset of Addr
type Addr: NumType + Into<Self::Offset> + Into<Self::Length>
                                                                               [src]
   ccx_addr_t equivalent
type FfiLength: FfiNumType<Self::Length>
                                                                               [src]
   ccx_length_t C-land equivalent - should have a memory layout identical to the C
   ccx_length_t. This is separate from Length because for 128-bit types the Rust and C
   versions may not look the same. In practice, we just assume they are the same (see
   crate::c_funcs documentation).
type FfiOffset: FfiNumType<Self::Offset>
                                                                               [src]
   ccx_offset_t C-land equivalent - should have a memory layout identical to the C
```

≡

ccx\_offset\_t. See Self::FfiLength for an explanation.

#### **Associated Constants**

```
const PERM_GLOBAL: u32
                                                                        [src]
    CCX_PERM_GLOBAL equivalent These are the same for 64 and 128bit, but should be
    overridden for Morello-128
 const PERM_EXECUTE: u32
                                                                        [src]
 const PERM_LOAD: u32
                                                                        [src]
 const PERM_STORE: u32
                                                                        [src]
 const PERM_LOAD_CAP: u32
                                                                        [src]
 const PERM_STORE_CAP: u32
                                                                        [src]
 const PERM_STORE_LOCAL: u32
                                                                        [src]
                                                                        [src]
 const PERM_SEAL: u32
 const PERM_CINVOKE: u32
                                                                        [src]
 const PERM_UNSEAL: u32
                                                                        [src]
 const PERM_ACCESS_SYS_REGS: u32
                                                                        [src]
 const PERM_SETCID: u32
                                                                        [src]
 const MAX_REPRESENTABLE_OTYPE: u32
                                                                        [src]
 const OTYPE_UNSEALED: u32
                                                                        [src]
    CCX_OTYPE_UNSEALED equivalent
 const OTYPE_SENTRY: u32
                                                                        [src]
 const OTYPE_RESERVED2: u32
                                                                        [src]
 const OTYPE_RESERVED3: u32
                                                                        [src]
 const MAX_UNRESERVED_OTYPE: u32
                                                                        [src]
Required methods
 fn compress_raw(src_cap: &CcxCap<Self>) -> Self::Addr
                                                                        [src]
```

Generate the pesbt bits for a capability (the top bits, which encode permissions, object type, compressed bounds, etc.) This transformation can be undone with

 $\equiv$ 

#### Self::decompress\_raw.

This is presumably intended for storing compressed capabilities in e.g. registers. Its counterpart for storing compressed capabilities in memory is Self::compress\_mem.

```
fn decompress_raw(
    pesbt: Self::Addr,
    cursor: Self::Addr,
    tag: bool
) -> CcxCap<Self>
```

Decompress a (pesbt, cursor) pair into a capability. This transformation can be undone with Self::compress\_raw.

```
fn compress_mem(src_cap: &CcxCap<Self>) -> Self::Addr [src]
```

Generate the pesbt bits for a capability (the top bits, which encode permissions, object type, compressed bounds, etc.) This transformation can be undone with Self::decompress\_mem.

This is presumably intended for storing compressed capabilities in memory. It is equivalent to calling Self::compress\_raw and XOR-ing the result with a "null mask". Presumably this transformation prevents all-zero data from being interpreted as a capability?

```
fn decompress_mem(
    pesbt: Self::Addr,
    cursor: Self::Addr,
    tag: bool
) -> CcxCap<Self>
```

Decompress a (pesbt, cursor) pair into a capability. This transformation can be undone with Self::compress\_mem.

This is equivalent to XOR-ing the pesbt with a "null mask" and calling Self::decompress\_raw. Presumably the null mask prevents all-zero data from being interpreted as a capability?

```
fn get_uperms(cap: &CcxCap<Self>) -> u32
[src]
```

Gets the user/software-defined permissions from the CcxCap::cr\_pesbt field

Counterpart: Self::update\_uperms

Gets the hardware-defined permissions from the CcxCap::cr\_pesbt field

Counterpart: Self::update\_perms

```
fn get_otype(cap: &CcxCap<Self>) -> u32
[src]
```

Gets the object type from the CcxCap::cr\_pesbt field

```
Counterpart: Self::update_otype
fn get_reserved(cap: &CcxCap<Self>) -> u8
                                                                              [src]
   Gets the reserved bits from the CcxCap::cr_pesbt field
   Counterpart: Self::update_reserved
fn get_flags(cap: &CcxCap<Self>) -> u8
                                                                              [src]
   Gets the flags from the CcxCap::cr_pesbt field
   Counterpart: Self::update_flags
fn update_uperms(cap: &mut CcxCap<Self>, value: u32)
                                                                              [src]
   Updates the user/software-defined permissions field in CcxCap::cr_pesbt
   Counterpart: Self::get_uperms
fn update_perms(cap: &mut CcxCap<Self>, value: u32)
                                                                              [src]
   Updates the hardware-defined permissions field in CcxCap::cr_pesbt
   Counterpart: Self::get_perms
fn update_otype(cap: &mut CcxCap<Self>, value: u32)
                                                                              [src]
   Updates the object type field in CcxCap::cr_pesbt
   Counterpart: Self::get_otype
fn update_reserved(cap: &mut CcxCap<Self>, value: u8)
                                                                              [src]
   Updates the reserved field in CcxCap::cr_pesbt
   Counterpart: Self::get_reserved
fn update_flags(cap: &mut CcxCap<Self>, value: u8)
                                                                              [src]
   Updates the flags field in CcxCap::cr_pesbt
   Counterpart: Self::get_flags
fn extract_bounds_bits(pesbt: Self::Addr) -> CcxBoundsBits
                                                                              [src]
   Extracts the floating-point encoded bounds from CcxCap::cr_pesbt
fn set_bounds(
                                                                              [src]
    cap: &mut CcxCap<Self>,
    req_base: Self::Addr,
    req_top: Self::Length
) -> bool
   Sets the capability bounds to bounds that encompass (req_base, req_top). Because a
   floating-point representation is used for bounds, it may not be able to set (req_base,
```

74

```
≡
```

req\_top) exactly. In this case it will return False.

Updates CcxCap::cr\_pesbt, CcxCap::\_cr\_top, CcxCap::cr\_base

```
fn is_representable_cap_exact(cap: &CcxCap<Self>) -> bool [src]
```

Check if the range (CcxCap::cr\_base, CcxCap::\_cr\_top) can be encoded exactly with the floating-point encoding

```
fn is_representable_new_addr(
    sealed: bool,
    base: Self::Addr,
    length: Self::Length,
    cursor: Self::Addr,
    new_cursor: Self::Addr
) -> bool
```

Check if a capability with the parameters sealed, base, length, cursor would be representable if the cursor were updated to new\_cursor.

```
fn make_max_perms_cap(
    base: Self::Addr,
    cursor: Self::Addr,
    top: Self::Length
) -> CcxCap<Self>
```

Generate a capability for base, top, cursor with the maximum available permissions

```
fn get_representable_length(length: Self::Length) -> [Src]
Self::Length
```

Get the minimum representable length greater than or equal to length.

If get\_representable\_length(l) == l then bounds of length l are exactly representable (if properly aligned).

See also Self::get\_required\_alignment, Self::get\_alignment\_mask.

```
fn get_required_alignment(length: Self::Length) -> Self::Length [src]
```

Get the alignment required for bounds of some length to be exactly represented.

See also Self::get\_representable\_length, Self::get\_alignment\_mask.

```
fn get_alignment_mask(length: Self::Length) -> Self::Length [src]
```

Get a mask which aligns a bounds of some length to be exactly representable.

See also Self::get\_representable\_length, Self::get\_required\_alignment.

### **Implementors**

```
npl CompressedCapability for Cc64
                                                                      [src]
     type Length = u64
     type Offset = i64
     type Addr = u32
     type FfiLength = u64
     type FfiOffset = i64
     const MAX_REPRESENTABLE_OTYPE: u32
                                                                      [src]
     _CC_N(OTYPE_UNSEALED_SIGNED) = (((int64_t)-1) - 0u)```
     The OTYPE field is 4 bits (50:47) in CC64
[-] impl CompressedCapability for Cc128
                                                                      [src]
     type Length = u128
     type Offset = i128
     type Addr = u64
     type FfiLength = u128
     type FfiOffset = i128
     const MAX_REPRESENTABLE_OTYPE: u32
                                                                      [src]
   The OTYPE field is 18 bits (108:91) in CC128
```

## CODE SNIPPETS

This appendix contains code snippets referenced in the document. The full codebases are available online at **(redacted for anonymity)**, and attached to the submission.

## B.1 riscv-v-lite - Vector memory accesses

```
/// Converts a decoded memory operation to the list of accesses it
   → performs.
  fn get_load_store_accesses(&mut self, rd: u8, addr_p: (u64,
     Provenance), rs2: u8, vm: bool, op: DecodedMemOp) ->
     Result<Vec<(VectorElem, u64)>> {
      // Vector of (VectorElem, Address)
      let mut map = vec![];
      let (base_addr, _) = addr_p;
      use DecodedMemOp::*;
      match op {
          Strided{stride, evl, nf, eew, emul, ..} => {
               // For each segment
11
              for i_segment in self.vstart..evl {
                   let seg addr = base addr + (i segment as u64 *
13
     stride);
14
                   // If we aren't masked out...
15
                   if !self.vreg.seg_masked_out(vm, i_segment) {
```

```
// For each field
                        let mut field_addr = seg_addr;
18
                        for i_field in 0..nf {
19
                            // ... perform the access
                            let vec elem = VectorElem::check with lmul(
                                // Register group start
                                // For field 0, = rd
23
                                // For field 1, = rd + (number of
24
       registers/group)
                                // etc.
25
                                rd + (i_field *
26
       emul.num_registers_consumed()),
                                eew, emul,
27
                                // Element index within register group
28
                                i_segment
29
                            );
                            map.push((vec_elem, field_addr));
31
                            // and increment the address
32
                            field addr += eew.width in bytes();
                        }
34
                   }
35
               }
36
           }
37
           FaultOnlyFirst{evl, nf, eew, emul} => {
               // We don't handle the exceptions here
39
               // This just lists the accesses that will be attempted
40
               // This is exactly the same code as for Strided, but it
       calculates the stride
               let stride = eew.width_in_bytes() * (nf as u64);
43
               // For each segment
44
               for i_segment in self.vstart..evl {
45
                   let seg_addr = base_addr + (i_segment as u64 *
46
       stride);
47
                   // If we aren't masked out...
48
                   if !self.vreg.seg_masked_out(vm, i_segment) {
```

```
// For each field
                        let mut field addr = seg addr;
51
                        for i_field in 0..nf {
52
                            // ... perform the access
                            let vec elem = VectorElem::check with lmul(
54
                                rd + (i_field *
55
       emul.num registers consumed()),
                                eew, emul,
56
                                i_segment
57
                            );
58
                            map.push((vec_elem, field_addr));
                            // and increment the address
60
                            field_addr += eew.width_in_bytes();
61
                        }
                   }
63
               }
           }
65
           Indexed{index ew, evl, nf, eew, emul, ..} => {
               // i = element index in logical vector (which includes
       groups)
               for i_segment in self.vstart..evl {
                   // Get our index
69
                   let seg_offset =
70
       self.vreg.load_vreg_elem_int(index_ew, rs2, i_segment)?;
                   let seg_addr = base_addr + seg_offset as u64;
71
72
                   // If we aren't masked out...
                    if !self.vreg.seg_masked_out(vm, i_segment) {
74
                        // For each field
75
                        let mut field addr = seg addr;
76
                        for i_field in 0..nf {
77
                            // ... perform the access
                            let vec_elem = VectorElem::check_with_lmul(
79
                                rd + (i_field *
80
       emul.num_registers_consumed()),
                                eew, emul,
81
                                i_segment
```

```
);
                              map.push((vec_elem, field_addr));
84
                              // and increment the address
85
                              field_addr += eew.width_in_bytes();
86
                         }
87
                     }
                }
89
            }
90
            WholeRegister{num_regs, eew, ..} => {
                if vm == false {
92
                     // There are no masked variants of this instruction
93
                     bail!("WholeRegister operations cannot be masked")
94
                }
95
                let mut addr = base addr;
97
                let vl = op.evl();
98
                // For element in register set...
                for i in self.vstart..vl {
100
                     // ...perform the access
101
                     let vec_elem = VectorElem::check_with_num_regs(
102
                         rd,
103
                         eew, num_regs,
104
                         i as u32
105
                     );
106
                     map.push((vec_elem, addr));
107
                     addr += eew.width_in_bytes();
108
                }
            }
110
            ByteMask{evl, ..} => {
111
                if vm == false {
112
                     // vlm, vsm cannot be masked out
113
                     bail!("ByteMask operations cannot be masked")
                }
115
                let mut addr = base_addr;
117
                // evl = number of 8-bit elements required for the mask
118
                // self.vstart = in terms of bytes
119
```

```
for i in self.vstart..evl {
120
                      let vec_elem = VectorElem::check_with_lmul(
121
                           rd,
122
                           Sew::e8, Lmul::e1,
123
124
                      );
                      map.push((vec_elem, addr));
126
                      addr += 1;
127
                  }
128
             }
129
        };
130
131
        Ok(map)
132
   }
133
```

## FAST PATH VECTOR CHECKS - PRE-CHECKS

This appendix describes methods of calculating tight bounds for vector memory accesses (Section 3.1.2.2) and ways that architectural complexity can be traded off to calculate *wider* bounds. These methods calculate the entire bounds up front, and while they are used in the emulator a hardware implementation may find it introduces too much latency. [TODO25 Justify appendix existence] [TODO26 reference Success, Indeterminate, Failure cases] [TODO27 fast-path could be split up? i.e. for LMUL = 8, could execute a fast-path for each register in the group rather than all 8 at once]

#### C.1 Masked accesses

For all masked accesses, masked-out/inactive segments should not trigger capability exceptions. Therefore, a tight bounds must include only the smallest and largest active segments. These segments can be found by inspecting the mask vector: either checking each bit in turn or using parallel logic to find the lowest/highest set bits. Care must be taken with these checks to ensure elements outside the range [vstart, evl) are not counted.

$$vstart_{active} = min(i \ \forall \ vstart \le i < evl \ where \ mask[i] = 1)$$
 (C.1.1)

$$evl_{active} = max(i \ \forall \ vstart \le i < evl \ where \ mask[i] = 1) + 1$$
 (C.1.2)

#### **Tradeoffs**

If using parallel logic to find the lowest/highest bits, it could be difficult to account for [vstart, evl]. An implementation could choose to only calculate tight bounds when the mask is fully utilized, i.e. vstart = 0, evl = VLEN, and assume wider bounds otherwise.

Accounting for masked accesses at all may not be worth the extra complexity. Only elements masked off on the edges make any difference, and it may be uncommon for long runs of edge

elements to be masked off. Thus, an implementation could choose to ignore masking entirely when computing the ranges. This does mean that all failures become Likely-Failure when masking is enabled, because all elements outside the capability bounds may be masked off.

[TODO28 iterating over elements may still be more energy-efficient than doing individual capability checks?]

#### C.2 Unit accesses

For unit segmented accesses, which includes fault-only first, the tight address range for an access is simple to calculate. Whole register and bytemask accesses can simplify this by fixing nf = 1 and eew = 8.

$$base + [vstart_{active} * nf * eew, evl_{active} * nf * eew)$$
 (C.2.1)

[TODO29 Note that the equation is a simplification of strided for stride = nf \* eew]

#### **Tradeoffs**

nf is not guaranteed to be a power of two (except for the whole-register case), so calculating the 'tight' address range would require a multiplication by an arbitrary four-bit value between 1 and 8. If this multiplication is too expensive, implementations could choose to classify all nf > 1 cases as Unchecked.

Unless extra restrictions are placed on vstart, calculating the start of this range requires another arbitrary multiplication. To avoid this one could assume vstart = 0 and treat failures as Likely-Failure for other cases. Once could also classify all nonzero vstart accesses as Unchecked.

Even if the previous two optimizations are applied, the final range still requires a multiplication evl \* eew. Thankfully, because eew may only be one of four powers-of-two, this can be encoded as a simple shift.

#### C.3 Strided accesses

Strided accesses bring further complication, especially as the stride may be negative.

$$bounds(\texttt{stride}) = base + \begin{cases} [\texttt{vstart}_{active} * \texttt{stride}, (\texttt{evl}_{active} - 1) * \texttt{stride} + \texttt{nf} * \texttt{eew}) & \texttt{stride} \geq 0 \\ [(\texttt{evl}_{active} - 1) * \texttt{stride}, \texttt{vstart}_{active} * \texttt{stride} + \texttt{nf} * \texttt{eew}) & \texttt{stride} < 0 \end{cases}$$

$$(C.3.1)$$

This is formed of three components:

- vstart<sub>active</sub> \* stride, the start of the first segment. This can be simplified to 0, just like for unit accesses, to avoid an arbitrary multiplication.
- $(\text{evl}_{active} 1)$  \* stride, the start of the final segment. This requires an arbitrary multiplication, unless strided accesses are all Unchecked.
- nf \* eew, the length of a segment, which can be implemented with a shift.

### C.4 Indexed accesses

This is the most complicated access of the bunch, because the addresses cannot be computed without reading the index register.

$$[base + min(offsets[vstart_{active}...evl_{active}]),$$
 (C.4.1)

$$base + \max(\text{offsets}[\text{vstart}_{active}..\text{evl}_{active}]) + \text{nf} * \text{eew})$$
 (C.4.2)

The most expensive components here are of course min, max of the offsets. These could be calculated in hardware through parallel reductions, making it slightly more efficient than looping over each element. A low-hanging optimization could be to remove the  $vstart_{active}..evl_{active}$  range condition, performing the reduction over the whole register group, which would make failures Likely-Failure where  $vstart_{active}! = 0 || evl_{active}! = VLMAX$ . This calculation could also be restricted to certain register configurations to reduce the amount of required hardware. Indeed, the amount of hardware could be reduced to zero by simply classifying all indexed accesses as Unchecked.

## COMPILER INFORMATION

## D.1 Building riscv-gnu-toolchain with vector support

As of May 2022, the RISC-V GNU toolchain (hosted at riscv-collab/riscv-gnu-toolchain on Github) does not support the vector extension or it's intrinsics. The rvv-intrinsic branch of this repository claimed to support vector intrinsics, but it was slightly outdated and has been deleted as of 17th May 2022. It referenced a repository for glibc that no longer exists as a submodule, which makes compilation impossible. I have archived the branch in my own repository ((redacted for anonymity)) and fixed that issue. This appendix describes how to build the toolchain.

To build the full toolchain with intrinsic support, perform the following steps (derived by the author independently, then amended based on macOS instructions from <sup>1</sup>):

- 1. Clone the repository itself:
- \$ git clone (redacted for anonymity)
- 2. Clone the riscv-gcc submodule:
- \$ git submodule update --init --recursive --progress --force ./riscv-gcc
  - On macOS, it may be necessary to disable SSL:
  - \$ git -c http.sslVerify=false submodule ...
- 3. Configure the compiler so it supports all General extensions, Compressed instructions, and Vector extension:
- \$ ./configure --prefix=<output directory> --with-arch=rv64gcv --with-abi=lp64d
- 4. Build the newlib version to compile for bare-metal platforms:

<sup>1</sup>https://github.com/riscv-collab/riscv-gcc/issues/323

## D.2 Vanilla RVV command-line options

| Compiler  | Required Arguments                                    | Notes                                                                                          |
|-----------|-------------------------------------------------------|------------------------------------------------------------------------------------------------|
| Clang-13  | -march=rv64gv0p10<br>-menable-experimental-extensions | Supports intrinsics, inline assembly for RVV v0.1                                              |
| Clang-14+ | -march=rv64gv                                         | Supports intrinsics, inline assembly for RVV v1.0                                              |
| GCC 10.1  | -march=rv64g_v                                        | Requires special toolchain (see Appendix D.1) and has incomplete support (see Section 6.1.2.1) |

**Table D.1:** Command-line arguments for compiling RVV code on non-CHERI compilers (assuming the base ISA is rv64g)

## D.3 CHERI-RVV command-line options

| Compiler | Required Arguments               | Notes                                |
|----------|----------------------------------|--------------------------------------|
| CHERI    | -march=rv64gv0p10xcheri          | Supports intrinsics, inline assembly |
| Clang-13 | -menable-experimental-extensions | for RVV v0.1                         |
|          | -mabi=164pc128                   | ABI string sets capability width.    |
|          | -mno-relax                       | Must disable linker relaxations.     |

**Table D.2:** Command-line arguments for compiling CHERI-RVV code (assuming the base ISA is rv64g)

By default CHERI-Clang doesn't actually compile capability-enabled code. The documentation on enabling capabilities is unfortunately sparse and outdated. In particular, the CHERI-Clang help menu states that --cheri will "Enable CHERI support with the default capability size", but this has no effect (at least on RISC-V). To find up-to-date answers, we consulted the source code for the CHERIbuild build tool<sup>2</sup>.

CHERIbuild's code<sup>3</sup> revealed three requirements:

- The architecture string must contain xcheri
- The capability length must be set using the ABI string

<sup>&</sup>lt;sup>2</sup>CSTRD-CHERI/cheribuild on Github

<sup>&</sup>lt;sup>3</sup>config/compilation\_targets.py:176 in CSTRD-CHERI/cheribuild on GitHub

- In pure-capability mode, pointers and capabilities are CLEN long
  - \* Example string: 164pc128
  - \* Integer width (long, or 1) = XLEN = 64-bits
  - \* Pointer width (p) = Capability width (p) = CLEN = 128-bits
- For hybrid mode, pointers remain XLEN long and capability length is not specified
  - \* Example string: 1p32
  - \* Integer width (1) = XLEN = Pointer width (p) = 32-bits
- "Linker relaxations", where function calls are converted to short jumps[30], must be disabled.
  - This is likely because CHERI requires function calls to go through capabilities
  - However the code that adds this option wasn't documented, so there may be more to it

Once the above options are set, plain CHERI-RISC-V code compiles without a hitch. Changes to CHERI-Clang itself are required to compile vectors (Section 4.2.1).

## CHERI-RVV CHANGES FROM CHERI AND RVV

[TODO30 CHERI-RVV changes RVV instructions to use capabilities as address references in all places, no other behaviour changes]

[TODO31 CHERI-RVV treats capability out-of-bounds for an element as a synchronous exception, permission checks for a vector access are treated as a synchronous exception on element 0.] [TODO32 The above means fault-only-first will silently swallow capability permission errors if element 0 is masked out.]

[TODO33 CHERI-RVV doesn't change CHERI behaviour]

## **E.1** Capabilities-in-vectors

[TODO34 Adds unit load/store 128, element width 128]

| SEW                | VS | ew [2: | :0] |
|--------------------|----|--------|-----|
| 8                  | 0  | 0      | 0   |
| 16                 | 0  | 0      | 1   |
| 32                 | 0  | 1      | 0   |
| 64                 | 0  | 1      | 1   |
| 128 <sup>new</sup> | 1  | 0      | 0   |

**Table E.1:** Selected element width encoding

|                     | mew | wid | lth[2 | :0] |
|---------------------|-----|-----|-------|-----|
| Vector 8b element   | 0   | 0   | 0     | 0   |
| Vector 16b element  | 0   | 1   | 0     | 1   |
| Vector 32b element  | 0   | 1   | 1     | 0   |
| Vector 64b element  | 0   | 1   | 1     | 1   |
| Vector 128b element | 1   | 0   | 0     | 0   |

Table E.2: Width encoding for vector loads and stores

### **E.1.1** Relevant properties

This subsection summarizes the properties of the emulator described in Chapter 5 than enable capabilities-in-vectors. These properties are not absolute requirements for all capability-in-vector implementations fulfilling Hyps. H-7 to H-9, but can be used as a starting point.

- ELEN = 128(+1) i.e. the length of a capability.
  - For a program to manipulate, load, and store capability values securely and atomically, it must be able to operate on appropriately sized logical elements.
- VLEN = 128(+1) i.e. the length of a capability.
  - VLEN  $\geq$  ELEN ([2, Chapter 2])
  - Larger VLEN could be supported, which must be a power-of-two[2, Chapter 2] and therefore will be a multiple of the capability length.
- The memory interface is identical to that used by the scalar processor, where each vector access is split into a set of sequential accesses less than 128 bits.
  - Therefore the safety properties for the scalar code still hold.
  - Example: capabilities can only be stored through a capability with the STORE\_CAP permission.
- Capability memory accesses use SafeTaggedCap as a unit.
  - This means it is impossible to set the tag bit on invalid capabilities.
  - It also means all capability accesses are 128-bit aligned, and atomic.
- The only instruction that can place capabilities in a vector register is 128-bit element loads.
  - There are no vectorized capability-to-capability or integer-to-capability instructions, such as CSetBounds or CFromPtr.
  - All other instructions that write to registers (e.g. arithmetic, 64-bit loads, etc.) unset the tag bit.
  - The only way to place a valid capability in a vector register is to copy a valid capability from memory.
  - Therefore the Provenance and Monotonicity properties are always upheld.
- The only 128-bit element vector load instruction is unit-stride.

- Strided and Indexed accesses could be supported as long as they enforced alignment and atomicity correctly.
- Whole-register accesses would need to be updated to always use 128-bit elements, in case capabilities are being accessed.
- Capabilities-in-vectors cannot be dereferenced directly.
  - Therefore Integrity cannot be violated by vector operations.

# APPENDIX F

# Full vector\_memcpy results

**Table F.1:** Results — Vectorized memcpy

|                    | RV32    | RV-64   |         |     |       |             |
|--------------------|---------|---------|---------|-----|-------|-------------|
|                    | 11vm-13 | 11vm-13 | llvm-15 | gcc | CHERI | CHERI (Int) |
| Unit Stride e8m1   | Y       | Y       | Y       | Y   | Y     | Y           |
| Unit Stride e16m2  | Y       | Y       | Y       | Y   | Y     | Y           |
| Unit Stride e32m4  | Y       | Y       | Y       | Y   | Y     | Y           |
| Unit Stride e64m8  | Y       | Y       | Y       | Y   | Y     | Y           |
| Unit Stride e32mf2 | Y       | Y       | Y       | -   | Y     | Y           |
| Unit Stride e16mf4 | Y       | Y       | Y       | -   | Y     | Y           |
| Unit Stride e8mf8  | Y       | Y       | Y       | -   | Y     | Y           |
| Strided e8m1       | Y       | Y       | Y       | Y   | Y     | Y           |
| Strided e16m2      | Y       | Y       | Y       | Y   | Y     | Y           |
| Strided e32m4      | Y       | Y       | Y       | Y   | Y     | Y           |
| Strided e64m8      | Y       | Y       | Y       | Y   | Y     | Y           |
| Strided e32mf2     | Y       | Y       | Y       | -   | Y     | Y           |
| Strided e16mf4     | Y       | Y       | Y       | -   | Y     | Y           |
| Strided e8mf8      | Y       | Y       | Y       | -   | Y     | Y           |
| Indexed e8m1       | Y       | Y       | Y       | Y   | Y     | Y           |
| Indexed e16m2      | Y       | Y       | Y       | Y   | Y     | Y           |
| Indexed e32m4      | Y       | Y       | Y       | Y   | Y     | Y           |
| Indexed e64m8      | Y       | Y       | Y       | Y   | Y     | Y           |
| Indexed e32mf2     | Y       | Y       | Y       | -   | Y     | Y           |
| Indexed e16mf4     | Y       | Y       | Y       | -   | Y     | Y           |
| Indexed e8mf8      | Y       | Y       | Y       | -   | Y     | Y           |

**Table F.1:** Results — Vectorized memcpy

|                              | RV32    | RV-64   |         |     |       |             |
|------------------------------|---------|---------|---------|-----|-------|-------------|
|                              | 11vm-13 | 11vm-13 | llvm-15 | gcc | CHERI | CHERI (Int) |
| Unit Stride Masked e8m1      | Y       | Y       | Y       | Y   | Y     | Y           |
| Unit Stride Masked e16m2     | Y       | Y       | Y       | Y   | Y     | Y           |
| Unit Stride Masked e32m4     | Y       | Y       | Y       | Y   | Y     | Y           |
| Unit Stride Masked e64m8     | Y       | Y       | Y       | Y   | Y     | Y           |
| Unit Stride Masked e32mf2    | Y       | Y       | Y       | -   | Y     | Y           |
| Unit Stride Masked e16mf4    | Y       | Y       | Y       | -   | Y     | Y           |
| Unit Stride Masked e8mf8     | Y       | Y       | Y       | -   | Y     | Y           |
| Bytemask Load e8m1           | -       | -       | Y       | -   | -     | -           |
| Bytemask Load e16m2          | -       | -       | Y       | -   | -     | -           |
| Bytemask Load e32m4          | -       | -       | Y       | -   | -     | -           |
| Bytemask Load e64m8          | -       | -       | Y       | =   | -     | -           |
| Bytemask Load e32mf2         | -       | -       | Y       | -   | -     | -           |
| Bytemask Load e16mf4         | -       | -       | Y       | =   | -     | -           |
| Bytemask Load e8mf8          | -       | -       | Y       | =   | -     | -           |
| Unit Stride Segmented e8m2   | Y       | Y       | Y       | Y   | Y     | Y           |
| Unit Stride Segmented e16m2  | Y       | Y       | Y       | Y   | Y     | Y           |
| Unit Stride Segmented e32m2  | Y       | Y       | Y       | Y   | Y     | Y           |
| Unit Stride Segmented e64m2  | Y       | Y       | Y       | Y   | Y     | Y           |
| Unit Stride Segmented e32mf2 | Y       | Y       | Y       | -   | Y     | Y           |
| Whole-Register e64m1         | Y       | Y       | Y       | Y   | Y     | Y           |
| Whole-Register e64m2         | Y       | Y       | Y       | Y   | Y     | Y           |
| Whole-Register e64m4         | Y       | Y       | Y       | Y   | Y     | Y           |
| Whole-Register e64m8         | Y       | Y       | Y       | Y   | Y     | Y           |
| FoF Memcpy e8m1              | Y       | Y       | Y       | Y   | Y     | Y           |
| FoF Memcpy e16m2             | Y       | Y       | Y       | Y   | Y     | Y           |
| FoF Memcpy e32m4             | Y       | Y       | Y       | Y   | Y     | Y           |
| FoF Memcpy e64m8             | Y       | Y       | Y       | Y   | Y     | Y           |
| FoF Memcpy e32mf2            | Y       | Y       | Y       | -   | Y     | Y           |
| FoF Memcpy e16mf4            | Y       | Y       | Y       | =   | Y     | Y           |
| FoF Memcpy e8mf8             | Y       | Y       | Y       | -   | Y     | Y           |
| FoF Boundary e8m1            | Y       | Y       | Y       | Y   | Y     | Y           |
| FoF Boundary e16m2           | Y       | Y       | Y       | Y   | Y     | Y           |
| FoF Boundary e32m4           | Y       | Y       | Y       | Y   | Y     | Y           |
| FoF Boundary e64m8           | Y       | Y       | Y       | Y   | Y     | Y           |

**Table F.1:** Results — Vectorized memcpy

|                     | RV32    | RV-64   |         |     |       |             |
|---------------------|---------|---------|---------|-----|-------|-------------|
|                     | 11vm-13 | 11vm-13 | llvm-15 | gcc | CHERI | CHERI (Int) |
| FoF Boundary e32mf2 | Y       | Y       | Y       | -   | Y     | Y           |
| FoF Boundary e16mf4 | Y       | Y       | Y       | -   | Y     | Y           |
| FoF Boundary e8mf8  | Y       | Y       | Y       | -   | Y     | Y           |

# APPENDIX G

# ARTIFACTS

[TODO35 artifacts appendix needs fixing] [TODO36 make sure to redact online copies]

| Artifact                              | Resources                                     |
|---------------------------------------|-----------------------------------------------|
| riscv-v-lite                          |                                               |
| Emulator                              | theturboturnip/riscv-v-lite on Github         |
|                                       | Documentation                                 |
| rust-cheri-compressed-cap             |                                               |
| Rust wrapper for C capability library | theturboturnip/cheri-compressed-cap on Github |
|                                       | Documentation                                 |
| llvm-project                          |                                               |
| CHERI-Clang fork with Vector support  | theturboturnip/llvm-project on Github         |

 Table G.1: Project artifacts