# CPU Trace Recompilation

Sam Epstein\*

January 5, 2025

## Abstract

We propose a  $O(n \log n)$  time complexity algorithm to recompile long sequences of assembly instructions into a novel code called SECODE. If the original trace uses s non-transient heap space and t non-transient stack operations, then the size of SECode is O(s+t), and it runs in time  $O(s \log s + t)$ , with small constants. This new code can be used to cause the state of the computer to jump forward, increasing the efficiency of the CPU. The bounds are achieved by removing transient and redundant operations. The new code is completely pointer friendly. With FOR loops, the conditions of the most common paths in each loop are compiled together to create condition tree. This is done without any extra programmer notations. Floating point operations are handled directly or approximately. We believe this paper introduces a new class of algorithms: trace recompilation methods.

## 1 Introduction

CPU designers have focused away from increasing the clock speed because the processors cannot be cooled fast enough. More heat is created from faster clock speed and a point has been reached where the amount of energy it takes to cool increasingly fast processors is prohibitively expensive. CPU manufacturers have turned to parallelism for clock speedups. For example, Intel has released the *Intel Parallel Studio*, which is a software development project to facilitate parallel programming to take advantage of multi-core processors. However, this complicates the coding process,



Figure 1: The proposed setup for exploiting hot paths.

eliminating coding abstractions that isolate the programmer from the underlying details of the computer running the code.

This obstacle invites new approaches to increasing the speed of CPU's. With the Programming Smart Machine Lab (PSML)[AWE+14, WAS12, App14], a new model is proposed. It is noted that traces of CPUs are very non-random. This implies there's redundancy in the trace streams, which implies there exists so-called hot paths, which are long pieces of code with minimal side-effects that are repeatedly executed by the computer. PSML proposes a form of *Memoization* of the hot paths, originally proposed by [Mic68]:

""It would be useful if computers could learn from experience and thus automatically improve the efficiency of their own programs during execution... When I write a clumsy program for a contemporary computer a thousand runs on the machine do not re-educate my handiwork. On every execution, each time-wasting blemish and crudity, each needless test and redundant eval-

<sup>\*</sup>same pst@jptheorygroup.com







Figure 2: The evolution of the state of the computer, represented as a binary image.

uation, is meticulously reproduced."

Mitchie further noted that functions can be *memoized* and either calculated directly (rote) or through a lookup table (rule).

"On each given occasion proceed either by rule, or by rote, or by a blend of the two, solely as dictated by the expediency of the moment ... rule versus rote decisions shall be handled by the machine behind the scenes."

With memoization in mind, the proposed setup (with some minor changes) can be seen in Figure 1. The CPU sends its operating trace to a GPU, which uses machine learning to identify hot paths. The hot paths are sent to a database. This database, upon receiving the same trace as the GPU, will periodically indicate a jump forward in the state of the computer. To the author's knowledge, PSML was the first team to propose this configuration.

This paper proposes a solution to the database component in Figure 1. This database is called the SEDATABASE, short for side effect database. The machine learning portion is out of the scope of this paper, but some discussion on this topic is included. We introduce a new method to directly compile, in  $O(n \log n)$  time, large streams of assembler trace C into new code that contains a minimal set of conditions and side effects of C. This new compiled

code is called SECODE, short for side effect code. SECODE contains conditions, called SECONDITIONS and changes, called SECHANGES. If C uses s non-transient heap space and performs t non-transient stack operations, then the size of SECONDITIONS is O(s+t) and it runs in time  $O(s\log s+t)$ . The running time of SEChanges is O(s+t). e SECHANGES is highly optimized. The following components of C are removed:

- Redundant memory operations.
- Math operations.
- Transient register operations.
- Transient memory operations.
- Transient stack operations.

Very large FOR loops are handled by compiling together the SECONDITIONS of the most common paths taken. SECODE is compiled without any programmer annotation. This method can also handle floating point registers with approximate programming. However this requires explicit markup in the code by the programmer.

In [AWE<sup>+</sup>14], the state of the computer was represented as a  $1 \times n$  binary picture which evolves over time, as seen in Figure 2 (seen as an  $n \times n$  picture). They described the evolution of the computer as a very long video. Partial states of the image are

queried into the database and if a matching rule is found, the state of the computer is updated with a new partial picture.

However, this approach has some difficulties. Take, for example, the following code. It starts with an input of two integer variables n and m and a pointer head. It constructs a m sized linked list starting at head and at node i it stores func(n,i), for some complicated math function func. It then terminates. Because the head pointer is different at every run, one cannot use the method described in [AWE+14]. One would like the recompiled code to be pointer agnostic. Indeed, in this paper, SECODE treats pointers as variables, so it can create the linked list as well as memoize func(n,i) for every node. The following criterion is used to distinguish pointers  $^1$  from regular variables:

Pointers only are added to and subtracted from, and compared with other pointers or zero.

An example is trace assembler code C that sorts a doubly linked list, with starting values (5,3,2,4,1). If C is compiled into SECODE and sent to the SEDATABASE, then the next time C is reached and a match (5,3,2,4,1) is made, then SEDATABASE will automatically jump to the list being sorted, (1,2,3,4,5). This will happen each time C is reached and matched to (5,3,2,4,1), regardless of the particular pointer values of each run.

There are two phases in the overall setup, the SEC-ODE construction phase and the execution phase.

The construction process starts when the machine learning component selects a region of trace assembler code to be compiled. Discussion of this component can be found in Section 6. Once the trace code is identified, it is recompiled into SECODE. To do this, the original code is compiled into a annotated directed graph. This graph is represented visually with SEDIAGRAMS. We will use the term graph and SEDIAGRAMS interchangeably. SEDIAGRAMS can be reduced to a minimal set of conditions and side effects of the original code. Then the SEDIAGRAMS is re-

Figure 3: The original code is compiled into SEDI-AGRAMS which are, in turn, reduced and recompiled into SECODE.

compiled into SECODE, in particular the SECONDITIONS and the SECHANGES. The entire process can be completed in  $O(n \log n)$  time. This configuration can be found in Figure 3.

The execution of the SEDATABASE IS AS FOL-LOWS. For each compiled SECODE the SECONDI-TIONS checks whether their is a match. Multiple SECONDITIONS can be compiled together into a condition tree structure. If, during the course of operation, a SECONDITION is satisfied, then the output is an array of pointers. Using this array, the side effects of the compiled code are computed with its corresponding SECHANGES code. After that, normal operation resumes.

## 2 Related Work

Memoization has been implemented on functional languages, [ABH11]. A program to dynamically identify areas of code that represent good candidates for memoization was introduced in [TPG15]. In [ACV05], lossy memoization for multimedia floating-point applications was proposed.

There is a lot of work on combining pattern recognition and computing when the results only need to be approximately correct [Mit16]. Machine learning has been incorporated at the operating system level with regard to learning configurations, learning policies, learning mechanism, and process scheduling [KK20]. Machine learning has also been suggested as a tool for managing the configuration of operating systems [ZH19]. A machine learning library for kernel space has been developed [UAZ20]. However, to the author's knowledge, there is nothing in the literature (aside from PSML) which suggests applying machine learning directly to CPU traces with the intention of leveraging hot paths.

<sup>&</sup>lt;sup>1</sup>This paper doesn't deal with the case of testing the difference between two pointers, but this can be handled readily.

#### MIPs Instruction Set 3

In this paper, a subset of MIPs instruction set is used 1 for the traces. With some redundancy, the entire instruction set can be used. The majority of this paper assumes the operations are over integers. Section 6 addresses how the SEDATABASE handles floating point numbers. We assume the reader is familiar with operations of registers and the stack. All registers are denoted by n, for some number n. The stack pointer register is represented \$sp. We assume a small, but unspecified, number of registers. The following MIPs operations will be used in this paper.

#### Load Immediate

\$1, 300

The load immediate operation puts the constant second argument into the register specified in the first argument.

#### Move

1 move \$1, \$2

The move operations puts the contents of register \$2 into register \$1.

#### Add Immediate

addi \$1, \$2, 200

The contents of register \$1 is set to the contents of register \$2 plus the absolute number (in this case 200).

### Add

1 add \$1, \$2, \$3

The contents of register \$1 is set to the contents of register \$2 plus register \$3.

#### Subtract

1 **sub** \$1, \$2, \$3

register \$2 minus register \$3.

## Multiply

mul \$1, \$2, \$3

The contents of register \$1 is set to the contents of register \$2 times register \$3. The way SEDATABASE handles MUL is the same way it handles all mathematical operations other than addition and subtraction. This includes OR, AND, DIV, an SHIFT. Therefore, we exclude these operations from the scope of the paper.

## Branch on Equal

1 beq \$1, \$2, 100

If register \$1 is equal to register \$2, then goto the program counter +4 + offset (in this case, 100). SE-Database handles branch on equal exactly how it handles branch on not equal, BNE, so this operation is not included.

## Branch on Less Than

If register \$1 is less than register \$2, then goto the program counter +4 + offset (in this case, 100). SE-Database handles branch on less than as it handles branch on less than or equal, greater than, and greater than or equal, so these operations are not included in this paper.

#### Load Word

1 lw \$1, 100(\$2)

Register \$1 is set to the contents of the memory at the location specified by register \$2 plus the offset (in this case 100).

### Store Word

1 sw \$1, 100(\$2)

The contents of register \$1 is saved into memory at The contents of register \$1 is set to the contents of location in register \$2 plus the offset (in this case 100).

#### New

To allocate memory, MIPs specifies placing the memory size a register and returns the address location in another. However, for simplification purposes, we will use the following notion to allocate memory.

```
new $1, $2
```

This operation creates memory of size equal to the contents of register \$2 and places its location in register \$1.

#### Free

The MIPs instruction set does not include a method for freeing memory. We create a new operation of the following form.

```
ı free $1
```

This operation frees the memory location at the address in register \$1.

#### Trace Code

When the assembler code is run, the registers will have values associated with them. In order to represent them, numbers are point in the comments section.

```
1 add $1 $2 $3 # 5 4 1
```

This indicates register \$1 results in a value of 5 and registers \$2 and \$3 are 4 and 1, respectively. Depending on the operation, the first argument equals either the current register value or a new assigned value. An example is the LW and SW operations.

```
lw $1, 100($2) # 200 14352
sw $1, 400($2) # 300 45902
```

The LW operation indicates register \$1 results in a value of 200 and \$2 contains the address 14352+200 that was referenced. The SW indicates 300 is in register \$1, which is stored in address 45902+400.

## 4 SEDIAGRAMS

SEDIAGRAMS are linearly constructed from large swaths of assembler trace code. SEDIAGRAMNS consists of two types of constructs: nodes and lines.



Figure 4: An abstract SEDIAGRAM. The nodes of the SEDIAGRAM are compiled into SECODE from top to bottom, in the order specified.

Nodes are represented by gray rectangles and the lines have arrows at the end. The diagram is a directed acyclic graph that flows downward. An abstract SEDIAGRAM can be seen in Figure 4. For each operation in the trace code, zero or one node is created at the bottom of the SEDIAGRAM. New lines going into the node are constructed, connecting to previous created nodes higher up in the SEDIA-GRAM. After the SEDIAGRAM has been constructed, it can be used to create the SECONDITIONS and SECHANGES. The SECONDITIONS construct is an annotated directed graph representing pointers and constant values. The SECHANGES is assembler code to be run on the SEDATABASE that will implement the side effects of the original code. The construction of the SECONDITIONS and SECHANGES from the SEDIAGRAM occurs with algorithms starting at the top of the diagram and then working its way down.



Figure 5: The three different type of lines. The Var line is on register \$3 with val 198078. The Const line is on register \$5 with val 100. The SP line has offset equal to 20.



\$3(44)

\$4(51)

#### 4.1 Lines

Lines contain three pieces of information: a val, a register number, and a type. There are three different types of lines: Const, Var, and SP. The Const lines represent values in the computation which are not pointer values. The val of a Const line is the value of the constant. For example, if the trace code is the sorting of a linked list, then the Const lines represent the integer values accessed and modified in the linked list nodes. The Var lines represent the pointer values used in the original trace code. The val of a Var line is the raw pointer value. The Var lines have the restriction that the only operations that can be applied to them is addition or subtraction by a constant, or testing equality with other Var lines or NULL. The SP lines, short for Stack Pointer, contains the value of the \$sp register. Thus the register of the SP is always \$sp. The SP line always starts with a value of 0, representing an offset from the original \$sp value. Thus if the SP line contains val of 100, then this means the stack pointer is 100 more than its original value. Other than the SP line, all lines are initiated with the Var type.

A line is active if there is no line further down with the same register. When a operation occurs, it may create a node. Some of the arguments of this operation are registers. If there are active lines with these registers, then they are connected to the input

Figure 6: The move operation is represented by the M node. The value 44 is passed from register \$3 to register \$5. Then this line becomes inactive when 51 is passed from register \$4 into register \$5.

of these nodes. Otherwise, a new line is created with an indication that its starting point is the register at the starting point of the trace code. The end state of registers can be simply calculated from the active ser of lines at the bottom of the SEDIAGRAM.

Figure 5 shows how the lines are displayed graphically, with their information shown on top. The register number and/or the value can be omitted to not overly clutter the diagram.

#### 4.2 Move

The move operation transports the contents of one register into another register. It corresponds to the "M" node in the SEDIAGRAM, which is shown in Figure 6.

## 4.3 Const Node

The multiply operator MUL (as well as the OR, AND, DIV, an SHIFT operations) will create a Const node. This is because we assume that the computation does not perform these operations on pointers. The input is two lines and the output is one line. The lines transformed from type Var to type Const.

## 5 Floating Point Numbers

# 6 Machine Learning

## References

- [ABH11] U. Acar, G. Blelloch, and R. Harper. Selective memoization. CoRR, abs/1106.0447, 2011.
- [ACV05] C. Alvarez, J. Corbal, and M. Valero. Fuzzy memoization for floating-point multimedia applications. *IEEE Transac*tions on Computers, 54(7):922–927, 2005.
- [App14] 2013-2014. Discussions in Programmable Smart Machine Lab. Jonathan Appavoo, Steve Homer, Katherine Missimer, and Amos Waterland.
- [AWE+14] J. Appavoo, A. Waterland, S. Eldridge, K. Missimer, A. Joshi, S. Homer, and Seltzer. Programmable Smart Machines: A Hybrid Neuromorphic Approach to General Purpose Computation. . In Neuromorphic Architectures (NeuroArch) Workshop at 41th International Symposium on Computer Architecture (ISCA-41), 2014.
- [KK20] M. Kulkarni and T. Kamble. Integration of Machine Learning into Operating Systems: A Survey. Internation Journal of Creative Research Thoughts., page 1270, 04 2020.

- [Mic68] D. Michie. "Memo" Functions and Machine Learning. *Nature*, 218(5136):19–22, 1968.
- [Mit16] S. Mittal. A survey of techniques for approximate computing. ACM Comput. Surv., 48(4), 2016.
- [TPG15] L. Toffola, M. Pradel, and T. Gross. Performance problems you can fix: a dynamic analysis of memoization opportunities. In Proceedings of the 2015 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications, page 607–622, 2015.
- [UAZ20] I. Umit, A. Aydin, and E. Zadok. Kmlib : Towards machine learning for operating systems. In Proceedings of the On-Device Intelligence Workshop, 2020.
- [WAS12] A. Waterland, J. Appavoo, and D. Schatzberg. Programmable smart machines. Technical Report BUCS-TR-2012-007, Computer Science Department, Boston University, 2012. http://hdl.handle.net/2144/11395.
- [ZH19] Y. Zhang and Y. Huang. "Learned": Operating Systems. SIGOPS Oper. Syst. Rev., 53:40–45, 2019.