# **Programming the FlexRAM Intelligent Memory Architecture**

Basilio B. Fraguela\*, José Renau<sup>†</sup>, Paul Feautrier<sup>‡</sup>, David Padua<sup>†</sup> and Josep Torrellas<sup>†</sup>

\* Dept. de Electrónica e Sistemas, Universidade da Coruña, E-15071, A Coruña, Spain. basilio@udc.es

† Dept. of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, 61801. {renau,padua,torrella}@uiuc.edu

‡ PRiSM Laboratory, Universitè de Versailles, F-78035 Versailles Cedex, France. paul.feautrier@prism.uvsq.fr

Abstract—Advances in VLSI technology have enabled the implementation of computer architectures based on Processing-in-Memory (PIM) chips. These architectures can help bridge the growing gap between processor and main memory speeds and, as a result, improve the performance of both numeric and symbolic codes. While PIM organization has been extensively studied from the machine organization point of view, the problem of effectively programming this kind of systems has been largely ignored. In this paper, we address how to program FlexRAM, a generalpurpose computer with PIM memory modules. We present CFlex, a family of compiler directives inspired in OpenMP, which enable the development of portable programs. We also explore enhancing the programmability with libraries of functions called Intelligent Memory Operations (IMOs). These libraries are optimized to make use of the PIMs but hide all their complexity. The geometric mean of the speedups achieved for our test suite by a system with one FlexRAM chip programmed with CFlex relative to a conventional machine is 14.4.

## I. Introduction

THE performance of modern computers with high-frequency processors is often limited by the long latencies and low bandwidths of their memory systems. A promising approach to remove this memory bottleneck is to migrate all or part of the computation to processors embedded in the memory chips. The integration of processors and memory in the same chip, commonly known as Processing-in-Memory (PIM), has been made possible by advances in VLSI technology. In fact, recent advances allow the integration of processors with clock rates matching those found in logic-only chips with DRAM that is only about 10% less dense than what is typical in memory-only chips [1], [2].

Several research groups are studying this new technology following one of three main approaches. In the first approach, the PIM chip contains the main processor [3], [4], as much memory as possible and, in some cases, one or more special-purpose processors such as vector processors. The second approach is to design special-purpose processors or co-processors integrated with much memory that are oriented to specific tasks [5], [6]. The third approach consists in using the PIM chips as intelligent memories that replace all or some of the standard memory modules of a server or workstation [7], [8], [9]. These latter architectures are able to run in the PIM modules the most memory-intensive parts of the application, which are those for which the current systems exhibit the worst performance.

While much effort has been devoted to the analysis and evaluation of PIM architectures, programming issues have been largely ignored. These issues are particularly important for the third group of PIM architectures, as they potentially include a large number of simple, general-purpose processors that must act in coordination with the main processor(s) of the system.

To address this problem, this paper presents a programming language and its associated run-time system for FlexRAM [7], which is one of the architectures that use PIM chips as intelligent memories. In FlexRAM, the processors in the PIM chips lack most of the synchronization and communication mechanisms found in many traditional multiprocessor systems. This is due to several reasons. First, the PIM chips replace conventional memory chips, which are never masters of the memory bus. As a result, their simple processors can never invoke the main processor(s). Another reason is that the caches in the memory processors cannot be kept coherent with those of the main processor(s) using hardware mechanisms because the system lacks the required connections. Other limitations come from the need to keep the hardware of the memory processors as simple as possible. For example, the memory processors do not have support for atomic operations that would allow them to use well-known synchronization mechanisms based on regular locks and barriers. Overall, all these limitations provide a significant challenge to effectively program the FlexRAM architecture.

The rest of this paper is organized as follows. In the next section, we describe the second version of the FlexRAM system introduced in [7]. Then, Sect. III presents the CFlex compiler directives that we propose to program this system. CFlex directives are as simple and easy to use as OpenMP [10] directives, but they are much more flexible. In fact, as the examples in Sect. IV will show, CFlex directives allow parallelization without extensive reprogramming of more codes in our test suite than standard OpenMP. The operating system extensions and runtime system underlying CFlex are discussed in Sect. V. In this section, we also discuss the design of libraries that encapsulate the use of PIM processors. These libraries, called Intelligent Memory Operations (IMOs), enhance programmability while providing near optimal performance. The results of an evaluation of FlexRAM when programmed using CFlex and IMOs are reported in Sect. VI. Finally, Sect. VII presents conclusions and future work.

#### II. FLEXRAM ARCHITECTURE

A FlexRAM system is an off-the-shelf workstation or server where some of the DRAM chips in the main memory have been PAAM replaced by FlexRAM PIM chips [7]. Each FlexRAM PIM chip contains memory plus many simple, general-purpose processing elements called *PArrays*. The resulting memory system is a versatile accelerator where many processors can access memory with high bandwidth and low latency. Moreover, legacy codes can run unmodified, since they see the FlexRAM chips as standard memory chips.

In our current design, each FlexRAM chip has 64 Mbytes of memory organized in 64 1-Mbyte banks. Each bank is associated to one PArray. Since each PArray can be programmed independently, PArrays can run in SPMD or MIMD mode. Due to density and power restrictions, PArrays are much simpler than the main processor of the system, which we call *PHost*. However, FlexRAM chips do not require any change to the existing memory system specifications to be integrated. The only requirement is a memory standard that provides additional power and ground signals to enable on-chip processing. A standard that meets these requirements is Rambus [11].

Compared to the original FlexRAM architecture proposed in [7], three main architectural improvements have been made. First, each PArray now has a 8 KByte write-back data cache and a 2 KByte instruction cache. Second, each FlexRAM chip now includes an internal 2-D torus connecting the PArrays, since there are enough metal layers to allow its implementation. The torus enables a PArray to access any of the banks in its chip. The torus is connected to a bus that connects all the FlexRAM chips (FlexRAM bus). Consequently, a PArray can also access the banks in the other FlexRAM chips. Finally, given the high connectivity of PArrays, we remove the per-chip controller (PMem) proposed in [7]. The main purpose of the PMem was to shuffle data between the PArrays banks. Instead, we only have a much simpler FlexRAM chip controller (FXCC) that acts as the interface between the PHost and the PArrays. The FXCC provides services for PArray synchronization. Specifically, it contains registers to store the requests from the PHost to the PArrays, such as tasks to spawn, and from the PArrays to the PHost, such as page faults. Figs. 1 and 2 depict the current FlexRAM system.

These architectural changes cause changes in other areas like address translation. Specifically, recall that PArrays use virtual addresses in the virtual space of the application run by the PHost. Each PArray has a relatively small TLB that contains some entries from the PHost's page table. Given the higher connectivity of the PArrays, PArrays can now serve their own TLB misses by accessing the PHost's page table. However, page faults, page migrations, or other operations that involve changes in the page tables must still be managed by the PHost operating system.



Fig. 1. Workstation or server with FlexRAM chips.



Fig. 2. Structure of a PArray and a memory bank inside a FlexRAM chip.

# A. Interprocessor Communication and Synchronization

PArrays communicate with the PHost(s) and with other PArrays in two ways: through memory or via registers in the FX-CCs. The input and output data of the tasks executed by the PArrays is typically retrieved from and stored to memory, respectively. However, the commands and service requests that modify the state of the PArrays are stored in registers in the FXCCs. FXCCs are also used for the synchronization among PArrays and between PArrays and PHost when locks are involved. In this section, we examine these communications in detail.

1) Communication from PHost to PArray: The PHost communicates with the PArrays to spawn tasks on them, answer service petitions, and order maintenance operations such as flushing certain pages from their caches and invalidating the TLB entry associated with those pages. In either case, the PHost issues a command that consists of a command type code and at most two words. For example, assume that the input data of a task are stored in memory. Then, the command to order a PArray to start executing the task includes the address of the routine to execute and a pointer to the input data buffer. The PHost delivers such a command by programming the FXCC of the corresponding FlexRAM chip using special features and reserved code words from the Rambus definition. The FXCC can be envisioned as an I/O device with a series of ports that can be written to send control and data signals, or read to check the status of the device. The FXCC temporarily stores the PHost commands and sends them to the corresponding PArray by means

of the internal torus in the chip.

- 2) Communication from PArray to PHost: When a PArray requires a service from the PHost, such as handling a page fault, it sends a message with the service code and its parameters to its FXCC through the torus. The FXCC cannot deliver this message to the PHost, since only the PHost(s) can be master(s) of the memory bus. Consequently, the PHost periodically polls the FXCCs of the chips to check for potential requests from the PArrays. The FXCCs have registers to hold PArray requests until the next PHost poll.
- 3) Synchronization: There are a number of well-known mechanisms for processor synchronization. Unfortunately, many of these mechanisms rely on atomic operations and hardware coherence, which are too expensive to implement in our simple PArrays. Still, FlexRAM chips have a simple yet effective synchronization mechanism based on locks managed by the FXCCs. Such locks may be acquired and released by both PArrays and PHost(s), thus enabling any synchronization pattern in the system. FXCC locks are implemented as tokens. Each FXCC is responsible for a disjoint set of such locks. For those locks that are currently acquired, the home FXCC records the owner processor in a set of registers. Both acquire and release operations are implemented as requests to the home FXCC, which will check its registers to grant or deny the corresponding operation. When an acquire request for a free lock arrives at a FXCC that has run out of free lock registers, it denies the operation until there is a register available to store the owner of the lock.
- 4) Data coherence: Our system lacks hardware support for cache coherence between PHost(s) and PArrays. The cheapest and simplest solution to still maintain cache coherence involves at least the ability to flush and invalidate caches. Note that, in this environment, a (total or partial) flush operation of the write-back caches is always required to ensure that the latest version of the data is in memory and, therefore, visible to other processors.

In our system, we use the following procedure. The PHost flushes its caches before starting a task on the FlexRAM chips. This is done to make sure that the PArrays see the latest versions of data. Moreover, before the PHost executes code that may use results from the PArrays, the PHost invalidates a superset of such variables from its caches. As for the PArrays, when they complete a task, they flush and invalidate their small caches. This ensures that the data generated by the task is visible to the whole system and that, when the PArray resumes, its cache is free of stale variables.

While coherence among the PArrays could be supported with a directory-based hardware scheme, we feel that it is too costly for a system like FlexRAM. Consequently, we choose to restrict our programming scope to problems in which different PArrays write on different data items. The PArray caches include a dirty bit per word so that when a line is displaced from a cache, only the modified words are written back. This way, data coherence is maintained even in the presence of false sharing.

Finally, in applications where several PArrays need to writeshare data, these mechanisms are not enough. In these cases, the compiler or the programmer marks as critical sections those code portions where such data is manipulated. The compiler then inserts the corresponding cache flushes, invalidations, and FXCC lock operations to ensure program correctness.

#### III. CFLEX

We have chosen to program FlexRAM adding compiler directives to the sequential version of the programs rather than adding new features to standard programming languages. The reason is that in this way the same program can also be compiled for conventional execution by ignoring the directives, and therefore a single version of the program can be executed both on conventional systems and systems extended with FlexRAM chips.

The translation of CFlex directives has been implemented using the SUIF compiler [12] augmented with a preprocessor, and is currently aimed at C programs, although it can be easily extended to other languages. The basic structure of a CFlex directive is

# #pragma FlexRAM directive-type [clauses]

There are three classes of directives:

- Execution modifiers which indicate how a given piece of the code should be executed. These are the most widely used directives. They include the requests to spawn a given portion of the program for execution on a given processor or group of processors.
- Data modifiers to request that data structures fulfill certain conditions. An example are directives to pad one of the dimensions of an array to make their size a multiple of the page size.
- Executable directives that are just instructions in the program. Examples of this class include barriers, prefetch operations, etc.

The only directives that are required to use FlexRAM are those in the first class that define, spawn and synchronize tasks. Because of space limitations we will only discuss this type of directives. A complete description of CFlex can be found in [13]. These directives can only appear in the code executed by a PHost, as the PArrays cannot perform this type of operations. The *directive-type* is the class of processor on which the code will execute and its possible values are phost and parray. The code of the task to be spawned is the (possibly compound) statement following the directive.

Optional clauses in this directive include:

- on\_home(x) to specify that the task should be executed on the PArray on whose memory bank is located the data structure x.
- sync/async to specify whether the task spawning the new task must stop until the new task finishes (sync) or it can continue (async). The default is tt sync. A task does

not finish until all the tasks it has spawned have finished. Asynchronous tasks must be created inside the syntactical scope of a synchronous task, so that there is a point in the program execution flow where they are known to have finished. Such point is the end of the synchronous task inside which they have been spawn. The power, flexibility and simplicity of this approach will be illustrated in Sect. IV

- if (cond) controls the execution of the directive where it appears. The directive is executed only if cond is true.
- else also controls the execution of a directive. The directive is executed if the cond in the if clause in the immediately preceding directive is false.
- shared, private, lastprivate, firstprivate, reduction, migrate which have the same semantics as the directives of the same name in OpenMP. In contrast to OpenMP, CFlex allows their application to only one part of a structure (e.g., a field of a struct, an element of a vector). A migrate clause has been considered. It designates shared data structures whose pages should migrate to the first PArray that touches them once the task(s) created by this directive begin their execution. This clause has not been implemented for two reasons. First, the high cost of automatic page migration offsets very often its advantages even in architectures, such as the NUMA machines [14], that are much better suited for it than the PIMs. Second, although its functionality is more restricted, our on\_home clause already allows the migration of computations to the processor that owns certain data at much cheaper cost.
- flush specifies which pieces of data need to be flushed from the PHost cache so that the PArrays executing the task(s) have access to the latest version of the data they require. By default the compiler flushes the whole cache of the PHost to ensure that the PArrays do not use outdated data residing in memory. The programmer can use this clause to improve the performance by restricting the flush to a certain set of variables.

Notice that CFlex can be used to express parallelism in systems without PIMs because it can generate and synchronize parallel tasks at the PHost level without ever having to refer to PArrays. In fact, CFlex is an easy and sensible approach to port existing parallel codes to NUMA systems thanks to clauses like on\_home or migrate. As an additional benefit, the sync/async clauses enable us to generate new tasks dynamically outside loops. This gives CFlex the power to parallelize accesses to lists, trees and other pointer-based structures as well as recursive algorithms. This last ability gives CFlex an advantage over OpenMP, which can only express the parallelization of iterative constructs.

# IV. Examples of the use of CFLex

Our basic description of CFlex provides enough context to illustrate its flexibility with some examples. Consider the code

in Fig. 3. The original task, say  $T_1$ , generates a task,  $T_2$ , corresponding to the loop. Task  $T_1$  will wait until task  $T_2$  completes. Task  $T_2$  spawns one task for each iteration of the loop that traverses the linked list. Since these tasks are asynchronous,  $T_2$ does not wait for them to complete and continues spawning without pause until in has completed all loop iterations. The task corresponding to each iteration of the loop will be run in the PArray on whose bank p->data is located and it will receive a privatized copy of the value of pointer p for that iteration. The for loop (task  $T_2$ ) is marked as a synchronous task in the PHost. As a result, the compiler will insert code to wait for the tasks generated in the PArray(s) to finish before  $T_1$  continues with the code following the loop. It is interesting to note how this scheme allows the specification of overlapping tasks in the PHost and the PArrays in a natural way just by adding more code to be run by the PHost inside the synchronous task.

As can be seen in this example, loops can be parallelized by enclosing the whole loop inside a synchronous task that will provide the required synchronization point, and by specifying that each iteration of the body is an asynchronous task. Loops are the most common source of parallelism, so CFlex has been extended with a pfor clause that tells the compiler to break the loop following it into a series of tasks and wait for those tasks to finish before continuing. The previous loop rewritten using a pfor clause is shown in Fig. 4.

A more elaborate parallelization scheme is required when there are portions of code that must be run sometimes in the PHost, and sometimes in the PArrays. Fig. 5 shows an example of this kind extracted from the treeadd code, part of the Olden benchmark suite [15]. This routine adds the values stored in the nodes of a tree. During the construction of the tree, the upper levels were allocated by the PHost while the subtrees below a given level were allocated by the PArrays. Our runtime system allows all the PArrays to allocate and deallocate heap memory in parallel.

The function lcl, which is part of this runtime system, returns zero if a node was allocated by the PHost and a positive number if it was allocated by a PArray. The routine proceeds by spawning one PHost task for each of the left children in the upper levels of the tree (those allocated by the PHost and therefore with an lcl value of zero). Once a PArray task is spawned, no new tasks will be created since PArrays are incapable of creating new tasks. Therefore, once a PArray task is spawned, the whole subtree will be processed sequentially.

Because PArrays cannot create tasks, the compiler generates two versions of this kind of routines. The PHost version includes all the tests and the task data generation, spawn and synchronization calls, while the PArray version only contains the original code of the routine.

Notice also that in both examples it was not necessary to add or modify any line of the sequential code. All that was needed to parallelize the code was to insert directives. This is typical of the codes we have parallelized for FlexRAM.

```
#pragma FlexRAM phost sync
  for(p = head; p != NULL; p = p->next)
#pragma FlexRAM parray async on_home(*(p->data)) firstprivate(p)
    process(p->data);
```

Fig. 3. Parallelized linked list processing using the sync and async clauses

```
#pragma FlexRAM parray pfor on_home(*(p->data)) firstprivate(p)
  for(p = head; p != NULL; p = p->next)
    process(p->data);
```

Fig. 4. Parallelized linked list processing using the pfor clause

```
int TreeAdd (register tree_t *t) {
   if (t == NULL) return 0;
   else {
     int leftval, rightval, value;
     tree_t *tleft, *tright;

#pragma FlexRAM phost sync
   {
      tleft = t->left;

#pragma FlexRAM parray async on_home(*tleft) if (lcl(tleft))
   #pragma FlexRAM phost async else
      leftval = TreeAdd(tleft);

      tright = t->right;

#pragma FlexRAM parray async on_home(*tright) if (lcl(tright))
      rightval = TreeAdd(tright);
   }
   value = t->val;
   return leftval + rightval + value;
   }
}
```

Fig. 5. Parallelized tree processing

### V. SOFTWARE SUPPORT

The management of our PIM system requires a series of extensions to the Operating System (OS). For example, when a page is referenced for the first time and that reference comes from a PArray, our Operating System places the page in the memory bank of that PArray. Also, our OS must ensure that the PArrays do not fall in an incorrect state when a page replacement is required. Currently this is achieved by precluding the replacement of pages containing information used by the PArrays. In the future, we plan to enable the replacement of these pages by implementing operations to flush all the references to them from the caches and TLBs of the PArrays. The implementation could be based on a table containing the list of PArrays that point to each physical page in their TLB. When a PArray suffered a TLB miss that led to the replacement of the entry associated to page P, it would remove itself from the list associated to P, and it would flush the data of this page from its cache. Then it would add itself to the list of PArrays that keep the mapping for the page that generated the TLB miss. These two steps could be restricted to only those pages that are not already in the PArray bank, as we expect the PArrays to generate most of their references to the local bank. Our OS would use these tables to know which PArrays need to be instructed about the replacement of a given page in order to flush the data of the page and invalidate the corresponding mapping. Notice that this table would be also very useful to implement the page migration mechanism because only the specified processors would have to be informed about the migration. The FXCC locks introduced in Sect. II-A would provide for the required mutual exclusion during the operation of the OSs of the different processors.

A runtime system implemented as a dynamically linked library would take care of those software procedures that do not require OS support. Those functions include:

- an interface to the FXCC locks and constructions built upon them, like barriers.
- heap memory management including parallel allocation and deallocation of space.
- the polling of the FlexRAM chips to detect requests from the PArrays (see Sect. II-A). The OS might have to be invoked to serve some requests such as page faults.

| Function description              | Abstract expression                        | Syntax                          |  |
|-----------------------------------|--------------------------------------------|---------------------------------|--|
| Apply func f with arg a           | $\operatorname{exec} f(v(i),a), 0 < i < N$ | Vector_apply(v,f,a)             |  |
| Search element that fulfills      | ret any v(i) such that                     | Vector_search(v,f,a)            |  |
| condition f with arg a            | $f(v(i),a) \neq 0, 0 < i < N$              |                                 |  |
| Generate vector with the result   | v2(i) = f(v(i),a)                          | v2=Vector_map(v,f,a)            |  |
| of applying func f with arg a     | 0 < i < N                                  |                                 |  |
| Reduce vector applying func f,    | ret $f(f(a,v(0)),v(N-1))$                  | <pre>Vector_reduce(v,f,a)</pre> |  |
| whose neutrum is a                | where $f(a,x)=x$                           |                                 |  |
| Process together two vectors and  | v3(i) = f(v(i), v2(i), a)                  | v3=Vector_map2(v,v2,f,a)        |  |
| an arg a, generating a new vector | 0 < i < N                                  |                                 |  |

TABLE I Some Vector container IMOs

Although the CFlex pragmas are all we need to program our system, programmability can be highly improved by providing libraries of subroutines that make use of the PArrays but that hide completely their existence. We call these subroutines Intelligent Memory Operations (IMOs). The IMOs may work on data structures defined by the user performing typical operations on them. Examples of this kind of IMOs would be finding the minimum value in a vector and adding two matrices. Other IMOs can be envisioned like STL classes [16] that make use of the PIMs throughout all the phases of the life of a given data structure. For example, IMO STL classes may define and operate on containers such as lists, hashes, sets, etc. making use of the PIMs to perform in parallel allocations and deallocations of elements, searches, and other computations with the elements stored in these containers. Some IMO functions for a vector container are proposed in Table I.

#### VI. EVALUATION

The performance evaluation of our hardware/software system, including our compiler, is based on a cycle-by-cycle detailed simulation using a tool derived from MINT [17]. Our baseline is a standard workstation with a 1.6 GHz five-issue high-performance processor similar to the Power4. We measure the speedup achieved when its plain memory modules are replaced with FlexRAM chips programmed using CFlex and our runtime system. The main related architectural parameters are listed in Table II, where all the times are measured in PHost cycles. Notice that the main processor of the system is extremely powerful, has a large L2 cache and is able to sustain many simultaneous accesses in parallel.

The applications and problem sizes used in our evaluation are described in Table III. TSP and TreeAdd are two well-known Olden benchmarks [15]. Equake is an SPEC OMP2001 run with the test input set. Matrix is a dense matrix product with blocking in the three dimensions that uses 1000x1000 double precision matrices. Finally, Distance and Path are two applications extracted from [18] that make use of an IMO implementing a singly-linked list. The last three columns of Table III attempt to measure the effort required to parallelize the applications by showing their original number of lines of code as well as the number of CFlex directives and additional lines of code

| PHost processor                           | PHost caches                    | Bus & Memory                         |
|-------------------------------------------|---------------------------------|--------------------------------------|
| Freq: 1.6 GHz                             | L1 size: 32 KB                  | Bus: split transaction               |
| Issue width: 5 Dyn issue: yes             | L1 OC,RT: 1,3<br>L1 assoc: 2    | Bus width: 16 B<br>Bus freq: 400 MHz |
| I-window size: 64                         | L1 line: 128 B                  | Mem: 1-channel Rambus                |
| Ld/St units: 2,1                          | L1 HUM: 16                      | Mem RT: 180 (112.5 ns.)              |
| Int,FP units: 5,4<br>Pending Ld,St: 32,32 | L2 size: 1 MB<br>L2 OC,RT: 4,12 |                                      |
| BR penalty: 12                            | L2 oc,R1. 4,12<br>L2 assoc: 8   |                                      |
| TLB entries: 128                          | L2 line: 128 B                  |                                      |
|                                           | L2 HUM: 8                       |                                      |
| PArray processor                          | PArray Cache                    | FlexRAM torus & bus                  |
| Freq: 1 .6GHz                             | L1 size: 8 KB                   | Torus width: 8B                      |
| Issue width: 2                            | L1 OC,RT:1,2                    | Torus freq: 1.6 GHz                  |
| Dyn issue: no                             | L1 assoc: 2                     | Bus: split transaction               |
| Ld/St units: 1,1                          | L1 line: 32 B                   | Bus width: 16 B                      |
| Int,FP units: 1,1                         | L1 HUM: 0                       | Bus freq: 400 MHz                    |
| Pending Ld,St: 2,2                        |                                 |                                      |
| BR penalty: 5                             |                                 |                                      |
| TLB entries: 32                           |                                 |                                      |

TABLE II

PARAMETERS OF THE ARCHITECTURE SIMULATED. OC, RT AND HUM STAND FOR OCCUPANCY, ROUND TRIP FROM THE PROCESSOR AND HIT UNDER MISS, RESPECTIVELY.

| Applic.  | Proble    | m size   | Original | Directives | Additional |
|----------|-----------|----------|----------|------------|------------|
|          | Nodes     | Memory   | lines    |            | lines      |
| TSP      | 512 K     | 22 MB    | 485      | 12         | 5          |
| TreeAdd  | 2 M       | 40 MB    | 71       | 8          | 4          |
| Equake   | 7 K       | 3.6 MB   | 1227     | 14         | 6          |
| Matrix   | 1000x1000 | 24 MB    | 81       | 1          | 2          |
| Distance | 30 K      | 1.2 MB   | 108      | 17         | 7          |
| Path     | 400 K     | 12.8 MB  | 165      | 17         | 9          |
| Average  | 991K      | 17.26 MB | 356      | 11.5       | 5.5        |

TABLE III
APPLICATION CHARACTERISTICS

required to generate the parallel version. Little effort has been made to optimize the parallel version, other than using a strategy that distributes the tasks evenly among the PArrays. Rather, we have stressed the simplicity of the parallelization scheme by making as few modifications as possible, as the figures in the table show.

The parallelization of the two Olden benchmarks is similar: they both build a tree and then perform some kind of processing on it. Both stages of these programs have been parallelized by choosing a level in the tree under which each one of the subtrees is processed by a single PArray. This scheme provides a good load balance between the PArrays. An example corresponding to the tree reduction in TreeAdd can be seen in Fig. 5. The upper levels of the tree are thus processed by fewer PArrays as the number of subtrees is reduced in TSP. The PHost is in charge of processing all the nodes of the uppers levels in TreeAdd because of the very short computation necessary per node.

As for Equake, it has been parallelized by replacing the OpenMP pragmas provided in the SPEC OMP suite with CFlex pragmas. It deserves mention that it allocates 290 additional KB of memory per each new PArray thread. Consequently, its memory usage grows with the degree of parallelization.

The matrix product has been parallelized simply by instructing the system to calculate the result for each block of rows of the destination matrix in a different PArray. This way, only one pragma is required to split this parallel loop.

In the case of Distance and Path we have not started from a sequential version. Instead, we have written the programs making use of the IMOs. Thus, for these applications we show the total number of CFlex directives inside the IMO library (Directives column) and the static number of calls to IMO functions in these applications (Additional lines). Also, the original code size reflects the number of lines of the applications without including the IMO library, which is 714 lines long, as it is provided by the system. The IMO functions are designed to be near-optimal both for the sequential and the parallel execution. Still, there are particular cases in which the sequential execution performance can be hindered by the structure of the code oriented to the parallel execution. In those cases, we have written versions of these functions that are optimized for the sequential execution, and we use them as our baseline when the code is run on the PHost.

### A. Speedups

The speedups achieved for these applications using one or two FlexRAM chips over the plain system are shown in Fig. 6. We see that despite the simple parallelization schemes, very good speedups have been achieved for most of the codes. This is particularly encouraging when we consider the simplicity of our PArrays compared to the main processor of the system.

It is interesting to see how some of the applications benefit from the usage of several chips while others do not. The main reason for this behavior difference lies in the degree of locality in the accesses to memory by the PArrays. Specifically, the use of several FlexRAM chips usually implies communication through the FlexRAM bus, which turns the bus into a source of contention when the locality of the PArrays is poor, and they need to access data in banks located in other chips. For example, in TSP each PArray can initially process locally a subtree of the lower part of the tree. However, as the processing of the tree goes up, there are fewer subtrees, which means both less parallelism and that the PArrays have to access data in more and more banks. The performance penalty is higher when some



Fig. 6. Speedups obtained using FlexRAM chips.

of those banks are located in another chip. This is the reason for the small improvement as we go to two FlexRAM chips in TSP.

TreeAdd does not suffer from this problem because its parallelization allows each PArray to process exclusively data located in its bank, and the reduction of the data provided by the PArrays can be quickly performed by the PHost.

The bus bottleneck problem is the worst for very regular parallel codes in which many processors request almost simultaneously certain pieces of data, thus turning the corresponding banks into sources of contention too. This is what happens in the dense matrix product.

There are situations in which although the parallelization scheme and the distribution of the data are both good, there is just not enough data or processing to justify splitting the work among more processors. The advantages obtained by the use of more PArrays vanish due to the growth of the overheads, in Equake and Path. There is an additional reason for the relatively small speedups of Equake: the maximum theoretical speedup it can achieve for the test input set is 2.13, as only 53% of its time is spent in the parallelizable loops. Our parallelization reduces the execution time of the parallel loops to about half, achieving a speedup of 1.38 with one FlexRAM chip and 1.18 with two chips.

### VII. CONCLUSIONS

This paper addressed the issue of providing ease of use and efficient programming support for FlexRAM [7], an intelligent memory architecture. We know of no other works focused on solving the programmability issues related to this kind of systems. This task required solving several challenges, as FlexRAM imposes a large number of restrictions. On the one hand, its memory processors lack a large number of interprocessor synchronization and communication mechanisms and there is very little hardware support for cache coherence. On the other hand, the memory processors depend on the main processor(s) of the system to perform several tasks. These prob-

lems have been overcome by combining CFlex, a novel family of pragmas inspired on OpenMP, with a run-time system. Our directives are adapted to the nature of our PIM chips and they allow simple and effective parallelization of a much larger class of problems than the current standard OpenMP does. The complexity of the system can be further hidden by providing libraries of functions that operate on data structures making use of the PIMs but without requiring the programmer to have any knowledge of them. Such functions, called Intelligent Memory Operations (IMOs), optimize the usage of the system while reducing the programming effort. Our experiments show that a workstation with PIM chips can easily achieve speedups of over one order of magnitude with few changes to the source code of the applications.

### REFERENCES

- [1] IBM Microelectronics, "Blue Logic SA-27E ASIC." http://www.chips.ibm.com/news/1999/sa27e, February 1999.
- [2] S. İyer and H. Kalter, "Embedded DRAM technology: opportunities and challenges," *IEEE Spectrum*, April 1999.
- [3] D. Patterson et al., "A Case for Intelligent DRAM," IEEE Micro, pp. 33–44, March/April 1997.
- [4] E. Waingold et al., "Baring It All to Software: Raw Machines," *IEEE Computer*, pp. 86–93, September 1997.
- [5] S. Rixner et al., "A Bandwidth-Efficient Architecture for Media Processing," in *International Symposium on Microarchitecture*, November 1998.
- [6] S. Kaxiras, R. Sugumar, and J. Schwarzmeier, "Distributed Vector Architecture: Beyond a Single Vector-IRAM," in First Workshop on Mixing Logic and DRAM: Chips that Compute and Remember, June 1997.

- [7] Y. Kang, W. Huang, S. Yoo, D. Keen, Z. Ge, V. Lam, P. Pattnaik, and J. Torrellas, "FlexRAM: Toward an Advanced Intelligent Memory System," in *International Conference on Computer Design*, pp. 192–201, October 1999.
- [8] M. Oskin, F. Chong, and T. Sherwood, "Active Pages: A Computation Model for Intelligent Memory," in *International Symposium on Computer Architecture*, pp. 192–203, June 1998.
- [9] M. Hall et al., "Mapping Irregular Aplications to DIVA, a PIM-based Data-Intensive Architecture," in Supercomputing, November 1999.
- [10] OpenMP Architecture Review Board, "OpenMP C and C++ Application Program Interface Version 2.0." March 2002.
- [11] R. Crisp, "Direct Rambus Technology: the New Main Memory Standard," in *IEEE Micro*, pp. 18–28, November 1997.
- [12] R. P. Wilson, R. S. French, C. S. Wilson, S. P. Amarasinghe, J.-A. M. Anderson, S. W. K. Tjiang, S.-W. Liao, C.-W. Tseng, M. W. Hall, M. S. Lam, and J. L. Hennessy, "SUIF: An Infrastructure for Research on Parallelizing and Optimizing Compilers," *SIGPLAN Notices*, vol. 29, no. 12, pp. 31–37, 1994.
- [13] B. Fraguela, J. Renau, P. Feautrier, D. Padua, and J. Torrellas, "CFlex, a Programming Language for the FlexRAM Intelligent Memory Architecture," tech. rep., Department of Computer Science, University of Illinois at Urbana-Champaign, July 2002.
- [14] J. Bircsak, P. Craig, R. Crowell, Z. Cvetanovic, J. Harris, C. A. Nelson, and C. D. Offner, "Extending OpenMP for NUMA Machines," in SC2000: High Performance Networking and Computing (ACM, ed.), pp. 68–69, ACM Press and IEEE Computer Society Press, November 2000
- [15] A. Rogers, M. C. Carlisle, J. H. Reppy, and L. J. Hendren, "Supporting Dynamic Data Structures on Distributed-Memory Machines," ACM Transactions on Programming Languages and Systems, vol. 17, pp. 233–263, March 1995.
- [16] A. A. Stepanov and M. Lee, "The Standard Template Library," Tech. Rep. X3J16/94-0095, WG21/N0482, ISO Programming Language C++ Project, May 1994.
- [17] J. Veenstra and R. Fowler, "MINT: A Front End for Efficient Simulation of Shared-Memory Multiprocessors," in *Second International Workshop* on *Modeling, Analysis, and Simulation of Computer and Telecommunica*tion Systems, pp. 201–207, January 1994.
- [18] C. Foster, Content Addressable Parallel Processors. Van Nostrand Reinhold Co, New York, 1976.