# Optimizing the Performance of Multi-threaded Linear Algebra Libraries, a Task Granularity based Approach

### A Dissertation

Submitted to the Graduate Faculty of the Louisiana State University and Agricultural and Mechanical College in partial fulfillment of the requirements for the degree of Doctor of Philosophy

in

Department of Electrical Engineering and Computer Science

by Shahrzad Shirzad B.Sc., Sharif University of Technology, 2006 M.Sc., K. N. Toosi University of Technology, 2009 December 2019

# Acknowledgments

# **Table of Contents**

| ACKNO  | DWLEDGMENTS                                 | ii  |
|--------|---------------------------------------------|-----|
| LIST O | F TABLES                                    | iv  |
| LIST O | F FIGURES                                   | V   |
| ABSTR  | ACT                                         | vii |
| CHAPT  | ΓER                                         |     |
| 1.     | INTRODUCTION                                |     |
|        | 1.1. Thesis Statement                       |     |
|        | 1.2. Contributions                          |     |
|        | 1.3. Document Organization                  | 2   |
| 2.     | BACKGROUND                                  | 3   |
|        | 2.1. Asynchronous Many Task Runtime Systems |     |
|        | 2.2. Blaze                                  | 4   |
|        | 2.3. Task Granularity                       | 4   |
|        | 2.4. Universal Scalibility Law              | 6   |
|        | 2.5. Loop Scheduling                        | 8   |
| 3.     | LITERATURE REVIEW                           | 11  |
|        | 3.1. Literature Review                      | 11  |
| 4.     | METHOD                                      | 15  |
|        | 4.1. Parallelization in Blaze               |     |
|        | 4.2. Experiments                            | 16  |
|        | 4.3. Method                                 | 23  |
| 5.     | RESULTS                                     | 33  |
| 6.     | FUTURE PLANS                                | 35  |
|        | 6.1. What is the role of matrix size?       | 35  |
| REFER  | ENCES                                       | 37  |

# APPENDIX

# **List of Tables**

| 4.1. | List of some of the thresholds applied to the operations performed          |    |
|------|-----------------------------------------------------------------------------|----|
|      | by Blaze, starting from which the operation is executed in parallel         | 15 |
| 4.2. | List of different values used for each variable for running the DMATDMATADD |    |
|      | benchmark                                                                   | 19 |
| 5.1. | Specifications of the Marvin node from Rostam cluster at CCT                | 34 |
| 5.2. | Cache specifications of the Marvin node from Rostam cluster at CCT          | 34 |
| 5.3. | Specifications of the libraries used to run our experiments                 | 34 |

# **List of Figures**

| 2.1.           | The effect of task size on execution time for Stencil application [1]                                                                 | 5  |
|----------------|---------------------------------------------------------------------------------------------------------------------------------------|----|
| 2.2.           | USL compared to the ideal linear speedup where $\sigma = 0.04$ and $\kappa = 0.005$                                                   | 7  |
| 4.1.           | The results obtained from running $DMATDMATADD$ benchmark                                                                             | 1  |
| T.1.           | through Blazemark for matrix size 690×690 on different number of cores                                                                | 19 |
| 4.2.           | The results obtained from running <i>DMATDMATADD</i> benchmark                                                                        | 13 |
| 1.2.           | through Blazemark for matrix of size 690×690 from two different angles                                                                | 20 |
| 4.3.           | The results obtained from running <i>DMATDMATADD</i> benchmark                                                                        |    |
| 1.01           | through Blazemark for matrix sizes from 200×200 to 1587×1587                                                                          | 21 |
| 4.4.           | The results obtained from running <i>DMATDMATADD</i> benchmark                                                                        |    |
|                | through Blazemark for matrix sizes from 690×690 with different com-                                                                   |    |
|                | binations of block size and chunk size on 4 cores                                                                                     | 22 |
| 4.5.           | The results obtained from running <i>DMATDMATADD</i> benchmark                                                                        |    |
|                | through Blazemark for matrix size 690×690 on 4 cores                                                                                  | 23 |
| 4.6.           | Throughput vs. grain size graph obtained from running <i>DMATDMATADD</i>                                                              |    |
|                | benchmark on 4 cores for matrix sizes (a) smaller than 793×793 and                                                                    |    |
|                | (b) larger than 793×793                                                                                                               | 24 |
|                |                                                                                                                                       |    |
| 4.7.           | The results of fitting the throughput vs grain size data into a 2d poly-                                                              |    |
|                | nomial for <i>DMATDMATADD</i> benchmark for matrix size 690×690                                                                       |    |
|                | with different number of cores on the test data set (a) 1 core, (b) 2                                                                 |    |
|                | cores, (c) 3 cores, (d) 4 cores, (e) 5 cores, (f) 6 cores, (g) 7 cores, (h) 8 cores                                                   | 25 |
| 4.8.           | The training and test error for fitting data obtained from the <i>DMATDMATADD</i>                                                     | 00 |
| 4.0            | benchmark for matrix size 690 × 690 against different number of cores cores                                                           | 26 |
| 4.9.           | Fitting the parameters of the quadratic function with a 3rd degree                                                                    |    |
|                | polynomial from the <i>DMATDMATADD</i> benchmark for matrix size                                                                      | 27 |
| 4 10           | 690 × 690 against different number of cores.                                                                                          |    |
| 4.10.<br>4.11. | The error in fitting the parameters $a$ , $b$ , and $c$ for matrix size $690 \times 690$                                              | 21 |
| 4.11.          | Results of fitting the data from <i>DMATDMATADD</i> benchmark with a polynomial of degree 2 in terms of grain size and of degree 3 in |    |
|                | terms of number of cores for matrix size 690 × 690 for (a) 2 core, (b) 4                                                              |    |
|                | cores, (c) 8 cores                                                                                                                    | 28 |
| 4.12.          | The training and test error obtained fitting the data to a polynomial                                                                 | 20 |
| 7.12.          | of degree 2 in terms of grain size and of degree 3 in terms of number                                                                 |    |
|                | of cores for matrix size $690 \times 690$ , for each number of cores. (a) All                                                         |    |
|                | the data points are include in caluculation of error, (b) the leftmost                                                                |    |
|                | sample was removed from error calculation                                                                                             | 28 |
| 4.13.          | The range of grain size (shown as the red line) that leads to a perfor-                                                               |    |
| 11101          | mance within 10% of the maximum performance for (a) 2 cores, (b)                                                                      |    |
|                | 4 cores and (b) 8 cores                                                                                                               | 29 |
| 4.14.          | The range of grain size within 10% of the maximum performance                                                                         |    |
|                | of the fitted polynomial function for <i>DMATDMATADD</i> benchmark                                                                    |    |
|                | and matrix size 690 × 690 against different number of cores                                                                           | 30 |

| 31 |
|----|
|    |
| 33 |
|    |
| 35 |
|    |
|    |
| 36 |
|    |

#### Abstract

Linear algebra libraries play a very important role in achieving a high performance in HPC applications. In this work we propose to tune the linear algebra library to a set of compile-time and runtime characteristics such as, the machine architecture, the expression being computed, the number of cores to run the application on, and also the size of arrays.

For this purpose we decided to use Blaze C++ library, a high performance math library, and HPX, a C++ standard library for concurrency and parallelism, as our runtime system. We propose that instead of dividing the work equally among the cores and assigning one chunk to each core, we should be able to achieve a higher performance by selecting the right amount of work to be assigned to each core, based on a set of runtime and compile-time characteristics.

Having collected a significant amount of data, we realized that the grain size, the amount of work assigned to each task, is the key factor for the performance for a fixed number of cores. Knowing that, we tried two different approaches to model the relationship between performance and grain size, in order to find a range of grain size that could lead us to the maximum performance.

The primary results suggest that, using the data collected for matrix-matrix addition, we are able to improve the performance for matrix-matrix addition, by finding the right range of grain size.

In the next step, this problem should be generalized to different matrix sizes, architectures, and expressions.

# **Chapter 1. Introduction**

The current microprocessors are able to deliver a peak performance in range of hundreds of Mflops to Gflops. But in order to achieve these performances a lot of effort needs to be made to optimize your program based on the architecture it would be run on[2].

The core element of any high performance computing library is the linear algebra library. linear algebra libraries like ATLAS, SPIRAL,... try to use hardware-specific optimizations to improve their performance. In this work, we are trying to optimize the performance based on the application parameters such as matrix size, operation, and data layout.

#### 1.1. Thesis Statement

The main objective of this thesis is to propose a hybrid runtime and compile-time solution for a linear algebra library to fully take advantage of the available parallelism and resources. We chose Blaze math library since it is a nice high performance template-based C++ library that allows you to access the expression tree for each assignment at compile time, and we chose HPX as as asynchronous may-task runtime system to manage the parallelism. HPX makes it possible to create thousands to millions of lightweight user threads, to avoid expensive context switching. Through analyzing and modeling the relationship between throughput and grain size (the amount of work assigned to as task), we would be able to identify a range of grain size that leads us to maximum performance. Once decided how big one unit of work should be, based on the identified range we would be able to decide on how many units of work should be packed into one task.

#### 1.2. Contributions

There has been a wide study, mostly by Gunther[3, 4, 5, 6], on different models to represent the relationship between the throughput and the number of cores, for a fixed size problem. Grubel et.al[1] has studied the effect of task granularity on the performance with a fixed

number of cores. Our contributions could be summarized into:

- We propose a novel physical model to represent how the execution time is expected to change based on grain size.
- To our knowledge, there has not been a work to create a 3D model of the throughput, grain size, and number of cores.

## 1.3. Document Organization

In Chapter 2, we will explain briefly the background needed for this thesis, including Blaze and HPX library, the effect of task granularity, and the Universal Scaling Law(USL) method for modeling the throughput based on number of cores. Chapter 3 refers to other works that have been done in our area of our focus. We explain our proposed method to optimize the performance, along with the models we used in Chapter 4. Our work heavily relies on the collected data, the environment we collected the data from and also the library versions are mentioned in Chapter 5. Finally we discuss our concerns and the further steps that needs to be taken in Chapter 6.

# Chapter 2. Background

## 2.1. Asynchronous Many Task Runtime Systems

Some of well known AMTs include: HPX[7], Charm++[8], Cilk[9], AM++[10], Legion[11].

#### 2.1.1. HPX

HPX[7] is a C++ runtime system for parallel and distributed applications based on ParalleX execution model[12]. HPX contains 5 main modules: Performance Monitoring System, Local Control Objects(LCOs), Thread Scheduling System, Parcel Transport Layer, Active Global Address Space (AGAS). HPX provides you with lightweight user-level threads with fast context switching, and allows you to hide the latency with using millions of these light-weight threads[13].

#### 2.1.1.1. Execution Model

The "SLOW" model identifies four potential sources for performance degradation as: Starvation, Latency, Overheads, and Waiting or contention[12].

Starvation refers to a situation where there is insufficient amount work for the computing resource. this could be due to insufficient total amount of work available, or unbalanced distribution of work among resources[13].

Latency is the time distance, usually measured in processor clock, of accessing remote data or services[14].

Overhead refers to the effort that needs to be taken to manage parallel resources and actions on the critical path[13].

Finally, waiting is the contention of the shared physical and logical resources causing one request to be blocked by another access of the same resource[14]. This could happen due to limited network bandwidth, shared communication channels, memory bank conflicts[13],[14].

#### **2.2.** Blaze

Blaze Math Library[15] is a C++ library for linear algebra. Blaze, based upon Expression Templates(ETs)[16], introduces "smart" expression templates(SETs)[15] to optimize the performance for array-based operations. Expression Templates[16] is an abstraction technique that uses overloaded operators in C++ to prevent creation of unnecessary temporaries, while evaluating arithmetic expressions, in order to improve the performance[15]. The ET-based approaches create a parse tree of the expression at compile time and postpone the actual evaluation to when the expression is assigned to a target.

Although being able to achieve promising performances for element-wise operations, these methods are not suitable for high performance computing for the following reasons. Due to their abstraction from both the data type and also the operation itself, they do not allow optimizations specific to the type of the arrays, alongside the operation[15]. As a solution, Blaze proposes smart ETs with these three main additions: integration with architecture-specific highly optimized compute kernels, creation of intermediate temporaries when needed, and selecting optimal evaluation method automatically for compound expressions[15].

Some of the ET-based linear algebra libraries are: Blitz++[17], Boost uBLAS[18], MTL[19], and Eigen[20]. Among these libraries, Eigen, MTL, alongside Blaze, impose different conceptual changes to ETs in order to make them suitable for HPC.

#### 2.3. Task Granularity

Defining the grain size as the amount of work assigned to one HPX thread, Grubel[1] studies the effect of grain size on the execution time for a fixed number of cores. The results show that, for small grain sizes the overhead of creating the tasks, and for large grain sizes the starvation, is the dominant factor affecting the execution time[1]. When grain size is small, to perform same amount of work, higher number of tasks is created, and there is an overhead associated with creation of each task. Although this overhead is very small (order of microsec-

onds), when the amount of work performed by each thread is also small, this overhead becomes significant. A the grain size increases, these overheads are amortized by the time it takes to execute the task.

On the other hand, when grain size is increases, the number of tasks being created decreases, up until a point where the number of tasks being created is smaller than the number of cores. At this point another factor would interfere with the performance, which is referred to as starvation. Starvation happens where a large amount of work is assigned to some of the cores while the other cores are idling. At this point we are not using our resources efficiently.

While overheads of creating tasks degrades the performance for small grain sizes and starvation causes the execution time to increase for large grain sizes, there is a region in between where changing the grain size does not affect the performance.



Figure 2.1. The effect of task size on execution time for Stencil application [1]

#### 2.4. Universal Scalibility Law

Amdahl's law[21], states that the amount of achievable speed up by adding more processors when running a parallel application, is restricted by the amount of code that could actually be parallelized. Equation 2.1, shows the relationship between speedup and number of processors, where  $\sigma$  is the serial fraction of the execution time, based on Amdahl's law[5].

$$S(p) = \frac{p}{1 + \sigma(p-1)} \tag{2.1}$$

On the other hand, Gunther[5] extends Amdahl's law by incorporating the effect of three factors, namely concurrency, contention, and coherency, as shown in Equation 2.2.

$$S(p) = \frac{p}{1 + \sigma(p-1) + \kappa p(p-1)}$$
 (2.2)

Concurrency(p) represents the linear speedup that could have been achieved if no interaction existed among the processors, contention( $\sigma$ ) represents the serialization effect of shared writable data, and finally coherency or data consistency( $\kappa$ ) represents the effort that needs to be made for keeping shared writable data consistent[5].

Figure 2.2 shows an example of the ideal linear speedup we expect to see when increasing the number of the processors, against the actual achievable speedup based on Amdahl's law and USL.

Equation 2.3 generalizes Equation 2.2 to represent the throughput by adding another parameter( $\gamma$ ) to represent the serial throughput.

$$X(p) = \frac{\gamma p}{1 + \sigma(p-1) + \kappa p(p-1)}$$
(2.3)



Figure 2.2. An example of the achievable speedup based on Amdahl's law and USL compared to the ideal linear speedup where  $\sigma = 0.04$  and  $\kappa = 0.005$ .

Universal scalibility law also suggests that for some values of  $\sigma$  and  $\kappa$  there could be a certain number of processors that yield to maximum performance[5]. Increasing the number of processor beyond that point would only cause performance degradation.

#### 2.4.1. Other Models

There are a few other models that have also been suggested to simulate the scalibility. Geometric model is a one-parameter model, in which speedup has the following relationship with the number of processors:

$$S(p) = \frac{1 - \phi^p}{1 - \phi} \tag{2.4}$$

The parameter  $\phi$ , where  $0 < \phi \le 1$ , is called the MP factor, and it represents the remaining section of the processor capacity after deducting overheads. The geometric model is non-physical for large number of processors, due to inconsistency with Coxian queuing model[4].

Quadratic model[3], with overhead parameter  $\gamma$ , where  $0 \le \gamma < 1$ , is represented in Equationquad.

$$S(p) = p - \gamma p(p-1) \tag{2.5}$$

The quadratic model has a critical point at:

$$p^* = \lfloor \frac{1+\gamma}{2\gamma} \rfloor \tag{2.6}$$

The problem with this model is that its not physical. This model represents an inverted parabola that will intersect the x axis at two points, implying that there with be a certain number of processors starting from which the speedup would be negative.

Exponential model, is also a single parameter model  $\alpha$  where  $0 < \alpha \le 1$ . This parameter is a combination of coherency and contention.

$$S(p) = p(1 - \alpha)^{(p-1)}$$
(2.7)

This model also has a critical point, but this point is very sensitive to  $\alpha$ . Although this model works very well for small number of processors, it imposes a severe capacity degradation for large number of processors.

#### 2.5. Loop Scheduling

Loop scheduling refers to different ways iterations could be assigned to the processors and the order of their execution. The main reason for performance degradation in loop scheduling is load imbalance, which refers to situations where different amount of work is assigned to different processors[22].

The simplest loop scheduling method is static scheduling, in which, the iterations are divided evenly among all the processors statically, either as a consecutive block -also called cyclic- or in a round-robin manner[23]. Since all the assignments happen at compile time or before execution of the application, this method imposes no runtime scheduling overhead. Several factors including interprocessor communication, cache misses, and page faults can lead to different execution times for different iterations, leading to load imbalance among the processors[24].

In the meanwhile, dynamic scheduling methods postpone the assignment to runtime, which tends to improve load balancing, at the cost of higher scheduling overhead. Some of dynamic scheduling methods include: Pure Self-scheduling, Chunk Self-scheduling, Guided Self-scheduling[25], Factoring[26] and Trapazoid Self-scheduling[27],[23]. We briefly go over some of these loop scheduling techniques here.

In Pure Self-scheduling every time a processors becomes idle, it fetches one loop iteration. This approach, while achieving a high load balance, imposes a considerable amount of scheduling overhead when we are dealing with a fine-grain workload, and a large number of iterations. Also frequent access to shared variables like loop index could lead to memory contention[23].

In order to decrease the high scheduling overhead of Pure Self-scheduling methods, Chunk Self-scheduling method assigns a certain number of iterations(called chunk size) to each idle processor. This method trades lower scheduling overhead with higher load imbalance. Selection of the chunk size plays a very important role in the performance, as so a large chunk size increases the scheduling overhead decreases and causes load imbalance, while a small chunk size increases memory contention and scheduling overhead[23].

As an adaptive loop scheduling technique, Guided Self-scheduling[25] divides the remaining number of iterations at each request evenly among the processors, and assigns it to the processor that made the request, while updating the number of remaining iterations. This causes larger number of iterations to be assigned to the processors at the beginning of the loop execution, which results in lower scheduling overhead. The number of iterations assigned to each processor decreases as it approaches to the end of the execution, generating tasks containing only one or two iterations, causing an increase in the scheduling overhead. In order to tackle this issue, a minimum number of chunks could be set to avoid creation of very small chunks[28].

Very similar to Guided Self-scheduling, Factoring[26] also decreases the chunk size as the loop execution proceeds, with this difference that it does it in batches of equal sized chunks.

If the first iterations of the loop are more time consuming then the rest of the iterations, Factoring performs better than Guided scheduling[29].

Along with the mentioned loop scheduling approaches, two other scheduling techniques can be utilized for load balancing. Work stealing[30] lets the processors to steal work from other processors queue, resulting in a more balanced load distribution. In work sharing on the other hand, each time a processor creates new threads, the scheduler would try to migrate some of them to other processors for a more balanced lad distribution[30].

But each of these methods work well for specific problem. We are looking for a general solution which can automatically decide on the chunk size parameter to achieve the best performance.

# Chapter 3. Literature Review

#### 3.1. Literature Review

Loop scheduling techniques has been extensively studied by different researchers. In [31] the authors propose a hybrid static/dynamic method for loop scheduling that improves the performance of dense matrix factorization, compared to both fully static and fully dynamic scheduling. The authors of [31], divide the dependency graph into two subgraphs, one of which is scheduled dynamically and the other one is scheduled statically. The tasks on the critical path are scheduled statically and each thread is forced to prioritize the static tasks[31]. They were able to improve data locality and scheduling overhead, while creating a more balanced workload.

The previous work on predicting the performance of a parallel application mainly focuses on three major types of models: analytical, trace-based, and empirical models[32].

The analytical models[33],[34],[35], while providing an arithmetic formula to represent the execution time of an application, require a deep understanding of the application, to apply platform-specific optimizations, and can not be generalized to different domains and architectures[36],[37],[38]. Traced-based models, on the other hand, use the traces collected through instrumentation, to predict the performance. These models, opposed to analytical models, do not rely on an expert's knowledge of the application, but while adding some overhead to the runtime, these models require a large storage space to save the traces, and are hard to interpret[37]. In empirical modeling, the results obtained from running an application with a set of parameters on a specific set of machines to build a model for unknown set of application and system parameters[32]. This type of modeling includes machine learning based approaches.

In [39], the authors use neural networks to predict the performance focusing on SMG2000 application, a parallel multigrid solver for linear systems [40], on two different platforms. Defining application parameters  $N_x$ ,  $N_y$ ,  $N_z$ , representing the working set size per processor, and  $P_x$ ,  $P_y$ ,  $P_z$ , describing the three-dimension processor topology, as the features, [39] uses a

fully connected neural network to learn the model. Since they use absolute mean square error as the loss function, they use stratification to replicate samples with lower values by a factor which is proportional to their target value. They also apply bagging technique to decrease the variance in the model. As they increase the size of the training set to 5K points, they reach an error rate of 4.9%.

As a trace-based model, [37] analyzes the abstract syntax tree of the code and collects data through inserting special code for instrumentation when encounters 4 different situations, namely, assignments, branches, loops, and MPI communications. The authors then use 5 different machine learning methods including random forests, support vector machine, and ridge regression to build a prediction model from the collected data. Through applying two filtration processes, they were able to decrease the amount of overhead introduced along with the storage space requirement. Their results were inclined towards random forest, mainly because of the lower impact of categorical features on it, which is helpful in general cases where we do not have any knowledge about the type of features[37].

In [32] the authors investigate a set of machine learning techniques, including deep neural networks, support vector machine, decision tree, random forest, and k-nearest neighbor to predict the execution time of 4 different applications. Each of these applications require a certain set of features as input, for example, for the miniMD application in molecular dynamics, the number of processes and the number of atoms were considered as the input features, while for miniAMR, an application for studying adaptive mesh refinement, number of processes and also block sizes in x, y, and z direction, where used as the input features. While achieving promising results especially for deep neural networks, bagging, and boosting methods, [32] suggest utilizing transfer learning through deep neural networks to predict performance on other platforms.

Although concentrating on GPUs, [41] proposes a lightweight machine learning based performance model to choose the number of threads to use for parallelization for a specific data size and operation. With the final goal of improving the training time in a neural network,

[41] selects 4 performance features collected by hardware counters namely, number of CPU cycles, number of cache misses, cache accesses for the last cache level, and number of level 1 cache hits. Then they take two different approaches to build their model. In the first on they try 10 different regression models including random forest, and in the second one they use hill climbing algorithm to choose the number of threads. In addition to hardware independent, and not requiring the training process, hill climbing algorithm achieves a much higher accuracy compared to the best performing regression model.

In this paper, we suggest using machine learning to directly predict the optimal chunk size to achieve the best performance instead of predicting the execution time or the optimal number of cores to run the application on. For this purpose, we have offered a set of general features that are not specific to an application and could easily be extracted at compile time or at run time. Once the data has been collected and our model has been created, the prediction results could be easily applied to a new application with a negligible overhead.

As another field to use machine learning, [42] collects seven runtime events and uses machine learning not to predict the performance, but to schedule the tasks. These events include, task creation, suspension, execution, completion, implicit/explicit barrier, parallel region, and finally loop/master/single region runtime events, collected through the OMPT using ORA API. Experimenting with four different machine learning techniques, including support vector machine, random forest, neural networks, and naive bayes, they would select one specific task pool configuration out of the three pre-defined options as the final classification result. Testing this framework on a real life molecular dynamics application, they observed an up to 31% improvement in performance.

The authors of [43] propose using machine learning to predict the optimal number of threads, and also the optimal scheduling policy for running an OpenMP application. Through that, they were able to develop an automatic compiler-based method to map a parallel application to a multicore processor. They collect three type of features namely, code, data, and runtime features. Code features are extracted from the code directly, and they include cycles

per instruction, number of branches, load and store instructions, and computations per instruction. While the code features could be collected statically at compile time, the data and run-time features are collected through low-cost profiling runs. This group of features include loop iteration count, branch miss rate, and L1 data cache miss rate. The authors of [43] then use an artificial neural network to predict the speedup achieved for a program with certain number of threads, and at the same time they use a support vector machine model to predict the best scheduling policy, out of block, cyclic, dynamic, and guided scheduling policies, for an unseen program.

# Chapter 4. Method

#### 4.1. Parallelization in Blaze

Depending on the operation and the size of operands, this assignment could be parallelized through four different backends, namely, HPX, OpenMP[44], C++ threads, and Boost[45]. Table 4.1 shows the default value for some of the threshold for parallelization applied to operations performed in Blaze. It should be noted that these thresholds should be tuned based on the parallelization backend and also the system architecture.

| Benchmark                 | Array size                                      |  |
|---------------------------|-------------------------------------------------|--|
| DVECDVCEADD, DVECDVECMULT | 38000                                           |  |
| DMATDMATADD               | 36100 elements equivalent to a 175 × 175 matrix |  |
| DMATDMATMULT              | 3025 elements equivalent to a 55 × 55 matrix    |  |

Table 4.1. List of some of the thresholds applied to the operations performed by Blaze, starting from which the operation is executed in parallel

### 4.1.1. Implementation of HPX Backend

As stated earlier, as an ET-based library, blaze performs the calculations when an expression is assigned to a target, which is implemented through the *blaze::Assign* function.

The four mentioned backends, parallelize this assignment process through a parallel forloop, in which at each iteration a specific section of each of the vectors or matrices(called a block) is selected and assigned to a core. Each core then performs the operation on the block they have been assigned to.

Each backend uses their own method for parallelizing this for loop. For HPX backend, current implementation uses a HPX *parallel::for\_loop* with static chunking policy and chunk size of 1. This way, knowing the number of cores to run the application on, we can divide the

original matrix equally among the cores, while the order of assignment of blocks to the cores is known at compile time. Listings4.1 shows the current implementation of the HPX backend in Blaze.

What we suggest here is that, some prior knowledge for example, architecture of the system we are running the application on, the expression that has to be executed, number of cores of the system, size and type of the arrays we are dealing with, and etc. should be able to help us to achieve a higher performance. For this purpose we introduced two parameters block\_size and chunk\_size.

### **4.1.2.** HPX *for\_loop*

HPX *for\_loop* takes an execution policy as first argument, which is set to *dynamic\_chunk\_size* execution policy in case of HPX backend for Blaze.

### 4.2. Experiments

In order to capture the relationship between number of cores, *chunk\_size*, *block\_size*, and the performance, we ran a series of experiments with different of these parameters and measured the number of floating point operations per second performed.

For these experiments ,at the first step we selected the DMatDMatADD benchmark which was implemented in Blazemark. DMatDMatADD benchmark is a level 3 BLAS function to perform matrix-matrix addition in the form of A = B + C, where A, B, C are square matrices of the same size. For simplification we are only studying raw-major matrices at this point. Our final goal is to extend the work to cover arbitrary data layouts for arrays.

To avoid adding the scheduling overhead for small matrix sizes, Blaze uses a threshold to start parallelization, which is specific to the type of operation. For matrix-matrix addition, if the number of elements in the matrix is greater than 36100 elements (which is equivalent to a square matrix of size  $190 \times 190$ ) Blaze uses the configured backend to parallelize the assignment operation. For this reason, we start our experiments with matrix size of  $200 \times 200$  and

Listing 4.1: Previous implementation of Assign function for HPX backend in Blaze.

```
// Type of the left-hand side dense matrix
1 template< typename MT1
2 , bool SO1
                    // Storage order of the left-hand side dense matrix
3 , typename MT2
                    // Type of the right-hand side dense matrix
4 , bool SO2
                    // Storage order of the right-hand side dense matrix
5 , typename OP > // Type of the assignment operation
6 void hpxAssign( DenseMatrix<MT1,SO1>& lhs, const DenseMatrix<MT2,SO2>& rhs, OP op )
7 {
8 using hpx::parallel::for_loop;
9 using hpx::parallel::execution::par;
11 BLAZE_FUNCTION_TRACE;
12
13 using ET1 = ElementType_t<MT1>;
14 using ET2 = ElementType_t<MT2>;
16 constexpr bool simdEnabled (MT1::simdEnabled && MT2::simdEnabled && IsSIMDCombinable_v<ET1,ET2> );
17 constexpr size_t SIMDSIZE( SIMDTrait< ElementType_t<MTl> >::size );
19 const bool lhsAligned( (~lhs).isAligned() );
20 const bool rhsAligned( (~rhs).isAligned() );
22 const size_t threads
                            ( getNumThreads() );
23 const ThreadMapping threadmap( createThreadMapping( threads, ~rhs ) );
24
                            ( ( (~rhs).rows() % threadmap.first ) != OUL )? 1UL : OUL );
25 const size t addon1
26 const size_t equalSharel( (~rhs).rows() / threadmap.first + addonl );
27 const size t rest1
                         ( equalShare1 & ( SIMDSIZE - 1UL ) );
28 const size_t rowsPerThread( ( simdEnabled && rest1 )?( equalShare1 - rest1 + SIMDSIZE ):( equalShare1 ) );
                           ( ( (~rhs).columns() % threadmap.second ) != 0UL )? 1UL : 0UL );
30 const size t addon2
31 const size_t equalShare2( (~rhs).columns() / threadmap.second + addon2 );
                        ( equalShare2 & ( SIMDSIZE - 1UL ) );
32 const size t rest2
{\tt 33} \ \ const \ \ size\_t \ \ colsPerThread(\ (\ simdEnabled \ \&\& \ rest2\ )\ ?(\ \ equalShare2-rest2+SIMDSIZE\ )\ :(\ \ equalShare2\ )\ );
35 for_loop( par, size_t(0), threads, [&](int i)
36 {
37 const size_t row ( ( i / threadmap.second ) * rowsPerThread );
38 const size_t column( ( i % threadmap.second ) * colsPerThread );
40 if ( row >= (\sim rhs).rows() || column >= (\sim rhs).columns())
41 return;
43 const size_t m( min( rowsPerThread, (~rhs).rows()
                                                          - row
44 const size_t n( min( colsPerThread, (~rhs).columns() - column ) );
46 if ( simdEnabled && lhsAligned && rhsAligned ) {
47 auto target( submatrix<aligned>( ~lhs, row, column, m, n ) );
48 const auto source( submatrix<aligned>( ~rhs, row, column, m, n ) );
49 op( target, source );
50 }
51 else if( simdEnabled && lhsAligned ) {
             target( submatrix<aligned>( ~lhs, row, column, m, n ) );
53 const auto source( submatrix<unaligned>( ~rhs, row, column, m, n ) );
54 op( target, source );
55 }
56 else if( simdEnabled && rhsAligned ) {
             target ( submatrix < unaligned > ( ~lhs, row, column, m, n ) );
58 const auto source( submatrix<aligned>( ~rhs, row, column, m, n ) );
59 op( target, source );
60 }
61 else {
62 auto
              target ( submatrix < unaligned > ( ~lhs, row, column, m, n ) );
63 const auto source ( submatrix < unaligned > ( ~rhs, row, column, m, n ) );
64 op( target, source );
65 }
66 } );
67 }
```

Listing 4.2: New implementation of Assign function for HPX backend in Blaze.

```
1 template< typename MT1
                          // Type of the left-hand side dense matrix
2 , bool SO1
                   // Storage order of the left-hand side dense matrix
                   // Type of the right-hand side dense matrix
3 , typename MT2
4 , bool SO2
                   // Storage order of the right-hand side dense matrix
5 , typename OP > // Type of the assignment operation
6 void hpxAssign( DenseMatrix<MT1,SOl>& lhs, const DenseMatrix<MT2,SO2>& rhs, OP op )
8 using hpx::parallel::for_loop;
9 using hpx::parallel::execution::par;
11 BLAZE_FUNCTION_TRACE;
12
using ET1 = ElementType_t<MT1>;
14 using ET2 = ElementType_t<MT2>;
16 constexpr bool simdEnabled( MT1::simdEnabled && MT2::simdEnabled && IsSIMDCombinable_v<ET1,ET2> );
17 constexpr size_t SIMDSIZE( SIMDTrait< ElementType_t<MTl> >::size );
19 const bool lhsAligned((~lhs).isAligned());
20 const bool rhsAligned( (~rhs).isAligned() );
22 const size_t threads
                          ( getNumThreads() );
23 const size_t numRows ( min( static_cast<std::size_t>( BLAZE_HPX_MATRIX_BLOCK_SIZE_ROW ), (~rhs).rows() ) );
24 const size_t numCols ( min( static_cast<std::size_t>( BLAZE_HPX_MATRIX_BLOCK_SIZE_COLUMN ), (~rhs).columns()
       ));
25
                          ( numRows & ( SIMDSIZE - 1UL ) );
26 const size t rest1
27 const size_t rowsPerIter( ( simdEnabled && rest1 )?( numRows - rest1 + SIMDSIZE ):( numRows ) );
28 const size t addon1
                         ( ( (~rhs).rows() % rowsPerIter ) != 0UL )? 1UL : 0UL );
29 const size_t equalSharel( (~rhs).rows() / rowsPerIter + addonl );
                          ( numCols & ( SIMDSIZE - 1UL ) );
31 const size t rest2
32 const size_t colsPerIter( ( simdEnabled && rest2 )?( numCols - rest2 + SIMDSIZE ):( numCols ) );
                        ( ( (~rhs).columns() % colsPerIter ) != OUL )? 1UL : OUL );
33 const size t addon2
34 const size_t equalShare2( (~rhs).columns() / colsPerIter + addon2 );
36 hpx::parallel::execution::dynamic_chunk_size chunkSize (BLAZE_HPX_MATRIX_CHUNK_SIZE);
38 for_loop( par.with( chunkSize ), size_t(0), equalShare1 * equalShare2, [&](int i)
39 {
40 const size_t row ( ( i / equalShare2 ) * rowsPerIter );
41 const size_t column( ( i % equalShare2 ) * colsPerIter );
43 if (row >= (\sim rhs).rows() || column >= (\sim rhs).columns())
46 const size_t m( min( rowsPerIter, (~rhs).rows()
                                                     - row
47 const size_t n( min( colsPerIter, (~rhs).columns() - column ) );
49 if ( simdEnabled && lhsAligned && rhsAligned ) {
             target( submatrix<aligned>( ~lhs, row, column, m, n ) );
51 const auto source( submatrix<aligned>( ~rhs, row, column, m, n ) );
52 op( target, source );
53 }
54 else if ( simdEnabled && lhsAligned ) {
             target( submatrix<aligned>( ~lhs, row, column, m, n ) );
56 const auto source( submatrix<unaligned>( ~rhs, row, column, m, n ) );
57 op( target, source );
58 }
59 else if( simdEnabled && rhsAligned ) {
             target( submatrix<unaligned>( ~lhs, row, column, m, n ) );
61 const auto source( submatrix<aligned>( ~rhs, row, column, m, n ) );
62 op( target, source );
63 }
64 else {
             target( submatrix<unaligned>( ~lhs, row, column, m, n ) );
66 const auto source ( submatrix < unaligned > ( ~rhs, row, column, m, n ) );
67 op( target, source );
68 }
69 } );
70 }
```

gradually increase the size to  $1587 \times 1587$ . Table 4.2 show the matrix sizes and the number of cores chosen for our experiments with DMATDMATADD benchmark.

| Matrix sizes                   | 200, 230, 264, 300, 396, 455, 523, 600, 690, 793, 912, 1048, 1200, 1380, 1587 |  |  |
|--------------------------------|-------------------------------------------------------------------------------|--|--|
| Number of cores                | 1, 2, 3, 4, 5, 6, 7, 8                                                        |  |  |
| Number of rows in the block    | 4, 8, 12, 16, 20, 32                                                          |  |  |
| Number of columns in the block | 64, 128, 256, 512, 1024                                                       |  |  |
| Chunk size                     | Between 1 and total number of blocks (logarithmic increase)                   |  |  |

Table 4.2. List of different values used for each variable for running the *DMATDMATADD* benchmark

Figure 4.2 shows the results of running DMatDMatADD benchmark for matrix sizes and number of cores listed in Tbale 4.2 based on grain size.

On the other hand, Figure 4.3 integrates the results obtained from running the same benchmark with different matrix sizes. Each color in this graph represents a specific matrix size.



Figure 4.1. The results obtained from running DMATDMATADD benchmark through Blazemark for matrix size  $690 \times 690$  on different number of cores.





Figure 4.2. The results obtained from running DMATDMATADD benchmark through Blazemark for matrix of size 690×690 from two different angles



Figure 4.3. The results obtained from running DMATDMATADD benchmark through Blazemark for matrix sizes from  $200 \times 200$  to  $1587 \times 1587$ 

#### 4.2.1. Observation

The final purpose of our experiments is to find a chunk size that gives us the best performance for a given matrix size on a given machine. This chunk size should also be tailored to the expression being executed, and this all is based on assuming that we have already fixed the block size. So the first step appeared to be selecting the block size. For this purpose, we ran the experiments with a selection of block sizes as shown in Table 4.2.

It should be mentioned that there were three constraints on selecting the block sizes. First, Blaze forces the number of columns in a raw-major matrix to be divisible to SIMD register size in order to be able to take advantage of vectorization. Second, we have selected the number of columns in our blocks to be either divisible by cache line or to contain all the columns of the matrix.



Figure 4.4. The results obtained from running DMATDMATADD benchmark through Blazemark for matrix sizes from  $690\times690$  with different combinations of block size and chunk size on 4 cores

The collected data, as seen in Figure 4.4, suggests two main points:

- For each selected block size, there is a range of chunk sizes that gives us the best performance.
- Except for some uncommon cases, no matter which block size we choose, we are able to achieve the maximum performance if we select the right chunk size.

This motivated us to move our search parameter from chunk size to grain size. As stated earlier, grain size is the amount of work assigned to one HPX thread. Here we represent grain size by number of floating point operations performed by a HPX thread. For example, performing addition among two matrices, if we choose the block size as  $4 \times 64$  and chunk size as 3, the grain size would be  $3 \times 4 \times 64 = 768$ . Note that in our experiments whenever the number of columns of the original matrix is not divisible to the selected number of columns for block size, there would be a set of blocks with less number of elements than the selected block size, this has been considered when calculating the grain size.

By changing our focus to the grain size instead of the block size and the chunk size, Figure 4.5 shows how the throughput changes with regards to the grain size for the *DMATDMATADD* 

benchmark, for each specific block size. Each combination of block size and chunk size generates a point in the graph. On the other hand, Figure 4.1 looks at these graphs from another aspect, keeping the problem size constant but changing the number of the cores to run the benchmark on, instead.



Figure 4.5. The results obtained from running DMATDMATADD benchmark through Blazemark for matrix size  $690 \times 690$  on 4 cores.

#### 4.3. Method

Looking at the throughput vs. grain size graphs and the consistent pattern observable motivated us to try to model the relationship between throughput and grain size. In order to simplify the process and eliminate the effect of different possible factors, we started with limiting the problem to a fixed matrix size.

### 4.3.1. Polynomial Fit

In our first attempt we used a 2nd degree polynomial to model throughput against grain size. For each matrix size, we fitted the corresponding graphs shown in Figure 4.6 to a second degree polynomial.



Figure 4.6. Throughput vs. grain size graph obtained from running DMATDMATADD benchmark on 4 cores for matrix sizes (a) smaller than  $793 \times 793$  and (b) larger than  $793 \times 793$ .



Figure 4.7. The results of fitting the throughput vs grain size data into a 2d polynomial for DMATDMATADD benchmark for matrix size  $690 \times 690$  with different number of cores on the test data set (a) 1 core, (b) 2 cores, (c) 3 cores, (d) 4 cores, (e) 5 cores, (f) 6 cores, (g) 7 cores, (h) 8 cores.

Figure 4.7 shows the results of using a quadratic function to fit the data for one matrix size with different number of threads.

For our experiment, we divided the data into two sections, training and test. 60% of the data was randomly chosen for the training part and the rest was considered as the test set. The training set was used to find the best 2nd degree polynomial for the data, and once the parameters were identified, the generated 2nd degree polynomial was applied to the test set to measure how good our fit was performing.

For the matrix size  $690 \times 690$  our dataset contained 117 data points, 72 of which was randomly selected to build the model. The mean relative error for each number of cores, calculated using Equation 4.1, is represented in Figure 4.8 for training and test set. In this equation,  $t_i$  and  $p_i$  denote the true value and the predicted value of the ith sample respectively, where n is the number of samples with the particular number of cores.

$$Relative Error = \frac{1}{n} \sum_{i=1}^{n} 1 - p_i / t_i$$
 (4.1)



Figure 4.8. The training and test error for fitting data obtained from the DMATDMATADD benchmark for matrix size  $690 \times 690$  against different number of cores cores.

## 4.3.1.1. Generalizing the fitted function to include number of cores

In this step, we try to generalize the fitted 2nd degree polynomial obtained from the previous step, represented by  $P = ag^2 + bg + c$ , where P is the throughput and g is the grain size, by looking at how the three parameters a, b, and c change when number of cores changes. A 3rd degree polynomial seems to a reasonable fit for each of these parameters, in regards to number of cores. In order to avoid overfitting, we excluded two of the data points (2 and 5) from the data points used for fitting the polynomial and tested the fitted function on those two points to see how well the function is working on unseen data points.



Figure 4.9. Fitting the parameters of the quadratic function with a 3rd degree polynomial from the DMATDMATADD benchmark for matrix size 690 × 690 against different number of cores.



Figure 4.10. The error in fitting the parameters a, b, and c for matrix size  $690 \times 690$ .

Using this 3rd degree polynomial to fit the parameters, we can generalize the relationship between throughput and grain size in the following equation:

$$P = a_{11}g^2N^3 + a_{10}g^2N^2 + \dots + a_1N + a_0$$
(4.2)

where P is the throughput, g is the grain size, and N is the number of cores and coefficients  $a_{11},...,a_0$  are the real values.

Knowing that a polynomial of degree 2 in terms of grain size and of degree 3 in terms of number of cores, we can try to fit our original data directly to the above mentioned formula (Equation 4.2). The results of the original data obtained from running DMATDMATADD benchmark, the fitted polynomial based on Equation 4.2, is represented in Figure 4.11, for 2,4, and 8 cores for a matrix of size  $690 \times 690$ .



Figure 4.11. Results of fitting the data from DMATDMATADD benchmark with a polynomial of degree 2 in terms of grain size and of degree 3 in terms of number of cores for matrix size  $690 \times 690$  for (a) 2 core, (b) 4 cores, (c) 8 cores.



Figure 4.12. The training and test error obtained fitting the data to a polynomial of degree 2 in terms of grain size and of degree 3 in terms of number of cores for matrix size  $690 \times 690$ , for each number of cores. (a) All the data points are include in caluculation of error, (b) the leftmost sample was removed from error calculation.

Figure 4.12a shows the obtained relative error on both training and test sets. The graph

suggests a higher test error compared to the training error, mostly caused by the left hand side of the graph. The effect of removing the leftmost sample from error calculations is depicted in Figure 4.12b.

Although we are interested in finding a model that results in a low training and test error, our purpose is mainly finding the region that generates the highest performance. So, even though our model might not match the original data in all data points, due to having a different nature than a quadratic function, our focus would be on how this fit can help us to find which range of grain sizes, or how big the task sizes should be, to achieve the highest performance.

## 4.3.1.2. Finding the range of grain size to achieve high performance

The major advantage of using a quadratic function to fit the data in terms of grain size, when number of cores is fixed, is the simplicity of the formula, which makes it possible for us to find the peak of the graph very easily. n order to add some uncertainty to our prediction, instead of finding the maximum of the quadratic function, we identified the range of grain size that results in a performance within 10% of the maximum performance. For a second degree polynomial in terms of g,  $P = ag^2 + bg + c$ , the minimum or maximum of the polynomial is located at  $p^* = \frac{-b}{2a}$ .



Figure 4.13. The range of grain size (shown as the red line) that leads to a performance within 10% of the maximum performance for (a) 2 cores, (b) 4 cores and (b) 8 cores.



Figure 4.14. The range of grain size within 10% of the maximum performance of the fitted polynomial function for DMATDMATADD benchmark and matrix size  $690 \times 690$  against different number of cores.

Figure 4.14 shows the calculated range for matrix size  $690 \times 690$  with different number of threads. In order to generalize the solution, we selected the intersection of all the ranges to find a range of grain size that we expect to work well for all the different number of cores in our experiment. This silver region in Figure 4.14 corresponds to this range.

#### 4.3.1.3. Estimating the chunk size

Once we identified a range of grain sizes that is expected to leads us to highest achievable performances for a specific matrix size, the next step is finding the possible combinations of block size and chunk size to achieve that range of grain sizes. As stated earlier in this chapter, results obtained from Figure 4.5 suggests that with a fixed grain size, our choice of block size does not affect the performance directly, as long as there exist a chunk size that when combined by the block size could result in the specified grain size.

In our experiment, we selected our block size to be  $4 \times 256$ . With this assumption, in order for the grain size to be within the specified range for each matrix size, chunk size has to be within a specific range size too.

For example, for a  $690 \times 690$  matrix we calculated the range of maximum performance for

all number of cores to be [4.069, 4.719] in logarithmic scale which is equivalent to [11711, 52368]. Setting the block size to  $4 \times 256$ , this range forces the chunk size to be within the range [13, 56]. The range of chunk sizes to match the range of grain sizes identified, and their corresponding throughput is shown in Figure 4.15, for matrix size  $690 \times 690$  and block size  $4 \times 256$ . The green line is the throughput achieved by the current implementation of HPX backend.



Figure 4.15. The range of chunk sizes to produce a grain size within 10% of the maximum performance of the fitted quadratic function for DMATDMATADD benchmark for matrix size  $690 \times 690$  with block size of  $4 \times 256$  on (a) 2 cores, (b) 4 cores, and (c) 8 cores, and block size of  $4 \times 512$  on (d) 2 cores, (e) 4 cores, and (f) 8 cores. Silver points denotes the detected range of chunk size.

#### **4.3.2.** Bathtub fit

In the previous section we studied the possibility of using a polynomial to capture the relationship between grain size, number of cores, and throughput for a fixed matrix size, with the purpose of finding a range of grain size that leads us to maximum performance. Although the polynomial function was helpful in directing us toward our objective, it does not have a physical implication.

This motivated us to change our view, and instead of looking just at the data and trying to find a function to fit the data, study the behavior of the data, and then find a function that would be likely to fit the data. That function would be a good fit mostly because that's how we expect the throughput to change with grain size, and not just how the data looks like.

## Chapter 5. Results

#### 5.0.1. Blazemark

Blazemark is a benchmark suite provided by Blaze to compare the performance of Blaze with other linear algebra libraries including Blitz++[17], Boost uBLAS[45], GMM++[46], Armadillo[47], MTL4[19], and Eigen3[48], alongside plain BLAS libraries like Atlas[49], Goto[50], and Intel MKL.[51]

```
Dense Vector/Dense Vector Addition:
  C-like implementation [MFlop/s]:
   100
               1115.44
   10000000
               206.317
  Classic operator overloading [MFlop/s]:
               415.703
   10000000
               112.557
  Blaze [MFlop/s]:
                                              N=100, steps=55116257
   100
                2602.56
                                                C-like
                                                           = 2.33322 (4.94123)
   10000000
                292.569
                                                                       (13.2586)
                                                Classic
                                                            = 6.26062
  Boost uBLAS [MFlop/s]:
                                                Blaze
                                                Boost uBLAS = 2.4628
                                                                        (5.21565)
   10000000
                                                Blitz++
                                                            = 2.57398
  Blitz++ [MFlop/s]:
                                                            = 2.33325
                                                                        (4.94129)
                                                Armadillo
                                                            = 2.3749
                                                                        (5.0295)
   10000000
                207.855
                                                MTL
                                                            = 2.55537
                                                                        (5.41168)
  GMM++ [MFlop/s]:
                                                            = 1.19742
                                                                       (2.53585)
                                                Eigen
   100
                1115.42
                                              N=100000000, steps=8
   10000000
                207.699
                                                C-like
                                                            = 1.41805
                                                                       (0.387753)
  Armadillo [MFlop/s]:
                                                                        (0.710753)
                                                Classic
                                                            = 2.5993
   100
                1095.86
                                                            = 1
                                                                        (0.27344)
                                                Blaze
   10000000
                208.658
                                                Boost uBLAS = 1.40227
                                                                       (0.383437)
 MTL [MFlop/s]:
                                                            = 1.40756
                                                                        (0.384884)
                                                Blitz++
                1018.47
   100
                                                            = 1.40862
                                                                        (0.385172)
                                                GMM++
   10000000
               209.065
                                                Armadillo
                                                            = 1.40215
                                                                        (0.383403)
  Eigen [MFlop/s]:
                                                MTL
                                                            = 1.39941
                                                                        (0.382656)
                2173.48
   100
                                                Eigen
                                                            = 1.39386
                                                                        (0.381136)
   10000000
```

Figure 5.1. An example of the results obtained from running DVECDVECADD benchmark through Blazemark

#### **5.0.2.** Setup

Our experiments were run on Marvin nodes of Rostam cluster at Center for Computation and Technology(CCT) at Louisiana State University. Table 5.1 and Table ?? show some of the specifications of this node.

| CPU             | 2 x Intel(R) Xeon(R) CPU E5-2450 0 @ 2.10GHz |
|-----------------|----------------------------------------------|
| RAM             | 48 GB                                        |
| Number of Cores | 16                                           |
| Hyperthreading  | Off                                          |

Table 5.1. Specifications of the Marvin node from Rostam cluster at CCT.

| Cache Level | Coherency Line Size | Number of Sets | Ways of Associativity | Size    |
|-------------|---------------------|----------------|-----------------------|---------|
| 1           | 64                  | 512            | 8                     | 32KB    |
| 2           | 64                  | 512            | 8                     | 256KB   |
| 3           | 64                  | 512            | 20                    | 20480KB |

Table 5.2. Cache specifications of the Marvin node from Rostam cluster at CCT.

| HPX version commit | 1.3.0 |
|--------------------|-------|
| Blaze version      | 3.5   |

Table 5.3. Specifications of the libraries used to run our experiments.

## **Chapter 6. Future Plans**

#### 6.1. What is the role of matrix size?

In Chapter 4 two different functions were proposed to model the relationship between grain size, number of cores, and performance (execution time or throughput). These functions were fitted for each matrix size individually for simplification, but this assumption is not practical, since even though the characteristics of the function is the same for all matrix sizes, the fitted parameters vary from one matrix size to another.

The matrix size should somehow be integrated into the model itself. For this purpose, first we need to understand how changing the matrix size affects the relationship between grain size and performance. One immediate effect of increasing the matrix size is an increase in maximum possible grain size (right hand side of Figure 6.1), while the minimum possible grain size is the same.



Figure 6.1. Throughput vs. grain size graph obtained from running *DMATDMATADD* benchmark on 4 cores.

Moreover, Figure 6.2 shows how the predicted grain size range changes for different matrix sizes.



Figure 6.2. The predicted range of grain size for maximum performance for different matrix sizes, using polynomial fit, based on DMATDMATADD benchmark.

#### References

- [1] Patricia Grubel, Hartmut Kaiser, Jeanine Cook, and Adrian Serio. The performance implication of task size for applications on the hpx runtime system. In *2015 IEEE International Conference on Cluster Computing*, pages 682–689. IEEE, 2015.
- [2] R Clinton Whaley and Jack J Dongarra. Automatically tuned linear algebra software. In *SC'98: Proceedings of the 1998 ACM/IEEE conference on Supercomputing*, pages 38–38. IEEE, 1998.
- [3] Neil J Gunther. The practical performance analyst. iuniverse. com inc. *Lincoln, Nebraska*, 2000.
- [4] Neil J Gunther. A new interpretation of amdahl's law and geometric scalability. *arXiv* preprint cs/0210017, 2002.
- [5] Neil J Gunther. What is guerrilla capacity planning? Springer, 2007.
- [6] Neil J Gunther. A new interpretation of amdahl's law and geometric scaling. *arXiv* preprint cs/0210017, 2011.
- [7] Hartmut Kaiser, Thomas Heller, Bryce Adelstein-Lelbach, Adrian Serio, and Dietmar Fey. Hpx: A task based programming model in a global address space. In *Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models*, page 6. ACM, 2014.
- [8] Laxmikant V Kale and Sanjeev Krishnan. Charm++: a portable concurrent object oriented system based on c++. In *OOPSLA*, volume 93, pages 91–108. Citeseer, 1993.
- [9] Robert D Blumofe, Christopher F Joerg, Bradley C Kuszmaul, Charles E Leiserson, Keith H Randall, and Yuli Zhou. Cilk: An efficient multithreaded runtime system. *Journal of parallel and distributed computing*, 37(1):55–69, 1996.
- [10] Jeremiah James Willcock, Torsten Hoefler, Nicholas Gerard Edmonds, and Andrew Lumsdaine. Am++: A generalized active message framework. In *Proceedings of the 19th international conference on Parallel architectures and compilation techniques*, pages 401–410. ACM, 2010.
- [11] Michael Bauer, Sean Treichler, Elliott Slaughter, and Alex Aiken. Legion: Expressing locality and independence with logical regions. In *SC'12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis*, pages 1–11. IEEE, 2012.
- [12] Hartmut Kaiser, Maciek Brodowicz, and Thomas Sterling. Parallex an advanced parallel execution model for scaling-impaired applications. In *2009 International Conference on Parallel Processing Workshops*, pages 394–401. IEEE, 2009.
- [13] Abhishek Kulkarni and Andrew Lumsdaine. A comparative study of asynchronous manytasking runtimes: Cilk, charm++, parallex and am++. *arXiv preprint arXiv:1904.00518*, 2019.

- [14] Guang R Gao, Thomas Sterling, Rick Stevens, Mark Hereld, and Weirong Zhu. Parallex: A study of a new parallel computation model. In *2007 IEEE International Parallel and Distributed Processing Symposium*, pages 1–6. IEEE, 2007.
- [15] Klaus Iglberger, Georg Hager, Jan Treibig, and Ulrich Rüde. Expression templates revisited: a performance analysis of current methodologies. *SIAM Journal on Scientific Computing*, 34(2):C42–C69, 2012.
- [16] Todd Veldhuizen. Expression templates. *C++ Report*, 7(5):26–31, 1995.
- [17] Blitz++ Library. http://www.oonumerics.org/blitz/.
- [18] Boost uBLAS Library. http://www.boost.org/doc/libs/1\_45\_0/libs/numeric/ublas/doc/index.htm.
- [19] MTL4 Library. http://www.simunova.com/de/mtl4.
- [20] Gaël Guennebaud, Benoit Jacob, et al. Eigen. URl: http://eigen. tuxfamily. org, 2010.
- [21] Gene M Amdahl. Validity of the single processor approach to achieving large scale computing capabilities. In *Proceedings of the April 18-20, 1967, spring joint computer conference*, pages 483–485. ACM, 1967.
- [22] Florina M Ciorba, Christian Iwainsky, and Patrick Buder. Openmp loop scheduling revisited: making a case for more schedules. In *International Workshop on OpenMP*, pages 21–36. Springer, 2018.
- [23] Jie Liu, Vikram A Saletore, and Ted G Lewis. Safe self-scheduling: a parallel loop scheduling scheme for shared-memory multiprocessors. *International Journal of Parallel Programming*, 22(6):589–616, 1994.
- [24] Teebu Philip. *Increasing chunk size loop scheduling algorithms for data independent loops.* PhD thesis, Citeseer, 1995.
- [25] Constantine D Polychronopoulos and David J Kuck. Guided self-scheduling: A practical scheduling scheme for parallel supercomputers. *Ieee transactions on computers*, 100(12):1425–1439, 1987.
- [26] Susan Flynn Hummel, Edith Schonberg, and Lawrence E Flynn. Factoring: A method for scheduling parallel loops. *Communications of the ACM*, 35(8):90–102, 1992.
- [27] Ten H Tzen and Lionel M Ni. Trapezoid self-scheduling: A practical scheduling scheme for parallel compilers. *IEEE Transactions on parallel and distributed systems*, 4(1):87–98, 1993.
- [28] David J Lilja. Exploiting the parallelism available in loops. Computer, 27(2):13–26, 1994.
- [29] Ali Mohammed, Ahmed Eleliemy, Florina M Ciorba, Franziska Kasielke, and Ioana Banicescu. Experimental verification and analysis of dynamic loop scheduling in scientific applications. In 2018 17th International Symposium on Parallel and Distributed Computing (ISPDC), pages 141–148. IEEE, 2018.

- [30] Robert D Blumofe and Charles E Leiserson. Scheduling multithreaded computations by work stealing. *Journal of the ACM (JACM)*, 46(5):720–748, 1999.
- [31] Simplice Donfack, Laura Grigori, William D Gropp, and Vivek Kale. Hybrid static/dynamic scheduling for already optimized dense matrix factorization. In *2012 IEEE 26th International Parallel and Distributed Processing Symposium*, pages 496–507. IEEE, 2012.
- [32] Preeti Malakar, Prasanna Balaprakash, Venkatram Vishwanath, Vitali Morozov, and Kalyan Kumaran. Benchmarking machine learning methods for performance modeling of scientific applications. In 2018 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), pages 33–44. IEEE, 2018.
- [33] Filip Blagojevic, Xizhou Feng, Kirk W Cameron, and Dimitrios S Nikolopoulos. Modeling multigrain parallelism on heterogeneous multi-core processors: a case study of the cell be. In *International Conference on High-Performance Embedded Architectures and Compilers*, pages 38–52. Springer, 2008.
- [34] Darren J Kerbyson, Henry J Alme, Adolfy Hoisie, Fabrizio Petrini, Harvey J Wasserman, and Mike Gittings. Predictive performance and scalability modeling of a large-scale application. In *SC'01: Proceedings of the 2001 ACM/IEEE Conference on Supercomputing*, pages 39–39. IEEE, 2001.
- [35] Leslie G Valiant. A bridging model for parallel computation. *Communications of the ACM*, 33(8):103–111, 1990.
- [36] Benjamin C Lee, David M Brooks, Bronis R de Supinski, Martin Schulz, Karan Singh, and Sally A McKee. Methods of inference and learning for performance modeling of parallel applications. In *Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming*, pages 249–258. ACM, 2007.
- [37] Jingwei Sun, Shiyan Zhan, Guangzhong Sun, and Yong Chen. Automated performance modeling based on runtime feature detection and machine learning. In 2017 IEEE International Symposium on Parallel and Distributed Processing with Applications and 2017 IEEE International Conference on Ubiquitous Computing and Communications (ISPA/I-UCC), pages 744–751. IEEE, 2017.
- [38] Sabri Pllana, Ivona Brandic, and Siegfried Benkner. Performance modeling and prediction of parallel and distributed computing systems: A survey of the state of the art. In *First International Conference on Complex, Intelligent and Software Intensive Systems (CI-SIS'07)*, pages 279–284. IEEE, 2007.
- [39] Engin Ipek, Bronis R De Supinski, Martin Schulz, and Sally A McKee. An approach to performance prediction for parallel applications. In *European Conference on Parallel Processing*, pages 196–205. Springer, 2005.
- [40] Robert D Falgout and Ulrike Meier Yang. hypre: A library of high performance preconditioners. In *International Conference on Computational Science*, pages 632–641. Springer, 2002.

- [41] Jiawen Liu, Dong Li, Gokcen Kestor, and Jeffrey Vetter. Runtime concurrency control and operation scheduling for high performance neural network training. *arXiv* preprint *arXiv*:1810.08955, 2018.
- [42] Ahmad Qawasmeh, Abid M Malik, and Barbara M Chapman. Adaptive openmp task scheduling using runtime apis and machine learning. In *2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA)*, pages 889–895. IEEE, 2015.
- [43] Zheng Wang and Michael FP O'Boyle. Mapping parallelism to multi-cores: a machine learning based approach. In *ACM Sigplan notices*, volume 44, pages 75–84. ACM, 2009.
- [44] Leonardo Dagum and Ramesh Menon. Openmp: An industry-standard api for shared-memory programming. *Computing in Science & Engineering*, (1):46–55, 1998.
- [45] Boost C++ Framework. https://www.boost.org.
- [46] Gmm++ Library. http://getfem.org/gmm/.
- [47] Conrad Sanderson and Ryan Curtin. Armadillo: a template-based c++ library for linear algebra. *Journal of Open Source Software*, 1(2):26, 2016.
- [48] Eigen Library. http://eigen.tuxfamily.org/.
- [49] Automatically Tuned Linear Algebra Software. http://math-atlas.sourceforge.net/.
- [50] GotoBLAS2. https://www.tacc.utexas.edu/research-development/tacc-software/gotoblas2.
- [51] Intel Math Kernel Library. https://software.intel.com/en-us/mkl.