# FIRST LINE OF THE TITLE SECOND LINE OF THE TITLE

by

Author
A Thesis
Submitted to the
Graduate Faculty
of
George Mason University
In Partial fulfillment of
The Requirements for the Degree
of
Master of Science
Discipline

| Committee: |                                                     |
|------------|-----------------------------------------------------|
|            | Dr. First Last, Thesis Director                     |
|            | Dr. First Last, Committee Member                    |
|            | Dr. First Last, Committee Member                    |
|            | Dr. First Last, Department Head                     |
|            | Dr. First Last, Dean                                |
| Date:      | X Semester Year George Mason University Fairfax, VA |

The Complete Title is to be Repeated Here without any Line Breaks for the Second Page and for the Abstract Page

A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science at George Mason University

By

Author Bachelor of Science My Other Former School, Year of first degree

> Director: Dr. First Last, Professor Department of Name of Department

> > X Semester Year George Mason University Fairfax, VA

Copyright © Year by Author All Rights Reserved

# Dedication

I dedicate this dissertation to ...

# Acknowledgments

I would like to thank the following people who made this possible  $\dots$ 

# Table of Contents

|     |        |         |                                                                | Page |
|-----|--------|---------|----------------------------------------------------------------|------|
| Lis | t of T | ables   |                                                                | vii  |
| Lis | t of F | igures  |                                                                | viii |
| Ab  | stract | t       |                                                                | ix   |
| 1   | Intr   | oductio | on                                                             | 1    |
|     | 1.1    | Motiva  |                                                                |      |
|     | 1.2    | Resear  | rch Questions                                                  | 3    |
|     | 1.3    | Contri  | ibutions                                                       |      |
| 2   | Bac    | kgroun  | d                                                              | 6    |
|     | 2.1    | Intel ( | Optane DC Persistent Memory                                    | 6    |
|     |        | 2.1.1   | Overview of Intel Optane DC PMM                                | 6    |
|     |        | 2.1.2   | Performance Characterization                                   | 9    |
|     |        | 2.1.3   | Operating Modes and Applications                               | 10   |
|     |        | 2.1.4   | Programming                                                    | 10   |
|     | 2.2    | Server  | cless Computing                                                | 11   |
|     |        | 2.2.1   | Funtion-as-a-Service (FaaS)                                    | 12   |
|     |        | 2.2.2   | Storage for FaaS                                               | 13   |
|     |        | 2.2.3   | Service Level Agreements                                       | 14   |
|     | 2.3    | Reinfo  | orcement Learning                                              | 14   |
|     |        | 2.3.1   | Overview of Reinforcement Learning                             | 14   |
|     |        | 2.3.2   | Q-Learning                                                     | 15   |
|     |        | 2.3.3   | Function Approximation using Linear Regression Models          | 16   |
|     |        | 2.3.4   | Exploration-Exploitation Tradeoff                              | 17   |
|     |        | 2.3.5   | Reward shaping                                                 | 18   |
| 3   | NV     | M Mide  | dleware: A control layer for persistent memory                 | 19   |
|     | 3.1    | Motiva  | ation                                                          | 20   |
|     |        | 3.1.1   | Concurrency Control Challenges in a serverless storage service | 21   |
|     |        | 3.1.2   | NVM Middleware Design Overview                                 | 22   |
|     | 3.2    | Archit  | ecture                                                         | 23   |

|     | 3.3    | Progra   | amming Interface                    | 25 |
|-----|--------|----------|-------------------------------------|----|
|     | 3.4    | Reinfo   | rcement Learning Component          | 26 |
|     |        | 3.4.1    | Integration with the NVM Middleware | 26 |
|     |        | 3.4.2    | Reinforcement Learning Model        | 28 |
|     |        | 3.4.3    | Training Methodology                | 30 |
|     | 3.5    | Impler   | nentation                           | 34 |
| 4   | Eval   | luation  |                                     | 37 |
|     | 4.1    | Experi   | mental Setup                        | 37 |
|     |        | 4.1.1    | Platform                            | 37 |
|     |        | 4.1.2    | Optane DC PMem Configuration        | 38 |
|     |        | 4.1.3    | FaaS Workload Traces                | 38 |
| 5   | Rela   | ated Wo  | ork                                 | 39 |
| 6   | Con    | clusions | s and Future Work                   | 40 |
| Rib | liogra | anhy     |                                     | 11 |

# List of Tables

| Table |                                      | Page |
|-------|--------------------------------------|------|
| 3.1   | Programming Interface                | 25   |
| 3.2   | The State Representation             | 28   |
| 3.3   | Possible Actions in the Action Space | 29   |
| 4.1   | Experimental Platform Specifications | 37   |

# List of Figures

| Figure |                                          | Page |
|--------|------------------------------------------|------|
| 2.1    | Memory Hierarchy. Taken from [1]         | . 6  |
| 2.2    | Communication between iMC and Optane PMM | . 8  |
| 2.3    | RL Workflow                              | 15   |
| 3.1    | NVM Middleware Architecture              | 24   |
| 3.2    | RL Workflow                              | 26   |
| 3.3    | Overview of the Environment Architecture | 31   |
| 3.4    | Agent Process flow                       | 32   |

## Abstract

THE COMPLETE TITLE IS TO BE REPEATED HERE WITHOUT ANY LINE BREAKS FOR THE SECOND PAGE AND FOR THE ABSTRACT PAGE

Author, MS

George Mason University, Ye ${\bf 2}$ r

Thesis Director: Dr. First Last

Enter abstract text.

## Chapter 1: Introduction

#### 1.1 Motivation

Serverless computing is an increasingly popular cloud execution model that liberates application developers from the burden of traditional infrastructure management. With serverless platforms (e.g., AWS Lambda, Google Cloud Functions, Azure Functions), developers solely focus on writing their code as event-driven functions that will execute on-demand in response to events or triggers. Cloud providers are responsible for dynamically allocating and scaling resources to meet demands as the event triggers occur. With a pay-as-you-go pricing model, users only pay for the resource consumed during their function invocations, making serverless computing a cost-effective solution.

Cloud providers designed serverless functions to be stateless, meaning that they do not retain state between function invocations. This intentional statelessness is a fundamental aspect for achieving high elasticity. By eliminating the need to store state within the function invocation, serverless platforms promote scalability and ease of deployment. Cloud providers can execute functions in parallel, allowing for efficient resource utilization. Any data needed between function invocations must be stored in remote storage.

Although the stateless nature of serverless computing is key to achieve high elasticity, it limits the type of applications that can run efficiently on serverless platforms. Previous studies [2] have found that data-intensive applications running in serverless platforms (i.e., data analytics, ML workflows, databases) are limited by the capacity and performance gaps that exist among the existing storage services. Object storage services, such as AWS S3, provide cheap long-term storage, but exhibits high access latencies. On the other hand, in-memory clusters, such as AWS ElastiCache, exhibit low access latencies and high throughput, but they are expensive and are not transparently provisioned. In between,

key-value databases, such as AWS DynamoDB, provide high throughput, but are expensive and can take a long time to scale.

Given the limitations of existing storage solutions, previous works motivate the development of a serverless storage service capable of handling the wide variety of workloads running on serverless platforms. These studies mention three requirements that such service must meet. First, it should provide low latency and high throughput for a wide range of object size and data access patterns. Second, it should be transparently provisioned and should scale to meet workload demands. Third, it must ensure isolation and predictable performance across applications and tenants.

To meet the first requirement, cloud providers must first close the capacity and performance gap between main memory and persistent storage media. As mentioned above, existing storage service have fixed tradeoffs that reflect the traditional memory hierarchy built from RAM, flash memory, and magnetic disk drives. Leveraging Non-volatile memory is a promising approach to bridge the gap between the memory and storage tiers. Non-volatile memory combines the persistence and capacity of traditional storage with the low latency and byte addressability of main memory. This technology experienced a breakthrough with the release of Intel Optane DC Persistent Memory.

Non-volatile memory technology experienced a breakthrough with the release of Intel Optane DC Persistent Memory Module (PMM). Optane PMM is an emerging technology where non-volatile media is placed in a Dual In-Line Memory Module (DIMM) and installed on the memory bus, alongside traditional DRAM (Dynamic Random Access Memory). Similar to DRAM, this technology presents a byte-addressable interface and achieves speeds comparable to DRAM (2x-3x lower). The main difference between the two is that Optane PMM has higher capacities and can retain data when the system is shutdown or loses power. This allows Optane PMM to be used as a form of persistent storage with memory-like speeds.

The unique combination of persistence and low access latency makes Optane PMM an ideal candidate to speed up data-intensive workloads running in serverless platforms. Thus,

thesis presents an analysis on how to make efficient use of Optane PMM to build a serverless storage service.

## 1.2 Research Questions

With the release of Intel Optane DIMM, researchers have started to understand its characteristics, capabilities, and limitations [3–5]. The initial expectation was that Intel Optane DC PMM would behave similar to DRAM, but with a lower performance (higher latency and lower bandwidth). However, recent studies suggest that it should not be treated as a "slower, persistent DRAM". Compared to DRAM, Optane DC PMM exhibits complicated behaviors and its performance changes based on multiple factors, such as the access size, access type (read vs. write), and degree of concurrency.

Intel Optane DC PMM differs from DRAM in two ways. First, there is a mismatch between the CPU cacheline access granularity (64-byte) and the 3D-XPoint media access granularity (256-byte) in Intel Optane DC PMM. This difference can lead to write or read amplification if the data access is smaller than 256 bytes. Second, to balance the gap in access granularity, the Intel Optane DC PMM implements a small (16KB) write-combining buffer to merge small writes and reduce write amplification. However, the buffer's limited capacity (16 KB) can cause contention within the device, limiting its ability to handle access from multiple threads simultaneously.

The complex behavior of Intel Optane DC PMM introduces interesting challenges for building a serverless storage service using this technology. Previous works have found that serverless functions vary considerable in multiple ways, including the way they access and process data, and their quality-of-service (QoS) demands. Furthermore, these workloads can spike by orders of magnitude and change dramatically over time. Knowing how these large-scale variations affect the system's performance and QoS for applications can assist in building an efficient serverless storage service.

Consequently, this thesis addresses the following research questions:

- How does Optane PMM affect the system's performance when used as persistent storage for serverless functions?
- How does Optane PMM performance under serverless workloads affect the (QoS) for applications?
- How can we overcome the limitations of Optane PMM to make efficient use of the device in a serverless scenario?
- How do we keep the system optimized and compliant with QoS requirements over time as workload shifts occur?

#### 1.3 Contributions

The experiments described in Section 3 provide various helpful insights on the Optane PMM behavior when used as persistent storage for serverless workloads. First, we discover that sharing the Optane PMM among hundreds of serverless functions lead to performance loss (higher latency and lower bandwidth) in the device. This fact was expected given the contention issues experienced by Optane PMM with higher thread counts. Second, we discover that, depending on the workloads, the performance degradation in Optane PMM affects one performance metric more than the other (latency vs. bandwidth). This suggests that QoS of some applications might be affected more than others. Therefore, we conclude the success of Optane PMM should be measured by its capability of meeting the QoS requirements of the current workload.

To help alleviate the limitations of Intel Optane PMM, we introduce a control layer that runs on top of Optane and guides the efficient use of the device under dynamic workloads. Our control layer, called NVM Middleware, is designed to limit the access to persistent memory to reduce its contention. While doing so, the NVM Middleware keeps track of the type of applications running in the system and applies different optimization policies for each one to ensure that their QoS requirements are met. Using machine learning, the

NVM Middleware learns how to scale resources to meet the current demand and dynamically adapts them to changing workloads. We propose using online reinforcement learning algorithms, given that data access patterns in serverless workloads can change over time.

- We present an experimental study that describes the capabilities and limitations of Intel Optane PMM when used as persistent storage for serverless workloads. To our knowledge, Optane PMM has not been tested yet in this scenario.
- We present the NVM Middleware, a control layer promotes the efficient use of Optane PMM, while ensuring that QoS requirements for different type of applications are met.
- We propose a Reinforcement Learning model and framework that allows the NVM
   Middleware to learn from historical data and adapt resources to changing workloads.
- Finally, we present empirical results that demonstrate the benefits of our solution.

## Chapter 2: Background

## 2.1 Intel Optane DC Persistent Memory

Intel Optane Persistent Memory (PMM) represents a significant advancement in persistent memory technology, bridging the gap between dynamic random-access memory (DRAM) and storage devices [6]. This section provides an overview of the architecture, features, benefits, and applications of Intel Optane PMM.

#### 2.1.1 Overview of Intel Optane DC PMM



Figure 2.1: Memory Hierarchy. Taken from [1]

Persistent memory, also referred to as Non-Volatile Memory (NVM), represents a significant evolution in the memory/storage hierarchy (Figure 2.1), addressing the performance and capacity gap between dynamic random-access memory (DRAM) and traditional storage mediums. This innovative technology combines the characteristics of both DRAM and storage, offering the speed of DRAM and the non-volatile nature of storage devices [6].

Like DRAM, persistent memory is available in the form of Dual In-line Memory Modules (DIMMs), which are directly connected to the memory bus. This direct connection enables applications to access persistent memory with the same ease as traditional DRAM, eliminating the need for frequent data transfers between memory and storage. However, unlike DRAM, persistent memory DIMMs provide significantly greater capacity and retain data even when power is removed, thereby enhancing system performance and enabling fundamental changes in computing architecture [6,7].

Intel Optane DC Persistent Memory Module (Optane PMM) stands at the forefront of commercial implementations of persistent memory technology, leveraging Intel's innovative 3D-XPoint technology. Upon its introduction, the Optane PMM offers substantial capacities up to 512GiB and is exclusively supported by Intel Cascade Lake platform. Each processor within this platform is equipped with two integrated memory controllers (iMCs), with each iMC supporting three channels. This architecture seamlessly integrates Optane PMM with DRAM, allowing users to deploy up to one Optane PMM per channel and up to six per CPU socket, thereby enabling extensive memory capacities of potentially up to 3TiB per socket [3,4].

Similar to conventional DRAM DIMMs, Optane PMMs are positioned on the memory bus and connect directly to the processor's iMC. The communication protocol between the iMC and the Optane PMM is depicted in Figure 2.2. Communication between the iMC and the Optane PMM occurs via the DDR-T protocol, adapted for persistent memory and operating at cache line granularity (64B). Initial memory access to the Optane PMM is coordinated by the onboard Controller, which manages access to the 3D-XPoint media. Analogous to SSDs, the Optane PMM conducts address translation for wear-leveling and

bad block management, facilitated by the maintenance of an address indirection table (AIT). Following translation, access to the storage media occurs. Notably, with 3D-XPoint access granularity set at 256B, the controller converts 64-byte accesses into 256-byte accesses, inducing write amplification. To mitigate this, the Controller incorporates a 16KB write-combining buffer to merge adjacent writes [3–5].



Figure 2.2: Communication between iMC and Optane PMM

To ensure data persistence, Intel platforms integrate the iMC and Optane PMM within the asynchronous DRAM refresh (ADR) domain. Intel's ADR feature ensures that CPU stores that reach the ADR domain will survive a power failures [4]. The iMC manages read and write pending queues for each Optane PMM, with the ADR domain encompassing the write pending queue. Once data reaches the write pending queue, ADR ensures its

persistence within Optane PMM in the event of power failures. The ADR domain excludes the CPU caches, necessitating additional steps beyond simply executing a store instruction to ensure data persistence. To achieve this, CPU stores must be continually flushed using specialized instructions provided by Intel's Instruction Set Architecture (ISA), including CLFLUSH, CLFLUSHOPT, and CLWB [3, 4, 7].

#### 2.1.2 Performance Characterization

Previous studies [3, 4] conducted an empirical performance assessment of Optane PMM, revealing its nuanced behavior compared to DRAM. They observed that Optane's performance varies significantly depending on specific access patterns, including access size, type, and concurrency level. Notably, they found that Optane's read latency is three times slower than that of DRAM, primarily due to Optane's longer media latency. However, sequential access patterns demonstrate notably improved latency, indicating Optane PMM's capability to consolidate adjacent requests into single 256-byte accesses. The study also highlights that Optane PMM achieves a maximum random read bandwidth of 6.6 GB/s and a write bandwidth of 2.3 GB/s. Moreover, sequential access further enhances bandwidth performance, exhibiting up to a fourfold increase [3, 4].

An insightful observation highlighted by Izraelevitz et al. [3] is that Optane PMM's bandwidth can become saturated when utilized in real-world multi-threaded applications, thereby introducing performance overhead. This phenomenon arises due to Optane PMM's inability to scale performance proportionally with increased thread count, primarily due to contention occurring within the processor's integrated memory controller (iMC) and Optane PMM's buffer. Contentious conditions within the buffer exacerbate the frequency of evictions and write-backs to the 3D-XPoint media, resulting in Optane writing more data internally than what the application necessitates. Furthermore, given Optane PMM's slightly slower performance compared to DRAM, the slower drainage of write pending queues by Optane PMMs introduces head-of-line blocking effects. As the number of threads concurrently accessing Optane PMM increases, contention on the device escalates, heightening

the likelihood of the processor experiencing blocking while awaiting completion of previous store operations [4].

#### 2.1.3 Operating Modes and Applications

Intel Optane persistent memory (PMem) offers two distinct operating modes: Memory mode and App Direct mode.

In Memory mode, Optane PMem serves as a high-capacity main memory without persistence. In this configuration, DRAM is concealed from users and acts solely as a cache for Optane PMem, seamlessly managed by the operating system [4].

Conversely, in App Direct mode, Optane PMMs are directly exposed to the operating system as independent persistent memory devices, thus enabling their utilization for persistent storage [3]. Functionally, the operating system perceives DRAM and Optane PMem as distinct memory pools, with the latter offering data persistence. Applications can access Intel Optane persistent memory through direct load/store operations or via a file system configured with the dax (direct access) option. Such a file system is termed as a PM-aware file system, facilitating direct access to persistent memory without relying on the page cache [7,8].

In the context of this thesis, Optane PMem is exclusively employed in App Direct Mode, coupled with a PM-aware file system to harness its storage capabilities.

#### 2.1.4 Programming

In the realm of persistent memory technology, maintaining data consistency across runtime and system reboots is essential. To address this challenge, prior research underscores the necessity for applications leveraging persistent memory to implement transactions that are atomic, consistent, thread-safe, and resilient to system failures—a paradigm akin to ACID transactions in database systems. However, achieving such robustness in real-world scenarios poses significant complexity. Recognizing this, Intel has developed the Persistent Memory Development Kit (PMDK) to tackle this challenge [6,7].

PMDK comprises a comprehensive suite of libraries and tools tailored for both application developers and system administrators, aiming to streamline the management and utilization of persistent memory devices. Drawing on the SNIA NVM Programming model [9] as its foundation, these libraries extend its capabilities to varying extents. Some libraries offer simplified wrappers around operating system primitives, facilitating ease of use, while others provide sophisticated data structures optimized for persistent memory usage [6].

In the scope of this thesis, we leverage pmemky [10], a persistent local key-value store provided by PMDK. Designed with cloud environments in mind, pmemky complements PMDK's suite of libraries with cloud-native support, abstracting the intricacies of programming with persistent memory through a familiar key-value API. Notably, pmemky distinguishes itself from traditional key-value databases by enabling direct access to data. This means that reading data from persistent memory circumvents the need for copying it into DRAM—an approach that significantly enhances the performance of applications leveraging persistent memory [6].

# 2.2 Serverless Computing

Serverless computing, a prominent execution model within cloud computing, revolutionizes the deployment process by allowing developers to deploy code without the need for provisioning or managing server infrastructure. Although termed "serverless," this model still utilizes servers provided by cloud vendors to execute developers' code. However, the distinguishing feature lies in the abstraction of infrastructure management from the developer's perspective. Developers no longer concern themselves with resource provisioning, scaling, fault tolerance, monitoring, or security patches; instead, they focus solely on code development. Cloud providers take on the responsibility of handling these infrastructure-related tasks on behalf of their customers. Consequently, developers are charged based on the execution time and resources consumed during their code invocations, offering a pay-per-use billing model [2, 11, 12].

#### 2.2.1 Funtion-as-a-Service (FaaS)

At the heart of serverless computing lies Function-as-a-Service (FaaS), introduced by AWS Lambda in 2015. Since then, various commercial and open-source alternatives have emerged, including Google Cloud Functions, Azure Functions, Apache OpenWhisk, and others. FaaS enables developers to express application logic as stateless functions written in high-level languages such as Java, Python, C, or C++. These functions, known as serverless functions, are packaged together with their dependencies and submitted to the serverless platform. Additionally, developers associate events with each serverless function, such as HTTP requests, file uploads, database triggers, and more. Upon the occurrence of a trigger, the cloud provider promptly executes the associated serverless function, offering a scalable and event-driven approach to application development and deployment [13–16].

Previous research has underscored the dynamic and unpredictable nature of Function-as-a-Service (FaaS) workloads within serverless computing environments. Typically, applications in serverless computing are composed of interconnected serverless functions, each serving a specific logical purpose. Analyzing these workloads often involves examining real-world FaaS provider logs to discern patterns and characteristics. Cloud providers face significant challenges in predicting the next function invocation due to the diverse array of triggers employed by applications. Moreover, the heterogeneity of these applications results in substantial variations in invocation frequency, data access sizes, and usage patterns. Data access sizes can range from mere bytes to gigabytes, while invocation frequencies can span several orders of magnitude, with some functions being infrequently called. Learning the invocation patterns of rarely invoked functions is particularly challenging. Additionally, a significant portion of data accesses exhibit bursty behavior, leading to rapid surges in I/O requests as multiple function instances are dynamically spun up to meet application demands. This phenomenon often results in short-lived bursts of intense activity within applications [11, 12, 17].

#### 2.2.2 Storage for FaaS

Serverless providers enforce a restriction on direct communication between serverless functions, necessitating the adoption of remote storage mechanisms for data interchange among them [2,11,12].

One approach to facilitate data exchange between serverless functions is through the utilization of serverless storage, a framework within cloud computing that abstracts the intricacies of managing storage infrastructure from developers. With serverless storage, developers harness storage services provided by cloud vendors like AWS S3, Google Cloud Storage, Azure Blob Storage, AWS DynamoDB, and others. These services empower developers to store and retrieve data via APIs or SDKs without the burden of managing servers or storage clusters. Moreover, serverless storage services offer features such as automatic scaling, data durability, and pay-per-use pricing models, enabling developers to seamlessly adjust their storage resources in line with demand without the need for upfront provisioning or capacity planning. This scalability is of paramount importance for applications characterized by fluctuating storage requirements or unpredictable workloads [18–24].

Despite the high scalability of existing serverless storage solutions, they are primarily optimized for durability rather than performance [2,12,25]. Studies have demonstrated that object-based storage solutions like AWS S3 may exhibit latencies of up to 10 milliseconds for small object reads or writes [26]. Similarly, key-value databases such as AWS DynamoDB, Google Cloud Datastore, and Azure Cosmos provide high throughput but may entail high expenses and lengthy scaling processes [2].

In scenarios demanding enhanced performance, developers may opt for in-memory key-value stores like AWS ElastiCache [27]. These solutions offer low access latencies and high throughput but incur higher costs associated with DRAM. Additionally, they do not provide persistence, meaning that data is not retained in the event of a system failure. However, a notable drawback is their lack of autoscaling capabilities, necessitating manual management and scaling of clusters when integrated with serverless functions [2, 12, 25].

#### 2.2.3 Service Level Agreements

Service level agreements (SLAs) are integral to cloud storage as they establish the terms and conditions governing the quality of service between cloud providers and customers. These agreements, formalized through SLAs, represent negotiated contracts wherein both parties delineate system-related attributes, including the client's anticipated request characteristics and the expected performance of the storage service under these conditions. SLAs serve to define the repercussions for any deviation from the predetermined service levels, often entailing service credits, refunds, or other forms of redress. Such measures incentivize providers to fulfill their obligations and redress any disruptions or deficiencies in service promptly [20, 28, 29].

In the realm of storage services, latency and throughput-related SLAs are of paramount importance. While conventional practice often involves describing performance SLAs using mean or median metrics, these measures may not adequately ensure a consistently positive user experience across all customers. Instead, a more effective approach involves assessing performance SLAs in terms of tail (99th) percentiles, thereby prioritizing the resolution of exceptional cases and optimizing service quality for all users [20].

#### Serverless Workload Characterization

# 2.3 Reinforcement Learning

Reinforcement Learning (RL) is a subfield of machine learning concerned with learning optimal decision-making policies through interactions with an environment [30].

#### 2.3.1 Overview of Reinforcement Learning

The fundamental concept underlying RL is the notion of an agent, which takes actions in an environment and receives feedback in the form of rewards, indicating the quality of its decisions. The agent's objective is to learn a policy that maximizes cumulative rewards over time. Moreover, the agent is not provided with explicit instructions on which actions to take; instead, it must discover the actions that lead to the highest rewards by trying them.



Figure 2.3: RL Workflow

Figure 2.3 presents a schematic representation of a standard reinforcement learning scenario. In discrete time steps, the agent perceives the current state  $s_t$  from the set of all possible states S. It then selects an action  $a_t$  from the available actions  $A(s_t)$  in the current state. The environment transitions to a new state  $s_{t+1}$ , and the agent receives a reward  $r_t$  associated with the transition  $(s_t, a_t, s_{t+1})$ .

The agent's behavior is governed by its policy, which maps perceived states to actions.

The ultimate aim is to learn an optimal or near-optimal policy that maximizes the cumulative reward.

#### 2.3.2 Q-Learning

One of the foundational algorithms in RL is Q-Learning, introduced by Watkins in 1989 [31]. The algorithm belongs to the class of model-free RL algorithms, meaning it learns directly from experience without requiring a model of the environment dynamics [32].

At the core of Q-Learning is the Q-value function, denoted as Q(s, a), which represents the expected cumulative reward the agent will receive by taking action a in state s and following an optimal policy thereafter. The objective of Q-Learning is to iteratively update the Q-values based on observed transitions and rewards, eventually converging to the optimal Q-values that maximize long-term rewards.

The Q-Learning algorithm proceeds as follows: the agent interacts with the environment by selecting actions based on its current estimate of the Q-values. Upon taking an action, the agent observes the resulting reward and the next state. It then updates the Q-value of the previous state-action pair using the observed reward and the estimated value of the next state.

The Q-value update rule in Q-Learning is based on the Bellman equation, which expresses the relationship between the Q-values of successive states [32]:

$$Q(s, a) \leftarrow (1 - \alpha) \cdot Q(s, a) + \alpha \cdot \left(r + \gamma \cdot \max_{a'} Q(s', a')\right)$$

Here,  $\alpha$  is the learning rate, determining the extent to which new information overrides the old one, and  $\gamma$  is the discount factor, representing the importance of future rewards relative to immediate rewards. The term  $r + \gamma \cdot \max_{a'} Q(s', a')$  is known as the temporaldifference (TD) target, combining the immediate reward r with the discounted maximum Q-value of the next state s' [32].

#### 2.3.3 Function Approximation using Linear Regression Models

One of the key advantages of Q-Learning is its simplicity and ease of implementation. It requires only a table to store the Q-values, making it computationally efficient for small state and action spaces. However, Q-Learning faces challenges in environments with large state spaces, as maintaining a lookup table becomes infeasible due to memory and computational constraints.

Function approximation is a fundamental technique in reinforcement learning (RL) aimed at approximating the Q-Value function when dealing with large state or action spaces where tabular representations become impractical [32]. This approach allows RL agents to

generalize from observed states to unseen states, facilitating decision-making in unexplored regions of the state space.

In the context of RL, linear regression models are commonly used for function approximation [30]. These models approximate the Q-value function by leveraging a weighted linear combination of features, with each feature capturing a distinct aspect of the state space. Employing gradient-descent methods, notably stochastic gradient descent, enables iterative refinement of the parameters governing the linear function, aimed at minimizing a predefined loss function. This iterative optimization process empowers the model to progressively enhance its predictive accuracy and capture intricate patterns within the state-action space.

Hyperparameter tuning is a critical aspect of training linear regression models in RL [33]. Hyperparameters, such as the learning rate, regularization strength, and feature scaling, significantly impact the performance and convergence of the models. A systematic approach to hyperparameter tuning involves experimenting with different combinations of hyperparameters, evaluating the performance of the trained models on a validation set, and selecting the optimal hyperparameters based on predefined criteria, such as validation error or performance metrics [32].

#### 2.3.4 Exploration-Exploitation Tradeoff

The exploration-exploitation tradeoff poses a significant challenge in reinforcement learning [30]. The agent must strike a balance between exploring unfamiliar actions to gather information and exploiting known actions for immediate rewards. Finding this balance is crucial for effective learning and task performance, as the agent gradually favors actions with higher expected rewards.

One classic strategy for balancing exploration and exploitation is the epsilon-greedy (egreedy) algorithm [30]. The e-greedy policy selects the action that maximizes the estimated value with probability  $1 - \epsilon$  (exploitation) and selects a random action with probability  $\epsilon$  (exploration). This approach ensures that the agent continues to explore the environment while gradually exploiting more rewarding actions as it gains knowledge.

Decayed e-greedy methods aim to strike a balance between exploration and exploitation by gradually reducing the exploration rate  $\epsilon$  as the agent gains more experience or as the training progresses [30]. This decay encourages the agent to explore the environment more extensively in the early stages of learning while gradually shifting towards exploitation as it becomes more knowledgeable.

#### 2.3.5 Reward shaping

Reward shaping is a technique in reinforcement learning (RL) aimed at accelerating learning by modifying the reward signal provided to the agent. Traditional RL algorithms rely solely on sparse reward signals, which can make learning slow and inefficient, especially in complex environments. Reward shaping addresses this issue by providing additional, shaped rewards that guide the agent towards desirable behaviors. These shaped rewards are designed to provide more informative feedback to the agent, encouraging it to explore the state-action space more effectively. However, reward shaping must be carefully designed to avoid unintended consequences such as overfitting to the shaped rewards or incentivizing undesirable behaviors [32].

# Chapter 3: NVM Middleware: A control layer for persistent memory

As we have discussed, the release of Intel Optane PMM opens a major opportunity for serverless storage services. This memory technology provides a unique combination of affordable larger capacity, high-performance, and support for data persistence [34]. When configured in App-Direct mode, the Optane DIMM and DRAM DIMMs act as independent memory resources under direct load/store control of the applications. This allows the Optane PMM capacity to be used as byte-addressable persistent memory that is mapped into the system application space and directly accessible by applications. Together, these advantages enable Optane PMM to be used as persistent storage with memory-like speeds.

Unfortunately, the resource contention observed within Optane PMM can impose serious performance and contractual implications for a multi-tenant serverless storage service. Given the hallmark autoscaling features of serverless computing, the memory's limited ability to handle accesses from multiple threads can degrade the overall system's performance when workload spikes occur. Furthermore, these storage systems make efficient use of their infrastructure by allowing multiple users, or tenants, to share the physical resources. The performance degradation caused by Optane PMM can lead tenants to experience significant performance variations. The latter inhibits service providers from offering certain service level agreements.

To reduce the contention effect, previous studies recommend limiting the number of threads that access Optane PMM simultaneously. In [4], Yang et. al they improve the performance of an NVM-aware file system by limiting the number of writer threads that access each Optane DIMM. Similarly, Ribbon [5] controls the number of threads performing CLF and adjusts this number dynamically at runtime. While this approach provides a viable

solution, it introduces management problems for a system administrator of a multi-tenant serverless storage.

Given the complexity of serverless computing workloads, implementing efficient concurrency control mechanisms for optimizing an Optane-based serverless storage service is a challenging task. These challenges are discussed in section 3.1, but in short, service providers have three crucial tasks when implementing these control mechanisms. First, they must provide predictable performance, ensuring that all the SLAs from different workloads are met. Second, they must scale resources transparently to meet the current workload demand. Finally, they must come up with policies that allow their system to adapt quickly to sudden workload shifts. To this end, we propose a solution that takes on these responsibilities from the service providers.

In this work, we present a shim layer that addresses the shortcomings of Intel Optane PMM highlighted above, while meeting the different service level agreements from multiple tenants under shifting workloads. Our shim layer, called NVM Middleware, distinguishes between latency-critical and throughput-oriented workloads and applies different concurrency control mechanisms for each one. This enables the system to reduce the contention on the memory device, as well as the interference among workloads with different service level agreements. In addition, we propose the development of a reinforcement learning agent to adapt the NVM Middleware quickly to changing workloads. The agent takes into account the characteristics and service level agreements and learns from past experiences to scale resources accordingly.

#### 3.1 Motivation

In this section, we discuss the pain points of controlling the number of threads to optimize Optane PMM within a serverless storage service and explain the design goals of the NVM Middleware.

#### 3.1.1 Concurrency Control Challenges in a serverless storage service

When building an Optane PMM based serverless storage service, optimizing the memory's performance is just the start. Early works in serverless computing have identified several tasks that a storage service must perform efficiently to meet the demands of serverless computing [2,11,12,28,35,36]. As a result, service providers must ensure that their concurrency control policies do not interfere with these design goals. In this work, we focus on three challenges faced by service providers when designing a high-performance storage service based on Optane PMM.

Support for a wide heterogeneity of applications. In serverless computing, users typically deploy their applications as a collection of serverless functions that share data among them using remote storage. Previous studies suggest that these applications vary considerably in the way store, distribute, and process data. This diversity is reflected in multiple ways, such as data access size [11, 12], data access patterns [11], and their performance requirements [180275,jonas2019cloud]. Therefore, service providers face the challenge of tuning the concurrency level to support many types of applications. In this work, we argue that considering the workload characteristics is key for tuning the system efficiently. The allocation of resources can vary depending on the workload type.

Compliance with Service Level Agreements. The success of a storage service relies on its ability to comply with various service level agreements (SLAs). SLAs play a critical role in governing the relationship between the storage provider and its customers. They help establish clear expectations between both parties regarding the quality of storage service. Therefore, service providers face the challenge of staying in compliance with these SLAs while they seek to optimize Optane PMM.

Automatic and transparent scaling. Serverless workloads are extremely unpredictable. These workloads can launch hundreds of functions instantaneously to meet application demands [35]. Furthermore, the data access patterns of the applications can change dramatically over time [11,36]. Service providers face the challenge of scaling the resources,

such as number of threads, automatically to meet the demands of changing workloads. In addition, they must ensure that the system adapts quickly enough to avoid missing SLAs.

#### 3.1.2 NVM Middleware Design Overview

We design NVM Middleware with three main design goals.

Workload-aware Contention Management. We focus our work on two main types of workloads: interactive and batch applications. Interactive applications, such as web-based platforms, enable real-time interactions between the user and the application. Low latency is critical to ensure that the user input is processed quickly, and feedback is delivered in real-time. On the other hand, batch applications, such as data analytics jobs, facilitate efficient processing of large-scale data. These workloads prioritize high throughput to process large volumes of data efficiently.

The NVM Middleware leverage insights about the workload characteristics, resource demands, and performance requirements of applications to make informed decisions about resource allocation and contention resolution. By dynamically adjusting resource allocation and contention resolution mechanisms based on the workload characteristics, the NVM Middleware mitigates contention-induced performance degradation and ensures efficient resource sharing among co-located applications. This adaptive approach enables the NVM Middleware to allocate resources judiciously to maximize overall system efficiency and meet diverse performance requirements of both interactive and batch applications. By using the content-aware contention management offered by the NVM Middleware, a storage system using Optane PMM can effectively balance the needs of different workload types, ensuring optimal performance and resources utilizing in multi-tenant environments.

**SLA-driven autoscaling policies.** The NVM Middleware leverages SLAs, which define the quality-of-service parameters agreed up between the service provider and their customers, to dynamically adjust contention resolution mechanisms in response to changes in service level agreement metrics. It continuously monitors SLA metrics, such as 99th

latency and throughput, and evaluates its own performance against predefined SLA targets. This real-time monitoring allows the NVM Middleware to detect deviations from SLA requirements and triggers scaling actions to dynamically adjust resource allocation. By aligning resource provisioning with SLA requirements, the NVM Middleware can ensure a consistent and reliable performance from Optane PMM, even under dynamic workload changes.

RL-driven autoscaling policies. Besides leveraging SLAs to dynamically provision resources and adjust contention resolution mechanisms, our solution proposes the use of Reinforcement Learning to learn from past experiences and predict future behaviors. These RL-driven policies enable the NVM Middleware to adapt to changing workload patterns over time and meet SLAs objectives more effectively than traditional threshold-based approaches []. Moreover, given the dynamic and unpredictable of serverless workloads, we propose a model-free algorithm, Q-Learning, to continuously learn the optimal policy based on observed experiences, allowing the NVM Middleware to adapt to new scenarios without needing to explicitly model them.

#### 3.2 Architecture

Figure 3.1 provides an overview of the NVM Middleware architecture. Positioned as a middle layer connecting user applications with Optane PMM, its design is tailored for seamless integration within a storage service, serving as an optimization layer specifically targeting Optane PMM. It comprises a request handler, two concurrency thread pools, and a monitoring and resource management module.

The request handler serves as the primary interface for handling user I/O requests. Upon receipt, it segregates requests into two distinct non-blocking First-In-First-Out (FIFO) queues: one tailed for latency-sensitive requests and the other for throughput-centric ones. Leveraging insights into workload characteristics, the handler intelligently allocates requests to the appropriate queue. Moreover, each queue is assigned a dedicated pool of worker threads tasked with dispatching I/O requests to Optane PMM using PMEMKV. Notably,



Figure 3.1: NVM Middleware Architecture

these thread pools operate independently and are dynamically managed and scaled by the Reinforcement Learning agent to meet predetermined latency and throughput goals.

The Monitoring and Resource Management module offers an interface to monitor system load and SLA performance metrics. This module initiates a separate control thread tasked with gathering data on key parameters within the NVM Middleware, such as 99th latency, throughput, and system load. Utilizing this information, the RL agent makes data-driven decisions regarding optimal thread pool scaling. Subsequently, these decisions are communicated to the Monitoring and Resource Management module, which executes the required actions within the NVM Middleware.

Table 3.1: Programming Interface

| Category | API Name                                    | Functionality                                                               |
|----------|---------------------------------------------|-----------------------------------------------------------------------------|
|          |                                             | Create PMEMKV database.                                                     |
| System   | start(db, interactiveThreads, batchThreads) | Start interactive and batch thread pools.                                   |
|          |                                             | Initiate system monitoring in Monitoring and Resource Management Module.    |
|          |                                             | Closes PMEMKV database.                                                     |
| System   | stop()                                      | Stop thread pools.                                                          |
|          |                                             | Stop system monitoring.                                                     |
| System   | get(key, mode)                              | Retrieves key from persistent memory.                                       |
| System   | put(key, value, mode)                       | Writes key to persistent memory.                                            |
| RL       | get_stats()                                 | Provides the 99th percentile and throughput observed by the NVM Middleware. |
| RL       | get_state()                                 | Provides the current state within the NVM Middleware.                       |
| RL       | perform_action(action)                      | Triggers a scaling action.                                                  |

## 3.3 Programming Interface

Table 3.1 outlines the NVM Middleware's programming interface, presenting a set of functions designed to facilitate interaction with a storage system and the reinforcement learning agent.

The *start* function initializes the PMEMKV database, initializes the thread pools with an initial thread count, and triggers the system monitoring within the Monitoring and Resource Management Module. In contrast, the *stop* function terminates the database connection, halts all threads in the thread pools, and stops system monitoring. Furthermore, the *get* and *put* functions facilitate key-value interactions with the persistent memory, allowing for read and write operations. These functions are designed to accommodate hints regarding the request type (e.g., latency-sensitive or throughput-oriented), aiding the request handler in queue allocation.

The get\_stats function furnishes insights into the 99th percentile and throughput observed by the NVM Middleware at any given moment. Similarly, the get\_state function provides real-time state information as outlined in Table 3.2. Finally, the perform\_action function accepts scaling actions detailed in Table 3.3 and initiates the corresponding action within the NVM Middleware.

## 3.4 Reinforcement Learning Component

In this section, we discuss the Q-learning algorithm used by the Reinforcement Learning agent to dynamically adjust the number of threads assigned to each thread pool. The agent's goal is to find the best combination of threads that meets predetermined latency and throughput SLAs while minimizing contention and ensuring efficient utilization of Intel Optane PMM.

#### 3.4.1 Integration with the NVM Middleware



Figure 3.2: RL Workflow

Figure 3.2 offers a visual representation of the interaction between the reinforcement learning (RL) agent and the NVM Middleware. At each time step, the NVM Middleware receives a diverse influx of requests, spanning both latency-sensitive and throughput-oriented tasks. These requests necessitate translation into actionable I/O commands directed towards the Intel Optane Persistent Memory Module (PMM).

Concurrently, the RL agent adeptly captures the environment's current state, leveraging real-time workloads' characteristics and performance metrics provided by the monitoring module. Utilizing this information, the agent orchestrates the selection of an optimal action, guiding the dynamic adjustment of threads within the interactive and batch thread pools. This adaptive decision-making process is exemplified by actions like augmenting the count of interactive threads to address evolving workload demands.

Following action selection, the NVM Middleware's resource management module implements the chosen course of action, fine-tuning the NVM Middleware's interactive and batch threads to efficiently handle incoming user requests. Upon the completion of each time step, the action's effectiveness is rigorously assessed against predefined service level agreement (SLA) targets, yielding a reward signal generated by a reward manager.

This reward serves as invaluable feedback for the RL agent, empowering iterative policy updates aimed at refining decision-making strategies in subsequent time steps. Thus, the presented framework embodies a recursive learning cycle, wherein the RL agent continuously hones its behavior through real-world interactions, ensuring adaptive responsiveness to evolving workload dynamics.

# 3.4.2 Reinforcement Learning Model

### **State Space**

Table 3.2: The State Representation

| *                     |                                                                                     |                                           |
|-----------------------|-------------------------------------------------------------------------------------|-------------------------------------------|
| Name                  | Description                                                                         | Values                                    |
| interactiveThreads    | Number of (interactive) threads assigned to the interactive thread pool.            | $1 \le \text{interactiveThreads} \le 32$  |
| batchThreads          | Number of (batch) threads assigned to the batch thread pool.                        | $1 \le \text{batchThreads} \le 32$        |
| interactiveQueueSize  | Number of requests in the interactive queue.                                        | $interactiveQueueSize \in \mathbb{R}^+$   |
| batchQueueSize        | Number of requests in the batch queue.                                              | $batchQueueSize \in \mathbb{R}^+$         |
| interactiveBlockSize  | Average block size of interactive workload.                                         | interactiveBlockSize $\in \mathbb{R}^+$   |
| batchBlockSize        | Average block size of batch workload.                                               | $batchBlockSize \in \mathbb{R}^+$         |
| interactiveWriteRatio | Proportion of write requests compared to read requests in the interactive workload. | interactive<br>RWRatio $\in \mathbb{R}^+$ |
| batchWriteRatio       | Proportion of write requests compared to read requests in the batch workload.       | $batchRWRatio \in \mathbb{R}^+$           |

Table 3.2 presents the features of our state representation. At each time step t, we define the state  $s_t$  as a tuple:

 $s_t = (\text{interactiveThreads}_t, \text{batchThreads}_t, \text{InteractiveQueueSize}_t, \text{batchQueueSize}_t, \\ \\ \text{interactiveBlockSize}_t, \text{batchBlockSize}_t, \text{interactiveRWRatio}_t, \\ \text{batchRWRatio}_t)$ 

where  $s_t \in S$  represents the state space. The tuple encapsulates the key features characterizing the system's current state, including the number of interactive and batch threads, number of pending requests in the queues, individual workload block sizes, and write ratio for both interactive and batch workloads.

Table 3.3: Possible Actions in the Action Space

| Action | Effect on Interactive Threads | Effect on Batch Threads |
|--------|-------------------------------|-------------------------|
| 0      | No change                     | No change               |
| 1      | Increase by 1                 | No change               |
| 2      | Decrease by 1                 | No change               |
| 3      | No change                     | Increase by 1           |
| 4      | No change                     | Decrease by 1           |
| 5      | Increase by 1                 | Increase by 1           |
| 6      | Increase by 1                 | Decrease by 1           |
| 7      | Decrease by 1                 | Increase by 1           |
| 8      | Decrease by 1                 | Decrease by 1           |

### **Action Space**

Table 3.3 illustrates the feasible actions within the action space. Each action corresponds to a potential adjustment in the number of interactive and batch threads. The table enumerates nine distinct actions, each with its respective effect on the number of interactive threads and batch threads.

Mathematically, the set of actions A is defined as  $A = \{0, 1, 2, 3, 4, 5, 6, 7, 8\}$  for a given state  $s_t \in S$ .

### Reward

To guide the optimization process of the reinforcement learning agent, we establish an algorithm (Algorithm 1) to calculate a reward value based on observed and target latency and throughput metrics. This algorithm, outlined below, serves as a crucial component in training the RL agent to make informed decisions.

Lines 1-5 define goals, scaling factors, and penalties. The observed and target latency
 (lat, lat\_goal) and throughput (tp, tp\_goal) metrics are scaled to a normalized range
 using scaling factors (max\_scale\_lat, max\_scale\_tp) and minimum scale (min\_scale).
 This normalization process ensures that both metrics contribute proportionally to the
 reward calculation.

- 2. Lines 6-7 compare the scaled latency (lat) and throughput (tp) metrics against the scaled target values for latency (lat\_goal) and throughput (tp\_goal). The absolute differences between observed and target values are computed to quantify the error in latency (error\_lat) and throughput (error\_tp).
- 3. Lines 8-12 determine the reward based on three distinct scenarios. Firstly, if both latency and throughput goals are achieved, a high positive reward is assigned. Secondly, if both goals are not met, a low negative reward is assigned, taking into account both latency and throughput errors. The disparity in penalties, represented by lat\_penalty and tp\_penalty, ensures that both types of errors contribute proportionately to the overall reward. Thirdly, if only the latency goal remains unmet, a low negative reward is assigned, incorporating the latency penalty and error. Finally, if only the throughput goal is unmet, a similar low negative reward is assigned, encompassing the throughput penalty and error.

## 3.4.3 Training Methodology

#### **Environment Design**

The environment architecture designed for training and evaluating the RL agent is depicted in Figure 3.3. This architecture comprises several key components, including an interactive multi-threaded application, a batch multi-threaded application, the NVM Middleware, and Intel Optane PMM.

To simulate a multi-tenant serverless scenario, both applications are executed concurrently. Workload patterns for each application are derived from collected serverless traces. To emulate high concurrency levels typical in serverless environments, multiple threads within each application are employed to dispatch requests to the NVM Middleware via the API described in Section 3.3. Meanwhile, the NVM Middleware processes these requests in accordance with the workflow outlined in Section 3.2.



Figure 3.3: Overview of the Environment Architecture

In order to model the time steps inherent in an RL process, the environment organizes the applications' requests into 1-second windows, processing one window per time step. Figure 3.4 illustrates the interactions between the RL agent and the environment at each time step. Beginning with a state observation from the preceding step, the agent communicates the intended action to the environment. Subsequently, the environment relays this action to the NVM Middleware, which then allocates resources accordingly. Upon successful execution of the action, the environment initiates processing for the next window of requests. Once all requests within the window are handled, the environment gathers metrics from the NVM Middleware and furnishes a new state observation along with a reward signal to



Figure 3.4: Agent Process flow

the agent. The agent utilizes this reward to update its policy, perpetuating the iterative learning process.

## **Function Approximation**

To address the challenge posed by the continuous state space in our environment, traditional Q-learning approaches become impractical due to the vast number of states that cannot be feasibly mapped into a Q-table. Consequently, we employ function approximation techniques to estimate the value of each action based on the state.

Specifically, we train nine separate linear regression models, each corresponding to one of the available actions, using stochastic gradient descent. This approach allows us to capture the underlying patterns in the data and generalize across states not encountered during training, enabling our agent to make informed decisions even in novel situations.

However, selecting appropriate hyperparameters for our regression models presents a significant challenge. Online training alone is insufficient for accurately assessing model performance, as it can be time-consuming and computationally intensive. To overcome this limitation, we adopt a batch learning approach with offline historical data.

By leveraging historical data collected from the environment, we can tune our models' hyperparameters and incorporate prior knowledge into our RL agent. This approach accelerates the learning process by bootstrapping our models with valuable insights gained from past experiences [37,38].

To construct our dataset, we deploy a non-optimal agent that performs random actions in the environment, capturing state-action-reward transitions. Following established machine learning practices, we split the dataset into training and testing sets and employ 5-fold cross-validation on the training set to evaluate model performance rigorously.

Additionally, we preprocess the features by standardizing them using the standard scaler and apply polynomial preprocessing to enhance the model's ability to capture nonlinear relationships within the data.

### Proposed Q-Learning Algorithm

Algorithm 2 outlines the Q-Learning process for training an agent to make optimal decisions in an environment. It takes the bootstrapped Q-value models  $M_a$  for all actions a and outputs the new learned models after training.

The algorithm initializes the training parameters and then iterates over a specified number of episodes. Within each episode, the environment is reset, and the agent interacts with it until the episode is complete. At each step, the agent observes the current state  $s_t$ , selects an action  $a_t$  based on an  $\epsilon$ -greedy policy, takes the action, and observes the resulting reward r and next state  $s_{t+1}$ .

The Q-value models are updated based on the observed reward and next state. If the episode is not done, the target Q-value is calculated using the reward and the maximum Q-value for the next state. If the episode is done, the target Q-value is simply set to the reward.

The model for the selected action  $a_t$  is updated using the target Q-value, and the state is updated to the next state. Additionally, the exploration rate  $\epsilon$  is decreased according to an exploration schedule.

# 3.5 Implementation

The NVM Middleware, detailed in Section 3.3, is implemented using C++. We leverage PMEMKV from the Persistent Memory Development Kit [6] to facilitate reading and writing data into Intel Optane PMM. To manage concurrent operations efficiently, we utilize the non-locking, concurrent queue provided by the Intel Threading Building Blocks [39] library for both the interactive and batch queues.

For the RL Environment, as described in Section 3.4.3, we adopt a hybrid approach employing C++ and Python. The environment itself is constructed in C++, aligning with the specifications outlined in Section 3.4.3. Conversely, the RL agent and the Q-Learning algorithm, also discussed in the same section, are developed using Python. We leverage the SGDRegressor model from the Scikit-learn[40] library to facilitate the representation of our linear regression models for function approximation. Additionally, we employ Scikit-learn for hyperparameter tuning. To seamlessly integrate the C++ and Python components, we utilize pybind11[41].

```
Algorithm 1: Reward Calculation Algorithm
         Input: System statistics: stat
          Output: Reward value: reward
          /* Initialize variables
                                                                                                                                                                                                                                                                                */
   1 max_scale_lat \leftarrow 1000, max_scale_tp \leftarrow 10, min_scale \leftarrow 1, lat_goal \leftarrow 200,
             tp\_goal \leftarrow 250000, lat\_penalty \leftarrow 500.0, tp\_penalty \leftarrow 5000.0;
          /* Scale observed and target latency and throughput
                                                                                                                                                                                                                                                                                */
   \mathbf{z} lat \leftarrow ((\max_{\text{scale\_lat}} - \min_{\text{scale}}) \times (\text{stat.tailLatency} - \min_{\text{value}}) / (\max_{\text{latency}} - \max_{\text{scale\_lat}}) / (\max_{\text{latency}} - \min_{\text{scale\_lat}}) / (\max_{\text{latency}} - \min_{\text{latency}}) / (\max_{\text{l
             \min_{\text{value}}) + \min_{\text{scale}};
   3 \text{ tp} \leftarrow ((\text{max\_scale\_tp} - \text{min\_scale}) \times (\text{stat.throughput} -
              \min_{\text{value}} / (\max_{\text{throughput}} - \min_{\text{value}}) + \min_{\text{scale}};
   4 lat_goal \leftarrow ((\text{max\_scale\_lat} - \text{min\_scale}) \times (\text{lat\_goal} - \text{min\_value})/(\text{max\_latency} - \text{min\_value})
             \min_{\text{value}}) + \min_{\text{scale}};
   tp\_goal \leftarrow ((max\_scale\_tp - min\_scale) \times (tp\_goal - min\_value)/(max\_throughput - min\_value)
             \min_{\text{value}}) + \min_{\text{scale}};
          /* Calculate errors
                                                                                                                                                                                                                                                                                */
   6 error_lat \leftarrow |lat - lat_goal|;
   7 \text{ error\_tp} \leftarrow |\text{tp} - \text{tp\_goal}|;
          /* Calculate reward
                                                                                                                                                                                                                                                                                */
   s if lat \leq lat\_goal and tp \geq tp\_goal then
                     reward \leftarrow 10 \times (\text{error\_lat} + \text{error\_tp});
                                                                                                                                                              // High reward for meeting both
                         latency and throughput goals
10 else
                     if lat > lat\_goal and tp < tp\_goal then
11
                               reward \leftarrow -1 \times (lat\_penalty \times error\_lat + tp\_penalty \times error\_tp);
12
                                    // Penalize for high latency and low throughput
                     else
13
                               if lat > lat\_goal then
14
                                           reward \leftarrow -1 \times \text{lat\_penalty} \times \text{error\_lat}; // Penalize for high latency
15
                                else
16
                                           reward \leftarrow -1 \times \text{tp\_penalty} \times \text{error\_tp};
                                                                                                                                                                                                            // Penalize for low
17
                                               throughput
18
                                end
                     end
19
20 end
```

# Algorithm 2: Q-Learning Algorithm

```
Input: Pre-trained Q-value models M_a for all actions a
   Output: Learned Q-value models M_a for all actions a
 1 Initialize the training parameters \alpha, \gamma, \epsilon;
 2 for episode \leftarrow 1 to E do
       Reset the environment;
 3
       repeat
 4
 5
           Observe the state s_t;
           // Choose action a_t using the \epsilon-greedy policy
           Generate random number r from uniform distribution in [0, 1];
 6
           if r < \epsilon then
 7
               Select a random action a_t from the action space;
           end
 9
           else
10
               for each action a do
11
                   Predict Q-value Q_a(s_t) using model M_a: Q_a(s_t) \leftarrow M_a.predict(s_t);
12
               end
13
               Select action a_t \leftarrow \arg\max_a Q_a(s_t);
14
           end
15
           Take action a_t, observe reward r and next state s_{t+1};
16
           // Update the Q-value model using reward and next state
           if not done then
17
               for each action a do
18
                   Predict Q-value Q_a(s_{t+1}) using model M_a:
19
                    Q_a(s_{t+1}) \leftarrow M_a.predict(s_{t+1});
               end
20
               Calculate target Q-value: target \leftarrow r + \gamma \cdot \max_a Q_a(s_{t+1});
21
           end
22
           else
23
               Set target Q-value to the reward: target \leftarrow r;
\mathbf{24}
           end
25
           Update the model for action a_t with the target Q-value:
26
            M_{a_t}.partial\_fit(s_t, target);
           Update state: s_t \leftarrow s_{t+1};
\mathbf{27}
       until episode is done;
28
       Decrease \epsilon according to exploration schedule;
29
30 end
```

# Chapter 4: Evaluation

In this chapter, the NVM Middleware and the Q-Learning Model.

# 4.1 Experimental Setup

### 4.1.1 Platform

Table 4.1: Experimental Platform Specifications

| Processor         | Intel® Xeon® Gold 6252            |
|-------------------|-----------------------------------|
| Sockets           | 2                                 |
| Cores per socket  | 24                                |
| Threads per core  | 2                                 |
| Numa nodes        | 2                                 |
| CPU Frequency     | 2.7 GHz (3.7 GHz Turbo frequency) |
| L1d cache         | 1.5 MiB                           |
| L1i cache         | 1.5 MiB                           |
| L2 Cache          | 48 MiB                            |
| L3 Cache          | 71.5 MiB                          |
| DRAM              | 16 GB DDR4 DIMM x 6 per socket    |
| Persistent Memory | 128 GB Optane PMM x 6 per socket  |
| Operating System  | Ubuntu 20.04.4 LTS (Focal Fossa)  |

The experimental platform utilized in this study is detailed in Table 4.1. It features an Intel,<sup>®</sup> Xeon,<sup>®</sup> Gold 6252 processor with 2 sockets, each hosting 24 cores and 2 threads per core, totaling 2 NUMA nodes. Each socket is equipped with three memory channels, housing 16 GB DDR4 DIMMs and 128 GB Optane PMMs. In aggregate, the system comprises 192 GB of DRAM and 1.5 TB of Optane persistent memory. To mitigate the NUMA effect, one

socket is designated for running the NVM Middleware threads, while the other handles the interactive and batch applications, as described in Section 3.4.3.

### 4.1.2 Optane DC PMem Configuration

As outlined earlier, this thesis concentrates on exploring the persistent capabilities of Optane DC PMem. Consequently, Optane DC PMem is employed in the App Direct Mode throughout our experiments. To facilitate the utilization of persistent memory, we expose it via an xfs filesystem configured in dax mode, thereby bypassing the page cache. Additionally, we enhance memory management and performance by configuring the persistent memory with huge pages (2MiB) [8]. Lastly, we deploy a PMEMKV database with a capacity of 600GB, configured with its persistent concurrent engine.

### 4.1.3 FaaS Workload Traces

#### **Interactive Traces**

We collect the traces from [42].

### **Batch Traces**

We collect the traces from [43].

# Chapter 5: Related Work

# Chapter 6: Conclusions and Future Work

# Bibliography

- [1] Persistent Memory Documentation. https://docs.pmem.io/persistent-memory/getting-started-guide/introduction. 2024.
- [2] Eric Jonas, Johann Schleier-Smith, Vikram Sreekanti, Chia-Che Tsai, Anurag Khandelwal, Qifan Pu, Vaishaal Shankar, Joao Carreira, Karl Krauth, Neeraja Yadwadkar, et al. Cloud programming simplified: A berkeley view on serverless computing. arXiv preprint arXiv:1902.03383, 2019.
- [3] Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, Amirsaman Memaripour, Yun Joon Soh, Zixuan Wang, Yi Xu, Subramanya R Dulloor, et al. Basic performance measurements of the intel optane DC persistent memory module. arXiv preprint arXiv:1903.05714, 2019.
- [4] Jian Yang, Juno Kim, Morteza Hoseinzadeh, Joseph Izraelevitz, and Steve Swanson. An empirical guide to the behavior and use of scalable persistent memory. In 18th USENIX Conference on File and Storage Technologies (FAST 20), pages 169–182, 2020.
- [5] Kai Wu, Ivy Peng, Jie Ren, and Dong Li. Ribbon: High performance cache line flushing for persistent memory. In *Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques*, pages 427–439, 2020.
- [6] Steve Scargall. Programming Persistent Memory: A Comprehensive Guide for Developers. Apress, USA, 1st edition, 2020.
- [7] Andy Rudoff. Persistent memory programming. Login: The Usenix Magazine, 42(2):34–40, 2017.
- [8] Intel. Speeding Up I/O Workloads with Intel Optane Persistent Memory. https://www.intel.com/content/www/us/en/developer/articles/technical/speeding-up-io-workloads-with-intel-optane-dc-persistent-memory-modules. html. 2024.
- [9] Storage Networking Industry Association. NVM Programming Model. https://www.snia.org/sites/default/files/technical-work/npm/release/SNIA-NVM-Programming-Model-v1.2.pdf. 2024.
- [10] Intel. pmem/pmemkv: Key/Value Datastore for Persistent Memory. https://github.com/pmem/pmemkv. 2024.
- [11] Francisco Romero, Gohar Irfan Chaudhry, Íñigo Goiri, Pragna Gopa, Paul Batum, Neeraja J. Yadwadkar, Rodrigo Fonseca, Christos Kozyrakis, and Ricardo Bianchini. Faa\$T: A Transparent Auto-Scaling Cache for Serverless Applications, 2021.

- [12] Ana Klimovic, Yawen Wang, Patrick Stuedi, Animesh Trivedi, Jonas Pfefferle, and Christos Kozyrakis. Pocket: Elastic ephemeral storage for serverless analytics. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 427–444, 2018.
- [13] Amazon. AWS Lambda. https://aws.amazon.com/lambda/. 2024.
- [14] Microsoft. Azure Functions. https://azure.microsoft.com/en-us/products/functions/. 2024.
- [15] Google. Cloud Functions. https://cloud.google.com/functions. 2024.
- [16] Apache Software Foundation. Apache OpenWhisk. https://openwhisk.apache. org/. 2024.
- [17] Mohammad Shahrad, Rodrigo Fonseca, Inigo Goiri, Gohar Chaudhry, Paul Batum, Jason Cooke, Eduardo Laureano, Colby Tresness, Mark Russinovich, and Ricardo Bianchini. Serverless in the wild: Characterizing and optimizing the serverless workload at a large cloud provider. In 2020 USENIX annual technical conference (USENIX ATC 20), pages 205–218, 2020.
- [18] Amazon. AWS S3. https://aws.amazon.com/s3/. 2024.
- [19] Amazon. Amazon DynamoDB. https://aws.amazon.com/pm/dynamodb/. 2024.
- [20] Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. Dynamo: Amazon's highly available key-value store. In ACM Symposium on Operating System Principles, 2007.
- [21] Microsoft. Azure Blob Storage. https://azure.microsoft.com/en-us/products/storage/blobs. 2024.
- [22] Google. Cloud Storage. https://cloud.google.com/storage. 2024.
- [23] Google. Datastore. https://cloud.google.com/datastore. 2024.
- [24] Microsoft. Azure Cosmos DB. https://azure.microsoft.com/en-us/products/cosmos-db. 2024.
- [25] Jingyuan Zhang, Ao Wang, Xiaolong Ma, Benjamin Carver, Nicholas John Newman, Ali Anwar, Lukas Rupprecht, Vasily Tarasov, Dimitrios Skourtis, Feng Yan, and Yue Cheng. Infinistore: Elastic serverless cloud storage. *Proc. VLDB Endow.*, 16(7):1629–1642, mar 2023.
- [26] Public Cloud Object-store Performance is Very Unequal across AWS S3, Google Cloud Storage, and Azure Blob Storage. https://dev.to/sachinkagarwal/public-cloud-object-store-performance-is-very-unequal-across-aws-s3-google-cloud-st 2024.
- [27] Amazon. Amazon ElastiCache. https://aws.amazon.com/pm/elasticache/. 2024.

- [28] David Shue, Michael J. Freedman, and Anees Shaikh. Performance Isolation and Fairness for Multi-Tenant Cloud Storage. In 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI 12), pages 349–362, Hollywood, CA, October 2012. USENIX Association.
- [29] Ali Tariq, Austin Pahl, Sharat Nimmagadda, Eric Rozner, and Siddharth Lanka. Sequoia: Enabling quality-of-service in serverless computing. In *Proceedings of the 11th ACM symposium on cloud computing*, pages 311–327, 2020.
- [30] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
- [31] Christopher John Cornish Hellaby Watkins. Learning from delayed rewards. PhD thesis, King's College, Cambridge United Kingdom, 1989.
- [32] Stuart S Russell and Petter Norvig. Artificial Intelligence: A Modern Approach. Pearson, 2020.
- [33] James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. Journal of machine learning research, 13(2), 2012.
- [34] Intel. Intel Optane Persistent Memory. https://www.intel.com/content/www/us/en/products/docs/memory-storage/optane-persistent-memory/overview.html. 2024.
- [35] Ana Klimovic, Yawen Wang, Christos Kozyrakis, Patrick Stuedi, Jonas Pfefferle, and Animesh Trivedi. Understanding ephemeral storage for serverless analytics. In 2018 USENIX annual technical conference (USENIX ATC 18), pages 789–794, 2018.
- [36] Chenggang Wu, Vikram Sreekanti, and Joseph M Hellerstein. Autoscaling tiered cloud storage in Anna. *Proceedings of the VLDB Endowment*, 12(6):624–638, 2019.
- [37] Ignacio Cano, Srinivas Aiyar, Varun Arora, Manosiz Bhattacharyya, Akhilesh Chaganti, Chern Cheah, Brent Chun, Karan Gupta, Vinayak Khot, and Arvind Krishnamurthy. Curator:{Self-Managing} Storage for Enterprise Clusters. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17), pages 51–66, 2017.
- [38] N Marivate Vukosi. Improved Empirical Methods in Reinforcement Learning Evaluation. PhD thesis, PhD thesis, Rutgers, New Brunswick, New Jersey, 2015.
- [39] Intel. Intel oneAPI Threading Building Blocks. https://www.intel.com/content/www/us/en/developer/tools/oneapi/onetbb.html#gs.4oc8fg.
- [40] scikit-learn: machine learning in Python. https://scikit-learn.org/stable/. 2024.
- [41] Wenzel Jakob. pybind11 documentation. https://pybind11.readthedocs.io/en/stable/index.html. (Accessed on 02/18/2024).
- [42] Microsoft. Azure/AzurePublicDataset: Microsoft Azure Traces. https://github.com/Azure/AzurePublicDataset. 2024.

[43] Benjamin Carver, Jingyuan Zhang, Ao Wang, Ali Anwar, Panruo Wu, and Yue Cheng. Wukong: A scalable and locality-enhanced framework for serverless parallel computing. In *Proceedings of the 11th ACM symposium on cloud computing*, pages 1–15, 2020.

# Biography

Include your biography here detailing your background, education, and professional experience.