#### Bachelor's thesis

EFFICIENCY AND
UTILIZATION OF
VECTOR PACKET
PROCESSING IN
HIGH-SPEED
NETWORKS

## Ondřej Slavík

Faculty of Information Technology Department of Computer Systems Supervisor: Ing. Jan Fesl, Ph.D. April 21, 2025



## Assignment of bachelor's thesis

Title: Efficiency and utilization of Vector Packet Processing in high-

speed networks

Student:Ondřej SlavíkSupervisor:Ing. Jan Fesl, Ph.D.

Study program: Informatics

Branch / specialization: Computer Networks and Internet 2021

Department: Department of Computer Systems

Validity: until the end of summer semester 2025/2026

#### Instructions

Vector Packet Processing (VPP) je moderní softwarový framework, který umožňuje zpracování paketů ve vysokorychlostních sítích na úrovni uživatelského prostoru operačního systému. Významnou výhodou využití VPP by mělo být výrazné zvýšení propustnosti a snížení latence v rámci vysokorychlostní sítě. Zmíněné výhody VPP jsou primárně teoretické a zatím nebyly experimentálně dostatečně prokázány.

V rámci tvorby bakalářské práce postupujte dle níže uvedených kroků:

- 1) Nastudujte a popište detailně všechny principy, které VPP používá, jak je implementováno a jak lze VPP efektivně využívat.
- 2) Vytvořte testovací scénáře, které umožní srovnat efektivitu a cenu využití VPP oproti běžnému způsobu zpracování paketů na úrovni jádra operačního systému.
- 3) Po poradě s vedoucím práce realizujte infrastrukturu vhodnou pro reálné otestování
- 4) Na základě bodu 2) proveďte dostatečný počet měření (minimálně stovky) a srovnejte možný dosažitelný průtok, latenci a spotřebu el. energie s využitím resp. bez využití VPP.
- 5) Proveďte důkladný rozbor a diskuzi výsledků z předchozího kroku a explicitně uveďte nevýhody využití VPP, pokud nějaké budou.

Czech Technical University in Prague Faculty of Information Technology © 2025 Ondřej Slavík. All rights reserved.

This thesis is school work as defined by Copyright Act of the Czech Republic. It has been submitted at Czech Technical University in Prague, Faculty of Information Technology. The thesis is protected by the Copyright Act and its usage without author's permission is prohibited (with exceptions defined by the Copyright Act).

Citation of this thesis: Slavík Ondřej. *Efficiency and utilization of Vector Packet Processing in high-speed networks*. Bachelor's thesis. Czech Technical University in Prague, Faculty of Information Technology, 2025.

I would like to express my sincere gratitude to my thesis supervisor, Ing. Jan Fesl, Ph.D., for his guidance, support, and valuable insights throughout the entire process of writing this thesis.

My thanks also go to the Silicon Hill club of the CTU Student Union for providing an inspiring environment and the technical resources that supported my work. Finally, I would like to thank my family for their unwavering support, encouragement, and never-ending patience during my studies.

#### **Declaration**

I hereby declare that the presented thesis is my own work and that I have cited all sources of information in accordance with the Guideline for adhering to ethical principles when elaborating an academic final thesis.

I acknowledge that my thesis is subject to the rights and obligations stipulated by the Act No. 121/2000 Coll., the Copyright Act, as amended. In accordance with Section 2373(2) of Act No. 89/2012 Coll., the Civil Code, as amended, I hereby grant a non-exclusive authorization (licence) to utilize this thesis, including all computer programs that are part of it or attached to it and all documentation thereof (hereinafter collectively referred to as the "Work"), to any and all persons who wish to use the Work. Such persons are entitled to use the Work in any manner that does not diminish the value of the Work and for any purpose (including use for profit). This authorisation is unlimited in time, territory and quantity.

In Prague on April 21, 2025

#### Abstract

Fill in the abstract of this thesis in English. Lorem ipsum dolor sit amet. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos hymenaeos. Cras pede libero, dapibus nec, pretium sit amet, tempor quis. Sed vel lectus. Donec odio tempus molestie, porttitor ut, iaculis quis, sem. Suspendisse sagittis ultrices augue.

**Keywords** Vector Packet Processing, Network benchmark, Energy efficiency, Linux network stack, Data Plane Development Kit

#### **Abstrakt**

Fill in the abstract of this thesis in Czech. Lorem ipsum dolor sit amet. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos hymenaeos. Cras pede libero, dapibus nec, pretium sit amet, tempor quis. Sed vel lectus. Donec odio tempus molestie, porttitor ut, iaculis quis, sem. Suspendisse sagittis ultrices augue.

Klíčová slova enter, comma, separated, list, of, keywords, in, CZECH

## Contents

| In | trod | uction                                                       | 1  |
|----|------|--------------------------------------------------------------|----|
| 1  | The  | eoretical part                                               | 3  |
|    | 1.1  | "Vector Packet Processing (VPP) and Its Operating Principles | 3  |
|    |      | 1.1.1 Traditional network traffic processing                 | 3  |
|    |      | 1.1.2 An Introduction to VPP                                 | 4  |
|    |      | 1.1.3 Techniques used in VPP                                 | 5  |
|    |      | 1.1.4 VPP Processing Graph and Graph nodes                   | 6  |
|    |      | 1.1.5 DPDK and Its Role in VPP                               | 7  |
|    |      | 1.1.5.1 Poll Mode Drivers                                    | 7  |
|    |      | 1.1.5.2 Memory management and Hugepages                      | 8  |
|    |      | 1.1.5.3 Packet Reception and Transmission: A com-            |    |
|    |      | parison between Linux Network Stack and DPDK                 | 8  |
|    | 1.2  | Implementation of Vector Packet Processing                   | 10 |
|    |      | 1.2.0.1 VPPINFRA                                             | 11 |
|    |      | 1.2.0.2 VNET                                                 | 12 |
|    |      | 1.2.0.3 VLIB                                                 | 12 |
|    |      | 1.2.0.4 Plugins                                              | 13 |
|    |      | 1.2.1 Configuration and Startup                              | 13 |
|    | 1.3  | Utilization of Vector Packet Processing                      | 13 |
|    |      | 1.3.1 Integration with the SDN/NFV Ecosystem                 | 13 |
|    |      | 1.3.2 VPP as a Complete Router Solution                      | 14 |
|    | 1.4  | Survey of Traffic Generation Tools                           | 14 |
| 2  | Pra  | tical part                                                   | 16 |
|    | 2.1  | Building Infrastructure for Measurement                      | 16 |
|    | 2.2  | Metodology                                                   | 17 |
|    | 2.3  | Test Scenarios & Results                                     | 18 |
|    |      | 2.3.1 Bidirectional UDP 1 Gbit/s                             | 18 |
|    |      | 2.3.1.1 64-bytes frames                                      | 19 |
|    |      | 2.3.1.2 512-bytes frames                                     | 19 |
|    |      | 2.3.1.3 TODO 889-bytes frames                                | 19 |
|    |      | 2.3.1.4 1280-bytes frames                                    | 20 |
|    |      | 2.3.1.5 1518-bytes frames                                    | 21 |
|    |      | 2.3.1.6 Test Conclusion                                      | 21 |
|    | 2.4  | Presentation and Analysis of Results                         | 21 |

| Contents         |    |  |
|------------------|----|--|
| 3 Conclusion     | 22 |  |
| A Nějaká příloha | 23 |  |
| Obsah příloh     | 27 |  |

## List of Figures

| 1.1<br>1.2        | Picture showing the VPP Processing Graph [4] | 7                    |
|-------------------|----------------------------------------------|----------------------|
| -· <b>-</b>       | the Linux Network Stack and DPDK             | 10                   |
| 2.1               | Picture showing hardware setup               | 16                   |
|                   |                                              |                      |
|                   |                                              |                      |
|                   | T: 4 C T 11                                  |                      |
|                   | List of Tabl                                 | les                  |
|                   | List of Table                                | les                  |
| 2.1               | Hardware details for DUT (Device Under Test) | les                  |
| 2.1<br>2.2        |                                              |                      |
|                   | Hardware details for DUT (Device Under Test) | 17                   |
| 2.2               | Hardware details for DUT (Device Under Test) | 17<br>17             |
| 2.2<br>2.3        | Hardware details for DUT (Device Under Test) | 17<br>17<br>19       |
| 2.2<br>2.3<br>2.4 | Hardware details for DUT (Device Under Test) | 17<br>17<br>19<br>20 |

List of code listings

## List of abbreviations

| DFA | Determ | inistic | Finite | Automaton |
|-----|--------|---------|--------|-----------|
|     |        |         |        |           |

FA Finite Automaton

LPS Labelled Prüfer Sequence

NFA Nondeterministic Finite Automaton

NPS Numbered Prüfer Sequence

XML Extensible Markup Language

XPath XML Path Language

XSLT eXtensible Stylesheet Language Transformations

W3C World Wide Web Consortium

## Introduction

Modern high-performance network devices are usually proprietary systems that combine custom hardware, specialized operating systems, and tightly coupled software. While these solutions offer high throughput and reliability, they are typically expensive, inflexible, and slower to evolve due to their closed design and development model. Vector Packet Processing (VPP) is a high-performance network stack that operates at layers 2 to 4 of the ISO/OSI model. It was originally developed by Cisco Systems, Inc. (which is a world leader in networking) and open-sourced in 2016 under the Fast Data Project (FD.io), that is part of the Linux Foundation. VPP brings the ability to perform efficient, high-speed packet processing on common off-the-shelf (COTS) hardware, across a wide range of platforms and operating systems. Its open and flexible architecture opens the door to a new class of network applications that can be deployed and scaled more easily than traditional hardware appliances. In this way, VPP could represent a shift in the traditionally conservative networking world, echoing the "Mainframe to PC" revolution, where generalpurpose systems replaced proprietary platforms, enabling broader innovation and accessibility.

Since VPP was open-sourced only recently, it has not yet been widely adopted by the market, and there are only a limited number of academic studies on the subject. As a result, this area remains underexplored. This thesis aims to contribute to this field by evaluating VPP's<sup>1</sup> performance, with a particular focus on its electricity consumption. The findings could provide valuable insights for the industry and guide future research, especially in light of the increasing importance of energy efficiency, as highlighted in recent forecasts by ČEPS a.s. regarding the future of energy resources in the Czech Republic.

With the development of AI and the growing demand for high-resolution streaming services, it is highly likely that the demand for internet bandwidth

 $<sup>^1{\</sup>rm The}$  abbreviation VPP is also commonly used in a cademic literature to refer to a Virtual Power Plant.

Introduction 2

will continue to rise. This will result in an increased need for network equipment capable of processing larger volumes of data more efficiently. Therefore, it is crucial to explore technologies like VPP that are capable to handle this growing demand and to explore their energy efficiency.

This thesis is divided into two parts: Theoretical and Practical. The Theoretical part presents the traditional approach to networking and packet processing, as well as an overview of how VPP is designed and the principles on which it operates. Additionally, it introduces the testing scenarios used. The Practical part describes the testing infrastructure, presents the results of various measurements, and provides an analysis of the findings.

## Chapter 1

# Theoretical part

# 1.1 "Vector Packet Processing (VPP) and Its Operating Principles

. . . . . . . . . . .

This section describes the fundamental principles behind the Vector Packet Processing (VPP) technology, which aims to enable efficient and high-performance network packet processing. VPP is built on modern programming and architectural principles that allow maximum utilization of contemporary hardware, particularly in parallel processing and memory access optimization.

The section begins with a brief description of traditional network traffic processing methods used by operating systems and their limitations in terms of performance and scalability. Following that, the architecture of VPP is explored in detail, explaining how packets are processed in vectors, the use of a node graph, and the various techniques that contribute to its high efficiency—such as I/O and compute batching, zero-copy methods, and lock-free multithreading. The purpose of this section is to provide a theoretical foundation for understanding how VPP operates.

## 1.1.1 Traditional network traffic processing

A network packet is a basic unit of data transmitted over a network. It consists of a header, which includes control information such as source and destination IP addresses, and a payload, which carries the actual user data. Packets are routed independently through the network and reassembled at the destination. This structure allows efficient and reliable communication, even over complex or unreliable network paths.

Currently, packet processing works as follows: a packet arrives at the network card, which then issues a system call (syscall) to the operating system for packet processing. The microprocessor must save the currently executing instruction, perform a context switch, locate the appropriate service routine in the interrupt vector table, and handle the packet processing. Once completed, it must restore the saved instruction, perform another context switch, and return to processing the interrupted program.

This system for operating peripherals was designed under the assumption that the peripherals would not request interrupts continuously, which is not the case with network devices that need to process large volumes of data split into small parts. This method requires the microprocessor to execute a significant number of instructions not directly related to packet processing. Chase et al. [1] discovered  $^1$  that if MTU is 1500 bytes, then interrupt handling accounts for 20% - 25% of receiver packet-processing overhead. Another disadvantage of tradidtional packet processing is the inefficient handling of cache memory; the processing of the packets one by one in response to interrupts leads to frequent cache misses in both cache & inctruction caches.  $^2$ [2]

#### 1.1.2 An Introduction to VPP

Vector Packet Processing (VPP) is a multi-platform network stack that operates at layers 2-4 of the ISO/OSI model and is developed by the FD.io project. It consists of a set of forwarding vertices arranged in an oriented graph and auxiliary software and provides out-of-the-box switch/router functionality. Unlike traditional network stacks, which run in the kernel, VPP operates in user space.

In a traditional approach, packets are processed one by one. In contrast, VPP reads the largest available number of packets called vector from the network interface card (NIC) and processes the entire vector through a VPP node-graph one node at a time. Each node in this graph handles a specific part of the packet processing. This approach reduces cache misses and spreads fixed overhead costs across multiple packets, lowering the average processing cost per packet. Additionally, it allows VPP to take advantage of multiple cores, enabling parallel processing, which significantly improves overall performance.

Vector Packet Processing (VPP) runs on common off-the-shelf hardware (COTS), ensuring its broad compatibility and flexibility for deployment. It supports various architectures such as x86, ARM, and Power, and can be deployed on both standard servers and embedded devices. The design of VPP is agnostic to hardware, kernel, and deployment platform, meaning it can operate across a wide range of systems, including bare metal servers, virtual machines (VMs), and containers. This approach allows VPP to be deployed on widely available infrastructure without the need for specialized hardware.[3]

<sup>&</sup>lt;sup>1</sup>kap. 3.3 obr. 6

 $<sup>^2</sup>$ kap. 4.2

## 1.1.3 Techniques used in VPP

TBD DOPAST !!! Low-level code optimization technique
According to Linguaglossa et al.[4] VPP uses theese kernel-bypass techniques:

- Lock-Free Multi-Threading (LFMT) is a programming technique that leverages modern multi-core CPUs to increase system performance. In network applications, parallelism is achieved by running multiple threads in the same time. Ideally, the more threads are used, the better the system performance but only up to a saturation point beyond which additional threads bring no gainns. However, to reach this ideal performance, traditional synchronization mechanisms such as mutexes and semaphores must be avoided, as they introduce delays due to thread contention. Instead, lock-free architectures have to be used, allowing threads to operate independently without blocking each other. In the context of VPP this approach is enabled by hardware features like multi-queue NICs, which allow each thread to handle a distinct subset of traffic, ensuring efficient and parallel processing.
- I/O batching (IOB) is a key technique used in VPP. Instead of raising an interrupt for every incoming packet, the network interface card (NIC) collects multiple packets into a buffer and triggers an interrupt only when the buffer is full. This reduces the overhead caused by frequent context switching and interrupt handling. VPP typically uses poll-mode drivers, which collect packets in batches without relying on interrupts. Moreover, the batching technique is applied system-wide in VPP. This approach maximizes CPU efficiency, improves cache usage, and delivers stable, high-throughput performance even under heavy load.
- Compute batching (CB) is a technique that extends I/O batching to the processing phase itself. Instead of processing one packet at a time, network functions are designed to operate on entire batches of packets. This approach minimizes overhead from function calls (such as context switches and stack setup) and improves instruction cache efficiency. When a batch of packets enters a processing function, only the first packet might cause an instruction cache miss, while the rest benefit from already-warmed cache. Additionally it is possible to take advatage of instruction-level parallelism.
- Receive-Side Scaling (RSS) is a hardware-based technique used by modern NICs to distribute incoming packets across multiple RX queues. This enables parallel packet processing by allowing each queue to be handled by a separate thread, improving scalability and throughput. Packet assignment is typically done using a hash function over packet header fields (e.g., the 5-tuple).

- Zero-Copy (Z-C) is a technique used to eliminate unnecessary memory copying during packet processing. Instead of copying incoming packets from the network interface card (NIC) to a separate buffer via system calls, the NIC writes packets directly into a pre-allocated memory region that is shared with the user-space application via Direct Memory Access (DMA). This allows the application to access packet data without invoking system calls or duplicating memory, significantly reducing CPU overhead.
- Cache Coherence and Locality (CC&L) are critical factors in the performance of modern software-based packet processing systems. In current COTS architectures, memory access has become a major bottleneck, which is mitigated by a multi-level cache hierarchy. Minimizing cache misses and maintaining data locality during packet processing is essential for achieving high performance and low latency.
- Low-Level Parallelism (LLP) refers to the ability to exploit the internal micro-architecture of modern CPUs, including multi-stage pipelines, arithmetic-logical units (ALUs), and branch predictors that help maintain pipeline efficiency. Well-optimized code can keep these pipelines full and execute multiple instructions per clock cycle, increasing overall throughput. Performance can be further improved by giving hints to the compiler such as indicating the likely outcome of conditional branches to reduce pipeline stalls. Vectorized packet processing and specific coding practices can take full advantage of these hardware features and VPP was specifically designed to taky advantage of LLP.

### 1.1.4 VPP Processing Graph and Graph nodes

At the core of VPP (Vector Packet Processing) lies the Packet Processing Graph, a directed graph composed of relatively small, modular & loosely coupled nodes. Each node is designed to perform a specific task and there are 3 types of them: process, input & internal. Process nodes do not participate in the packet forwarding graph; instead, they handle timers, events, and other background tasks within the VPP runtime. Input nodes are used for input of data and internal nodes are used for vector processing. Internal nodes also serves as output nodes. When a vector of packets is prepared by input node, it is then pushed through the internal nodes. During processing, the vector may be split if the batch contains packets of different protocols or types, as they may need to follow different paths through the graph When the original vector is completely processed, the process repeats. Illustration of this Processing Graph is shown in fig. 1.1.

Thanks to VPP's modular design, the processing graph is highly customizable and extensible. New nodes – referred to as plugins – can be easily added to implement specific functionality or repleace existing ones. Plugins are shared libraries that are loaded during startup of VPP, and they are not dependent



■ Figure 1.1 Picture showing the VPP Processing Graph [4]

on the VPP source code, allowing them to be developed independently. Moreover, existing nodes can be rewired to modify the packet processing logic when necessary.[4, 5, 6]

#### 1.1.5 DPDK and Its Role in VPP

The Data Plane Development Kit (DPDK) is an open-source collection of libraries and drivers designed to support high-speed packet processing in user space. It was initially developed by Intel in 2010 and is now maintained as a Linux Foundation project. DPDK provides a set of APIs and components that allow applications to bypass the kernel network stack and directly access network interface cards (NICs) through poll-mode drivers (PMD), significantly reducing the overhead associated with traditional packet handling mechanisms.[7]

DPDK is used in VPP for interfacing with hardware. It is implemented as a plugin called dpdk-pluqin.[4, 5]

While VPP supports multiple mechanisms for accessing network devices, such as *af\_packet*, to the best of the author's knowledge, DPDK is by far the most widely used option.

#### 1.1.5.1 Poll Mode Drivers

Poll Mode Drivers (PMDs) are a key component of the DPDK framework. Unlike traditional network drivers, which rely on interrupts to signal packet arrival, PMDs continuously poll the network interface card (NIC) (specifically its RX queue) in a busy-loop, completely avoiding traditional interrupt-based

mechanisms. This approach allows packets to be retrieved, processed, and delivered directly to user space without kernel involvement. While this results in very low latency and high throughput, it also causes constant CPU utilization on the cores assigned to polling, regardless of actual traffic load.[8]

Not every network interface card is supported by DPDK. Each supported device requires a specific Poll Mode Driver (PMD), which must be available and compatible with the given hardware. An up-to-date list of supported NICs and their corresponding PMDs is maintained on the official DPDK website.[9]

#### 1.1.5.2 Memory management and Hugepages

DPDK uses a user-space memory model that eliminates the need for kernel involvement during packet processing. It operates on memory regions reserved as hugepages – large memory pages, typically 2 MB or 1 GB in size, which are allocated at startup. These hugepages are used to store packet buffers and manage memory pools. DPDK defines its own memory management structures, such as mempools, which consist of preallocated fixed-size objects.

DPDK is also explicitly NUMA-aware. Most memory allocation functions require the application to specify the target NUMA node, ensuring that memory is allocated close to the CPU core accessing it. This minimizes latency caused by cross-node memory access and helps optimize performance on multisocket systems. [10]

# 1.1.5.3 Packet Reception and Transmission: A comparison between Linux Network Stack and DPDK

When a packet arrives at a NIC managed by the Linux Network Stack, it is first stored in the NIC's internal buffers. The NIC then writes the packet via Direct Memory Access (DMA) to the section of RAM provided by the driver and updates the corresponding descriptor in the RX buffer. The RX buffer is implemented as a ring queue.

Once the packet has been saved, the corresponding interrupt request (IRQ) is triggered to notify the CPU that one or more packets have arrived in that queue. Then, the corresponding IRQ handler is executed, which acknowledges the interrupt and calls the napi\_schedule and \_\_\_raise\_softirq\_irqoff functions.

The first function marks the associated napi\_struct<sup>3</sup> as ready for processing, while the second one raises a software interrupt (SoftIQR) specifically intended for processing incoming packets. Once the SoftIRQ is triggered, the kernel handles the actual packet processing in a deferred context. It goes through a list of network devices that have indicated pending work (i.e., their

<sup>&</sup>lt;sup>3</sup>napi\_struct represents a NAPI context associated with a specific receive queue of a network device.

associated napi\_struct has been marked as ready) and calls their associated poll functions to retrieve and process packets from the receive queues.

This happens on the same CPU core that handled the original interrupt. If the system is busy or the processing takes too long, the remaining work may be handled by the *ksoftirqd* kernel thread. The packets may be aggregated into a single larger packet using Generic Receive Offload (GRO), or processed individually. In both cases, they are passed to the IP stack via the netif\_receive\_skb function.

The transmission path is handled in a similar manner, using ring buffers, DMA, and deferred processing. However, unlike reception, packet transmission is initiated from the IP stack using the \_\_dev\_queue\_xmit function. Depending on the qdisc in use, packets are either enqueued in the software queue or passed directly to the driver for transmission. Once a packet is selected for transmission, the driver places a descriptor into the TX ring buffer and sets up DMA so that the NIC can read the packet data from memory. After the NIC finishes transmitting the packet, it triggers a TX interrupt, which allows the driver to perform post-processing such as unmapping DMA buffers and freeing memory.[11]

When there is a NIC with multiple RX queues available, it is assigned to one of the queues based on the NIC's configuration<sup>4</sup>. The selection of the target queue is typically based on a hash function computed over network and/or transport layer headers. Each queue has a dedicated IRQ, which can be assigned to specific CPU cores based on system settings. This mechanism is known as Receive Side Scaling (RSS).

When sending a packet from a NIC equipped with multiple TX queues, Transmit Packet Steering (XPS) is used to determine the appropriate TX queue. The first option is that a CPU core is assigned specific TX queues. The second option is to use the TX queue corresponding to the RX queue from which the flow originated. If multiple queues are eligible, a hash function is used to select the specific queue.[12]

On the other hand, in DPDK the incoming packets are delivered by the network interface card (NIC) using direct memory access (DMA), which writes packet data into pre-allocated memory buffers specified by receive (Rx) descriptors. These descriptors are organized in a circular ring (Rx queue), where the NIC populates entries at the head, while the VPP continuously polls the tail using functions such as  $rte\_eth\_rx\_burst()$ . This polling mechanism enables the VPP to retrieve multiple packets in a batch, minimizing interrupt overhead and reducing latency, thereby increasing throughput and core efficiency.[13]

Transmission is handled similarly, using a ring buffer known as the transmit (Tx) queue. The application prepares transmit (Tx) descriptors at the tail of this queue, each containing the address and length of the packet to be sent.

<sup>&</sup>lt;sup>4</sup>For network interface cards with multi-queue capabilities, the corresponding kernel driver often provides a module parameter to define how many hardware queues should be initialized and utilized.

These descriptors reference memory buffers (mbufs) holding the packet data. After the descriptors are written, the application updates the Tx queue's tail pointer to notify the NIC that new packets are available. The NIC then reads the descriptors from the head of the queue, fetches the packet data via DMA, and transmits the packets on the wire.[14]



■ Figure 1.2 Diagram illustrating the differences in packet handling between the Linux Network Stack and DPDK.

Based on the description above, several key differences between the Linux Network Stack and DPDK can be observed. Although both rely on a similar underlying mechanism – ring buffer queues – their implementations differ fundamentally. In the Linux Network Stack, memory management is handled by the kernel through device drivers. In contrast, DPDK allocates memory in user space and manages it through its own framework, providing packet buffers and ring structures directly to the application.

Linux is heavily dependent on hardware interrupts (IRQs) for packet reception and software interrupts (softIRQs) for deferred processing, which introduces frequent context switches. While NAPI uses polling to process packets from receive queues, they are still handled one at a time. This increases the likelihood of cache misses during packet processing, as each packet is processed independently and may not benefit from cache locality.

In contrast, DPDK works entirely in user space and uses continuous active polling, completely bypassing the need for interrupts and context switching. Since it does not wait for an interrupt to occur, packet processing can begin sooner, reducing initial latency. Additionally, DPDK can retrieve multiple packets in a single burst, preparing them for vectorized processing in VPP. Figure 1.2 presents a simplified diagram highlighting the key differences.

## 1.2 Implementation of Vector Packet Processing

VPP's dataplane is implemented by four main architectural layers: VPPIN-FRA, VNET, VLIB, and Plugins. Each of these layers provides distinct functionality that supports efficient networking operations, from low-level data structure management to high-level network function optimizations.

VPPINFRA provides foundational libraries for tasks such as memory handling, vectors, rings, hash table lookups, and timers. VNET focuses on implementing network protocols for layers 2 to 4 and includes the control plane. VLIB serves as the runtime environment for vectorized processing and also provides the command-line interface and application lifecycle management. Finally, plugins allow the system to be extended or customized by adding new features or modifying existing ones [15].

#### 1.2.0.1 **VPPINFRA**

VPPINFRA is a collection of foundational libraries designed to provide highperformance capabilities for various internal tasks within VPP. It includes dynamic arrays, hash tables, bitmaps, timing utilities, logging mechanisms, and data structure serialization, all optimized for speed and efficiency.

- Vectors dynamically resized arrays with user-defined headers. Vectors are used as the basis for other structures such as pools or hash tables and support efficient memory reuse through safe length resetting.
- **Bitmaps** compact data structures used to efficiently track the true/false state of multiple indexed items using individual bits, built on top of vectors.
- Pools structures used to quickly allocate and free fixed-size data structures, such as packet buffers or per-session metadata. Internally, they are implemented using vectors and bitmaps.
- **Hashes** lookup structures optimized for fast access using hash functions.
- **Timekeeping** utilities providing precise, low-overhead timing based on CPU cycles. VPPINFRA continuously adjusts its time calibration by comparing CPU ticks against kernel time, ensuring accurate time measurement without expensive system calls.
- **Timer wheel** subsystem for efficiently managing timers and periodic events. It supports multiple configuration options, including the number of wheels, slots, and timers per object, allowing high-performance scheduling in time-sensitive applications.
- **Logging and formatting** includes support for fast event logging, trace output, and data formatting used for debugging and diagnostics.
- Serialization support for serializing and deserializing internal data structures for persistent storage or communication between threads.

The infrastructure layer underpins many of VPP's internal mechanisms and enables building scalable and fast packet processing pipelines [16].

#### 1.2.0.2 **VNET**

VNET (VPP Network Stack) implements the core networking logic in VPP, providing graph nodes for Layer 2 and Layer 3 packet processing.

A key mechanism provided by VNET is the concept of feature arcs. These represent named sequences of graph nodes within the packet processing graph, allowing custom nodes – such as NAT, ACLs, or telemetry – to be inserted into existing pipelines in a defined order. Feature arcs enable modular composition of processing features without modifying the core graph logic. For example, an ACL node can be inserted at the beginning of the ip4-unicast<sup>5</sup> feature arc.

In addition to protocol and interface handling, VNET also provides a flexible framework for packet tracing, allowing developers to inspect and debug the path each packet takes through the graph in fine detail. This is especially useful for analyzing the behavior of custom nodes or diagnosing complex feature interactions.

Finally, VNET includes a built-in packet generator, which can be used to simulate traffic and evaluate the performance of specific graph paths under controlled conditions.[17, 18]

#### 1.2.0.3 VLIB

VLIB provides the runtime environment and execution engine that powers VPP's packet processing model. One of its core responsibilities is managing the registration and execution of all graph nodes, including input, internal, and process nodes.

The execution of graph nodes in VPP is coordinated by a lightweight cooperative scheduler. Each iteration of the main loop begins with input nodes producing vectors of packets, which are then passed through a sequence of internal nodes forming a directed graph. Nodes process the incoming vector and determine, for each packet, which next node should process it based on routing, classification, or protocol-specific logic.

Packets destined for the same next node are grouped together and placed into a new vlib\_frame\_t, which is then enqueued to that node for processing. This selective forwarding enables efficient vector splitting and maintains high performance by improving cache locality and reducing per-packet overhead.

When a node cannot or should not be executed immediately, VPP defers its execution by adding it to a list of pending operations using pending\_frames. These frames are processed later in the main loop

VLIB also provides the command-line interface (CLI), which allows operators to interact with the VPP runtime.[19]

 $<sup>^5</sup>$ ip4-unicast is a feature arc that processes unicast IPv4 packets before they reach the routing logic.

#### 1.2.0.4 **Plugins**

Plugins in VPP are implemented as shared object libraries that are automatically discovered and loaded by VLIB during startup. They allow developers to add new features or extend existing functionality without modifying the VPP core.

To create a plugin, developers add a new directory under src/plugins, define build instructions using plugin.mk and CMakeLists.txt, and implement the required logic. After compilation, the plugin is placed into the designated plugin directory and becomes available for VPP to load.

This modular architecture enables rapid experimentation and integration of custom network functions while keeping the base system clean and maintainable. [20, 21]

#### 1.2.1 Configuration and Startup

TBD

## 1.3 Utilization of Vector Packet Processing

VPP supports a comprehensive set of Layer 2 to Layer 4 network features. At Layer 2, it provides Ethernet bridging, MAC learning, VLAN tagging (including dot1q and QinQ), and support for L2 cross-connects and policers.

At Layer 3, VPP implements both IPv4 and IPv6 routing with ECMP support, NAT44/NAT64, and ACL-based filtering. It also supports tunneling mechanisms such as GTP-U, IP-in-IP, and VXLAN. Segment routing (SRv6), LISP, and punt redirect mechanisms are included as well.

At the transport layer (L4), basic UDP and TCP stack functionality is available, enabling packet forwarding and processing for a wide range of use cases within virtualized networking environments.

Additionally, supported features include PPPoE, the WireGuard VPN protocol, GRE tunneling, DHCP client and proxy functionality, and L2TPv3.[22]

According to the authors, VPP can be for example effectively utilized as a virtual switch, virtual router, gateway or used as a basis for a firewall, IDS and load balancer.[3] It already includes enough features to be deployed in production environments.

## 1.3.1 Integration with the SDN/NFV Ecosystem

To meet the requirements of modern virtualized and cloud-native networking environments, Vector Packet Processing (VPP) was architected with a clear separation between the data plane and control plane. This design choice enables its integration into SDN and NFV frameworks, where packet forwarding logic can operate independently from centralized control mechanisms. VPP's

modularity and userspace implementation allow it to function efficiently within dynamic, multi-tenant infrastructure, while remaining compatible with orchestration systems and control-plane protocols commonly used in such deployments

VPP is fully compatible with both Virtual Network Functions (VNFs) and Cloud-Native Network Functions (CNFs). Its modular architecture allows deployment in environments utilizing service function chaining, Kubernetes-based orchestration, or OpenStack-based infrastructures. Because of its userspace design and performance-optimized data plane, VPP can serve as the fast packet processing backend for SDN-controlled systems and NFV orchestrators.[23]

## 1.3.2 VPP as a Complete Router Solution

Vector Packet Processing (VPP) is implemented solely as a data-plane, meaning it is not a complete routing solution on its own. VPP is dedicated to efficiently forwarding packets between interfaces based on routing rules and access control filters, but it does not include a native control-plane or support for dynamic routing protocols such as BGP or OSPF.

However, as demonstrated by the authors of the VBSR (VPP-Bird Software Router) project [24], it is possible to integrate VPP with additional components such as the Linux Control Plane (Linux-CP) plugin and the BIRD routing daemon. Bird acting as a control-plane enables dynamic routing using protocols like BGP and the Linux-CP is responsible for communication between VPP and BIRD This integrated system creates a nearly feature-complete router solution, comparable in functionality to commercial routers.

It is important to note, however, that firewall functionality is still limited and was left by authors of VBSR as a future work.[24] While VPP supports basic packet filtering through ACLs, it lacks advanced stateful firewall features[22]. These would need to be handled externally.

## 1.4 Survey of Traffic Generation Tools

In order to evaluate the performance of network devices and data-plane frameworks such as VPP, synthetic traffic must be generated in a controlled and reproducible manner. Selecting appropriate traffic generation tools is therefore essential for conducting accurate benchmarking and stress-testing. Although numerous traffic generation tools exist [25], this section focuses on a subset commonly used for high-performance benchmarking and synthetic traffic generation in research and practice, namely iPerf3, D-ITG, TRex, Pktgen-DPDK & Genesids.

■ iPerf3 – iPerf3 is a network testing tool used to measure TCP, UDP, and SCTP throughput between two endpoints. It allows detailed configuration of testing parameters such as buffer size, number of parallel streams, test

duration, and jitter. iPerf3 can also measure jitter, providing insights into the variation in packet arrival times, which is useful for evaluating network stability. Its client-server architecture makes it a common tool for performance benchmarking of networks and devices.[26]

- D-ITG Distributed Internet Traffic Generator is a network traffic generator designed to produce traffic flows that accurately emulate a wide range of real-world application behaviors. It supports multiple transport layer protocols, including TCP, UDP, DCCP, and SCTP. D-ITG allows users to define parameters such as packet size, inter-departure time, and number of flows, making it suitable for controlled experiments on delay, jitter, packet loss, and throughput. It can operate in both single-node and distributed modes, enabling flexible deployment for testing complex topologies and performance conditions. D-ITG also includes tools for logging and analyzing the generated traffic, facilitating detailed post-experiment evaluation. [27]
- TRex TRex, developed by Cisco, is a high-performance, stateful and stateless traffic generator built on top of DPDK. It supports the generation of realistic Layer 4–7 traffic using pre-recorded PCAP files and emulates multiple concurrent users and flows. TRex is especially suited for benchmarking network function virtualization (NFV) platforms, routers, and firewalls in both laboratory and production-like environments. [28]
- Pktgen-DPDK Pktgen-DPDK is a high-performance traffic generator tool developed as part of the Data Plane Development Kit (DPDK). Pktgen-DPDK supports various network protocols, including IPv4, IPv6, UDP, and TCP. The tool allows precise control over traffic parameters, such as packet rate, size, and timing. Pktgen-DPDK is used in network performance tests and can capture packet-level statistics to assess the performance of the devices under test. [29]

Among the reviewed tools, the author decided to utilize iPerf3 and TRex in the subsequent experimental evaluation. IPerf3 was selected due to its status as a de facto standard for basic throughput and jitter measurements, ease of use, and widespread adoption in academic and practical contexts. TRex was chosen for its modern architecture, support for high-speed stateful and stateless traffic generation, ability to simulate real-world traffic. In addition, TRex provides a Python-based API that enables scripting and automation of test scenarios, making it well-suited for integration into continuous testing pipelines and reproducible experiments. Other tools, such as Pktgen-DPDK and D-ITG, were excluded due to their relatively complex usage (Pktgen-DPDK) or limited maintenance and outdated design (D-ITG).

## Chapter 2

## Pratical part

## 2.1 Building Infrastructure for Measurement

The testing infrastructure has been implemented as recomended in RFC 2544, which defines methods for evaluating network performance. It consists of a device under test (DUT), connected to a measurement device called *Tester.*<sup>1</sup>. In line with the more modern RFC 8219, which states that: "All tests described SHOULD be performed with bidirectional traffic" [30], the infrastructure is designed to operate with bidirectional traffic. This approach ensures more accurate performance measurement under real-world network conditions, as opposed to unidirectional traffic. The Device Under Test (DUT) and the measurement device are connected using 100Gbit capable cables, preventing any potential bottlenecks in the connection. The illustration of this hardware setup is shown in fig. 2.1



**Figure 2.1** Picture showing hardware setup

The Device Under Test (DUT) is the network device being evaluated during testing. It is configured with a specific network stack and settings based on measurement scenario and serves as the focus of performance and behavior analysis in a controlled test environment. The DUT is responsible for processing network traffic and responding to the test conditions set by the measure-

<sup>&</sup>lt;sup>1</sup>The hardware used in this testing setup was loaned free of charge for the purposes of this bachelor thesis by Silicon Hill club.

Metodology 17

ment device. Additionally, the electrical power consumption of the DUT is monitored and measured during the tests to assess its energy efficiency under varying loads. The hardware of DUT is shown in table 2.1.

| Hardware Component            | DUT (Device Under Test)                      |  |
|-------------------------------|----------------------------------------------|--|
| CPU Model                     | 2x Intel(R) Xeon(R) CPU E5-2660 v3           |  |
| Frequency                     | $2.60 \mathrm{GHz}$                          |  |
| Cores                         | 10 physical cores each (one thread per core) |  |
| Memory (RAM)                  | Size, type, speed                            |  |
| Network Interface Cards (NIC) | Mellanox ConnectX-6 Dx (Dual-port)           |  |

■ Table 2.1 Hardware details for DUT (Device Under Test)

The Tester (Measurement Device), on the other hand, is responsible in generating the network traffic and capturing the responses from the DUT. Its physical features are shown in table 2.2.

| Hardware Component            | Tester (Measurement Device)                   |  |
|-------------------------------|-----------------------------------------------|--|
| CPU Model                     | 2x Intel(R) Xeon(R) Gold 6136 CPU             |  |
| Frequency                     | $3.00 \mathrm{GHz}$                           |  |
| Cores                         | 12 physical cores each (two threads per core) |  |
| Memory (RAM)                  | Size, type, speed                             |  |
| Network Interface Cards (NIC) | 2x Mellanox ConnectX-5                        |  |

■ Table 2.2 Hardware details for Tester (Measurement Device)

The DUT is running Debian GNU/Linux 12 (Bookworm)  $x86\_64$  with Linux kernel version 6.1.0-32-amd64, VPP v25.02-release, and DPDK version 24.11.1.

The tester is running ...

## 2.2 Metodology

The RFC 2544 recommends to test be at least 60 seconds in duration[31] and NAJÍT ZDROJ??? kolikrát opakovat Each test scenario executed using TRex was repeated 30 times, with each individual run lasting five minutes. The reported results represent the arithmetic mean of these 30 measurements. In cases where an anomalous spike or irregularity was observed in the results, the corresponding measurement was discarded and the test was repeated. All this steps should ensure consistency and statistical reliability.

Transmitted packets and bytes refer exclusively to those that were successfully sent and received without any loss (i.e., without packet drops).

To evaluate energy efficiency, the number of packets per watt (PPW) and bytes per watt (BPW) was used, with all values rounded to two decimal places. These metrics provide a measure of how efficiently energy is utilized for each

transmitted packet and byte. The machine – when idle – consumes  $144~\mathrm{Watts}$  per minute.

### 2.3 Test Scenarios & Results

## 2.3.1 Bidirectional UDP 1 Gbit/s

This section presents a set of performance and efficiency tests performed on the Device Under Test (DUT) using bidirectional UDP traffic at a total rate of 1 Gbit/s. The goal is to evaluate the forwarding performance of the DUT under varying packet sizes, simulating realistic traffic patterns with increasing stress on the packet-processing path.

To provide a comprehensive and representative view, the tests are structured into five subsections, each corresponding to a different Ethernet frame size. Four of the selected sizes – 64 bytes, 512 bytes, 1280 bytes, and 1518 – bytes are recommended by RFC2544[31] covering both edge-cases and practically relevant intermediate values. The fifth size, 888 bytes, was chosen because it was identified as the average size in real-world network traffic by Jurkiewicz et al. [32]. This selection covers the full range of standard Ethernet frame sizes, from the minimum to the maximum non-jumbo frames, and includes a representative average frame size observed in real network traffic.

Traffic in all scenarios is generated using TRex with the udp\_1pkt\_src\_ip\_split.py profile, which ensures that each packet carries a unique source IP address to simulate multiple concurrent clients, while maintaining a single destination IP per direction. The routing table of the DUT contains only two active forwarding entries, corresponding to the test routes, in addition to two administrative entries used for management. The total offered load is 1 Gbit/s, symmetrically split between both directions (500 Mbit/s each). The chosen load of 1 Gbit/s is representative of a realistic aggregate traffic pattern that could be observed in a small or medium-sized enterprise network, especially when routed through a central gateway.

The DUT is configured with the Vector Packet Processing (VPP) stack and tested under three levels of parallelism: using 1, 4, and 10 worker threads. The number of RX/TX queues is aligned with the number of active worker threads in each configuration to ensure balanced packet distribution and optimal resource utilization.

To provide a baseline for comparison, all scenarios are also executed using the standard Linux kernel networking stack, configured with equivalent routing and interface parameters, but able to use all 20 cores of CPU. This allows for a direct comparison between VPP and traditional kernel-based forwarding in terms of performance and energy efficiency.

The aim of this test is to observe the behavior of the VPP forwarding plane under low traffic load of small packets and to evaluate its energy efficiency.

#### 2.3.1.1 64-bytes frames

This test evaluates the behavior of the VPP forwarding plane under a low-throughput traffic load composed of minimum-sized Ethernet frames (64 bytes). Such frames result in a high packet-per-second rate for a given bandwidth, which increases the processing overhead per bit of data. This configuration represents a stress scenario for packet forwarding and allows assessment of the system's efficiency in handling a large number of small packets.

| Configuration    | Watts used | PPW           | BPW              |
|------------------|------------|---------------|------------------|
| VPP – 1 worker   | 848.08     | 20 726 965.60 | 1 326 525 798.50 |
| VPP – 4 workers  | 951.03     | 18 483 249.76 | 1 182 927 982.40 |
| VPP – 10 workers | 1 193.56   | 14 727 474.94 | 942 558 396.09   |
| Linux stack      | 1 257.25   | 13 981 407.81 | 894 810 100.29   |

**Table 2.3** Result of Bidirectional UDP 1 Gbit/s of 64-bytes packets test

As the results in Table 2.4 show, the power consumption increases notably with the number of worker threads in the VPP stack. While all VPP configurations deliver identical packet and byte throughput, the most energy-efficient setup is this measurement is the single-worker variant, consuming roughly 850 Watts during the test. In contrast, the traditional Linux network stack demonstrates the highest energy usage, despite handling the same volume of packets.

This discrepancy can likely be attributed to the cost of processing a high number of small packets in kernel space. Since the test uses fixed-size 64-byte packets, which are known to generate frequent system calls and context switches in Linux, the forwarding path becomes less efficient compared to VPP's user-space architecture, where such overheads are significantly reduced. The results highlight the energy cost of kernel-based packet forwarding in scenarios dominated by small-packet traffic.

#### 2.3.1.2 512-bytes frames

TODO - CITOVAT DNS OD CZ NIC NA VELIKOST DNS This test focuses on the forwarding performance of the VPP data plane when processing medium-sized Ethernet frames of 512 bytes. These frames offer a balance between protocol overhead and payload efficiency and are representative of many real-world applications that do not utilize maximum frame sizes. The test helps to evaluate how the system handles typical traffic patterns with moderate packet rates and processing demands.

#### 2.3.1.3 TODO 889-bytes frames

This test focuses on the forwarding performance of the VPP data plane when processing medium-sized Ethernet frames of 888 bytes. These frames offer a

| Configuration    | Watts used  | PPW              | BPW              |
|------------------|-------------|------------------|------------------|
| VPP – 1 worker   | 851.21      | $2\ 581\ 343.78$ | 1 321 648 015.98 |
| VPP – 4 workers  | 969.49      | $2\ 266\ 413.93$ | 1 160 403 931.63 |
| VPP – 10 workers | $1\ 175.07$ | $1\ 869\ 901.91$ | 957 389 779.06   |
| Linux stack      | 965.58      | $2\ 275\ 591.60$ | 1 165 102 847.70 |

**Table 2.4** Result of Bidirectional UDP 1 Gbit/s of 512-bytes frames test

balance between protocol overhead and payload efficiency. As said above this size was found as the average size in real-work network traffic.

The test helps to evaluate how the system handles traffic patterns with moderate packet rates and processing demands,

| Configuration    | Watts used | PPW          | BPW              |
|------------------|------------|--------------|------------------|
| VPP – 1 worker   |            |              |                  |
| VPP – 4 workers  | 952.46     | 1 328 629.91 | 1 181 151 986.18 |
| VPP – 10 workers | 1 190.42   | 1 063 042.32 | 945 044 623.54   |
| Linux stack      | 940.33     | 1 345 768.87 | 1 196 388 523.99 |

**Table 2.5** Result of Bidirectional UDP 1 Gbit/s of 512-bytes frames test

As shown in table 2.5, the second test with 870-byte packets, VPP's power consumption remains relatively stable across different worker thread configurations, with a variation of only about  $\pm 5\%$  compared to previous measurement. This indicates that VPP's performance and power consumption are consistent, regardless of the increased packet size. On the other hand, the Linux stack's power consumption shows a positive reduction compared to the previous test, which is attributed to the lower number of packets being processed. Since larger packets reduce the packet-per-second (PPS) rate, the system overhead and processing cost in Linux are lower, leading to more efficient energy usage in this scenario.

#### 2.3.1.4 1280-bytes frames

This test focuses on the forwarding performance of the VPP data plane when processing large Ethernet frames of 1280 bytes. These frames represent a size commonly used in environments where larger payloads are transferred, but without exceeding the typical Ethernet MTU limit of 1500 bytes. This size strikes a balance between higher data throughput and maintaining efficient protocol overhead.

The test evaluates how the system handles traffic patterns with higher packet sizes, which are common in applications involving large data transfers such as video streaming, file sharing, or other high-bandwidth scenarios in enterprise and data center environments.

| Configuration    | Watts used | PPW          | BPW              |
|------------------|------------|--------------|------------------|
| VPP – 1 worker   | 866.91     | 1 013 837.98 | 1 297 712 609.61 |
| VPP – 4 workers  | 961.01     | 914 565.18   | 1 170 643 425.56 |
| VPP – 10 workers | 1188.39    | 739 577.31   | 946 658 957.41   |
| Linux stack      | 926.77     | 948 354.26   | 1 213 893 456.20 |

■ **Table 2.6** Result of Bidirectional UDP 1 Gbit/s of 1280-frames test

#### 2.3.1.5 1518-bytes frames

This test evaluates the performance of the VPP forwarding plane when handling full-sized Ethernet frames (1518 bytes, including headers). These frames represent the upper limit of standard Ethernet without jumbo frame extensions and represent throughput efficiency under optimal conditions. Meaning that the ratio of payload to protocol overhead is maximized. The goal is to observe how the system behaves when forwarding large packets that minimize per-packet processing overhead.

| Configuration    | Watts used | PPW           | BPW              |
|------------------|------------|---------------|------------------|
| VPP – 1 worker   | 854.66     | 867 136.33    | 1 316 312 956.40 |
| VPP – 4 workers  | 972.93     | 761 726.68    | 1 156 301 102.16 |
| VPP – 10 workers | 1 192.34   | $621\ 556.55$ | 943 522 846.94   |
| Linux stack      | 934.24     | 793 272.33    | 1 204 187 394.37 |

■ Table 2.7 Result of Bidirectional UDP 1 Gbit/s of 1518-frames test

As shown in Table 2.7, even when using the maximum standard Ethernet frame size – thus minimizing the number of interrupts and per-packet processing overhead – Linux does not outperform VPP running on single core (with one worker-thread) in terms of energy efficiency.

This shows that the Linux networking stack must perform a number of energy-intensive operations, which cannot be fully offset even under ideal conditions with large frame sizes and reduced interrupt overhead. As a result, it remains less efficient than VPP running on a single core.

#### 2.3.1.6 Test Conclusion

## 2.4 Presentation and Analysis of Results

# Chapter 3 Conclusion

# Appendix A Nějaká příloha

Sem přijde to, co nepatří do hlavní části.

## **Bibliography**

- 1. GALLATIN, Andrew J.; CHASE, Jeffrey S.; YOCUM, Kenneth G. Trapeze/IP: TCP/IP at Near-Gigabit Speeds. In: *Proceedings of the USENIX Annual Technical Conference*. 1999, pp. 109-120. Available also from: https://www.usenix.org/event/usenix99/full\_papers/gallatin/gallatin.pdf.
- 2. COX, Alan L.; SCHAELICKE, Lambert; DAVIS, Al; MCKEE, Sally A. Profiling I/O Interrupts in Modern Architectures. *Proceedings of the Workshop on Performance Analysis and Its Impact on Design.* 2000. Available also from: https://users.cs.utah.edu/~ald/pubs/interrupts.pdf.
- FD.IO. What is VPP? [https://wiki.fd.io/view/VPP/What\_is\_ \_VPP%3F]. 2025. Accessed: 2025-04-07.
- 4. LINGUAGLOSSA, Leonardo; ROSSI, Dario; PONTARELLI, Salvatore; BARACH, Dave; MARJON, Damjan; PFISTER, Pierre. High-speed data plane and network functions virtualization by vectorizing packet processing. Computer Networks. 2019, vol. 149, pp. 187–199. ISSN 1389-1286. Available from DOI: https://doi.org/10.1016/j.comnet.2018.11.033.
- 5. BARACH, David; LINGUAGLOSSA, Leonardo; MARION, Damjan; PFISTER, Pierre; PONTARELLI, Salvatore; ROSSI, Dario. High-speed Software Data Plane via Vectorized Packet Processing. *IEEE Communication Magazine* [https://perso.telecom-paristech.fr/drossi/paper/rossi18commag.pdf]. 2018, vol. 56, no. 12, pp. 97–103. ISSN 0163-6804. Available from DOI: 10 . 1109/MCOM.2018.1800069.
- 6. FD.IO. Extensible: VPP and its plugin architecture [https://fd.io/docs/vpp/v2101/whatisvpp/extensible]. 2021. Accessed: 2025-04-10.
- 7. DPDK PROJECT. About DPDK [https://www.dpdk.org/about/]. 2025. Accessed: 2025-04-13.

Bibliography 25

8. FREITAS, Eduardo; DE OLIVEIRA FILHO, Assis T.; DO CARMO, Pedro R.X.; SADOK, Djamel; KELNER, Judith. A survey on accelerating technologies for fast network packet processing in Linux environments. *Computer Communications*. 2022, vol. 196, pp. 148–166. ISSN 0140-3664. Available from DOI: https://doi.org/10.1016/j.comcom.2022.10.003.

- 9. DPDK PROJECT. Supported NICs [https://core.dpdk.org/supported/nics/]. 2024. Accessed: 2025-04-20.
- 10. BURAKOV, Anatoly. Memory in DPDK, Part 1: General Concepts. DPDK Blog. 2019. Available also from: https://www.dpdk.org/memory-in-dpdk-part-1-general-concepts/. Accessed: 2025-04-20.
- 11. ZHANG, Shiqi; GUPTA, Mridul; DEZFOULI, Behnam. *Understanding* and Enhancing Linux Kernel-based Packet Switching on WiFi Access Points. 2024. Available from arXiv: 2408.01013 [cs.NI].
- 12. DOCUMENTATION, Linux Kernel. Scaling in the Linux Networking Stack [https://docs.kernel.org/networking/scaling.html]. 2023. Accessed: 2025-04-19.
- 13. INTEL CORPORATION. Core Utilization in DPDK Applications [htt ps://www.intel.com/content/www/us/en/docs/vtune-profiler/cookbook/2025-0/core-utilization-in-dpdk-apps.html]. 2024. Accessed: 2025-04-19.
- 14. INTEL CORPORATION. PCIe Traffic in DPDK Applications [https://www.intel.com/content/www/us/en/docs/vtune-profiler/cookbook/2025-0/pcie-traffic-in-dpdk-apps.html]. 2024. Accessed: 2025-04-19.
- 15. FD.IO PROJECT. *VPP Software Architecture*. 2025. Available also from: https://s3-docs.fd.io/vpp/25.06/developer/corearchitecture/s oftwarearchitecture.html. Accessed: 2025-04-21.
- 16. FD.IO PROJECT. *VPP Infrastructure Layer*. 2025. Available also from: https://s3-docs.fd.io/vpp/25.06/developer/corearchitecture/infrastructure.html. Accessed: 2025-04-21.
- 17. FD.IO PROJECT. *VNET VPP Network Stack.* 2025. Available also from: https://s3-docs.fd.io/vpp/25.06/developer/corearchitecture/vnet.html. Accessed: 2025-04-21.
- 18. FD.IO PROJECT. Feature Arcs. 2025. Available also from: https://s3 -docs.fd.io/vpp/25.06/developer/corearchitecture/featurearcs.html. Accessed: 2025-04-21.
- 19. VLIB Architecture VPP Developer Documentation (Release 25.06). FD.io Project, 2025. Available also from: https://s3-docs.fd.io/vpp/25.06/developer/corearchitecture/vlib.html. Accessed: 2025-04-21.

Bibliography 26

20. FD.IO PROJECT. *Plugins* [online]. 2025. [visited on 2025-04-21]. Available from: https://s3-docs.fd.io/vpp/25.06/developer/plugins/index.html.

- 21. FD.IO PROJECT. Add a New Plugin [online]. 2025. [visited on 2025-04-21]. Available from: https://s3-docs.fd.io/vpp/25.06/developer/plugindoc/add\_plugin.html.
- 22. FD.IO PROJECT. VPP Feature List FD.io Documentation (Release 25.02). 2025. Available also from: https://s3-docs.fd.io/vpp/25.02/aboutvpp/featurelist.html. Accessed: 2025-04-15.
- FD.IO PROJECT. FD.io VPP: High Performance, Modular, and Production Quality Software Forwarder. 2017-07. Tech. rep. FD.io. Available also from: https://fd.io/docs/whitepapers/FDioVPPwhitepaperJuly2017.pdf. Accessed: 2025-04-16.
- 24. SLAVIC, Goran; KRAJNOVIC, Nenad. Practical implementation of the vector packet processing software router. In: 2024 32nd Telecommunications Forum (TELFOR). 2024, pp. 1–4. Available from DOI: 10.1109/TELF0R63250.2024.10819057.
- ADELEKE, Oluwamayowa Ade; BASTIN, Nicholas; GURKAN, Deniz. Network Traffic Generation: A Survey and Methodology. ACM Comput. Surv. 2022, vol. 55, no. 2. ISSN 0360-0300. Available from DOI: 10.1145/3488375.
- 26. TEAM, The iPerf. *iPerf The Network Testing Tool.* 2025. Available also from: https://iperf.fr/. Accessed: 2025-04-16.
- 27. COMPUTER NETWORKING GROUP, Università degli Studi di Napoli Federico II. *D-ITG: Distributed Internet Traffic Generator Manual*. 2013. Available also from: https://traffic.comics.unina.it/software/ITG/manual/. Accessed: 2025-04-16.
- 28. CISCO SYSTEMS, Inc. TRex Traffic Generator. 2025. Available also from: https://trex-tgn.cisco.com/. Accessed: 2025-04-16.
- 29. PROJECT, DPDK. *Pktgen-DPDK* [https://pktgen-dpdk.readthedocs.io/]. 2025. Accessed: 2025-04-16.
- 30. ALLAN, David; MARTINEZ, Jordi Palet. Benchmarking Methodology for IPv6 Transition Technologies [https://datatracker.ietf.org/doc/html/rfc8219]. 2017. RFC 8219.
- 31. Benchmarking Methodology for Network Interconnect Devices [RFC 2544]. RFC Editor, 1999. Request for Comments, no. 2544. Available from DOI: 10.17487/RFC2544.
- 32. JURKIEWICZ, Piotr; RZYM, Grzegorz; BORYŁO, Piotr. Flow length and size distributions in campus Internet traffic. *Computer Communications*. 2021, vol. 167, pp. 15–30. ISSN 0140-3664. Available from DOI: https://doi.org/10.1016/j.comcom.2020.12.016.

# Obsah příloh

| / | /                                     |                          |
|---|---------------------------------------|--------------------------|
|   | readme.txtstr                         | ručný popis obsahu média |
|   | exeadresář se spustiteli              |                          |
|   | src                                   |                          |
|   | implzdr                               | ojové kódy implementace  |
|   | src<br>implzdr<br>thesiszdrojová form | a práce ve formátu LATEX |
|   | text                                  |                          |
|   | thesis ndf te                         | xt práce ve formátu PDF  |