# Purpose

This notebooks graphs the performance results of 5G core networks. The traffic for the 5G core networks is generated using a [5G core traffic generator](https://github.com/tariromukute/core-tg). The performance results are collected using a bcc and bpftrace tools.

In [1]:
# configure spark variables
from pyspark.context import SparkContext
from pyspark.sql.context import SQLContext
from pyspark.sql.session import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
    
sc = SparkContext()
sqlContext = SQLContext(sc)
spark = SparkSession(sc)

# load up other dependencies
import re
import pandas as pd

import glob
import matplotlib.pyplot as plt
import numpy as np

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/08/25 03:43:34 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [2]:
import os
if not os.path.exists("images"):
    os.mkdir("images")

import os
import glob
import plotly.express as px
from plotly.subplots import make_subplots
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
from pyspark.sql.functions import expr
basePath = "../results"

In [3]:
html_output_file = '../open5gs.html'
with open(html_output_file, 'w') as f:
    f.write('<h1>Open GiLAN Testbed Results</h1>')
    f.write('<h3> The graphs summaries the NFV performance metrics<h3>')
    f.write('<h4> General Workload Chacterisation </h4>')
    f.write('<a href="#system-chaterisation"> Skip to results </a></h5>')
    f.write('<p> The system run different processes depending on the applications running. The operations of these applications and their respective processes is \
        execute through system calls. There are wide range of system calls that can be run by the OS. In general the frequent types of system calls can provide \
        a general chaterisationof the workload running on the OS. The workload charaterisation is a good starting point in understanding the system or applications \
        running the system. In addition to the frequent system calls, details on the processes making syscalls is helpful in understanding the system. \
        \
        The latency of the both the system calls and the processes making system calls is a starting point in understand the latency of the system as a whole. From these \
        results we can go further to look at the performance results of the different compute resources. The chaterisations helps in knowing the compute results to focus on \
        e.g., if there is a load or read syscalls we can focus on Filesystem and cache.</p>')
    f.write('<h4> CPU  <h4>')
    f.write('<h5><a href="#cpu-metrics"> Skip to results </a></h5>')
    f.write('<p> The CPU is responsible for executing all workloads on the NFV. Like other resources, the CPU is managed by the kernel. The user-level applications access CPU resources by sending system calls to the kernel. The kernel also receives other system call requests from different processes; memory loads and stores can issue page faults system calls. The primary consumers of CPU resources are threads (also called tasks), which belong to procedures, kernel routines and interrupt routes. The kernel manages the sharing via a CPU scheduler.</p>')
    f.write('<p> There are three thread states: ON-PROC for threads running on a CPU, RUNNABLE for threads that could run but are waiting their turn, and SLEEP for blocked lines on another event, including uninterruptible waits. These can be categorised into two for more accessible analysis, on-CPU referring to ON-PROC, and off-CPU referring to all other states, where the thread is not running on a CPU. Lines leave the CPU in one of two ways: (1) voluntary if they block on I/O, a lock, or asleep, or (2) involuntary if they have exceeded their scheduled allocation of CPU time. When a CPU switches from running one process or thread to another, it switches address spaces and other metadata. This process is called context switching; it also consumes the CPU resources. All these processes, described, in general, consume the CPU time. In addition to the time, another CPU resource used by the methods, kernel routines and interrupts routines is the CPU cache.</p>')
    f.write('<p> There are typically multiple levels of CPU cache, increasing in both size and latency. The caches end with the last-level store (LLC), large (Mbytes) and slower. On a processor with three levels of supplies, the LLC is also the Level 3 cache. Processes are instructions to be interpreted and run by the CPU. This set of instructions is typically loaded from RAM and cached into the CPU cache for faster access. The CPU first checks the lower cache, i.e., L1 cache. If the CPU finds the data, this is called a hit. If the CPU does not see the data, it looks for it in L2 and then L3. If the CPU does not find the data in any memory caches, it can access it from your system memory (RAM). When that happens, it is known as a cache miss. In general, a cache miss means high latency, i.e., the time needed to access data from memory. </p>')

    f.write('<h4> Memory <h4>')
    f.write('<h5><a href="#memory-metrics"> Skip to results </a> </h5>')
    f.write('<p> The kernel and processor are responsible for mapping the virtual memory to physical memory. For efficiency, memory mappings are created in groups of memory called <em>pages</em>. When an application starts, it begins with a request for memory allocation. In the case that there is no free memory on the heap, the syscall <em>brk()</em> is issued to extend the size of the bank. However, if there is free memory on the heap, a new memory segment is created via the <em>mmap()</em> syscall. Initially, this virtual memory mapping does not have a corresponding physical memory allocation. Therefore when the application tries to access this allocated memory segment, the error called <em>page fault</em> occurs on the MMU. The kernel then handles the page fault, mapping from the virtual to physical memory. The amount of physical memory allocated to a process is called resident set size (RSS). When there is too much memory demand on the system, the kernel page-out daemon (kswapd) may look for memory pages to free. Three types of pages can be released in their order: pages that we read but not modified (backed by disk) these can be immediately rid, pages that have been modified (dirty) these need to be written to disk before they can be freed and pages of application memory (anonymous) these must be stored on a swap device before they can be released. kswapd, a page-out daemon, runs periodically to scan for inactive and active pages with no memory to free. It is woken up when free memory crosses a low threshold and goes back to sleep when it crosses a high threshold. Swapping usually causes applications to run much more slowly.</p>')

    f.write('<h4>Filesytem <h4>')
    f.write('<h5><a href="#filesystem-metrics"> Skip to results </a> </h5>')
    f.write('<p> The file system that applications usually interact with directly and file systems can use caching, read-ahead, buffering, and asynchronous I/O to avoid exposing disk I/O latency to the application. Logical I/O describes requests to the file system. If these requests must be served from the storage devices, they become physical I/O. Not all I/O will; many logical read requests may be returned from the file system cache and never become physical I/O. File systems are accessed via a virtual file system (VFS). It provides operations for reading, writing, opening, closing, etc., which are mapped by file systems to their internal functions. Linux uses multiple caches to improve the performance of storage I/O via the file system. These are Page cache: This contains virtual memory pages and enhances the performance of file and directory I/O. Inode cache, which are data structures used by file systems to describe their stored objects. The directory cache caches mappings from directory entry names to VFS inodes, improving the performance of pathname lookups. The page cache grows to be the largest of all these because it caches the contents of files and includes “dirty” pages that have been modified but not yet written to disk.</p>')

    f.write('<h4>Disk I/O <h4>')
    f.write('<h5><a href="#disk-metrics"> Skip to results </a> </h5>')
    f.write('<p> Linux exposes rotational magnetic media, flash-based storage, and network storage as storage devices. Therefore, disk I/O refers to I/O operations on these devices. Disk I/O is a common source of performance issues because I/O latency on storage devices is orders of magnitude slower than the nanosecond or microsecond speed of CPU and memory operations. Block I/O refers to device access in blocks. I/O is queued and scheduled in the block layer. The wait time is spent in the block layer scheduler queues and device dispatcher queues from the operating system. Service time is the time from device issue to completion. This may include the time spent waiting in an on-device line. Request time is the overall time from when an I/O was inserted into the OS queues to its completion. The request time matters the most, as that is the time that applications must wait if I/O is synchronous.</p>')

    f.write('<h4>Networking<h4>')
    f.write('<h5><a href="#networking-metrics"> Skip to results </a> </h5>')
    f.write('<p> Networking is a complex part of the Linux system. It involves many different layers and protocols, including the application, protocol libraries, syscalls, TCP or UDP, IP, and device drivers for the network interface. In general, the Networking system can be broken down into four. The NIC and Device Driver Processing first reads packets from the NIC and puts them into kernel buffers. Besides the NIC and Device driver, this process includes the DMA and particular memory regions on the RAM for storing receive and transmit packets called rings and the NAPI system for poling packets from these rings to the kernel buffers. It also incorporates some early packet processing hooks like XDP and AF\_XDP and can have custom drivers that bypass the kernel (i.e., the following two processes) like DPDK. Following is the Socket processing. This part also includes queuing and different queuing disciplines. It also incorporates some packet processing hooks like TC, Netfilter etc., which can alter the flow of the networking stack. After that is the  Protocol processing layer, which applies functions for different IP and transport protocols, both these protocols run under the context of SoftIrq. Lastly is the application process. The application receives and sends packets on the destination socket</p>')
    
    f.write('<h4>Flame Graphs to analyse code paths<h4>')
    f.write('<h5><a href="#flame-graphs"> Skip to results </a> </h5>')
    f.write('<p> A flame graph visualizes a distributed request trace and represents each service call that occurred during the requests execution path with a timed, color-coded, horizontal bar. Flame graphs for distributed traces include error and latency data to help developers identify and fix bottlenecks in their applications..</p>')

In [4]:
# General chaterisation
import plotly; print(plotly.__version__)

5.15.0


In [28]:
import subprocess

# Helper functions
def remove_noise_processes(df, field, values):
    a = df.loc[df[field].isin(values)].index.array.tolist()
    df.drop(a, inplace=True)
    return df

def pivot_dataframe_to_gnuplot_format(df, values, index='ues', columns='cn'):
    print(df.head())

    # Group the DataFrame by 'Country' and 'Year'
    grouped_data = df.groupby([index, columns]).sum()

    # Pivot the resulting grouped data
    pivoted_df = grouped_data.pivot_table(index='ues', columns='cn', values=values).reset_index()

    return pivoted_df

def draw_gnuplot_linepoints(df, name, title, xlabel, ylabel):
    df.to_csv(f'{name}.csv', index=False)
    print(df.columns)
    # Write the Gnuplot script
    with open(f'{name}.gnu', 'w') as f:
        f.write('set style data linespoints\n')
        f.write(f"set output '{name}.png'\n")
        f.write('set key autotitle columnhead\n')
        f.write("set datafile separator ','\n")
        f.write(f'set title "{title}"\n')
        f.write('set grid xtics ytics mytics\n')
        f.write(f'set xlabel "{xlabel}"\n')
        f.write(f'set ylabel "{ylabel}"\n')
        f.write(f'plot for [i=2:{len(df.columns)}] "{name}.csv" u 1:i t columnhead')

    # Run the Gnuplot script
    subprocess.call(['gnuplot', '-p', f'{name}.gnu'])

    # display the image on the screen
    from IPython.display import Image
    Image(filename=f'{name}.png')

labels = {
    "ues": "Number of UEs",
    "time (ms)": "Time (ms)",
    "syscall": "System calls",
    "count": "Number of calls",
    "avg": "Average time per syscall (ms)",
    "cn": "Core network"
}

In [31]:
""" This shows how the usage syscalls change as the load changes. 
(a) The number of syscalls as the traffic load increases
(b) The time spent executing syscalls as the traffic increases
(c) The average time spent per syscall as the traffic increases
This can tell us:
1. How the core network is architectures to respond to increasing load
2. Comparing can tell us the core network that sends more time on syscalls. We can use that to corellate to the performance of the core network
3. We have the details on overal performance of the core networks, we can look at the results that correlate the performance
4. Is there a general trend to syscalls that can show well architected e.g., the latency should increase as load increase etc.
If there is an ideal trend or correlation, does it match the trend of the core networks and correlate to the performance we are seeing
"""

top_n = 5

syscount_df = spark.read.option("basePath", basePath).json(
f"{basePath}/cn=*/ues=*/tool=syscount")

df_syscount = syscount_df.toPandas().groupby(['cn', 'ues']).agg({ 'count': 'sum', 'time (ms)': 'sum' }).reset_index()

df_syscount['avg'] = (df_syscount['time (ms)'] / df_syscount['count'])

title='Syscalls across the system (by latency)'
syscount_fig = px.line(df_syscount, x="ues", y="time (ms)", color="cn", labels=labels,
                title=title, markers=True)
syscount_fig.show()
# syscount_fig.write_image("images/syscount_fig_m2.medium.jpeg")
gnuplot_df = pivot_dataframe_to_gnuplot_format(df_syscount, 'time (ms)')
draw_gnuplot_linepoints(gnuplot_df, name='syscount', title=title,
                        xlabel='Number of UEs', ylabel=labels['time (ms)'])

title=f'Processes making syscall (by number of calls)'
sysprocess_count_fig = px.line(df_syscount, x="ues", y="count", color="cn", labels=labels,
                hover_data=["count", "time (ms)"],
                title=title, markers=True)
sysprocess_count_fig.show()
gnuplot_df = pivot_dataframe_to_gnuplot_format(df_syscount, 'count')
draw_gnuplot_linepoints(gnuplot_df, name='syscount', title=title,
                        xlabel='Number of UEs', ylabel=labels['count'])

title=f'Processes making syscall (by number of calls)'
sysprocess_count_fig = px.line(df_syscount, x="ues", y="avg", color="cn", labels=labels,
                hover_data=["count", "time (ms)"],
                title=title, markers=True)
sysprocess_count_fig.show()
gnuplot_df = pivot_dataframe_to_gnuplot_format(df_syscount, 'avg')
draw_gnuplot_linepoints(gnuplot_df, name='syscount_avg', title=title,
                        xlabel='Number of UEs', ylabel=labels['avg'])


        cn  ues    count     time (ms)         avg
0  free5gc    0    20399  3.294732e+06  161.514367
1  free5gc    5    48316  3.749627e+06   77.606314
2  free5gc   10    62475  3.719466e+06   59.535277
3  free5gc   50   718714  3.381769e+06    4.705305
4  free5gc  100  1357965  3.279550e+06    2.415048
Index(['ues', 'free5gc', 'oai', 'open5gs'], dtype='object', name='cn')


qt.qpa.fonts: Populating font family aliases took 63 ms. Replace uses of missing font family "Sans" with one that exists to avoid this cost. 


        cn  ues    count     time (ms)         avg
0  free5gc    0    20399  3.294732e+06  161.514367
1  free5gc    5    48316  3.749627e+06   77.606314
2  free5gc   10    62475  3.719466e+06   59.535277
3  free5gc   50   718714  3.381769e+06    4.705305
4  free5gc  100  1357965  3.279550e+06    2.415048
Index(['ues', 'free5gc', 'oai', 'open5gs'], dtype='object', name='cn')


qt.qpa.fonts: Populating font family aliases took 54 ms. Replace uses of missing font family "Sans" with one that exists to avoid this cost. 


        cn  ues    count     time (ms)         avg
0  free5gc    0    20399  3.294732e+06  161.514367
1  free5gc    5    48316  3.749627e+06   77.606314
2  free5gc   10    62475  3.719466e+06   59.535277
3  free5gc   50   718714  3.381769e+06    4.705305
4  free5gc  100  1357965  3.279550e+06    2.415048
Index(['ues', 'free5gc', 'oai', 'open5gs'], dtype='object', name='cn')


qt.qpa.fonts: Populating font family aliases took 56 ms. Replace uses of missing font family "Sans" with one that exists to avoid this cost. 


In [None]:
""" A tabular view with ratios of the most sum of (latency per syscall, count per syscall and average latency of syscall). The tabular view will
1. Show us for each core network what is the ratio of a syscall precense over the other e.g., recvfrom has 4x more latency than sendto
2. Across core networks, we can compare the ratio of presence of a syscall e.g., free5gc invokes recvfrom 4x more than open5gs
3. For grouped syscalls, we can tell which flavor a given call network uses more e.g., for multiplexing syscalls, we can may see that free5gc uses
select more than epoll_wait and infer based on the relative performance of them
4. In addition to (3), for different core networks we can see that e.g., free5gc use select which is 4x more that epoll_wait being used by open5gs.
Tying this with the theory of the syscall we may be able to get the reasons for difference in performance

"""

In [80]:
""" The top X active syscalls and process per core network
We can look at:
1. The composition of the core network, the system calls that run or maintain the system
2. It can tell us what the system spends most of it's time on
3. For these we can see if they syscalls follow the 'ideal trend' of responding to traffic load

"""
top_n = 6

def top_processes(df, field):
    label_maxes = df.groupby(['comm'])[field].sum().sort_values(ascending=False)

    # Select the top n labels with the highest y-values
    top_labels = label_maxes.head(top_n).index.tolist()

    return top_labels

sysprocess_df = spark.read.option("basePath", basePath).json(
f"{basePath}/cn=*/ues=*/tool=sysprocess")

df_process = sysprocess_df.toPandas()
df_process = remove_noise_processes(df_process, 'comm', ['python3'])
df_process['avg'] = (df_process['time (ms)'] / df_process['count'])

grouped_data = df_process.groupby(['cn'])
for group_name, group_df in grouped_data:
    top_labels = top_processes(group_df, 'time (ms)')

    sysprocess_fig = px.line(group_df[group_df['comm'].isin(top_labels)].sort_values('ues'),
                x="ues", y="time (ms)", color="comm",
                hover_data=["count", "time (ms)"],
                labels={
                     "ues": "Number of UEs",
                     "time (ms)": "Time (ms)",
                     "syscall": "System calls",
                     "count": "Number of calls",
                     "comm": "Process name"
                },
                title=f'Top {top_n} active processes making syscall {group_name[0]} (by latency)',
                markers=True)
    sysprocess_fig.show()

    top_labels = top_processes(group_df, 'count')

    sysprocess_fig = px.line(group_df[group_df['comm'].isin(top_labels)].sort_values('ues'),
                x="ues", y="count", color="comm",
                hover_data=["count", "time (ms)"],
                labels={
                     "ues": "Number of UEs",
                     "time (ms)": "Time (ms)",
                     "syscall": "System calls",
                     "count": "Number of calls",
                     "comm": "Process name"
                },
                title=f'Top {top_n} active processes making syscall {group_name[0]} (by number of calls)',
                markers=True)
    sysprocess_fig.show()

    top_labels = top_processes(group_df, 'avg')

    sysprocess_fig = px.line(group_df[group_df['comm'].isin(top_labels)].sort_values('ues'),
                x="ues", y="avg", color="comm",
                hover_data=["count", "time (ms)"],
                labels={
                     "ues": "Number of UEs",
                     "time (ms)": "Time (ms)",
                     "syscall": "System calls",
                     "count": "Number of calls",
                     "avg": "Average time per syscall (ms)",
                     "comm": "Process name"
                },
                title=f'Top {top_n} active processes making syscall {group_name[0]} (by average latency)',
                markers=True)
    sysprocess_fig.show()

In [50]:

top_n = 10

def top_syscalls(df, field):
    label_maxes = df.groupby(['syscall'])[field].sum().sort_values(ascending=False)

    # Select the top n labels with the highest y-values
    top_labels = label_maxes.head(top_n).index.tolist()

    return top_labels

syscount_df = spark.read.option("basePath", basePath).json(
f"{basePath}/cn=*/ues=*/tool=syscount")

df_syscall = syscount_df.toPandas()

df_syscall['avg'] = (df_syscall['time (ms)'] / df_syscall['count'])

grouped_data = df_syscall.groupby(['cn'])
for group_name, group_df in grouped_data:
    top_labels = top_syscalls(group_df, 'time (ms)')

    syscount_fig = px.line(group_df[group_df['syscall'].isin(top_labels)].sort_values('ues'),
                x="ues", y="time (ms)", color="syscall",
                hover_data=["count", "time (ms)"],
                labels={
                     "ues": "Number of UEs",
                     "time (ms)": "Time (ms)",
                     "syscall": "System calls",
                     "count": "Number of calls",
                     "comm": "Process name"
                },
                title=f'Top {top_n} active syscalls {group_name[0]} (by latency)',
                markers=True)
    syscount_fig.show()

    top_labels = top_syscalls(group_df, 'count')

    syscount_fig = px.line(group_df[group_df['syscall'].isin(top_labels)].sort_values('ues'),
                x="ues", y="count", color="syscall",
                hover_data=["count", "time (ms)"],
                labels={
                     "ues": "Number of UEs",
                     "time (ms)": "Time (ms)",
                     "syscall": "System calls",
                     "count": "Number of calls",
                     "comm": "Process name"
                },
                title=f'Top {top_n} active syscalls {group_name[0]} (by number of calls)',
                markers=True)
    syscount_fig.show()

    top_labels = top_syscalls(group_df, 'avg')

    syscount_fig = px.line(group_df[group_df['syscall'].isin(top_labels)].sort_values('ues'),
                x="ues", y="avg", color="syscall",
                hover_data=["count", "time (ms)"],
                labels={
                     "ues": "Number of UEs",
                     "time (ms)": "Time (ms)",
                     "syscall": "System calls",
                     "count": "Number of calls",
                     "avg": "Average time per syscall (ms)",
                     "comm": "Process name"
                },
                title=f'Top {top_n} active syscalls {group_name[0]} (by average latency)',
                markers=True)
    syscount_fig.show()

In [107]:
""" For each of the syscalls, we plot for the core networks, the amount of times and occurances. This can tell us:
1. How the core network are architected to respond to load for different operations e.g., how their socket read logic is implemented
to work and how that responses to change in traffic load
2. Relative to other core network which syscalls it uses the most. For example this can tell us the syscall that has the most differentiating factor, 
e.g., if all syscalls are relatively the same and there is a huge difference for sched_yield, then it is likely the differentiating syscall or design

"""
import pandas as pd
import plotly.graph_objs as go

def grouped_syscall_stats(df_sysprocess, writer=None):

    cn_df = df_sysprocess.groupby(['cn', 'ues']).agg({ 'count': 'sum', 'time (ms)': 'sum' }).reset_index()

    cn_df['avg'] = (cn_df['time (ms)'] / cn_df['count'])

    sysprocess_fig = px.line(cn_df.sort_values('ues'),
                    x="ues", y="time (ms)", color="cn", 
                    hover_data=["count", "time (ms)"],
                    labels={
                        "ues": "Number of UEs",
                        "time (ms)": "Time (ms)",
                        "syscall": "System calls",
                        "count": "Number of calls",
                        "cn": "Core Network"
                    },
                    template='ggplot2',
                    title=f'Core network syscall {syscall} (by latency)',
                    markers=True)
    # sysprocess_fig.update_traces(textinfo='value')
    sysprocess_fig.show()
    sysprocess_fig.write_image(f"images/sysprocess_{syscall}_fig_m2.medium.jpeg")

    
    with open(html_output_file, 'a') as f:
        f.write(f'<h2 id="{syscall}-syscall-processes"> Processes are making {syscall} syscalls with latency information </h2>')
        f.write('<p>  </p>') 
        f.write(sysprocess_fig.to_html(full_html=False, include_plotlyjs='cdn'))
    

    sysprocess_count_fig = px.line(cn_df.sort_values('ues'),
                    x="ues", y="count", color="cn",
                    hover_data=["count", "time (ms)"],
                    labels={
                        "ues": "Number of UEs",
                        "time (ms)": "Time (ms)",
                        "syscall": "System calls",
                        "count": "Number of calls",
                        "cn": "Core network"
                    },
                    title=f'Processes making {syscall} syscall (by number of calls)',
                    markers=True)
    sysprocess_count_fig.show()

    sysprocess_count_fig = px.line(cn_df.sort_values('ues'),
                    x="ues", y="avg", color="cn",
                    hover_data=["count", "time (ms)"],
                    labels={
                        "ues": "Number of UEs",
                        "time (ms)": "Time (ms)",
                        "syscall": "System calls",
                        "count": "Number of calls",
                        "avg": "Average time per syscall (ms)",
                        "cn": "Core network"
                    },
                    title=f'Processes making {syscall} syscall (by average latency)',
                    markers=True)
    sysprocess_count_fig.show()

    sysprocess_count_fig.write_image(f"images/sysprocess_count_{syscall}_fig_m2.medium.jpeg")
    with open(html_output_file, 'a') as f:
        f.write(f'<h2 id="{syscall}-syscall-count-processes"> Processes are making {syscall} syscalls by number of calls</h2>')
        f.write('<p>  </p>') 
        f.write(sysprocess_count_fig.to_html(full_html=False, include_plotlyjs='cdn'))

    return cn_df


def grouped_processes_stats(df_sysprocess, writer=None):
    comm_df = df_sysprocess.groupby(['cn', 'comm']).agg({ 'count': 'sum', 'time (ms)': 'sum' }).reset_index()

    comm_df['avg'] = (comm_df['time (ms)'] / comm_df['count'])

    sunburst_fig = px.sunburst(comm_df, path=['cn', 'comm'], values='time (ms)',
                  color='cn', hover_data=['count'],
                  title=f"Processes making {syscall} syscall (by average latency)")
    sunburst_fig.show()

    sunburst_fig = px.sunburst(comm_df, path=['cn', 'comm'], values='count',
                  color='cn', hover_data=['time (ms)'],
                  title=f"Processes making {syscall} syscall (by number of calls)")
    sunburst_fig.show()

    sunburst_fig = px.sunburst(comm_df, path=['cn', 'comm'], values='avg',
                  color='cn', hover_data=['time (ms)'],
                  title=f"Processes making {syscall} syscall (by average latency)")
    sunburst_fig.show()

    return comm_df

def grouped_syscall_types(syscalls, syscall_type):
    df = pd.DataFrame()
    for syscall in syscalls:
        sysprocess_df = spark.read.option("basePath", basePath).json(
            f"{basePath}/cn=*/ues=*/tool=sysprocess_{syscall}")
        df1 = sysprocess_df.toPandas()
        df1['syscall'] = syscall
        df = pd.concat([df, df1])

    print(df)
    df_syscall = remove_noise_processes(df, 'comm', noise_processes)

    syscall_df = df_syscall.groupby(['cn', 'syscall']).agg({ 'count': 'sum', 'time (ms)': 'sum' }).reset_index()

    syscall_df['avg'] = (syscall_df['time (ms)'] / syscall_df['count'])

    sunburst_fig = px.sunburst(syscall_df, path=['cn', 'syscall'], values='time (ms)',
                  color='cn', hover_data=['count'],
                  title=f"Core networks making {syscall_type} syscall (by average latency)")
    sunburst_fig.show()

    sunburst_fig = px.sunburst(syscall_df, path=['cn', 'syscall'], values='count',
                  color='cn', hover_data=['time (ms)'],
                  title=f"Core networks making {syscall_type} syscall (by number of calls)")
    sunburst_fig.show()

    sunburst_fig = px.sunburst(syscall_df, path=['cn', 'syscall'], values='avg',
                  color='cn', hover_data=['time (ms)'],
                  title=f"Core networks making {syscall_type} syscall (by average latency)")
    sunburst_fig.show()

def compute_grouped_stats(syscall, summary_df):
    sysprocess_df = spark.read.option("basePath", basePath).json(
    f"{basePath}/cn=*/ues=*/tool=sysprocess_{syscall}")

    df_sysprocess = sysprocess_df.toPandas()

    df1 = remove_noise_processes(df_sysprocess, 'comm', noise_processes)
    syscall_df = grouped_syscall_stats(df1, writer)

    comm_df = grouped_processes_stats(df1, writer)

    # Get the summary
    df2 = comm_df.groupby(['cn']).agg({ 'count': 'sum', 'time (ms)': 'sum', 'avg': 'sum' }).reset_index()
    
    df2['syscall'] = syscall

    summary_df = pd.concat([summary_df, df2])

# writer = pd.ExcelWriter('ActiveProcessesPerSyscall-WithoutNoiseProcesses.xlsx', engine='xlsxwriter')
writer = None
noise_processes = ['python3', 'systemd', 'snapd', 'sshd', 'sudo', 'multipathd', 'systemd-logind', 'systemd-timesyn', 'systemd-resolve', 'systemd-udevd', 'systemd-network', 'systemctl', 'accounts-daemon', 'dbus-daemon', 'mongod', 'mysqld', '[unknown]']


io_multiplex_syscalls = ['epoll_wait', 'poll', 'ppoll', 'epoll_pwait', 'select']
grouped_syscall_types(io_multiplex_syscalls, 'I/O Multiplexing')
print("Syscalls for io multiplexing")
# Run for each syscall
grouped_io_df = pd.DataFrame(columns=['cn', 'count', 'time (ms)', 'avg', 'syscall'])
for syscall in io_multiplex_syscalls:
    compute_grouped_stats(syscall, grouped_io_df)     

print(grouped_io_df)

grouped_io_df = pd.DataFrame(columns=['cn', 'count', 'time (ms)', 'avg', 'syscall'])
socket_write_syscalls = ['write', 'sendto', 'sendmsg']
grouped_syscall_types(socket_write_syscalls, 'Write')
print("Syscalls for socket write operations")
for syscall in socket_write_syscalls:
    compute_grouped_stats(syscall, grouped_io_df)

print(grouped_io_df)

grouped_io_df = pd.DataFrame(columns=['cn', 'count', 'time (ms)', 'avg', 'syscall'])
socket_read_syscalls = [ 'recvmsg', 'recvfrom', 'read']
grouped_syscall_types(socket_read_syscalls, 'Read')
print("Syscalls for socket read operations")
for syscall in socket_read_syscalls:
    compute_grouped_stats(syscall, grouped_io_df)

print(grouped_io_df)

grouped_io_df = pd.DataFrame(columns=['cn', 'count', 'time (ms)', 'avg', 'syscall'])
time_syscalls = ['clock_nanosleep', 'nanosleep']
grouped_syscall_types(time_syscalls, 'Time')
print("Syscalls for process time operations")
for syscall in time_syscalls:
    compute_grouped_stats(syscall, grouped_io_df)

print(grouped_io_df)

grouped_io_df = pd.DataFrame(columns=['cn', 'count', 'time (ms)', 'avg', 'syscall'])
locks_syscalls = ['futex']
grouped_syscall_types(locks_syscalls, 'Locks')
print("Syscalls for locks operations")
for syscall in locks_syscalls:
    compute_grouped_stats(syscall, grouped_io_df)

print(grouped_io_df)

grouped_io_df = pd.DataFrame(columns=['cn', 'count', 'time (ms)', 'avg', 'syscall'])
control_syscalls = ['sched_yield']
grouped_syscall_types(control_syscalls, 'Control operations')
print("Syscalls for control operations")
for syscall in control_syscalls:
    compute_grouped_stats(syscall, grouped_io_df)

print(grouped_io_df)

               comm  count    pid      time     time (ms)       cn  ues  \
0           python3     69  59597  14:22:10  69061.962025      oai    5   
1            mysqld     70    717  14:22:10  69061.254584      oai    5   
2           python3     68  59621  14:22:10  68059.357671      oai    5   
3           systemd    209      1  14:22:10  68025.277205      oai    5   
4   systemd-journal    208    332  14:22:10  67885.656903      oai    5   
..              ...    ...    ...       ...           ...      ...  ...   
67          python3     13  26186  21:55:50  65061.921249  free5gc    5   
68          python3      1  21169  21:19:27  70070.126055  free5gc    0   
69          python3     13  21163  21:19:27  65051.544214  free5gc    0   
70          python3      1  73929  06:55:05  70063.723470  free5gc   50   
71          python3     13  73922  06:55:05  65059.763258  free5gc   50   

                     tool     syscall  
0   sysprocess_epoll_wait  epoll_wait  
1   sysprocess_epol

Syscalls for io multiplexing


Empty DataFrame
Columns: [cn, count, time (ms), avg, syscall]
Index: []
                comm  count     pid      time  time (ms)       cn  ues  \
0           rsyslogd    243     599  03:28:16   7.439022      oai  300   
1          [unknown]     20  206484  03:28:16   7.123676      oai  300   
2               ausf     37  206634  03:28:16   3.279406      oai  300   
3                udr     50  206652  03:28:16   3.278574      oai  300   
4    systemd-resolve     84     547  03:28:16   1.921271      oai  300   
..               ...    ...     ...       ...        ...      ...  ...   
544          systemd      8       1  23:02:20   0.139765  free5gc   10   
545   systemd-logind      9     695  23:02:20   0.138929  free5gc   10   
546  systemd-network      2     569  23:02:20   0.080403  free5gc   10   
547        [unknown]      3   32425  23:02:20   0.024176  free5gc   10   
548          polkitd      2     691  23:02:20   0.023974  free5gc   10   

                   tool  syscall  
0   

Syscalls for socket write operations


Empty DataFrame
Columns: [cn, count, time (ms), avg, syscall]
Index: []
                comm  count    pid      time  time (ms)       cn  ues  \
0                udm   1502  85118  18:10:36   6.233582      oai   50   
1                amf   1113  85138  18:10:36   5.447773      oai   50   
2                udr   1242  85139  18:10:36   4.677987      oai   50   
3               ausf   1087  85117  18:10:36   4.333754      oai   50   
4      systemd-udevd    612    364  18:10:36   1.277580      oai   50   
..               ...    ...    ...       ...        ...      ...  ...   
874   systemd-logind      2    695  07:37:20   0.012431  free5gc  100   
875          udisksd      6    700  07:37:20   0.012166  free5gc  100   
876     ModemManager      6    772  07:37:20   0.009953  free5gc  100   
877  systemd-timesyn      1  43904  07:37:20   0.009161  free5gc  100   
878        [unknown]      1  79535  07:37:20   0.003784  free5gc  100   

                   tool  syscall  
0    sysprocess_

Syscalls for socket read operations


Empty DataFrame
Columns: [cn, count, time (ms), avg, syscall]
Index: []
          comm  count    pid      time     time (ms)       cn  ues  \
0       mysqld     69    717  21:55:03  69008.784139      oai   50   
1   multipathd     69    459  21:55:03  69005.335739      oai   50   
2         cron      1    586  21:55:03  60000.131689      oai   50   
3          nrf      4  86263  21:55:03    200.687199      oai   50   
4          smf      5  86286  21:55:03     23.503996      oai   50   
..         ...    ...    ...       ...           ...      ...  ...   
79       snapd    973    692  15:10:27    241.790472  open5gs   50   
80       snapd    619    692  16:10:07    245.406503  open5gs  100   
81       snapd    462    692  14:36:59    176.029606  open5gs   10   
82       snapd    628    692  17:54:50    169.679299  open5gs  200   
83       snapd    628    692  19:05:26    199.212320  open5gs  300   

                          tool          syscall  
0   sysprocess_clock_nanosleep  clock

Syscalls for process time operations


Empty DataFrame
Columns: [cn, count, time (ms), avg, syscall]
Index: []
           comm  count    pid      time      time (ms)       cn  ues  \
0        mysqld  21364    717  12:00:53  880860.637669      oai    0   
1           smf    100  47909  12:00:53  179996.486792      oai    0   
2         snapd    523    602  12:00:53  177050.697768      oai    0   
3           amf     46  47911  12:00:53  126255.179612      oai    0   
4    multipathd    209    459  12:00:53   69008.925390      oai    0   
..          ...    ...    ...       ...            ...      ...  ...   
329       snapd    302    692  17:30:54   80611.145291  open5gs  200   
330  multipathd    209    459  17:30:54   69014.730357  open5gs  200   
331    rsyslogd  22796    691  17:30:54   66553.282355  open5gs  200   
332     polkitd     12    689  17:30:54       0.075680  open5gs  200   
333     udisksd     12    698  17:30:54       0.054194  open5gs  200   

                 tool syscall  
0    sysprocess_futex   futex  

Syscalls for locks operations


Empty DataFrame
Columns: [cn, count, time (ms), avg, syscall]
Index: []
      comm  count    pid      time  time (ms)       cn  ues  \
0      amf   2443  18598  07:44:03  38.705375  free5gc  100   
1   mongod   7343  13372  07:44:03  17.122063  free5gc  100   
2      nrf     95  18587  07:44:03   8.285115  free5gc  100   
3      udm    324  18639  07:44:03   8.048028  free5gc  100   
4      udr    227  18620  07:44:03   6.450657  free5gc  100   
..     ...    ...    ...       ...        ...      ...  ...   
57   snapd      3  85885  03:17:44   0.031693      oai  300   
58   snapd     24    602  16:51:53   0.196251      oai   10   
59   snapd     12    692  16:05:00   0.025004  open5gs  100   
60   snapd     34    692  13:04:39   1.391539  open5gs    0   
61   snapd      1    602  15:13:17   0.037596      oai    5   

                      tool      syscall  
0   sysprocess_sched_yield  sched_yield  
1   sysprocess_sched_yield  sched_yield  
2   sysprocess_sched_yield  sched_yield  
3  

Syscalls for control operations


Empty DataFrame
Columns: [cn, count, time (ms), avg, syscall]
Index: []


In [None]:
""" For each syscall look at the processes that are making the calls
(a) Graphs
(b) Tables with the sum per latency, count and average latency
This should give us:
1. An idea of the processes making use of the most relavant syscall or the syscall we are looking at in the study
2. It will give us an ide of the relavance of these processes and making it easier for us to analysis e.g., if the rsyslog system
is the most active process per syscall, we know we need to do further work to disable logs or looking at another logging mechanism
3. 
"""

In [102]:
import plotly.io as pio

print(pio.templates.)

Templates configuration
-----------------------
    Default template: 'plotly'
    Available templates:
        ['ggplot2', 'seaborn', 'simple_white', 'plotly',
         'plotly_white', 'plotly_dark', 'presentation', 'xgridoff',
         'ygridoff', 'gridon', 'none']



In [105]:
import pandas as pd
import plotly.express as px

# create a sample dataframe
data = {'year': [2010, 2010, 2011, 2011, 2012, 2012],
        'region': ['A', 'B', 'A', 'B', 'A', 'B'],
        'sales': [100, 200, 150, 250, 180, 220]}
df = pd.DataFrame(data)

# create a sample dataframe
sunburst_data = {'region': ['Asia', 'Asia', 'Europe', 'Europe', 'Africa', 'Africa'],
        'country': ['China', 'Japan', 'France', 'Germany', 'Nigeria', 'South Africa'],
        'industry': ['Technology', 'Automobiles', 'Energy', 'Manufacturing', 'Oil & Gas', 'Mining'],
        'sales': [100, 50, 75, 90, 60, 40]}
sunburs_df = pd.DataFrame(sunburst_data)


for template in ['ggplot2', 'seaborn', 'simple_white', 'plotly',
         'plotly_white', 'plotly_dark', 'presentation', 'xgridoff',
         'ygridoff', 'gridon', 'none']:
    # create a line chart using Plotly Express
        fig = px.line(df, x='year', y='sales', color='region',
                title=f"Line chart theme {template}", 
                template=template, markers=True)
        fig.show()

        fig = px.sunburst(sunburs_df, path=['region', 'country', 'industry'], values='sales',
                        title=f"Sunburst chart theme {template}", 
                        template=template)
        fig.show()
   

In [106]:
def my_theme(fig):
    #This is my own personal preferences you can create a default & pass a plotly graph_object
    #changes theme, height & width to my preferences
    fig.update_layout(template='plotly_white', width=1000, height=700)
    #I like grid lines on my x & y axis
    fig.update_xaxes(showline=False,linewidth=0.2, gridwidth=1, linecolor='white', gridcolor='lightgrey',categoryorder='total descending',color='black')
    fig.update_yaxes(showline=False,linewidth=0.2, gridwidth=1, linecolor='white', gridcolor='lightgrey')
    fig.update_traces(texttemplate='<b>%{y:0,.1f}')

#converts every plot to my default styling    
my_theme(fig)
fig.show()

In [4]:
import pandas as pd
import subprocess

# Define your data
data = {'year': [2010, 2011, 2012, 2013, 2014],
        'sales': [100, 130, 240, 350, 500],
        'purchases': [10, 13, 24, 35, 500]}

# Convert to a DataFrame and save to CSV
df = pd.DataFrame(data)



draw_gnuplot_linepoints(df, 'sample1', 'Sample graph', "Year", "Sales")

Index(['year', 'sales', 'purchases'], dtype='object')


qt.qpa.fonts: Populating font family aliases took 56 ms. Replace uses of missing font family "Sans" with one that exists to avoid this cost. 


FileNotFoundError: [Errno 2] No such file or directory: 'sample1.png'