# Purpose

This notebooks graphs the performance results of 5G core networks. The traffic for the 5G core networks is generated using a [5G core traffic generator](https://github.com/tariromukute/core-tg). The performance results are collected using a bcc and bpftrace tools.

In [1]:
# configure spark variables
from pyspark.context import SparkContext
from pyspark.sql.context import SQLContext
from pyspark.sql.session import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
    
sc = SparkContext()
sqlContext = SQLContext(sc)
spark = SparkSession(sc)

# load up other dependencies
import re
import pandas as pd

import glob
import matplotlib.pyplot as plt
import numpy as np

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/12/01 13:31:18 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [2]:
import os
if not os.path.exists("images"):
    os.mkdir("images")

import os
import glob
import plotly.express as px
from plotly.subplots import make_subplots
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
from pyspark.sql.functions import expr
basePath = "../results"

In [3]:
html_output_file = '../general.html'

with open(html_output_file, 'w') as f:
    f.write('<head>')
    f.write('<script><link href="https://cdn.jsdelivr.net/npm/bootstrap@5.3.2/dist/css/bootstrap.min.css" rel="stylesheet" integrity="sha384-T3c6CoIi6uLrA9TneNEoa7RxnatzjcDSCmG1MXxSR1GAsXEV/Dwwykc2MPK8M2HN" crossorigin="anonymous"></script>')
    f.write('<script><script src="https://cdn.jsdelivr.net/npm/bootstrap@5.3.2/dist/js/bootstrap.bundle.min.js" integrity="sha384-C6RzsynM9kWDrMNeT87bh95OGNyZPhcTNXj1NW7RuBCsyN/o0jlpcV8Qyq46cDfL" crossorigin="anonymous"></script></script>')
    f.write('</head>')

with open(html_output_file, 'w') as f:
    f.write('<h1>Performane evaluation of open-source 5G Core networks</h1>')
    f.write('<h4> General Workload Chacterisation </h4>')

    f.write('<p> Virtual Network Functions (VNFs) are network functions that run on virtualised platforms. Virtualisation allows \
NFs to run on commercial off-the-shelf infrastructure. There are different virtualisation approaches, the most \
common being hypervisor virtualisation and OS-level virtualisation [? ], [? ]. Regardless of the manner of \
virtualisation employed, in most cases, a VNF is a piece of software that runs on top of the Linux kernel, and this \
degrades its performance compared to the hardware NF [? ]</p>')
    f.write('<p> To gain a better understanding of how virtualisation affects performance and how the software architecture \
of VNFs affects the performance of the 5G core networks, it is necessary to dive deeper into the underlying \
operations of the Linux kernel. A standard Linux system is based on the monolithic kernel architecture, which \
divides the system into two main components: kernel-space and user-space, which operate at different privilege \
levels. The kernel-space is responsible for managing the hardware and software components of the system and \
accounts for independent operations of programs. The user-space contains and executes user-defined programs \
and data. These two spaces are kept separate from each other, and communication between them is only possible \
through secure system calls. This concept allows the Linux kernel to abstract away the complexities of interacting \
directly with the hardware, which requires specialised communication</p>')
    f.write('<p> The system calls provide abstract APIs for various functionalities, including process management, memory \
management, file systems, device drivers, and networking. The kernel manages and executes these system calls \
by interfacing with the software or hardware support modules, as described and illustrated in [? ]. Of importance \
to the scope of this study is process manage</p>')
    f.write('<h5><a href="#syscalls-across-system"> Syscalls ascross the system </a> </h5>')
    f.write('<h5><a href="#process-across-system">Processes making syscalls </a> </h5>')
    f.write('<h5><a href="##io-multiplexing">epoll/poll/select </a> </h5>')
    f.write('<h5><a href="#read-write">read/write </a> </h5>')
    f.write('<h5><a href="#recv">recv, recvfrom, recvmsg, recvmmsg. recvfrom(), recvmsg() and recvmmsg() </a> </h5>')
    f.write('<h5><a href="#send">send, sendto, sendmsg, sendmmsg </a> </h5>')
    f.write('<h5><a href="#sleep">nanosleep/clock_nanosleep </a> </h5>')
    f.write('<h5><a href="#futex">futex </a> </h5>')
    f.write('<h5><a href="#sched_yield">sched_yield </a> </h5>')
    f.write('<h5><a href="#free5gc">free5GC: by process analysis </a> </h5>')
    f.write('<h5><a href="#open5gs">Open5GS: by process analysis </a> </h5>')
    f.write('<h5><a href="#oai">OAI: by process analysis </a> </h5>')
    

In [4]:
# General chaterisation
import plotly; print(plotly.__version__)

5.15.0


In [5]:
import subprocess
import os

# Helper functions
def remove_noise_processes(df, field, values):
    a = df.loc[df[field].isin(values)].index.array.tolist()
    df.drop(a, inplace=True)
    return df

def pivot_dataframe_to_gnuplot_format(df, values, index='ues', columns='cn'):

    # Group the DataFrame by 'Country' and 'Year'
    grouped_data = df.groupby([index, columns]).sum()

    # Pivot the resulting grouped data
    pivoted_df = grouped_data.pivot_table(index=index, columns=columns, values=values).reset_index()

    return pivoted_df

def draw_gnuplot_linepoints(df, name, title, xlabel, ylabel):
    df.to_csv(f'gnuplot/{name}.csv', index=False)
    print(df.columns)
    # Write the Gnuplot script
    with open(f'gnuplot/{name}.gnu', 'w') as f:
        f.write('set style data linespoints\n')
        f.write('set term png\n')
        f.write(f"set output '{name}.png'\n")
        # f.write('set key outside left bottom horizontal spacing 1 width 2 height 1.5\n')
        f.write('set key noenhanced\n')
        f.write('set key top left\n')
        f.write('set key autotitle columnhead\n')
        f.write("set datafile separator ','\n")
        f.write(f'set title "{title}"\n')
        f.write('set grid xtics ytics mytics\n')
        f.write(f'set xlabel "{xlabel}"\n')
        f.write(f'set ylabel "{ylabel}"\n')
        f.write(' # Create theme \n \
        dpi = 600 ## dpi (variable) \n \
        width = 164.5 ## mm (variable) \n \
        height = 100 ## mm (variable) \n \
        \n \
        in2mm = 25.4 # mm (fixed) \n \
        pt2mm = 0.3528 # mm (fixed) \n \
        \n \
        mm2px = dpi/in2mm \n \
        ptscale = pt2mm*mm2px \n \
        round(x) = x - floor(x) < 0.5 ? floor(x) : ceil(x) \n \
        wpx = round(width * mm2px) \n \
        hpx = round(height * mm2px) \n \
        \n \
        set terminal pngcairo size wpx,hpx fontscale ptscale/1.4 linewidth ptscale pointscale ptscale \n \
        \n \
        colors = "blue red green brown black magenta orange purple sienna1 slategray tan1 yellow turquoise orchid khaki" \n ')
        f.write(f'plot for [i=2:{len(df.columns)}] "{name}.csv" u 1:i t columnhead lc rgb word(colors, i-1)')

    # Run the Gnuplot script
    
    # Relative path of the desired working directory
    relative_dir_path = 'gnuplot'
    
    # Get the absolute path of the working directory
    curr_dir = os.getcwd()
    
    # Create the full path to the desired directory
    my_dir_path = os.path.join(curr_dir, relative_dir_path)
    
    subprocess.call(['gnuplot', '-p', f'{name}.gnu'],  cwd=my_dir_path)
    # Change current working directory to 'mydir'
    # os.chdir('./gnuplot')

    # # Execute command 'mycommand' in the new directory
    # os.system('gnuplot -p cn_perf_ue_avg_exp.gnu')

    # display the image on the screen
    from IPython.display import Image
    Image(filename=f'gnuplot/{name}.png')

labels = {
    "ues": "Number of UEs",
    "time (ms)": "Time (ms)",
    "syscall": "System calls",
    "count": "Number of calls",
    "avg": "Average time per syscall (ms)",
    "avg_duration": "Time (ms)",
    "cn": "Core network",
    "rct": "Rate of change of N# of calls"
}

noise_processes_excl_db = ['python3', 'systemd', 'snapd', 'sshd', 'sudo', 'multipathd', 'systemd-logind', 'systemd-timesyn', 'systemd-resolve', 'systemd-udevd', 'systemd-network', 'systemctl', 'accounts-daemon', 'dbus-daemon', '[unknown]']
noise_processes = noise_processes_excl_db + ['mongod', 'mysqld']

In [6]:
""" This shows how the usage syscalls change as the load changes. 
(a) The number of syscalls as the traffic load increases
(b) The time spent executing syscalls as the traffic increases
(c) The average time spent per syscall as the traffic increases
This can tell us:
1. How the core network is architectures to respond to increasing load
2. Comparing can tell us the core network that sends more time on syscalls. We can use that to corellate to the performance of the core network
3. We have the details on overal performance of the core networks, we can look at the results that correlate the performance
4. Is there a general trend to syscalls that can show well architected e.g., the latency should increase as load increase etc.
If there is an ideal trend or correlation, does it match the trend of the core networks and correlate to the performance we are seeing
"""

html = ""

html += '</b>' + '<h2 id="#syscalls-across-system">Syscalls across the system.</h2>'
html += '</b>' + '<p>Analysing system calls (syscalls) across the system helps in categorising \
the workload of the system. This information is valuable in identifying the hardware resources that require \
optimisation, such as installing an accelerated network card interface (NIC) or a cryptographic accelerator</p>'

top_n = 5

syscount_df = spark.read.option("basePath", basePath).json(
f"{basePath}/cn=*/ues=*/tool=syscount")

df_syscount = syscount_df.toPandas().groupby(['cn', 'ues']).agg({ 'count': 'sum', 'time (ms)': 'sum' }).reset_index()

df_syscount['avg'] = (df_syscount['time (ms)'] / df_syscount['count'])

title='Syscalls across the system (by latency)'
syscount_fig = px.line(df_syscount, x="ues", y="time (ms)", color="cn", labels=labels,
                title=title, markers=True)
# syscount_fig.show()
syscount_fig.write_image("./plotly/syscount_latency.jpeg")
html += '</br>' + syscount_fig.to_html(full_html=False, include_plotlyjs='cdn')

gnuplot_df = pivot_dataframe_to_gnuplot_format(df_syscount, 'time (ms)')
draw_gnuplot_linepoints(gnuplot_df, name='syscount_latency', title=title,
                        xlabel='Number of UEs', ylabel=labels['time (ms)'])

title=f'Syscalls across the system (by number of calls)'
sysprocess_count_fig = px.line(df_syscount, x="ues", y="count", color="cn", labels=labels,
                hover_data=["count", "time (ms)"],
                title=title, markers=True)
# sysprocess_count_fig.show()
sysprocess_count_fig.write_image("plotly/syscount_count.jpeg")
html += '</br>' + sysprocess_count_fig.to_html(full_html=False, include_plotlyjs='cdn')

gnuplot_df = pivot_dataframe_to_gnuplot_format(df_syscount, 'count')
draw_gnuplot_linepoints(gnuplot_df, name='syscount_count', title=title,
                        xlabel='Number of UEs', ylabel=labels['count'])

title=f'Syscalls across the system (by average latency)'
sysprocess_count_fig = px.line(df_syscount, x="ues", y="avg", color="cn", labels=labels,
                hover_data=["count", "time (ms)"],
                title=title, markers=True)
# sysprocess_count_fig.show()
sysprocess_count_fig.write_image("plotly/syscount_avg.jpeg")
html += '</br>' + sysprocess_count_fig.to_html(full_html=False, include_plotlyjs='cdn')

gnuplot_df = pivot_dataframe_to_gnuplot_format(df_syscount, 'avg')
draw_gnuplot_linepoints(gnuplot_df, name='syscount_avg', title=title,
                        xlabel='Number of UEs', ylabel=labels['avg'])

with open(html_output_file, 'a') as f:
    f.write(html)


Index(['ues', 'free5gc', 'oai', 'open5gs'], dtype='object', name='cn')
Index(['ues', 'free5gc', 'oai', 'open5gs'], dtype='object', name='cn')
Index(['ues', 'free5gc', 'oai', 'open5gs'], dtype='object', name='cn')


In [7]:
""" Show the system calls that are called for each core network. This is aggregated for all UEs and not broken down by UE

This can tell us:
1. The system calls mostly involved and their proportional sizes per core network
2. The core network with mostly system calls active and thier proportional sizes 
"""

html = ""

html += '</b>' + '<h2 id="#syscalls-across-system-sunburst">Syscalls across the system.</h2>'

top_n = 5

syscount_df = spark.read.option("basePath", basePath).json(
f"{basePath}/cn=*/ues=*/tool=syscount")

df_syscount = syscount_df.toPandas().groupby(['cn', 'syscall']).agg({ 'count': 'sum', 'time (ms)': 'sum' }).reset_index()

title="Syscalls per core network (by latency)"
sunburst_fig = px.sunburst(df_syscount, path=['cn', 'syscall'], values='time (ms)',
            color='cn', hover_data=['time (ms)'],
            title=title)

sunburst_fig.update_traces(textinfo="label+percent root")
# sunburst_fig.show()
sunburst_fig.write_image(f"plotly/grouped_sysccount_latency.jpeg")
html += '</br>' + sunburst_fig.to_html(full_html=False, include_plotlyjs='cdn')

title="Syscalls per core network (by number of calls)"
sunburst_fig = px.sunburst(df_syscount, path=['cn', 'syscall'], values='count',
            color='cn', hover_data=['time (ms)'],
            title=title)

sunburst_fig.update_traces(textinfo="label+percent root")
# sunburst_fig.show()
sunburst_fig.write_image(f"plotly/grouped_syscount_count.jpeg")
html += '</br>' + sunburst_fig.to_html(full_html=False, include_plotlyjs='cdn')

with open(html_output_file, 'a') as f:
    f.write(html)


In [8]:
""" A tabular view with ratios of the most sum of (latency per syscall, count per syscall and average latency of syscall). The tabular view will
1. Show us for each core network what is the ratio of a syscall precense over the other e.g., recvfrom has 4x more latency than sendto
2. Across core networks, we can compare the ratio of presence of a syscall e.g., free5gc invokes recvfrom 4x more than open5gs
3. For grouped syscalls, we can tell which flavor a given call network uses more e.g., for multiplexing syscalls, we can may see that free5gc uses
select more than epoll_wait and infer based on the relative performance of them
4. In addition to (3), for different core networks we can see that e.g., free5gc use select which is 4x more that epoll_wait being used by open5gs.
Tying this with the theory of the syscall we may be able to get the reasons for difference in performance

"""

' A tabular view with ratios of the most sum of (latency per syscall, count per syscall and average latency of syscall). The tabular view will\n1. Show us for each core network what is the ratio of a syscall precense over the other e.g., recvfrom has 4x more latency than sendto\n2. Across core networks, we can compare the ratio of presence of a syscall e.g., free5gc invokes recvfrom 4x more than open5gs\n3. For grouped syscalls, we can tell which flavor a given call network uses more e.g., for multiplexing syscalls, we can may see that free5gc uses\nselect more than epoll_wait and infer based on the relative performance of them\n4. In addition to (3), for different core networks we can see that e.g., free5gc use select which is 4x more that epoll_wait being used by open5gs.\nTying this with the theory of the syscall we may be able to get the reasons for difference in performance\n\n'

In [9]:
""" The top X active processes per core network
We can look at:
1. The composition of the core network, the system calls that run or maintain the system
2. It can tell us what the system spends most of it's time on
3. For these we can see if they syscalls follow the 'ideal trend' of responding to traffic load

"""
html = ""

html += '</b>' + '<h2 id="#process-across-system">Processes making syscalls</h2>'
html += '</b>' + '<h2>The information about the processes that make system calls provides valuable \
insights into the most active processes during the registration procedure. By observing the changes in latency \
and frequency of the system calls made by a process as the number of UEs increases, we can identify processes \
that have a high probability of becoming bottlenecks. The information can be used to make several mitigation \
decisions, such as allocating more resources or dedicated resources to a given process or Network Function (NF), \
optimising the usage by the NF or process, and examining the configuration of the process, among other things.</h2>'
top_n = 6

def top_processes(df, field):
    label_maxes = df.groupby(['comm'])[field].sum().sort_values(ascending=False)

    # Select the top n labels with the highest y-values
    top_labels = label_maxes.head(top_n).index.tolist()

    return top_labels

sysprocess_df = spark.read.option("basePath", basePath).json(
f"{basePath}/cn=*/ues=*/tool=sysprocess")

df_sysprocess = sysprocess_df.toPandas()
df_process = remove_noise_processes(df_sysprocess, 'comm', noise_processes_excl_db)
df_process['avg'] = (df_process['time (ms)'] / df_process['count'])

top_labels = top_processes(df_process, 'time (ms)')

sunburst_fig = px.sunburst(df_process[df_process['comm'].isin(top_labels)], path=['cn', 'comm'], values='time (ms)',
            color='cn', hover_data=['time (ms)'],
            title=f"Processes making syscall (by latency)")
sunburst_fig.update_traces(textinfo="label+percent root")
# sunburst_fig.show()
sunburst_fig.write_image(f"plotly/grouped_sysprocess_latency.jpeg")
html += '</br>' + sunburst_fig.to_html(full_html=False, include_plotlyjs='cdn')

top_labels = top_processes(df_process, 'count')

sunburst_fig = px.sunburst(df_process[df_process['comm'].isin(top_labels)], path=['cn', 'comm'], values='count',
            color='cn', hover_data=['time (ms)'],
            title=f"Processes making syscall (by number of calls)")
sunburst_fig.update_traces(textinfo="label+percent root")
# sunburst_fig.show()
sunburst_fig.write_image(f"plotly/grouped_sysprocess_count.jpeg")
html += '</br>' + sunburst_fig.to_html(full_html=False, include_plotlyjs='cdn')

top_labels = top_processes(df_process, 'avg')

sunburst_fig = px.sunburst(df_process[df_process['comm'].isin(top_labels)], path=['cn', 'comm'], values='avg',
            color='cn', hover_data=['time (ms)'],
            title=f"Processes making syscall (by average latency)")
sunburst_fig.update_traces(textinfo="label+percent root")
# sunburst_fig.show()
sunburst_fig.write_image(f"plotly/grouped_sysprocess_avg.jpeg")
html += '</br>' + sunburst_fig.to_html(full_html=False, include_plotlyjs='cdn')

# Create line graphs for the all process per core network
top_labels = top_processes(df_process, 'count')
df_cn_process = remove_noise_processes(df_sysprocess, 'comm', noise_processes)
df_cn_process = df_cn_process.groupby(['ues', 'cn']).agg({ 'count': 'sum', 'time (ms)': 'sum'}).reset_index()

title=f'Core networks: Top {top_n} active processes making syscall (by number of calls)'
sysprocess_fig = px.line(df_cn_process.sort_values('ues'),
        x="ues", y="count", color="cn",
        hover_data=["count", "time (ms)"],
        labels=labels,
        title=title,
        markers=True)
# sysprocess_fig.show()
sysprocess_fig.write_image(f"plotly/core_network_sum_sysprocess_count.jpeg")
html += '</br>' + sysprocess_fig.to_html(full_html=False, include_plotlyjs='cdn')

gnuplot_df = pivot_dataframe_to_gnuplot_format(df_cn_process, 'count', index='ues', columns='cn')
draw_gnuplot_linepoints(gnuplot_df, name=f'core_network_sum_sysprocess_count', title=title,
                xlabel='Number of UEs', ylabel=labels['count'])


# html += '</br>' + "<h4>Performance per process</h4>"
# For the active processes remove databases
df_process = remove_noise_processes(df_sysprocess, 'comm', noise_processes)
df_process['avg'] = (df_process['time (ms)'] / df_process['count'])

grouped_data = df_process.groupby(['cn'])
for group_name, group_df in grouped_data:
     html += '</br>' + f'<h5>Process {group_name[0]}</h5>'
    #  top_labels = top_processes(group_df, 'time (ms)')

    #  title=f'{group_name[0]}: Top {top_n} active processes making syscall (by latency)'
    #  sysprocess_fig = px.line(group_df[group_df['comm'].isin(top_labels)].sort_values('ues'),
    #             x="ues", y="time (ms)", color="comm",
    #             hover_data=["count", "time (ms)"],
    #             labels=labels,
    #             title=title,
    #             markers=True)
    #  sysprocess_fig.show()
    #  sysprocess_fig.write_image(f"plotly/{group_name[0]}_sysprocess_latency.jpeg")
    #  html += '</br>' + sysprocess_fig.to_html(full_html=False, include_plotlyjs='cdn')

    #  gnuplot_df = pivot_dataframe_to_gnuplot_format(group_df[group_df['comm'].isin(top_labels)], 'count', index='ues', columns='comm')
    #  draw_gnuplot_linepoints(gnuplot_df, name=f'{group_name[0]}_sysprocess_latency', title=title,
    #                     xlabel='Number of UEs', ylabel=labels['time (ms)'])

     top_labels = top_processes(group_df, 'count')

     title=f'{group_name[0]}: Top {top_n} active processes making syscall (by number of calls)'
     sysprocess_fig = px.line(group_df[group_df['comm'].isin(top_labels)].sort_values('ues'),
                x="ues", y="count", color="comm",
                hover_data=["count", "time (ms)"],
                labels=labels,
                title=title,
                markers=True)
    #  sysprocess_fig.show()
     sysprocess_fig.write_image(f"plotly/{group_name[0]}_sysprocess_count.jpeg")
     html += '</br>' + sysprocess_fig.to_html(full_html=False, include_plotlyjs='cdn')

     gnuplot_df = pivot_dataframe_to_gnuplot_format(group_df[group_df['comm'].isin(top_labels)], 'count', index='ues', columns='comm')
     draw_gnuplot_linepoints(gnuplot_df, name=f'{group_name[0]}_sysprocess_count', title=title,
                        xlabel='Number of UEs', ylabel=labels['count'])
     
    #  top_labels = top_processes(group_df, 'avg')

    #  title = f'{group_name[0]}: Top {top_n} active processes making syscall (by average latency)'
    #  sysprocess_fig = px.line(group_df[group_df['comm'].isin(top_labels)].sort_values('ues'),
    #             x="ues", y="avg", color="comm",
    #             hover_data=["count", "time (ms)"],
    #             labels=labels,
    #             title=title,
    #             markers=True)
    #  sysprocess_fig.show()
    #  sysprocess_fig.write_image(f"plotly/{group_name[0]}_sysprocess_avg.jpeg")
    #  html += '</br>' + sysprocess_fig.to_html(full_html=False, include_plotlyjs='cdn')

    #  gnuplot_df = pivot_dataframe_to_gnuplot_format(group_df[group_df['comm'].isin(top_labels)], 'count', index='ues', columns='comm')
    #  draw_gnuplot_linepoints(gnuplot_df, name=f'{group_name[0]}_sysprocess_avg', title=title,
    #                     xlabel='Number of UEs', ylabel=labels['avg'])
     
with open(html_output_file, 'a') as f:
    f.write(html)

Index(['ues', 'free5gc', 'oai', 'open5gs'], dtype='object', name='cn')
Index(['ues', 'amf', 'ausf', 'nrf', 'pcf', 'udm', 'udr'], dtype='object', name='comm')
Index(['ues', 'amf', 'ausf', 'nrf', 'systemd-journal', 'udm', 'udr'], dtype='object', name='comm')
Index(['ues', 'open5gs-amfd', 'open5gs-ausfd', 'open5gs-scpd', 'open5gs-udmd',
       'rsyslogd', 'systemd-journal'],
      dtype='object', name='comm')


In [10]:
df_cn_process = df_process.groupby(['ues', 'cn']).agg({ 'count': 'sum', 'time (ms)': 'sum'}).reset_index()
print(df_cn_process.head())

   ues       cn   count     time (ms)
0    0  free5gc    9990  1.291965e+06
1    0      oai  284715  9.058338e+05
2    0  open5gs   11252  3.279446e+06
3    5  free5gc   34966  1.618149e+06
4    5      oai  283691  7.738424e+05


In [11]:
# """ The top X active syscalls per core network

# """
# html = ""

# top_n = 10

# def top_syscalls(df, field):
#     label_maxes = df.groupby(['syscall'])[field].sum().sort_values(ascending=False)

#     # Select the top n labels with the highest y-values
#     top_labels = label_maxes.head(top_n).index.tolist()

#     return top_labels

# syscount_df = spark.read.option("basePath", basePath).json(
# f"{basePath}/cn=*/ues=*/tool=syscount")

# df_syscall = syscount_df.toPandas()

# df_syscall['avg'] = (df_syscall['time (ms)'] / df_syscall['count'])

# grouped_data = df_syscall.groupby(['cn'])
# for group_name, group_df in grouped_data:
#      top_labels = top_syscalls(group_df, 'time (ms)')

#      f'Top {top_n} active syscalls {group_name[0]} (by latency)'
#      syscount_fig = px.line(group_df[group_df['syscall'].isin(top_labels)].sort_values('ues'),
#                     x="ues", y="time (ms)", color="syscall",
#                     hover_data=["count", "time (ms)"],
#                     labels=labels,
#                     title=title,
#                     markers=True)
#      syscount_fig.show()
#      syscount_fig.write_image(f"plotly/{group_name[0]}_top_syscall_latency.jpeg")
#      html += '</br>' + syscount_fig.to_html(full_html=False, include_plotlyjs='cdn')
     
#      gnuplot_df = pivot_dataframe_to_gnuplot_format(group_df[group_df['syscall'].isin(top_labels)], 'time (ms)', index='ues', columns='syscall')
#      draw_gnuplot_linepoints(gnuplot_df, name=f'{group_name[0]}_top_syscall_latency.jpeg', title=title,
#                         xlabel='Number of UEs', ylabel=labels['time (ms)'])
     
#      top_labels = top_syscalls(group_df, 'count')

#      title=f'Top {top_n} active syscalls {group_name[0]} (by number of calls)'
#      syscount_fig = px.line(group_df[group_df['syscall'].isin(top_labels)].sort_values('ues'),
#                     x="ues", y="count", color="syscall",
#                     hover_data=["count", "time (ms)"],
#                     labels=labels,
#                     title=title,
#                     markers=True)
#      syscount_fig.show()
#      syscount_fig.write_image(f"plotly/{group_name[0]}_top_syscall_count.jpeg")
#      html += '</br>' + syscount_fig.to_html(full_html=False, include_plotlyjs='cdn')

#      gnuplot_df = pivot_dataframe_to_gnuplot_format(group_df[group_df['syscall'].isin(top_labels)], 'count', index='ues', columns='syscall')
#      draw_gnuplot_linepoints(gnuplot_df, name=f'{group_name[0]}_top_syscall_count.jpeg', title=title,
#                         xlabel='Number of UEs', ylabel=labels['count'])

#      top_labels = top_syscalls(group_df, 'avg')

#      title=f'Top {top_n} active syscalls {group_name[0]} (by average latency)'
#      syscount_fig = px.line(group_df[group_df['syscall'].isin(top_labels)].sort_values('ues'),
#                     x="ues", y="avg", color="syscall",
#                     hover_data=["count", "time (ms)"],
#                     labels=labels,
#                     title=title,
#                     markers=True)
#      syscount_fig.show()
#      syscount_fig.write_image(f"plotly/{group_name[0]}_top_syscall_avg.jpeg")
#      html += '</br>' + syscount_fig.to_html(full_html=False, include_plotlyjs='cdn')

#      gnuplot_df = pivot_dataframe_to_gnuplot_format(group_df[group_df['syscall'].isin(top_labels)], 'avg', index='ues', columns='syscall')
#      draw_gnuplot_linepoints(gnuplot_df, name=f'{group_name[0]}_top_syscall_avg.jpeg', title=title,
#                         xlabel='Number of UEs', ylabel=labels['avg'])
     
# with open(html_output_file, 'a') as f:
#     f.write(html)

In [12]:
""" For each of the syscalls, we plot for the core networks, the amount of times and occurances. 

This can tell us:
1. How the core network are architected to respond to load for different operations e.g., how their socket read logic is implemented
to work and how that responses to change in traffic load
2. Relative to other core network which syscalls it uses the most. For example this can tell us the syscall that has the most differentiating factor, 
e.g., if all syscalls are relatively the same and there is a huge difference for sched_yield, then it is likely the differentiating syscall or design

"""
import pandas as pd
import plotly.graph_objs as go


def grouped_syscall_stats(df_sysprocess, syscall, writer=None):
    html = ""

    cn_df = df_sysprocess.groupby(['cn', 'ues']).agg({ 'count': 'sum', 'time (ms)': 'sum' }).reset_index()

    cn_df['avg'] = (cn_df['time (ms)'] / cn_df['count'])

    title=f'Core network syscall {syscall} (by latency)'
    sysprocess_count_fig = px.line(cn_df.sort_values('ues'),
                    x="ues", y="time (ms)", color="cn", 
                    hover_data=["count", "time (ms)"],
                    labels=labels,
                    title=title,
                    markers=True)
    # sysprocess_count_fig.show()
    sysprocess_count_fig.write_image(f"plotly/core_network_on_{syscall}_latency.jpeg")
    html += '</br>' + sysprocess_count_fig.to_html(full_html=False, include_plotlyjs='cdn')

    gnuplot_df = pivot_dataframe_to_gnuplot_format(cn_df, 'count', index='ues', columns='cn')
    draw_gnuplot_linepoints(gnuplot_df, name=f'core_network_on_{syscall}_latency', title=title,
                        xlabel='Number of UEs', ylabel=labels['time (ms)'])
    
    title=f'Core network syscall {syscall} (by number of calls)'
    sysprocess_count_fig = px.line(cn_df.sort_values('ues'),
                    x="ues", y="count", color="cn",
                    hover_data=["count", "time (ms)"],
                    labels=labels,
                    title=title,
                    markers=True)
    # sysprocess_count_fig.show()
    sysprocess_count_fig.write_image(f"plotly/core_network_on_{syscall}_count.jpeg")
    html += '</br>' + sysprocess_count_fig.to_html(full_html=False, include_plotlyjs='cdn')

    gnuplot_df = pivot_dataframe_to_gnuplot_format(cn_df, 'count', index='ues', columns='cn')
    draw_gnuplot_linepoints(gnuplot_df, name=f'core_network_on_{syscall}_count', title=title,
                        xlabel='Number of UEs', ylabel=labels['count'])

    title=f'Core network syscall {syscall} (by average latency)'
    sysprocess_count_fig = px.line(cn_df.sort_values('ues'),
                    x="ues", y="avg", color="cn",
                    hover_data=["count", "time (ms)"],
                    labels=labels,
                    title=title,
                    markers=True)
    # sysprocess_count_fig.show()
    sysprocess_count_fig.write_image(f"plotly/core_network_on_{syscall}_avg.jpeg")
    html += '</br>' + sysprocess_count_fig.to_html(full_html=False, include_plotlyjs='cdn')

    gnuplot_df = pivot_dataframe_to_gnuplot_format(cn_df, 'avg', index='ues', columns='cn')
    draw_gnuplot_linepoints(gnuplot_df, name=f'core_network_on_{syscall}_avg', title=title,
                        xlabel='Number of UEs', ylabel=labels['avg'])


    return cn_df, html


def grouped_processes_stats(df_sysprocess, writer=None):
    html = ""

    comm_df = df_sysprocess.groupby(['cn', 'comm']).agg({ 'count': 'sum', 'time (ms)': 'sum' }).reset_index()

    comm_df['avg'] = (comm_df['time (ms)'] / comm_df['count'])

    sunburst_fig = px.sunburst(comm_df, path=['cn', 'comm'], values='time (ms)',
                  color='cn', hover_data=['count'],
                  title=f"Processes making {syscall} syscall (by latency)")
    sunburst_fig.update_traces(textinfo="label+percent root")
    # sunburst_fig.show()
    sunburst_fig.write_image(f"plotly/grouped_sysprocess_on_{syscall}_latency.jpeg")
    html += '</br>' + sunburst_fig.to_html(full_html=False, include_plotlyjs='cdn')

    sunburst_fig = px.sunburst(comm_df, path=['cn', 'comm'], values='count',
                  color='cn', hover_data=['time (ms)'],
                  title=f"Processes making {syscall} syscall (by number of calls)")
    sunburst_fig.update_traces(textinfo="label+percent root")
    # sunburst_fig.show()
    sunburst_fig.write_image(f"plotly/grouped_sysprocess_on_{syscall}_count.jpeg")
    html += '</br>' + sunburst_fig.to_html(full_html=False, include_plotlyjs='cdn')

    sunburst_fig = px.sunburst(comm_df, path=['cn', 'comm'], values='avg',
                  color='cn', hover_data=['time (ms)'],
                  title=f"Processes making {syscall} syscall (by average latency)")
    sunburst_fig.update_traces(textinfo="label+percent root")
    # sunburst_fig.show()
    sunburst_fig.write_image(f"plotly/grouped_sysprocess_on_{syscall}_avg.jpeg")
    html += '</br>' + sunburst_fig.to_html(full_html=False, include_plotlyjs='cdn')

    return comm_df, html

def grouped_syscall_types(syscalls, syscall_type):
    html = ""

    df = pd.DataFrame()
    for syscall in syscalls:
        sysprocess_df = spark.read.option("basePath", basePath).json(
            f"{basePath}/cn=*/ues=*/tool=sysprocess_{syscall}")
        df1 = sysprocess_df.toPandas()
        df1['syscall'] = syscall
        df = pd.concat([df, df1])

    df = df.reset_index(drop=True)
    df_syscall = remove_noise_processes(df, 'comm', noise_processes)
    syscall_df = df_syscall.groupby(['cn', 'syscall', 'comm']).agg({ 'count': 'sum', 'time (ms)': 'sum' }).reset_index()

    syscall_df['avg'] = (syscall_df['time (ms)'] / syscall_df['count'])

    sunburst_fig = px.sunburst(syscall_df, path=['cn', 'syscall', 'comm'], values='time (ms)',
                  color='cn', hover_data=['count'],
                #   title=f"Core networks making {syscall_type} syscall (by latency)"
                  )
    sunburst_fig.update_traces(textinfo="label+percent root")
    # sunburst_fig.show()
    sunburst_fig.write_image(f"plotly/grouped_systypes_on_{syscall_type.replace('/', '')}_latency.jpeg")
    html += '</br>' + sunburst_fig.to_html(full_html=False, include_plotlyjs='cdn')

    sunburst_fig = px.sunburst(syscall_df, path=['cn', 'syscall', 'comm'], values='count',
                  color='cn', hover_data=['time (ms)'],
                #   title=f"Core networks making {syscall_type} syscall (by number of calls)"
                  )
    sunburst_fig.update_traces(textinfo="label+percent root")
    # sunburst_fig.show()
    sunburst_fig.write_image(f"plotly/grouped_systypes_on_{syscall_type.replace('/', '')}_count.jpeg")
    html += '</br>' + sunburst_fig.to_html(full_html=False, include_plotlyjs='cdn')


    sunburst_fig = px.sunburst(syscall_df, path=['cn', 'syscall', 'comm'], values='avg',
                  color='cn', hover_data=['time (ms)'],
                #   title=f"Core networks making {syscall_type} syscall (by average latency)"
                  )
    sunburst_fig.update_traces(textinfo="label+percent root")
    # sunburst_fig.show()
    sunburst_fig.write_image(f"plotly/grouped_systypes_on_{syscall_type.replace('/', '')}_avg.jpeg")
    html += '</br>' + sunburst_fig.to_html(full_html=False, include_plotlyjs='cdn')

    return None, html

def compute_grouped_stats(syscall, summary_df):
    html = ""
    sysprocess_df = spark.read.option("basePath", basePath).json(
    f"{basePath}/cn=*/ues=*/tool=sysprocess_{syscall}")

    df_sysprocess = sysprocess_df.toPandas()
    df1 = df_sysprocess
    # df1 = remove_noise_processes(df_sysprocess, 'comm', noise_processes)
    syscall_df, lhtml = grouped_syscall_stats(df1, syscall, writer)
    html += '</br>' + lhtml   

    comm_df, lhtml = grouped_processes_stats(df1, writer)
    html += '</br>' + lhtml   

    # Get the summary
    df2 = comm_df.groupby(['cn']).agg({ 'count': 'sum', 'time (ms)': 'sum', 'avg': 'sum' }).reset_index()
    
    df2['syscall'] = syscall

    summary_df = pd.concat([summary_df, df2])
    summary_df = summary_df.reset_index(drop=True)
    return None, html

html = ""
# writer = pd.ExcelWriter('ActiveProcessesPerSyscall-WithoutNoiseProcesses.xlsx', engine='xlsxwriter')
writer = None
noise_processes = ['python3', 'systemd', 'snapd', 'sshd', 'sudo', 'multipathd', 'systemd-logind', 'systemd-timesyn', 'systemd-resolve', 'systemd-udevd', 'systemd-network', 'systemctl', 'accounts-daemon', 'dbus-daemon', 'mongod', 'mysqld', '[unknown]']


html += '</br>' + '<h3 id="#io-multiplexing">epoll/poll/select</h3>'
html += '</br>' + '<p>The system calls epoll/poll/select implement I/O multiplexing, which enables the simultaneous \
monitoring of multiple input and output sources in a single operation. These system calls are based on the \
Linux design principle, which considers everything as a file and operates by monitoring files to determine if they \
are ready for the requested operation. The main advantage of multiplexing I/O operations is that it avoids blocking \
read and write where a process will wait for data while on the CPU. Instead, one waits for the multiplexing I/O \
system calls to determine which files are ready for read or write.</p>'

io_multiplex_syscalls = ['epoll_wait', 'poll', 'ppoll', 'epoll_pwait', 'select']
_, lhtml = grouped_syscall_types(io_multiplex_syscalls, 'IO Multiplexing')
html += '</br>' + lhtml
print("Syscalls for io multiplexing")
# Run for each syscall
grouped_io_df = pd.DataFrame(columns=['cn', 'count', 'time (ms)', 'avg', 'syscall'])
for syscall in io_multiplex_syscalls:
    _, lhtml = compute_grouped_stats(syscall, grouped_io_df)
    html += '</br>' + lhtml   


html += '</br>' + '<h3 id="#read-write">read/write</h3>'
html += '</br>' + '<p>The read() system call is used to retrieve data from a file stored in the file system, while \
the write() system call is used to write data from a buffer to a file. Both system calls take into account the \
"count", which represents the number of bytes to read or write. Upon successful execution, these system calls \
return the number of bytes that were successfully read or written. By default, these system calls are blocking but \
can be changed to non-blocking using the fnctl system call. Blocking is a problem for programs that should operate concurrently, since blocked processes \
are suspended. There are two different, complementary ways to solve this problem. They are nonblocking mode \
and I/O multiplexing system calls, such as select and epoll. The architectural decision to use a combination of \
multiplexing I/O operations and non-blocking system calls offers advantages depending on the use cases. Some \
scenarios where this approach is beneficial include situations where small buffers would result in repeated system \
calls, when the system is dedicated to one function, or when multiple I/O system calls return an error.</p>'

grouped_io_df = pd.DataFrame(columns=['cn', 'count', 'time (ms)', 'avg', 'syscall'])
socket_files_syscalls = ['read', 'write']
_, lhtml = grouped_syscall_types(socket_files_syscalls, 'Files')
html += '</br>' + lhtml
print("Syscalls for read or write for files operations")
for syscall in socket_files_syscalls:
    _, lhtml = compute_grouped_stats(syscall, grouped_io_df)
    html += '</br>' + lhtml


html += '</br>' + '<h3 id="#recv">recv, recvfrom, recvmsg, recvmmsg. recvfrom(), recvmsg() and recvmmsg()</h3>'
html += '</br>' + '<p> These are all system calls used to receive \
messages from a socket. They can be used to receive data on a socket, whether or not it is connection-orientated. \
These system calls are blocking calls; if no messages are available at the socket, the receive calls wait for a \
message to arrive. If the socket is set to non-blocking, then the value -1 is returned and errno is set to EAGAIN or \
EWOULDBLOCK. Passing the flag MSG_DONTWAIT to the system call enables non-blocking operation. \
This provides behaviour similar to setting O_NONBLOCK with fcntl except MSG_DONTWAIT is per operation. \
The recv() call is normally used only on a connected socket and is identical to recvfrom() with a nil from \
parameter. recv(), recvfrom() and recvmsg() calls return the number of bytes received, or -1 if an error occurred. For \
connected sockets whose remote peer was shut down, 0 is returned when no more data is available. The \
recvmmsg() call returns the number of messages received, or -1 if an error occurred</p>'

grouped_io_df = pd.DataFrame(columns=['cn', 'count', 'time (ms)', 'avg', 'syscall'])
socket_read_syscalls = [ 'recvmsg', 'recvfrom']
_, lhtml = grouped_syscall_types(socket_read_syscalls, 'Receive')
html += '</br>' + lhtml
print("Syscalls for socket read operations")
for syscall in socket_read_syscalls:
    _, lhtml = compute_grouped_stats(syscall, grouped_io_df)
    html += '</br>' + lhtml


html += '</br>' + '<h3 id="#send">send, sendto, sendmsg, sendmmsg</h3>'
html += '</br>' + '<p>The send() call may only be used when the socket is in a connected state \
(so that the intended recipient is known). The send() is similar to write() with the difference of flags. The sendto \
and sendmsg work on both connected and unconnected sockets. The sendmsg() call also allows sending ancillary \
data (also known as control information). The sendmmsg() system call is an extension of sendmsg that allows \
the caller to transmit multiple messages on a socket using a single system call. \
The approaches to optimise the send(s) system calls are similar to the discussed approaches for the recv(s) \
system calls. These include I/O multiplexing, using the system calls in non-blocking mode, and sending multiple \
messages in a single system call where possible </p>'

grouped_io_df = pd.DataFrame(columns=['cn', 'count', 'time (ms)', 'avg', 'syscall'])
socket_write_syscalls = ['sendto', 'sendmsg' ]
_, lhtml = grouped_syscall_types(socket_write_syscalls, 'Send')
html += '</br>' + lhtml
print("Syscalls for socket write operations")
for syscall in socket_write_syscalls:
    _, lhtml = compute_grouped_stats(syscall, grouped_io_df)
    html += '</br>' + lhtml


html += '</br>' + '<h3 id="#sleep">nanosleep/clock_nanosleep</h3>'
html += '</br>' + '<p>The nanosleep and clock_nanosleep system calls are used to allow the calling \
thread to sleep for a specific interval with nanosecond precision. The clock_nanosleep differs from nanosleep \
in two ways. Firstly, it allows the caller to select the clock against which the sleep interval is to be measured. \
Secondly, it enables the specification of the sleep interval as either an absolute or a relative value. Using an \
absolute timer is useful to prevent timer drift issues mentioned about nanosleep. </p>'

grouped_io_df = pd.DataFrame(columns=['cn', 'count', 'time (ms)', 'avg', 'syscall'])
time_syscalls = ['clock_nanosleep', 'nanosleep']
_, lhtml = grouped_syscall_types(time_syscalls, 'Time')
html += '</br>' + lhtml
print("Syscalls for process time operations")
for syscall in time_syscalls:
    _, lhtml = compute_grouped_stats(syscall, grouped_io_df)
    html += '</br>' + lhtml


html += '</br>' + '<h3 id="#futex">futex</h3>'
html += '</br>' + '<p>The futex() system call offers a mechanism to wait until a specific condition becomes true. It is \
typically used as a blocking construct in the context of shared-memory synchronisation. Additionally, futex() \
operations can be employed to wake up processes or threads that are waiting for a particular condition. The \
main design goal of futex is to manage the mutex keys in the user space to avoid context switches when handling \
mutex in kernel space. In the futex design, the kernel is involved only when a thread needs to sleep or the system \
needs to wake up another thread. Essentially, the futex system call can be described as providing a kernel side \
wait queue indexed by a user space address, allowing threads to be added or removed from user space. A \
high frequency of calls to the futex system may indicate a high degree of concurrent access to shared resources \
or data structures by multiple threads or processes. </p>'

grouped_io_df = pd.DataFrame(columns=['cn', 'count', 'time (ms)', 'avg', 'syscall'])
locks_syscalls = ['futex']
_, lhtml = grouped_syscall_types(locks_syscalls, 'Locks')
html += '</br>' + lhtml
print("Syscalls for locks operations")
for syscall in locks_syscalls:
    _, lhtml = compute_grouped_stats(syscall, grouped_io_df)
    html += '</br>' + lhtml


html += '</br>' + '<h3 id="#sched_yield">sched_yield</h3>'
html += '</br>' + '<p>The sched_yield system call is used by a thread to allow other threads a chance to run, and \
the calling thread relinquishes the CPU. Strategic calls to sched_yield() can improve performance by giving \
other threads or processes an opportunity to run when (heavily) contended resources, such as mutexes, have been \
released by the caller. The authors of were able to improve the throughput of their system by employing \
the sched_yield system call after a process processes each batch of packets before calling the poll. On the other \
hand, sched_yield can result in unnecessary context switches, which will degrade system performance if not used \
appropriately. The latter is mainly true in generic Linux systems, as the scheduler is responsible for deciding \
which process runs. In most cases, when a process yields, the scheduler may perceive it as a higher priority and \
still put it back into execution, where it yields again in a loop. This behaviour is mainly due to the algorithm and \
logic used by Linux’s default scheduler to determine the process with the higher prior</p>'

grouped_io_df = pd.DataFrame(columns=['cn', 'count', 'time (ms)', 'avg', 'syscall'])
control_syscalls = ['sched_yield']
_, lhtml = grouped_syscall_types(control_syscalls, 'Control operations')
html += '</br>' + lhtml
print("Syscalls for control operations")
for syscall in control_syscalls:
    _, lhtml = compute_grouped_stats(syscall, grouped_io_df)
    html += '</br>' + lhtml


with open(html_output_file, 'a') as f:
    f.write(html)


Syscalls for io multiplexing
Index(['ues', 'free5gc', 'oai', 'open5gs'], dtype='object', name='cn')
Index(['ues', 'free5gc', 'oai', 'open5gs'], dtype='object', name='cn')
Index(['ues', 'free5gc', 'oai', 'open5gs'], dtype='object', name='cn')
Index(['ues', 'free5gc', 'oai', 'open5gs'], dtype='object', name='cn')
Index(['ues', 'free5gc', 'oai', 'open5gs'], dtype='object', name='cn')
Index(['ues', 'free5gc', 'oai', 'open5gs'], dtype='object', name='cn')
Index(['ues', 'free5gc', 'oai', 'open5gs'], dtype='object', name='cn')
Index(['ues', 'free5gc', 'oai', 'open5gs'], dtype='object', name='cn')
Index(['ues', 'free5gc', 'oai', 'open5gs'], dtype='object', name='cn')
Index(['ues', 'free5gc', 'oai', 'open5gs'], dtype='object', name='cn')
Index(['ues', 'free5gc', 'oai', 'open5gs'], dtype='object', name='cn')
Index(['ues', 'free5gc', 'oai', 'open5gs'], dtype='object', name='cn')
Index(['ues', 'free5gc', 'oai', 'open5gs'], dtype='object', name='cn')
Index(['ues', 'free5gc', 'oai', 'open5gs'], dtyp

In [13]:
""" Show the characterisation of each process. We show for each process the stats for grouped syscalls
e.g., for opne5gs-amfd we show results for the multiplexing system calls

For each of the core networks, we show their processes and then
1. The syscalls active for the process by the frequency - line graph as UE increase and sunburst
2. The syscalls active for the process by the latency - line graph as UE increase and sunburst
3. The syscalls active for the process by the average - line graph as UE increase and sunburst
"""

import re

noise_processes = ['python3', 'systemd', 'snapd', 'sshd', 'sudo', 'multipathd', 'systemd-logind', 'systemd-timesyn', 'systemd-resolve', 'systemd-udevd', 'systemd-network', 'systemctl', 'accounts-daemon', 'dbus-daemon', 'mongod', 'mysqld', '[unknown]']

noise_processes = noise_processes + ['rsyslogd', 'systemd-journal', 'irqbalance', 'fwupd']

html = ""

def grouped_process_and_syscall_types(syscalls, syscall_type, core_network):
    html = ""
    df = pd.DataFrame()
    for syscall in syscalls:
        sysprocess_df = spark.read.option("basePath", basePath).json(
            f"{basePath}/cn={core_network}/ues=*/tool=sysprocess_{syscall}")
        df1 = sysprocess_df.toPandas()
        df1['syscall'] = syscall
        df = pd.concat([df, df1])

    df = df.reset_index(drop=True)
    df_syscall = remove_noise_processes(df, 'comm', noise_processes)
    syscall_df = df_syscall.groupby(['comm', 'ues', 'syscall']).agg({ 'count': 'sum', 'time (ms)': 'sum' }).reset_index()

    syscall_df['avg'] = (syscall_df['time (ms)'] / syscall_df['count'])

    # Sunburst summarising
    title=f"{core_network}: {syscall_type} syscalls (by latency)"
    file_name = re.sub(r'[^\w\s]','_', title).replace(' ', '_')
    sunburst_fig = px.sunburst(syscall_df, path=['comm', 'ues', 'syscall'], values='time (ms)',
                  color='comm', hover_data=['count'],
                  title=title
                  )
    sunburst_fig.update_traces(textinfo="label+percent root")
    sunburst_fig.update_traces(sort=False, selector=dict(type='sunburst')) 
    # sunburst_fig.show()
    sunburst_fig.write_image(f"plotly/{file_name}.jpeg")
    html += '</br>' + sunburst_fig.to_html(full_html=False, include_plotlyjs='cdn')

    # Line graph
    for process_name, process_syscall_df in syscall_df.groupby('comm'):
        title=f"{core_network}: {syscall_type} syscalls by {process_name} (by latency)"
        sysprocess_count_fig = px.line(process_syscall_df.sort_values('ues'),
                        x="ues", y="time (ms)", color="syscall", 
                        facet_row="comm",
                        hover_data=["count", "time (ms)"],
                        labels=labels,
                        title=title,
                        markers=True)
        # sysprocess_count_fig.show()
        file_name = re.sub(r'[^\w\s]','_', title).replace(' ', '_')
        sysprocess_count_fig.write_image(f"plotly/{file_name}.jpeg")
        html += '</br>' + sysprocess_count_fig.to_html(full_html=False, include_plotlyjs='cdn')

        gnuplot_df = pivot_dataframe_to_gnuplot_format(process_syscall_df, 'time (ms)', 'ues', 'syscall')
        draw_gnuplot_linepoints(gnuplot_df, name=file_name, title=title,
                        xlabel='Number of UEs', ylabel=labels['time (ms)'])
    
    title=f"{core_network}: {syscall_type} syscalls (by number of calls)"
    file_name = re.sub(r'[^\w\s]','_', title).replace(' ', '_')
    sunburst_fig = px.sunburst(syscall_df, path=['comm', 'ues', 'syscall'], values='count',
                  color='comm', hover_data=['time (ms)'],
                  title=title
                  )
    sunburst_fig.update_traces(textinfo="label+percent root")
    sunburst_fig.update_traces(sort=False, selector=dict(type='sunburst'))
    # sunburst_fig.show()
    sunburst_fig.write_image(f"plotly/{file_name}.jpeg")
    html += '</br>' + sunburst_fig.to_html(full_html=False, include_plotlyjs='cdn')

    # Line graph
    for process_name, process_syscall_df in syscall_df.groupby('comm'):
        title=f"{core_network}: {syscall_type} syscalls by {process_name} (by number of calls)"
        sysprocess_count_fig = px.line(process_syscall_df.sort_values('ues'),
                        x="ues", y="count", color="syscall", 
                        facet_row="comm",
                        hover_data=["count", "time (ms)"],
                        labels=labels,
                        title=title,
                        markers=True)
        # sysprocess_count_fig.show()
        file_name = re.sub(r'[^\w\s]','_', title).replace(' ', '_')
        sysprocess_count_fig.write_image(f"plotly/{file_name}.jpeg")
        html += '</br>' + sysprocess_count_fig.to_html(full_html=False, include_plotlyjs='cdn')

        gnuplot_df = pivot_dataframe_to_gnuplot_format(process_syscall_df, 'count', 'ues', 'syscall')
        draw_gnuplot_linepoints(gnuplot_df, name=file_name, title=title,
                        xlabel='Number of UEs', ylabel=labels['count'])

    title=f"{core_network}: {syscall_type} syscalls (by average latency)"
    file_name = re.sub(r'[^\w\s]','_', title).replace(' ', '_')
    sunburst_fig = px.sunburst(syscall_df, path=['comm', 'ues', 'syscall'], values='avg',
                  color='comm', hover_data=['time (ms)'],
                  title=title
                  )
    sunburst_fig.update_traces(textinfo="label+percent root")
    sunburst_fig.update_traces(sort=False, selector=dict(type='sunburst'))
    # sunburst_fig.show()
    sunburst_fig.write_image(f"plotly/{file_name}.jpeg")
    html += '</br>' + sunburst_fig.to_html(full_html=False, include_plotlyjs='cdn')

    # Line graph
    for process_name, process_syscall_df in syscall_df.groupby('comm'):
        title=f"{core_network}: {syscall_type} syscalls by {process_name} (by average latency)"
        sysprocess_count_fig = px.line(process_syscall_df.sort_values('ues'),
                        x="ues", y="avg", color="syscall", 
                        facet_row="comm",
                        hover_data=["count", "time (ms)"],
                        labels=labels,
                        title=title,
                        markers=True)
        # sysprocess_count_fig.show()
        file_name = re.sub(r'[^\w\s]','_', title).replace(' ', '_')
        sysprocess_count_fig.write_image(f"plotly/{file_name}.jpeg")
        html += '</br>' + sysprocess_count_fig.to_html(full_html=False, include_plotlyjs='cdn')

        gnuplot_df = pivot_dataframe_to_gnuplot_format(process_syscall_df, 'avg', 'ues', 'syscall')
        draw_gnuplot_linepoints(gnuplot_df, name=file_name, title=title,
                        xlabel='Number of UEs', ylabel=labels['avg'])

    return html

io_multiplex_syscalls = ['epoll_wait', 'poll', 'ppoll', 'epoll_pwait', 'select']
socket_files_syscalls = ['read', 'write']
socket_write_syscalls = ['sendto', 'sendmsg']
socket_read_syscalls = [ 'recvmsg', 'recvfrom']
time_syscalls = ['clock_nanosleep', 'nanosleep']
locks_syscalls = ['futex']
control_syscalls = ['sched_yield']

# Free 5GC
html += '</br>' + '<h2>free5GC</h2>'
html += '</br>' + '<p>This section focuses on the free5GC core network and presents results per process. This \
helps to identify the processes that are responsible for the aggregated usage and frequency. This information can \
be useful for developers, who can use it to focus their optimisation efforts on the most critical processes. It can \
also be used to identify processes that are potential bottlenecks.</p>'

lhtml = grouped_process_and_syscall_types(io_multiplex_syscalls, 'IO Multiplexing', 'free5gc')
html += '</br>' + lhtml
lhtml = grouped_process_and_syscall_types(socket_files_syscalls, 'File operation', 'free5gc')
html += '</br>' + lhtml
lhtml = grouped_process_and_syscall_types(socket_write_syscalls, 'Socket write', 'free5gc')
html += '</br>' + lhtml
lhtml = grouped_process_and_syscall_types(socket_read_syscalls, 'Socket read', 'free5gc')
html += '</br>' + lhtml
lhtml = grouped_process_and_syscall_types(time_syscalls, 'Sleep', 'free5gc')
html += '</br>' + lhtml
lhtml = grouped_process_and_syscall_types(locks_syscalls, 'Resource contention', 'free5gc')
html += '</br>' + lhtml
lhtml = grouped_process_and_syscall_types(control_syscalls, 'Scheduling', 'free5gc')
html += '</br>' + lhtml


# Open5gs
html += '</br>' + '<h2>Open5GS</h2>'
html += '</br>' + '<p>This section focuses on the OAI core network and presents results per process. This helps to identify which processes \
are responsible for the aggregated usage and frequency, and informs developers which processes to focus on. It \
can also help to identify processes that may be potential bottlenecks.</p>'

lhtml = grouped_process_and_syscall_types(io_multiplex_syscalls, 'IO Multiplexing', 'open5gs')
html += '</br>' + lhtml
lhtml = grouped_process_and_syscall_types(socket_files_syscalls, 'File operation', 'open5gs')
html += '</br>' + lhtml
lhtml = grouped_process_and_syscall_types(socket_write_syscalls, 'Socket write', 'open5gs')
html += '</br>' + lhtml
lhtml = grouped_process_and_syscall_types(socket_read_syscalls, 'Socket read', 'open5gs')
html += '</br>' + lhtml
lhtml = grouped_process_and_syscall_types(time_syscalls, 'Sleep', 'open5gs')
html += '</br>' + lhtml
lhtml = grouped_process_and_syscall_types(locks_syscalls, 'Resource contention', 'open5gs')
html += '</br>' + lhtml
lhtml = grouped_process_and_syscall_types(control_syscalls, 'Scheduling', 'open5gs')
html += '</br>' + lhtml


# OAI
html += '</br>' + '<h2>OAI</h2>'
html += '</br>' + '<p>This analysis is complementary to the results presented in the main body of the paper, which showed \
the aggregated usage and frequency of system calls across the core network. By identifying the processes that \
are responsible for the highest system call usage and frequency, we can help developers to focus their efforts on \
optimising these processes and identifying potential bottlenecks.</p>'

lhtml = grouped_process_and_syscall_types(io_multiplex_syscalls, 'IO Multiplexing', 'oai')
html += '</br>' + lhtml
lhtml = grouped_process_and_syscall_types(socket_files_syscalls, 'File operation', 'oai')
html += '</br>' + lhtml
lhtml = grouped_process_and_syscall_types(socket_write_syscalls, 'Socket write', 'oai')
html += '</br>' + lhtml
lhtml = grouped_process_and_syscall_types(socket_read_syscalls, 'Socket read', 'oai')
html += '</br>' + lhtml
lhtml = grouped_process_and_syscall_types(time_syscalls, 'Sleep', 'oai')
html += '</br>' + lhtml
lhtml = grouped_process_and_syscall_types(locks_syscalls, 'Resource contention', 'oai')
html += '</br>' + lhtml
lhtml = grouped_process_and_syscall_types(control_syscalls, 'Scheduling', 'oai')
html += '</br>' + lhtml

with open(html_output_file, 'a') as f:
    f.write(html)


Index(['ues', 'poll'], dtype='object', name='syscall')
Index(['ues', 'epoll_pwait'], dtype='object', name='syscall')
Index(['ues', 'epoll_pwait'], dtype='object', name='syscall')
Index(['ues', 'epoll_pwait'], dtype='object', name='syscall')
Index(['ues', 'epoll_pwait'], dtype='object', name='syscall')
Index(['ues', 'epoll_pwait'], dtype='object', name='syscall')
Index(['ues', 'poll'], dtype='object', name='syscall')
Index(['ues', 'epoll_pwait'], dtype='object', name='syscall')
Index(['ues', 'poll'], dtype='object', name='syscall')
Index(['ues', 'epoll_pwait'], dtype='object', name='syscall')
Index(['ues', 'epoll_pwait'], dtype='object', name='syscall')
Index(['ues', 'epoll_pwait'], dtype='object', name='syscall')
Index(['ues', 'poll'], dtype='object', name='syscall')
Index(['ues', 'epoll_pwait'], dtype='object', name='syscall')
Index(['ues', 'epoll_pwait'], dtype='object', name='syscall')
Index(['ues', 'epoll_pwait'], dtype='object', name='syscall')
Index(['ues', 'epoll_pwait'], dtype=



Index(['ues', 'epoll_pwait'], dtype='object', name='syscall')
Index(['ues', 'poll'], dtype='object', name='syscall')
Index(['ues', 'epoll_pwait'], dtype='object', name='syscall')




Index(['ues', 'poll'], dtype='object', name='syscall')
Index(['ues', 'epoll_pwait'], dtype='object', name='syscall')
Index(['ues', 'epoll_pwait'], dtype='object', name='syscall')
Index(['ues', 'epoll_pwait'], dtype='object', name='syscall')




Index(['ues', 'poll'], dtype='object', name='syscall')
Index(['ues', 'epoll_pwait'], dtype='object', name='syscall')
Index(['ues', 'epoll_pwait'], dtype='object', name='syscall')
Index(['ues', 'epoll_pwait'], dtype='object', name='syscall')
Index(['ues', 'epoll_pwait'], dtype='object', name='syscall')
Index(['ues', 'epoll_pwait'], dtype='object', name='syscall')
Index(['ues', 'poll'], dtype='object', name='syscall')
Index(['ues', 'epoll_pwait'], dtype='object', name='syscall')
Index(['ues', 'poll'], dtype='object', name='syscall')
Index(['ues', 'epoll_pwait'], dtype='object', name='syscall')
Index(['ues', 'epoll_pwait'], dtype='object', name='syscall')
Index(['ues', 'epoll_pwait'], dtype='object', name='syscall')
Index(['ues', 'read', 'write'], dtype='object', name='syscall')
Index(['ues', 'read', 'write'], dtype='object', name='syscall')
Index(['ues', 'read', 'write'], dtype='object', name='syscall')
Index(['ues', 'read', 'write'], dtype='object', name='syscall')
Index(['ues', 'read',



Index(['ues', 'sendmsg'], dtype='object', name='syscall')
Index(['ues', 'sendmsg'], dtype='object', name='syscall')
Index(['ues', 'recvmsg'], dtype='object', name='syscall')
Index(['ues', 'recvmsg'], dtype='object', name='syscall')
Index(['ues', 'recvmsg'], dtype='object', name='syscall')
Index(['ues', 'recvmsg'], dtype='object', name='syscall')
Index(['ues', 'recvmsg'], dtype='object', name='syscall')
Index(['ues', 'recvmsg'], dtype='object', name='syscall')
Index(['ues', 'recvmsg'], dtype='object', name='syscall')




Index(['ues', 'recvmsg'], dtype='object', name='syscall')
Index(['ues', 'recvmsg'], dtype='object', name='syscall')
Index(['ues', 'recvmsg'], dtype='object', name='syscall')
Index(['ues', 'recvmsg'], dtype='object', name='syscall')
Index(['ues', 'recvmsg'], dtype='object', name='syscall')
Index(['ues', 'nanosleep'], dtype='object', name='syscall')
Index(['ues', 'nanosleep'], dtype='object', name='syscall')
Index(['ues', 'clock_nanosleep'], dtype='object', name='syscall')
Index(['ues', 'nanosleep'], dtype='object', name='syscall')
Index(['ues', 'nanosleep'], dtype='object', name='syscall')
Index(['ues', 'nanosleep'], dtype='object', name='syscall')
Index(['ues', 'nanosleep'], dtype='object', name='syscall')
Index(['ues', 'nanosleep'], dtype='object', name='syscall')
Index(['ues', 'nanosleep'], dtype='object', name='syscall')
Index(['ues', 'nanosleep'], dtype='object', name='syscall')
Index(['ues', 'nanosleep'], dtype='object', name='syscall')
Index(['ues', 'nanosleep'], dtype='object', 



Index(['ues', 'nanosleep'], dtype='object', name='syscall')
Index(['ues', 'nanosleep'], dtype='object', name='syscall')
Index(['ues', 'nanosleep'], dtype='object', name='syscall')
Index(['ues', 'nanosleep'], dtype='object', name='syscall')
Index(['ues', 'nanosleep'], dtype='object', name='syscall')
Index(['ues', 'nanosleep'], dtype='object', name='syscall')
Index(['ues', 'nanosleep'], dtype='object', name='syscall')
Index(['ues', 'nanosleep'], dtype='object', name='syscall')
Index(['ues', 'nanosleep'], dtype='object', name='syscall')
Index(['ues', 'clock_nanosleep'], dtype='object', name='syscall')
Index(['ues', 'nanosleep'], dtype='object', name='syscall')
Index(['ues', 'nanosleep'], dtype='object', name='syscall')
Index(['ues', 'nanosleep'], dtype='object', name='syscall')
Index(['ues', 'nanosleep'], dtype='object', name='syscall')
Index(['ues', 'nanosleep'], dtype='object', name='syscall')
Index(['ues', 'nanosleep'], dtype='object', name='syscall')
Index(['ues', 'nanosleep'], dtype=



Index(['ues', 'sched_yield'], dtype='object', name='syscall')
Index(['ues', 'sched_yield'], dtype='object', name='syscall')
Index(['ues', 'sched_yield'], dtype='object', name='syscall')
Index(['ues', 'sched_yield'], dtype='object', name='syscall')
Index(['ues', 'sched_yield'], dtype='object', name='syscall')
Index(['ues', 'sched_yield'], dtype='object', name='syscall')
Index(['ues', 'sched_yield'], dtype='object', name='syscall')




Index(['ues', 'sched_yield'], dtype='object', name='syscall')
Index(['ues', 'sched_yield'], dtype='object', name='syscall')
Index(['ues', 'sched_yield'], dtype='object', name='syscall')
Index(['ues', 'sched_yield'], dtype='object', name='syscall')
Index(['ues', 'sched_yield'], dtype='object', name='syscall')
Index(['ues', 'sched_yield'], dtype='object', name='syscall')
Index(['ues', 'sched_yield'], dtype='object', name='syscall')




Index(['ues', 'sched_yield'], dtype='object', name='syscall')
Index(['ues', 'sched_yield'], dtype='object', name='syscall')
Index(['ues', 'poll'], dtype='object', name='syscall')
Index(['ues', 'epoll_wait', 'poll'], dtype='object', name='syscall')
Index(['ues', 'epoll_wait', 'poll'], dtype='object', name='syscall')
Index(['ues', 'epoll_wait', 'poll'], dtype='object', name='syscall')
Index(['ues', 'poll'], dtype='object', name='syscall')




Index(['ues', 'poll'], dtype='object', name='syscall')




Index(['ues', 'epoll_wait', 'poll'], dtype='object', name='syscall')
Index(['ues', 'epoll_wait', 'poll'], dtype='object', name='syscall')
Index(['ues', 'epoll_wait', 'poll'], dtype='object', name='syscall')
Index(['ues', 'poll'], dtype='object', name='syscall')
Index(['ues', 'epoll_wait', 'poll'], dtype='object', name='syscall')
Index(['ues', 'epoll_wait'], dtype='object', name='syscall')
Index(['ues', 'epoll_wait'], dtype='object', name='syscall')
Index(['ues', 'epoll_wait', 'poll'], dtype='object', name='syscall')
Index(['ues', 'epoll_wait', 'poll'], dtype='object', name='syscall')
Index(['ues', 'epoll_wait', 'poll'], dtype='object', name='syscall')
Index(['ues', 'epoll_wait'], dtype='object', name='syscall')
Index(['ues', 'poll'], dtype='object', name='syscall')
Index(['ues', 'poll'], dtype='object', name='syscall')
Index(['ues', 'poll'], dtype='object', name='syscall')




Index(['ues', 'epoll_wait', 'poll'], dtype='object', name='syscall')
Index(['ues', 'epoll_wait', 'poll'], dtype='object', name='syscall')
Index(['ues', 'epoll_wait', 'poll'], dtype='object', name='syscall')
Index(['ues', 'poll'], dtype='object', name='syscall')




Index(['ues', 'poll'], dtype='object', name='syscall')




Index(['ues', 'epoll_wait', 'poll'], dtype='object', name='syscall')
Index(['ues', 'epoll_wait', 'poll'], dtype='object', name='syscall')
Index(['ues', 'epoll_wait', 'poll'], dtype='object', name='syscall')
Index(['ues', 'poll'], dtype='object', name='syscall')




Index(['ues', 'epoll_wait', 'poll'], dtype='object', name='syscall')
Index(['ues', 'epoll_wait'], dtype='object', name='syscall')
Index(['ues', 'epoll_wait'], dtype='object', name='syscall')
Index(['ues', 'epoll_wait', 'poll'], dtype='object', name='syscall')
Index(['ues', 'epoll_wait', 'poll'], dtype='object', name='syscall')
Index(['ues', 'epoll_wait', 'poll'], dtype='object', name='syscall')
Index(['ues', 'epoll_wait'], dtype='object', name='syscall')
Index(['ues', 'poll'], dtype='object', name='syscall')
Index(['ues', 'poll'], dtype='object', name='syscall')
Index(['ues', 'poll'], dtype='object', name='syscall')
Index(['ues', 'epoll_wait', 'poll'], dtype='object', name='syscall')
Index(['ues', 'epoll_wait', 'poll'], dtype='object', name='syscall')
Index(['ues', 'epoll_wait', 'poll'], dtype='object', name='syscall')
Index(['ues', 'poll'], dtype='object', name='syscall')




Index(['ues', 'poll'], dtype='object', name='syscall')




Index(['ues', 'epoll_wait', 'poll'], dtype='object', name='syscall')
Index(['ues', 'epoll_wait', 'poll'], dtype='object', name='syscall')
Index(['ues', 'epoll_wait', 'poll'], dtype='object', name='syscall')
Index(['ues', 'poll'], dtype='object', name='syscall')
Index(['ues', 'epoll_wait', 'poll'], dtype='object', name='syscall')
Index(['ues', 'epoll_wait'], dtype='object', name='syscall')
Index(['ues', 'epoll_wait'], dtype='object', name='syscall')
Index(['ues', 'epoll_wait', 'poll'], dtype='object', name='syscall')
Index(['ues', 'epoll_wait', 'poll'], dtype='object', name='syscall')
Index(['ues', 'epoll_wait', 'poll'], dtype='object', name='syscall')
Index(['ues', 'epoll_wait'], dtype='object', name='syscall')
Index(['ues', 'poll'], dtype='object', name='syscall')
Index(['ues', 'poll'], dtype='object', name='syscall')
Index(['ues', 'read', 'write'], dtype='object', name='syscall')
Index(['ues', 'read', 'write'], dtype='object', name='syscall')
Index(['ues', 'read', 'write'], dtype='ob



Index(['ues', 'write'], dtype='object', name='syscall')




Index(['ues', 'read'], dtype='object', name='syscall')
Index(['ues', 'read'], dtype='object', name='syscall')
Index(['ues', 'read'], dtype='object', name='syscall')
Index(['ues', 'read', 'write'], dtype='object', name='syscall')
Index(['ues', 'write'], dtype='object', name='syscall')




Index(['ues', 'read', 'write'], dtype='object', name='syscall')
Index(['ues', 'read'], dtype='object', name='syscall')
Index(['ues', 'read', 'write'], dtype='object', name='syscall')
Index(['ues', 'read', 'write'], dtype='object', name='syscall')
Index(['ues', 'read', 'write'], dtype='object', name='syscall')
Index(['ues', 'read', 'write'], dtype='object', name='syscall')
Index(['ues', 'read', 'write'], dtype='object', name='syscall')
Index(['ues', 'write'], dtype='object', name='syscall')




Index(['ues', 'write'], dtype='object', name='syscall')




Index(['ues', 'write'], dtype='object', name='syscall')




Index(['ues', 'read'], dtype='object', name='syscall')




Index(['ues', 'read'], dtype='object', name='syscall')




Index(['ues', 'read'], dtype='object', name='syscall')




Index(['ues', 'read', 'write'], dtype='object', name='syscall')
Index(['ues', 'write'], dtype='object', name='syscall')




Index(['ues', 'read', 'write'], dtype='object', name='syscall')
Index(['ues', 'read'], dtype='object', name='syscall')
Index(['ues', 'read', 'write'], dtype='object', name='syscall')
Index(['ues', 'read', 'write'], dtype='object', name='syscall')
Index(['ues', 'read', 'write'], dtype='object', name='syscall')
Index(['ues', 'read', 'write'], dtype='object', name='syscall')
Index(['ues', 'read', 'write'], dtype='object', name='syscall')
Index(['ues', 'write'], dtype='object', name='syscall')
Index(['ues', 'write'], dtype='object', name='syscall')




Index(['ues', 'write'], dtype='object', name='syscall')




Index(['ues', 'read'], dtype='object', name='syscall')
Index(['ues', 'read'], dtype='object', name='syscall')
Index(['ues', 'read'], dtype='object', name='syscall')
Index(['ues', 'read', 'write'], dtype='object', name='syscall')
Index(['ues', 'write'], dtype='object', name='syscall')




Index(['ues', 'read', 'write'], dtype='object', name='syscall')
Index(['ues', 'read'], dtype='object', name='syscall')
Index(['ues', 'read', 'write'], dtype='object', name='syscall')
Index(['ues', 'read', 'write'], dtype='object', name='syscall')
Index(['ues', 'sendmsg', 'sendto'], dtype='object', name='syscall')
Index(['ues', 'sendto'], dtype='object', name='syscall')
Index(['ues', 'sendto'], dtype='object', name='syscall')
Index(['ues', 'sendmsg', 'sendto'], dtype='object', name='syscall')
Index(['ues', 'sendmsg', 'sendto'], dtype='object', name='syscall')
Index(['ues', 'sendto'], dtype='object', name='syscall')
Index(['ues', 'sendto'], dtype='object', name='syscall')
Index(['ues', 'sendmsg', 'sendto'], dtype='object', name='syscall')
Index(['ues', 'sendmsg', 'sendto'], dtype='object', name='syscall')
Index(['ues', 'sendto'], dtype='object', name='syscall')
Index(['ues', 'sendto'], dtype='object', name='syscall')
Index(['ues', 'sendto'], dtype='object', name='syscall')
Index(['ues', 



Index(['ues', 'sendmsg', 'sendto'], dtype='object', name='syscall')
Index(['ues', 'sendto'], dtype='object', name='syscall')
Index(['ues', 'sendto'], dtype='object', name='syscall')
Index(['ues', 'sendmsg', 'sendto'], dtype='object', name='syscall')
Index(['ues', 'sendmsg', 'sendto'], dtype='object', name='syscall')
Index(['ues', 'sendto'], dtype='object', name='syscall')
Index(['ues', 'sendto'], dtype='object', name='syscall')
Index(['ues', 'sendmsg', 'sendto'], dtype='object', name='syscall')
Index(['ues', 'sendmsg', 'sendto'], dtype='object', name='syscall')
Index(['ues', 'sendto'], dtype='object', name='syscall')
Index(['ues', 'sendto'], dtype='object', name='syscall')
Index(['ues', 'sendto'], dtype='object', name='syscall')
Index(['ues', 'sendmsg', 'sendto'], dtype='object', name='syscall')
Index(['ues', 'sendto'], dtype='object', name='syscall')
Index(['ues', 'sendmsg', 'sendto'], dtype='object', name='syscall')
Index(['ues', 'sendto'], dtype='object', name='syscall')
Index(['ues



Index(['ues', 'recvfrom'], dtype='object', name='syscall')
Index(['ues', 'recvfrom', 'recvmsg'], dtype='object', name='syscall')
Index(['ues', 'recvfrom'], dtype='object', name='syscall')
Index(['ues', 'recvfrom'], dtype='object', name='syscall')
Index(['ues', 'recvfrom'], dtype='object', name='syscall')
Index(['ues', 'recvfrom', 'recvmsg'], dtype='object', name='syscall')
Index(['ues', 'recvfrom'], dtype='object', name='syscall')
Index(['ues', 'recvfrom'], dtype='object', name='syscall')
Index(['ues', 'recvfrom'], dtype='object', name='syscall')
Index(['ues', 'recvmsg'], dtype='object', name='syscall')




Index(['ues', 'recvmsg'], dtype='object', name='syscall')
Index(['ues', 'recvmsg'], dtype='object', name='syscall')
Index(['ues', 'recvfrom', 'recvmsg'], dtype='object', name='syscall')
Index(['ues', 'recvfrom'], dtype='object', name='syscall')
Index(['ues', 'recvfrom'], dtype='object', name='syscall')
Index(['ues', 'recvmsg'], dtype='object', name='syscall')
Index(['ues', 'recvmsg'], dtype='object', name='syscall')
Index(['ues', 'recvfrom'], dtype='object', name='syscall')
Index(['ues', 'recvfrom'], dtype='object', name='syscall')
Index(['ues', 'recvfrom'], dtype='object', name='syscall')
Index(['ues', 'recvfrom', 'recvmsg'], dtype='object', name='syscall')
Index(['ues', 'recvfrom'], dtype='object', name='syscall')
Index(['ues', 'recvfrom'], dtype='object', name='syscall')
Index(['ues', 'recvfrom'], dtype='object', name='syscall')
Index(['ues', 'recvfrom', 'recvmsg'], dtype='object', name='syscall')
Index(['ues', 'recvfrom'], dtype='object', name='syscall')
Index(['ues', 'recvfrom'], 



Index(['ues', 'clock_nanosleep'], dtype='object', name='syscall')




Index(['ues', 'clock_nanosleep'], dtype='object', name='syscall')




Index(['ues', 'clock_nanosleep'], dtype='object', name='syscall')
Index(['ues', 'clock_nanosleep'], dtype='object', name='syscall')
Index(['ues', 'clock_nanosleep'], dtype='object', name='syscall')
Index(['ues', 'futex'], dtype='object', name='syscall')
Index(['ues', 'futex'], dtype='object', name='syscall')
Index(['ues', 'futex'], dtype='object', name='syscall')
Index(['ues', 'futex'], dtype='object', name='syscall')
Index(['ues', 'futex'], dtype='object', name='syscall')
Index(['ues', 'futex'], dtype='object', name='syscall')
Index(['ues', 'futex'], dtype='object', name='syscall')
Index(['ues', 'futex'], dtype='object', name='syscall')
Index(['ues', 'futex'], dtype='object', name='syscall')
Index(['ues', 'futex'], dtype='object', name='syscall')
Index(['ues', 'futex'], dtype='object', name='syscall')
Index(['ues', 'futex'], dtype='object', name='syscall')
Index(['ues', 'futex'], dtype='object', name='syscall')
Index(['ues', 'futex'], dtype='object', name='syscall')
Index(['ues', 'fut



Index(['ues', 'write'], dtype='object', name='syscall')
Index(['ues', 'read', 'write'], dtype='object', name='syscall')
Index(['ues', 'read', 'write'], dtype='object', name='syscall')
Index(['ues', 'write'], dtype='object', name='syscall')




Index(['ues', 'read', 'write'], dtype='object', name='syscall')
Index(['ues', 'read', 'write'], dtype='object', name='syscall')
Index(['ues', 'read'], dtype='object', name='syscall')
Index(['ues', 'write'], dtype='object', name='syscall')
Index(['ues', 'read', 'write'], dtype='object', name='syscall')
Index(['ues', 'read', 'write'], dtype='object', name='syscall')
Index(['ues', 'write'], dtype='object', name='syscall')




Index(['ues', 'write'], dtype='object', name='syscall')
Index(['ues', 'read', 'write'], dtype='object', name='syscall')
Index(['ues', 'read', 'write'], dtype='object', name='syscall')
Index(['ues', 'write'], dtype='object', name='syscall')




Index(['ues', 'read', 'write'], dtype='object', name='syscall')
Index(['ues', 'read', 'write'], dtype='object', name='syscall')
Index(['ues', 'read'], dtype='object', name='syscall')




Index(['ues', 'write'], dtype='object', name='syscall')
Index(['ues', 'read', 'write'], dtype='object', name='syscall')
Index(['ues', 'read', 'write'], dtype='object', name='syscall')
Index(['ues', 'write'], dtype='object', name='syscall')




Index(['ues', 'write'], dtype='object', name='syscall')
Index(['ues', 'read', 'write'], dtype='object', name='syscall')
Index(['ues', 'read', 'write'], dtype='object', name='syscall')
Index(['ues', 'write'], dtype='object', name='syscall')




Index(['ues', 'read', 'write'], dtype='object', name='syscall')
Index(['ues', 'read', 'write'], dtype='object', name='syscall')
Index(['ues', 'read'], dtype='object', name='syscall')
Index(['ues', 'write'], dtype='object', name='syscall')
Index(['ues', 'read', 'write'], dtype='object', name='syscall')
Index(['ues', 'read', 'write'], dtype='object', name='syscall')
Index(['ues', 'sendto'], dtype='object', name='syscall')
Index(['ues', 'sendto'], dtype='object', name='syscall')
Index(['ues', 'sendmsg', 'sendto'], dtype='object', name='syscall')
Index(['ues', 'sendto'], dtype='object', name='syscall')
Index(['ues', 'sendto'], dtype='object', name='syscall')
Index(['ues', 'sendto'], dtype='object', name='syscall')
Index(['ues', 'sendto'], dtype='object', name='syscall')
Index(['ues', 'sendto'], dtype='object', name='syscall')
Index(['ues', 'sendmsg', 'sendto'], dtype='object', name='syscall')
Index(['ues', 'sendto'], dtype='object', name='syscall')




Index(['ues', 'sendto'], dtype='object', name='syscall')
Index(['ues', 'sendto'], dtype='object', name='syscall')




Index(['ues', 'sendto'], dtype='object', name='syscall')
Index(['ues', 'sendto'], dtype='object', name='syscall')
Index(['ues', 'sendmsg', 'sendto'], dtype='object', name='syscall')
Index(['ues', 'sendto'], dtype='object', name='syscall')
Index(['ues', 'sendto'], dtype='object', name='syscall')
Index(['ues', 'sendto'], dtype='object', name='syscall')
Index(['ues', 'recvmsg'], dtype='object', name='syscall')




Index(['ues', 'recvfrom', 'recvmsg'], dtype='object', name='syscall')
Index(['ues', 'recvfrom', 'recvmsg'], dtype='object', name='syscall')
Index(['ues', 'recvmsg'], dtype='object', name='syscall')
Index(['ues', 'recvmsg'], dtype='object', name='syscall')




Index(['ues', 'recvfrom', 'recvmsg'], dtype='object', name='syscall')
Index(['ues', 'recvmsg'], dtype='object', name='syscall')




Index(['ues', 'recvfrom', 'recvmsg'], dtype='object', name='syscall')
Index(['ues', 'recvfrom', 'recvmsg'], dtype='object', name='syscall')
Index(['ues', 'recvmsg'], dtype='object', name='syscall')




Index(['ues', 'recvfrom', 'recvmsg'], dtype='object', name='syscall')
Index(['ues', 'recvfrom', 'recvmsg'], dtype='object', name='syscall')
Index(['ues', 'recvmsg'], dtype='object', name='syscall')
Index(['ues', 'recvmsg'], dtype='object', name='syscall')




Index(['ues', 'recvfrom', 'recvmsg'], dtype='object', name='syscall')
Index(['ues', 'recvmsg'], dtype='object', name='syscall')




Index(['ues', 'recvfrom', 'recvmsg'], dtype='object', name='syscall')
Index(['ues', 'recvfrom', 'recvmsg'], dtype='object', name='syscall')
Index(['ues', 'recvmsg'], dtype='object', name='syscall')




Index(['ues', 'recvfrom', 'recvmsg'], dtype='object', name='syscall')
Index(['ues', 'recvfrom', 'recvmsg'], dtype='object', name='syscall')
Index(['ues', 'recvmsg'], dtype='object', name='syscall')
Index(['ues', 'recvmsg'], dtype='object', name='syscall')




Index(['ues', 'recvfrom', 'recvmsg'], dtype='object', name='syscall')
Index(['ues', 'recvmsg'], dtype='object', name='syscall')




Index(['ues', 'recvfrom', 'recvmsg'], dtype='object', name='syscall')
Index(['ues', 'recvfrom', 'recvmsg'], dtype='object', name='syscall')
Index(['ues', 'clock_nanosleep'], dtype='object', name='syscall')
Index(['ues', 'clock_nanosleep'], dtype='object', name='syscall')
Index(['ues', 'clock_nanosleep'], dtype='object', name='syscall')
Index(['ues', 'clock_nanosleep'], dtype='object', name='syscall')




Index(['ues', 'clock_nanosleep'], dtype='object', name='syscall')
Index(['ues', 'clock_nanosleep'], dtype='object', name='syscall')
Index(['ues', 'clock_nanosleep'], dtype='object', name='syscall')
Index(['ues', 'clock_nanosleep'], dtype='object', name='syscall')
Index(['ues', 'clock_nanosleep'], dtype='object', name='syscall')
Index(['ues', 'futex'], dtype='object', name='syscall')
Index(['ues', 'futex'], dtype='object', name='syscall')
Index(['ues', 'futex'], dtype='object', name='syscall')
Index(['ues', 'futex'], dtype='object', name='syscall')




Index(['ues', 'futex'], dtype='object', name='syscall')
Index(['ues', 'futex'], dtype='object', name='syscall')




Index(['ues', 'futex'], dtype='object', name='syscall')
Index(['ues', 'futex'], dtype='object', name='syscall')
Index(['ues', 'futex'], dtype='object', name='syscall')
Index(['ues', 'futex'], dtype='object', name='syscall')
Index(['ues', 'futex'], dtype='object', name='syscall')




Index(['ues', 'futex'], dtype='object', name='syscall')




Index(['ues', 'futex'], dtype='object', name='syscall')
Index(['ues', 'futex'], dtype='object', name='syscall')




Index(['ues', 'futex'], dtype='object', name='syscall')
Index(['ues', 'futex'], dtype='object', name='syscall')
Index(['ues', 'futex'], dtype='object', name='syscall')
Index(['ues', 'futex'], dtype='object', name='syscall')
Index(['ues', 'futex'], dtype='object', name='syscall')
Index(['ues', 'futex'], dtype='object', name='syscall')




Index(['ues', 'futex'], dtype='object', name='syscall')
Index(['ues', 'futex'], dtype='object', name='syscall')




Index(['ues', 'futex'], dtype='object', name='syscall')
Index(['ues', 'futex'], dtype='object', name='syscall')


In [28]:
sysprocess_scatter_df = spark.read.option("basePath", basePath).json(
    f"{basePath}/cn=*/ues=*/tool=sysprocess")

df_sysprocess_scatter = sysprocess_scatter_df.toPandas()

# Calculate average rate of increase per slot
sorted_df = df_sysprocess_scatter.groupby(['cn', 'comm', 'ues']).first().sort_values(by='ues').reset_index()

sorted_df['diff'] = sorted_df.groupby(['cn', 'comm'])['count'].diff() / sorted_df.groupby(['cn', 'comm'])['ues'].diff()

sum_df = sorted_df.groupby(['cn', 'comm'])['count'].sum().to_frame()
duration_df = sorted_df.groupby(['cn', 'comm'])['time (ms)'].sum().to_frame()
mean_df = sorted_df.groupby(['cn', 'comm'])['diff'].mean().to_frame()


merged_df = pd.merge(sum_df, duration_df, on=['cn', 'comm']) \
             .merge(mean_df.rename(columns={'diff': 'rct'}), on=['cn', 'comm']).reset_index()


merged_df['avg_duration'] = (merged_df['time (ms)'] / 1000000)
merged_df["avg_duration"] = [0.25 if val < 1 else val for val in merged_df['avg_duration']]
merged_df = remove_noise_processes(merged_df, 'comm', noise_processes_excl_db)
fig = px.scatter(merged_df, x="count", y="rct",
	         size="avg_duration", color="cn",
             labels={
                "ues": "Number of UEs",
                "time (ms)": "Time (ms)",
                "syscall": "System calls",
                "count": "Cummulative number of calls",
                "avg": "Average time per syscall (ms)",
                "avg_duration": "Time (ms)",
                "cn": "Core network",
                "rct": "Rate of change of N# of calls"
            },
            #  title="Total number of calls against thier rate of change as the number of UEs generated increases per Core network. Each bubble represents a process.",
                 hover_name="comm", log_x=True, log_y=True, size_max=50)
fig.update_layout(title_font=dict(size=10))
x_avg = merged_df['count'].mean()
y_avg = merged_df['rct'].mean()
fig.add_vline(x=x_avg, line_width=1, opacity=0.5)
fig.add_hline(y=y_avg, line_width=1, opacity=0.5)

# Update layout with new dimensions as percentages
# fig.update_layout(
#     height=600,     
#     width=800
# )

fig.show()
fig.write_image(f"plotly/processes_scatter_summary.jpeg")

In [30]:
# Get the data
syscall_scatter_df = spark.read.option("basePath", basePath).json(
    f"{basePath}/cn=*/ues=*/tool=syscount")

df_syscall_scatter = syscall_scatter_df.toPandas()
df_syscall_scatter['avg'] = ((df_syscall_scatter['time (ms)'] / df_syscall_scatter['count']) / 100)
study_syscalls = ['futex', 'epoll_wait', 'epoll_wait2', 'epoll_pwait', 'epoll_pwait2', 'poll', 'ppoll', 'select', 'nanosleep', 'clock_nanosleep', 'read', 'write', 'recv', 'recvfrom', 'recvmsg', 'recvmmsg', 'send', 'sendto', 'sendmsg', 'sendmmsg', 'sched_yield']
df_syscall_scatter =df_syscall_scatter[df_syscall_scatter["syscall"].isin(study_syscalls)]

# Calculate average rate of increase per slot
sorted_df = df_syscall_scatter.groupby(['cn', 'syscall', 'ues']).first().sort_values(by='ues').reset_index()

sorted_df['diff'] = sorted_df.groupby(['cn', 'syscall'])['count'].diff() / sorted_df.groupby(['cn', 'syscall'])['ues'].diff()

sum_df = sorted_df.groupby(['cn', 'syscall'])['count'].sum().to_frame()
duration_df = sorted_df.groupby(['cn', 'syscall'])['time (ms)'].sum().to_frame()
mean_df = sorted_df.groupby(['cn', 'syscall'])['diff'].mean().to_frame()


merged_df = pd.merge(sum_df, duration_df, on=['cn', 'syscall']) \
             .merge(mean_df.rename(columns={'diff': 'rct'}), on=['cn', 'syscall']).reset_index()

merged_df['avg_duration'] = (merged_df['time (ms)'] / 1000000)
merged_df["avg_duration"] = [0.5 if val < 1 else val for val in merged_df['avg_duration']]
fig = px.scatter(merged_df, x="count", y="rct",
            labels={
                "ues": "Number of UEs",
                "time (ms)": "Time (ms)",
                "syscall": "System calls",
                "count": "Cummulative number of calls",
                "avg": "Average time per syscall (ms)",
                "avg_duration": "Time (ms)",
                "cn": "Core network",
                "rct": "Rate of change of N# of calls"
            },
            # title="Total number of calls against thier rate of change as the number of UEs generated increases per Core network. Each bubble represents a syscall.",
	         size="avg_duration", color="cn",
                hover_name="syscall",
                 log_x=True, log_y=True, size_max=50)
fig.update_layout(title_font=dict(size=10))
x_avg = merged_df['count'].mean()
y_avg = merged_df['rct'].mean()
fig.add_vline(x=x_avg, line_width=1, opacity=0.5)
fig.add_hline(y=y_avg, line_width=1, opacity=0.5)
#label each bubble
# fig.update_traces(textposition='top center')

# fig.update_layout(
#     height=600,     
#     width=800
# )

fig.show()
fig.write_image(f"plotly/syscalls_scatter_summary.jpeg")

23/12/01 18:56:45 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 314345 ms exceeds timeout 120000 ms
23/12/01 18:56:45 WARN SparkContext: Killing executors is not supported by current scheduler.
23/12/01 18:56:51 ERROR Inbox: Ignoring error
org.apache.spark.SparkException: Exception thrown in awaitResult: 
	at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:322)
	at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
	at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:102)
	at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:110)
	at org.apache.spark.util.RpcUtils$.makeDriverRef(RpcUtils.scala:36)
	at org.apache.spark.storage.BlockManagerMasterEndpoint.driverEndpoint$lzycompute(BlockManagerMasterEndpoint.scala:117)
	at org.apache.spark.storage.BlockManagerMasterEndpoint.org$apache$spark$storage$BlockManagerMasterEndpoint$$driverEndpoint(BlockManagerMasterEndpoint.scala:116)
	at org.apache.spark.storage.B

In [27]:
# Get the data
syscall_scatter_df = spark.read.option("basePath", basePath).json(
    f"{basePath}/cn=*/ues=*/tool=syscount")

df_syscall_scatter = syscall_scatter_df.toPandas()
df_syscall_scatter['avg'] = ((df_syscall_scatter['time (ms)'] / df_syscall_scatter['count']) / 100)
study_syscalls = ['futex', 'epoll_wait', 'epoll_wait2', 'epoll_pwait', 'epoll_pwait2', 'poll', 'ppoll', 'select', 'nanosleep', 'clock_nanosleep', 'read', 'write', 'recv', 'recvfrom', 'recvmsg', 'recvmmsg', 'send', 'sendto', 'sendmsg', 'sendmmsg', 'sched_yield']
df_syscall_scatter =df_syscall_scatter[df_syscall_scatter["syscall"].isin(study_syscalls)]

# Calculate average rate of increase per slot
sorted_df = df_syscall_scatter.groupby(['cn', 'syscall', 'ues']).first().sort_values(by='ues').reset_index()

sorted_df['diff'] = sorted_df.groupby(['cn', 'syscall'])['count'].diff() / sorted_df.groupby(['cn', 'syscall'])['ues'].diff()

sum_df = sorted_df.groupby(['cn'])['count'].sum().to_frame()
duration_df = sorted_df.groupby(['cn'])['time (ms)'].sum().to_frame()
mean_df = sorted_df.groupby(['cn', 'syscall'])['diff'].mean().to_frame()
mean_2_df = mean_df.groupby(['cn'])['diff'].mean().to_frame()


merged_df = pd.merge(sum_df, duration_df, on=['cn']) \
             .merge(mean_2_df.rename(columns={'diff': 'rct'}), on=['cn']).reset_index()

merged_df.to_csv("merged_df.csv")
merged_df['avg_duration'] = (merged_df['time (ms)'] / 1000000)
merged_df["avg_duration"] = [0.5 if val < 1 else val for val in merged_df['avg_duration']]

fig = px.scatter(merged_df, x="count", y="rct",
            labels=labels,
            # title="Total number of calls against thier rate of change as the number of UEs generated increases per Core network. Each bubble represents a syscall.",
	         size="avg_duration", color="cn",
                # hover_name="syscall",
                 log_x=True, log_y=True, size_max=50)
fig.update_layout(title_font=dict(size=10))
x_avg = merged_df['count'].mean()
y_avg = merged_df['rct'].mean()
fig.add_vline(x=x_avg, line_width=1, opacity=0.5)
fig.add_hline(y=y_avg, line_width=1, opacity=0.5)
#label each bubble
# fig.update_traces(textposition='top center')

# fig.update_layout(
#     height=600,     
#     width=800
# )

# fig.update_layout(
#     yaxis=dict(
#         range=[-50, 200]
#     )
# )

fig.show()
fig.write_image(f"plotly/syscalls_scatter_summary.jpeg")

In [16]:
import re

my_string = "Hello / World: How, Are You?"

new_string = re.sub(r'[^\w\s]+','_', my_string).replace(' ', '_')

print(new_string) # Output: Hello___World__How__Are_You_


Hello___World__How__Are_You_


In [17]:
""" For each syscall look at the processes that are making the calls
(a) Graphs
(b) Tables with the sum per latency, count and average latency
This should give us:
1. An idea of the processes making use of the most relavant syscall or the syscall we are looking at in the study
2. It will give us an ide of the relavance of these processes and making it easier for us to analysis e.g., if the rsyslog system
is the most active process per syscall, we know we need to do further work to disable logs or looking at another logging mechanism
3. 
"""

' For each syscall look at the processes that are making the calls\n(a) Graphs\n(b) Tables with the sum per latency, count and average latency\nThis should give us:\n1. An idea of the processes making use of the most relavant syscall or the syscall we are looking at in the study\n2. It will give us an ide of the relavance of these processes and making it easier for us to analysis e.g., if the rsyslog system\nis the most active process per syscall, we know we need to do further work to disable logs or looking at another logging mechanism\n3. \n'

In [18]:
def my_theme(fig):
    #This is my own personal preferences you can create a default & pass a plotly graph_object
    #changes theme, height & width to my preferences
    fig.update_layout(template='plotly_white', width=1000, height=700)
    #I like grid lines on my x & y axis
    fig.update_xaxes(showline=False,linewidth=0.2, gridwidth=1, linecolor='white', gridcolor='lightgrey',categoryorder='total descending',color='black')
    fig.update_yaxes(showline=False,linewidth=0.2, gridwidth=1, linecolor='white', gridcolor='lightgrey')
    fig.update_traces(texttemplate='<b>%{y:0,.1f}')

#converts every plot to my default styling    
my_theme(fig)
fig.show()