# L41: Lab 1 - Getting started with kernel tracing - I/O

This Jypterlab Notebook is intended to get you started with:

1. Building and running the benchmark
2. Extracting and plotting data collected by the benchmark itself (e.g., execution time)
3. Extracting and plotting data collected externally by DTrace

This file is not intended to be a template for your solutions; we recommend that you create a new Notebook, placing your work there, rather than using this Notebook as your starting point.

Make sure to run cells in the right order (pressing Ctrl-Enter when in the cell) so that dependencies are executed in the right order. For example, Python imports must occur before running the remainder of the code, and data must be collected before it can be plotted.

Note: When you execute a cell in Jupyterlab, the bracketed number to the left (e.g., `[1]`) will temporarily change to a `[*]` to indicate that it has not yet completed. Running benchmarks or longer forms of data analysis or plotting may take a considerable time on our RPi4 boards, so do exercise patience.

# 1. Building and running the benchmark

First, we need to build the benchmark using `make` (no text output is expected from a successful build):

In [None]:
!make -C io

Next, we can run the benchmark using Jupyter's `!` syntax, illustrating its command-line arguments:

In [None]:
!io/io-benchmark

Run a quick test of the benchmark using small parameters so that we can see the JSON format of the output, which you will need to know in order to extract various results of interest:

Now, create a data file suitable for the I/O benchmark to use; the default parameters are fine:

In [None]:
!io/io-benchmark -c iofile

In [None]:
!io/io-benchmark -b 262144 -g -j -v -n 2 -r iofile

The `"host_configuration"` and `"benchmark_configuration"` blocks provide information about the configuration of the host and the benchmark.

The `"benchmark_samples`" block consists of an array of individual measurements with various results for each measurement. In general, dropping the first sample is a good idea, as it may contain artifacts from "first runs" -- such as the costs of dynamic linking. The captured metrics using this benchmark command line are:

- `bandwidth`: The average bandwidth over the run of the benchmark's work loop.
- `time`: Wall-clock time running the work loop.
- `utime` and `stime`: Sampled user and system (kernel) time. This may not add up to wall-clock time if software has to sleep awaiting I/O. Further, while `time` is measured using precise clock reads, `utime` and `stime` are sampled by the timer interrupt. You therefore cannot expect that (`time` == `utime` + `stime`).
- `inblock` and `outblock`: The number of actual block I/O operations performed by the process measured using `getrusage(2)`.

# 2. Extracting and plotting data generated by the benchmark

Next, we import some Python module dependencies, and set configuration parameters:

In [None]:
import json
# Enable Jupyter notebook mode for matplotlib
%matplotlib notebook
import matplotlib.pyplot as plt
import numpy as np

# Set low for experimentation; consider using 11 "in production", but this will run for a long time!
iterations=4

Next, run the benchmark and process the results. We run the benchmark binary once for each buffer size, `iterations` iterations each time, generating JSON. We import the JSON into Python, and generate some summary statistics (medians and quartiles) for each buffer size. In this example, we consider only bandwidth, but you can also easily plot properties such as I/O counts or user/system time.

You will likely want to modify this code to drop the first sample for each size.

In [None]:
benchmark_strings = {}
print("Benchmark run starting")
for buffersize in [2**v for v in range(25)]:
    print("Buffer size: ", buffersize)
    output = !io/io-benchmark -b $buffersize -j -n $iterations -r -v iofile
    benchmark_strings[buffersize] = ' '.join(output)
display("Benchmark run completed")
    
bw_samples = {}       # Arrays of bandwidth samples indexed by buffer size
medians = {}          # Arrays of medians indexed by buffer size
stds = {}             # Arrays of standard deviations indexed by buffer size
q3s = {}              # Arrays of third quartiles
q1s = {}              # Arrays of first quartiles

for buffersize in [2**v for v in range(25)]:
    j = json.loads(benchmark_strings[buffersize])
    samples = list(j["benchmark_samples"])
    bw_samples[buffersize] = list([x["bandwidth"] for x in samples])
    medians[buffersize] = np.median(bw_samples[buffersize])
    q1s[buffersize] = medians[buffersize] - np.quantile(bw_samples[buffersize], 0.25)
    q3s[buffersize] = np.quantile(bw_samples[buffersize], 0.75) - medians[buffersize] 

Finally, we generate a plot using `matplotlib`, consisting of medians and error bars based on IQR:

In [None]:
fig1, ax = plt.subplots()
ax.set_title("buffer size vs. bandwidth")

x_coords = []
y_coords = []
low_errs = []
high_errs = []

for x in [2**v for v in range(25)]:
    x_coords.append(x)
    y_coords.append(medians[x])
    low_errs.append(q1s[x])
    high_errs.append(q3s[x])

ax.set_xscale("log")
ax.errorbar(x_coords, y_coords, [low_errs, high_errs])
plt.show()

In analysing this plot, it is worth considering key inflection points: Points on the plot where there are behavioural changes, and what they reflect. We can directly annotate those points on the plot using `avxline`.

In the next plot, we've manually placed several vertical lines at points where the data you collect is likely to experience inflection points. If they don't line up, check that you are collecting data as expected.

Be sure to take note of the linear Y axis and exponential X axis, and consider its implications for data analysis.

In [None]:
### This content the same as the above cell
fig1, ax = plt.subplots()
ax.set_title("buffer size vs. bandwidth")

x_coords = []
y_coords = []
low_errs = []
high_errs = []

for x in [2**v for v in range(25)]:
    x_coords.append(x)
    y_coords.append(medians[x])
    low_errs.append(q1s[x])
    high_errs.append(q3s[x])

ax.set_xscale("log")
ax.errorbar(x_coords, y_coords, [low_errs, high_errs])

### This is new content relative to the prior cell
ax.axvline(x=4*1024, color="red", label="4KB", linestyle=":")
ax.axvline(x=64*1024, color="blue", label="64KB", linestyle=":")
ax.axvline(x=128*1024, color="green", label="128KB", linestyle=":")
ax.legend()
ax.errorbar(x_coords, y_coords, [low_errs, high_errs])
plt.show()

You can save a plot out to disk as a PDF -- e.g., for use in a lab report -- using this API:

In [None]:
#plt.savefig("performance.pdf")

# 3. Extracting and plotting data generated using DTrace

DTrace scripts can be run directly from Python and Jupyter, returning a data structure that describes the resulting output. The details of the data structure depend on the script you have written. The DTrace script will run asynchronously while the benchmark runs, and you then collect the data after completion.

You will likely wish to develop DTrace scripts using the `dtrace(1)` command-line tool rather than in Python, as that will give more ready access to debug output (such as script compilation failure details).

First you need to import the `python-dtrace` module:

In [None]:
from dtrace import DTraceConsumerThread

The following example uses DTrace to record the number of times each `syscall` is called by the `io-benchmark` benchmark reading the benchmark data file. Note that it brackets data collection based on both the executable name (`io-benchmark`) and also the start and finish of the benchmark loop as detected using calls to the `clock_gettime(2)` system call (note that the system call is invoked directly in `io-benchmark` to bypass `vdso` optimisation). We set iterations to 1 to avoid capturing data from more than one run:

In [None]:
# D Language script
io_syscall_script = """
syscall::clock_gettime:return
/execname == "io-benchmark" && !in_benchmark/
{
    in_benchmark = 1;
}

syscall::clock_gettime:entry
/execname == "io-benchmark" && in_benchmark/
{
    in_benchmark = 0;
}

syscall:::entry
/execname == "io-benchmark" && in_benchmark && probefunc != "clock_gettime"/
{
    @a[probefunc] = count();
}
"""

from collections import defaultdict
values = defaultdict(int)

# Callback invoked to process the aggregation
def simple_walk(action, identifier, keys, value):
    """
    action -- type of action (sum, avg, ...)
    identifier -- the id.
    keys -- list of keys.
    value -- the value.
    """
    values[keys[0]] += value

# Create a seperate thread to run the DTrace instrumentation
dtrace_thread = DTraceConsumerThread(io_syscall_script,
                                     walk_func=simple_walk,
                                     out_func=lambda v: None,
                                     chew_func=lambda v: None,
                                     chewrec_func=lambda v: None,
                                     sleep=1)

# Start the DTrace instrumentation
dtrace_thread.start()

# Display header to indicate that the benchmarking has started
print("Starting io-benchmark read performance measurement")

# Run the io-benchmark benchmark    
BUFFER_SIZE = 512

output_dtrace = !io/io-benchmark -r -b {str(BUFFER_SIZE)} iofile
        
# The benchmark has completed - stop the DTrace instrumentation
dtrace_thread.stop()
dtrace_thread.join()
    
# Print the syscalls and their frequency
for x in values.keys():
    print("Number of ", x, " calls {}", values[x])

# Display footer to indicate that the benchmarking has finished
print("Finished io-benchmark read performance measurement")

This approach can be used to extract a variety of kernel trace data using DTrace. One known limitation is that stack() results are stored as a set of code addresses, rather than being expanded to strings, so you will likely prefer to use the `dtrace(1)` command-line tool to capture stack data.