# Read-While-Writing

## Luster file system

The test were performed with the *drpneh* Luster file system. The specs are:
* 12 OSS each with 2 OSTs  (drp-neh-ffb101 - drp-neh-ffb112)
* each ost: 8 SSDs in raidz1 with total size of 2.8TB.
* SSDs: 480GB INTEL SSDSC2BB48
* Lustre version on OSS: lustre-2.12.7 with zfs-0.8.6
* clients are drp-neh-cmp0NN
* Lustre client: lustre-client-2.12.0 except drp-neh-cmp022: lustre-client-2.12.7

## Tests

The tool _~wilko/psdm/dev/psdm-testing/filesystem/rww/rww.go_ is used for the reader and writer:

1) write a 100GB file (default blocksize is 1MiB)
    * rww --mode write --wsize 100000 --fn /drp/neh/wktst/f1
2) wait a few minutes and start the readere on a different client machine
    * rww --mode read --wsize 100000 --fn /drp/neh/wktst/f1
    
### Setting Lustre cache size

    % ctl set_param llite.*.max_cached_mb=16384
    
In order to persist is run on the above command with the _-P_ option on the MDS. It requires a remount on the client.

In [8]:
from bokeh.io import output_notebook
from bokeh.plotting import figure, show, save
from bokeh.themes import built_in_themes
from bokeh.io import curdoc

import jmespath
from datetime import datetime, timedelta
import numpy as np
import pandas as pd
import pyarrow.feather as feather

output_notebook()

In [9]:
curdoc().theme = 'light_minimal'
rate_data = "rww_20210929.feather"

## Save the Prometheus IB rates 

Read from prometheus and store them in a file. Prometheus keeps the data only for 3-4 weeks. 

In [10]:
def read_test():
    import promquery as pq
    from itertools import cycle, chain
    
    tst_start = datetime(2021, 9, 29, 21, 21, 0)
    tst_dt = timedelta(seconds=1200)
    trange = pq.time_range(tst_start, tst_dt)
    
    rate_ib_out = pq.get_data_prom(
        'irate(node_infiniband_port_data_transmitted_bytes_total{job="nehdrp",instance=~"drp-neh-cmp014:9100|drp-neh-cmp013:9100"}[30s])', 
        *trange)
    rate_ib_in = pq.get_data_prom(
        'irate(node_infiniband_port_data_received_bytes_total{job="nehdrp",instance="drp-neh-cmp013:9100"}[30s])', 
        *trange)
    
    out = zip(rate_ib_out['data']['result'], cycle(["out"]))
    ind = zip(rate_ib_in['data']['result'], cycle(["in"]))

    to_mb = pow(2,20)
    df = pd.DataFrame({
            f"{r['metric']['instance'][8:14]}_{n}":
                pd.Series((np.float64(v[1])/to_mb for v in r['values']), index=(pd.Timestamp(v[0], unit='s') for v in r['values']))
            for r,n in chain(ind, out) })
    return df

do_create_file = False
if do_create_file:
    import os
    import pyarrow as pa
    import pyarrow.feather as feather

    df = read_test()
    if os.path.exists(rate_data):
        print("data file already exists", rate_data)
    else:
        table = pa.Table.from_pandas(df, preserve_index=True)
        feather.write_feather(table, rate_data)
        print("Created data file", rate_data)

## Read the data from a feather file

In [12]:
# read the data from file
df = feather.read_feather(rate_data)

from bokeh.models import ColumnDataSource
cds = ColumnDataSource(df)

In [13]:
from bokeh.models.formatters import DatetimeTickFormatter
p = figure(title="Read and write for RWW tests", x_axis_label="time", y_axis_label="rate [MiB/s]", x_axis_type='datetime', width=1000, height=600)
p.line("index", "cmp014_out", source=cds, line_color="red", legend_label="writer")
p.line("index", "cmp013_in", source=cds, line_color="blue", legend_label="reader")
p.xaxis.major_label_orientation = 3.14/4
p.xaxis.formatter = DatetimeTickFormatter(minutes="%H:%M")
p.xgrid.grid_line_color = "black"
p.ygrid.grid_line_color = "black"
p.xgrid.grid_line_alpha = .3
p.ygrid.grid_line_alpha = .3

#p.xaxis.major_label_text_font_size = "20pt"
show(p)

# Result

The above plot shows the IB read and write rate of a client to and from the Lustre filesystems. The red curves are the writer rates and the blue one from the reader client. 
Three tests were run with limiting the Lustre cache size to 80GB, 4GB and 16GB. 

### max-cache = 80GB (time=4\:21\-4\:24)

The first test (4:21:30-4:24:15) used a max cache of 80GB. The write rate drops to zero as soon as the reader is starting and only after about 30s the writer contiinues and
the reader starts reading. This is also confirmed by the reader application which reports that the very first read (each read is 1MiB large) took about 30s and at the same time 
the writer reported that the write syscall took 30s. 

### max-cache = 4GB (time=4\:30\-4\:32)

When the cache is limited to 4GB the stall of the reader and writer disapper. The figure indicates the the write drops a bit due to the 
reduced cache size. 

### max-cache = 16GB (time=4\:37\-4\:39)

The results for a cache limit of 16GB is similar to the one for 4GB. The rates are showing more fluctutions and the write rate is a bit higher 
in the 16GB case compared the the 4GB limit (at least whith out reader). There is no indication for any write stall beyound a few seconds. 

## Discussion

Lustre will grant locks to the writer and reader. When the reader starts up the writer holds a lock on the file and it has to return it before the reader is granted one and the writer asks for a new one. 
The test result indicates that when the cache size is large the locking exchange takes a long time maybe due to validation that the data in the writers cache has been commited to the server. Most of the
data in the cache were written to the server because when the writer stalls not data is send to the OSS (the rates is 0).