# Reproducing LinnOS End-to-end Workflow 
NOTE: Please read the README first before running the end-to-end workflows.

This artifact is designed to be run on Chameleon testbed using Jupyter. It will run two end-to-end workflows on a Chameleon instance: baseline and LinnOS (more detail in README). Specifically, it will conduct the following steps:

1. __Process IO traces to obtain Baseline results:__ <br>
   a. Acquire an instance with a local SSD array from Chameleon <br>
   b. Populate those drives with preprepared I/O trace dataset  <br>
   c. Run the SSD replayer and obtain the baseline cdf results <br>
2. __Train the LinnOS ML model and implant the learned weights to LinnOS kernel code :__ <br> 
   a. Train the LinnOS ML model using the output obtained from running the baseline and save the learned weights <br>
   b. Use header generator to convert saved weight files to LinnOS kernel compatible headers (and put them inside LinnOS kernel source code) <br>
3. __Install LinnOS kernel:__  <br>
   a. Prepare the config file <br>
   b. Install the required packages  <br>
   c. Compile and reboot <br>
4. __Using the LinnOS, replay the IO traces to obtain ML results:__   <br>
    a. Run the SSD replayer (with LinnOS Kernel enabled) and obtain the linnos-ml cdf results <br>

__Requirements:__ Chameleon account,familiarity with openstack and a storage hierarchy instance with Ubuntu 18.04 (or later)

__Total run time of each step is shown below:__

* __step 1 ----> Depending on the condition of the physical instance, runtime of this step can range from several minutes to several hours__
* __step 2 ----> around 30 minutes__
* __step 3 ----> around 30 minutes__
* __step 4 ----> around 10 minutes__

This package contains following files: LinnOS.ipynb, LinnOSResultsPlot.ipynb, LinnOSWriterReplayer.tgz, linux-5.4.8-linnos.tgz, reservation.sh, stack.yaml and References folder

- LinnOS.ipynb (this script) is the main script that follows the steps mentioned above.
- LinnOSResultsPlot.ipynb is the script that plots the end result of the LinnOS (i.e graph of baseline vs LinnOS-ML line). This script should run after LinnOS.ipynb is successfully executed.
- LinnOSWriterReplayer.tgz contains LinnOS scripts used for populating drives, running SSD fail-over experiments and etc.
- linux-5.4.8-linnos.tar.gz contains the LinnOS Kernel code.
- reservation.sh and stack.yaml is used for creating a lease and an instance.
- References folder that includes our recent results on Chameleon.

In [None]:
# These parameters can be tuned based on users need by simply changing the string. 
# For example OS_PROJECT_NAME can be changed to match your project names.
export OS_PROJECT_NAME="Chameleon Reproducibility Research"
export OS_REGION_NAME="CHI@TACC"
export NODE_TYPE="storage_hierarchy"

export RESOURCE_NAME="$USER-LinnOSStorage"

In [None]:
# This script creates/starts the lease, and exports a public IP address to access the instance after 
# it is created. If no available host error is obtained, it means that all available storage hierachy devices 
# are in use. Changing the start date of the reservation in reservation.sh script might solve this. 
# If the cell runs correctly then you will see "Lease started successfully!" message.

source ./reservation.sh

In [None]:
# Please proceed after the lease is successfully started.
# If lease is successfully created then it takes roughly 10 minutes to iniate an instance.

key_pair_upload

stack_name="$RESOURCE_NAME"

openstack stack create "$stack_name" --wait \
  --template stack.yaml \
  --parameter floating_ip="$FLOATING_IP_ID" \
  --parameter reservation_id="$RESERVATION_ID" \
  --parameter key_name="$USER-jupyter"

In [None]:
# Repeat this cell until it returns success.
wait_ssh "$FLOATING_IP"

In [None]:
TraceTag='trace'
echo $FLOATING_IP
export ConIP=$FLOATING_IP

# STEP 1: Process IO traces to obtain Baseline results (Non-ML)

In [None]:
scp LinnOSWriterReplayer.tgz cc@"$ConIP":/home/cc/

In [None]:
ssh cc@"$ConIP" tar -xzf LinnOSWriterReplayer.tgz

##### LinnOS requires three SSD drives to run failover behaviour. Here we list all the block devices and pick three SSD drives (for convenience we picked sde,sdf,sdg). (Notice that picking other drives require modifying the ml_model.h file inside the linux-5.4.8-linnos/block folder)

In [None]:
ssh cc@"$ConIP" lsblk

##### Here we are trying to get the baseline SSD performance results (i.e cdf graph). In order to achieve that, each picked drive needs to be populated with the I/O trace data. Note that each drive needs to be populated only once. Then we run SSD replayer to measure the performance of the three drive ssd group on populated I/O data.

##### In the following cell, we are populating the sde,sdf and sdg drives with anonymous.drive0,anonymous.drive1 and anonymous.drive2 I/O traces using the writer script. Writer script takes the ssd drive and I/O trace as arguments and populates the specified drive with the chosen trace.

In [None]:
# Input trace file format:
# 1: timestamp in ms
# 2: disk ID (not used)
# 3: offset in bytes
# 4: I/O size in bytes
# 5: r/w type, 1 for read and 0 for write
# Depending on the condition of the physical instance, runtime of this cell
# can range from several minutes to several hours. 
# Currently we see writer return segmentation fault on Chameleon instances, but this does 
# not affect populating the drives with data, and you can proceed to the next cell.
ssh cc@"$ConIP" << EOF
 cd LinnOSWriterReplayer
 nohup sudo ./writer /dev/sde 'testTraces/anonymous.drive0.'$TraceTag &
 nohup sudo ./writer /dev/sdf 'testTraces/anonymous.drive1.'$TraceTag &
 id_1=$(sudo pgrep -a writer | awk 'NR==1 {print $1}')
 id_2=$(sudo pgrep -a writer | awk 'NR==2 {print $1}')
 sudo ./writer /dev/sdg 'testTraces/anonymous.drive2.'$TraceTag
 wait \$id_1
 wait \$id_2
EOF

##### In the following cell we are using the replayer_fail script to obtain the results for the baseline model using the test IO set. (Training IO set is used to train the machine learning model and Test IO set is used for baseline vs ml linnOS comparison) Replayer_fail requires 3 SSD devices and their corresponding traces to demonstrate fail over behaviour (i.e if given IO request is costly, redirect it to other drives. The same logic applies to the other drives) if LinnOS kernel is installed. Without LinnOS kernel, Replayer_fail script basically just sends each trace to their corresponding drive. Replayer_fail script outputs the resultant latency values along with some other features. 

In [None]:
# The replayer output format is:
# <Index of I/O>,<Index of device>,<schedule timestamp>,<I/O latency>,<I/O type>,<I/O size>,<I/O offset>,<submission timestamp>,<return state>
# This process takes 3 minutes. If the cell successfully runs, you will see "All done!" message.
ssh cc@"$ConIP" << EOF
 cd LinnOSWriterReplayer
 sudo ./replayer_fail /dev/sde-/dev/sdf-/dev/sdg \
 'testTraces/testdrive0.'$TraceTag \
 'testTraces/testdrive1.'$TraceTag \
 'testTraces/testdrive2.'$TraceTag py/TestTraceOutput
EOF

In [None]:
# Installing numpy dependency (for the percentile.py script)
ssh cc@"$ConIP" pip3 install numpy

##### Percentile.py takes the resultant latency values produced by replayer as input and calculates the cdf  to produce baseline trajectory in the graph.

In [None]:
ssh cc@"$ConIP" python3 LinnOSWriterReplayer/py/percentile.py 2 read \
LinnOSWriterReplayer/py/TestTraceOutput LinnOSWriterReplayer/py/BaselineData

In [None]:
# Here baseline performance ouput is saved to be used later in the LinnOSResultsPlot.ipynb script.
scp cc@"$ConIP":/home/cc/LinnOSWriterReplayer/py/BaselineDataread_percentile.csv .

# STEP 2: Train the LinnOS ML model and implant the learned weights to LinnOS kernel code 

##### In this cell we are using the replayer_fail script to prepare our traces for linnOS (Training IO set is used to train the machine learning model and Test IO set is used for baseline vs linnOS-ml comparison).

In [None]:
# This process takes around 10 minutes.If the cell sucessfully runs, you will see "All done!" message.
ssh cc@"$ConIP" << EOF
 cd LinnOSWriterReplayer
 sudo ./replayer_fail /dev/sde-/dev/sdf-/dev/sdg \
 'testTraces/traindrive0.'$TraceTag \
 'testTraces/traindrive1.'$TraceTag \
 'testTraces/traindrive2.'$TraceTag py/TrainTraceOutput
EOF

In [None]:
scp linux-5.4.8-linnos.tar.gz cc@"$ConIP":/home/cc/

In [None]:
ssh cc@"$ConIP" tar -xf linux-5.4.8-linnos.tar.gz

In [None]:
#Installing dependencies (for the pred1.py script)
ssh cc@"$ConIP" pip3 install --upgrade pip
ssh cc@"$ConIP" pip3 install tensorflow==1.15.2
ssh cc@"$ConIP" pip3 install keras==2.1.3 
ssh cc@"$ConIP" pip3 install pandas
ssh cc@"$ConIP" pip3 install scikit-learn

##### Trace parser takes replayer output as input and converts it into ml-friendly dataset (i.e., ML trace dataset) which we will be using to train our model.

In [None]:
# Run-time is around couple minutes.
for i in 0 1 2 
do
   ssh cc@"$ConIP" python3 LinnOSWriterReplayer/py/traceParser.py direct 3 4 \
   LinnOSWriterReplayer/py/TrainTraceOutput LinnOSWriterReplayer/mlData/temp1 \
   LinnOSWriterReplayer/mlData/"mldrive${i}.csv" "$i"
done

##### Here we train the LinnOS ML model on the ML trace data. The script automatically saves the learned weights/biases to /home/cc/LinnOSWriterReplayer/mlData directory.

In [None]:
# Run-time is around 20 minutes.
# Custom loss modifier can be changed by modifiying the custom_loss parameter in the pred1.py script.
# Given limited time, we have not tested the inflection point
# on Chameleon and currently have applied a constant p85 threshold.
for i in 0 1 2 
do
   ssh cc@"$ConIP" python3 LinnOSWriterReplayer/py/pred1.py \
   LinnOSWriterReplayer/mlData/"mldrive${i}.csv" > "mldrive${i}results".txt
done

##### LinnOS requires each ML trace data to have a seperate directory with its own learned weight/bias files. Hence here we are grouping weight files for each Drive.

In [None]:
ssh cc@"$ConIP" << EOF
 cd LinnOSWriterReplayer/mlData
 mkdir -p drive0weights
 mkdir -p drive1weights
 mkdir -p drive2weights
 cp mldrive0.csv.* drive0weights
 cp mldrive1.csv.* drive1weights
 cp mldrive2.csv.* drive2weights
EOF

##### Here we use the mlHeaderGen.py script to convert saved weight files to LinnOS kernel compatible headers. The output of the mlHeaderGen.py script is configured as the LinnOS kernel source code. The added machine learning header files can be configured (i.e., disabled and enabled) by modifiying the ml_models.h located in /home/cc/linux-5.4.8-linnos/block directory.

In [None]:
ssh cc@"$ConIP" python3 LinnOSWriterReplayer/mlHeaderGen/mlHeaderGen.py \
Trace sde /home/cc/LinnOSWriterReplayer/mlData/drive0weights /home/cc/linux-5.4.8-linnos/block

ssh cc@"$ConIP" python3 LinnOSWriterReplayer/mlHeaderGen/mlHeaderGen.py \
Trace sdf /home/cc/LinnOSWriterReplayer/mlData/drive1weights /home/cc/linux-5.4.8-linnos/block

ssh cc@"$ConIP" python3 LinnOSWriterReplayer/mlHeaderGen/mlHeaderGen.py \
Trace sdg /home/cc/LinnOSWriterReplayer/mlData/drive2weights /home/cc/linux-5.4.8-linnos/block

# STEP 3: Install LinnOS kernel

In [None]:
# Preparing the config file
# The current kernel config file is copied here for backward compatibility.
ssh cc@"$ConIP" cp /boot/config-4.15.0-112-generic linux-5.4.8-linnos/.config

# Installing the required packages
ssh cc@"$ConIP" sudo apt-get -y install build-essential libncurses-dev bison flex libssl-dev libelf-dev

# Preparing the config file
ssh cc@"$ConIP" make -C /home/cc/linux-5.4.8-linnos olddefconfig

In [None]:
# Compiling the LinnOS Kernel
ssh cc@"$ConIP" make -C /home/cc/linux-5.4.8-linnos -j $(nproc) > makeLinnosLog.txt

In [None]:
#Compiling the LinnOS Kernel
ssh cc@"$ConIP" sudo make -C /home/cc/linux-5.4.8-linnos modules_install > modulesInstallLinnosLog.txt

In [None]:
#Compiling the LinnOS Kernel
ssh cc@"$ConIP" sudo make -C /home/cc/linux-5.4.8-linnos install  

In [None]:
ssh cc@"$ConIP" sudo update-initramfs -c -k 5.4.8

In [None]:
ssh cc@"$ConIP" sudo update-grub

In [None]:
ssh cc@"$ConIP" sudo reboot

In [None]:
#Check the kernel version to make sure that it is Linux 5.4.8-linnos x86_64
ssh cc@"$ConIP" uname -mrs 

# STEP 4: Using the LinnOS, replay the IO traces to obtain ML results

##### In step-1 we already populated the drive and in step-3 we installed the LinnOS kernel. Hence here we can just run the replayer_fail script to initiate fail over behaviour. More specifically, the fail over behaviour is determined by the LinnOS machine learning model (i.e., If predicted IO request is costly (i.e., high predicted latency), then redirect it to another drive). Note that here we run the replayer on testdrive just like what we did to obtain the baseline.

In [None]:
# This process takes 5 minutes. If the cell sucessfully runs, you will see "All done!" message.
ssh cc@"$ConIP" << EOF
 cd LinnOSWriterReplayer
 sudo ./replayer_fail /dev/sde-/dev/sdf-/dev/sdg \
 'testTraces/testdrive0.'$TraceTag \
 'testTraces/testdrive1.'$TraceTag \
 'testTraces/testdrive2.'$TraceTag py/MLOutput
EOF

In [None]:
#Converting the results to graph-friendly format
ssh cc@"$ConIP" python3 LinnOSWriterReplayer/py/percentile.py 2 read \
LinnOSWriterReplayer/py/MLOutput LinnOSWriterReplayer/py/MLData

In [None]:
scp cc@"$ConIP":/home/cc/LinnOSWriterReplayer/py/MLDataread_percentile.csv .

##### To obtain baseline vs LinnOS graph, run the LinnOSResultsPlot.ipynb. The closer the line gets to the upper left, the better its performance is.

# Clean Up

In [None]:
# If you encounter with "Failed to validate token (HTTP 404)" error,
# you can stop/restart your server in https://jupyter.chameleoncloud.org/hub/home to fix it.
openstack stack delete "$RESOURCE_NAME" --yes --wait

In [None]:
blazar lease-delete "$RESOURCE_NAME"