# TriScale tutorial

|18-05-2021|90 min version|CPS-IoTBench 2021|
|---|---|---|

This notebook contains 
1. A short demonstration of _TriScale_ API, how the different functions are meant to be used, and the visualizations they produce. 
2. A few simple tasks that invite you to use _TriScale_ to 
    1. design an experiment, and 
    2. analysis (synthetic) data to assess the replicability of the results.

For more details about _TriScale,_ you may refer to [the paper](https://doi.org/10.5281/zenodo.3464273).


## Following with tutorial
To follow the tutorial, some basic knowledge of Python is required. 
You are invited to play around with the cells from this Jupyter notebook, which interleaves code snippets and text explanations. 

**Not familiar with Jupyter notebooks?** No worries!  
You need very little knowledge to successfully follow the tutorial, most of which you can find from these doc pages
- [Notebook user interface and Document structure](https://jupyter-notebook.readthedocs.io/en/stable/notebook.html#notebook-user-interface)
- [Basic wokflow](https://jupyter-notebook.readthedocs.io/en/stable/notebook.html#basic-workflow)
- [User interface components](https://jupyter-notebook.readthedocs.io/en/stable/ui_components.html) 


## List of imports
To get started, let's import the Python libraries we need in this tutorial. All the _TriScale_-specific functions are part of one module called `triscale`.

In [None]:
import os
from pathlib import Path

import pandas as pd
import numpy as np

import triscale

## Part 0. Performance evaluation of Banana

Throughout the tutoriel presentation, we used the Banana communication protocol as an example. Let's see how easily we can improve the experiment design and analysis with _TriScale._

> We want to measure the overall energy consumption achieved by the protocol.  
For this purpose, we can use a simple metric: the sum of the energy consumed by all nodes in the network. The lower, the better.  
> 
> Note that we could pick _any metric_; the choice of the metric is independent of TriScale's methodology.

TriScale defines performance objectives as percentiles of the metric distributions. The experiments aim to estimate these percentiles with a given confidence level: that's what we call the KPIs, or Key Performance Indicators. 

In [None]:
# Definition of Banana's KPI
percentile = 50 # the median
confidence = 95 # the confidence level, in %

These two values are sufficient to define the minimal number of runs required to compute this KPI:

In [None]:
triscale.experiment_sizing(
    percentile, 
    confidence,
    verbose=True); 

We need a **minimum of 5 runs.**

We do the same thing to estimate the long-term variability with the variability score.

In [None]:
# Definition of Banana's variability score
percentile = 25 # the median
confidence = 95 # the confidence level, in %

In [None]:
triscale.experiment_sizing(
    percentile, 
    confidence,
    verbose=True); 

We need a **minimum of 11 series.**

Hence, with only these four parameters, we can connect the total number of runs one needs to run (a minimum of 11 series of 5 runs) with the corresponding performance claims that one can make:
- **KPI**: In a series of runs, the median value of the runs metric values is lower or equal to the KPI with a confidence of 95%.
- **Variability score**: The range of KPI values of the middle 50% of series is less or equal to the variability score, with a confidence of 95%.

--- 

The rest of this notebook provides more details about how TriScale computes these minimal numbers of runs and series (Part 1.). It then introduces an actual experiment dataset as support to present TriScale's analysis functions (Part 2.). 

Have fun!

## Part 1. Experiment design

During the design phase of an experiment, one important question to answer is "how many time should the experiments be performed?" This question directly relates to the definition of _TriScale_ KPIs and variability scores. 

_TriScale_ implements a statistical method that allows to estimate, based on a data sample, any percentile of the underlying distribution with any level of confidence. Importantly, the estimation does not rely on any assumption on the nature of the underlying distribution (eg normal, or Poisson). The estimate is valid as long as the sample is independent and identically distributed (or _iid_ ).

Intuitively, it is "easier" to estimate the median (50th percentile) than the 99th percentile; the more extreme the percentile, the more samples are required to provide an estimate for a given level of confidence. More precisely, the minimal number of sample $N$ required to estimate a percentile $0<p<1$ with confidence $0<C<1$ is given by:

$$N \;\geq\; \frac{log(1-C)}{log(1-p)}$$

_TriScale_ `experiment_sizing()` function implements this computation and retuens the minimal number of samples $N$, as illustrated below.

In [None]:
# Select the percentile we want to estimate 
percentile = 10

# Select the desired level of confidence for the estimation
confidence = 99 # in %

# Compute the minimal number of samples N required
triscale.experiment_sizing(
    percentile, 
    confidence,
    verbose=True); 

Let us consider the samples $x$ are ordered such that $x_1 \leq x_2 \ldots \leq x_N$. 
The previous result indicates that for $N = 44$ samples and above,  $x_1$ is a lower bound for the 10th percentile with probibility larger than 99%. 

Observe that the probabilities are symetric; it takes the same number of samples to compute a lower bound for the 10th percentile as to compute an upper bound for the 90th percentile.

In [None]:
percentile = 90
confidence = 99 # in %

triscale.experiment_sizing(
    percentile, 
    confidence,
    verbose=True); 

As you can see, the `experiment_sizing` functions automatically returns an upper bound for percentiles $P>50$ and a lower bound for percentiles $P<50$.

To get a better feeling of how this minimal number of samples evolves this increasing confidence and more extreme percentiles, let us compute a range of minimal number of samples and display the results in a table (where the columns are the percentiles to estimate).

In [None]:
# Sets of percentiles and confidence levels to try
percentiles = [0.1, 1, 5, 10, 25, 50, 75, 90, 95, 99, 99.9]
confidences = [75, 90, 95, 99, 99.9, 99.99]
min_number_samples = []

# Computing the minimum number of runs for each (perc., conf.) pair
for c in confidences:
    tmp = []
    for p in percentiles:
        N = triscale.experiment_sizing(p,c)
        tmp.append(N[0])
    min_number_samples.append(tmp)
    
# Put the results in a DataFrame for a convenient display of the results
df = pd.DataFrame(columns=percentiles, data=min_number_samples)
df['Confidence level'] = confidences
df.set_index('Confidence level', inplace=True)

display(df)

Similarly, one can compute the minimal number $N$ such that any sample $x_m$ is an estimate (instead of $x_1$). This can be obtained from the `experiment_sizing()` function using the option `robustness` argument.

In [None]:
percentile = 10
confidence = 99
triscale.experiment_sizing(
    percentile, 
    confidence,
    robustness=3,
    verbose=True); 

The previous result indicates that for $N = 97$ samples and above,  $x_4$ is a lower bound for the 10th percentile with probibility larger than 99%.

## Part 1. Your turn! 

Based on the explanations above, use _TriScale_'s `experiment_sizing` function to answer the following questions:
- What is the minimal number of runs required to estimate the **90th** percentile with **90%** confidence?
- What is the minimal number of runs required to estimate the **90th** percentile with **95%** confidence?
- What is the minimal number of runs required to estimate the **95th** percentile with **90%** confidence?
- Based on the answers to the previous questions, is it harder (i.e., does it require more runs) to increase the confidence level, or to estimate a more extreme percentile? 

In [None]:
# 
# 
# 
# You can either write your own code here, or modify and re-execute some of the cells above. 
# 
# 
# Have fun! 
# 
# 
# 


Answering these questions should be fairly easy. Want a harder one? Then try this:  
- For $N = 50$ samples, what is the index $m$ of the best possible (i.e., the largest) lower bound for the 25th percentile, estimated with a 95% confidence level? 

## Part 2. Data analysis

### Runs and Metrics

Metrics in _TriScale_ evaluate a performance dimension across a run. The computation of metrics is implmented in the `analysis_metric()` functions, which takes two compulsory arguments:
- the raw data,
- the metric definition,

The raw data can be passed as a file name (ie, a string) or as a Pandas dataframe. 
- If a string is passed, the function tries to read the file name as a csv file (comma separated) with `x` data in the first column and `y` data in the second column. 
- If a pandas DataFrame is passed, `data` must contain columns named `x` and `y`.

The metric definition is provided as a dictionary, with only the `measure` key being compulsory. This defines "what is the computation to be performed" on the data. The measure can be any percentile ($0<P<100$) or `mean`, `minimum`, `maximum`.

In [None]:
# Input data file (one-way delay of a full-throttled flow using TPC BBR)
data = 'ExampleData/raw_data.csv'

# Definition of a TriScale metric
metric = {  
    'measure': 50,   # Integer: interpreted as a percentile
    'unit'   : 'ms', # For display only
         }

has_converged, metric_measure, plot = triscale.analysis_metric( 
    data,
    metric)

print('Run metric: %0.2f %s' % (metric_measure, metric['unit']))

Passing the optional argument `plot=True` generates a plot of the raw data. 

In [None]:
has_converged, metric_measure, plot = triscale.analysis_metric( 
    data,
    metric,
    plot=True,
)

> As presented here, this function is not very interesting: it "only" returns some percentile of an array... 
The main value of the `analysis_metric()` function is when the metric attempts to estimate the long-term performance; that is, the value one would obtain shall the run last longer/more data points be collected. 
When this is the case, _TriScale_ performs a convergence test on the data, which can be triggered in `analysis_metric()` function by passing the optional `convergence` parameter. 
The study is convergence goes beyond the scope of this tutorial; refer to the [paper](https://doi.org/10.5281/zenodo.3464273) for more details. 

### Series and KPIs

_TriScale_ ’s key performance indicators (KPIs) evaluate performance dimensions across a series of runs. Performing multiple runs allows to mitigate the inherent variability of the experimental conditions. KPIs capture this variability by estimating percentiles of the (unknown) metric distributions. Concretely, a _TriScale_ KPI is a one-sided confidence interval of a percentile, e.g., a lower bound for the 25th percentile of a throughput metric, estimated with a 95% confidence level.

The computation of KPIs is implmented in the `analysis_kpi()` function, which takes two compulsory arguments:
- the metric data
- the KPI definition

The metric data can be passed as a list or an NumPy array. 
The KPI definition is provided as a dictionary with three compulsory keys: `percentile` ($0<P<100$), `confidence` ($0<C<100$), and `bounds`. The KPI bounds are the expected extremal values of the metric; this is necessary to perform the independance test (see below).

> If the metrics bounds are unknown, simply pass the minimum and maximum metric values as bounds.

The `analysis_kpi()` function performs two computations:
1. It performs an empirical independence test; that is, the function tests whether the metric data appears to be iid.
2. It computes the KPI value for the metric data.  

The metric data must be iid for the KPI to be a valid estimate of the underlying metric distribution. In general, independence is a property of the data collection process. However, in many preactical cases for networking experiment, independence cannot be guaranteed (for example, because there are correlations between the interference conditions between sucessive experiments). 
In such a case, one can perform an _empirical_ test for independence; essentially, this assesses whether the level of correlation in the data appears sufficiently low such that the data can be assumed iid.

In [None]:
# Load sample metrics data (failure recovery time, in seconds)
data = 'ExampleData/metric_data.csv'
df = pd.read_csv(data, header=0, names=['metric'])

# Minimal KPI definition
KPI = {
    'percentile': 75,
    'confidence': 95,
    'bounds': [0,10],
    'unit': 's'
}

# Computes the KPI
indep_test_passed, KPI_value = triscale.analysis_kpi(
    df.metric.values,
    KPI,
)

# Output
if indep_test_passed:
    print('The metric data appears iid.')
    print('KPI value: %0.2f %s' % (KPI_value, KPI['unit']))
else:
    print('The metric data does not appear iid.')

Since the metric data appears to be iid, we can interpret the KPI value as follows:  
> __With a confidence level of 95%, the 75th percentile on the metric is smaller or equal to 1.92s.__   
In other words, with a probability of 95%, the performance metric is smaller or equale to 1.92s in three quarters of the runs. 

If the independence test fails, the KPI value is computed and returned nonetheless. However, the user must be aware that the resulting KPI is then not a trustworthy estimate of the corresponding percentile. Refer to the [paper](https://doi.org/10.5281/zenodo.3464273) for more details about the empirical independence test implemented in _TriScale._

Optionally, the `analysis_kpi()` functions plots 
- the metric data series (`series`)
- the autocorrelation plot (`autocorr`)
- the metric data and the corresponding KPI value (`horizontal`)

In [None]:
# Computes the KPI and plot
indep_test_passed, KPI_value = triscale.analysis_kpi(
    df.metric.values,
    KPI,
    # to_plot=['series','autocorr','horizontal']
    to_plot=['horizontal']
)

### Sequels and Variability Scores

Sequels are repetitions of series of runs. _TriScale_ ’s variability score evaluates the variations of KPI values across sequels. Sequels enable _TriScale_ to detect long-term variations of KPIs and ultimately quantify the reproducibility of an experiment. 
Concretely, a variability score is a two-sided confidence interval, i.e., an estimation of a symmetric pair of percentiles. For example, a 75% confidence interval for the 25-75th percentiles.
The underlying computations are the same as for the [KPIs values](#Series-and-KPIs).

The computation of variability scores is implmented in the `analysis_variability()` function, which takes two compulsory arguments:
- the KPI data
- the variability score definition

The KPI data can be passed as a list or an NumPy array. 
The variability score definition is provided as a dictionary with three compulsory keys: `percentile` ($0<P<100$), `confidence` ($0<C<100$), and `bounds`. The bounds are the expected extremal values of the KPI; this is necessary to perform the independance test (see below).

> If the KPI bounds are unknown, simply pass the minimum and maximum KPI values as bounds.

Like in `analysis_kpi()`, `analysis_variability()` performs both the empirical independence test and the computation of the variability score. The same plotting options are available.

In [None]:
# Load sample KPI data (failure recovery time, in seconds)
data = 'ExampleData/kpi_data.csv'
df = pd.read_csv(data, header=0, names=['kpi'])

# Minimal KPI definition
score = {
    'percentile': 25, # the 25th-75th  percentiles range
    'confidence': 95,
    'bounds': [0,10],
    'unit': 's'
}

# Computes the KPI
(indep_test_passed, 
 upper_bound, 
 lower_bound, 
 var_score, 
 rel_score) = triscale.analysis_variability(
    df.kpi.values,
    score,
    to_plot=['series','horizontal']
)

# Output
if indep_test_passed:
    print('The KPI data appears iid.')
    print('Variability score: %0.2f %s' % (var_score, score['unit']))
else:
    print('The KPI data does not appear iid.')

Since the KPI data appears to be iid, we can interpret the variability score as follows:  
> __With a confidence level of 95%, the inter-quartile (25th-75th perc) range on the KPIs is smaller or equal to 0.4s.__   
In other words, with a probability of 95%, across all series, the middle 50% of KPI values differ by 0.4s or less.


## Part 2. Your turn! 
We have collected data for a comparative evaluation of congestion-control schemes using the [Pantheon platform](https://pantheon.stanford.edu/). Details about the experiment setup can be found in the [TriScale paper](https://doi.org/10.5281/zenodo.3464273).

For the purpose of this tutorial, we provide a dataset containing the metric values computed based on the raw data of each run:
- the mean **throughput** 
- the 95th percentile of the **one-way delay**

We performed five series of ten runs each. You task consist in analysing this data using _TriScale_ in order to compute and interpret variability scores for different congestion-control schemes. 

Let us first load and visualise the dataset:

In [None]:
# Load the entire dataset
df = pd.read_csv(Path('ExampleData/metrics_wo_convergence.csv'))

display(df)

The following cell contains a simple function to filter this dataset to extract metric values per scheme and per series.

In [None]:
# Extract the list of congestion control schemes and experiment dates
cc_scheme = df.cc.unique()
dates = df.datetime.unique()

# Uncomment to print out the list of schemes
# print(cc_scheme)

# Function parsing the dataset to extract metric values for one scheme and one series of runs
def filter_results(df, cc, date, metric):

    # Setup the data filter
    filter = (
        (df.cc == cc) &
        (df.datetime == date) 
    )
    # Filter
    df_filtered = df.where(filter).dropna()

    # Return the desired metric data
    return df_filtered[(metric+'_value')].values

We can use this function to easily extract all metrics values for one scheme and one metric (e.g., the mean `throughput` of `copa`).

In [None]:
# Let us extract the metric values for the 5 series of runs
cc = 'copa'
metric = 'throughput' # valid options: 'throughput' and 'delay'
results = []

for date in dates:
    values = filter_results(df, cc, date, metric)
    results.append(list(values))

print(results)

Let's now analyse these data! The KPI and variability score definitions for both the `throughput` and `delay` are provided below. 

> Note that we aim to estimate the 25th percentile for the `throughput`, where higher is better; whereas we estimate the 75th percentile of the `delay`, where lower is better.  
Thus, for both metrics, the KPI provides the least performance expected in at least 75% of the runs.

In [None]:
# KPIs
KPI_tput  = {'percentile': 25,
             'confidence': 75,
             'name': 'KPI Throughput',
             'unit': 'Mbit/s',
             'bounds':[0,120],    # expected value range
            }
KPI_delay = {'percentile': 75,
             'confidence': 75,
             'name': 'KPI One-way delay',
             'unit': 'ms',
             'bounds':[0,100],    # expected value range
            }

# Variability scores
score_tput  = {'percentile': 75,
             'confidence': 75,
             'name': 'Throughput',
             'unit': 'Mbit/s',
             'bounds':[0,120],    # expected value range
            }
score_delay = {'percentile': 75,
             'confidence': 75,
             'name': 'One-way delay',
             'unit': 'ms',
             'bounds':[0,100],    # expected value range
            }

You are now all set to analyse these data! 
1. Use _TriScale_'s `analysis_kpi` and `analysis_variability` functions to compute the variability score for one scheme and one performance metric (whichever ones).
2. Modify the definition of the variability score to estimate the median (`'percentile': 50`) instead of the 25-75th percentile range. How does this change the variability score obtained? Does this make sense to you?
3. Compute the scores for a few different schemes. Do they vary a lot? Does this variability appear "big" with respect to the range of KPI values?

In [None]:
# 
# 
# 
# You can either write your own code here, or modify and re-execute some of the cells above. 
# 
# 
# Have fun! 
# 
# 
# 


---
If you got down all this way, congratulations! We hope you found this tutorial useful, and that you will remember some these techniques for the design and analysis of your next experiment! 

Of course, many complexities and aspects of _TriScale_ were not discussed in this tutorial. For more details, you can explore the [(longer) live demo](https://mybinder.org/v2/gh/TriScale-Anon/triscale/master?filepath=triscale_demo.ipynb) and the [TriScale paper](https://doi.org/10.5281/zenodo.3464273).