Let's import `tcbench` and map its alias `tcb`

The module automatically import a few functions and constants.

In [1]:
import tcbench as tcb

## The `.get_datasets_root_folder()` method

You can first discover the <root> path where the datasets are
installed using `.get_datasets_root_folder()`

In [2]:
root_folder = tcb.get_datasets_root_folder()
root_folder

PosixPath('/opt/anaconda/anaconda3/envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets')

The function returns a [`pathlib` path](https://docs.python.org/3/library/pathlib.html?highlight=pathlib)
so you can take advantage of it to navigate the subfolders structure.

For instance:

In [3]:
list(root_folder.iterdir())

[PosixPath('/opt/anaconda/anaconda3/envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/utmobilenet21'),
 PosixPath('/opt/anaconda/anaconda3/envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/mirage22'),
 PosixPath('/opt/anaconda/anaconda3/envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/mirage19'),
 PosixPath('/opt/anaconda/anaconda3/envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/ucdavis-icdm19')]

As from the output, each dataset is mapped to a different folder 
named after the dataset itself. Meaning, again taking advantage of `pathlib`, 
you can compose path based on strings.

For instance:

In [4]:
list((root_folder / 'ucdavis-icdm19').iterdir())

[PosixPath('/opt/anaconda/anaconda3/envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/ucdavis-icdm19/raw'),
 PosixPath('/opt/anaconda/anaconda3/envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/ucdavis-icdm19/preprocessed')]

## The `.DATASETS` enum

A more polished way to reference datasets is via the `tcbench.DATASETS` attribute which corresponds to a [python enumeration](https://docs.python.org/3/library/enum.html?highlight=enum#enum.Enum) object

In [5]:
type(tcb.DATASETS), list(tcb.DATASETS)

(enum.EnumMeta,
 [<DATASETS.UCDAVISICDM19: 'ucdavis-icdm19'>,
  <DATASETS.UTMOBILENET21: 'utmobilenet21'>,
  <DATASETS.MIRAGE19: 'mirage19'>,
  <DATASETS.MIRAGE22: 'mirage22'>])

## The `.get_dataset_folder()` method

For instance, you can bypass the composition of a dataset folder path
and call directly `.get_dataset_folder()` to find the specific 
dataset folder you look for.

In [6]:
dataset_folder = tcb.get_dataset_folder(tcb.DATASETS.UCDAVISICDM19)
dataset_folder

PosixPath('/opt/anaconda/anaconda3/envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/ucdavis-icdm19')

## Listing files

Via `pathlib` you can easily discover all parquet files composing a dataset

In [7]:
list(dataset_folder.rglob('*.parquet'))

[PosixPath('/opt/anaconda/anaconda3/envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/ucdavis-icdm19/preprocessed/ucdavis-icdm19.parquet'),
 PosixPath('/opt/anaconda/anaconda3/envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/ucdavis-icdm19/preprocessed/imc23/train_split_3.parquet'),
 PosixPath('/opt/anaconda/anaconda3/envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/ucdavis-icdm19/preprocessed/imc23/train_split_4.parquet'),
 PosixPath('/opt/anaconda/anaconda3/envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/ucdavis-icdm19/preprocessed/imc23/test_split_human.parquet'),
 PosixPath('/opt/anaconda/anaconda3/envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/ucdavis-icdm19/preprocessed/imc23/train_split_0.parquet'),
 PosixPath('/opt/anaconda/anaconda3/envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/ucdavis-icdm19/preprocessed/imc23/test_split_script.p

But you can also programmatically call the the `datasets lsparquet` subcommand of the CLI using `get_rich_tree_parquet_files()`

In [8]:
from tcbench.libtcdatasets.datasets_utils import get_rich_tree_parquet_files
get_rich_tree_parquet_files(tcb.DATASETS.UCDAVISICDM19)

## The `.load_parquet()` method

Finally, the generic `.load_parquet()` can be used to load one of the parquet files.

For instance, the following load the unfiltered monolithic file of the `ucdavis-icdm19` dataset

In [9]:
df = tcb.load_parquet(tcb.DATASETS.UCDAVISICDM19)

In [10]:
df.head(2)

Unnamed: 0,row_id,app,flow_id,partition,num_pkts,duration,bytes,unixtime,timetofirst,pkts_size,pkts_dir,pkts_iat
0,0,google-doc,GoogleDoc-100,pretraining,2925,116.348,816029,"[1527993495.652867, 1527993495.685678, 1527993...","[0.0, 0.0328109, 0.261392, 0.262656, 0.263943,...","[354, 87, 323, 1412, 1412, 107, 1412, 180, 141...","[1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, ...","[0.0, 0.0328109, 0.2285811, 0.0012639999999999..."
1,1,google-doc,GoogleDoc-1000,pretraining,2813,116.592,794628,"[1527987720.40456, 1527987720.422811, 15279877...","[0.0, 0.0182509, 0.645106, 0.646344, 0.647689,...","[295, 87, 301, 1412, 1412, 1412, 180, 113, 141...","[1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, ...","[0.0, 0.0182509, 0.6268551, 0.0012380000000000..."


In [11]:
df.groupby(['partition', 'app'])['app'].value_counts()

partition                    app          
pretraining                  google-doc       1221
                             google-drive     1634
                             google-music      592
                             google-search    1915
                             youtube          1077
retraining-human-triggered   google-doc         15
                             google-drive       18
                             google-music       15
                             google-search      15
                             youtube            20
retraining-script-triggered  google-doc         30
                             google-drive       30
                             google-music       30
                             google-search      30
                             youtube            30
Name: count, dtype: int64

Beside the dataset name, the function only has 2 other parameters, but
their semantic and values are "mingled" with the curation process adopted.

In [12]:
tcb.load_parquet?

[0;31mSignature:[0m
[0mtcb[0m[0;34m.[0m[0mload_parquet[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mdataset_name[0m[0;34m:[0m [0;34m'str | DATASETS'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmin_pkts[0m[0;34m:[0m [0;34m'int'[0m [0;34m=[0m [0;34m-[0m[0;36m1[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0msplit[0m[0;34m:[0m [0;34m'str'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m [0;34m->[0m [0;34m'pd.DataFrame'[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Load and returns a dataset parquet file

Arguments:
    dataset_name: The name of the dataset
    min_pkts: the filtering rule applied when curating the datasets.
        If -1, load the unfiltered dataset
    split: if min_pkts!=-1, is used to request the loading of 
        the split file. For DATASETS.UCDAVISICDM19 
        values can be "human", "script" or a number 
        between 0 and 4.
        For all other dataset split can be anything 
        which is

## How `.load_parquet()` maps to parquet files

The logic to follow to load specific files can be confusing. The table below report a global view across datasets:

| Dataset | min_pkts=-1 | min_pkts=10 | min_pkts=1000 | split=True | split=0..4 | split=human | split=script |
|:-------:|:-------------:|:-------------:|:---------------:|:------------:|:------------:|:-------------:|:--------------:|
|`ucdavis-icdm19`| yes | - | - | - | yes (train+val) | yes (test)| yes (test)|
|`mirage19`| yes | yes| - | yes (train/val/test) | - | - | - |
|`mirage22`| yes | yes |yes|yes (train/val/test) | - | - | - | 
|`utmobilenet21`| yes | yes |-|yes (train/val/test) | - | - | - |

* `min_pkts=-1` is set by default and corresponds to loading the unfiltered parquet files, i.e., the files stored immediately under `/preprocessed`. All other files are stored under the `imc23` subfolders

* For `ucdavis-icdm19`, the parameter `min_pkts` is not used. The loading of training(+validation) and test data is controlled by `split`

* For all other datasets, `min_pkts` specifies which filtered version of the data to use, while `split=True` load the split indexes

## Loading `ucdavis-icdm19`

For instance, to load the `human` test split of `ucdavid-icdm19` you can run

In [13]:
df = tcb.load_parquet(tcb.DATASETS.UCDAVISICDM19, split='human')
df['app'].value_counts()

app
youtube          20
google-drive     18
google-doc       15
google-music     15
google-search    15
Name: count, dtype: int64

And the logic is very similar for the `script` partition

In [14]:
df = tcb.load_parquet(tcb.DATASETS.UCDAVISICDM19, split='script')
df['app'].value_counts()

app
google-doc       30
google-drive     30
google-music     30
google-search    30
youtube          30
Name: count, dtype: int64

However to load a specific train split

In [16]:
df = tcb.load_parquet(tcb.DATASETS.UCDAVISICDM19, split='0')
df['app'].value_counts()

app
google-doc       100
google-drive     100
google-music     100
google-search    100
youtube          100
Name: count, dtype: int64

## Loading other datasets

By default, without any parameter beside the dataset name, the function loads the unfiltered version of a dataset

In [17]:
df = tcb.load_parquet(tcb.DATASETS.MIRAGE19)
df.shape

(122007, 135)

Recall the structure of the `mirage19` dataset

In [18]:
get_rich_tree_parquet_files(tcb.DATASETS.MIRAGE19)

So there is only one filtering with `min_pkts=10`

In [19]:
df = tcb.load_parquet(tcb.DATASETS.MIRAGE19, min_pkts=10)
df.shape

(64172, 20)

Based on the dataframe shape, we can see that (indeed) we loaded a reduced version of the unfiltered dataset.

While for `ucdavis-icdm19` the "split" files represent 100 samples selected for training (because there are two ad-hoc test split), for all other dataset the "split" files contains indexes indicating the rows to use for train/val/test.

Thus, issuing `split=True` is enough to indicate the need to load the split table.

In [20]:
df_split = tcb.load_parquet(tcb.DATASETS.MIRAGE19, min_pkts=10, split=True)

Unnamed: 0,train_indexes,val_indexes,test_indexes,split_index
0,"[18965, 24694, 59797, 35708, 42030, 356, 39052...","[36752, 7114, 48500, 39083, 44382, 58758, 2001...","[20363, 36256, 24604, 11752, 40529, 50086, 470...",0
1,"[8741, 55715, 47053, 37161, 59608, 6777, 47281...","[41506, 56625, 18344, 23114, 10634, 44785, 130...","[21524, 19560, 41837, 57207, 35174, 38440, 563...",1
2,"[58596, 59589, 26514, 56766, 51386, 20802, 453...","[11552, 34447, 16180, 21248, 28195, 16763, 387...","[43026, 28228, 29243, 27753, 50389, 48093, 85,...",2
3,"[22303, 11403, 53901, 919, 54389, 22144, 51538...","[26990, 50118, 45109, 29126, 16420, 10965, 257...","[10721, 35420, 47187, 51800, 30736, 44707, 134...",3
4,"[21918, 7887, 5426, 22788, 40262, 34857, 58966...","[30232, 23269, 16058, 30390, 60505, 26499, 258...","[7366, 48552, 27092, 40144, 19834, 15065, 5229...",4
