# Datasets

This notebook guides through the different datasets, their metadata and statistics. All datasets are saved on the google cloud platform, each dataset has a corresponding folder. A file called `metadata.json` in each folder collects metadata and statistics about the dataset. In our own convention, datasets start with **scpa** while experiments start with **exp**.

Let us first show, how we can search and filter the datasets.

## Search datasets

With the following function, you can search through all our datasets (and experiments) on th google cloud platform. Possible arguments are as follows:

 - project_dir (str): The project directory. For this project: ltl-repair
 - name_includes (List[str], optional): A list of strings that should be included in the blob path. Defaults to [].
 - name_excludes (List[str], optional): A list of strings that must not be included in the blob path. Defaults to [].
 - datasets_metadata (Optional[Dict], optional): Filter options for datasets. If not set, no datasets are collected. Defaults to None.
 - experiments_args (Optional[Dict], optional): Filter options for experiments. If not set, no experiments are collected. Defaults to None.
 - ignore_missing_keys (bool, optional): If True, keys specified in the filters do not need to exist in the cloud dataset. Defaults to True.
 - bucket (str, optional): The google cloud bucket. Defaults to "ml2-bucket".
 - return_param (List[str], optional): If set, we return a parameter from the artifact metadata together with the name of the artifact. Specify a list of hierarchical keys to reach the parameter you want. Including empty list, for all data.

The arguments datasets_metadata and experiments_args specify filters. If not set, no datasets resp. experiments are collected. If empty, no filter is set, hence all datasets resp. experiments are collected.

The dictionary should have the same structure as the metadata / args file, including hierarchy. Instead of a value a tuple is expected, configuring whether we search for 
 - values equal the filter value (=), 
 - greater the filter value (>), 
 - less (<), 
 - greater equal (>=),
 - less equal (<=). 
  
Make sure types match and are comparable, otherwise it will throw an exception.

The following is an exemplary query:

In [1]:
import ml2.gcp_bucket as cloud

cloud.find(
    project_dir="ltl-repair",
    datasets_metadata=dict(
        replace_minimal=(False, "="),
        load_from_beamsize=(1, "="),
        test=dict(changed_fraction=(0.33, ">")),
    ),
    name_includes=["/scpa-repair-gen"],
    name_excludes=["raw"],
    ignore_missing_keys=True
)

Key load_from_beamsize is None in ltl-repair/scpa-repair-gen-75/metadata.json. Ignore...


['scpa-repair-gen-1',
 'scpa-repair-gen-10',
 'scpa-repair-gen-15',
 'scpa-repair-gen-16',
 'scpa-repair-gen-2',
 'scpa-repair-gen-22',
 'scpa-repair-gen-23',
 'scpa-repair-gen-27',
 'scpa-repair-gen-28',
 'scpa-repair-gen-3',
 'scpa-repair-gen-32',
 'scpa-repair-gen-33',
 'scpa-repair-gen-75',
 'scpa-repair-gen-8',
 'scpa-repair-gen-9']

## Search datasets and compare parameters

We can specify a parameter we like to return for each dataset. This easily allows to compare parameters through different datasets. In the following, we like to know, how much samples with the status "Changed" satisfy the specification. We look in the train split and in all datasets that contain "/scpa-repair-alter". 

It is possible to enter the empty list [] to receive all metadata for the found datasets

In [2]:
import ml2.gcp_bucket as cloud

cloud.find(
    project_dir="ltl-repair",
    datasets_metadata={},
    name_includes=["/scpa-repair-alter"],
    name_excludes=["raw"],
    ignore_missing_keys=True,
    return_param=["train","satisfied_in_changed_fraction"]
)

[('scpa-repair-alter-0', 0.0185),
 ('scpa-repair-alter-1', 0.0145),
 ('scpa-repair-alter-10', 0.016),
 ('scpa-repair-alter-11', 0.015),
 ('scpa-repair-alter-12', 0.038),
 ('scpa-repair-alter-13', 0.03625),
 ('scpa-repair-alter-14', 0.042),
 ('scpa-repair-alter-15', 0.00525),
 ('scpa-repair-alter-16', 0.0045),
 ('scpa-repair-alter-17', 0.0045),
 ('scpa-repair-alter-18', 0.0045),
 ('scpa-repair-alter-19', 0.00625),
 ('scpa-repair-alter-2', 0.0185),
 ('scpa-repair-alter-20', 0.007),
 ('scpa-repair-alter-21', 0.004),
 ('scpa-repair-alter-3', 0.019),
 ('scpa-repair-alter-4', 0.0175),
 ('scpa-repair-alter-5', 0.01575),
 ('scpa-repair-alter-6', 0.0125),
 ('scpa-repair-alter-9', 0.01375)]

## Download Datasets and Metadata

We can print the metadata to any dataset with the following command. The datasets does not need to be downloaded for that.

In [3]:
from ml2.data import SplitData
import ml2.gcp_bucket as cloud

print(SplitData.pretty_metadata_(cloud.get_metadata(artifact_name="scpa-repair-gen-80", project_dir="ltl-repair")))


2022-06-30 15:44:31.289811: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-06-30 15:44:31.289830: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


{
    "eval_alpha": 0.5,
    "eval_beam_size": "1",
    "from": null,
    "inputs": ["i0", "i1", "i2", "i3", "i4"],
    "load_from_alpha": 0.5,
    "load_from_beamsize": 1,
    "load_from_model": "repair-data-2",
    "load_from_num_samples": null,
    "max_changes": 50,
    "max_distance": 50,
    "max_match_fraction": 0.1,
    "outputs": ["o0", "o1", "o2", "o3", "o4"],
    "parent_dataset": "scpa-2",
    "parent_model": "repair-data-2",
    "reference_alphas": null,
    "reference_beamsizes": [
        "1",
        "2",
        "3",
        "4"
    ],
    "remove_or_alter": "remove",
    "replace_minimal": true,
    "replace_satisfied": false,
    "splits": [
        "train",
        "val",
        "test"
    ],
    "test": {
        "changed_fraction": 0.0,
        "distance_all_mean": 20.56,
        "distance_all_median": 20.0,
        "distance_all_std": 15.08,
        "distance_broken_mean": 22.84,
        "distance_broken_median": 22.0,
        "distance_broken_std": 14.16,
     

With the following commands we can download and inspect a dataset.

In [4]:
from ml2.ltl.ltl_repair.ltl_repair_data import LTLRepairSplitData

dataset = LTLRepairSplitData.load("scpa-repair-gen-9")
dataset["train"].plot_lev(filter_range=[None, None], bins=300, x_label = None, dataset_name = dataset.name + "(train)").show()
dataset["train"].data_frame.head()


INFO:ml2.artifact:Found split_data scpa-repair-gen-9 locally
INFO:ml2.ltl.ltl_repair.ltl_repair_data:Read in metadata
INFO:ml2.ltl.ltl_repair.ltl_repair_data:Load data from file:/home/matthias/ml2-storage/ltl-repair/scpa-repair-gen-9/train.csv
INFO:ml2.ltl.ltl_repair.ltl_repair_data:Load data from file:/home/matthias/ml2-storage/ltl-repair/scpa-repair-gen-9/val.csv
INFO:ml2.ltl.ltl_repair.ltl_repair_data:Load data from file:/home/matthias/ml2-storage/ltl-repair/scpa-repair-gen-9/test.csv


Unnamed: 0,level_0,status,assumptions,guarantees,repair_circuit,inputs,outputs,realizable,circuit,hash,levenshtein_distance
0,3,Violated,((X (G ((! (o1)) | (((! (i0)) & (! (i4))) U ((...,"((G (F ((! (i0)) | (X (o0)))))),((G ((o4) -> (...",aag 13 5 2 5 6\n2\n4\n6\n8\n10\n12 22\n14 26\n...,"i0,i1,i2,i3,i4","o0,o1,o2,o3,o4",0,aag 13 5 2 5 6\n2\n4\n6\n8\n10\n12 26\n14 22\n...,00941481d18a078fa4e4052dc205a2e4f1ba27ec6d706d...,7
1,4,Violated,,(((G (((i4) | (i3)) | (i1))) <-> (G (F (o0))))...,aag 12 5 2 5 5\n2\n4\n6\n8\n10\n12 21\n14 25\n...,"i0,i1,i2,i3,i4","o0,o1,o2,o3,o4",1,aag 12 5 2 5 5\n2\n4\n6\n8\n10\n12 21\n14 25\n...,02fc2aa7933c0163d9580fdfdf5e6cd6a6f542fc799674...,1
2,5,Violated,,"((G (((! (o2)) & (! (o1))) | (i0)))),((G ((i4)...",aag 8 5 0 5 3\n2\n4\n6\n8\n10\n0\n16\n0\n4\n13...,"i0,i1,i2,i3,i4","o0,o1,o2,o3,o4",1,aag 9 5 0 5 4\n2\n4\n6\n8\n10\n0\n14\n0\n4\n19...,871efabacf59e5fb8e3358399c0ec0090c8efa0cd09fc0...,16
3,7,Violated,,"(((i0) R (! (o4)))),((G ((((i2) & (! (i4))) & ...",aag 5 5 0 5 0\n2\n4\n6\n8\n10\n0\n1\n0\n1\n0\n...,"i0,i1,i2,i3,i4","o0,o1,o2,o3,o4",0,aag 5 5 0 5 0\n2\n4\n6\n8\n10\n0\n1\n0\n0\n0\n...,0e1c0752e5ee577bdc3c397037414fa71ad762c8d94253...,1
4,9,Violated,,"((((o1) U (i2)) | (G (o1)))),((G ((o0) -> (X (...",aag 7 5 0 5 2\n2\n4\n6\n8\n10\n14\n10\n0\n8\n0...,"i0,i1,i2,i3,i4","o0,o1,o2,o3,o4",1,aag 7 5 0 5 2\n2\n4\n6\n8\n10\n14\n13\n0\n8\n0...,8f734a607d8c353c8a4bddbab6fb28d5e62766bdb22406...,5


With a slight modification, this is alo possible for the original Neural Circuit Synthesis Dataset.

In [5]:
from ml2.ltl.ltl_repair.ltl_syn_data import LTLSynSplitData

dataset = LTLSynSplitData.load("scpa-2")
dataset["train"].data_frame.head()



INFO:ml2.artifact:Found split_data scpa-2 locally
INFO:ml2.ltl.ltl_repair.ltl_syn_data:Read in metadata


Unnamed: 0,assumptions,guarantees,inputs,outputs,realizable,circuit,hash
0,,(G (((((i4) & (! (i0))) & (! (i3))) & (! (i2))...,"i0,i1,i2,i3,i4","o0,o1,o2,o3,o4",0,aag 16 5 2 5 9\n2\n4\n6\n8\n10\n12 26\n14 33\n...,433908336677874c9654d86950a01fa2ac7987e0b75dee...
1,(G ((! (i4)) | (! (i0)))),((((! (i4)) & (! (i0))) & (i3)) R ((((o0) | (o...,"i0,i1,i2,i3,i4","o0,o1,o2,o3,o4",0,aag 8 5 1 5 2\n2\n4\n6\n8\n10\n12 17\n0\n0\n1\...,8650303123812682bc304b7ea7f818bf526fa2541a4f3d...
2,,(G ((((i2) & (i1)) & (! (i0))) -> (F ((((! (o0...,"i0,i1,i2,i3,i4","o0,o1,o2,o3,o4",0,aag 14 5 1 5 8\n2\n4\n6\n8\n10\n12 29\n0\n13\n...,8ce4d53d1d2f20b9dbf60f5bb2373dabbc4338fd19e3a3...
3,"(G (F (! (i0)))),(G (((i0) & (X ((! (o4)) & (!...","(G (((i4) & (X (i2))) -> (F ((o1) & (o4))))),(...","i0,i1,i2,i3,i4","o0,o1,o2,o3,o4",1,aag 7 5 1 5 1\n2\n4\n6\n8\n10\n12 15\n0\n14\n1...,81ab829166e0924cf944bd660bae2f101f4806fb73ff7f...
4,,(G (((o0) & (X (! (o0)))) -> (X ((! (o0)) U ((...,"i0,i1,i2,i3,i4","o0,o1,o2,o3,o4",0,aag 5 5 0 5 0\n2\n4\n6\n8\n10\n1\n1\n1\n0\n0\n...,289a7bc5063e8d623167c919ac75b5d979085fa188e71a...
