# MAITE Compatibility demo

This notebook contains an end-to-end demostration of Dioptra that can be run on any modern laptop.

## Setup

Below we import the necessary Python modules and ensure the proper environment variables are set so that all the code blocks will work as expected.

In [2]:
# Import packages from the Python standard library
import importlib.util
import os
import sys
import pprint
import time
import warnings
from pathlib import Path


def register_python_source_file(module_name: str, filepath: Path) -> None:
    """Import a source file directly.

    Args:
        module_name: The module name to associate with the imported source file.
        filepath: The path to the source file.

    Notes:
        Adapted from the following implementation in the Python documentation:
        https://docs.python.org/3/library/importlib.html#importing-a-source-file-directly
    """
    spec = importlib.util.spec_from_file_location(module_name, str(filepath))
    module = importlib.util.module_from_spec(spec)
    sys.modules[module_name] = module
    spec.loader.exec_module(module)


# Filter out warning messages
warnings.filterwarnings("ignore")

# Experiment name
EXPERIMENT_NAME = "pytorch_maite"

# Default address for accessing the RESTful API service
RESTAPI_ADDRESS = "http://localhost:30080"

# Set DIOPTRA_RESTAPI_URI variable if not defined, used to connect to RESTful API service
os.environ["DIOPTRA_RESTAPI_URI"] = RESTAPI_ADDRESS

# Default address for accessing the MLFlow Tracking server
MLFLOW_TRACKING_URI = "http://localhost:35000"

# Set MLFLOW_TRACKING_URI variable, used to connect to MLFlow Tracking service
if os.getenv("MLFLOW_TRACKING_URI") is None:
    os.environ["MLFLOW_TRACKING_URI"] = MLFLOW_TRACKING_URI

# Path to workflows archive
WORKFLOWS_TAR_GZ = Path("workflows.tar.gz")

# Register the examples/scripts directory as a Python module
register_python_source_file("scripts", Path("..", "scripts", "__init__.py"))

from scripts.client import DioptraClient
from scripts.utils import make_tar

# Import third-party Python packages
import numpy as np
from mlflow.tracking import MlflowClient

# Create random number generator
rng = np.random.default_rng(54399264723942495723666216079516778448)

## Submit and run jobs

The entrypoints that we will be running in this example are implemented in the Python source files under `src/` and the `src/MLproject` file.
To run these entrypoints within Dioptra's architecture, we need to package those files up into an archive and submit it to the Dioptra RESTful API to create a new job.
For convenience, we provide the `make_tar` helper function defined in `examples/scripts/utils.py`.

In [46]:
def mlflow_run_id_is_not_known(response_fgm):
    return response_fgm["mlflowRunId"] is None and response_fgm["status"] not in [
        "failed",
        "finished",
    ]

In [54]:
make_tar(["src"], WORKFLOWS_TAR_GZ)

PosixPath('/mnt/c/Users/jtsexton/Documents/GitHub/dioptra/examples/pytorch-maite/workflows.tar.gz')

To connect with the endpoint, we will use a client class defined in the `examples/scripts/client.py` file that is able to connect with the Dioptra RESTful API using the HTTP protocol.
We connect using the client below.
The client uses the environment variable `DIOPTRA_RESTAPI_URI`, which we configured at the top of the notebook, to figure out how to connect to the Dioptra RESTful API.

The MlflowClient object is used to retrieve our results.

In [55]:
restapi_client = DioptraClient()
mlflow_client = MlflowClient()

We need to register an experiment under which to collect our job runs.
The code below checks if the relevant experiment exists.
If it does, then it just returns info about the experiment, if it doesn't, it then registers the new experiment.

In [49]:
!python ../scripts/register_task_plugins.py --force --plugins-dir ../task-plugins --api-url http://localhost:30080

[1;36m╭─────────────────────────────────────────────────╮[0m
[1;36m│[0m[1;36m [0m[1;36mDioptra Examples - Register Custom Task Plugins[0m[1;36m [0m[1;36m│[0m
[1;36m╰─────────────────────────────────────────────────╯[0m
 ‣ [1mplugins_dir:[0m ..[35m/[0m[95mtask-plugins[0m
 ‣ [1mapi_url:[0m [4;39mhttp://localhost:30080[0m
 ‣ [1mforce:[0m [3;92mTrue[0m
 [1;92m✔[0m [1;33mOverwritten.[0m [39mRemoved and re-registered the custom task plugin [0m
[39m'custom_fgm_plugins'[0m[39m.[0m
 [1;92m✔[0m [1;33mOverwritten.[0m [39mRemoved and re-registered the custom task plugin [0m
[39m'custom_patch_plugins'[0m[39m.[0m
 [1;92m✔[0m [1;33mOverwritten.[0m [39mRemoved and re-registered the custom task plugin [0m
[39m'custom_poisoning_plugins'[0m[39m.[0m
 [1;92m✔[0m [1;33mOverwritten.[0m [39mRemoved and re-registered the custom task plugin [0m[39m'evaluation'[0m[39m.[0m
 [1;92m✔[0m [1;33mOverwritten.[0m [39mRemoved and re-registered the c

In [50]:
!python ../scripts/register_queues.py --api-url http://localhost:30080

[1;36m╭────────────────────────────────────╮[0m
[1;36m│[0m[1;36m [0m[1;36mDioptra Examples - Register Queues[0m[1;36m [0m[1;36m│[0m
[1;36m╰────────────────────────────────────╯[0m
 ‣ [1mqueue:[0m [39mtensorflow_cpu, tensorflow_gpu, pytorch_cpu, pytorch_gpu[0m
 ‣ [1mapi_url:[0m [4;39mhttp://localhost:30080[0m
[1;33mⒾ[0m  [1;37mSkipped.[0m [39mThe queue [0m[39m'tensorflow_cpu'[0m[39m is already registered.[0m
[1;33mⒾ[0m  [1;37mSkipped.[0m [39mThe queue [0m[39m'tensorflow_gpu'[0m[39m is already registered.[0m
[1;33mⒾ[0m  [1;37mSkipped.[0m [39mThe queue [0m[39m'pytorch_cpu'[0m[39m is already registered.[0m
[1;33mⒾ[0m  [1;37mSkipped.[0m [39mThe queue [0m[39m'pytorch_gpu'[0m[39m is already registered.[0m
 [1;92m✔[0m Queue registration is complete.


In [51]:
response_experiment = restapi_client.get_experiment_by_name(name=EXPERIMENT_NAME)

if response_experiment is None or "Not Found" in response_experiment.get("message", []):
    response_experiment = restapi_client.register_experiment(name=EXPERIMENT_NAME)

response_experiment

{'experimentId': 1,
 'createdOn': '2024-04-11T15:20:35.000038',
 'lastModified': '2024-04-11T15:20:35.000038',
 'name': 'pytorch_maite'}

In [58]:
import json, shlex

def get_output(res):
    while mlflow_run_id_is_not_known(res) or res['status'] != "finished":
        time.sleep(1)
        res = restapi_client.get_job_by_id(res["jobId"])
    out = mlflow_client.get_run(res["mlflowRunId"])
    pprint.pprint(out.data.metrics)
def format_kwargs_dict(kwargs_dict):
    jsd = json.dumps(kwargs_dict, separators=(',',':'))
    return jsd
def post_process_kwargs(args):
    print(args)
    cmdline = " ".join(
        "-P " + shlex.quote(arg) for arg in args
    )
    return cmdline
def gen_attack_kwargs(library, name, kwargs_dict):
    args = [
        "subset=100",
        "save_original=False",
        "batch_size=2",
        f"attack_name={name}",
        f"attack_library={library}",
        f"attack_kwargs={format_kwargs_dict(kwargs_dict)}"
    ]
    return post_process_kwargs(args)

def submit_job(ep, ep_kwargs):
    return restapi_client.submit_job(
        workflows_file=WORKFLOWS_TAR_GZ,
        experiment_name=EXPERIMENT_NAME,
        entry_point=ep,
        entry_point_kwargs=ep_kwargs,
        queue="pytorch_cpu",
        timeout="1h",
    )

def infer_from_artifact():
    return submit_job(ep="infer_from_artifact", ep_kwargs={"run_id": "",
                                                           "adv_tar_name": "fgm.tar.gz",
                                                           "adv_data_dir": "adv_testing",
                                                           "image_size": [3,224,224],
                                                           "new_size": 224})

def infer_from_dataset_maite():
    kwargs = {'provider_name':'huggingface',
               'dataset_name':'cifar10',
               'task':'image-classification',
               'split':'test'}
    ep_kwargs={"local_dataset": False, "dataset_kwargs": format_kwargs_dict(kwargs) }
    
    args = post_process_kwargs([m + '=' + str(ep_kwargs[m]) for m in ep_kwargs])
    return submit_job(ep="infer_from_dataset",ep_kwargs=args)
    
def infer_from_dataset_local():
    kwargs = {"data_dir":"/dioptra/data/Mnist/testing",
              "image_size":"[28,28,3]",
              "new_size":224,
              "validation_split": 0.3}
    ep_kwargs={"local_dataset": True, "dataset_kwargs": format_kwargs_dict(kwargs) }    
    args = post_process_kwargs([m + '=' + str(ep_kwargs[m]) for m in ep_kwargs])

    return submit_job(ep="infer_from_dataset", ep_kwargs=args)

def register_model_from_maite():
    return submit_job(ep="register_model", ep_kwargs={})

def gen_attack():
    cmdline = gen_attack_kwargs(attack_library, attack_name, kwargs_dict)
    return submit_job(ep="attack", ep_kwargs=cmdline)


The `infer_from_dataset` entry point uses basic MAITE functionality: load a dataset from huggingface or a local dataset, load a model from huggingface or use a registered model, load a metric from torchvision and run that metric on that model/dataset. It also saves the model into MLFlow, if it is a newly loaded model. In this example, we will load both the dataset and the model from MAITE.

In [67]:
response_test_metrics = infer_from_dataset_maite()  # pull a dataset from maite, and a model from maite.
pprint.pprint(response_test_metrics)

['local_dataset=False', 'dataset_kwargs={"provider_name":"huggingface","dataset_name":"cifar10","task":"image-classification","split":"test"}']
{'createdOn': '2024-06-18T19:29:52.045294',
 'dependsOn': None,
 'entryPoint': 'infer_from_dataset',
 'entryPointKwargs': '-P local_dataset=False -P '
                     '\'dataset_kwargs={"provider_name":"huggingface","dataset_name":"cifar10","task":"image-classification","split":"test"}\'',
 'experimentId': 1,
 'jobId': '3febbd7e-1304-4c60-a0ab-ca8a5d7457e0',
 'lastModified': '2024-06-18T19:29:52.045294',
 'mlflowRunId': None,
 'queueId': 3,
 'status': 'queued',
 'timeout': '1h',
 'workflowUri': 's3://workflow/938ee15e18c44663878ba27e0b564661/workflows.tar.gz'}


In this example, we will load the dataset from disk and the model from MAITE.

In [68]:
response_test_metrics = infer_from_dataset_local()  # load a dataset from disk, and a model from maite.
pprint.pprint(response_test_metrics)

['local_dataset=True', 'dataset_kwargs={"data_dir":"/dioptra/data/Mnist/testing","image_size":"[28,28,3]","new_size":224,"validation_split":0.3}']
{'createdOn': '2024-06-18T19:33:15.474895',
 'dependsOn': None,
 'entryPoint': 'infer_from_dataset',
 'entryPointKwargs': '-P local_dataset=True -P '
                     '\'dataset_kwargs={"data_dir":"/dioptra/data/Mnist/testing","image_size":"[28,28,3]","new_size":224,"validation_split":0.3}\'',
 'experimentId': 1,
 'jobId': '6e07c700-0869-44f7-bc2f-545a15b7c7de',
 'lastModified': '2024-06-18T19:33:15.474895',
 'mlflowRunId': None,
 'queueId': 3,
 'status': 'queued',
 'timeout': '1h',
 'workflowUri': 's3://workflow/1ef5b336c8cf4efd9c02d452a7ca12d3/workflows.tar.gz'}


The `register_model` entry point loads a model from huggingface and saves it to MLFlow.

In [None]:
register_model_from_maite()
pprint.pprint(response_model)

The `test_model` entrypoint loads the previously saved model from MLFlow into a MAITE-readable format, and then uses maite to test metrics and a dataset on it.

Note: Currently this saves the dataset to /dioptra/data/tmp - make sure the docker container has permissions to write to this area or change the location this is saved to. 

In [None]:
while mlflow_run_id_is_not_known(response_model):
    time.sleep(1)
    response_model = restapi_client.get_job_by_id(response_model["jobId"])

response_use_model = restapi_client.submit_job(
    workflows_file=WORKFLOWS_TAR_GZ,
    experiment_name=EXPERIMENT_NAME,
    entry_point="test_model",
    entry_point_kwargs=" ".join([
        "-P model_name=loaded_model",
        "-P model_version=1",
        "-P subset=500"
    ]),
    queue="pytorch_cpu",
    timeout="1h",
)

The `gen` entrypoint loads a dataset using MAITE, runs the specified attack on it, saves the output of the attack to MLFlow as an artifact, and has the option to write the original dataset to disk (configurable). The `do_gen_attack` function takes the class name of the attack, as well as a dictionary of parameters. Any unnecessary parameters will be filtered out and reported in the logs.

Note: While this function in theory could work with poisoning attacks since it uses heart-lib, and does not have anything specifically requiring an evasion attack, at the time of writing there did not seem to exist a compatible poisoning example. This is largely due to poisoning examples requiring either knowledge of the feature layers from the model (and this information is not available in the context of this notebook) or in the case of PoisoningAttackBackdoor, due to an incompatibility with heartlib. It is possible that in the future this may be corrected and that example may work. 

In [None]:
response_gen_fgm = do_gen_attack("FastGradientMethod", {'eps': 0.3, 'eps_step': 0.1, 'norm': 'inf', 'minimal': False})

In [None]:
# may require GPU
response_gen_pt = do_gen_attack("PixelAttack", {})

In [None]:
# may require GPU
response_gen_pgd = do_gen_attack("ProjectedGradientDescentPyTorch", {})

In [None]:
# may require GPU
response_gen_hsj = do_gen_attack("HopSkipJump", {})

In [None]:
# does not work currently due to problem with heartlib, but includes syntax for passing an existing function as an argument, and also for using a different library
# response_gen_poison = do_gen_attack("PoisoningAttackBackdoor", {'perturbation_FUNCTION': 'art.attacks.poisoning.perturbations.add_single_bd' }, attack_library='art.attacks.poisoning')

The `infer` entrypoint takes the previously generated fgm attack results and runs it against a given model and metric. It is included here as a function and tested against 4 models on huggingface from different authors. Note that not all CIFAR10 targeted models on huggingface are compatible for various reasons - missing `config.json`, different requirements for data formatting, etc. The examples included below worked at the time of testing.

Although MAITE supports torchvision as a provider as well, torchvision does not seem to provide pretrained CIFAR10 models. An ImageNET example may be more suited to cross-testing torchvision and huggingface models.

In [None]:
def test_cifar10_fgm(provider, model):
    global response_gen_fgm
    while mlflow_run_id_is_not_known(response_gen_fgm):
        time.sleep(1)
        response_gen_fgm = restapi_client.get_job_by_id(response_gen_fgm["jobId"])
    response_infer_fgm = restapi_client.submit_job(
        workflows_file=WORKFLOWS_TAR_GZ,
        experiment_name=EXPERIMENT_NAME,
        entry_point="infer",
        entry_point_kwargs=" ".join([
            f"-P run_id={response_gen_fgm['mlflowRunId']}",
            f"-P model_provider_name={provider}",
            f"-P model_name={model}",
            f"-P model_task=image-classification"
        ]),
        queue="pytorch_cpu",
        timeout="1h",
        depends_on=response_gen_fgm["jobId"],
    )
    return response_infer_fgm

In [None]:
model1_results = test_cifar10_fgm("huggingface","aaraki/vit-base-patch16-224-in21k-finetuned-cifar10")
get_output(model1_results)

In [None]:
model2_results = test_cifar10_fgm("huggingface","abhishek/autotrain_cifar10_vit_base")
get_output(model2_results)

In [None]:
model3_results = test_cifar10_fgm("huggingface","Weili/vit-base-patch16-224-finetuned-cifar10")
get_output(model3_results)

In [None]:
model4_results = test_cifar10_fgm("huggingface","arize-ai/resnet-50-cifar10-quality-drift")
get_output(model4_results)