Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmarks Discussion #2

Open
stanleybak opened this issue Mar 4, 2021 · 78 comments
Open

Benchmarks Discussion #2

stanleybak opened this issue Mar 4, 2021 · 78 comments

Comments

@stanleybak
Copy link
Owner

stanleybak commented Mar 4, 2021

Use this issue to describe your benchmark and solicit feedback. A benchmark consists of a one or more ONNX network files, one or more VNNLIB property files. For each instance (combination of .onnx and .vnnlib. file), a timeout should also be suggested, where the sum of all the timeouts is less than six hours.

For image classification benchmarks, rather than picking a specific set of images, it's suggested to describe how to generate the .vnnlib file from a specific image (for example, using a fixed epsilon disturbance value and check for a different maximum output). For the competition we would then sample images from MNISt/CIFAR at random, to prevent tailoring of tools to specific images.

Also, indicate if there should be input scaling, channel orders, or anything else. It's probably a good idea to provide an example input or image and expected output, to make sure the execution semantics for the network are clear.

For reference, here are the topics from last year's benchmark discussion:
verivital/vnn-comp#1
verivital/vnn-comp#2
verivital/vnn-comp#3

@stanleybak
Copy link
Owner Author

stanleybak commented Apr 17, 2021

To start the discussion I'll propose a few examples from last year. Here's a link to the preferred format of benchmarks, on a simple example with VNNLIB: link

Here's the ACAS Xu system properties 1-4 that we analyzed last year in this format: link

The benchmark directories includes three parts: a set of .onnx network files, a set of .vnnlib spec files, and a .csv listing each combination of .onnx and .vnnlib file that needs to be analyzed, in addition to the per-instance timeout. Total timeout for all instances should be equal to six hours.

I'll also provide one example soon showing how to do image classifiers using DNNV on standard image classification systems.

@naizhengtan
Copy link

Hi all, I'd like to propose our benchmark: verifying learned indexes of databases. The full proposal is here. I will briefly introduce this benchmark below.

Background: learned index is a neural network (NN) based database index proposed by this paper, 2018. It has great potential but also has one drawback due to NNs---for non-existing keys (in the database), the outputs of a learned index can be arbitrary.

What we do: to tame NN's uncertainty, we design a specification to dictate how "far" one predicted position can be, compared to its actual position (or the positions that non-existing keys should be).

What to verify: our benchmark provides multiple pairs of (i) learned indexes (trained NNs) and (ii) corresponding specifications. We design these pairs with different parameters such that they cover different user needs and have varied difficulties for verification tools. Please see an example here.

A note on broader impact: using NNs for systems is a broad topic, but many existing works lack strict safety guarantees. We believe that NN Verification can help system developers gain confidence in NNs to apply them to critical systems. We hope our benchmark can be an early step towards this vision.

@pat676
Copy link

pat676 commented Apr 23, 2021

Hi all, we want to propose the same three benchmarks as in VNN-COMP 20.

In short, the three benchmarks consist of MNIST classifiers with 2, 4, and 6 hidden layers, respectively, and 256 ReLU nodes in each hidden layer. We suggest 25 images with perturbation radii 0.03 and 0.05 and a timeout of 7 minutes. Note that we changed the radius ( VNN_COMP 20 used 0.02 and 0.05) since perturbations with radius 0.02 often were very easy to verify.

The networks and vnnlib properties for the first 25 images from the MNIST test-set can be found in https://github.com/pat676/vnn-21-fc-mnist

We also added a script "generate_properties.py", which can be used to create properties for random images by setting the random parameter to "True" in the if name == 'main' block at the end of the script.

@changliuliu
Copy link
Collaborator

Hi all, I'd like to propose our benchmark: verifying learned indexes of databases. The full proposal is here. I will briefly introduce this benchmark below.

Background: learned index is a neural network (NN) based database index proposed by this paper, 2018. It has great potential but also has one drawback due to NNs---for non-existing keys (in the database), the outputs of a learned index can be arbitrary.

What we do: to tame NN's uncertainty, we design a specification to dictate how "far" one predicted position can be, compared to its actual position (or the positions that non-existing keys should be).

What to verify: our benchmark provides multiple pairs of (i) learned indexes (trained NNs) and (ii) corresponding specifications. We design these pairs with different parameters such that they cover different user needs and have varied difficulties for verification tools. Please see an example here.

A note on broader impact: using NNs for systems is a broad topic, but many existing works lack strict safety guarantees. We believe that NN Verification can help system developers gain confidence in NNs to apply them to critical systems. We hope our benchmark can be an early step towards this vision.

Following up on this benchmark, there are two important features of this benchmark that we would like to seek participants' inputs on:

  1. In this benchmark, the specifications (.vnnlib) contain IF-ELSE conditions (as a bunch of conditions that if the input belongs to a range, the output should satisfies certain constraint). There are many IF-ELSE conditions that if we break them into separated .vnnlib files, there will be too many files and too much overhead to operate these files. Here we would like to see if participants are ok with these IF-ELSE conditions in .vnnlib?

  2. The networks used in this benchmark are big sparse networks. The sparsity does not come from CNN or other type of predefined structures in onnx, and onnx currently does not support arbitrary sparse matrix. So these networks are stored as dense matrices, which can go as big as 500MB (~15MB after zipping). Will any participant has any concern on operating these big networks?

@pat676
Copy link

pat676 commented Apr 29, 2021

Hi all, we also want to suggest some larger Cifar10 convolutional networks. We've created an updated repository with the two Cifar10 nets from last year's competition as well as vnnlib properties for the first 100 images and scripts to create random properties. The repo also contains the mnist-fc networks suggested earlier:

https://github.com/pat676/vnn-comp21-benchmarks

The vnnlib files for the same properties as in last year's competition can be generated by running generate_all_properties.sh and vnnlib files based on random images can be created by running generate_all_properties_random.sh.

We also feel that it is important to have some classifiers with Sigmoid and Tanh activation functions. We'll update our repo with some suggestions tomorrow.

EDIT:
We have updated our repo with one MNIST Sigmoid and one Tanh classifier. The networks were used for benchmarking in https://vas.doc.ic.ac.uk/papers/20/ecai20-HL.pdf and are originally from https://github.com/eth-sri/eran.

Based on the results from https://vas.doc.ic.ac.uk/papers/20/ecai20-HL.pdf, we suggest two epsilons 0.025 and 0.035, and 25 images for each epsilon.

@wu-haoze
Copy link
Contributor

Hi, we'd like to suggest three cifar10 convolutional networks. The benchmarks are in the following repo:
https://github.com/anwu1219/VNN-COMP21-benchmarks
We also included a script to generate targeted l-inf robustness query taking the test image index and the epsilon as inputs.
The architectures of the networks are specified in https://github.com/anwu1219/VNN-COMP21-benchmarks/blob/master/info.txt
The network expects the input image to be (32x32x3) and normalized between[0,1].

Based on our evaluation, we suggest two epsilons 0.012 and 0.024, and 6 minutes timeout for each epsilon.

@pat676
Copy link

pat676 commented May 2, 2021

Thanks, @anwu1219, we have a few questions regarding the benchmarks.

We have had some trouble loading the networks in our toolkit. All layers seem to be wrapped in a sequential layer (see onnx graph for the large net attached below); would it be possible to get the file in a format without the sequential layer? Also, what is the purpose of the first transpose node in the network?

Regarding the Dropout layers, how are these implement in practice? We are concerned that few toolkits may natively support dropout and this may have a negative effect on widening participation.

net

@pat676
Copy link

pat676 commented May 2, 2021

@changliuliu, @naizhengtan

Thanks, @naizhengtan for the suggestion. While we find the benchmarks interesting, we also agree with @changliuliu that the format indeed seems unfamiliar and may in practice require quite an effort from every participant. To encourage as many toolkits as possible to participate, we suggest that we keep the benchmarks in this competition similar to benchmarks that have been used in previous papers and are widely supported already.

@wu-haoze
Copy link
Contributor

wu-haoze commented May 2, 2021

Thanks, @anwu1219, we have a few questions regarding the benchmarks.

We have had some trouble loading the networks in our toolkit. All layers seem to be wrapped in a sequential layer (see onnx graph for the large net attached below); would it be possible to get the file in a format without the sequential layer? Also, what is the purpose of the first transpose node in the network?

Regarding the Dropout layers, how are these implement in practice? We are concerned that few toolkits may natively support dropout and this may have a negative effect on widening participation.

net

Hi @pat676,

Thanks for trying to load the network in your toolkit.
The reason for the transpose node in the network is that the networks were originally stored in tensorflow graphdef format (which uses the channel order NHWC) and were transformed to onnx format, whose operators expects NCHW, to conform to the competition format.

I'm not sure that I understand the point about the sequential layer. Attached is a visualization of the architecture of the cifar10_large.onnx network using Netron. I believe the node types in the onnx file are transpose, conv, relu, reshape, matmul and add, which are fairly common operators. Please do let me know if I am missing anything.
Screenshot from 2021-05-02 15-16-06

Regarding the dropout layer, it was added for certain layers at training time, but it does not play a role at inference time. The onnx files should be agnostic of whether the networks were trained using dropouts or not.

@huanzhang12
Copy link
Contributor

Apologize for submitting our benchmarks a little bit late - I hope our benchmarks are still useful for VNN COMP 2021. Together with @tcwangshiqi-columbia @KaidiXu we propose a new set of benchmarks of residual networks (ResNet) on CIFAR-10, which is available here: https://github.com/tcwangshiqi-columbia/cifar10_resnet_benchmark.

Currently, most networks in VNN COMP are feedforward NNs, and many tools are hardcoded to handle feedforward networks only. To make neural network verification more useful in practical scenarios, we advocate that tools should handle more general architectures, and ResNet is the first step towards this goal. We hope this can provide some incentives for the community to develop better tools.

Model details: We provided two small ResNet models on CIFAR-10 with the following structures:

  • ResNet-2B with 2 residual blocks: 5 convolutional layers + 2 linear layers
  • ResNet-4B with 4 residual blocks: 9 convolutional layers + 2 linear layers

Since this is one of the first benchmarks using ResNet, we keep the networks relatively small (compared to ResNet 50 used in many vision tasks) so hopefully most tools can handle them: the two networks have only 2 and 4 residual blocks, respectively and they are relatively narrow -- although they are still large compared to most models used in the complete NN verification literature. The networks are trained using adversarial training with L_\infty perturbation epsilon (2/255). We report basic model performance numbers below:

Model Clean acc. PGD acc. CROWN/DeepPoly verified acc.
ResNet-2B 69.25 56.67 26.88
ResNet-4B 67.21 55.25 8.81

Since the models are trained using adversarial training, it also poses a challenge for many verifiers - the CROWN/DeepPoly verified accuracy (as a simple baseline) is much lower than PGD accuracy, and we hope this benchmark can motivate researchers in the community to develop stronger tools that can make this gap smaller.

Data format: The input images should be normalized using mean and std computed from CIFAR-10 training set. The perturbation budget is element-wise, eps=2/255 on unnormalized images and clipped to the [0, 1] range. We provide eval.py as a simple PyTorch example of loading data (e.g., data preprocessing, channel order etc).

Data Selection: We propose to randomly select 100 images from the test set which are classified correctly and cannot be attacked by a 50-step targeted PGD attack with 5 random restarts. For each image, we specify the runner up label (the class with second largest logit) as the target for verification, and verify the property that the logit of runner up label is not larger than that of the groundtruth label within L_\infty perturbation eps=2/255 on input. See instructions below for generating test images with a script, and some example properties are in the vnnlib_properties_pgd_filtered folder.

To keep the sum of all the timeouts less than 6 hours (as suggested by @stanleybak), the suggested per-example timeout is 6 minutes per example (assuming only a fraction of all examples time out).

It is also possible to change the target label for evaluation (e.g., using the target label following the ground truth label, a random target label, or all 9 target labels) - I think it is probably better to align the target label selection mechanism with other benchmarks after they are finalized. We are also willing to split the dataset into easy, medium, hard parts based on results by PGD/CROWN/DeepPoly, if that makes the evaluation easier.

@pat676
Copy link

pat676 commented May 3, 2021

Thanks, @anwu1219, we have a few questions regarding the benchmarks.
We have had some trouble loading the networks in our toolkit. All layers seem to be wrapped in a sequential layer (see onnx graph for the large net attached below); would it be possible to get the file in a format without the sequential layer? Also, what is the purpose of the first transpose node in the network?
Regarding the Dropout layers, how are these implement in practice? We are concerned that few toolkits may natively support dropout and this may have a negative effect on widening participation.
net

Hi @pat676,

Thanks for trying to load the network in your toolkit.
The reason for the transpose node in the network is that the networks were originally stored in tensorflow graphdef format (which uses the channel order NHWC) and were transformed to onnx format, whose operators expects NCHW, to conform to the competition format.

I'm not sure that I understand the point about the sequential layer. Attached is a visualization of the architecture of the cifar10_large.onnx network using Netron. I believe the node types in the onnx file are transpose, conv, relu, reshape, matmul and add, which are fairly common operators. Please do let me know if I am missing anything.
Screenshot from 2021-05-02 15-16-06

Regarding the dropout layer, it was added for certain layers at training time, but it does not play a role at inference time. The onnx files should be agnostic of whether the networks were trained using dropouts or not.

Thank you, your visualization of the net helped a lot, we’ll check out Netron.

Indeed, the problem wasn’t the sequential layer. We don’t currently support transpose ops, so I tried skipping the initial transpose but hadn’t noticed there is one more before the FC layers. However, adding support for transpose is easy enough, so we’re happy with the architecture of the networks.

@FlorianJaeckle
Copy link

Sorry for the late response. If it is not too late, we would like to provide an adversarial robustness benchmark with image-dependent perturbation radii (epsilons).

All the adversarial robustness benchmarks posted so far consider only image-independent epsilon values, possibly resulting in some (image, epsilon, network) pairings that are either easily verified by all verification methods, or too hard to be verified (in the usually small timeout) by any of them.

In line with the OVAL verification dataset employed in last year's VNN-COMP, whose versions have been already been used in various published works [1, 2, 3, 4, 5], we propose a benchmark consisting of (image, epsilon, network) pairings that have been found via binary search to ensure that all properties are challenging to solve.

We hope this will address some of the points addressed by @alessiolomuscio and others in the Rules Discussion Issue.
A brief description of the dataset follows. We will post a link to the properties in VNNLib format in case there is appetite for this benchmark's adoption.

Models:
The 3 CIFAR-10 convolutional networks from last year's OVAL verification dataset.
These consist of: a "base" model with 2 convolutional layers followed by 2 fully connected layers, a "wide" model with more hidden units in each layer, and a "deep" model that features more layers (but a similar number of hidden units per layer as the "base" model). All three models are trained robustly using the method provided by Kolter and Wong [6].

Verification properties:
We propose a dataset of 300 properties for each model: 100 taken from the dataset that was used last year (consisting only on (image, epsilon, network) pairings verified to be robust), 100 more challenging properties (likely to be robust pairings), and 100 pairings that provably contain adversarial vulnerabilities.
In order to avoid overfitting, and to comply with the 6hrs cumulative timeout, we suggest sampling 10 properties for each network at competition time, using a timeout of 720 seconds each. (We note that the timeout is picked arbitrarily and offers a trade-off between difficulty and how many properties you can sample to fit the 6hrs timeout)
The pairing generation procedure is as follows: given a correctly classified image and a network, we randomly sample a misclassified label, then find the epsilon value (in infinity norm) via binary searches with predefined timeouts.

@pkouvaros
Copy link

Thank you @anwu1219 for the proposed benchmark. In our experience radii of the magnitude proposed are easy to verify for convolutional networks since under such radii there is a significant percentage of ReLUs that are stable following a bound propagation pass. We are wandering whether this is also the case here? If so, in the interest of having a more meaningful evaluation of complete verification (where more branching on the states of the ReLUs needs to be done
to solve a verification problem), we are suggesting to increase the radii and timeouts and accordingly reduce the number of images.

Thank you @huanzhang12 for the proposed benchmark. Given the size of the networks and their low verified accuracy with DeepPoly, we are suggesting to increase the timeout and reduce the number of images so as to increase the
chances of the tools to resolve a verification query in what appears to be a challenging benchmark. Also we feel that the residual blocks may be problematic for several tools that do not currently support them thereby hindering wide participation.

Overall we now have 10 CIFAR10 CNNs proposed. As we believe using all of these would be an overkill we suggest selecting only three networks: one of small size, one of medium size and one of large size. Also, to promote participation, we suggest giving emphasis to CNNs with simple architectures. Regarding the properties, to avoid overfitting of the methods, we believe it is important that the random inputs are drawn from the whole test-set and that the selection of the verification properties is not based on the efficiency of a single tool to verify them. Finally (if we limit the benchmarks to three CIFAR10 networks) we feel that the properties should be to prove robustness against all other classes, instead of a single class.

@Neelanjana314
Copy link
Contributor

Neelanjana314 commented May 8, 2021

Sorry to join the party late. If permissible, I (on behalf of my lab: VeriVITAL) would like to propose the following benchmark: https://github.com/Neelanjana314/vnn-comp21-benchmarks

Two MNIST classifiers, one with a maxpooling layer and the other with an average pooling layer.

For both the networks we suggest 25 images (provided in the data folder: https://github.com/Neelanjana314/vnn-comp21-benchmarks/tree/main/Data);

  • for the average pooling network consider a perturbation radii of 0.02 and 0.04 and a timeout of 5 minutes.

  • for the max pooling network consider a perturbation radii of 0.004 and a timeout of 7 minutes.

The network expects the input image to be (28x28x1) and normalized between[0,1].

For any queries please feel free to contact.

@huanzhang12
Copy link
Contributor

@pkouvaros Thank you for your comments on our benchmark. Yes, our benchmark is indeed more challenging than most other proposals, however better tools are also being developed throughout the year - in fact, state-of-the-art verifiers such as Fast-and-Complete, Active Set, Lagrangian decomposition and Beta-CROWN can finish the VNN COMP 2020 benchmark on CIFAR-10 pretty quickly, sometimes just a few seconds per instance. If we just have small and simple networks in the benchmark suite, then it is hard to distinguish the performance differences among of these tools as they all finish in seconds. Thus, we believe introducing a more challenging and future-proof benchmark is a good idea. For this year, if our benchmark is selected, we can certainly reduce the number of test images (e.g., from 100 down to 50 or 20 images) and increase timeout threshold to make it easier.

Also, we agree with you that ResNet can be a challenge for many tools. One purpose of our benchmark is to give tool developers an incentive to handle general network structures instead of restricting to mostly toy problems. We hope the incentive can help the VNN community to grow and make broader impacts outside of the community. I do agree that, to promote participation, simple networks should be included in the benchmark suite, but ideally we should also have a track for challenging networks which pushes the boundary and inspires researchers to invent more efficient and powerful verifiers.

Besides, we agree with you that it is important to use random inputs from the whole test-set rather than selecting properties based on the efficiency of a single tool, which may create bias. So far it seems the benchmark proposals have different ways to create test examples, but I think it is probably better to unify the data selection process across benchmarks.

@stanleybak
Copy link
Owner Author

stanleybak commented May 9, 2021

My opinion is that good benchmarks should have a mix of easy and difficult instances so that everyone can get some result, while still having hard instances so that new improvements can be evaluated.

In terms of benchmark generation, the goal is to make the specific images be chosen randomly whenever possible, to ensure the methods are generalizable and not only applicable to hand-selected set of benchmark images or categories. This can be done in a number of ways. Some people have python scripts to generate specific vnnlib instances with a command line argument random seed.

Here's another way to do it using DNNV (from David Shriver): localrobustness.dnnp

For this file you can use dnnv to create a series of vnnlib files to check for local robustness:

pip install dnnv
dnnv localrobustness.dnnp --network N cifar10_2_255.onnx --prop.x_i=3 --convert --convert.to=vnnlib --convert.dest=/tmp/outputdir/

This uses i=3 (image #3 from the CIFAR dataset) to create 9 local robustness vnnlin specification files and onnx networks (one for each potential misclassification category), based on these networks from last year.

Also: we've extended the benchmark submission phase a few weeks. The plan is to let everyone do some initial evaluation on the benchmarks and then we'll do a large group meeting with everyone to discuss the benchmarks near the end of the month or early next month.

@pat676
Copy link

pat676 commented May 9, 2021

I agree with all of @pkouvaros comments, as well as @stanleybak on generating properties from random images. However, it is my opinion that we should prioritise a single vnnlib-property for checking robustness against all other classes, not 9 properties where each property is for a single adversarial class. This seems to be more in line with most practical use-cases where we want to determine robustness against all other classes and toolkits optimised for checking against multiple classes at once should be rewarded in this competition.

@huanzhang12 we have not seen any indication that SoTA toolkits can solve all cases for the CIFAR10 GGN networks from VNN-COMP’ 20 within the timeout; on the contrary it seems like most public SoTA methods have a good balance of difficulty for these benchmarks. I’m not against more challenging benchmarks in general but I agree with @pkouvaros that the complexity of the architectures should be kept to a minimum. I don’t think it is realistic to expect developers to significantly extend the supported architectures in the relatively limited timeframe we have now; if incentivising developers to extend the supported architectures is a goal of VNN-COMP, then this and the architectures under consideration should be made clear at an earlier stage.

@Neelanjana314, related to the comment above, several toolkits do not seem to support pooling layers; If we want to have pooling I would at least suggest limiting it to average-pooling which is theoretically simpler than max-pooling.

@wu-haoze
Copy link
Contributor

@pkouvaros Thanks for the comments on our proposed benchmarks. Experimentally, we found the proposed radii yield benchmarks with a mixed levels of difficulties. Indeed some can be solved fairly quickly without the need of branching, but some does require that.

@stanleybak
Copy link
Owner Author

However, it is my opinion that we should prioritise a single vnnlib-property for checking robustness against all other classes, not 9 properties where each property is for a single adversarial class.

This is fine for me, but this will mean the VNN-LIB properties will have disjunctions. Do people want to add support for that?

@mnmueller
Copy link

@stanleybak I agree with @pat676 comments that proving robustness against all other classes seems to be a practically more relevant use-case and allows tools that can reuse intermediate results to show this strength.
Further I think this should only lead to conjunctions of the form \wedge_i (y_j > y_i ) \forall i \neq j which could be translated to a set of n-1 properties without either conjunctions or disjunctions for tools that do not support conjunctions.

Would the other participants be open for benchmarks using different, non-piecewise linear activation functions such as tanh or sigmoid?

I agree with @huanzhang12 that it will be valuable to also have slightly more complex network architectures such as ResNets in the benchmark mix as they have great practical importance and will provide a great opportunity for the comparison of faster (incomplete) methods.

@stanleybak
Copy link
Owner Author

stanleybak commented May 11, 2021

I think this should only lead to conjunctions of the form \wedge_i (y_j > y_i ) \forall i \neq j

This property would be SAT if we can find any inputs where the classification comes out as j, which is not quite what we want. The violations occurs if any of the other classes is greatest, other than j, which is why (I think) we need to either support disjucntion directly, or break it up into 9 specifications

@mnmueller
Copy link

@stanleybak I see this does indeed depend on whether we view the specification as the to be verified property (prove it holds for all permissible inputs) or as a description of a counter-example (prove there is no satisfying example).
My intent was in line with the former.
After checking the vnnlib standard again, I realized that it is in line with your interpretation.

However, looking at the .vnnlib specifications generated by some of the other participants (e.g., @pat676 (https://github.com/pat676/vnn-comp21-benchmarks/blob/main/cifar-conv_2_255/vnnlib_properties/prop_0_eps_0.008.vnnlib) and @huanzhang12 (https://github.com/tcwangshiqi-columbia/cifar10_resnet_benchmark/blob/main/vnnlib_properties/prop_0_eps_0.008.vnnlib)) I think I was not alone in this (mis-)understanding. Therefore, I think it might be important that we clarify this.

Regardless of which notation we chose, I think we should allow for specifications of this type.

@pat676
Copy link

pat676 commented May 11, 2021

@stanleybak @mnmueller thanks, I had indeed misinterpreted the vnnlib format. I agree that since the vnnlib format encodes the requirements of a counterexample (as opposed to the requirements of safety/robustness) we have to support disjunctions in the vnnlib format to encode this correctly. However, I don't think this should be problematic, especially if we stick to the special cases of maximal (and possibly minimal?) outputs.

From a toolkit perspective, it is my experience that most (all?) toolkits already support such disjunctions for classification problems, especially since most benchmarks in last years vnn-comp used exactly this specification. If any toolkits do not support it, they can always solve each class individually as @mnmueller commented.

@huanzhang12
Copy link
Contributor

@stanleybak @mnmueller Apologize for misinterpreting the vnnlib format. I agree with @mnmueller @pat676 that proving robustness against all other classes are more related to practical problems. To avoid the potential issues with disjunctions and make properties easier/faster to handle, an approximation is to just use the runner-up class (the class with second-largest logit and smallest margin to the top-1 class) as the target label, which is often the worst case. It probably makes more sense than selecting a random target class. Actually, I am okay with any target label selection scheme (random, runner-up or all targets), but I think it would be better to make it consistent across benchmarks if possible.

@stanleybak
Copy link
Owner Author

We're setting up a meeting the week of May 31st to discuss the benchmarks and other aspects of the competition this year. if you haven't gotten our e-mail on the mailing list, please let me or one of the organizers know.

@stanleybak
Copy link
Owner Author

stanleybak commented May 23, 2021

@huanzhang12 Although just using the runner-up class or another single class would be easier to encode, there could be strategies that work better if they can directly deal with a disjunction. We could directly present the disjunction, and then have the option of converting it to 9 independent queries if the tool doesn't support disjunction. This might be slightly more work for the organizers though.

The best option is if everyone adds support for disjunctions, although I don't know if this is realistic. We can discuss it during the meeting. I think @changliuliu's benchmark could be converted from a bunch of if-then-else statements to disjunctions, if we allow for constraints over both the input and output variables.

@mnmueller
Copy link

I would like to propose the following benchmark:

  1. A ReLU FFN with 8 hidden layers of 200 neurons each for MNIST, that was normally trained with eps=0.015. 36 samples and a 5min timeout per sample, with the property being robustness against all other classes.
  2. A Sigmoid FFN with 6 hidden layers of 200 neurons each for MNIST, that was normally trained with eps=0.012. 36 samples and a 5min timeout per sample with the property being robustness against all other classes.

Both networks and corresponding specs are provided here: https://github.com/mnmueller/vnncomp2021_eran_benchmarks
Also included is a script to generate new (potentially random) specs and one to load and evaluate the networks.

@stanleybak it looks like the meeting time around Tuesday evening (CET) has the highest availability. Do we want to fix a time?

@stanleybak
Copy link
Owner Author

@mnmueller great thanks. The meeting time is Tuesday at 12pm (noon) to 1pm Eastern US time. The zoom link should have been sent to the mailing list.

@REST-XD
Copy link

REST-XD commented May 31, 2021

If permission, we would like to propose a benchmark of properties related to data augmentation.

Counterexamples of adversarial robustness are like inputs augmented using random noise on the original data. We could design the properties whose counterexamples are:

  1. inputs augmented using random noise on the flipped images (Horizontal flip),
  2. inputs augmented using random noise on the padding area of a zoomed image (Fixed zoom-in), mainly MNIST images of constant value background,
  3. inputs augmented using random noise on the original data, where the element-wise noises are in [delta_min, delta_max] instead of [-delta, delta] (Brightness).

Those augmentations could be easily expressed. Experiments of horizontal flipping robustness took similar time comparing with that of adversarial robustness.
We feel that those properties are worthy of verification, especially the brightness.

@stanleybak may I join the discussion? Very lucky to see this channel before the discussion started.

@stanleybak
Copy link
Owner Author

@REST-XD looks interesting. Do you have specific networks for this in .onnx format and can you provide example .vnnlib files? You can see some of the earlier discussion for examples. We're planning to have a discussion about this on Tuesday. Send me a message if you're not on the mailing list and I can get you the meeting info.

@huanzhang12
Copy link
Contributor

Thank you @mnmueller! BTW, it seems the two CIFAR models are different from the last year's GGC-CNN models @pat676 proposed. (the checksum is not the same as those proposed by @pat676 here). Is that expected? I thought we wanted the same set of models to have a rough comparison to last year.

@mnmueller
Copy link

@huanzhang12 Only the cifar10_2_255 was taken from last year and the other model was slightly larger (62k instead of 49k neurons with an extra conv layer (4 instead of 3)).
For the cifar10_2_255.onnx, I directly copied the one from the benchmark directory from last years competition and also get the same checksum, so I am not sure where the discrepancy to @pat676 comes from.

I will split the benchmarks into two folders and adapt the script for the cifar spec to consider both networks from last year and the slightly larger one that I would suggest to add. If we decide to just use a subset adapting the script will be as easy as deleting the corresponding line.

@pat676
Copy link

pat676 commented Jun 9, 2021

@huanzhang12, @mnmueller believe there was a complex flatten operation (combination of shape, gather, reshape, concat...) in last year's networks that I changed to a flatten op in my version of the networks.

@mnmueller
Copy link

@huanzhang12 @pat676 @stanleybak If this is fine with everybody, I will update the networks to @pat676 specification.

@pat676
Copy link

pat676 commented Jun 10, 2021

I have a question regarding exactly how the "prepare_instance.sh" and "run_instance.sh" should be used.

In particular, is it allowed to use "prepare_instance.sh" to start a process, load the network and vnnlib property into memory, start child-procs, etc. and then keep the process alive until "run_instance.sh" is called, at which point the analysis is started with the same process? Or is "prepare_instance.sh" supposed to start a new process?

In my experience, loading the network and starting child-procs may take longer than the analysis in a lot of "easy" instances, thus the difference may be significant.

@stanleybak
Copy link
Owner Author

It's not intended that prepare_instance.sh keeps anything running or in memory. Instead, it's more like converting the .onnx network to another framework, or converting the .vnnlib file into whatever tool input format is needed, or killing off any zombie processes from previous runs that errored or crashed.

The intention to only measure run_instance.sh is that this the runtime a user would expect when running your tool, assuming everything was in the expected format. I guess you point out an interesting point, that maybe tools could be run in a batch mode where they don't need to keep starting from scratch. I'll note this down for consideration... hopefully the impact isn't too large.

@Neelanjana314
Copy link
Contributor

Neelanjana314 commented Jun 10, 2021

Hi,

I am facing an issue while converting .onnx networks to a keras model. It is caused by the same 'transpose' layers mentioned by @pat676 in a previous comment. While translating a keras network to an onnx one, it works but the reverse process is giving me pain. I tried checking online but no luck.
Do any of you know how to get rid of this 'transpose' layers or, even if its there, how can I successfully convert an onnx network to a keras one?

Thanks
Neelanjana

@stanleybak
Copy link
Owner Author

Hmm, good question. Are people converting the onnx files to other frameworks (what frameworks)? Maybe it's easier to covert to something else and then to keras?

@dlshriver
Copy link

Which network is causing the issue? You might be able to use DNNV to remove transpose operations. DNNV will remove a transpose operation if it comes before an optional Flatten or flattening Reshape operation followed by a Gemm or MatMul operation. This covers many cases. If you use the --convert option to convert to vnnlib you can take any of the onnx models that are output, they should all be equivalent. Here is an example of running DNNV to convert a network:

$ pip install dnnv
$ dnnv property.dnnp --network N cifar10_2_255.onnx --convert --convert.to=vnnlib --convert.dest=/path/to/output/

or

$ pip install dnnv
$ dnnv property.vnnlib --vnnlib --network N cifar10_2_255.onnx --convert --convert.to=vnnlib --convert.dest=/path/to/output/

If you just want to convert a network, I would suggest using a random property from the benchmark, or you can use the following property (in DNNP) which should be true for any network:

from dnnv.properties import *

N = Network("N")

Forall(x, Implies(N(x) > 1, N(x) < 2 * N(x)))

With the above property, the simplified network should be saved as /path/to/output/property0.onnx.

@pat676
Copy link

pat676 commented Jun 11, 2021

@stanleybak, thanks. I agree this is the best solution this year and that a batch solution may be a good idea for next year.

@Neelanjana314, I'm using PyTorch, but I've noticed that it differs between networks whether the weight matrices for FC layers are stored in TensorFlow or PyTorch format. For PyTorch it seems like the solution is simply to transpose the weight matrix in most cases (except the one network that had transpose layers that were more complicated).

I can confirm that dnnv works great in simplifying several of the networks. However, some networks are not supported in dnnv due to e.g. residual connections and unsuported layers such as sub/div.

@Neelanjana314
Copy link
Contributor

@dlshriver, @pat676 , thanks for the suggestion. I was trying to just convert an onnx model to a keras one. However I will try the dnnv tool to chcek if I can work around. I will also check if using PyTorch can help. It's actually the network I proposed, trained in Keras and it can be converted to onnx easily from both matlab and keras model but reverting back to keras seems difficult.

@dlshriver
Copy link

@pat676 Thanks for the feedback. I just updated dnnv to hopefully handle the networks that you were having problems with. I've checked it on the networks in the vnncomp2021/benchmarks directory and it now seems to convert them all without errors. If there are any other networks that do not seem to work feel free to let me know I will try to get it working as soon as possible.

@huanzhang12
Copy link
Contributor

It's not intended that prepare_instance.sh keeps anything running or in memory. Instead, it's more like converting the .onnx network to another framework, or converting the .vnnlib file into whatever tool input format is needed, or killing off any zombie processes from previous runs that errored or crashed.

The intention to only measure run_instance.sh is that this the runtime a user would expect when running your tool, assuming everything was in the expected format. I guess you point out an interesting point, that maybe tools could be run in a batch mode where they don't need to keep starting from scratch. I'll note this down for consideration... hopefully the impact isn't too large.

I am testing some models and have similar observations as @pat676, some easy instance can take a few seconds to load but can be verified almost instantly. Especially, importing pytorch/tensorflow in Python and initialize GPU (for GPU tools) take a few seconds, but the verification procedure only takes like 0.1 second. So for easy instances we are just measuring the importing/initialization overhead. It seems in many papers people assume running in batch mode and do not count library importing/model loading time, or amortize the this part among all examples.

Is it allowed to leave an python process in the background with preloaded model and libraries via prepare_instance.sh (but it does not precompute anything verification related), and the real verification procedure only kicks in when run_instance.sh is started? For example, run_instance.sh can signal to the preloaded process to start the real verification procedure. Or, is it expected to always start a fresh process in run_instance.sh and disallow the creation of pre-initialized processes in prepare_instance.sh? This does have a huge impact on simple instances so perhaps we need to make an explicit rule on that - this year we probably just need to say it is allowed or disallowed. Next year, we can consider running tools in batch mode instead. @stanleybak what do you think?

@stanleybak
Copy link
Owner Author

Hmm... do other people have thoughts on this? I feel like this is more of a way to get a better score than something technically important... and it could lead to more complexity that messes up error handling or something else with the competition scripts.

Maybe if benchmarks are verified in under 3 seconds then all tools under 3 seconds should get a first place score? This would be a change in the rules though, so I think we would need to see most of the participants reply before we do something like this.

@mnmueller
Copy link

I agree that using batch mode is a more realistic use case. However, I think it was clearly communicated that no processes may be left running after the prepare_instance.sh. Therefore, I would suggest we stick with this decision for this year's version of the competition and switch to a batch mode next year.

I would be fine with the proposal of counting the max(3s, measured time) as a workaround for this year.

@pat676
Copy link

pat676 commented Jun 14, 2021

I agree with @mnmueller.

@huanzhang12
Copy link
Contributor

I also agree with @mnmueller and @pat676 that we should make it clear that no background processes shall be used. However it is still unclear to me how to measure time in a fair and scientific manner. On the AWS p3.2xlarge machine I tried the following:

time python -c 'import torch; torch.cuda.FloatTensor(1)'

It takes 5.3s to just import pytorch and initialize GPU, not even counting the model loading time. This heavily penalizes pytorch based tools on easy instances (e.g., in last year's report, some ACASXu instances can be solved in 0.01 seconds), although in practical scenarios the startup time is not an issue when running the tool in batch or daemon mode. Tools based on other frameworks that may have a slow startup time (e.g., Matlab, Julia) are also affected.

I feel the 3 second threshold cannot really solve this issue - in the above example the initialization already takes over 5 seconds. Even we increase the threshold, it still incorrectly penalizes some tools (e.g., if a 6s threshold is used, a pytorch based tool only has 0.7s effective time).

Since we won't be able to run in batch mode this year, I personally think the most scientific way is to properly measure the initialization time of each tool, and minus that time in evaluation. This requires a little bit effort but seems doable - each tool just needs to provide an option like --dry-run and directly return after initialization and model loading, without doing any verification. What do you think?

@mnmueller
Copy link

Since one benchmark can encompass multiple networks this year, I was wondering how to interpret the rule that we may use different settings per benchmark, but not per instance. Are we allowed to use different settings depending on the network to be analysed for a specific instance?

@stanleybak
Copy link
Owner Author

stanleybak commented Jun 17, 2021

In discussing with the organizers, here's what we came up with:

@huanzhang12 In terms of overhead, we're going to prepare a trivially unsat example (1-2 inputs and outputs and one small hidden layer), and then run everyone's tool on that to measure overhead. This overhead will then be reported and subtracted from the runtimes on the other instances, to hopefully provide an overhead-free verification time. No process should be left running after prepare_instance.sh runs. We may also consider all small runtimes (like < 0.5 seconds or < 1 second) to be the same for scoring, just due to noise in the measurement process.

@mnmueller in terms of tuning, it's okay to have settings that adjust based on general features of the network (for example, for lots of inputs and lots of layers use one set of settings, otherwise use a different set of settings). What we do not want to see are fine-tuned strategies for targeting a specific network or instance (for example, we don't want to see if the hash of the .onnx file is something, use settings A, otherwise use settings B). The motivation for this is that from a user's perspective, general tuning strategies are great and helpful, but fine-tuned network or instance-specific strategies add no value.

@huanzhang12
Copy link
Contributor

@stanleybak Thank you for considering the overhead problem! I like the solution of using trivially unsat example to measure the overhead. It totally makes sense. And considering small runtimes like 1 second to be the same also looks fair.

@castrong
Copy link

Hi all - I realize it is likely too late to suggest a new benchmark, but I wanted to share a set of benchmarks I've put together from a recent project in case it is useful either this year or for thinking about next year. The queries are all on a low-dimensional (4 input) network, which consists of a conditional GAN concatenated with an image-based control network for a taxiing aircraft. It is a feedforward ReLU network with dimensions 4 --> 256 --> 256 --> 256 --> 256 --> 16 --> 8 --> 8 --> 2. The GAN represents a model of the possible images we expect our controller to encounter, and we would like to guarantee performance over this set of images. The queries represent questions about the possible control effort in different regions, and I generated queries with varying input sizes and in more and less difficult regions of the state space to have a variety of difficulties. We could also use GAN and control networks of varying size, and increase the input dimension by adding more latent dimensions to the GAN to increase or decrease the difficulty. For more detailed justification of this network feel free to see this paper.

This is a github repo with the network, queries, and script used to generate the queries.

It seems like many of the benchmarks currently in the repo / thread are on robustness of image networks to adversarial perturbations, so I thought it would be worth bringing up another low input dimension control example (in the same vein as ACASXu) as I think these low-dimensional scenarios are relevant to many control problems, are worth working towards solving efficiently, and may be underrepresented in the current set of benchmarks. I'd appreciate any thoughts anyone has on this!

@stanleybak
Copy link
Owner Author

@castrong yes it's probably to late to add a new benchmark for the competition, but if you get it in the correct format I think we can add it as an unscored benchmark, in case others plan to use some of the benchmarks for evaluation in their papers. If you prepare the benchmark using .onnx, .vnnlib specification files and an instances.csv with timeouts, you can issue a pull request.

@huanzhang12
Copy link
Contributor

@stanleybak regarding the launching overhead test, are we going to use test_small.onnx with test_small.vnnlib and test_tiny.onnx with test_tiny.vnnlib? I am a little bit confused because test_instances.csv in that folder actually uses the ACASXu networks, not the small/tiny onnx. I don't think we want to use ACASXu to test overhead?

@stanleybak
Copy link
Owner Author

Yes I'll update the test/instances.csv eventually. Both networks should have similar runtimes... I think one is one input one output and the other is two inputs and two outputs. I may do multiple measurements during the runs and take the minimum to be the overhead. Probably the tiny one as that should be slightly faster. Are you getting different runtimes for these?

@huanzhang12
Copy link
Contributor

huanzhang12 commented Jun 29, 2021

@stanleybak No, for these two networks (test_small.onnx and test_tiny.onnx) the runtimes are roughly the same for us. The csv file just needs to be updated. And taking multiple measurements sounds like a good plan!

@mnmueller
Copy link

@stanleybak for the cifar2020 benchmarks, I reduced the timeout to 100s to fit in this year's total timeout of 6h for the set, but this is considerably shorter than last year's 300s. If we want to use them for comparison, should I update this to the same runtime as last year? This would require an explicit exception to the rejection of long benchmarks.

@huanzhang12
Copy link
Contributor

I personally feel the 100s timeout is fine (we cannot make direct comparisons anyways due to different hardware) and it's probably too late to change a benchmark, but it all depends on the organizers (increasing the timeout will certainly increase their cost for evaluation).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests