Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

List of next steps #65

Closed
tylergu opened this issue Apr 7, 2022 · 1 comment
Closed

List of next steps #65

tylergu opened this issue Apr 7, 2022 · 1 comment
Labels
Discussion documentation Improvements or additions to documentation

Comments

@tylergu
Copy link
Member

tylergu commented Apr 7, 2022

We discussed several interesting next steps to do at this stage:

Input space pruning 1 - Prune the fields that are simply copied over to a template

Cluster management systems like Kubernetes provides some generic functionalities for managing applications, for example, Affinity and PersistenceVolume. The operators enable users to manage their applications with just one application-specific input (CR). In this input, they still allow users to specific these generic functionalities provided by Kubernetes. Then in the operators' logic, they will just simply hand these generic fields to Kubernetes. If we can identify such fields in the operator's input, then we can prune the subfields of them.

For example, in the rabbitmq-operator's code, spec.Affinity is simply copied over to a field when creating the podTemplateSpec for statefulSet. In this case, we can prune all the children of the field spec.Affinity in rabbitmq's CR.

Input space pruning 2 - Prune the fields that are too expensive to test

We observe that in rabbitmq-operator's input, there are in total 1323 fields. Out of the 1323 fields, 1109 fields are under the field spec.override.statefulSet.spec, because this field contains the complete schema of the statefulSet.spec resource. But this spec.override.statefulSet.spec field is only used as a patch to conduct a JSONStrategicMergePatch on the existing statefulSet as shown in the code below, where podSpecOverride corresponds to the spec.override.statefulSet.spec field:

patch, err := json.Marshal(podSpecOverride)
patchedJSON, err := strategicpatch.StrategicMergePatch(originalPodSpec, patch, corev1.PodSpec{})

It is too expensive for us to spend 99% of the test cases to only test this single functionality. We can do program analysis to identify the fields that the operator directly accesses. And then we can get the cost of testing this field, if it is too expensive to test this field, then we should prune this field.

Cass-operator also has a field called spec.podSpecTemplate which has the entire schema of statefulSet's podTemplate. This spec.podSpecTemplate has ~1000 fields.

Reduce false alarms

Running several test cases

Currently our test cases only change one field in each test case. If we can run several test cases at the same time, we can largely reduce the testing time.

There are two potential challenges:
1. There are dependencies among the fields
2. Changing multiple fields at a time could complicate the oracle

Run the rabbitmq-operator/cass-operator systematically

We need to run rabbitmq-operator/cass-operator with the new input generation. We can first run them by manually pruning the input space. The results will show us how many false alarms we have.

Test plan generation

  1. Generating good/bad values: As discussed two weeks before, operators' input go through two levels of check.
    1. The first level is the server side validation, which uses the schema and the validation webhook. If the input cannot pass this level, it gets rejected directly without reaching the operator code. Acto can recognize if the input is rejected by the server side, and tries its best to generate inputs that pass the first level check.
    2. The second level is in the operator's logic. When the operator receives the input, it performs some sanity check. The challenge here is that Acto cannot tell if the input fails the second level check. This challenge causes some false alarms.
  2. Back and forth testing: The idea of back and forth testing is that, assuming the declarative nature of operators, if the same input is submitted to operator, then the application deployed should be exactly the same. During testing, we can revert back to an input we submitted before, and compare the two application produced by the operator. If they are not the same, then we find a bug.

Testing delete, and creating 2nd application

Currently we only test different inputs in one CR. It's also possible to test deleting the CR and recreating. We can also test inputs in two CRs, for example in rabbitmq's case, we will be creating two clusters of rabbitmq.

Parallelize Acto.

Acto is based on Kind cluster which uses docker containers to virtualize clusters. So it is possible to have multiple Kubernetes clusters running different test cases on the same machine. I noticed that not all the cores are efficiently used while running Acto, so it might be beneficial to explore running Acto in multi-cluster setting.

Automate the CRD generation into the pipeline.

Some operators do not provide a complete CRD that fully reflects the input structure defined in the API types. We can use kubebuilder to automatically generate CRD for operators in this case. We need to incorporate this option into Acto's pipeline.

@tianyin
Copy link
Member

tianyin commented Jun 28, 2022

See #131

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Discussion documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

10 participants