Meeting summary 03/24 #48

tylergu · 2022-03-24T20:13:30Z

In this meeting, we went through the problems and progress during the past two weeks:

Stabilizing Acto's testing - being able to run for a long time

Goal: To be able to run Acto for an overnight-run or a week-run.
Problem: Acto pulls images too frequently, both for the operator image and the application image, causing ImagePullBackOff
Solution:
- For the operator image, we can preload the operator into the Kind cluster and change the pull policy to be IfNotPresent.
- For the application image, the workaround is to provide an argument option to preload some frequently used images.

Input space

Rabbitmq-operator has 1323 fields in its CR.

If we assume no dependency among the fields, only allow one field change at a time, and test three different values for each field, we would need to run 3289 tests.

The exploration strategy Acto currently is using is random walk. It means at each step, Acto tries to select a random field, and select a random value for this field. This exploration strategy causes that we are running a lot of redundant tests.

Comments: We have a huge input space to explore, it's interesting to see how to reduce the exploration space.
We also could have a more systematic way to do the exploration. Instead of stateless random walk, we can remember which fields are already explored and bias towards the fields that are not explored before.

Solving the nested value comparison

Problem: The value in the delta could be nested, causing problems when doing value comparison.

Solution: Flatten the dict before comparison

Comments:
- We should make heuristics easily extensible. We may add other heuristics later (e.g. solving the format problem in value comparison), and users can implement their own heuristics.
- This problem is essentially a problem of matching input delta to system state deltas. There are some possible related work about Object-relational mapping https://github.com/Frankkkkk/pykorm

Trying bad values

When the CR yaml is submitted, it goes through two levels of checks. The first level is the Kubernetes server. Since the CRD was previously registered to the server, the server will use the schema in the CRD to validate the CR yaml. If the CR yaml fails the check here, it is rejected by the server without even reaching the operator code, and user is prompted with an error message.

Our goal should be testing the operator with both good values and the bad values that can pass the server-side check. By testing the operators with bad values, we can test if the operators could handle these bad values properly. If the operators cannot handle bad values properly, there would be a failure.

In fact, Acto currently is already generating a lot of bad values that pass the server-side check.

Generating CRD using Kubebuilder

Problem: There was this previous discussion on whether to use the CRD or the API to help the input generation.
Acto currently relies on the schema in the CRD to generate structure-correct inputs. The quality of the inputs we generate largely depends on the quality of the schema in the CRD.

We found that some operator developers only specify very opaque schema in their CRD. In this case, the API would contain the correct structure information. We also want to take advantage of the information in the API definition.

Solution: Kubebuilder has a feature of generating CRDs automatically, and this feature is cleanly separated out as a CLI.
I was able to generate CRD for Percona's mongodb-operator with one command line, without modifying source code.

Next step

Port 2 more operators: choice(cass-operator, redis-operator, ibm-cloud-operator...)
Explore the input space more systematically, and think about how to prune it.
Make oracle heuristics easily extensible, as we most likely would add more heuristics later, and allow users to extend it too

tylergu assigned tianyin, pdettori, vazirim, marshtompsxd, wangchen615 and tylergu Mar 24, 2022

tylergu closed this as completed Apr 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Meeting summary 03/24 #48

Meeting summary 03/24 #48

tylergu commented Mar 24, 2022