-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EKS: Run the ping-pong test without making functional changes #1416
Comments
Does it depend on the CNI used?
|
This is the plan, to verify whether it also happens on other CNI combinations. Also, could you please clarify what does |
If that's the plan, could you update the description with the list of tasks? Regarding reproducing: the ticket description is not clear whether this bug is in testground, the sidecar, the test itself, or if it is something common to any application using CNI X and Y. In other words, if I write a simple |
This has been happening on both Here is some additional info about the issue (tested on AWS VPC CNI + weave): Command used to invoke the test: Output in the daemon pod, notice that one pod quickly goes into
And then the test hangs indefinitely, until the pods are manually deleted and daemon deployment restarted, since even if we remove the pods and issues Logs from the ping-pong pods:
|
Hey Laurent, thanks for the replies! That being said, everything mentioned above is CNI-agnostic, and should not manifest itself differently depending on the type of CNI being used. Just to reiterate, interface substitution is working properly, as expected. When we predefine IP values of pods in pingpong.go, the tests are passing. We're currently working on resolving this, ie m I will update the issue name and description accordingly, once you've seen the message, so I don't cause additional confusion. |
Thanks for the details @dektech & @AbominableSnowman730,
It sounds like we might be asking too much from the network simulation layer, and the solution could be a change in the SDK,
|
You're quite welcome. :)
Yup, I think you're right. I'd also point out this line as well. The way IP allocation is handled now differs a bit from the previous setup, if I'm not mistaken. We're getting IPs that are allocated to pods dynamically. I'd say that we can have a deterministic way of predicting what the IPs will be like, but it's probably a better idea to approach it dynamically. I'd have to do a bit more research, but I'd guess that we can get those values by polling Kubernetes. If you have any suggestions, It'd be great if you could share them.
This is what we've previously used. It's heavily modified, but the point to note is that we've "locked" 2 addresses that were expected to be allocated to pods: x.x.x.2 and x.x.x.4. I'd just like to point out that, even now, inter-pod communication works without any issues (even if pods are being hosted on different nodes). This is just a matter of a "active" pod not being given a proper IP address of the "passive" pod in the test to talk to. |
I've seen cases of the instance finding its own data network IP dynamically: in the go sdk: https://github.com/testground/sdk-go/blob/49c90fa754052018b70c63d87b7f1d37f6080a78/network/address.go#L13-L50 testground/plans/example-rust/src/main.rs Lines 10 to 16 in eaab278
Is it helpful? What else is missing? |
Sorry for the late reply. I somehow missed this. Thanks for the references. This portion of code is working as expected. We should already be using it in test plans, if I'm not mistaken. However, we're trying to get the IP of it's pair (the running test container's/pod's pairing that's supposed to establish connection with. We cannot grab it directly using interfaces on the running (active) pod. The test itself was successfully ran on an EKS cluster with the Calico/Weave CNI pairing. We're in the process of running it on the AWS VPC/Weave CNI pairing as well. This would be our preferred pairing, since it is AWS' native solution, and we'd delegate altogether connectivity issues on the control network to AWS (we shouldn't strive to maintain and troubleshoot connectivity issues on two CNI plugins). While on the topic, we've encountered an issue with running connectivity tests between nodes on the out-of-box AWS VPC/Weave (for that matter, any other overlay network plugin as well). This is related to iptables having rules that encapsulate outgoing packets on secondary network interfaces. |
Is it correct to rephrase this by saying:
If that's true: that's what the sync service is used for. That's the service living on the control network that instances use to coordinate with each other. It's in the storm test: testground/plans/benchmarks/storm.go Lines 232 to 255 in eaab278
|
Yeah, that's true. Completely agreed, sync service should be used for this purpose. However, we've chosen a temporary option to poll the API server, so we can reduce the amount of traffic we have between worker nodes during the test run. We were still in the process of pinpointing the issue with connectivity between nodes. Now that we've located the issue, we no longer have a need for it. |
Can we please be clearer on the done criteria? I haven't been following the issue, but is it something like, "As a testground developer I should be able to run command X on an EKS cluster and get result Y"? |
@brdji : the done criteria are better - thanks. To fully drive this home, what would we demo to show this is fully working as expected? What command would we run and what output would we expect to get? |
Thanks for the suggestion - I've updated the issue to clarify what needs to be run and checked. |
Thanks for updating, any blockers on this? The docker version is already tested in CI: https://github.com/testground/testground/blob/8a2a5f15e2b723eb2dd77af7b3c7d2a753a94c2f/integration_tests/06_docker_network_ping-pong.sh It would make sense to create a similar |
There are no blockers on this, and we are in the process of making the changes and running them on the current cluster(s).
I suppose it would be a good idea to add the ping-pong as a basic healthcheck. Do we have any CI tests that are running on the current TaaS cluster? |
I think they were all disabled now (thinking of Kubo and Lotus). @laurentsenta do you know of any other ones? |
Update: We have developed and tested a working solution using the sync-service. The solution is currently on the |
There are a few What I would like to see is an
|
master...tmp_eks_cluster is starting to grow and I'm worried we'll get burned by a massive, 6-month-old, review. Could we split into multiple PRs, for every deliverable? If we don't want to merge into
I'll keep an eye on the changes in master and update the feature branch when needed. |
We have managed to get the ping-pong test working on both EKS clusters, however, a strange issue has popped up:
I am assuming the cause is one of two things:
Yes, this is probably the best way to handle this. The |
@brdji could you create a PR with the test updated? It should still pass with the docker runner. |
Closing, superseded by #1499, |
Description
While running benchmarks on the EKS trial cluster, we have discovered that the
ping-pong
test cannot pass. It turns out that the test attempts to reach one pod from another, and it waits for this operation to finish and hangs indefinitely.This is happening due to predefined sets of IP addresses that are allocated to running pods, which are not matched properly to the EKS setup/installation. We need to find a way of resolving the issue without impacting the basic functionality that the test offers.
What defines this endeavor to be complete
We should minimally modify the ping-pong test, so that it does not rely on pods receiving IP addresses that are in an expected range (e.g. from 16.0.0.0 to 16.0.0.1). In other words, the pods should somehow communicate their IP addresses, and not assume them beforehand.
The output of the command
testground run single --plan network --testcase ping-pong --instances 2 --builder docker:go --runner cluster:k8s
Should result in a
successful
job result.One possible solution (using the sync service) would be:
local:docker
runner, as wellThe text was updated successfully, but these errors were encountered: