Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EKS: Run the ping-pong test without making functional changes #1416

Closed
3 tasks done
brdji opened this issue Aug 3, 2022 · 23 comments
Closed
3 tasks done

EKS: Run the ping-pong test without making functional changes #1416

brdji opened this issue Aug 3, 2022 · 23 comments
Assignees

Comments

@brdji
Copy link
Contributor

brdji commented Aug 3, 2022

Description

While running benchmarks on the EKS trial cluster, we have discovered that the ping-pong test cannot pass. It turns out that the test attempts to reach one pod from another, and it waits for this operation to finish and hangs indefinitely.

This is happening due to predefined sets of IP addresses that are allocated to running pods, which are not matched properly to the EKS setup/installation. We need to find a way of resolving the issue without impacting the basic functionality that the test offers.

What defines this endeavor to be complete

We should minimally modify the ping-pong test, so that it does not rely on pods receiving IP addresses that are in an expected range (e.g. from 16.0.0.0 to 16.0.0.1). In other words, the pods should somehow communicate their IP addresses, and not assume them beforehand.

The output of the command
testground run single --plan network --testcase ping-pong --instances 2 --builder docker:go --runner cluster:k8s
Should result in a successful job result.

One possible solution (using the sync service) would be:

  • Modify the ping-pong test so that the listening pod broadcasts its listening IP address to the sync-service, and the sending pod receives the address from the sync-service
  • Verify the test is working successfully on the EKS test cluster
  • Verify the test is working for the local:docker runner, as well
@laurentsenta
Copy link
Contributor

Does it depend on the CNI used?
Is this related to Testground and its sidecar or is it an issue that we can reproduce outside of Testground?
Do we have a plan? I guess something like:

  • Reproduce the issue outside of Testground
  • Create a minimal test case & share here
  • Try with different combination of CNIs
    • CNI1-CNIA: pass / fail
    • CNI2-CNIA: pass / fail
    • CNI1-CNIB: pass / fail

@dektech
Copy link
Contributor

dektech commented Aug 8, 2022

This is the plan, to verify whether it also happens on other CNI combinations.

Also, could you please clarify what does reproducing the issue outside of Testground mean? My understanding was that this is a TG-exclusive test that only works inside an environment running the daemon and sidecar.

@laurentsenta
Copy link
Contributor

If that's the plan, could you update the description with the list of tasks?
Maybe copy my comment and update with the CNIs we have tested / planning to test. This would make it easier to follow along.

Regarding reproducing: the ticket description is not clear whether this bug is in testground, the sidecar, the test itself, or if it is something common to any application using CNI X and Y.

In other words, if I write a simple main.go program that tries to remove and attach a network interface, will this container hang? Can I reproduce this issue outside of Testground environment?

@dektech
Copy link
Contributor

dektech commented Aug 8, 2022

This has been happening on both calico + weave and aws vpc cni + weave clusters. Currently working on finding the exact issue and where is it happening.
As for the last question, I really cannot answer that, maybe @brdji can but he is away until next week.
You can try and let us know of the results.

Here is some additional info about the issue (tested on AWS VPC CNI + weave):

Command used to invoke the test:
testground run single --plan network --testcase ping-pong --builder docker:go --runner cluster:k8s --instances 2

Output in the daemon pod, notice that one pod quickly goes into succeeded:

Aug  8 14:07:38.078040  DEBUG   testplan pods state     {"runner": "cluster:k8s", "run_id": "cbohgt926un0htvnf1eg", "running_for": "12s", "succeeded": 0, "running": 2, "pending": 0, "failed": 0, "unknown": 0}
Aug  8 14:07:40.093865  DEBUG   testplan pods state     {"runner": "cluster:k8s", "run_id": "cbohgt926un0htvnf1eg", "running_for": "14s", "succeeded": 1, "running": 1, "pending": 0, "failed": 0, "unknown": 0}

And then the test hangs indefinitely, until the pods are manually deleted and daemon deployment restarted, since even if we remove the pods and issues testground terminate --runner=cluster:k8s, it will still show the task as running.

Logs from the ping-pong pods:

kubectl logs tg-network-cbohgt926un0htvnf1eg-single-1

{"ts":1659967651240735595,"msg":"","group_id":"single","run_id":"cbohgt926un0htvnf1eg","event":{"message_event":{"message":"registering default http handler at: http://[::]:6060/ (pprof: http://[::]:6060/debug/pprof/)"}}}
{"ts":1659967651240792538,"msg":"","group_id":"single","run_id":"cbohgt926un0htvnf1eg","event":{"start_event":{"runenv":{"plan":"network","case":"ping-pong","run":"cbohgt926un0htvnf1eg","params":{},"instances":2,"outputs_path":"/outputs/cbohgt926un0htvnf1eg/single/1","temp_path":"","network":"10.32.0.0/12","group":"single","group_instances":2}}}}
{"ts":1659967651272437160,"msg":"","group_id":"single","run_id":"cbohgt926un0htvnf1eg","event":{"message_event":{"message":"waiting for network to initialize"}}}
{"ts":1659967653232734345,"msg":"","group_id":"single","run_id":"cbohgt926un0htvnf1eg","event":{"message_event":{"message":"network initialisation successful"}}}
{"ts":1659967653233064561,"msg":"","group_id":"single","run_id":"cbohgt926un0htvnf1eg","event":{"message_event":{"message":"network initialization complete"}}}
{"ts":1659967653233495189,"msg":"","group_id":"single","run_id":"cbohgt926un0htvnf1eg","event":{"message_event":{"message":"claimed sequence numbers; global=1, group(single)=1"}}}
{"ts":1659967653233510369,"msg":"","group_id":"single","run_id":"cbohgt926un0htvnf1eg","event":{"message_event":{"message":"before sync.MustBoundClient"}}}
{"ts":1659967653233620345,"msg":"","group_id":"single","run_id":"cbohgt926un0htvnf1eg","event":{"message_event":{"message":"before netclient.MustConfigureNetwork"}}}
{"ts":1659967653335749609,"msg":"","group_id":"single","run_id":"cbohgt926un0htvnf1eg","event":{"message_event":{"message":"I am 2, and my desired IP is 10.32.1.2\n"}}}
{"ts":1659967653335782474,"msg":"","group_id":"single","run_id":"cbohgt926un0htvnf1eg","event":{"message_event":{"message":"before reconfiguring network"}}}
Attempting to dial 10.32.1.1
Received an error attempting to dial 10.32.1.1
{"ts":1659967656496696131,"msg":"","group_id":"single","run_id":"cbohgt926un0htvnf1eg","event":{"failure_event":{"group":"single","error":"dial tcp4 10.32.1.1:1234: connect: no route to host"}}}
kubectl logs tg-network-cbohgt926un0htvnf1eg-single-0

{"ts":1659967651240882526,"msg":"","group_id":"single","run_id":"cbohgt926un0htvnf1eg","event":{"message_event":{"message":"registering default http handler at: http://[::]:6060/ (pprof: http://[::]:6060/debug/pprof/)"}}}
{"ts":1659967651240921442,"msg":"","group_id":"single","run_id":"cbohgt926un0htvnf1eg","event":{"start_event":{"runenv":{"plan":"network","case":"ping-pong","run":"cbohgt926un0htvnf1eg","params":{},"instances":2,"outputs_path":"/outputs/cbohgt926un0htvnf1eg/single/0","temp_path":"","network":"10.32.0.0/12","group":"single","group_instances":2}}}}
{"ts":1659967651272384084,"msg":"","group_id":"single","run_id":"cbohgt926un0htvnf1eg","event":{"message_event":{"message":"waiting for network to initialize"}}}
{"ts":1659967653232722287,"msg":"","group_id":"single","run_id":"cbohgt926un0htvnf1eg","event":{"message_event":{"message":"network initialisation successful"}}}
{"ts":1659967653233058661,"msg":"","group_id":"single","run_id":"cbohgt926un0htvnf1eg","event":{"message_event":{"message":"network initialization complete"}}}
{"ts":1659967653233601847,"msg":"","group_id":"single","run_id":"cbohgt926un0htvnf1eg","event":{"message_event":{"message":"claimed sequence numbers; global=2, group(single)=2"}}}
{"ts":1659967653233640973,"msg":"","group_id":"single","run_id":"cbohgt926un0htvnf1eg","event":{"message_event":{"message":"before sync.MustBoundClient"}}}
{"ts":1659967653233788098,"msg":"","group_id":"single","run_id":"cbohgt926un0htvnf1eg","event":{"message_event":{"message":"before netclient.MustConfigureNetwork"}}}
{"ts":1659967653335661303,"msg":"","group_id":"single","run_id":"cbohgt926un0htvnf1eg","event":{"message_event":{"message":"I am 1, and my desired IP is 10.32.1.1\n"}}}
{"ts":1659967653335722425,"msg":"","group_id":"single","run_id":"cbohgt926un0htvnf1eg","event":{"message_event":{"message":"before reconfiguring network"}}}
This container is listening!

@AbominableSnowman730
Copy link
Contributor

If that's the plan, could you update the description with the list of tasks? Maybe copy my comment and update with the CNIs we have tested / planning to test. This would make it easier to follow along.

Regarding reproducing: the ticket description is not clear whether this bug is in testground, the sidecar, the test itself, or if it is something common to any application using CNI X and Y.

In other words, if I write a simple main.go program that tries to remove and attach a network interface, will this container hang? Can I reproduce this issue outside of Testground environment?

Hey Laurent, thanks for the replies!
We've confirmed on Friday that interfaces are, indeed, being recreated inside of running containers, which is one of the things ping-pong is trying to accomplish. The issue is incorrectly titled, since we are getting errors when pods are trying to reach each other after address reallocation. This is happening due to pods attempting to connect to "predefined" sets of IP addresses. We're currently working on resolving this.

That being said, everything mentioned above is CNI-agnostic, and should not manifest itself differently depending on the type of CNI being used.

Just to reiterate, interface substitution is working properly, as expected. When we predefine IP values of pods in pingpong.go, the tests are passing. We're currently working on resolving this, ie m

I will update the issue name and description accordingly, once you've seen the message, so I don't cause additional confusion.

@laurentsenta
Copy link
Contributor

Thanks for the details @dektech & @AbominableSnowman730,
I appreciate the thoughtfulness regarding waiting to rename the issue and changing descriptions ;) In the future please do not worry and go ahead, GitHub logs everything and the team is used to following along with these.

This is happening due to pods attempting to connect to "predefined" sets of IP addresses.
...
When we predefine IP values of pods in pingpong.go, the tests are passing.

It sounds like we might be asking too much from the network simulation layer, and the solution could be a change in the SDK,

  • That issue is caused by the call to MustConfigureNetwork with a hardcoded IP correct?
  • Can you share a branch with the minimal change you use to make this test pass?

@AbominableSnowman730
Copy link
Contributor

AbominableSnowman730 commented Aug 9, 2022

Thanks for the details @dektech & @AbominableSnowman730, I appreciate the thoughtfulness regarding waiting to rename the issue and changing descriptions ;) In the future please do not worry and go ahead, GitHub logs everything and the team is used to following along with these.

You're quite welcome. :)

It sounds like we might be asking too much from the network simulation layer, and the solution could be a change in the SDK,

* That issue is caused by the call to `MustConfigureNetwork` with [a hardcoded IP](https://github.com/testground/testground/blob/775a58896b01063ae7d9fef0afbb4faf7a5f2bf5/plans/network/pingpong.go#L60-L63) correct?

Yup, I think you're right. I'd also point out this line as well. The way IP allocation is handled now differs a bit from the previous setup, if I'm not mistaken. We're getting IPs that are allocated to pods dynamically. I'd say that we can have a deterministic way of predicting what the IPs will be like, but it's probably a better idea to approach it dynamically. I'd have to do a bit more research, but I'd guess that we can get those values by polling Kubernetes. If you have any suggestions, It'd be great if you could share them.

* Can you share a branch with the minimal change you use to make this test pass?

This is what we've previously used. It's heavily modified, but the point to note is that we've "locked" 2 addresses that were expected to be allocated to pods: x.x.x.2 and x.x.x.4.

I'd just like to point out that, even now, inter-pod communication works without any issues (even if pods are being hosted on different nodes). This is just a matter of a "active" pod not being given a proper IP address of the "passive" pod in the test to talk to.

@AbominableSnowman730 AbominableSnowman730 changed the title EKS: Cannot remove and attach network interfaces EKS: Run the ping-pong test without making functional changes Aug 9, 2022
@laurentsenta
Copy link
Contributor

I've seen cases of the instance finding its own data network IP dynamically:

in the go sdk: https://github.com/testground/sdk-go/blob/49c90fa754052018b70c63d87b7f1d37f6080a78/network/address.go#L13-L50
in rust:

let local_addr = &if_addrs::get_if_addrs()
.unwrap()
.into_iter()
.find(|iface| iface.name == "eth1")
.unwrap()
.addr
.ip();

Is it helpful? What else is missing?

@AbominableSnowman730
Copy link
Contributor

I've seen cases of the instance finding its own data network IP dynamically:

in the go sdk: https://github.com/testground/sdk-go/blob/49c90fa754052018b70c63d87b7f1d37f6080a78/network/address.go#L13-L50 in rust:

let local_addr = &if_addrs::get_if_addrs()
.unwrap()
.into_iter()
.find(|iface| iface.name == "eth1")
.unwrap()
.addr
.ip();

Is it helpful? What else is missing?

Sorry for the late reply. I somehow missed this.

Thanks for the references. This portion of code is working as expected. We should already be using it in test plans, if I'm not mistaken.

However, we're trying to get the IP of it's pair (the running test container's/pod's pairing that's supposed to establish connection with. We cannot grab it directly using interfaces on the running (active) pod.
We've been able to get the address by polling the Kubernetes' API Server.

The test itself was successfully ran on an EKS cluster with the Calico/Weave CNI pairing. We're in the process of running it on the AWS VPC/Weave CNI pairing as well. This would be our preferred pairing, since it is AWS' native solution, and we'd delegate altogether connectivity issues on the control network to AWS (we shouldn't strive to maintain and troubleshoot connectivity issues on two CNI plugins).

While on the topic, we've encountered an issue with running connectivity tests between nodes on the out-of-box AWS VPC/Weave (for that matter, any other overlay network plugin as well). This is related to iptables having rules that encapsulate outgoing packets on secondary network interfaces.
When iptables are flushed on both nodes/servers, connection is established successfully. We are currently trying to find the optimal solution for the problem. This will, most likely, result in us having to prepend an iptable rule on nodes during the bootstrapping process.

@laurentsenta
Copy link
Contributor

laurentsenta commented Aug 26, 2022

However, we're trying to get the IP of it's pair (the running test container's/pod's pairing that's supposed to establish connection with. We cannot grab it directly using interfaces on the running (active) pod.
We've been able to get the address by polling the Kubernetes' API Server.

Is it correct to rephrase this by saying:

  • you have 2 instances, A and B.
  • Each instance can find its address, IP_a, IP_b using the code I shared above,
  • Now you want A to "learn" IP_b in order to send a ping to B?

If that's true: that's what the sync service is used for. That's the service living on the control network that instances use to coordinate with each other.

It's in the storm test:

func shareAddresses(ctx context.Context, client sync.Client, runenv *runtime.RunEnv, mynodeInfo *ListenAddrs) ([]string, error) {
subCtx, cancel := context.WithCancel(ctx)
defer cancel()
ch := make(chan *ListenAddrs)
if _, _, err := client.PublishSubscribe(subCtx, PeerTopic, mynodeInfo, ch); err != nil {
return nil, errors.Wrap(err, "publish/subscribe failure")
}
res := []string{}
for i := 0; i < runenv.TestInstanceCount; i++ {
select {
case info := <-ch:
runenv.RecordMessage("got info: %d: %s", i, info.Addrs)
res = append(res, info.Addrs...)
case <-ctx.Done():
return nil, ctx.Err()
}
runenv.D().Counter("got.info").Inc(1)
}
return res, nil
}

@AbominableSnowman730
Copy link
Contributor

Is it correct to rephrase this by saying:

* you have 2 instances, A and B.

* Each instance can find its address, IP_a, IP_b using the code I shared above,

* Now you want A to "learn" IP_b in order to send a ping to B?

If that's true: that's what the sync service is used for. That's the service living on the control network that instances use to coordinate with each other.

It's in the storm test:

Yeah, that's true. Completely agreed, sync service should be used for this purpose. However, we've chosen a temporary option to poll the API server, so we can reduce the amount of traffic we have between worker nodes during the test run. We were still in the process of pinpointing the issue with connectivity between nodes. Now that we've located the issue, we no longer have a need for it.
Unfortunate and imprecise wording on my part, apologies.

@BigLep
Copy link

BigLep commented Aug 26, 2022

Can we please be clearer on the done criteria? I haven't been following the issue, but is it something like, "As a testground developer I should be able to run command X on an EKS cluster and get result Y"?

@BigLep
Copy link

BigLep commented Aug 29, 2022

@brdji : the done criteria are better - thanks. To fully drive this home, what would we demo to show this is fully working as expected? What command would we run and what output would we expect to get?

@brdji
Copy link
Contributor Author

brdji commented Aug 29, 2022

Thanks for the suggestion - I've updated the issue to clarify what needs to be run and checked.

@laurentsenta
Copy link
Contributor

Thanks for updating,

any blockers on this?

The docker version is already tested in CI: https://github.com/testground/testground/blob/8a2a5f15e2b723eb2dd77af7b3c7d2a753a94c2f/integration_tests/06_docker_network_ping-pong.sh

It would make sense to create a similar eks integration test.

@brdji
Copy link
Contributor Author

brdji commented Aug 30, 2022

There are no blockers on this, and we are in the process of making the changes and running them on the current cluster(s).

It would make sense to create a similar eks integration test.

I suppose it would be a good idea to add the ping-pong as a basic healthcheck. Do we have any CI tests that are running on the current TaaS cluster?

@galargh
Copy link
Contributor

galargh commented Sep 1, 2022

I suppose it would be a good idea to add the ping-pong as a basic healthcheck. Do we have any CI tests that are running on the current TaaS cluster?

I think they were all disabled now (thinking of Kubo and Lotus). @laurentsenta do you know of any other ones?

@brdji
Copy link
Contributor Author

brdji commented Sep 2, 2022

Update: We have developed and tested a working solution using the sync-service. The solution is currently on the tmp_eks_branch, and we are currently running the ping-pong during the cluster scaling benchmarks.

@laurentsenta
Copy link
Contributor

laurentsenta commented Sep 2, 2022

I suppose it would be a good idea to add the ping-pong as a basic healthcheck. Do we have any CI tests that are running on the current TaaS cluster?

I think they were all disabled now (thinking of Kubo and Lotus). @laurentsenta do you know of any other ones?

There are a few k8s tests in CI, they use a local kind test cluster but I'm not aware of any eks tests in the Testground repo.

What I would like to see is an eks test suite that we can point towards a cluster, we'd use this to test changes on local development clusters, and staging cluster before doing releases for example. That should be the deliverable for this task, something like:

  • there is a EKS ping test that a developer can trigger with EKS_CLUSTER=staging.eks.testground.com make test-eks.

@laurentsenta
Copy link
Contributor

master...tmp_eks_cluster is starting to grow and I'm worried we'll get burned by a massive, 6-month-old, review.

Could we split into multiple PRs, for every deliverable?

If we don't want to merge into master directly, we can:

  • create a feature/eks feature branch from master
  • create a PR for each issue (like "fixing the ping test for k8s").

I'll keep an eye on the changes in master and update the feature branch when needed.

@brdji
Copy link
Contributor Author

brdji commented Sep 5, 2022

We have managed to get the ping-pong test working on both EKS clusters, however, a strange issue has popped up:

{"ts":1662143280332427650,"msg":"","group_id":"single","run_id":"cc94ma926un1kj0521eg","event":{"failure_event":{"group":"single","error":"expected an RTT between 200ms and 215ms, got 89.571µs"}}}

I am assuming the cause is one of two things:

  • The pods are communicating over a different interface (ie. the control network) during the test (is this possible?)
  • The network interface change did not affect the latency. The IP address of the interface changed, but not the latency.

Could we split into multiple PRs, for every deliverable?

Yes, this is probably the best way to handle this. The tmp_eks_branch does not have as many changes as one would guess looking at the commit log, as most of the changes were localized in the small area of the codebase. However, we will extract the final changes once we have confirmed everything is working 100%.

@laurentsenta
Copy link
Contributor

@brdji could you create a PR with the test updated? It should still pass with the docker runner.

@laurentsenta
Copy link
Contributor

Closing, superseded by #1499,
tackled in testground/infra#78 and #1350

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

8 participants