EKS: Run the ping-pong test without making functional changes #1416

brdji · 2022-08-03T09:13:20Z

Description

While running benchmarks on the EKS trial cluster, we have discovered that the ping-pong test cannot pass. It turns out that the test attempts to reach one pod from another, and it waits for this operation to finish and hangs indefinitely.

This is happening due to predefined sets of IP addresses that are allocated to running pods, which are not matched properly to the EKS setup/installation. We need to find a way of resolving the issue without impacting the basic functionality that the test offers.

What defines this endeavor to be complete

We should minimally modify the ping-pong test, so that it does not rely on pods receiving IP addresses that are in an expected range (e.g. from 16.0.0.0 to 16.0.0.1). In other words, the pods should somehow communicate their IP addresses, and not assume them beforehand.

The output of the command
testground run single --plan network --testcase ping-pong --instances 2 --builder docker:go --runner cluster:k8s
Should result in a successful job result.

One possible solution (using the sync service) would be:

Modify the ping-pong test so that the listening pod broadcasts its listening IP address to the sync-service, and the sending pod receives the address from the sync-service
Verify the test is working successfully on the EKS test cluster
Verify the test is working for the local:docker runner, as well

The text was updated successfully, but these errors were encountered:

laurentsenta · 2022-08-04T08:52:34Z

Does it depend on the CNI used?
Is this related to Testground and its sidecar or is it an issue that we can reproduce outside of Testground?
Do we have a plan? I guess something like:

Reproduce the issue outside of Testground
Create a minimal test case & share here
Try with different combination of CNIs
- CNI1-CNIA: pass / fail
- CNI2-CNIA: pass / fail
- CNI1-CNIB: pass / fail

dektech · 2022-08-08T10:23:42Z

This is the plan, to verify whether it also happens on other CNI combinations.

Also, could you please clarify what does reproducing the issue outside of Testground mean? My understanding was that this is a TG-exclusive test that only works inside an environment running the daemon and sidecar.

laurentsenta · 2022-08-08T13:16:12Z

If that's the plan, could you update the description with the list of tasks?
Maybe copy my comment and update with the CNIs we have tested / planning to test. This would make it easier to follow along.

Regarding reproducing: the ticket description is not clear whether this bug is in testground, the sidecar, the test itself, or if it is something common to any application using CNI X and Y.

In other words, if I write a simple main.go program that tries to remove and attach a network interface, will this container hang? Can I reproduce this issue outside of Testground environment?

dektech · 2022-08-08T14:23:50Z

This has been happening on both calico + weave and aws vpc cni + weave clusters. Currently working on finding the exact issue and where is it happening.
As for the last question, I really cannot answer that, maybe @brdji can but he is away until next week.
You can try and let us know of the results.

Here is some additional info about the issue (tested on AWS VPC CNI + weave):

Command used to invoke the test:
testground run single --plan network --testcase ping-pong --builder docker:go --runner cluster:k8s --instances 2

Output in the daemon pod, notice that one pod quickly goes into succeeded:

Aug  8 14:07:38.078040  DEBUG   testplan pods state     {"runner": "cluster:k8s", "run_id": "cbohgt926un0htvnf1eg", "running_for": "12s", "succeeded": 0, "running": 2, "pending": 0, "failed": 0, "unknown": 0}

Aug  8 14:07:40.093865  DEBUG   testplan pods state     {"runner": "cluster:k8s", "run_id": "cbohgt926un0htvnf1eg", "running_for": "14s", "succeeded": 1, "running": 1, "pending": 0, "failed": 0, "unknown": 0}

And then the test hangs indefinitely, until the pods are manually deleted and daemon deployment restarted, since even if we remove the pods and issues testground terminate --runner=cluster:k8s, it will still show the task as running.

Logs from the ping-pong pods:

kubectl logs tg-network-cbohgt926un0htvnf1eg-single-1

{"ts":1659967651240735595,"msg":"","group_id":"single","run_id":"cbohgt926un0htvnf1eg","event":{"message_event":{"message":"registering default http handler at: http://[::]:6060/ (pprof: http://[::]:6060/debug/pprof/)"}}}
{"ts":1659967651240792538,"msg":"","group_id":"single","run_id":"cbohgt926un0htvnf1eg","event":{"start_event":{"runenv":{"plan":"network","case":"ping-pong","run":"cbohgt926un0htvnf1eg","params":{},"instances":2,"outputs_path":"/outputs/cbohgt926un0htvnf1eg/single/1","temp_path":"","network":"10.32.0.0/12","group":"single","group_instances":2}}}}
{"ts":1659967651272437160,"msg":"","group_id":"single","run_id":"cbohgt926un0htvnf1eg","event":{"message_event":{"message":"waiting for network to initialize"}}}
{"ts":1659967653232734345,"msg":"","group_id":"single","run_id":"cbohgt926un0htvnf1eg","event":{"message_event":{"message":"network initialisation successful"}}}
{"ts":1659967653233064561,"msg":"","group_id":"single","run_id":"cbohgt926un0htvnf1eg","event":{"message_event":{"message":"network initialization complete"}}}
{"ts":1659967653233495189,"msg":"","group_id":"single","run_id":"cbohgt926un0htvnf1eg","event":{"message_event":{"message":"claimed sequence numbers; global=1, group(single)=1"}}}
{"ts":1659967653233510369,"msg":"","group_id":"single","run_id":"cbohgt926un0htvnf1eg","event":{"message_event":{"message":"before sync.MustBoundClient"}}}
{"ts":1659967653233620345,"msg":"","group_id":"single","run_id":"cbohgt926un0htvnf1eg","event":{"message_event":{"message":"before netclient.MustConfigureNetwork"}}}
{"ts":1659967653335749609,"msg":"","group_id":"single","run_id":"cbohgt926un0htvnf1eg","event":{"message_event":{"message":"I am 2, and my desired IP is 10.32.1.2\n"}}}
{"ts":1659967653335782474,"msg":"","group_id":"single","run_id":"cbohgt926un0htvnf1eg","event":{"message_event":{"message":"before reconfiguring network"}}}
Attempting to dial 10.32.1.1
Received an error attempting to dial 10.32.1.1
{"ts":1659967656496696131,"msg":"","group_id":"single","run_id":"cbohgt926un0htvnf1eg","event":{"failure_event":{"group":"single","error":"dial tcp4 10.32.1.1:1234: connect: no route to host"}}}

kubectl logs tg-network-cbohgt926un0htvnf1eg-single-0

{"ts":1659967651240882526,"msg":"","group_id":"single","run_id":"cbohgt926un0htvnf1eg","event":{"message_event":{"message":"registering default http handler at: http://[::]:6060/ (pprof: http://[::]:6060/debug/pprof/)"}}}
{"ts":1659967651240921442,"msg":"","group_id":"single","run_id":"cbohgt926un0htvnf1eg","event":{"start_event":{"runenv":{"plan":"network","case":"ping-pong","run":"cbohgt926un0htvnf1eg","params":{},"instances":2,"outputs_path":"/outputs/cbohgt926un0htvnf1eg/single/0","temp_path":"","network":"10.32.0.0/12","group":"single","group_instances":2}}}}
{"ts":1659967651272384084,"msg":"","group_id":"single","run_id":"cbohgt926un0htvnf1eg","event":{"message_event":{"message":"waiting for network to initialize"}}}
{"ts":1659967653232722287,"msg":"","group_id":"single","run_id":"cbohgt926un0htvnf1eg","event":{"message_event":{"message":"network initialisation successful"}}}
{"ts":1659967653233058661,"msg":"","group_id":"single","run_id":"cbohgt926un0htvnf1eg","event":{"message_event":{"message":"network initialization complete"}}}
{"ts":1659967653233601847,"msg":"","group_id":"single","run_id":"cbohgt926un0htvnf1eg","event":{"message_event":{"message":"claimed sequence numbers; global=2, group(single)=2"}}}
{"ts":1659967653233640973,"msg":"","group_id":"single","run_id":"cbohgt926un0htvnf1eg","event":{"message_event":{"message":"before sync.MustBoundClient"}}}
{"ts":1659967653233788098,"msg":"","group_id":"single","run_id":"cbohgt926un0htvnf1eg","event":{"message_event":{"message":"before netclient.MustConfigureNetwork"}}}
{"ts":1659967653335661303,"msg":"","group_id":"single","run_id":"cbohgt926un0htvnf1eg","event":{"message_event":{"message":"I am 1, and my desired IP is 10.32.1.1\n"}}}
{"ts":1659967653335722425,"msg":"","group_id":"single","run_id":"cbohgt926un0htvnf1eg","event":{"message_event":{"message":"before reconfiguring network"}}}
This container is listening!

AbominableSnowman730 · 2022-08-08T15:07:32Z

If that's the plan, could you update the description with the list of tasks? Maybe copy my comment and update with the CNIs we have tested / planning to test. This would make it easier to follow along.

Regarding reproducing: the ticket description is not clear whether this bug is in testground, the sidecar, the test itself, or if it is something common to any application using CNI X and Y.

In other words, if I write a simple main.go program that tries to remove and attach a network interface, will this container hang? Can I reproduce this issue outside of Testground environment?

Hey Laurent, thanks for the replies!
We've confirmed on Friday that interfaces are, indeed, being recreated inside of running containers, which is one of the things ping-pong is trying to accomplish. The issue is incorrectly titled, since we are getting errors when pods are trying to reach each other after address reallocation. This is happening due to pods attempting to connect to "predefined" sets of IP addresses. We're currently working on resolving this.

That being said, everything mentioned above is CNI-agnostic, and should not manifest itself differently depending on the type of CNI being used.

Just to reiterate, interface substitution is working properly, as expected. When we predefine IP values of pods in pingpong.go, the tests are passing. We're currently working on resolving this, ie m

I will update the issue name and description accordingly, once you've seen the message, so I don't cause additional confusion.

laurentsenta · 2022-08-09T08:43:14Z

Thanks for the details @dektech & @AbominableSnowman730,
I appreciate the thoughtfulness regarding waiting to rename the issue and changing descriptions ;) In the future please do not worry and go ahead, GitHub logs everything and the team is used to following along with these.

This is happening due to pods attempting to connect to "predefined" sets of IP addresses.
...
When we predefine IP values of pods in pingpong.go, the tests are passing.

It sounds like we might be asking too much from the network simulation layer, and the solution could be a change in the SDK,

That issue is caused by the call to MustConfigureNetwork with a hardcoded IP correct?
Can you share a branch with the minimal change you use to make this test pass?

AbominableSnowman730 · 2022-08-09T13:22:03Z

Thanks for the details @dektech & @AbominableSnowman730, I appreciate the thoughtfulness regarding waiting to rename the issue and changing descriptions ;) In the future please do not worry and go ahead, GitHub logs everything and the team is used to following along with these.

You're quite welcome. :)

It sounds like we might be asking too much from the network simulation layer, and the solution could be a change in the SDK,
* That issue is caused by the call to `MustConfigureNetwork` with [a hardcoded IP](https://github.com/testground/testground/blob/775a58896b01063ae7d9fef0afbb4faf7a5f2bf5/plans/network/pingpong.go#L60-L63) correct?

Yup, I think you're right. I'd also point out this line as well. The way IP allocation is handled now differs a bit from the previous setup, if I'm not mistaken. We're getting IPs that are allocated to pods dynamically. I'd say that we can have a deterministic way of predicting what the IPs will be like, but it's probably a better idea to approach it dynamically. I'd have to do a bit more research, but I'd guess that we can get those values by polling Kubernetes. If you have any suggestions, It'd be great if you could share them.

* Can you share a branch with the minimal change you use to make this test pass?

This is what we've previously used. It's heavily modified, but the point to note is that we've "locked" 2 addresses that were expected to be allocated to pods: x.x.x.2 and x.x.x.4.

I'd just like to point out that, even now, inter-pod communication works without any issues (even if pods are being hosted on different nodes). This is just a matter of a "active" pod not being given a proper IP address of the "passive" pod in the test to talk to.

laurentsenta · 2022-08-23T07:37:51Z

I've seen cases of the instance finding its own data network IP dynamically:

in the go sdk: https://github.com/testground/sdk-go/blob/49c90fa754052018b70c63d87b7f1d37f6080a78/network/address.go#L13-L50
in rust:

testground/plans/example-rust/src/main.rs

Lines 10 to 16 in eaab278

    
           let local_addr = &if_addrs::get_if_addrs() 
        
               .unwrap() 
        
               .into_iter() 
        
               .find(|iface| iface.name == "eth1") 
        
               .unwrap() 
        
               .addr 
        
               .ip();

Is it helpful? What else is missing?

AbominableSnowman730 · 2022-08-26T12:48:25Z

I've seen cases of the instance finding its own data network IP dynamically:

in the go sdk: https://github.com/testground/sdk-go/blob/49c90fa754052018b70c63d87b7f1d37f6080a78/network/address.go#L13-L50 in rust:

testground/plans/example-rust/src/main.rs

Lines 10 to 16 in eaab278

let local_addr = &if_addrs::get_if_addrs()

.unwrap()

.into_iter()

.find(|iface| iface.name == "eth1")

.unwrap()

.addr

.ip();

Is it helpful? What else is missing?

Sorry for the late reply. I somehow missed this.

Thanks for the references. This portion of code is working as expected. We should already be using it in test plans, if I'm not mistaken.

However, we're trying to get the IP of it's pair (the running test container's/pod's pairing that's supposed to establish connection with. We cannot grab it directly using interfaces on the running (active) pod.
We've been able to get the address by polling the Kubernetes' API Server.

The test itself was successfully ran on an EKS cluster with the Calico/Weave CNI pairing. We're in the process of running it on the AWS VPC/Weave CNI pairing as well. This would be our preferred pairing, since it is AWS' native solution, and we'd delegate altogether connectivity issues on the control network to AWS (we shouldn't strive to maintain and troubleshoot connectivity issues on two CNI plugins).

While on the topic, we've encountered an issue with running connectivity tests between nodes on the out-of-box AWS VPC/Weave (for that matter, any other overlay network plugin as well). This is related to iptables having rules that encapsulate outgoing packets on secondary network interfaces.
When iptables are flushed on both nodes/servers, connection is established successfully. We are currently trying to find the optimal solution for the problem. This will, most likely, result in us having to prepend an iptable rule on nodes during the bootstrapping process.

laurentsenta · 2022-08-26T15:28:25Z

However, we're trying to get the IP of it's pair (the running test container's/pod's pairing that's supposed to establish connection with. We cannot grab it directly using interfaces on the running (active) pod.
We've been able to get the address by polling the Kubernetes' API Server.

Is it correct to rephrase this by saying:

you have 2 instances, A and B.
Each instance can find its address, IP_a, IP_b using the code I shared above,
Now you want A to "learn" IP_b in order to send a ping to B?

If that's true: that's what the sync service is used for. That's the service living on the control network that instances use to coordinate with each other.

It's in the storm test:

testground/plans/benchmarks/storm.go

Lines 232 to 255 in eaab278

    
           func shareAddresses(ctx context.Context, client sync.Client, runenv *runtime.RunEnv, mynodeInfo *ListenAddrs) ([]string, error) { 
        
           	subCtx, cancel := context.WithCancel(ctx) 
        
           	defer cancel() 
        
           	ch := make(chan *ListenAddrs) 
        
           	if _, _, err := client.PublishSubscribe(subCtx, PeerTopic, mynodeInfo, ch); err != nil { 
        
           		return nil, errors.Wrap(err, "publish/subscribe failure") 
        
           	} 
        
           	res := []string{} 
        
           	for i := 0; i < runenv.TestInstanceCount; i++ { 
        
           		select { 
        
           		case info := <-ch: 
        
           			runenv.RecordMessage("got info: %d: %s", i, info.Addrs) 
        
           			res = append(res, info.Addrs...) 
        
           		case <-ctx.Done(): 
        
           			return nil, ctx.Err() 
        
           		} 
        
           		runenv.D().Counter("got.info").Inc(1) 
        
           	} 
        
           	return res, nil 
        
           }

AbominableSnowman730 · 2022-08-26T17:23:00Z

Is it correct to rephrase this by saying:
* you have 2 instances, A and B.

* Each instance can find its address, IP_a, IP_b using the code I shared above,

* Now you want A to "learn" IP_b in order to send a ping to B?
If that's true: that's what the sync service is used for. That's the service living on the control network that instances use to coordinate with each other.

It's in the storm test:

Yeah, that's true. Completely agreed, sync service should be used for this purpose. However, we've chosen a temporary option to poll the API server, so we can reduce the amount of traffic we have between worker nodes during the test run. We were still in the process of pinpointing the issue with connectivity between nodes. Now that we've located the issue, we no longer have a need for it.
Unfortunate and imprecise wording on my part, apologies.

BigLep · 2022-08-26T20:51:44Z

Can we please be clearer on the done criteria? I haven't been following the issue, but is it something like, "As a testground developer I should be able to run command X on an EKS cluster and get result Y"?

BigLep · 2022-08-29T16:46:51Z

@brdji : the done criteria are better - thanks. To fully drive this home, what would we demo to show this is fully working as expected? What command would we run and what output would we expect to get?

brdji · 2022-08-29T16:57:07Z

Thanks for the suggestion - I've updated the issue to clarify what needs to be run and checked.

laurentsenta · 2022-08-30T14:50:29Z

Thanks for updating,

any blockers on this?

The docker version is already tested in CI: https://github.com/testground/testground/blob/8a2a5f15e2b723eb2dd77af7b3c7d2a753a94c2f/integration_tests/06_docker_network_ping-pong.sh

It would make sense to create a similar eks integration test.

brdji · 2022-08-30T15:54:44Z

There are no blockers on this, and we are in the process of making the changes and running them on the current cluster(s).

It would make sense to create a similar eks integration test.

I suppose it would be a good idea to add the ping-pong as a basic healthcheck. Do we have any CI tests that are running on the current TaaS cluster?

galargh · 2022-09-01T07:39:43Z

I suppose it would be a good idea to add the ping-pong as a basic healthcheck. Do we have any CI tests that are running on the current TaaS cluster?

I think they were all disabled now (thinking of Kubo and Lotus). @laurentsenta do you know of any other ones?

brdji · 2022-09-02T09:30:01Z

Update: We have developed and tested a working solution using the sync-service. The solution is currently on the tmp_eks_branch, and we are currently running the ping-pong during the cluster scaling benchmarks.

laurentsenta · 2022-09-02T15:34:20Z

I suppose it would be a good idea to add the ping-pong as a basic healthcheck. Do we have any CI tests that are running on the current TaaS cluster?

I think they were all disabled now (thinking of Kubo and Lotus). @laurentsenta do you know of any other ones?

There are a few k8s tests in CI, they use a local kind test cluster but I'm not aware of any eks tests in the Testground repo.

What I would like to see is an eks test suite that we can point towards a cluster, we'd use this to test changes on local development clusters, and staging cluster before doing releases for example. That should be the deliverable for this task, something like:

there is a EKS ping test that a developer can trigger with EKS_CLUSTER=staging.eks.testground.com make test-eks.

laurentsenta · 2022-09-02T15:41:33Z

master...tmp_eks_cluster is starting to grow and I'm worried we'll get burned by a massive, 6-month-old, review.

Could we split into multiple PRs, for every deliverable?

If we don't want to merge into master directly, we can:

create a feature/eks feature branch from master
create a PR for each issue (like "fixing the ping test for k8s").

I'll keep an eye on the changes in master and update the feature branch when needed.

brdji · 2022-09-05T16:23:14Z

We have managed to get the ping-pong test working on both EKS clusters, however, a strange issue has popped up:

{"ts":1662143280332427650,"msg":"","group_id":"single","run_id":"cc94ma926un1kj0521eg","event":{"failure_event":{"group":"single","error":"expected an RTT between 200ms and 215ms, got 89.571µs"}}}

I am assuming the cause is one of two things:

The pods are communicating over a different interface (ie. the control network) during the test (is this possible?)
The network interface change did not affect the latency. The IP address of the interface changed, but not the latency.

Could we split into multiple PRs, for every deliverable?

Yes, this is probably the best way to handle this. The tmp_eks_branch does not have as many changes as one would guess looking at the commit log, as most of the changes were localized in the small area of the codebase. However, we will extract the final changes once we have confirmed everything is working 100%.

laurentsenta · 2022-09-07T13:02:51Z

@brdji could you create a PR with the test updated? It should still pass with the docker runner.

laurentsenta · 2022-11-08T09:52:41Z

Closing, superseded by #1499,
tackled in testground/infra#78 and #1350

dektech mentioned this issue Aug 8, 2022

Add support for deleting scheduled tasks from the daemon #1417

Open

2 tasks

AbominableSnowman730 changed the title ~~EKS: Cannot remove and attach network interfaces~~ EKS: Run the ping-pong test without making functional changes Aug 9, 2022

brdji added the Bloxico label Aug 26, 2022

KatarinaKeti assigned dektech Aug 26, 2022

KatarinaKeti added this to the Phase 2: Migrate TaaS cluster, add new features milestone Aug 26, 2022

KatarinaKeti added the Sprint 12 label Aug 29, 2022

This was referenced Sep 14, 2022

Explore EKS's limitations #1315

Closed

Resolve any additional issues with running Testground on EKS #1331

Closed

This was referenced Sep 15, 2022

Exchange IPs in pingpong test #1462

Merged

EKS: Configure network interface parameters #1463

Open

Kale98 added Sprint 13 and removed Sprint 12 labels Sep 26, 2022

laurentsenta closed this as completed Nov 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EKS: Run the ping-pong test without making functional changes #1416

EKS: Run the ping-pong test without making functional changes #1416

brdji commented Aug 3, 2022 •

edited

laurentsenta commented Aug 4, 2022

dektech commented Aug 8, 2022

laurentsenta commented Aug 8, 2022

dektech commented Aug 8, 2022

AbominableSnowman730 commented Aug 8, 2022

laurentsenta commented Aug 9, 2022

AbominableSnowman730 commented Aug 9, 2022 •

edited

laurentsenta commented Aug 23, 2022

AbominableSnowman730 commented Aug 26, 2022

laurentsenta commented Aug 26, 2022 •

edited

AbominableSnowman730 commented Aug 26, 2022

BigLep commented Aug 26, 2022

BigLep commented Aug 29, 2022

brdji commented Aug 29, 2022

laurentsenta commented Aug 30, 2022

brdji commented Aug 30, 2022

galargh commented Sep 1, 2022

brdji commented Sep 2, 2022

laurentsenta commented Sep 2, 2022 •

edited

laurentsenta commented Sep 2, 2022

brdji commented Sep 5, 2022

laurentsenta commented Sep 7, 2022

laurentsenta commented Nov 8, 2022

EKS: Run the ping-pong test without making functional changes #1416

EKS: Run the ping-pong test without making functional changes #1416

Comments

brdji commented Aug 3, 2022 • edited

Description

What defines this endeavor to be complete

laurentsenta commented Aug 4, 2022

dektech commented Aug 8, 2022

laurentsenta commented Aug 8, 2022

dektech commented Aug 8, 2022

AbominableSnowman730 commented Aug 8, 2022

laurentsenta commented Aug 9, 2022

AbominableSnowman730 commented Aug 9, 2022 • edited

laurentsenta commented Aug 23, 2022

AbominableSnowman730 commented Aug 26, 2022

laurentsenta commented Aug 26, 2022 • edited

AbominableSnowman730 commented Aug 26, 2022

BigLep commented Aug 26, 2022

BigLep commented Aug 29, 2022

brdji commented Aug 29, 2022

laurentsenta commented Aug 30, 2022

brdji commented Aug 30, 2022

galargh commented Sep 1, 2022

brdji commented Sep 2, 2022

laurentsenta commented Sep 2, 2022 • edited

laurentsenta commented Sep 2, 2022

brdji commented Sep 5, 2022

laurentsenta commented Sep 7, 2022

laurentsenta commented Nov 8, 2022

brdji commented Aug 3, 2022 •

edited

AbominableSnowman730 commented Aug 9, 2022 •

edited

laurentsenta commented Aug 26, 2022 •

edited

laurentsenta commented Sep 2, 2022 •

edited