-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
QUESTION: how to get sample app WordCount.jar to run with version 1.15.3 and 2 taskmanager replicas #618
Comments
Hey! This is rather odd. Looks like there may be a network issue somewhere. Could you look into your namespace events see if anything pops ? |
@jto - appreciate you responding so quick. The networking is fine I believe. I deployed an ubuntu container with networking tools in the same namespace and am able to access each of the job/taskmanager pods/ports. The only interesting events are related to the taskmanager experiencing the gating issue, where its readiness probes are failing with 500 response.
Another data point: I can delete the failing taskmanager pod and when it comes back up it becomes the healthy one and the other, previously healthy taskmanager pod becomes unhealthy with the above |
After doing some pcaps, the network traffic between the taskmanagers and and jobmanager looks pretty healthy/normal. After seeing heartbeat timeouts like below led to the thought that our envoy proxies were blocking the heartbeat communication.
After adding an istio annotation to exclude traffic for port 6123, both taskmanagers were able to register and stay up. |
Hi all,
I am attempting to deploy the sample WordCount.jar app using 2 taskmanager replicas, however only 1 taskmanager is able to register successfully with the ResourceManager at any one time. When attempting to deploy the same app with a single replica for taskmanager, it seems to work fine.
flink UI showing running job using a single taskmanager replica setup:
However when deploying a 2 replica taskmanager configuration, one of the taskmanagers gets an akka gated exception.
logs from the failed taskmanager pod in a 2 replica setup:
flink UI showing job with 1 taskmanager registered and running healthy, although 2 taskmanager pods are running:
same flink UI observed a few minutes later, showing the job failed:
I created a custom flink 1.15.3 image using the below Dockerfile:
Where wc.jar was downloaded from here:
https://repo1.maven.org/maven2/org/apache/flink/flink-examples-streaming_2.12/1.15.3/flink-examples-streaming_2.12-1.15.3-WordCount.jar
Below is the flinkcluster CR I'm using to deploy this setup:
The text was updated successfully, but these errors were encountered: