Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNS Lookup Loop Never Quits #36

Open
wbarnha opened this issue Mar 8, 2024 · 0 comments
Open

DNS Lookup Loop Never Quits #36

wbarnha opened this issue Mar 8, 2024 · 0 comments

Comments

@wbarnha
Copy link
Owner

wbarnha commented Mar 8, 2024

I have python workers in a Docker Image A (kafka-python). There are 4 workers that connect to another Docker Image B (kafka-server) that is running kafka-server. If Docker Image B (kafka-server) goes down, the workers in Docker Image A go into an infinite loop for DNS lookup until Docker Image B (kafka-server) comes back online.

Here's a part of the log

2023-02-17 15:48:32,489 [WARNING] elasticsearch_worker_0 /usr/local/lib/python3.8/site-packages/kafka/client_async.py:331 - Node 1 connection failed -- refreshing metadata
2023-02-17 15:48:33,430 [WARNING] elasticsearch_worker_0 /usr/local/lib/python3.8/site-packages/kafka/conn.py:1527 - DNS lookup failed for kafka-server:19092, exception was [Errno -2] Name or service not known. Is your advertised.listeners (called advertised.host.name before Kafka 9) correct and resolvable?
2023-02-17 15:48:33,430 [ERROR] elasticsearch_worker_0 /usr/local/lib/python3.8/site-packages/kafka/conn.py:315 - DNS lookup failed for kafka-server:19092 (AddressFamily.AF_UNSPEC)
2023-02-17 15:48:34,323 [WARNING] elasticsearch_worker_0 /usr/local/lib/python3.8/site-packages/kafka/conn.py:1527 - DNS lookup failed for kafka-server:19092, exception was [Errno -2] Name or service not known. Is your advertised.listeners (called advertised.host.name before Kafka 9) correct and resolvable?
2023-02-17 15:48:34,323 [ERROR] elasticsearch_worker_0 /usr/local/lib/python3.8/site-packages/kafka/conn.py:315 - DNS lookup failed for kafka-server:19092 (AddressFamily.AF_UNSPEC)
2023-02-17 15:48:35,110 [WARNING] elasticsearch_worker_0 /usr/local/lib/python3.8/site-packages/kafka/conn.py:1527 - DNS lookup failed for kafka-server:19092, exception was [Errno -2] Name or service not known. Is your advertised.listeners (called advertised.host.name before Kafka 9) correct and resolvable?
2023-02-17 15:48:35,110 [ERROR] elasticsearch_worker_0 /usr/local/lib/python3.8/site-packages/kafka/conn.py:315 - DNS lookup failed for kafka-server:19092 (AddressFamily.AF_UNSPEC)
2023-02-17 15:48:35,955 [WARNING] elasticsearch_worker_0 /usr/local/lib/python3.8/site-packages/kafka/conn.py:1527 - DNS lookup failed for kafka-server:19092, exception was [Errno -2] Name or service not known. Is your advertised.listeners (called advertised.host.name before Kafka 9) correct and resolvable?
2023-02-17 15:48:35,955 [ERROR] elasticsearch_worker_0 /usr/local/lib/python3.8/site-packages/kafka/conn.py:315 - DNS lookup failed for kafka-server:19092 (AddressFamily.AF_UNSPEC)
2023-02-17 15:48:36,795 [WARNING] elasticsearch_worker_0 /usr/local/lib/python3.8/site-packages/kafka/conn.py:1527 - DNS lookup failed for kafka-server:19092, exception was [Errno -2] Name or service not known. Is your advertised.listeners (called advertised.host.name before Kafka 9) correct and resolvable?
2023-02-17 15:48:36,795 [ERROR] elasticsearch_worker_0 /usr/local/lib/python3.8/site-packages/kafka/conn.py:315 - DNS lookup failed for kafka-server:19092 (AddressFamily.AF_UNSPEC)

When Docker Image B (kafka-server) comes back online, the workers will reconnect. But because of timeouts, only one worker will connect and it causes the kafka-server to start the topic with 1 partition instead of the 4 partitions which is what is expected.

It would be nice for the workers to actual fall off trying to connect and return execution to the main loop so I can handle the even when Docker Image B (kafka-server) goes offline.

What I've been seeing is when kafka-server comes back online, 1 worker will reconnect, 2 will connect but not be assigned a partition, and 1 will get a wakeup socket error
https://github.com/dpkp/kafka-python/blob/4d598055dab7da99e41bfcceffa8462b32931cdd/kafka/client_async.py#L937

versions

$ python3 --version
Python 3.6.8
$ cat /usr/local/lib/python3.8/site-packages/kafka/version.py 
__version__ = '2.0.2'

Also, random comment, this line should have a return value but is just an empty return.
https://github.com/dpkp/kafka-python/blob/4d598055dab7da99e41bfcceffa8462b32931cdd/kafka/conn.py#L323

I'm sure I'm missing some details but at least this will get a thread/conversation started about what I'm observing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant