-
-
Notifications
You must be signed in to change notification settings - Fork 527
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IPs resolved from *.service.consul URLs seem to be cached #453
Comments
I haven't done any testing yet, so don't take this as an authoritative answer, but I would suspect that there's something in your environment that is caching the IP. KafkaJS uses the Net and TLS modules to create sockets, which defaults to using the |
So, when we provide brokers: [process.env.BROKER] where
But, when we remote into the pod and ping |
When you |
The lookup would only happen when it tries to establish a connection, yes, so if the broker is replaced and the resolved IP changed, my assumption would be that the existing connection is closed and KafkaJS would try to reconnect, which should mean that a new lookup is done - but this is only an assumption at this point (the lookup is done internally by Net/TLS.Socket, not by us, so I'm not 100% sure). I think this sounds like a promising line of reasoning though. It would be good to try to get a repro case up. Do you think you could explain a bit more about your setup and maybe share some debug logs from when this is happening? And just to make sure, what's your consul |
@eliw00d it works like this:
Like @Nevon said, we use NET and TLS, so we don't cache DNS, it just delegates to the OS. Have you configured |
I know that we changed the dns_config to not allow stale reads and I'm pretty sure we don't have KAFKA_ADVERTISED_HOST_NAME configured anywhere. I will try logging to see if that helps narrow anything down, and try to get some more information for you guys. |
So, I put some logging in and re-deployed the pod. Then, I deleted the brokers it knew about and tried to do a
where you can clearly see the brokers have been updated. So, KafkaJS does have the updated metadata during this time. It definitely seems like no new lookups were done when await producer.connect()
await producer.send()
await producer.disconnect() ? It already seems to connect/disconnect on its own before/after the call to Here is a snippet of the preceding logs (there were many similar loops):
|
Okay, so I finally dug through the source and found this. So, if the nodeId matches the previous metadata's nodeId, it returns result without any changes. In our case, we reuse nodeIds, so it never gets a chance to check if the host and port match or assign new values to that broker. Maybe that could be changed to: this.brokers = this.metadata.brokers.reduce((result, { nodeId, host, port, rack }) => {
const existingBroker = result[nodeId]
if (existingBroker && existingBroker.host === host && existingBroker.port === port && existingBroker.rack === rack) { or similar ? |
@tulios - In response to your points made above:
This should likely refetch metadata if a broker connection fails. Otherwise, if a broker IP changes (such as when hosted in Kubernetes), you could have up to 5 minutes of a missing broker (or more).
This can be dangerous. It should likely fallback to the original bootstrap servers and do another metadata lookup. See https://issues.apache.org/jira/browse/KAFKA-3068 for reasoning from the past. Broker IPs should not be assumed as near-static, if metadata is outdated the cache should be updated if possible (i.e. broker is up but IP changed), or fail if the broker is down (as expected).
|
I don't know how often rack would change, if at all, but since it's used here would it be safe to ensure that it is criteria for replacing as well? |
What would be a good workaround for this in the interim? |
@tulios |
We use *.service.consul URLs so that the IPs can be resolved during runtime to the correct Kafka brokers. However, we started noticing stale IPs whenever KafkaJS tried to connect. If we deleted the pods in k8s and created new ones, the IPs were correct.
I haven't really dug too deep into the source yet to know exactly where this happens, but maybe a configuration option would be nice to disable the caching? Or is there something I am missing?
The text was updated successfully, but these errors were encountered: