-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to obtain Agent info #504
Comments
As mentioned on Slack, there are no The fact that there are no resources available probably means that anything |
Restarting two agents ( |
Reopening as a) the cli should be resilient to this and b) need to understand why this agents couldn't get their resource info through. |
The `conduct agents` command is now tolerates the scenario that some agents contain resource information and some not. Fixes a) of typesafehub#504.
@edwardcallahan @huntc The following error occurs on ConductR agent several times when the bundle is scaled.
So the bundle fails frequently because of |
Additionally, I see that 2 hours before the
After the restart of the
This log message indicates that the resource poller actor gets started. We don't log any message when the start was successful. Once a resource poller actor has been started, it will send every 15 seconds the resources to ConductR core. Also here, we do not log any message on the agent so I cannot tell based on the log output if something is wrong here. However, the fact that ConductR core has no At the very beginning so before the above log messages I can see that agent connected successfully to the core and started sending resources:
Summary Based on the ConductR agent log files I am not able to identify the issue why the resources are not updated on ConductR core. The ConductR agent was able to connect with a core initially and continues to send the resources. To further analyze the issue I'd need the ConductR core log files during the same time frame as well. @edwardcallahan Are you able to attach these log files as well? |
@markusjura wrt #504 (comment)
|
core_logs.tar.gz Note: When cluster was built it was over provisioned with agents. These private agent nodes are stopped before loading bundles to provide 'ready to go' spare agents. That might be the cause of the reconcile event you observed in the logs, @markusjura. |
@edwardcallahan With stopped, do you mean a manual stop by a user? I've tried to continue the investigation with the provided core log. Turns out that I need the log files of the day of all 3 core nodes. Tried to get them by ssh into selkirk and then on the respective nodes but the |
@markusjura The nodes were stopped using AWS EC2 controls. Not terminated, but stopped, so that it can be resumed later, if needed, without incurring compute costs. Regarding the logfile rotation, it seems that logratate is enabled and the logs were rotated away. Sorry. That said, the cluster is back in that state today. I'll assert that we can get it back into this state again with the bundles we have currently loaded. |
@edwardcallahan If you can (and want) to reproduce the issue with the current bundles, please do and send me the logs of the failing agent and all core nodes. Otherwise, we can close the issue and open another one once we reproduced it. |
Closing the issue because it cannot be reproduced. |
version 1.2.4, conductr 2.1.1
After simultaneously loading multiple bundles, ie
None of the bundles could be started.
Although agents are present
The agents command cannot be used
Agents logs:
agent6.23.tar.gz
The text was updated successfully, but these errors were encountered: