Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for CRaC to the Java Agent #1697

Closed
kford-newrelic opened this issue Jan 18, 2024 · 2 comments · May be fixed by #2250
Closed

Add support for CRaC to the Java Agent #1697

kford-newrelic opened this issue Jan 18, 2024 · 2 comments · May be fixed by #2250
Assignees
Labels
5 Story Point Estimate feature request Suggestion for a new product enhancement or change jan-mar qtr Represents proposed work item for the Jan-Mar quarter

Comments

@kford-newrelic
Copy link
Contributor

Is your feature request related to a problem? Please describe.

Determine how to allow the Java Agent to support CRaC

Feature Description

Our customer is looking to implement CRaC within their environments to reduce the warm up time when scaling. This is very important to them, and they are now finding that they can’t even take a snapshot when the Java Agent is enabled due to the number of open files it has.

Their application is Spring Boot based, which now supports CRaC.

Describe Alternatives

N/A

Additional context

Original FR - NR-182805

Priority

Must Have

@kford-newrelic kford-newrelic added the feature request Suggestion for a new product enhancement or change label Jan 18, 2024
@workato-integration
Copy link

@elucus elucus added the jan-mar qtr Represents proposed work item for the Jan-Mar quarter label Dec 12, 2024
@elucus elucus moved this from Triage to Next Quarter Candidates in Java Engineering Board Dec 13, 2024
@jbedell-newrelic jbedell-newrelic self-assigned this Jan 6, 2025
@kmudduluru kmudduluru moved this from Next Quarter Candidates to In Quarter in Java Engineering Board Jan 7, 2025
@kmudduluru kmudduluru moved this from In Quarter to In Sprint in Java Engineering Board Jan 13, 2025
@kmudduluru kmudduluru added the 8 Story Point Estimate label Jan 13, 2025
@kmudduluru kmudduluru moved this from In Sprint to In Progress in Java Engineering Board Jan 22, 2025
@kmudduluru kmudduluru added 5 Story Point Estimate and removed 8 Story Point Estimate labels Jan 27, 2025
@kmudduluru kmudduluru moved this from In Progress to In Quarter in Java Engineering Board Feb 10, 2025
@jbedell-newrelic
Copy link
Contributor

Moving this back to the backlog as we've run into some hurdles, I'll describe my findings here.

As a bit of information, an essential aspect of CRaC is that no file handles can be open during the checkpointing process. In my testing I used the spring-petclinic app with and without the agent attached and found that with the agent attached we had numerous file handles open. I was unable to checkpoint a running JVM with the agent attached, so I was never able to even attempt to restore one with the agent attached. Thus, my findings here are limited to checkpointing only.

Note: When attempting to checkpoint the JVM, if you used the -Djdk.crac.collect-fd-stacktraces=true option during startup, any exceptions thrown due to open handles will include a stacktrace of where the handle was opened. If an open handle has no accompanying stacktrace, it was created in native code.

  • Our .old and .new class files before and after weaving. In this method we are not closing these files after writing them. Sometimes they appear to get closed on their own, sometimes not. This is a simple matter of closing those files when we are done.
  • Our log file. This was a simple matter of implementing the API Interface in the appropriate place to close the log file and re-open it when needed. The first wrinkle would be making sure we re-open correctly when the underlying system may have changed and the previous log file is no longer there. The second wrinkle is what, if anything, to do with any messages that come in after we have closed the log file, but before the checkpointing is complete, especially if checkpointing never actually finished successfully and the JVM stays running, I did not explore this wrinkle.
  • Backend collector connection. Again, a simple matter of implementing the API (perhaps here) to close the connection and re-open it.
  • Log config inside of newrelic.jar. This is an issue that I did not work through to solution. The file is actually opened by Log4J, not by us, during this call. There is a chance we may be able to get around that, or we may have to involve Log4J. I did not try to solve this problem yet, because the next problem became higher priority.
  • Temp instrumentation JARs. During agent premain startup we add several agent-related instrumentation JARs to the bootstrap class loader. The running JVM appears to be holding on to those file handles in native code and we are unable to close them successfully. We have engaged the CRaC team at Azul to consult.

@kmudduluru kmudduluru moved this from In Quarter to In Sprint in Java Engineering Board Feb 18, 2025
@jbedell-newrelic jbedell-newrelic moved this from In Sprint to In Progress in Java Engineering Board Feb 25, 2025
@jbedell-newrelic jbedell-newrelic moved this from In Progress to Needs Review in Java Engineering Board Mar 3, 2025
@meiao meiao linked a pull request Mar 3, 2025 that will close this issue
@deleonenriqueta deleonenriqueta moved this from Needs Review to Code Complete/Done in Java Engineering Board Mar 4, 2025
@deleonenriqueta deleonenriqueta closed this as completed by moving to Code Complete/Done in Java Engineering Board Mar 4, 2025
@deleonenriqueta deleonenriqueta moved this from Code Complete/Done to Needs Review in Java Engineering Board Mar 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 Story Point Estimate feature request Suggestion for a new product enhancement or change jan-mar qtr Represents proposed work item for the Jan-Mar quarter
Projects
Status: Needs Review
Development

Successfully merging a pull request may close this issue.

5 participants