Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JDK Flight Recorder Extension #37900

Merged
merged 1 commit into from
Jun 3, 2024
Merged

JDK Flight Recorder Extension #37900

merged 1 commit into from
Jun 3, 2024

Conversation

chiroito
Copy link
Member

This PR integrates the JDK Flight Recorder into Quarkus.
It will work with Resteasy Reactive and Opentelemetry.

For blocking REST, it records a single blocking event. This is a common way of recording events in JFR.

For non-blocking REST, this will record Start and End events. This is to avoid overlapping multiple events in a non-blocking process, since multiple requests may be processed at the same time, apparently due to I/O waiting.

Also, the request ID is recorded in these events.
This assumes propagation of the ID using HTTP headers, such as the common X-Request-ID or WebLogic Server's ECID-Context.
If OpenTelemetry is enabled, the Request ID also includes its TraceId and SpanID.

@quarkus-bot quarkus-bot bot added area/dependencies Pull requests that update a dependency file area/documentation labels Dec 22, 2023
@quarkus-bot quarkus-bot bot added this to To do in Quarkus Documentation Dec 22, 2023
@quarkus-bot

This comment has been minimized.

@quarkus-bot

This comment has been minimized.

@gsmet
Copy link
Member

gsmet commented Dec 22, 2023

Hey. Thanks for this contribution.

I wonder if it would be better to have it as an external extension hosted on the Quarkiverse Hub.

@maxandersen WDYT?

@maxandersen
Copy link
Contributor

Hmmm. Weve talked about how jfr would fit with respect to tracing in past. Not sure where we would put it. Kinda feels like it could make sense to have in core if makes things more consistent/integrated.

Does this link up with opentlemetry span id or something completely different?

Cc @brunobat @cescoffier

@quarkus-bot

This comment has been minimized.

@quarkus-bot quarkus-bot bot added the area/devtools Issues/PR related to maven, gradle, platform and cli tooling/plugins label Dec 24, 2023
@quarkus-bot

This comment has been minimized.

@chiroito
Copy link
Member Author

I expect this extension to work on its own, or in conjunction with OpenTelemetry.

For systems that already have some kind of tracing in place, the request will be linked to an X-Request-ID or TraceId/SpanId. We analyze the JFR with reference to these IDs when the time of that request becomes long. This will help in analyzing the types of problems that OpenTelemetry cannot solve.

There may also be some systems where tracing is difficult to implement. In such systems, JFR alone can be used to solve performance problems.
In this case we need the ID of the Request for the Reactive type analysis. I could not find anything like RequestId inside Quarkus. Therefore, I am assuming that I would have this ID generated by nginx or something similar that would be in front of Quarkus and then passed on to Quarkus.

The reasons why the Request ID is needed are as follows
When accessing the REST API provided by Quarkus as Blocking, Quarkus records the time from the start of the process to the end as a single JFR event for blocking. In between, if Quarkus records DB and other events, we can see which part of the request is taking a long time.
With Reactive, however, multiple requests are processed simultaneously. Therefore, in a similar way to blocking, JFR events will overlap each other. Therefore, I have implemented Start and End events. In the future, when we expand to record information in DBs, etc., we will also implement Start and End events to record reactives. However, the linkage of this information requires a mechanism to associate events with each other, such as by RequestId.

@quarkus-bot

This comment has been minimized.

Copy link
Contributor

@brunobat brunobat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

Copy link
Contributor

@brunobat brunobat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @chiroito.
We need something like this but we need to discuss it at high level before we merge this.
Questions to answer:

  • Does this belong in core or Quarkiverse? It's not clear now.
  • How does this work from the user side? Can you please elaborate a bit on what the user can configure, where he will see the data and the format of the events?
  • We need documentation.

After this we can go into the details of the PR, which is already very big.

Quarkus Documentation automation moved this from To do to Review pending Jan 3, 2024
@chiroito
Copy link
Member Author

chiroito commented Jan 5, 2024

Thank you for your reply @brunobat .

Users only need to install this Extention and then start JFR. That's it, this will record JAX-RS information to JFR. Users can get the JFR dump file as usual and load it into a tool such as JDK Mission Control.

I understood that the documentation we need is an introduction to these processes. Is this right?

Should I document this below?

https://github.com/quarkusio/quarkus/tree/main/docs/src/main/asciidoc

@brunobat
Copy link
Contributor

brunobat commented Jan 5, 2024

I understood that the documentation we need is an introduction to these processes. Is this right?

Should I document this below?

https://github.com/quarkusio/quarkus/tree/main/docs/src/main/asciidoc

Yes, please name it telemetry-jfr.adoc and follow a similar pattern to what we have in https://github.com/quarkusio/quarkus/blob/main/docs/src/main/asciidoc/opentelemetry.adoc

Independently if this will leave in the core repo, we need an explanation of what it will do, which events will be recorded and where the file will be stored.
An explanation of the event format and links to the relevant JFR spec parts would also be appreciated. All this will help people to use the extension.

I don't see any configuration classes. Would it be possible to at least create a config that disables the JFR data production at runtime?

@brunobat
Copy link
Contributor

brunobat commented Jan 5, 2024

There's also a design problem. I would like very much to integrate this at some point: https://github.com/open-telemetry/opentelemetry-java-instrumentation/tree/main/instrumentation/runtime-telemetry/runtime-telemetry-java17/library
I wonder if you could take a look at it, @chiroito because I think you are duplicating the work done in there.


There is also existing documentation about the JFR here: https://github.com/roberttoyonaga/quarkus/blob/main/docs/src/main/asciidoc/native-reference.adoc
This would need to be adapted to include the new extension.

@brunobat
Copy link
Contributor

brunobat commented Jan 5, 2024

CC @roberttoyonaga

@roberttoyonaga
Copy link
Contributor

@chiroito

We analyze the JFR with reference to these IDs when the time of that request becomes long.... With Reactive, however, multiple requests are processed simultaneously. Therefore, in a similar way to blocking, JFR events will overlap each other. Therefore, I have implemented Start and End events.

It seems like your goal is to create custom start/end JFR events for requests (containing the requestID). Then you want to use that time window to attribute other JFR events to that requestID? If so, how would this be possible if you have multiple concurrent requests resulting in multiple overlapping time-windows?

@roberttoyonaga
Copy link
Contributor

There's also a design problem. I would like very much to integrate this at some point: https://github.com/open-telemetry/opentelemetry-java-instrumentation/tree/main/instrumentation/runtime-telemetry/runtime-telemetry-java17/library
I wonder if you could take a look at it, @chiroito because I think you are duplicating the work done in there.

I don't think there will be too much overlap. This PR is introducing new events, while the OTEL runtime metric gatherers only collect some existing JFR data and make it accessible through opentelemetry.

@chiroito
Copy link
Member Author

chiroito commented Jan 9, 2024

@brunobat
I understand, I will write it as telemetry-jfr.adoc, referring to OpenTelemetry.

In addition to that, I will include the following at a minimum

  • How to Activate JFR Extension
  • How to start recording JFR in JVM mode/native mode
  • How to dump JFR
  • How to open dump file by JMC/jfr command
  • How to check events added by the extension
  • Explanation of event formats
  • Link about JVM startup arguments
  • Link about jfr command
  • Link about JMC

I would also like to add a note to the JFR section of the Native Reference Guide.

It is possible to create a config to disable JFR recording on this extension.
However, this can be done in the JFR configuration as well.
I did not add a config to disable it because I thought it would confuse users if there were two ways to set it up.
Should I create a config?

@chiroito
Copy link
Member Author

chiroito commented Jan 9, 2024

@roberttoyonaga
Reactive processing is recorded as a JFR event with a duration of zero at the start and end of processing.
JFR events have standard processing duration information. However, this extension records a different processing duration than that in the end-of-process event.
By implementing JFR in this way, even if multiple requests overlap by non-blocking, the JFR events do not overlap each other.

In order to link these start and end points, some kind of ID is needed.
This ID is also used to link multiple types of events (e.g., REST and DB).
If the caller is another application, this ID is also needed to link to it.

One common way of ID is to store the ID in the HTTP header in the traditional way, and a newer way would be to use OpenTelemetry's TraceId/SpanId.
If these IDs are available, use those IDs.
In the method of storing the ID in the HTTP header, the header key can vary.
I believe the most commonly used for this is X-Request-ID, but in WebLogic Server, for example, it is ECID-Context.
So I allow the user to change which key to use.

However, without these keys, the respective events of the JFR cannot be linked.
Therefore, a better operation would be for nginx or others to generate an X-Request-ID and put it in the header, or to introduce OpenTelemetry.
It would be nice if Quarkus provided an ID for each request, but I have not been able to find anything like that.

@brunobat
Copy link
Contributor

brunobat commented Jan 9, 2024

Thanks very much for the reply, @chiroito. I would prefer to get the correlation IDs from OTel, if the extension is present. That can be activated in the processor class.
An X-Request-ID would be a duplication of the OTel propagation that is already automatically handled. I would only use it if OTel is not there... If the header is not present, a local ID would need to be used. This calls for some sort of strategy in the use/generation of the ids.

It would be also interesting to write a motivation for the PR that will allow reviewers and users to better understand the work scope given that this somehow duplicates some of the work already done in OTel and that there are also ways to generate JFR events that can become available in OTel.

In this case, the created JFR events could be used for fine grained debugging by analysing the JFR stream file. Am I right?

@chiroito
Copy link
Member Author

Thank you for your reply, @brunobat

I completely agree with you in the case of OTel extensions. However, it is true that not all applications that implement tracing use OTel.

The tracing technology options would be

  • Older tracing technologies (e.g. X-Request-ID)
  • W3C Trace Context (e.g. OTel)
  • No tracing

It may appear that the old tracing technology and the W3C Trace Context overlap. I am aware these use different headers, so I think each can store different values. If Quarkus only supports W3C Trace Context, tracing will be lost if Quarkus is deployed on the boundary between the old tracing technology and Trace Context.

Therefore, both tracing technologies should be supported and their respective IDs should be recorded. Currently, OTel does not support the older tracing technology. So, I think this extension should not be completely dependent on OTel. Of course, if OTel is present, we will take advantage of it.

How we should record IDs needs to be considered. If Quarkus does not get any of the IDs, then each record cannot be linked to each other, as mentioned earlier. In this PR, the extension records the old tracing technology ID, W3C TraceId, and SpanID. If none of these are available, Quarkus would be better provided with the ID. And this extension defines different JFR events with and without OTel. For simpler code, it may be possible to make it one event.

Also, the extension should not propagate and should only record those information if it is available. If the Quarkus application uses OTel, then OTel will do the propagation, and if it uses older tracing technology, then the application or REST Server/Client extension should do it.

Since it is JFR, the recorded information can be consumed in a variety of ways, including JMC / jfr commands / JFR Stream API. Of course, it can also be transferred to Grafana using Cryostat. We could watch the JAX-RS response time recorded by this extension, and if it exceeds a certain threshold, we could retrieve all the information associated with the corresponding ID.

@brunobat
Copy link
Contributor

@chiroito This context propagation is a deep subject. With the quarkus-opentelemetry extension, apart from the default W3C headers we support many others. See: https://quarkus.io/guides/opentelemetry#propagators .

I think we shouldn't try to boil the ocean here and try to support everything. We should instead focus on creating the right abstractions.

One way would be to create some interface that abstracts the retrieval of the current span id. Each implementation could have a priority and we can start with a couple of implementations, lets say W3C Trace Context and 'X-Request-ID'.
More implementations can be added with time.

A propagation header detection could be set as a user configuration first and automated later.

What do you think?

@gastaldi
Copy link
Contributor

If it's not too late, for easier maintenance purposes, I'd recommend moving this extension to Quarkiverse.

Hmmm. Weve talked about how jfr would fit with respect to tracing in past. Not sure where we would put it. Kinda feels like it could make sense to have in core if makes things more consistent/integrated.

Quarkiverse extensions are consistent and integrated too, plus you have the freedom to release anytime you feel like, independent of the core releases.

@maxandersen
Copy link
Contributor

@gastaldi i dont think we have conclusion on wether it makes sense to have this in core or not.

@chiroito
Copy link
Member Author

@brunobat
Thanks for the good idea.

In the future, if we create an IdGenerator that uses the ID of the old tracing technology, such as X-Request-ID, we can deploy the Quarkus application on the boundary between the old tracing technology and OTel and still link the tracings correctly.

So how about the following policy?

  • OTel is responsible for propagating tracing information.
  • This extension is not responsible for tracing information propagation, only for recording JFR events.
  • If OTel is present, this extension records the tracing information to JFR together with the tracing information.
  • If OTel is not present, it generates its own ID and records it to JFR together with the ID.

I expect the following as a concrete implementation.

  • First, in this extension, I create an interface which we can get the TraceId and SpanId. Then, I will implement two classes that implement this interface depending on whether OTel is present or not.
  • If OTel is present, the implementation of this interface returns the TraceId and SpanId obtained from the OTel.
  • If OTel is not present, the class that implements this interface creates its own ID and returns it as TraceId. In this implementation, SpanId returns null.

@gastaldi @maxandersen
What information would help us decide whether to put it in this repository or in the Quarkiverse?

@gastaldi
Copy link
Contributor

gastaldi commented Jan 13, 2024

@chiroito here is the documentation about Quarkiverse and how to onboard a new extension: https://hub.quarkiverse.io/

IMHO the best part about having in Quarkiverse is that you can decide who can push to the repository, whereas if that lands in core everything needs to be reviewed and merged by a Quarkus core member, which may not always know the full context and have the necessary know-how to approve it.

@quarkus-bot

This comment has been minimized.

@quarkus-bot

This comment has been minimized.

@quarkus-bot

This comment has been minimized.

@brunobat brunobat requested a review from geoand May 30, 2024 19:59
@MetadataDefinition
@Name("io.quarkus.SpanId")
@Label("Span ID")
@Description("Link events that have the common Span ID")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the OpenJDK project, we have tried to avoid using the word event in labels and descriptions. You might want to do the same.

The rationale is that data may be pushed into a database or visualized in ways where the event concept is no longer visible to the user. I would drop the description, since it is pretty obvious, or keep it more abstract. i.e. "Links spans with the same ID together". Same for the TraceID.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@quarkus-bot

This comment has been minimized.

@quarkus-bot

This comment has been minimized.

@quarkus-bot
Copy link

quarkus-bot bot commented Jun 2, 2024

Status for workflow Quarkus Documentation CI

This is the status report for running Quarkus Documentation CI on commit 5e0f32f.

✅ The latest workflow run for the pull request has completed successfully.

It should be safe to merge provided you have a look at the other checks in the summary.

⚠️ There are other workflow runs running, you probably need to wait for their status before merging.

@quarkus-bot
Copy link

quarkus-bot bot commented Jun 2, 2024

Status for workflow Quarkus CI

This is the status report for running Quarkus CI on commit 5e0f32f.

✅ The latest workflow run for the pull request has completed successfully.

It should be safe to merge provided you have a look at the other checks in the summary.

You can consult the Develocity build scans.


Flaky tests - Develocity

⚙️ JVM Tests - JDK 17

📦 integration-tests/reactive-messaging-kafka

io.quarkus.it.kafka.KafkaConnectorTest.testFruits - History

  • Assertion condition defined as a Lambda expression in io.quarkus.it.kafka.KafkaConnectorTest expected: <6> but was: <5> within 10 seconds. - org.awaitility.core.ConditionTimeoutException
org.awaitility.core.ConditionTimeoutException: Assertion condition defined as a Lambda expression in io.quarkus.it.kafka.KafkaConnectorTest expected: <6> but was: <5> within 10 seconds.
	at org.awaitility.core.ConditionAwaiter.await(ConditionAwaiter.java:167)
	at org.awaitility.core.AssertionCondition.await(AssertionCondition.java:119)
	at org.awaitility.core.AssertionCondition.await(AssertionCondition.java:31)
	at org.awaitility.core.ConditionFactory.until(ConditionFactory.java:1006)
	at org.awaitility.core.ConditionFactory.untilAsserted(ConditionFactory.java:790)
	at io.quarkus.it.kafka.KafkaConnectorTest.testFruits(KafkaConnectorTest.java:63)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)

⚙️ JVM Tests - JDK 21

📦 extensions/smallrye-reactive-messaging-kafka/deployment

io.quarkus.smallrye.reactivemessaging.kafka.deployment.dev.KafkaDevServicesDevModeTestCase.sseStream - History

  • Assertion condition defined as a Lambda expression in io.quarkus.smallrye.reactivemessaging.kafka.deployment.dev.KafkaDevServicesDevModeTestCase Expecting size of: [] to be greater than or equal to 2 but was 0 within 10 seconds. - org.awaitility.core.ConditionTimeoutException
org.awaitility.core.ConditionTimeoutException: 
Assertion condition defined as a Lambda expression in io.quarkus.smallrye.reactivemessaging.kafka.deployment.dev.KafkaDevServicesDevModeTestCase 
Expecting size of:
  []
to be greater than or equal to 2 but was 0 within 10 seconds.
	at org.awaitility.core.ConditionAwaiter.await(ConditionAwaiter.java:167)
	at org.awaitility.core.AssertionCondition.await(AssertionCondition.java:119)
	at org.awaitility.core.AssertionCondition.await(AssertionCondition.java:31)

@chiroito chiroito requested a review from geoand June 3, 2024 02:02
@geoand geoand removed area/documentation area/dependencies Pull requests that update a dependency file area/devtools Issues/PR related to maven, gradle, platform and cli tooling/plugins labels Jun 3, 2024
Copy link
Contributor

@geoand geoand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work!

Quarkus Documentation automation moved this from Review pending to Reviewer approved Jun 3, 2024
@geoand geoand merged commit 531e9e4 into quarkusio:main Jun 3, 2024
55 checks passed
Quarkus Documentation automation moved this from Reviewer approved to Done Jun 3, 2024
@quarkus-bot quarkus-bot bot added this to the 3.12 - main milestone Jun 3, 2024
@gsmet gsmet changed the title JDK Flight Recorder Extention JDK Flight Recorder Extension Jun 26, 2024
@quarkus-bot quarkus-bot bot added area/dependencies Pull requests that update a dependency file area/devtools Issues/PR related to maven, gradle, platform and cli tooling/plugins area/documentation labels Jun 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/dependencies Pull requests that update a dependency file area/devtools Issues/PR related to maven, gradle, platform and cli tooling/plugins area/documentation area/tracing release/noteworthy-feature triage/flaky-test
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

None yet

9 participants