Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Managed executor thread has the AppClassLoader on HttpClient dependent task #41311

Closed
antoniomacri opened this issue Jun 19, 2024 · 20 comments
Closed
Labels
kind/question Further information is requested

Comments

@antoniomacri
Copy link

Describe the bug

I'm passing the ManagedExecutor to the Java HttpClient in order for it to run its tasks. However, when I set a "small" CPU limit on the docker container (via docker run or k8s CPU limit) in dependent tasks the contextual classloader is the JVM jdk.internal.loader.ClassLoaders$AppClassLoader. As a consequence, I have ClassNotFoundExceptions.

Expected behavior

The classloader on threads running dependent tasks should be the RunnerClassLoader from Quarkus.

Actual behavior

The classloader on threads running dependent tasks is the JVM jdk.internal.loader.ClassLoaders$AppClassLoader.

How to Reproduce?

Clone https://github.com/antoniomacri/quarkus-reproducer-http-client-classloader.git

Then run

./build-image.sh && ./run-image.sh

On another shell run:

curl http://localhost:8080/test

The expected output should be:

io.quarkus.bootstrap.runner.RunnerClassLoader

but instead it is:

jdk.internal.loader.ClassLoaders$AppClassLoader

If I set --cpus=2.1 or greater, the classloader is correct.

Output of uname -a or ver

No response

Output of java -version

No response

Quarkus version or git rev

3.11.2

Build tool (ie. output of mvnw --version or gradlew --version)

No response

Additional information

Running MacOS on M2 Pro with 12 cores

@antoniomacri antoniomacri added the kind/bug Something isn't working label Jun 19, 2024
@antoniomacri
Copy link
Author

antoniomacri commented Jun 20, 2024

Setting -XX:ActiveProcessorCount=N, instead of running in a docker container with cpu limit, produces the same results.

@geoand
Copy link
Contributor

geoand commented Jun 21, 2024

Thanks for reporting.

I believe this is not a Quarkus issue

@antoniomacri
Copy link
Author

Hi @geoand. Could you expand on that?

@geoand
Copy link
Contributor

geoand commented Jun 22, 2024

We don't do anything specific for either the HTTP client or respond specifically to container limits.

I would suggest trying to reproduce something similar (essentially the executor not being taken into account) with plain Java

@antoniomacri
Copy link
Author

I may be wrong, but the executor seems OK to me. From this code:

    @PostConstruct
    void createHttpClient() {
        this.httpClient = HttpClient.newBuilder()
                .executor(managedExecutor)
                .version(HttpClient.Version.HTTP_2)
                .build();

        managedExecutor.supplyAsync(() -> "").thenApplyAsync(response -> {
            Log.infof("PostConstruct thread=%s", Thread.currentThread().hashCode());
            Log.infof("PostConstruct classLoaderAfter=%s", Thread.currentThread().getContextClassLoader());
            return Thread.currentThread().getContextClassLoader().getClass().getName();
        }, managedExecutor);
    }


    @GET
    @Produces({MediaType.TEXT_PLAIN})
    public CompletableFuture<String> test() {
        URI uri = URI.create("http://www.columbia.edu/~fdc/sample.html");
        HttpRequest request = HttpRequest.newBuilder().GET().uri(uri).build();

        Log.infof("classLoaderBefore=%s", Thread.currentThread().getContextClassLoader());
        return httpClient.sendAsync(request, HttpResponse.BodyHandlers.ofString()).thenApplyAsync(response -> {
            Log.infof("thread=%s", Thread.currentThread().hashCode());
            Log.infof("classLoaderAfter=%s", Thread.currentThread().getContextClassLoader());
            return Thread.currentThread().getContextClassLoader().getClass().getName();
        }, managedExecutor);
    }

I get this log:

2024-06-23 08:46:15,765 INFO  [ClassloadingResource] (executor-thread-1) PostConstruct thread=10217343
2024-06-23 08:46:15,765 INFO  [ClassloadingResource] (executor-thread-1) PostConstruct classLoaderAfter=io.quarkus.bootstrap.runner.RunnerClassLoader@1
2024-06-23 08:46:15,766 INFO  [ClassloadingResource] (vert.x-eventloop-thread-1) classLoaderBefore=io.quarkus.bootstrap.runner.RunnerClassLoader@1
2024-06-23 08:46:16,093 INFO  [ClassloadingResource] (executor-thread-2) thread=1224931749
2024-06-23 08:46:16,094 INFO  [ClassloadingResource] (executor-thread-2) classLoaderAfter=jdk.internal.loader.ClassLoaders$AppClassLoader@3d71d552

The thread appears to be from the managed executor.

The problem seems related to the classloader not being set/propagated on thread context.

Correct?

@dmlloyd
Copy link
Member

dmlloyd commented Jun 24, 2024

This does seem like a real bug to me. Our thread pools should always be propagating the correct class loader, so it's odd that it's being erased somehow.

@geoand
Copy link
Contributor

geoand commented Jun 25, 2024

cc @manovotn for the ManagedExecutor part

@manovotn
Copy link
Contributor

cc @manovotn for the ManagedExecutor part

Looking at this part:

        return httpClient.sendAsync(request, HttpResponse.BodyHandlers.ofString()).thenApplyAsync(response -> {
            Log.infof("thread=%s", Thread.currentThread().hashCode());
            Log.infof("classLoaderAfter=%s", Thread.currentThread().getContextClassLoader());
            return Thread.currentThread().getContextClassLoader().getClass().getName();
        }, managedExecutor);

The ManagedExecutor will capture contexts from the thread where it is invoked.
You firstly call httpClient.sendAsync and IIUIC, this goes outside of Quarkus and can use different TCCL. The ManagedExecutor then just captures whatever was used and keeps using it within the subsequent thenApplyAsync call.

Now, the above is what I think happens, I need to get a closer look this afternoon to verify that claim.

@antoniomacri
Copy link
Author

Thanks @manovotn. I set the .executor(managedExecutor) on the HttpClient builder, so it should already have the correct TCCL, I think.

@manovotn
Copy link
Contributor

Thanks @manovotn. I set the .executor(managedExecutor) on the HttpClient builder, so it should already have the correct TCCL, I think.

Yes, I noticed that. I am looking deeper into it just now.
What's off is this - below is a string representation of the ManagedExecutor you're getting and it includes the list of contexts it is propagating (which by default should be all that it finds providers for):

io.smallrye.context.SmallRyeManagedExecutor, with maxAsync: -1, with maxQueued: -1, with cleared contexts: [], with propagated contexts: [io.quarkus.arc.runtime.context.ArcContextProvider@34e1e537], with unchanged contexts: [],  with injection point name: no-ip

This looks like only ArcContextProvider is present and so only CDI context would be propagated and not the TCCL.

FTR, this section of docs describes what contexts are (conditionally) present in Quarkus. The one handling TCCL should be the Application context.
I am not yet sure why it isn't present...

@manovotn
Copy link
Contributor

I am not yet sure why it isn't present...

Ok, so normally SR CP has a dedicated module that performs this (here) but Quarkus doesn't seem to use this at all, so I assume our thread pool does this elsewhere automagically.
@dmlloyd out of curiosity, any pointers as to where our thread pools handle the tccl setup?

Anyhow, I did try adding the SR module manually and even with that in place the same issue appears - propagation actually happens but at some point the TCCL underneath changes and the propagation carries that information forward.
In other words, this doesn't seem like the problem of propagation or of ManagedExecutor.

I cannot put my finger on the exact spot which causes the issue but it seems to be originating inside HttpClientImpl.
It appears their (async) tasks are handled via "selector manager thread" (see HttpClientImpl$SelectorManager JDK class) which then delegates to the user provided executor - however, the context capture happens at this point and you therefore read the TCCL from their thread instead of the original one in your reproducer class.

I won't pretend I understand why is this happening and especially why it isn't happening when you don't set the CPU limits so ideas are welcome 🤔

@dmlloyd
Copy link
Member

dmlloyd commented Jun 25, 2024

I don't know about how any of the higher-level frameworks do it, but at the lowest level the thread pool (EnhancedQueueExecutor) captures the TCCL of the calling thread at the time that Executor#execute (or any other task-submit method) is called. So if it's wrong at one call site, the "wrongness" could spread through task propagation as those tasks in turn execute more tasks. If it originates in the JDK, that would explain a lot.

@manovotn
Copy link
Contributor

I see.
That corresponds to what the SR impl does just at a lower level.

The biggest mystery here is that the reproducer actually works if you increase the --cpus=2.1 option with docker...

@dmlloyd
Copy link
Member

dmlloyd commented Jun 25, 2024

Poking around in the JDK, it looks like it only uses the delegate executor sometimes. Sometimes you're running from the selector manager thread which doesn't necessarily do anything with TCCL and is controlled completely within the HttpClient implementation. I guess the problem likely originates here. Maybe having more threads available increases the likelihood of the source "infecting" task running on a safe thread pool thread instead of the selector thread?

It doesn't look like there's any way to access that thread, in any event, other than through a subscription to a socket flow event (like read- or write-ready events).

@manovotn
Copy link
Contributor

Poking around in the JDK, it looks like it only uses the delegate executor sometimes. Sometimes you're running from the selector manager thread which doesn't necessarily do anything with TCCL and is controlled completely within the HttpClient implementation. I guess the problem likely originates here. Maybe having more threads available increases the likelihood of the source "infecting" task running on a safe thread pool thread instead of the selector thread?

Ok that seems to confirm my earlier findings, thanks for looking into it as well!

It doesn't look like there's any way to access that thread, in any event, other than through a subscription to a socket flow event (like read- or write-ready events).

I think there is no guarantee that the executor you pass to the HttpClient will "just execute the task" on it - it has no notion of possible context propagation we are trying to do and hence can affect the result in certain ways, TCCL included 🤷

@geoand
Copy link
Contributor

geoand commented Jun 26, 2024

Thanks for reporting.

I believe this is not a Quarkus issue

In addition to the findings of Matej and David, another reason I still believe this is that we've never encountered such an issue with any Quarkus managed component

@dmlloyd
Copy link
Member

dmlloyd commented Jun 26, 2024

A workaround could be to wrap the injected ManagedExecutor with your own executor that simply initializes the TCCL before submitting the task.

@antoniomacri
Copy link
Author

Poking around in the JDK, it looks like it only uses the delegate executor sometimes. Sometimes you're running from the selector manager thread which doesn't necessarily do anything with TCCL and is controlled completely within the HttpClient implementation. I guess the problem likely originates here. Maybe having more threads available increases the likelihood of the source "infecting" task running on a safe thread pool thread instead of the selector thread?

By "selector thread" you mean the thread that is awoken when the socket receives data? Any pointer to the source code?

However it seems deterministic. I've never encountered the issue with -XX:ActiveProcessorCount=3 but always had it with -XX:ActiveProcessorCount=2.

A workaround could be to wrap the injected ManagedExecutor with your own executor that simply initializes the TCCL before submitting the task.

Something like this?

private static Executor createContextualExecutor(Executor innerExecutor) {
    final ClassLoader classLoaderToPropagate = Thread.currentThread().getContextClassLoader();
    return command -> {
        Thread currentThread = Thread.currentThread();
        ClassLoader old = currentThread.getContextClassLoader();
        currentThread.setContextClassLoader(classLoaderToPropagate);
        try {
            innerExecutor.execute(command);
        } finally {
            currentThread.setContextClassLoader(old);
        }
    };
}

Thanks!

@manovotn
Copy link
Contributor

manovotn commented Jun 28, 2024

By "selector thread" you mean the thread that is awoken when the socket receives data? Any pointer to the source code?

This is code directly in the JDK's HttpClientImpl
Take a look at the private static final class SelectorManager defined inside jdk.internal.net.http.HttpClientImpl.

However it seems deterministic. I've never encountered the issue with -XX:ActiveProcessorCount=3 but always had it with -XX:ActiveProcessorCount=2.

Yes, the reproducer works well in that regard (thanks for taking the time to make it!).
What David means, or at least what I observed as well, is that the selector thread is not unconditionally used but the choice is outside of our reach - JDK internals seem to make that decision, maybe with the processor limitation in mind 🤷

Something like this?

Yes, that should work

@manovotn manovotn added kind/question Further information is requested and removed kind/bug Something isn't working labels Jul 4, 2024
@manovotn
Copy link
Contributor

manovotn commented Jul 4, 2024

Closing this issue as I don't think we can do more here.
Feel free to reopen if you feel otherwise.

@manovotn manovotn closed this as completed Jul 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants