Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ClientAbortException from StreamingResponseBody can cause infinite loop of dispatches #32042

Closed
straubp opened this issue Jan 16, 2024 · 13 comments
Assignees
Labels
in: web Issues in web modules (web, webmvc, webflux, websocket) type: enhancement A general enhancement
Milestone

Comments

@straubp
Copy link

straubp commented Jan 16, 2024

Spring Boot 3.2.0
Java 17.0.8.1

Client aborting a StreamingResponseBody download in combination with an ExceptionHandler can cause an infinite loop of these exceptions (once every second):

java.lang.NullPointerException: Cannot invoke "org.apache.catalina.core.AsyncContextImpl.timeout()" because "asyncConImpl" is null
	at org.apache.catalina.connector.CoyoteAdapter.asyncDispatch(CoyoteAdapter.java:135) ~[tomcat-embed-core-10.1.16.jar:10.1.16]
	at org.apache.coyote.AbstractProcessor.dispatch(AbstractProcessor.java:243) ~[tomcat-embed-core-10.1.16.jar:10.1.16]
	at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:57) ~[tomcat-embed-core-10.1.16.jar:10.1.16]
	at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:896) ~[tomcat-embed-core-10.1.16.jar:10.1.16]
	at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1744) ~[tomcat-embed-core-10.1.16.jar:10.1.16]
	at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:52) ~[tomcat-embed-core-10.1.16.jar:10.1.16]
	at org.apache.tomcat.util.threads.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1191) ~[tomcat-embed-core-10.1.16.jar:10.1.16]
	at org.apache.tomcat.util.threads.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:659) ~[tomcat-embed-core-10.1.16.jar:10.1.16]
	at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) ~[tomcat-embed-core-10.1.16.jar:10.1.16]
	at java.base/java.lang.Thread.run(Thread.java:833) ~[na:na]

There seems to be some sort of race condition as this doesn't happen if you "wait" briefly after catching the ClientAbortException.

What should happen (or actually happens when we wait)
ClientAbortException -> "clean up ot the request" -> request ends

What actually happens (when we don't wait) is something like this:
ClientAbortException -> "clean up" starts -> ExceptionHandler get's called which forwards to the original URL again -> new request comes in -> "clean up" ends (=> asyncConImpl is set to null) -> request never ends -> timeout -> NPEs

Why is the ExceptionHandler called at all? Does this make sense when we use a StreamingResponseBody and encounter a ClientAbortEexception?

Steps to reproduce:

Code:

package com.example.coyote;

import org.apache.catalina.connector.ClientAbortException;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import org.springframework.context.annotation.Configuration;
import org.springframework.http.HttpHeaders;
import org.springframework.http.ResponseEntity;
import org.springframework.web.bind.annotation.*;
import org.springframework.web.servlet.mvc.method.annotation.StreamingResponseBody;

import java.nio.charset.StandardCharsets;

@SpringBootApplication
@RestController
@RestControllerAdvice
@Configuration
public class CoyoteApplication {

    private static final Logger LOG = LoggerFactory.getLogger(CoyoteApplication.class);

    public static void main(String[] args) {
        SpringApplication.run(CoyoteApplication.class, args);
    }

    @GetMapping(value = "/download")
    public ResponseEntity<StreamingResponseBody> download(@RequestParam(value = "wait", defaultValue = "false") boolean wait) {

        LOG.info("wait: {}", wait);

        var data = "test".getBytes(StandardCharsets.UTF_8);
        var times = 10;

        StreamingResponseBody responseBody = outputStream -> {
            LOG.info("start download");
            try {
                for (int i = 0; i < times; i++) {
                    outputStream.write(data);
                    outputStream.flush();
                    waitMillis(1000);
                }
            } catch (Exception e) {

                // this might already take too long to reproduce the behavior (see comment below)
                LOG.error("error");

                // if we rethrow the exception fast enough, the exception handler is called,
                // and we end up in a very strange state
                // (just wait 20 seconds - spring.mvc.async.request-timeout - to see it in the logs)
                // note: if not happening, remove the error log from above, as it might already take too much time
                if (!wait) {
                    throw e;
                }

                // if we wait, the exception handler is not called and everything is fine
                var before = Thread.currentThread().isInterrupted() ? "interrupted" : "running";
                waitMillis(100);
                var after = Thread.currentThread().isInterrupted() ? "interrupted" : "running";
                LOG.info("waited: {} -> {}", before, after);

                throw e;
            }
            LOG.info("finish download");
        };

        return ResponseEntity.ok()
                .header(HttpHeaders.CONTENT_DISPOSITION, "attachment; filename=file.txt")
                .header(HttpHeaders.CONTENT_LENGTH, String.valueOf(data.length * times))
                .header(HttpHeaders.CONTENT_TYPE, "plain/txt")
                .body(responseBody);
    }

    @ExceptionHandler({ClientAbortException.class})
    public void handleClientAbortException() {
        LOG.info("CLIENT ABORT");
    }

    private static void waitMillis(long millis) {
        long now = System.currentTimeMillis();
        while (System.currentTimeMillis() - now < millis) ;
    }
}

application.properties:

spring.mvc.async.request-timeout=20000ms

pom.xml

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd">
	<modelVersion>4.0.0</modelVersion>
	<parent>
		<groupId>org.springframework.boot</groupId>
		<artifactId>spring-boot-starter-parent</artifactId>
		<version>3.2.0</version>
		<relativePath/> <!-- lookup parent from repository -->
	</parent>
	<groupId>com.example</groupId>
	<artifactId>coyote</artifactId>
	<version>0.0.1-SNAPSHOT</version>
	<name>coyote</name>
	<description>Demo project for Spring Boot</description>
	<properties>
		<java.version>17</java.version>
	</properties>
	<dependencies>
		<dependency>
			<groupId>org.springframework.boot</groupId>
			<artifactId>spring-boot-starter-web</artifactId>
		</dependency>

		<dependency>
			<groupId>org.springframework.boot</groupId>
			<artifactId>spring-boot-starter-test</artifactId>
			<scope>test</scope>
		</dependency>
	</dependencies>

	<build>
		<plugins>
			<plugin>
				<groupId>org.springframework.boot</groupId>
				<artifactId>spring-boot-maven-plugin</artifactId>
				<configuration>
					<image>
						<builder>paketobuildpacks/builder-jammy-base:latest</builder>
					</image>
				</configuration>
			</plugin>
		</plugins>
	</build>

</project>

Logs when not waiting:

2024-01-02T10:37:32.455+01:00  INFO 12788 --- [nio-8080-exec-1] com.example.coyote.CoyoteApplication     : wait: false
2024-01-02T10:37:32.469+01:00  INFO 12788 --- [         task-1] com.example.coyote.CoyoteApplication     : start download
2024-01-02T10:37:36.471+01:00 ERROR 12788 --- [         task-1] com.example.coyote.CoyoteApplication     : error
2024-01-02T10:37:36.481+01:00  INFO 12788 --- [nio-8080-exec-2] com.example.coyote.CoyoteApplication     : CLIENT ABORT
2024-01-02T10:37:36.484+01:00  INFO 12788 --- [nio-8080-exec-2] com.example.coyote.CoyoteApplication     : wait: false
2024-01-02T10:37:36.485+01:00  INFO 12788 --- [         task-2] com.example.coyote.CoyoteApplication     : start download
2024-01-02T10:37:36.485+01:00 ERROR 12788 --- [         task-2] com.example.coyote.CoyoteApplication     : error
2024-01-02T10:37:57.090+01:00 ERROR 12788 --- [nio-8080-exec-3] o.a.catalina.connector.CoyoteAdapter     : Exception while processing an asynchronous request

java.lang.NullPointerException: Cannot invoke "org.apache.catalina.core.AsyncContextImpl.timeout()" because "asyncConImpl" is null
	at org.apache.catalina.connector.CoyoteAdapter.asyncDispatch(CoyoteAdapter.java:135) ~[tomcat-embed-core-10.1.16.jar:10.1.16]
	at org.apache.coyote.AbstractProcessor.dispatch(AbstractProcessor.java:243) ~[tomcat-embed-core-10.1.16.jar:10.1.16]
	at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:57) ~[tomcat-embed-core-10.1.16.jar:10.1.16]
	at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:896) ~[tomcat-embed-core-10.1.16.jar:10.1.16]
	at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1744) ~[tomcat-embed-core-10.1.16.jar:10.1.16]
	at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:52) ~[tomcat-embed-core-10.1.16.jar:10.1.16]
	at org.apache.tomcat.util.threads.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1191) ~[tomcat-embed-core-10.1.16.jar:10.1.16]
	at org.apache.tomcat.util.threads.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:659) ~[tomcat-embed-core-10.1.16.jar:10.1.16]
	at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) ~[tomcat-embed-core-10.1.16.jar:10.1.16]
	at java.base/java.lang.Thread.run(Thread.java:833) ~[na:na]

2024-01-02T10:37:58.100+01:00 ERROR 12788 --- [nio-8080-exec-4] o.a.catalina.connector.CoyoteAdapter     : Exception while processing an asynchronous request

java.lang.NullPointerException: Cannot invoke "org.apache.catalina.core.AsyncContextImpl.timeout()" because "asyncConImpl" is null
	at org.apache.catalina.connector.CoyoteAdapter.asyncDispatch(CoyoteAdapter.java:135) ~[tomcat-embed-core-10.1.16.jar:10.1.16]
	at org.apache.coyote.AbstractProcessor.dispatch(AbstractProcessor.java:243) ~[tomcat-embed-core-10.1.16.jar:10.1.16]
	at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:57) ~[tomcat-embed-core-10.1.16.jar:10.1.16]
	at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:896) ~[tomcat-embed-core-10.1.16.jar:10.1.16]
	at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1744) ~[tomcat-embed-core-10.1.16.jar:10.1.16]
	at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:52) ~[tomcat-embed-core-10.1.16.jar:10.1.16]
	at org.apache.tomcat.util.threads.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1191) ~[tomcat-embed-core-10.1.16.jar:10.1.16]
	at org.apache.tomcat.util.threads.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:659) ~[tomcat-embed-core-10.1.16.jar:10.1.16]
	at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) ~[tomcat-embed-core-10.1.16.jar:10.1.16]
	at java.base/java.lang.Thread.run(Thread.java:833) ~[na:na]

2024-01-02T10:37:59.094+01:00 ERROR 12788 --- [nio-8080-exec-5] o.a.catalina.connector.CoyoteAdapter     : Exception while processing an asynchronous request
...

Logs when waiting:

2024-01-02T10:41:04.131+01:00  INFO 7688 --- [nio-8080-exec-1] com.example.coyote.CoyoteApplication     : wait: true
2024-01-02T10:41:04.145+01:00  INFO 7688 --- [         task-1] com.example.coyote.CoyoteApplication     : start download
2024-01-02T10:41:08.146+01:00 ERROR 7688 --- [         task-1] com.example.coyote.CoyoteApplication     : error
2024-01-02T10:41:08.247+01:00  INFO 7688 --- [         task-1] com.example.coyote.CoyoteApplication     : waited: running -> interrupted

(Issue originally posted here spring-projects/spring-boot#38950)

@spring-projects-issues spring-projects-issues added the status: waiting-for-triage An issue we've not yet triaged or decided on label Jan 16, 2024
@jhoeller jhoeller added the in: web Issues in web modules (web, webmvc, webflux, websocket) label Jan 17, 2024
@KazzmanK
Copy link

We have same issues with infinite loop of exceptions with StreamingResponse.
Also, if there are more root issues, there will be more exceptions at 1 second rate. On our test site it looks like
image

But more investigation bring us to conclusion what added extra waiting does not help in our case.
Also, going back to Boot 3.1.3 helps, but to Boot 3.1.7 - exception at 1 second rate

@rstoyanchev rstoyanchev self-assigned this Jan 23, 2024
@rstoyanchev
Copy link
Contributor

rstoyanchev commented Jan 23, 2024

Thanks for the report. I am unable to reproduce the behavior, and therefore cannot narrow the exact cause, but I have an idea of what's going on.

For any async request, Spring MVC saves the result in WebAsyncManager#concurrentResult and performs an ASYNC dispatch to get back to a Servlet container thread to complete processing. If the result was an exception, then we do exception handling as for any request.

StreamingResponseBody is a bit different as an async handler in that it is more low level and writes directly to the response. When an IO error occurs, Spring MVC handles it as any other, but Tomcat is also aware of it and notifies AsyncListeners of which we have one in order to detect and handle IO errors. So there is a race between two threads trying to do exactly the same, but the method they call setConcurrentResultAndDispatch is protected so only one should perform an async dispatch.

What I think happens is, in some cases the controller thread performs the async dispatch quicker. After that when the Tomcat thread is done calling AsyncListeners, and if none did anything to complete the async request, it tries to take further action explained by Mark in this comment. However, I can't confirm what exactly happens at that point (an error dispatch, or something unexpected due to the ASYNC dispatch that already completed) because I can't reproduce the issue.

It's possible there may be room for Tomcat to handle this better, avoiding the NullPointerException and the continuous loop of dispatches, /cc @markt-asf.

That said I can make an improvement on our side so that an exception implies a disconnected client, we don't attempt to handle it any further, knowing the Servlet container will be doing the same. I'll let you know when there is a snapshot to try the change with in order to confirm whether it helps or if something more is needed.

@rstoyanchev rstoyanchev added type: enhancement A general enhancement and removed status: waiting-for-triage An issue we've not yet triaged or decided on labels Jan 23, 2024
@rstoyanchev rstoyanchev added this to the 6.1.4 milestone Jan 23, 2024
@markt-asf
Copy link

I'll see if I can re-create this as well.

@rstoyanchev
Copy link
Contributor

rstoyanchev commented Jan 23, 2024

@straubp and @KazzmanK please give 6.1.4-SNAPSHOT a try to see what the behavior is now after this commit 9230a7d.

@KazzmanK
Copy link

Is it sufficient dependency? Couse boot 3.3.0-snapshot points to spring 6.1.3, and needs overrides.

[INFO] +- org.springframework:spring-web:jar:6.1.4-SNAPSHOT:compile
[INFO] |  +- org.springframework:spring-beans:jar:6.1.3:compile
[INFO] |  \- org.springframework:spring-core:jar:6.1.3:compile

@straubp
Copy link
Author

straubp commented Jan 24, 2024

@rstoyanchev I can confirm that the fix is working. Thank you! :-)

Using this in my POM fixed the issue:

	<dependencyManagement>
		<dependencies>
			<dependency>
				<groupId>org.springframework</groupId>
				<artifactId>spring-web</artifactId>
				<version>6.1.4-SNAPSHOT</version>
			</dependency>
		</dependencies>
	</dependencyManagement>

With the previous version the issue still exists:

	<dependencyManagement>
		<dependencies>
			<dependency>
				<groupId>org.springframework</groupId>
				<artifactId>spring-web</artifactId>
				<version>6.1.3</version>
			</dependency>
		</dependencies>
	</dependencyManagement>

Logs using the fixed version:

2024-01-24T08:39:00.413+01:00  INFO 15828 --- [nio-8080-exec-1] com.example.coyote.CoyoteApplication     : wait: false
2024-01-24T08:39:00.429+01:00  INFO 15828 --- [         task-1] com.example.coyote.CoyoteApplication     : start download
2024-01-24T08:39:04.431+01:00 ERROR 15828 --- [         task-1] com.example.coyote.CoyoteApplication     : error
2024-01-24T08:39:04.444+01:00  INFO 15828 --- [nio-8080-exec-2] com.example.coyote.CoyoteApplication     : CLIENT ABORT

@rstoyanchev
Copy link
Contributor

Thanks for confirming @straubp.

@rstoyanchev rstoyanchev changed the title Aborting a StreamingResponseBody download can cause an infinite loop ClientAbortException from StreamingResponseBody can cause infinite loop of dispatches Jan 24, 2024
@markt-asf
Copy link

I am able to reproduce this issue sometimes and I have identified what is happening internally in Tomcat.

I agree that it is the concurrent error handling by both the app (on a non-container thread) and the container (on a container thread) where things start to go wrong.

Note that there is work ongoing in the Servlet project to clarify expectations around error handling but I don't think it is going to help in this case.

I'm currently looking at possible fixes to Tomcat to better handle the results of concurrent error handling but I did wonder about the following change at the framework level:
If the framework intends to respond to an error in an non-container thread and that response includes calling complete() or dispatch() then the framework should not allow AsyncListener.onError() (which will be called from a different, container thread) to return to the container until complete() or dispatch() has been called in the non-container thread.

@rstoyanchev
Copy link
Contributor

rstoyanchev commented Jan 24, 2024

Good that you were able to reproduce it.

If the framework intends to respond to an error in an non-container thread and that response includes calling complete() or dispatch() then the framework should not allow AsyncListener.onError() (which will be called from a different, container thread) to return to the container until complete() or dispatch() has been called in the non-container thread.

The case of a controller method returning StreamingResponseBody is a bit unique as mentioned earlier as it's doing low level streaming directly to the response, and propagates an exception that Tomcat is also aware of. In the typical case, however, we would get a result value from async processing first, then we would async dispatch and write to the response from there as we would for regular requests.

More generally though, regardless of how a controller method completes async handling, i.e. with success (producing a result value), or with an error of its own, there is an inherent race between us trying to then perform an async dispatch to continue and finish processing, and a potential error notification coming from the container side via AsyncListener.

We could introduce something to guard against this so that if the async dispatch gets there first, it is allowed to dispatch before our AsyncListener#onError returns. Or otherwise, if the onError gets there first, we don't proceed with the async dispatch.

That said I'm wondering if the Servlet container could help prevent this race from its side, or otherwise any other Servlet application (not using our support) would also need to guard against this. If I recall there is a similar rule for when an async dispatch occurs before the REQUEST dispatch that started async processing has returned, but I'm not sure if that is comparable or not.

@markt-asf
Copy link

I've added some protection against the original error (will be in the February release round for all currently supported Tomcat versions) and I am looking to see if there is anything I can do more generally.

@markt-asf
Copy link

I've applied a more general fix. It too will be in the February release round. The fix prevents application threads from calling methods on the AsyncContext once AsyncListener.onError() has returned to the container.

@rstoyanchev
Copy link
Contributor

Thanks for the changes, @markt-asf.

rstoyanchev added a commit that referenced this issue Feb 29, 2024
In preparation for a larger update, start by aligning with
6.1.x, which includes changes for gh-32042 and gh-30232.

See gh-32342
rstoyanchev added a commit that referenced this issue Feb 29, 2024
In preparation for a larger update, start by aligning with
6.1.x, which includes changes for gh-32042 and gh-30232.

See gh-32341
@KazzmanK
Copy link

KazzmanK commented Mar 1, 2024

@straubp and @KazzmanK please give 6.1.4-SNAPSHOT a try to see what the behavior is now after this commit 9230a7d.

As of recently released spring-boot 3.2.3 based on spring 6.1.4 issue looks resolved.

rstoyanchev added a commit that referenced this issue Mar 3, 2024
Dispatching was prevented for disconnected client errors after
recent reports like #32042 when running on Tomcat, and the
async request was completed from the onError notification.
This has the side effect of not allowing exception resolvers
to take final action even if the response is not writeable.

After all updates for this issue, it appears the dispatch no
longer causes issues. Tomcat actually does not do the dispatch,
but it doesn't seem to cause any issues, and on other servers
like Jetty where the dispatch works, applications can have a
chance to handle the exception.

This change removes the disconnected client checks and allows
dispatching again. After the change DisconnectedClientHelper
is no longer needed in the 5.3.x branch.

See gh-32342
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
in: web Issues in web modules (web, webmvc, webflux, websocket) type: enhancement A general enhancement
Projects
None yet
Development

No branches or pull requests

6 participants