Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

swoole:5.1.2-php8.3 - Maximum call stack size of 8339456 bytes (zend.max_allowed_stack_size - zend.reserved_stack_size) reached. #5347

Closed
ValiDrv opened this issue May 30, 2024 · 4 comments
Labels

Comments

@ValiDrv
Copy link

ValiDrv commented May 30, 2024

Short version

swoole:5.1.2 and php8.3 gives randomly Maximum call stack ... reached. Infinite recursion warnings and sometimes even hangs when using dispatch_func.

  • Usually this happens onRequest after a restart, or randomly a few days after the server has been running.
  • It randomly fixes itself with no interaction after X random time.
  • It can be fixed, or caused, if the swoole server is restarted.
  • It sometimes causes high latency, and sometimes stops responding to any requests, on one or multiple threads.
  • Proxies in-front of the Swoole server respond with a 504 Gateway Timeout Error error (no data from backend)
  • Usually, but not always, 10k requests to the proxy will create 10k proxy logs and only 9999 Swoole access logs (last one is a 504 error for the proxy)

It might be hard to replicate, sometimes I get it 50 times per day, sometimes once in 4 hours... but I included a stripped down version of the code that I could replicate the issue with.

Note: It's not a hardware capacity issue, since Swoole can serve 70k RPS on our hardware and this issue happens with 20rps, 5k rps or 50k rps. Servers are Ubuntu 22.04.3 LTS, and can handle more than 1mill connections, multiple IPs, ip_local_port_range, ulimit, somaxconn, etc.

The only difference is PHP 8.3 vs PHP 8.2 and the Swoole version.

Long version

Please answer these questions before submitting your issue.

  1. What did you do? If possible, provide a simple script for reproducing the error.

There are some really strange/weird and intermittent issues with Swoole 5.1.2 and PHP 8.3 when using dispatch_func, as explained after the code.

cat docker-compose.yml

version: '3.8'
services:
  swoole:
    image: phpswoole/swoole:5.1.2-php8.3
      #network_mode: host
    ports:
    - '13091:13091'
    environment:
      AUTORELOAD_PROGRAMS: "swoole"
      AUTORELOAD_ANY_FILES: 1
      BOOT_MODE: TASK
    restart: unless-stopped
    volumes:
      - .:/var/www

cat server.php

<?php

declare(strict_types=1);

use Swoole\Http\Server as Server;
use Swoole\Http\Request as Request;
use Swoole\Http\Response as Response;

class myServer
{
    public Server $server;

    public function __construct(string $listen_ip = '0.0.0.0', $listen_port = 13091, int $worker_num = 4) {
        Swoole\Runtime::enableCoroutine(SWOOLE_HOOK_ALL | SWOOLE_HOOK_NATIVE_CURL);
        $http = new Server($listen_ip, $listen_port, SWOOLE_PROCESS);


        $http->on('request', fn(Request $request, Response $response) => $this->onRequest($request, $response, $http));

	# Setting
        $http->set(
            [
                # LOG
                'daemonize' => false,    # Must be FALSE, else can't kill workers in DEV...
                'log_file' => null,      # Only works if daemonize is true, so instead we use the STDOUT/STDERR
                'log_level' => SWOOLE_LOG_ERROR,
                'log_rotation' => SWOOLE_LOG_ROTATION_SINGLE,

                'tcp_fastopen' => true,     # This feature can improve the response speed of TCP short connections by sending data with the SYN packet when the client completes the third step of the handshake.
                'tcp_user_timeout' => 10 * 1000, # The TCP_USER_TIMEOUT option is a socket option at the TCP layer, with a value indicating the maximum time in milliseconds to wait for an ACK confirmation after a data packet is sent. Please refer to the man page for details.

                # GZIP 1-9, higher uses more CPU
                'http_gzip_level' => 7,
                'http_compression' => true,
                # Options
                'max_connection' => 1000000,                # Accept connections
                'buffer_output_size' => 32 * 1024 * 1024,   # 32M (byte) can't send files larger than this.
                'socket_buffer_size' => 32 * 1024 * 1024,   # Set the socket buffer size
                 'max_queued_bytes' => 32 * 1024 * 1024,   # Set the maximum queue length of the received buffer. If exceeded, stop receiving.
                'max_conn' => 32768,                        # the max tcp connection number of the server
                'backlog' => 327680,                            # How many connections waiting for accept can be handled simultaneously (net.core.somaxconn=32768 & net.ipv4.tcp_max_syn_backlog=32768). Note: `netstat -s | grep 'times the listen queue of a socket overflowed'`
                # 'max_request' => 10000,                     # Restart workers after X requests
                'open_cpu_affinity' => true,                # Stick to CPU core (helps with cpu cache)
                # 'discard_timeout_request' => true,          # Discard worker received data after closing connection. NOTE: might affect web workers.

                # Max coroutines
                "max_coro_num" => 1000000,                  # Total coroutines, default 3000
                "max_coroutine" => 1000000,                  # Total coroutines, default 3000

                # Threads
                'worker_num' => $worker_num,
                'task_worker_num' => 0,
                'task_enable_coroutine' => true,    # Add coroutines for task workers, so they don't block
                'enable_coroutine' => true,           # Automatically add coroutine to onRequest
                'reactor_num' => 4,  # Number of events IO threads, normally number of CPU cores, but at one we don't lock between threads

                'enable_reuse_port' => true,    # Enabling port reuse allows for starting multiple Server programs listening on the same port.
                # Keep-Alive connections.
                'open_tcp_nodelay' => true,
                'tcp_keepidle' => 600,
                'tcp_keepinterval' => 60,
		'tcp_keepcount' => 1000,
		# Custom dispatch function
		'dispatch_func' => $this->dispatch_func(...),
            ]);


        # Need to access
        $this->server = $http;

        # Last this is to start it
	echo $this->server->start() ? "Server started" : "Could not start server";
    }

    public function onRequest(Request $swoole_request, Response $swoole_response, Server $swoole_server): void {
	if($swoole_request->server['request_uri'] !== '/ping') {
	    echo getmypid() . " onRequest({$swoole_server->getWorkerId()}): {$swoole_request->server['request_uri']}\n";
	    Co::sleep(2);
        }
        $swoole_response->status(200);
        $swoole_response->end('OK');
    }

   # Used to return different worker ids based on the URL and whatnot.
    public function dispatch_func(Server $swoole_server, int $fd, int $type, string $data = null): int {

#	    if ($type !== 0 || is_null($data)) {
#		    echo "drop: $fd";
            # Use worker ID#0 to open/close the connection (does not matter what worker we get, but we don't have the data to select a worker...)
            # - 3: indicating that the client connection is established
            # - 4: indicating that the client connection is closed
#            return 0;
#	}

	    preg_match("~\A(\S+)\s(\S+)\s~", $data ?? '', $m);
	    $method = $m[1] ?? 'GET';
	    $url = $m[2] ?? '';
	    
	    return $fd % $swoole_server->setting['worker_num'];
    }
}


echo "\n----------------------\n";
echo `php --ri swoole`;
echo "\n----------------------\n";
echo `php --version`;
echo "\n----------------------\n";
echo `gcc --version`;
echo "\n----------------------\n";

echo "\n\n -== SERVER RESTART ==- \n\n";
$r = new myServer();

cat makefile

# Hide Entering/Leaving directory
MAKEFLAGS += --no-print-directory
# If a file exists with the make target, still run the target
MAKEFLAGS += --always-make

RUN_DOCKER_COMPOSE := CURDIR=$(CURDIR) HOSTNAME=$(shell hostname) docker-compose --compatibility -p swoole-test -f ./docker-compose.yml

logs:
	@$(RUN_DOCKER_COMPOSE) logs -t -f --tail=100


##
# Stop and remove passed in ${SERVICES} exited containers
down:
	# Stopping and removing service containers...
	$(RUN_DOCKER_COMPOSE) down -v --remove-orphans || true

##
# Build the passed in ${SERVICES}
build: down
	# Get latest container versions
	@$(RUN_DOCKER_COMPOSE) pull
	# Building containers
	@$(RUN_DOCKER_COMPOSE) build --parallel
	# Starting containers and waiting for healthy status...
	@$(RUN_DOCKER_COMPOSE) up -d --force-recreate --wait
	# Containers started [OK]
	@$(RUN_DOCKER_COMPOSE) ps

ps:
	$(RUN_DOCKER_COMPOSE) ps


  1. What did you expect to see?

Most the time the server works great, no errors, fast response and so on.

time ab -k -c 1024 -n 10000 http://x.x.x.x:13091/ping
This is ApacheBench, Version 2.3 <$Revision: 1879490 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking x.x.x.x (be patient)
Completed 1000 requests
Completed 2000 requests
Completed 3000 requests
Completed 4000 requests
Completed 5000 requests
Completed 6000 requests
Completed 7000 requests
Completed 8000 requests
Completed 9000 requests
Completed 10000 requests
Finished 10000 requests


Server Software:        swoole-http-server
Server Hostname:        x.x.x.x
Server Port:            13091

Document Path:          /ping
Document Length:        2 bytes

Concurrency Level:      1024
Time taken for tests:   0.375 seconds
Complete requests:      10000
Failed requests:        0
Keep-Alive requests:    10000
Total transferred:      1540000 bytes
HTML transferred:       20000 bytes
Requests per second:    26692.22 [#/sec] (mean)
Time per request:       188.220 [ms] (mean)
Time per request:       0.037 [ms] (mean, across all concurrent requests)
Transfer rate:          4014.26 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0   45  48.5     40     138
Processing:    28   88  23.7     91     137
Waiting:        1   88  23.7     91     128
Total:         28  132  40.0    132     211

Percentage of the requests served within a certain time (ms)
  50%    132
  66%    163
  75%    170
  80%    173
  90%    182
  95%    190
  98%    198
  99%    202
 100%    211 (longest request)

real	0m0.437s
user	0m0.071s
sys	0m0.252s

  1. What did you see instead?

BUT, every now and then, usually fixed by or after a server restart, we get this error for some request.

 Fatal error: Uncaught Error: Maximum call stack size of 8339456 bytes (zend.max_allowed_stack_size - zend.reserved_stack_size) reached. Infinite recursion? in /var/www/server.php:109
 Stack trace:
 #0 [internal function]: myServer->dispatch_func(Object(Swoole\Http\Server), 5056, 4)
 #1 /var/www/server.php(98): Swoole\Http\Response->end('OK')
 #2 /var/www/server.php(21): myServer->onRequest(Object(Swoole\Http\Request), Object(Swoole\Http\Response), Object(Swoole\Http\Server))
 #3 [internal function]: myServer->{closure}(Object(Swoole\Http\Request), Object(Swoole\Http\Response))
 #4 {main}

The weird some requests part:

  • sometimes it happens on every request,
  • sometimes every request on one worker (pid),
  • sometimes on every healtcheck request (/ping) but not the rest.
  • sometimes it stops randomly, and the site works fine.
  • sometimes the server stops responding for a few seconds/hours/ever, with no errors anywhere.

Usually, AFTER this happens, we start to get 504 timeouts (2min) in our proxy logs, and just before, but not always accompanied by Maximum call stack size errors, latency starts to go up and up.

time ab -k -c 1024 -n 10000 http://x.x.x.x:13091/ping
This is ApacheBench, Version 2.3 <$Revision: 1879490 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking x.x.x.x (be patient)
Completed 1000 requests
Completed 2000 requests
Completed 3000 requests
Completed 4000 requests
Completed 5000 requests
Completed 6000 requests
Completed 7000 requests
Completed 8000 requests
Completed 9000 requests
Completed 10000 requests
Finished 10000 requests


Server Software:        swoole-http-server
Server Hostname:        10.9.9.113
Server Port:            13091

Document Path:          /ping
Document Length:        2 bytes

Concurrency Level:      1024
Time taken for tests:   12.545 seconds
Complete requests:      10000
Failed requests:        0
Keep-Alive requests:    10000
Total transferred:      1540000 bytes
HTML transferred:       20000 bytes
Requests per second:    797.13 [#/sec] (mean)
Time per request:       6302.607 [ms] (mean)
Time per request:       1.254 [ms] (mean, across all concurrent requests)
Transfer rate:          119.88 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0 3672 4614.4    120   12272
Processing:    38  104 124.5    112   12270
Waiting:        1  103  26.3    112     147
Total:         38 3777 4607.8    259   12384

Percentage of the requests served within a certain time (ms)
  50%    259
  66%   5036
  75%   7728
  80%   9229
  90%  12232
  95%  12314
  98%  12321
  99%  12359
 100%  12384 (longest request)

real	0m12.614s
user	0m0.076s
sys	0m12.393s
  1. What version of Swoole are you using (show your php --ri swoole)?
swoole

Swoole => enabled
Author => Swoole Team <team@swoole.com>
Version => 5.1.2
Built => Apr 29 2024 21:13:40
coroutine => enabled with boost asm context
epoll => enabled
eventfd => enabled
signalfd => enabled
cpu_affinity => enabled
spinlock => enabled
rwlock => enabled
sockets => enabled
openssl => OpenSSL 3.0.11 19 Sep 2023
dtls => enabled
http2 => enabled
json => enabled
curl-native => enabled
zlib => 1.2.13
mutex_timedlock => enabled
pthread_barrier => enabled
futex => enabled
mysqlnd => enabled
async_redis => enabled
coroutine_pgsql => enabled

Directive => Local Value => Master Value
swoole.enable_coroutine => On => On
swoole.enable_library => On => On
swoole.enable_fiber_mock => Off => Off
swoole.enable_preemptive_scheduler => Off => Off
swoole.display_errors => On => On
swoole.use_shortname => On => On
swoole.unixsock_buffer_size => 8388608 => 8388608
  1. What is your machine environment used (show your uname -a & php -v & gcc -v) ?
 PHP 8.3.2 (cli) (built: Feb 13 2024 05:36:23) (NTS)
 Copyright (c) The PHP Group
 Zend Engine v4.3.2, Copyright (c) Zend Technologies
 
 ----------------------
 gcc (Debian 12.2.0-14) 12.2.0
 Copyright (C) 2022 Free Software Foundation, Inc.
 This is free software; see the source for copying conditions.  There is NO
 warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
@NathanFreeman
Copy link
Member

I am trying to reproduce the issue by stress-testing the code for a long time with wrk.

@ValiDrv
Copy link
Author

ValiDrv commented Jun 4, 2024

For the 504 http code errors, wrk will not work, since it counts responses and not the response code. So if one takes 2 min it still counts it as GOOD.
Use ab, and you see the non 200 response codes.
(basically I could replicate it about 20 times more often with ab compared to wrk)

Also, it seems unrelated to the number of requests/load past a certain amount. As in, I got 12 failures in 1000 requests, and 16 in 10k requests.

For the Maximum call stack ... reached. Infinite recursion it seems more related to how/when the server is restarted. Sometimes I don't get it for days/weeks, other times I get it on every request. (so we just deploy a few times until it doesn't happen)
And once running, it just "randomly" happens for a few minutes, and then stops by itself.

matyhtf added a commit that referenced this issue Jun 6, 2024
* fix bug #5347

* skip ci

* optimize

* remove useless files

---------

Co-authored-by: MARiA so cute <33935209+NathanFreeman@users.noreply.github.com>
@NathanFreeman
Copy link
Member

@ValiDrv
We have released Swoole v5.1.3. You can try Swoole v5.1.3 to see if this issue still exists.

@ValiDrv
Copy link
Author

ValiDrv commented Jun 6, 2024

We moved away from using dispatch_func in production, but will keep an eye on it.

@matyhtf matyhtf closed this as completed Jun 12, 2024
@matyhtf matyhtf added the fixed label Jun 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants