Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In some cases, changing a vserver/alias' domain to a fake domain might break multi-versions PHP FPM #649

Open
nekohayo opened this issue Sep 23, 2023 · 4 comments

Comments

@nekohayo
Copy link

nekohayo commented Sep 23, 2023

I have encountered a probable bug with Virtualmin 7.7 (which was recently auto-upgraded to 7.8.2) on my Debian 11 server at home, where I am hosting some of my personal websites.

I'm not entirely sure how this bug was triggered, but I have spent some weeks investigating and have come towards a reasonable guess.

Summary

The bug's symptom:

  • In some cases, I will get mysterious HTTP 503 "Service Unavailable" failures on some of my Virtualmin vservers, depending on what PHP version they are configured to use (on a server with multiple PHP versions available).
  • When I dug deeper, I found that PHP-FPM 7.4 is unable to start (or it might be PHP 8.2, depending on "whichever comes up first" during boot/reloads).

The possible cause I'm suspecting:
using fake domain names (like "exampleplaceholder" instead of "example.com" in your vservers, and then deleting those vservers, might confuse virtualmin and leave some cruft in PHP FPM configuration on the filesystem that is invisible to virtualmin/webmin's UI… and then those conflicting configs prevent PHP from starting.


The server's specs and usage type

This Debian 11 + Virtualmin 7.8.2 server has both PHP 7.4 and 8.2, installed from Sury's Debian repositories. Both PHP versions have the same modules installed, with this command:

apt install php{7.4,8.2}-{cgi,cli,fpm,pdo,gd,mbstring,mysqlnd,opcache,xml,zip,curl,imagick,intl,bcmath,sqlite3}

This Debian 11 server is pretty much the same thing as my other home server that runs Debian 12 + Virtualmin 7.8.2, I just haven't upgraded it to Debian 12 yet.

  • The other (Debian 12) server was used as my cPanel migration sandbox, i.e. a trashy "disposable" server where all the post-migration domains/vservers shuffling happens
  • This one (Debian 11) is kind of a clean "production" server, so I try not to do too stupid stuff with it, so in theory the websites that got there are "clean" transfers "from Virtualmin to Virtualmin" rather than cPanel to Virtualmin, once the websites/domains have been rearranged in a way that makes sense

So in theory this server shouldn't misbehave, but it does. Investigation details below.

Symptoms in detail / troubleshooting so far

I first noticed the issue when some of my personal websites started randomly failing, showing only:

Service Unavailable

The server is temporarily unable to service your request due to maintenance downtime or capacity problems. Please try again later.

This HTTP/Apache error page is the result of PHP not working at all on those vservers. Showing a plain index.html page (instead of some index.php) works fine.

Doing tail -f logs/error_log in such a vserver's home, I get things like this whenever a visitor tries to hit the website:

[proxy:error] [pid 2567842:tid 140347971720960] (2)No such file or directory: AH02454: FCGI: attempt to connect to Unix domain socket /var/php-fpm/16891050803698117.sock (*:80) failed

[proxy_fcgi:error] [pid 2567842:tid 140347971720960] [client some_ip_address:63460] AH01079: failed to make connection to backend: httpd-UDS

The number of that socket might vary across different failing vservers, for example I also have the same error with:

[proxy:error] [pid 1438585:tid 140120789436160] (2)No such file or directory: AH02454: FCGI: attempt to connect to Unix domain socket /var/php-fpm/16887681373193083.sock (*:80) failed

[proxy_fcgi:error] [pid 1438585:tid 140120789436160] [remote some_other_IP:55104] AH01079: failed to make connection to backend: httpd-UDS

Searching for those errors, I saw this forum thread which led me to #96 (comment), which led me nowhere. Indeed, I did all the updates, reinstalled the various PHP 7.4 and 8.2 debian packages, etc etc.

Curiously, the package installations would now fail because they would be unable to start the services, too:

Setting up php7.4-fpm (1:7.4.33-8+0~20230904.88+debian11~1.gbp87c414) ...
NOTICE: Not enabling PHP 7.4 FPM by default.
NOTICE: To enable PHP 7.4 FPM in Apache2 do:
NOTICE: a2enmod proxy_fcgi setenvif
NOTICE: a2enconf php7.4-fpm
NOTICE: You are seeing this message because you have apache2 package installed.
Job for php7.4-fpm.service failed because the control process exited with error code.
See "systemctl status php7.4-fpm.service" and "journalctl -xe" for details.
invoke-rc.d: initscript php7.4-fpm, action "restart" failed.
● php7.4-fpm.service - The PHP 7.4 FastCGI Process Manager
     Loaded: loaded (/lib/systemd/system/php7.4-fpm.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Sun 2023-09-10 14:19:28 EDT; 8ms ago
       Docs: man:php-fpm7.4(8)
    Process: 2583978 ExecStart=/usr/sbin/php-fpm7.4 --nodaemonize --fpm-config /etc/php/7.4/fpm/php-fpm.conf (code=exited, status=78)
    Process: 2583979 ExecStopPost=/usr/lib/php/php-fpm-socket-helper remove /run/php/php-fpm.sock /etc/php/7.4/fpm/pool.d/www.conf 74 (code=exited, status=0/SUCCESS)
   Main PID: 2583978 (code=exited, status=78)
        CPU: 58ms

Sep 10 14:19:28 my_home_server systemd[1]: Starting The PHP 7.4 FastCGI Process Manager...
Sep 10 14:19:28 my_home_server php-fpm7.4[2583978]: [10-Sep-2023 14:19:28] ERROR: Another FPM instance seems to already listen on /var/php-fpm/16801964152855.sock
Sep 10 14:19:28 my_home_server php-fpm7.4[2583978]: [10-Sep-2023 14:19:28] ERROR: FPM initialization failed
Sep 10 14:19:28 my_home_server systemd[1]: php7.4-fpm.service: Main process exited, code=exited, status=78/CONFIG
Sep 10 14:19:28 my_home_server systemd[1]: php7.4-fpm.service: Failed with result 'exit-code'.
Sep 10 14:19:28 my_home_server systemd[1]: Failed to start The PHP 7.4 FastCGI Process Manager.
dpkg: error processing package php7.4-fpm (--configure): installed php7.4-fpm package post-installation script subprocess returned error exit status 1
Errors were encountered while processing: php7.4-fpm
E: Sub-process /usr/bin/dpkg returned an error code (1)

...eventually the only thing that made a visible difference was a full reboot, but the full reboot just shuffled my luck in whether it's PHP 7.4 failing, or PHP 8.2 failing. So you have a different set of websites failing to load, depending on which PHP version they use... Flip a coin!

As per my friend @s3phy's recommendation, I looked at grep "listen" /etc/php/*/fpm/pool.d/www.conf to see if things were using separate sockets, and they were:

/etc/php/7.4/fpm/pool.d/www.conf:listen = /run/php/php7.4-fpm.sock
/etc/php/8.2/fpm/pool.d/www.conf:listen = /run/php/php8.2-fpm.sock

Alright then, could it be some stale socket file thing going on? Let's look at the dates in ls -l /var/php-fpm/ (some characters replaced by "*"):

total 0
srw-rw---- 1 c***                 c***                 0 Sep 10 14:20 167695548612615.sock
srw-rw---- 1 ****_sandbox         ****_sandbox         0 Sep 10 14:20 16791107091046257.sock
srw-rw---- 1 go*****              go*****              0 Sep 10 14:20 16791566971152905.sock
srw-rw---- 1 exampleplaceholder   exampleplaceholder   0 Sep 10 14:20 16801964152855.sock
srw-rw---- 1 ****archive          ****archive          0 Sep 10 14:20 1682961999589308.sock
srw-rw---- 1 je****               je****               0 Sep 10 14:20 16885075782852274.sock
srw-rw---- 1 s****                s****                0 Sep 10 14:20 16887637243176092.sock
srw-rw---- 1 opensource           opensource           0 Sep 10 14:20 16887681373193083.sock
srw-rw---- 1 exampleplaceholder   exampleplaceholder   0 Sep 10 14:20 16889114523420091.sock
srw-rw---- 1 tab**                tab**                0 Sep 10 14:20 16891050803698117.sock

Currently, both the "tab**" and "opensource" websites are failing, and both happen to be configured to use PHP 7.4 instead of 8.2.

Make a mental note of the "exampleplaceholder" fake domain / vserver here. It was there in that folder when I first encountered this issue around September 10th, but it's absent today. I think it might be key to the problem. More on that in a minute.

Currently, the Webmin dashboard "Servers status" indicates that the PHP FPM 7.4 service is down, and PHP FPM 8.2 is up. If you try to restart it, it fails, and suggests you look at systemctl status php7.4-fpm.service, in which case you see:

● php7.4-fpm.service - The PHP 7.4 FastCGI Process Manager
     Loaded: loaded (/lib/systemd/system/php7.4-fpm.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Sat 2023-09-23 14:56:44 EDT; 9s ago
       Docs: man:php-fpm7.4(8)
    Process: 1438823 ExecStart=/usr/sbin/php-fpm7.4 --nodaemonize --fpm-config /etc/php/7.4/fpm/php-fpm.conf (code=exited, status=78)
    Process: 1438824 ExecStopPost=/usr/lib/php/php-fpm-socket-helper remove /run/php/php-fpm.sock /etc/php/7.4/fpm/pool.d/www.conf 74 (code=exited, status=0/SUCCE>
   Main PID: 1438823 (code=exited, status=78)
        CPU: 61ms

Sep 23 14:56:44 my_home_server systemd[1]: Starting The PHP 7.4 FastCGI Process Manager...
Sep 23 14:56:44 my_home_server php-fpm7.4[1438823]: [23-Sep-2023 14:56:44] ERROR: [pool 16801964152855] cannot get uid for user 'exampleplaceholder'
Sep 23 14:56:44 my_home_server php-fpm7.4[1438823]: [23-Sep-2023 14:56:44] ERROR: FPM initialization failed
Sep 23 14:56:44 my_home_server systemd[1]: php7.4-fpm.service: Main process exited, code=exited, status=78/CONFIG
Sep 23 14:56:44 my_home_server systemd[1]: php7.4-fpm.service: Failed with result 'exit-code'.
Sep 23 14:56:44 my_home_server systemd[1]: Failed to start The PHP 7.4 FastCGI Process Manager.

...okay then, we have a much better clue what might be the problem: it mentions "exampleplaceholder". That's the fake domain name I had renamed a vserver to (or maybe was it an alias server, child of some other vserver? I don't remember, but I certainly had used the "Change domain name" feature in the Virtualmin menus), before recently deleting it via the virtualmin web interface. That's suspicious, why is it referencing something that doesn't even exist anymore?

Unfortunately, Virtualmin and Webmin don't seem to have any way for me to search for that string, so I went Conan-the-barbarian-style and did grep -Rn "exampleplaceholder" /etc/* as root, and the only mentions are in /etc/php/7.4/fpm/pool.d/16801964152855.conf. Here are the contents of that file:

[16801964152855]
user = exampleplaceholder
group = exampleplaceholder
listen.owner = exampleplaceholder
listen.group = exampleplaceholder
listen.mode = 0660
listen = /var/php-fpm/16801964152855.sock
pm = dynamic
pm.max_children = 10
pm.start_servers = 2
pm.min_spare_servers = 1
pm.max_spare_servers = 5
php_value[upload_tmp_dir] = /home/exampleplaceholder/tmp
php_value[session.save_path] = /home/exampleplaceholder/tmp

But wait, does exampleplaceholder still exist elsewhere? Doesn't seem so:

# ls /home/ | grep exampleplaceholder
(No matches)

# groups exampleplaceholder
groups: ‘exampleplaceholder’: no such user

Suspected causes

In conclusion, this leads me to the following hypothesis: there probably is some scenario, somehow, where Virtualmin 7.7 (and maybe 7.8.2) fails to delete some PHP configuration thing when deleting a renamed vserver/alias, maybe when:

  • it's not a real domain name (ex: exampleplaceholder instead of example.com)?
  • when some server (whether the one you are deleting, or its parent) is "disabled"?

I wish I had better clues as to the exact cause or steps to trigger this issue, but I'm hoping that the info and thought process so far here might ring a bell. Otherwise, I am happy to provide further information you may request; for now I haven't deleted the sock or conf files or anything like that, to ensure I can still provide information for you, until you tell me it's not needed.

Besides, I don't know if it would be good/safe for me to just rm /etc/php/7.4/fpm/pool.d/16801964152855.conf or if webmin/virtualmin/PHP/Apache would get even more confused 😅

@marclaporte
Copy link
Contributor

Thank you @nekohayo for the investigation.

We at @EvoluData / @WikiSuite have experienced these mysterious HTTP 503 "Service Unavailable" every few months, on various servers, but we were unable to pinpoint the cause. And on production servers, we often have to fix the problem quickly, and can't keep the server as is as evidence for a senior sysadmin to investigate.

We have noticed however that it seems to happen after a Virtualmin Virtual Server is deleted. As to your clues
1- "not a real domain name": We never do this as all our Virtualmin Virtual Servers are given a subdomain of the main server (we use a wildcard)
2- "when some server (whether the one you are deleting, or its parent) is "disabled"?": This is very common for us, as we frequently disable a Virtualmin Virtual Server for a little while before we delete it.

Recently, there has been some work on the PHP-FPM code in Virtualmin and I am hopeful it will take us step closer to a resolution.

@iliajie
Copy link
Collaborator

iliajie commented Sep 23, 2023

We expect those to be fixed in 7.8.2!

@jcameron
Copy link
Collaborator

We did recently fix a bug in 7.8.2 that could cause the PHP FPM server not to restart properly when changing the PHP version for a domain - but this only happens when using a TCP port, not a socket file. The fix will be included in the 7.8.3 release..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants