Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nonroot install: ubuntu18 failed to start for NOFILE rlimit too low #7133

Closed
amoskong opened this issue Aug 27, 2020 · 19 comments
Closed

nonroot install: ubuntu18 failed to start for NOFILE rlimit too low #7133

amoskong opened this issue Aug 27, 2020 · 19 comments
Assignees
Labels
Milestone

Comments

@amoskong
Copy link
Contributor

amoskong commented Aug 27, 2020

Installation details
Scylla version (or git commit hash): unified-package-0.20200824.9636a3399.tar.gz
Cluster size: 1
OS (RHEL/CentOS/Ubuntu/AWS AMI): Ubuntu18.

Description

After offline installation on Ubuntu18.04, scylla-server fails to start.
root install works well. and nonroot install of centos8 & Debian 10 works well. This problem only exists with Ubuntu18.04.
I saw scylla-server.service has big limit (LimitNOFILE=800000)

● scylla-server.service - Scylla Server
   Loaded: loaded (/home/scylla-test/.config/systemd/user/../../../scylladb/etc/systemd/scylla-server.service; bad; vendor preset: enabled)
  Drop-In: /home/scylla-test/.config/systemd/user/scylla-server.service.d
           └─nonroot.conf
   Active: failed (Result: exit-code) since Thu 2020-08-27 01:16:05 UTC; 5min ago
  Process: 16163 ExecStart=/home/scylla-test/scylladb/bin/scylla $SCYLLA_ARGS $SEASTAR_IO $DEV_MODE $CPUSET (code=exited, status=1/FAILURE)
 Main PID: 16163 (code=exited, status=1/FAILURE)

Aug 27 01:16:05 artifacts-ubuntu1804-jenkins-db-node-52b70dd2-0-1 scylla[16163]: Warning: number of IO queues (8) greater than logical cores (2). Adjusting dow
Aug 27 01:16:05 artifacts-ubuntu1804-jenkins-db-node-52b70dd2-0-1 scylla[16163]:  [shard 0] init - installing SIGHUP handler
Aug 27 01:16:05 artifacts-ubuntu1804-jenkins-db-node-52b70dd2-0-1 scylla[16163]:  [shard 0] init - Scylla version 666.development-0.20200825.f7c5c48df with bui
Aug 27 01:16:05 artifacts-ubuntu1804-jenkins-db-node-52b70dd2-0-1 scylla[16163]:  [shard 0] init - NOFILE rlimit too low (recommended setting 200000, minimum s
Aug 27 01:16:05 artifacts-ubuntu1804-jenkins-db-node-52b70dd2-0-1 scylla[16163]:  [shard 0] init - Shutting down sighup
Aug 27 01:16:05 artifacts-ubuntu1804-jenkins-db-node-52b70dd2-0-1 scylla[16163]:  [shard 0] init - Shutting down sighup was successful
Aug 27 01:16:05 artifacts-ubuntu1804-jenkins-db-node-52b70dd2-0-1 scylla[16163]:  [shard 0] init - Startup failed: std::runtime_error (NOFILE rlimit too low)
Aug 27 01:16:05 artifacts-ubuntu1804-jenkins-db-node-52b70dd2-0-1 systemd[3148]: scylla-server.service: Main process exited, code=exited, status=1/FAILURE
Aug 27 01:16:05 artifacts-ubuntu1804-jenkins-db-node-52b70dd2-0-1 systemd[3148]: scylla-server.service: Failed with result 'exit-code'.
Aug 27 01:16:05 artifacts-ubuntu1804-jenkins-db-node-52b70dd2-0-1 systemd[3148]: Failed to start Scylla Server.
scylla-test@artifacts-ubuntu1804-jenkins-db-node-52b70dd2-0-1:~$ cat .config/systemd/user/scylla-server.service
[Unit]
Description=Scylla Server
Wants=scylla-jmx.service
Wants=scylla-housekeeping-restart.timer
Wants=scylla-housekeeping-daily.timer

[Service]
PermissionsStartOnly=true
Type=notify
LimitMEMLOCK=infinity
LimitNOFILE=800000
LimitAS=infinity
LimitNPROC=8096
EnvironmentFile=/etc/sysconfig/scylla-server
EnvironmentFile=/etc/scylla.d/*.conf
ExecStartPre=/opt/scylladb/scripts/scylla_prepare
ExecStart=/usr/bin/scylla $SCYLLA_ARGS $SEASTAR_IO $DEV_MODE $CPUSET $MEM_CONF
ExecStopPost=/opt/scylladb/scripts/scylla_stop
TimeoutStartSec=1y
TimeoutStopSec=900
KillMode=process
Restart=on-abnormal
User=scylla
OOMScoreAdjust=-950
StandardOutput=syslog
StandardError=syslog
SyslogLevelPrefix=false
Slice=scylla-server.slice

[Install]
WantedBy=multi-user.target
scylla-test@artifacts-ubuntu1804-jenkins-db-node-52b70dd2-0-1:~$ cat .config/systemd/user/scylla-server.service.d/nonroot.conf
[Service]
EnvironmentFile=
EnvironmentFile=/home/scylla-test/scylladb//etc/default/scylla-server
EnvironmentFile=/home/scylla-test/scylladb/etc/scylla.d/*.conf
ExecStartPre=
ExecStart=
ExecStart=/home/scylla-test/scylladb/bin/scylla $SCYLLA_ARGS $SEASTAR_IO $DEV_MODE $CPUSET
ExecStopPost=
User=

/CC @roydahan @syuu1228

@avikivity
Copy link
Member

I wonder how we should fix it.

  1. read /proc and see how many open files are suppored, adjust .service to match
  2. tell user to adjust /proc
  3. a combination

@amoskong
Copy link
Contributor Author

Limit                     Soft Limit           Hard Limit           Units 
Max open files            1024                 1048576              files     
scylla-test@artifacts-ubuntu1804-jenkins-db-node-f62ccf05-0-1:~$ cat /proc/$$/limits
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            8388608              unlimited            bytes     
Max core file size        0                    unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             29659                unlimited            processes 
Max open files            1024                 1048576              files     
Max locked memory         16777216             16777216             bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       29659                29659                signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us   

@avikivity
Copy link
Member

Hmm. Not sure what to do about it, we rely on the raised limit so much.

@penberg
Copy link
Contributor

penberg commented Sep 4, 2020

AFAICT, you could change the rlimits for user systemd services in ~/.config/systemd/user/scylla.service.d/limits.conf as per the long discussion here:

https://bugzilla.redhat.com/show_bug.cgi?id=1364332

However, you cannot go beyond the OS-configured hard limit.

@syuu1228
Copy link
Contributor

syuu1228 commented Sep 5, 2020

Hmm, I think we at least want to adjust LimitNOFILE to hard limit value (1048576 in this case), correct?

@avikivity
Copy link
Member

It looks like there are two sources of limits:

$ grep 'Max open files' /proc/$$/limits
Max open files            1000000              1000000              files     
$ systemctl show user@1000.service | grep LimitNOFILE
LimitNOFILE=524288

The first is from /etc/security/limits.conf (I adjusted mine), the second is from deep in the bowels of systemd.

@slivne slivne added this to the 4.3 milestone Sep 6, 2020
@syuu1228
Copy link
Contributor

It looks like there are two sources of limits:

$ grep 'Max open files' /proc/$$/limits
Max open files            1000000              1000000              files     
$ systemctl show user@1000.service | grep LimitNOFILE
LimitNOFILE=524288

The first is from /etc/security/limits.conf (I adjusted mine), the second is from deep in the bowels of systemd.

@avikivity What does it means?
Is it means NOFILE in user mode systemd units is lower than /proc/$$/limits value?
If so we probably have to check both value and use lower one, is it correct?

@avikivity
Copy link
Member

I'm not sure what it means. We'll have to find out what systemd does.

@syuu1228
Copy link
Contributor

I think user@1000.service decreasing NOFILE for user systemd units, at least on Ubuntu 18.
I added scylla_prepare as ExecStartPre and modify the script to print NOFILE as follows:

    print(resource.getrlimit(resource.RLIMIT_NOFILE))

Even hard limit of NOFILE on standard process is 1048576, scylla_prepare reports (1024, 4096).
The output is same as what user@1000.service says, so it looks like we cannot have lager value than that.
But 4096 is too small, we cannot run scylla on such setting.

vagrant@ubuntu1604:~/scylladb$ cat /proc/$$/limits|grep "Max open files"
Max open files            1024                 1048576              files  
vagrant@ubuntu1604:~/scylladb$ systemctl show user@1000.service | grep LimitNOFILE
LimitNOFILE=4096
LimitNOFILESoft=1024
vagrant@ubuntu1604:~/scylladb$ systemctl --user start scylla-server.service
Job for scylla-server.service failed because the control process exited with error code. See "systemctl --user status scylla-server.service" and "journalctl -xe" for details.
vagrant@ubuntu1604:~/scylladb$ systemctl --user --no-pager -l status scylla-server.service
● scylla-server.service - Scylla Server
   Loaded: loaded (/home/vagrant/.config/systemd/user/scylla-server.service; enabled; vendor preset: enabled)
  Drop-In: /home/vagrant/.config/systemd/user/scylla-server.service.d
           └─nonroot.conf
   Active: failed (Result: exit-code) since Mon 2020-09-28 09:20:09 UTC; 2s ago
  Process: 10728 ExecStartPre=/home/vagrant/scylladb/scripts/scylla_prepare (code=exited, status=1/FAILURE)
 Main PID: 10304 (code=exited, status=1/FAILURE)

Sep 28 09:20:09 ubuntu1604.localdomain systemd[1864]: Starting Scylla Server...
Sep 28 09:20:09 ubuntu1604.localdomain scylla_prepare[10728]: (1024, 4096)
Sep 28 09:20:09 ubuntu1604.localdomain systemd[1864]: scylla-server.service: Control process exited, code=exited status=1
Sep 28 09:20:09 ubuntu1604.localdomain systemd[1864]: Failed to start Scylla Server.
Sep 28 09:20:09 ubuntu1604.localdomain systemd[1864]: scylla-server.service: Unit entered failed state.
Sep 28 09:20:09 ubuntu1604.localdomain systemd[1864]: scylla-server.service: Failed with result 'exit-code'.

@syuu1228
Copy link
Contributor

Ahh, I just posted user@1000.service values on our supported distributions, but I did systemctl cat user@1000.service, it shows up different value with systemctl show user@1000.service, so I need to re-run it and post the result again.

@syuu1228
Copy link
Contributor

Seems like only new distro has larger value of LimitNOFILE, if it applied as hard limit, we should see NOFILE rlimit too low error on CentOS8 too.

CentOS8

[vagrant@localhost ~]$ systemctl show user@1000.service|grep LimitNOFILE
LimitNOFILE=4096
LimitNOFILESoft=1024

Ubuntu16

vagrant@ubuntu1604:~$ systemctl show user@1000.service|grep LimitNOFILE
LimitNOFILE=4096
LimitNOFILESoft=1024

Ubuntu18

vagrant@ubuntu1804:~$ systemctl show user@1000.service|grep LimitNOFILE
LimitNOFILE=4096
LimitNOFILESoft=1024

Ubuntu20

vagrant@vagrant:~$ systemctl show user@1000.service|grep LimitNOFILE
LimitNOFILE=524288
LimitNOFILESoft=1024

Debian9

vagrant@debian9:~$ systemctl show user@1000.service|grep LimitNOFILE
LimitNOFILE=4096
LimitNOFILESoft=1024

Debian10:

vagrant@debian10:~$ systemctl show user@1000.service|grep LimitNOFILE
LimitNOFILE=524288
LimitNOFILESoft=1024

@syuu1228
Copy link
Contributor

I did one more mistake: I mistakenly had tried reproduce the issue on Ubuntu 16, not 18.
#7133 (comment)

I going to try on Ubuntu 18, and also other distros.

@syuu1228
Copy link
Contributor

Confirmed able to reproduce the issue on Ubuntu 18, too.

vagrant@ubuntu1804:~/scylladb$ systemctl --user -l --no-pager status scylla-server.service
● scylla-server.service - Scylla Server
   Loaded: loaded (/home/vagrant/.config/systemd/user/scylla-server.service; enabled; vendor preset: enabled)
  Drop-In: /home/vagrant/.config/systemd/user/scylla-server.service.d
           └─nonroot.conf
   Active: failed (Result: exit-code) since Mon 2020-09-28 10:59:48 UTC; 20s ago
  Process: 16558 ExecStart=/home/vagrant/scylladb/bin/scylla $SCYLLA_ARGS $SEASTAR_IO $DEV_MODE $CPUSET (code=exited, status=1/FAILURE)
 Main PID: 16558 (code=exited, status=1/FAILURE)

Sep 28 10:59:48 ubuntu1804.localdomain scylla[16558]:  [shard 0] seastar - Unable to set SCHED_FIFO scheduling policy for timer thread; latency impact possible. Try adding CAP_SYS_NICE
Sep 28 10:59:48 ubuntu1804.localdomain scylla[16558]:  [shard 0] init - installing SIGHUP handler
Sep 28 10:59:48 ubuntu1804.localdomain scylla[16558]:  [shard 0] init - Scylla version 666.development-0.20200928.abe6da8e0 with build-id beaaaf9dface7d2fbd73e3041ecde92b5e8769ba starting ...
Sep 28 10:59:48 ubuntu1804.localdomain scylla[16558]:  [shard 0] init - NOFILE rlimit too low (recommended setting 200000, minimum setting 10000; refusing to start.
Sep 28 10:59:48 ubuntu1804.localdomain scylla[16558]:  [shard 0] init - Shutting down sighup
Sep 28 10:59:48 ubuntu1804.localdomain scylla[16558]:  [shard 0] init - Shutting down sighup was successful
Sep 28 10:59:48 ubuntu1804.localdomain scylla[16558]:  [shard 0] init - Startup failed: std::runtime_error (NOFILE rlimit too low)
Sep 28 10:59:48 ubuntu1804.localdomain systemd[2188]: scylla-server.service: Main process exited, code=exited, status=1/FAILURE
Sep 28 10:59:48 ubuntu1804.localdomain systemd[2188]: scylla-server.service: Failed with result 'exit-code'.
Sep 28 10:59:48 ubuntu1804.localdomain systemd[2188]: Failed to start Scylla Server

@avikivity
Copy link
Member

We will just have to live with it. Let's make setup warn that there will be limits on the total amount of data.

@syuu1228
Copy link
Contributor

syuu1228 commented Oct 3, 2020

Seems like only new distro has larger value of LimitNOFILE, if it applied as hard limit, we should see NOFILE rlimit too low error on CentOS8 too.

CentOS8

[vagrant@localhost ~]$ systemctl show user@1000.service|grep LimitNOFILE
LimitNOFILE=4096
LimitNOFILESoft=1024

This was wrong, CentOS8 has more larger value unlike Ubuntu 18:

[vagrant@localhost ~]$ systemctl  show user@1000.service|grep LimitNOFILE
LimitNOFILE=262144
LimitNOFILESoft=1024

I tried to print out NOFILE value in scylla-server service in CentOS8, looks like it comming from user@1000.service, not from scylla-server.service:

Oct 03 14:28:37 localhost.localdomain scylla_prepare[34781]: (1024, 262144)

On Ubuntu 16 it was:

Sep 28 09:20:09 ubuntu1604.localdomain scylla_prepare[10728]: (1024, 4096)

So I think LimitNOFILE in scylla-server.service is ignored in user mode on all distributions, probably because our setting value LimitNOFILE=800000 is larger than user@1000.service.
Why only Ubuntu 16/18 causing error is, because LimitNOFILE=4096 is too small.
We raise error when NOFILE < 10000:
https://github.com/scylladb/scylla/blob/master/main.cc#L240

I think we have to use developer mode for these distributions.

syuu1228 added a commit to syuu1228/scylla that referenced this issue Oct 3, 2020
…n NOFILE is too low

On Ubuntu 16/18 and Debian 9, LimitNOFILE is set to 4096 and not able to override from
user unit.
To run scylla-server in such environment, we need to turn on developer mode and show
warnings.

Fixes scylladb#7133
@amoskong
Copy link
Contributor Author

amoskong commented Nov 26, 2020

Hi @avikivity @syuu1228

I can still see this issue with scylla-unified-package-4.3.rc2.0.20201123.bc922a743.tar.gz
Scylla-server failed to start, and it didn't switch to developer mode.

test job: https://jenkins.scylladb.com/job/scylla-4.3/job/artifacts-offline-install/job/artifacts-ubuntu1804-test-nonroot/2/console
test id: 3b0e729f-adf0-4b61-b678-358b321e36d5
db log: https://cloudius-jenkins-test.s3.amazonaws.com/3b0e729f-adf0-4b61-b678-358b321e36d5/20201126_163557/db-cluster-3b0e729f.zip

Nov 26 16:35:41 artifacts-ubuntu1804-jenkins-db-node-3b0e729f-0-1 systemd[3215]: Starting Scylla Server...
Nov 26 16:35:41 artifacts-ubuntu1804-jenkins-db-node-3b0e729f-0-1 scylla[16940]: Scylla version 4.3.rc2-0.20201123.bc922a743 with build-id e058a76df3c44abbbd8514772ec2fd43e7277ec1 starting ...
Nov 26 16:35:41 artifacts-ubuntu1804-jenkins-db-node-3b0e729f-0-1 scylla[16940]: command used: "/home/scylla-test/scylladb/bin/scylla --workdir /home/scylla-test/scylladb --log-to-syslog 1 --log-to-stdout 0 --default-log-level info --network-stack posix --num-io-queues=8 --io-properties-file=/home/scylla-test/scylladb/etc/scylla.d/io_properties.yaml"
Nov 26 16:35:41 artifacts-ubuntu1804-jenkins-db-node-3b0e729f-0-1 scylla[16940]: parsed command line options: [workdir: /home/scylla-test/scylladb, log-to-syslog: 1, log-to-stdout: 0, default-log-level: info, network-stack: posix, num-io-queues: 8, io-properties-file: /home/scylla-test/scylladb/etc/scylla.d/io_properties.yaml]
Nov 26 16:35:41 artifacts-ubuntu1804-jenkins-db-node-3b0e729f-0-1 scylla[16940]: Warning: number of IO queues (8) greater than logical cores (2). Adjusting downwards.
Nov 26 16:35:41 artifacts-ubuntu1804-jenkins-db-node-3b0e729f-0-1 scylla[16940]:  [shard 0] init - installing SIGHUP handler
Nov 26 16:35:41 artifacts-ubuntu1804-jenkins-db-node-3b0e729f-0-1 scylla[16940]:  [shard 0] init - Scylla version 4.3.rc2-0.20201123.bc922a743 with build-id e058a76df3c44abbbd8514772ec2fd43e7277ec1 starting ...
Nov 26 16:35:41 artifacts-ubuntu1804-jenkins-db-node-3b0e729f-0-1 scylla[16940]:  [shard 0] init - NOFILE rlimit too low (recommended setting 200000, minimum setting 10000; refusing to start.
Nov 26 16:35:41 artifacts-ubuntu1804-jenkins-db-node-3b0e729f-0-1 scylla[16940]:  [shard 0] init - Shutting down sighup
Nov 26 16:35:41 artifacts-ubuntu1804-jenkins-db-node-3b0e729f-0-1 scylla[16940]:  [shard 0] init - Shutting down sighup was successful
Nov 26 16:35:41 artifacts-ubuntu1804-jenkins-db-node-3b0e729f-0-1 scylla[16940]:  [shard 0] init - Startup failed: std::runtime_error (NOFILE rlimit too low)
Nov 26 16:35:41 artifacts-ubuntu1804-jenkins-db-node-3b0e729f-0-1 systemd[3215]: scylla-server.service: Main process exited, code=exited, status=1/FAILURE
Nov 26 16:35:41 artifacts-ubuntu1804-jenkins-db-node-3b0e729f-0-1 systemd[3215]: scylla-server.service: Failed with result 'exit-code'.
Nov 26 16:35:41 artifacts-ubuntu1804-jenkins-db-node-3b0e729f-0-1 systemd[3215]: Failed to start Scylla Server.
Nov 26 16:35:41 artifacts-ubuntu1804-jenkins-db-node-3b0e729f-0-1 systemd[3215]: Dependency failed for Scylla JMX.
Nov 26 16:35:41 artifacts-ubuntu1804-jenkins-db-node-3b0e729f-0-1 systemd[3215]: scylla-jmx.service: Job scylla-jmx.service/start failed with result 'dependency'.

@roydahan roydahan reopened this Nov 30, 2020
@roydahan
Copy link

@penberg / @syuu1228 please see @amoskong comment above.

@amoskong
Copy link
Contributor Author

Hi @avikivity @syuu1228

I can still see this issue with scylla-unified-package-4.3.rc2.0.20201123.bc922a743.tar.gz
Scylla-server failed to start, and it didn't switch to developer mode.

The fail isn't caused by scylla issue, the problem can be solved by assigning nic name in scylla_setup cmdline.
scylla_setup will success and developer mode will be used as expected.

I also sent a patch to SCT:

  • [PATCH] fix(nonroot offline install): scylla_setup with detected nic PR link

Let's close this issue.

@avikivity
Copy link
Member

Fix present on all active branches, not backporting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
7 participants