Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elasticsearch unable to start because of java.lang.IllegalStateException: Failed to create node environment #2609

Closed
vincenzodnp opened this issue Apr 15, 2021 · 5 comments

Comments

@vincenzodnp
Copy link
Contributor

Describe the bug

Elasticsearch image is not able to create "node environment" in the mounted (persistent) /usr/share/elasticsearch/data
This is due to permission issues because of fsGroup (it was set to 0)
The Java Exception is: java.lang.IllegalStateException: Failed to create node environment

To Reproduce

Steps to reproduce the behavior:

  1. Create an Elasticsearch deployment
  2. See error

Expected behavior

Elasticsearch pod up&running.

Additional context

Tested by adding fsGroup: 0 in the deployment securityContext and it works as expected

@Schnitzel
Copy link
Contributor

@vincenzodnp
can you elaborate what you mean with "it was set to 0"?
The elasticsearch deployment.yaml was added with no fsGroup setting from the very beginning, see https://github.com/amazeeio/lagoon/blame/e426942e825b5a03660f58520115ad77622866fc/images/kubectl-build-deploy-dind/helmcharts/elasticsearch/values.yaml

so I'm confused why this needs to be changed now? as it worked from february 2021 until now?

@dasrecht
Copy link
Contributor

Maybe this is something specific to GCP? as this happened when we tried to deploy things to GCP today

@vincenzodnp
Copy link
Contributor Author

vincenzodnp commented Apr 15, 2021

Here the change made by @shreddedbacon
1cba802#diff-287110c964bdfece67930f362083b08892c8a619c0a8e8611322b0e99e8b32c7
where he changed fsGroup: 0 to {{- toYaml .Values.podSecurityContext | nindent 8 }}

@Schnitzel
Copy link
Contributor

I did some more research on this, as I was confused why this has not been a problem earlier.

So turns out if you mount a NEW PVC into a container, without any securityContext.fsGroup settings in the Pod, this is how the PVC is mounted into the container:

[lagoonproject]dev@elasticsearch:/usr/share/elasticsearch/data$ ls -lisa
total 24
      2  4 drwxr-xr-x 3 root          root  4096 Apr 15 21:21 .
1851635  4 drwxrwxr-x 1 elasticsearch root  4096 Feb 13  2019 ..
     11 16 drwx------ 2 root          root 16384 Apr 15 21:21 lost+found

see the drwxr-xr-x on the main folder, the container itself is running as user/group root/root, so technically it should have rite access (user root is owner of the folder and has write access) but the elasticsearch process is started under the user/group elasticsearch/root, see here:

[lagoonproject]dev@elasticsearch:/usr/share/elasticsearch$ ps -aux   
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.0   4364   624 ?        Ss   10:41   0:00 /sbin/tini -- /lagoon/entrypoints.bash /usr/local/bin/docker-entrypoint.sh
elastic+       6  0.2  2.9 3917836 480732 ?      Sl   10:41   2:12 /opt/jdk-11.0.1/bin/java -Xms1g -Xmx1g -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupan

which means the elasticsearch user has no write access to the PVC (user doesn't match, and the group matches, but the group does not have write access).

Now, if you set securityContext.fsGroup: 0 inside the pod it looks like this:

[lagoonproject]dev@elasticsearch:/usr/share/elasticsearch/data$ ls -lisa
total 24
      2  4 drwxrwsr-x 3 root          root  4096 Apr 15 21:21 .
1851635  4 drwxrwxr-x 1 elasticsearch root  4096 Feb 13  2019 ..
     11 16 drwxrws--- 2 root          root 16384 Apr 15 21:21 lost+found

the big difference is the drwxrwsr-x on the . folder, meaning the the group root has write access. Therefore elasticsearch will be able to access the data folder and do it's thing.
So turns out that setting securityContext.fsGroup: 0 does not only set the filesystem group to 0 (root) but also changes the permissions of the filesystem to writeable by group.

Now why id this not cause more havoc:
After the permissions have been set once to drwxrwsr-x on the PVC, they stay that way, so even if we removed securityContext.fsGroup: 0 recently, all PVCs that where created before the removal hat the drwxrwsr-x on the . folder and everything is fine. Only if a new PVC (like a new project/migration) was added this caused issues on the very beginning.

It also only causes issues with container images that switch the user of the service to something else than root, like the elasticsearch or solr images do.
I still though changed it in the PR for mariadb-single, mongo-single, postgres-single just to be safe

@rocketeerbkw
Copy link
Member

Fixed in #2610

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants