Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stargate fails to start because of too many open files #1286

Closed
jsanda opened this issue Sep 29, 2021 · 11 comments · Fixed by #1300
Closed

Stargate fails to start because of too many open files #1286

jsanda opened this issue Sep 29, 2021 · 11 comments · Fixed by #1300
Assignees

Comments

@jsanda
Copy link

jsanda commented Sep 29, 2021

I am running Stargate 1.0.31 in Kubernetes with K8ssandra. The Stargate image used is stargateio/stargate-3_11:v1.0.31. In one of our automated tests we have seen Stargate fail to start a few times with this a resource limit error like this:

INFO  [main] 2021-09-28 21:25:10,597 AbstractConnector.java:331 - Started Server@29862f53{HTTP/1.1, (http/1.1)}{0.0.0.0:8082}
INFO  [main] 2021-09-28 21:25:10,598 Server.java:415 - Started @80358ms
INFO  [main] 2021-09-28 21:25:10,599 BaseActivator.java:185 - Started restapi
Finished starting bundles.
Unexpected error: java.io.IOException: User limit of inotify instances reached or too many open files
java.lang.RuntimeException: java.io.IOException: User limit of inotify instances reached or too many open files
        at io.stargate.starter.Starter.watchJarDirectory(Starter.java:539)
        at io.stargate.starter.Starter.start(Starter.java:441)
        at io.stargate.starter.Starter.cli(Starter.java:619)
        at io.stargate.starter.Starter.main(Starter.java:660)
Caused by: java.io.IOException: User limit of inotify instances reached or too many open files
        at sun.nio.fs.LinuxWatchService.<init>(LinuxWatchService.java:64)
        at sun.nio.fs.LinuxFileSystem.newWatchService(LinuxFileSystem.java:47)
        at io.stargate.starter.Starter.watchJarDirectory(Starter.java:526)
        ... 3 more

This is in a CI environment with limited cpu/memory resources. The test is running in the free tier runner in GitHubActions. The runner vm has 2 cpus and 7 GB memory. The particular test in which this failed had already deployed two Cassandra nodes and one Stargate node. This failure is from the second Stargate node.

I believe the open file limit on the vm is set to 65536. I don't think I am able to increase it. Maybe the solution is to run my tests in an environment with more resources, but it would be nice if Stargate could less demanding especially considering this happens on startup.

@dougwettlaufer
Copy link
Contributor

Huh, well that's a new one. We run some things in the GitHub actions free tier as well without a problem, granted it is with fewer nodes.

The odd thing is that file descriptors shouldn't be heavily consumed until the services start taking traffic. In a resource constrained environment I'd expect the error you're seeing to occur under load rather than on start up.

We can take a look at dropwizard to see if there's anything we can tune. Although I wonder if your runner had a noisy neighbor?

@jsanda
Copy link
Author

jsanda commented Sep 30, 2021 via email

@jdonenine
Copy link

@jsanda @Miles-Garnsey Maybe what we can do is attempt these tests on the self-hosted runner we're going to setup and keep the file descriptor limits at the defaults and see if we run into this problem there, that would help rule out the noisy neighbor problem?

@jsanda
Copy link
Author

jsanda commented Sep 30, 2021

The file limit error has only happened a few times. We can run the test N times on a self-hosted runner without the error happening. That doesn't mean it won't happen, but it does give increased confidence. We have been deploying nodes with heaps configured as low as 256 MB and 384 MB. Surprisingly that works fine a lot of the time, but we have issues too often. The issues are not limited to this open file limit error. The situation is like a game of Jenga :)

@jdonenine
Copy link

@ivansenic
Copy link
Contributor

ivansenic commented Oct 1, 2021

@dougwettlaufer @jsanda How about creating a small fix here by adding a --enableBundlesWatch which would be false by default and thus you can enable the watching of bundles.. Or vice-versa.

Doug, I think we don't need bundles watching by default, but this might be a breaking change.. So we could also go with --disableBundlesWatch, keep the current behavior by default, but give anybody an option to avoid it..

UPDATE: even simpler solution, let's have that watchJarDirectory call in try/catch, log the error and note that bundles will not be watched and continue loading Stargate..

@jdonenine
Copy link

I'm just curious here, what is the bundle watching used for @ivansenic ?

@ivansenic
Copy link
Contributor

I'm just curious here, what is the bundle watching used for @ivansenic ?

With OSGi you can replace bundles during the runtime. Meaning you can paste a new version of a jar to the folder we are watching and you would in runtime update that specific bundle with new version.

@jdonenine
Copy link

Makes sense, I guess I was more wondering if that's something that is often done with Stargate? Just trying to gauge for example, is that an option we'd want to see exposed through K8ssandra or would it be sufficient just to turn it off by default when deployed through K8ssandra.

@ivansenic
Copy link
Contributor

If you ask me, and you do 😄, I would say it should be turned off in Kubernetes. I mean this is old tech, developed for monoliths and actually this bundle reloading was a way to achieve something you would nowadays do in the cloud. You have a new version, no problem, deploy. In fact, that's the whole benefit of the cloud-native development that you can deploy as much times as you want.

@dougwettlaufer
Copy link
Contributor

Right, watching the directory is to enable the hot-reload use case which really doesn't apply in the cloud. How about we add the --disableBundlesWatch and try-catch @ivansenic? That way we can avoid the error from ever happening with the flag and if it isn't set then avoid completely breaking with the try-catch.

@ivansenic ivansenic self-assigned this Oct 1, 2021
jsanda added a commit to jsanda/k8ssandra-operator that referenced this issue Oct 6, 2021
With a fix for stargate/stargate#1286 I think we
might be able to reduce resource requirements in our test fixtures. This will
be helpful for freeing up limiting resources in the GHA free runner.
jsanda added a commit to jsanda/k8ssandra-operator that referenced this issue Oct 6, 2021
With a fix for stargate/stargate#1286 I think we
might be able to reduce resource requirements in our test fixtures. This will
be helpful for freeing up limiting resources in the GHA free runner.
jsanda added a commit to k8ssandra/k8ssandra-operator that referenced this issue Oct 6, 2021
* reduce on and off heap memory for C* and Stargate

With a fix for stargate/stargate#1286 I think we
might be able to reduce resource requirements in our test fixtures. This will
be helpful for freeing up limiting resources in the GHA free runner.

* add comment explaining the usage of the custom image
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants