Stargate fails to start because of too many open files #1286

jsanda · 2021-09-29T15:49:19Z

I am running Stargate 1.0.31 in Kubernetes with K8ssandra. The Stargate image used is stargateio/stargate-3_11:v1.0.31. In one of our automated tests we have seen Stargate fail to start a few times with this a resource limit error like this:

INFO  [main] 2021-09-28 21:25:10,597 AbstractConnector.java:331 - Started Server@29862f53{HTTP/1.1, (http/1.1)}{0.0.0.0:8082}
INFO  [main] 2021-09-28 21:25:10,598 Server.java:415 - Started @80358ms
INFO  [main] 2021-09-28 21:25:10,599 BaseActivator.java:185 - Started restapi
Finished starting bundles.
Unexpected error: java.io.IOException: User limit of inotify instances reached or too many open files
java.lang.RuntimeException: java.io.IOException: User limit of inotify instances reached or too many open files
        at io.stargate.starter.Starter.watchJarDirectory(Starter.java:539)
        at io.stargate.starter.Starter.start(Starter.java:441)
        at io.stargate.starter.Starter.cli(Starter.java:619)
        at io.stargate.starter.Starter.main(Starter.java:660)
Caused by: java.io.IOException: User limit of inotify instances reached or too many open files
        at sun.nio.fs.LinuxWatchService.<init>(LinuxWatchService.java:64)
        at sun.nio.fs.LinuxFileSystem.newWatchService(LinuxFileSystem.java:47)
        at io.stargate.starter.Starter.watchJarDirectory(Starter.java:526)
        ... 3 more

This is in a CI environment with limited cpu/memory resources. The test is running in the free tier runner in GitHubActions. The runner vm has 2 cpus and 7 GB memory. The particular test in which this failed had already deployed two Cassandra nodes and one Stargate node. This failure is from the second Stargate node.

I believe the open file limit on the vm is set to 65536. I don't think I am able to increase it. Maybe the solution is to run my tests in an environment with more resources, but it would be nice if Stargate could less demanding especially considering this happens on startup.

dougwettlaufer · 2021-09-30T00:01:36Z

Huh, well that's a new one. We run some things in the GitHub actions free tier as well without a problem, granted it is with fewer nodes.

The odd thing is that file descriptors shouldn't be heavily consumed until the services start taking traffic. In a resource constrained environment I'd expect the error you're seeing to occur under load rather than on start up.

We can take a look at dropwizard to see if there's anything we can tune. Although I wonder if your runner had a noisy neighbor?

jsanda · 2021-09-30T00:09:12Z

We have only seen this 2 or 3 times so maybe it is noisy neighbors.

On Wed, Sep 29, 2021 at 8:01 PM Doug Wettlaufer ***@***.***> wrote: Huh, well that's a new one. We run some things in the GitHub actions free tier as well without a problem, granted it is with fewer nodes. The odd thing is that file descriptors shouldn't be heavily consumed until the services start taking traffic. In a resource constrained environment I'd expect the error you're seeing to occur under load rather than on start up. We can take a look at dropwizard to see if there's anything we can tune. Although I wonder if your runner had a noisy neighbor? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#1286 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABJBOMWLS44U2QC5KFTWXTUEOSGVANCNFSM5FAGY5DA> .

-- - John

jdonenine · 2021-09-30T01:08:51Z

@jsanda @Miles-Garnsey Maybe what we can do is attempt these tests on the self-hosted runner we're going to setup and keep the file descriptor limits at the defaults and see if we run into this problem there, that would help rule out the noisy neighbor problem?

jsanda · 2021-09-30T01:20:49Z

The file limit error has only happened a few times. We can run the test N times on a self-hosted runner without the error happening. That doesn't mean it won't happen, but it does give increased confidence. We have been deploying nodes with heaps configured as low as 256 MB and 384 MB. Surprisingly that works fine a lot of the time, but we have issues too often. The issues are not limited to this open file limit error. The situation is like a game of Jenga :)

jdonenine · 2021-09-30T01:37:50Z

https://media.giphy.com/media/PlnQNcQ4RYOhG/giphy.gif

ivansenic · 2021-10-01T11:36:32Z

@dougwettlaufer @jsanda How about creating a small fix here by adding a --enableBundlesWatch which would be false by default and thus you can enable the watching of bundles.. Or vice-versa.

Doug, I think we don't need bundles watching by default, but this might be a breaking change.. So we could also go with --disableBundlesWatch, keep the current behavior by default, but give anybody an option to avoid it..

UPDATE: even simpler solution, let's have that watchJarDirectory call in try/catch, log the error and note that bundles will not be watched and continue loading Stargate..

jdonenine · 2021-10-01T12:45:36Z

I'm just curious here, what is the bundle watching used for @ivansenic ?

ivansenic · 2021-10-01T12:48:42Z

I'm just curious here, what is the bundle watching used for @ivansenic ?

With OSGi you can replace bundles during the runtime. Meaning you can paste a new version of a jar to the folder we are watching and you would in runtime update that specific bundle with new version.

jdonenine · 2021-10-01T12:54:09Z

Makes sense, I guess I was more wondering if that's something that is often done with Stargate? Just trying to gauge for example, is that an option we'd want to see exposed through K8ssandra or would it be sufficient just to turn it off by default when deployed through K8ssandra.

ivansenic · 2021-10-01T12:58:51Z

If you ask me, and you do 😄, I would say it should be turned off in Kubernetes. I mean this is old tech, developed for monoliths and actually this bundle reloading was a way to achieve something you would nowadays do in the cloud. You have a new version, no problem, deploy. In fact, that's the whole benefit of the cloud-native development that you can deploy as much times as you want.

dougwettlaufer · 2021-10-01T16:20:40Z

Right, watching the directory is to enable the hot-reload use case which really doesn't apply in the cloud. How about we add the --disableBundlesWatch and try-catch @ivansenic? That way we can avoid the error from ever happening with the flag and if it isn't set then avoid completely breaking with the try-catch.

With a fix for stargate/stargate#1286 I think we might be able to reduce resource requirements in our test fixtures. This will be helpful for freeing up limiting resources in the GHA free runner.

* reduce on and off heap memory for C* and Stargate With a fix for stargate/stargate#1286 I think we might be able to reduce resource requirements in our test fixtures. This will be helpful for freeing up limiting resources in the GHA free runner. * add comment explaining the usage of the custom image

ivansenic self-assigned this Oct 1, 2021

dougwettlaufer mentioned this issue Oct 1, 2021

Provide a flag to disable watching the bundles directory #1300

Merged

3 tasks

dougwettlaufer assigned dougwettlaufer and unassigned ivansenic Oct 1, 2021

dougwettlaufer closed this as completed in #1300 Oct 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stargate fails to start because of too many open files #1286

Stargate fails to start because of too many open files #1286

jsanda commented Sep 29, 2021

dougwettlaufer commented Sep 30, 2021

jsanda commented Sep 30, 2021 via email

jdonenine commented Sep 30, 2021

jsanda commented Sep 30, 2021

jdonenine commented Sep 30, 2021

ivansenic commented Oct 1, 2021 •

edited

Loading

jdonenine commented Oct 1, 2021

ivansenic commented Oct 1, 2021

jdonenine commented Oct 1, 2021

ivansenic commented Oct 1, 2021

dougwettlaufer commented Oct 1, 2021

Stargate fails to start because of too many open files #1286

Stargate fails to start because of too many open files #1286

Comments

jsanda commented Sep 29, 2021

dougwettlaufer commented Sep 30, 2021

jsanda commented Sep 30, 2021 via email

jdonenine commented Sep 30, 2021

jsanda commented Sep 30, 2021

jdonenine commented Sep 30, 2021

ivansenic commented Oct 1, 2021 • edited Loading

jdonenine commented Oct 1, 2021

ivansenic commented Oct 1, 2021

jdonenine commented Oct 1, 2021

ivansenic commented Oct 1, 2021

dougwettlaufer commented Oct 1, 2021

ivansenic commented Oct 1, 2021 •

edited

Loading