Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduler stalls #312

Closed
realhik opened this issue Mar 3, 2019 · 10 comments
Closed

Scheduler stalls #312

realhik opened this issue Mar 3, 2019 · 10 comments
Labels
plugins Issue relates to a plugin

Comments

@realhik
Copy link

realhik commented Mar 3, 2019

I am using SmartHomeNG and plugin version 1.5.1.

In the last weeks I observed a stalled scheduler queue for a few times. The system lags and execution times in the scheduler queue point to a timestamp in the past. In this case I have to restart SmarthomeNG to fix this.

In the smarthome.log I found the following lines which are probably related. The oldest entries in the scheduler queue point to 2019-03-02 11:48:44:

2019-03-02 11:46:47 ERROR sonos_discover_scheduler Method sonos_discover_scheduler exception: can't start new thread Traceback (most recent call last): File "/usr/local/smarthome/lib/scheduler.py", line 517, in _task obj() File "/usr/local/smarthome/plugins/sonos/__init__.py", line 2797, in _discover sonos_speaker[uid].check_subscriptions() File "/usr/local/smarthome/plugins/sonos/__init__.py", line 404, in check_subscriptions self.zone_subscription.subscribe() File "/usr/local/smarthome/plugins/sonos/__init__.py", line 178, in subscribe self._thread.start() File "/usr/lib/python3.5/threading.py", line 844, in start _start_new_thread(self._bootstrap, ()) RuntimeError: can't start new thread

and a few lines later:

2019-03-02 11:47:34 ERROR Scheduler Needing more worker threads than the specified maximum of 20!

It seems that the sonos plugin is the origin of this issue and the thread limit is only a subsequent problem. Any idea why this happens and how to fix it?

@aschwith
Copy link
Contributor

aschwith commented Mar 3, 2019

The latest version of the sonos plugin is subject to a memory leak coming from the SoCo framework. As far as I know, it has not been adressed yet. See more details here:
SoCo/SoCo#653

@realhik
Copy link
Author

realhik commented Mar 4, 2019

Yesterday I increased the maximum number of threads to 30. Today, I got ERROR Scheduler Needing more worker threads than the specified maximum of 30! again but without the Sonos error. It seems that the issue is not related to that plugin (in contrast to my #1 assumption).

I have now turned debugging on and hope that the problem can be reproduced soon.

@msinn
Copy link
Member

msinn commented Mar 5, 2019

The scheduler is only the messenger of the bad news, not the source (what the title of your issue implys). The source has to be searched in one of the plugins you use.

  • You need to specify which plugins you use.
  • Further it could be helpful to know the list of running threads (from the threads page of the backend plugin)

@realhik
Copy link
Author

realhik commented Mar 6, 2019

Plugins in use:

  • knx
  • BackendServer
  • thz (in use for 2 years)
  • websocket
  • sql
  • smartvisu
  • cli
  • Sonos (in use for months)
  • GPIO (in use for months)
  • telegram (added a few weeks ago)

I have been using THZ & GPIO for months now without problems. Problems occurred with update to 1.5.1. But maybe it could also be the Telegram plugin which I added a few days before the update.

I'll disable the Telegram plugin now. If it is the reason for the issue then the daily scheduler stalls should disappear. For completeness here the active threads according to the log file:

2019-03-04  19:37:19 ERROR    Scheduler    Needing more worker threads than the specified maximum of 30!
2019-03-04  19:37:19 INFO     Scheduler    Threads: 
762d6382-c969-46db-a365-8937ea08f7f5_0: 1, 
762d6382-c969-46db-a365-8937ea08f7f5_1: 1, 
762d6382-c969-46db-a365-8937ea08f7f5_2: 1, 
762d6382-c969-46db-a365-8937ea08f7f5_3: 1, 
CP Server Thread-10: 1, 
CP Server Thread-11: 1, 
CP Server Thread-12: 1, 
CP Server Thread-13: 1, 
CP Server Thread-14: 1, 
CP Server Thread-16: 1, 
CP Server Thread-17: 1, 
CP Server Thread-18: 1, 
CP Server Thread-19: 1
CP Server Thread-20: 1, 
CP Server Thread-21: 1, 
CP Server Thread-22: 1, 
CP Server Thread-23: 1, 
CP Server Thread-24: 1, 
CP Server Thread-25: 1, 
CP Server Thread-7: 1, 
CP Server Thread-8: 1, 
CP Server Thread-9: 1, 
dispatcher: 1, 
Dummy-42: 1, 
HTTPServer Thread-15: 1, 
HTTPServer Thread-6: 1, 
job_queue: 1, 
Main: 1, 
QueueFeederThread: 2, 
Scheduler: 1, 
Thread-26: 1, 
Thread-33: 1, 
Thread-34: 1, 
Thread-35920: 1, 
Thread-35921: 1, 
Thread-35923: 1, 
Thread-35924: 1, 
Thread-35926: 1, 
Thread-35927: 1, 
Thread-35928: 1, 
Thread-35931: 1, 
Thread-35938: 1, 
Thread-35939: 1, 
Thread-35941: 1, 
Thread-35942: 1, 
Thread-35944: 1, 
Thread-35945: 1, 
Thread-35949: 1,
Thread-35951: 1, 
Thread-35953: 1, 
Thread-35954: 1, 
Thread-67: 1, 
Thread-68: 1, 
THZ PortHandler: 1, 
THZ: 30, 
thzServer: 1, 
updater: 1, 

@bmxp
Copy link
Member

bmxp commented Mar 7, 2019

I've taken your posts and sorted the threads in the message to get a better overview.
Telegram has at least 7 or 8 threads that are not named beautifully at the moment (See python-telegram-bot/python-telegram-bot#1332)
Telegram should also be responsible for a delay in shutdown shng down to a kill. It depends a little on your timeout in plugin.

The other Thread-xyz will mostly be originated by sonos plugin as @aschwith already mentioned above or by THZ plugin.

@ohinckel
Copy link
Member

ohinckel commented Mar 7, 2019

@realhik, is this the first log entry about reaching the maximum number of threads or just the last one? Usually this message is logged whenever a new thread is created but maximum already reached.

Is there any reason why the THZ plugin has 30 threads? Currently there's no implementation available in the plugins repo to review the code, but I found an old PR smarthomeNG/plugins#116.

Maybe something changed in the core/scheduler which makes the plugin creating more threads then expected.

@realhik
Copy link
Author

realhik commented Mar 7, 2019

@ohinckel, @bmxp, thank you for your hints.

My log snippets each consider the first entry after a restart of SmarthomeNG. There are no peculiar entries around this log entry. However, I will try to debug the THZ Plugin which can be reviewed at github. Btw, I am user and not really maintainer of this helpful plugin.

Just to understand the scheduler data: THZ: 30 means that there is a scheduler entry named THZ with 30 threads? And Thread-26: 1, Thread-33: 1, etc. means that Sonos created multiple scheduler entries with a single thread each?

@realhik
Copy link
Author

realhik commented Mar 17, 2019

Ten days ago, I disabled the Telegram plugin and the problem occurred again yesterday. As I increased the number of threads the log file showed a huge number of scheduler entries: THZ: 98.

Above I gave the link to the source code of the THZ plugin. According to the method run of class THZ in __init__.py there is only one thread added to the scheduler queue. I could not find any other places where it is added again to the scheduler.

As an intermediate result, it seems that the THZ plugin or the maybe the scheduler is responsible. I'll try figure it out. Any hints are welcome

@msinn
Copy link
Member

msinn commented Mar 18, 2019

Since the problem hits only you, I bet it is the THZ plugin (99.5% chance).

Usually in plugins schedulers are added and the threads finish after thy have done their work. If needed, a new scheduler is added.

The HUE plugin had the problem, that threads that encountered an error situation didn't terminate, so threads got created and created and created ...

@realhik
Copy link
Author

realhik commented Mar 26, 2019

Yes, it is the THZ plugin and the reason that the number of threads increases has something to do with a broken serial communication. Usually, there subsequent requests and answers in the logs, e.g.:

2019-03-26  08:03:50 DEBUG    THZ PortHandler Tx: 02 01 00 88 0a 05 78 10 03
2019-03-26  08:03:50 DEBUG    THZ          Rx: 01 00 78 0a 05 78 00 f0
2019-03-26  08:03:50 DEBUG    THZ          param = p39FanStage3AirIn (0A0578)
2019-03-26  08:03:50 DEBUG    THZ PortHandler Tx: 02 01 00 1b 0b 01 0e 10 03
2019-03-26  08:03:50 DEBUG    THZ          Rx: 01 00 31 0b 01 0e 00 16
...

However, after one point in time I could observe only requests every one second:

2019-03-26  08:04:22 DEBUG    THZ PortHandler Tx: 02 01 00 58 0a 02 4b 10 03
2019-03-26  08:04:23 DEBUG    THZ PortHandler Tx: 02 01 00 1e 0a 00 13 10 03
2019-03-26  08:04:24 DEBUG    THZ PortHandler Tx: 02 01 00 81 0a 05 71 10 03

Polling all registers with a one second timeout takes longer than the cyclic execution of 30 seconds and the number of threads increase. So, @msinn, you were totally right.

Btw, this malfunction will also affect other plugins as the maximum number of threads exceed. Maybe there is a way to make the scheduler more robust in case of erroneous plugin code...

@msinn msinn added the plugins Issue relates to a plugin label Mar 27, 2019
@msinn msinn closed this as completed May 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
plugins Issue relates to a plugin
Projects
None yet
Development

No branches or pull requests

5 participants