Strange workers behaviour #1010

dav009 · 2015-08-20T13:49:59Z

Ive been using uwsgi for a while. Lately Im seeing the following problem (to diagnose Im using uwsgitop):

my instance is usually using 5,7 workers, and serving roughly 50 to 70 request per second, cpu usage usually around 50,60%
At some point cpu usage goes down to 20%, only 2 workers seem busy(via uwsgitop) , the requests per second drop to 10/20. Worth noticing that the Exception column show thousands of exceptions. I cant see anything unusual on uwsgi.log. Is there any hint where to see those exceptions?
the problem gets solved by restarting uwsgi.

Thank you in advance.

in the screeshot: 527, 598 exceptions..

Here is my .ini file

[uwsgi]
socket=/tmp/content_services.sock
wsgi-file=/mnt/code/current/content_services.wsgi
callable=app
protocol=uwsgi
master=true
single-interpreter=true
enable-threads=true
vacuum=true
processes=30
harakiri=45
listen=4096
pythonpath=/mnt/code/current/python_modules
uid=www-data
gid=
buffer-size=12992
no-orphans=true
disable-logging=false
logto=/var/log/cs/uwsgi.log
log-date=true
ignore-sigpipe=true
ignore-write-errors=true
log-maxsize=15000000
log-backupname=/var/log/cs/uwsgi.log.old

# Caching config

cache2=name=content,items=300,blocksize=655360,bitmap=1,ignore_full=1,purge_lru=1
cache2=name=scrape,items=800,blocksize=655360,bitmap=1,ignore_full=1,purge_lru=1
cache2=name=resolve,items=7000,blocksize=21846,bitmap=1,ignore_full=1,purge_lru=1
cache2=name=metaknight,items=300,blocksize=327680,bitmap=1,ignore_full=1,purge_lru=1

The text was updated successfully, but these errors were encountered:

unbit · 2015-08-20T16:57:18Z

That exceptions counter is pretty scaring, expecially because you do not have the exception in the logs.

I would attach a strace to the busy processes to understand what is going on and disable ignore-write-errors just to be sure it is an unreported network issue

dav009 · 2015-08-20T17:16:54Z

thanks for the quick reply.
So could write-errors give a hint on where the problem is?
I will try to generate the scenario and strace the process.

I was thinking on adding reload-on-exception or max-requests to maybe force bugged workers to start taking requests again. The funny fact is that the ones with exceptions are the ones taking requests, all the others do nothing.

unbit · 2015-08-20T17:20:41Z

Yes reload-on-exception would help, but honestly i think it would be better to understand what is going on :)

dav009 · 2015-08-20T18:16:47Z

definitely 👍 thanks for the support. I will come back with some straces

dav009 · 2015-08-20T20:11:50Z

Gonna clean some of the straces:
here are some of the straces for one of the running workers when the problem occurs:
https://gist.github.com/dav009/acd076f94a2712bc78bc

dav009 · 2015-08-22T04:16:22Z

stracing some of the unactive workers some have what seems to be normal output:

utex(0x7fc3c8c62504, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 527335, {1440216849, 194353000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
gettimeofday({1440216849, 194598}, NULL) = 0
futex(0x7fc3c8c62580, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fc3c8c62504, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 527337, {1440216849, 199598000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
gettimeofday({1440216849, 200773}, NULL) = 0
futex(0x7fc3c8c62580, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fc3c8c62504, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 527339, {1440216849, 205773000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
gettimeofday({1440216849, 206016}, NULL) = 0
futex(0x7fc3c8c62580, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fc3c8c62504, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 527341, {1440216849, 211016000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
gettimeofday({1440216849, 211259}, NULL) = 0
futex(0x7fc3c8c62580, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fc3c8c62504, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 527343, {1440216849, 216259000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
gettimeofday({1440216849, 216496}, NULL) = 0
futex(0x7fc3c8c62580, FUTEX_WAKE_PRIVATE, 1^CProcess 26240 detached

however others just sit at:

Process 19028 attached
futex(0x7f49d51bc580, FUTEX_WAIT_PRIVATE, 2, NULL

dav009 · 2015-08-25T15:25:31Z

so when the issue happens, I go killing workers which are idle and outputing:

futex(0x7fc4b137b580, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fc4b137b504, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 157843097, {1440516253, 399105000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
gettimeofday({1440516253, 426563}, NULL) = 0
futex(0x7fc4b137b580, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fc4b137b504, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 157843099, {1440516253, 431563000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
gettimeofday({1440516253, 433662}, NULL) = 0
futex(0x7fc4b137b580, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fc4b137b504, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 157843101, {1440516253, 438662000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
gettimeofday({1440516253, 439742}, NULL) = 0
futex(0x7fc4b137b580, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fc4b137b504, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 157843103, {1440516253, 444742000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
gettimeofday({1440516253, 445104}, NULL) = 0
futex(0x7fc4b137b580, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fc4b137b504, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 157843105, {1440516253, 450104000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
gettimeofday({1440516253, 475187}, NULL) = 0..
over and over..

after I kill each one, they become active and RPS starts going up.
by the time I've killed all the idle ones, everything is back to normal.

dav009 · 2015-08-25T16:29:46Z

when the number of workers is really low (i.e:3) then when the app is just started all of the workers strace outputs what the previous comment mentioned.
No requests are processed.

dav009 · 2015-08-25T16:38:49Z

A healthy idle worker output looks like:

Process 18544 attached
epoll_wait(11,

dav009 · 2015-08-25T20:38:51Z

it turned out that pymongo's MongoClient is somehow realted to the issue.
Explicitly passing MongoClien(....,max_pool_size=100) made the workers start without locking

unbit · 2015-08-25T21:12:00Z

the strace you reported shows a GIL deadlock and the mongoclient is heavily thread based.

Maybe you initialize it before fork() ?

dav009 · 2015-08-25T21:25:57Z

I initialized mongonClient in a with app.app_context(): to share it among workers.

unbit · 2015-08-25T21:30:51Z

I am not sure it is supported by mongoclient, i fear the socket will be clobbered. I think initializing a new pool for each worker is the best approach

dav009 · 2015-08-25T21:34:36Z

gonna try it, saw similar posts regarding: uwsgi + gevent + pymongo
Thanks again for the quick responses 👍

dav009 · 2015-08-26T15:22:18Z

Quick summary & solution for anybody having a similar issue:

Stack: python, flask, uwsgi, pymongo

Symptoms

idle workers
strange straces (futex)
RPS low
particular workers(2 or 3) taking 90% of the requests

Cause of the problem

Pymongo was in app_context inside flask app.
Sharing the MongoClient created idle workers from the very start of the server, this can be checked by stracing the server just after started.
The few healthy workers will eventaully become idle. Leaving only restart as an option.

Solution

After trying various things/settings, the only way to avoid it, is to create a mongoClient per worker/request.

Probably a heads up for any other lib using threads.

unbit · 2015-08-26T15:49:27Z

Just as a note, if you do not want to change your app code, you can use the --lazy-apps option that will load the app one time per worker

ryankuczka · 2017-04-18T13:07:10Z

I am having a similar issue that appears to be caused by pymongo, however, I cannot figure out why. I am seeing workers slowly start hanging, no exceptions, no responses, they just seem to stop and then once all the workers reach this state, my app goes down since there are no more workers to respond to requests.

I am using the following:

uwsgi==2.0.12
Django==1.7.10
mongoengine==0.8.7
pymongo==2.7.2

Using strace on a hanging worker shows only:

futex(0x36e7210, FUTEX_WAIT_PRIVATE, 0, NULL

This is the exact same for every hanging worker and using gdb, I discovered the following:

(gdb) py-bt
#4 Frame 0x39b4920, for file /home/myapp/myapp/venv/local/lib/python2.7/site-packages/pymongo/thread_util.py, line 235, in acquire (self=<BoundedSemaphore(_value=94, _initial_value=100, _cond=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x3914070>, acquire=<built-in method acquire of thread.lock object at remote 0x3914070>, _Condition__waiters=[], release=<built-in method release of thread.lock object at remote 0x3914070>) at remote 0x39435d0>) at remote 0x3940d40>, blocking=True, timeout=None, rc=False, endtime=None)
    self._cond.acquire()
#8 Frame 0x5f373a0, for file /home/myapp/myapp/venv/local/lib/python2.7/site-packages/pymongo/pool.py, line 323, in get_socket (self=<Pool(ssl_certfile=None, ssl_cert_reqs=None, conn_timeout=None, ssl_ca_certs=None, lock=<thread.lock at remote 0x3914150>, _socket_semaphore=<BoundedSemaphore(_value=94, _initial_value=100, _cond=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x3914070>, acquire=<built-in method acquire of thread.lock object at remote 0x3914070>, _Condition__waiters=[], release=<built-in method release of thread.lock object at remote 0x3914070>) at remote 0x39435d0>) at remote 0x3940d40>, ssl_keyfile=None, pid=19943, pool_id=0, _ident=<ThreadIdent(_lock=<DummyLock at remote 0x3943710>, _refs={}, _local=<thread._local at remote 0x3916f50>) at remote 0x3943790>, wait_queue_multiple=None, net_timeout=None, use_ssl=None, pair=(u'mongossd.srv.myapp.com', 27017), wait_queue_timeout=None, _check_interval_seconds=1, _request_counter=<Counter(_counters={}, ident=<Threa...(truncated)
    elif not self._socket_semaphore.acquire(True, self.wait_queue_timeout):
#12 Frame 0x59db940, for file /home/myapp/myapp/venv/local/lib/python2.7/site-packages/pymongo/mongo_replica_set_client.py, line 1299, in __socket (self=<MongoReplicaSetClient(_MongoReplicaSetClient__use_greenlets=False, _BaseObject__write_concern=<WriteConcern at remote 0x38fb580>, _MongoReplicaSetClient__wait_queue_timeout=None, _MongoReplicaSetClient__monitor_class=<type at remote 0x37ae8a0>, _MongoReplicaSetClient__seeds=set([(...), (...)]), _MongoReplicaSetClient__name='rs_myapp', _MongoReplicaSetClient__document_class=<type at remote 0x7f6fe0af4e60>, _MongoReplicaSetClient__ssl_ca_certs=None, _MongoReplicaSetClient__net_timeout=None, _MongoReplicaSetClient__index_cache={myapp': {u'user': {u'_cls_1': <datetime.datetime at remote 0x719ba08>, u'username_1': <datetime.datetime at remote 0x71d50a8>}, u'django_session': {u'expire_date_1': <datetime.datetime at remote 0x719bf80>}}}, _BaseObject__safe=True, _MongoReplicaSetClient__opts={'replicaset': 'rs_myapp', 'read_preference': False}, _MongoRep...(truncated)
    sock_info = member.pool.get_socket(force=force)

Thus, I am fairly certain the issue is due to sharing a single pymongo client across all the workers. I am using the --lazy-apps option to load the app in the workers. I have added print statements around where the pymongo clients get created by mongoengine in my django settings file like so:

from mongoengine import connect, connection
print 'Existing Connections:', connection._connections
if MONGO_REPLICA:
    connect(MONGO_DB, host=MONGO_IP, replicaSet=MONGO_REPLICA)
else:
    connect(MONGO_DB, port=MONGO_PORT, host=MONGO_IP)
print 'New Connections:', connection._connections
print 'Module ID:', id(connection)
print 'Connection Dict ID:', id(connection._connections)
print 'New Connection ID:', id(connection._connections['default'])

This prints the following when starting uwsgi (I have truncated it but it is the same for every worker):

Existing Connections: {}
New Connections: {'default': MongoClient('127.0.0.1', 27017)}
Module ID: 48170384
Dict ID: 48303232
New Connection ID: 49126224
Existing Connections: {}
New Connections: {'default': MongoClient('127.0.0.1', 27017)}
Module ID: 48170384
Dict ID: 48303232
New Connection ID: 49126224
Existing Connections: {}
New Connections: {'default': MongoClient('127.0.0.1', 27017)}
Module ID: 48170384
Dict ID: 48303232
New Connection ID: 49126224
...

My guess is that it has something to do with the pymongo client being stored in a module-level global dictionary in mongoengine but I am not understanding why every uwsgi process is using the same instance of the module, dictionary, and client (as noted by the same memory address).

I am running uwsgi with the following options:

/home/myapp/myapp/venv/bin/uwsgi \
    --socket=/tmp/django.sock \
    --listen=512 \
    --logto=/var/log/myapp/uwsgi.log \
    --chmod-socket=660 \
    --chown-socket=myapp:www-data \
    --chdir=/home/myapp/myapp/django/myapp \
    --wsgi-file=wsgi.py \
    --touch-reload=/home/myapp/myapp/bin/uwsgi-reload \
    --master \
    --processes=10 \
    --die-on-term \
    --lazy-apps \
    --buffer-size=32768 \
    --post-buffering=1

unbit · 2017-04-18T16:32:11Z

Hi, try adding --enable-threads from the strace (and if i remember correctly) the mongoclient spawns a thread in background

ryankuczka · 2017-04-18T18:34:38Z

Hmm adding --enable-threads results in still seeing the same module and mongoclient IDs for every process. Is that an actual issue or is it possibly a red herring? Upon further investigation, it appears the same thing occurs for every module so I think that might be supposed to happen...

I haven't been able to reliably reproduce the actual hanging of the processes as it seems to occur randomly so I'll try letting it run for a while with --enable-threads to see if it still hangs. Thanks for the tip!

ryankuczka · 2017-04-19T20:39:33Z

Hi @unbit, I am still seeing my application hanging on connections with the mongoclient. Do you have any other insight into why I am seeing every uwsgi process use the same memory address? When running the django development server I see the following:

(venv) $ python manage.py runserver 0.0.0.0:8888


Existing Connections: {}
New Connections: {'default': MongoClient('127.0.0.1', 27017)}
Module ID: 53909104
Dict ID: 53746048
New Connection ID: 55134352
Existing Connections: {}
New Connections: {'default': MongoClient('127.0.0.1', 27017)}
Module ID: 48065064
Dict ID: 47747152
New Connection ID: 49027856

Performing system checks...

System check identified no issues (1 silenced).
April 19, 2017 - 16:22:17
Django version 1.7.10, using settings 'settings'
Starting development server at http://0.0.0.0:8888/
Quit the server with CONTROL-C.

I'll admit I'm not entirely familiar with how multiple processes/threads are handled and how different what the development server is doing compared to uwsgi but I do know there are 2 processes for the development server and each has a different instance of the module/dict/mongoclient which is what I would expect from uwsgi as well...

bgehman · 2019-01-10T01:26:16Z

stumbled into this after trying to figure out a phantom consistent ~3% CPU usage from my supposedly idle docker container (which has python, uwsgi, and the pymongo pip module).

strace output for just a couple of seconds (against one of the workers) show futex in a tight error loop.

$ strace -cfp 16

strace: Process 16 attached with 3 threads
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00    0.190000          99      1902      1897 futex
  0.00    0.000000           0         2         2 restart_syscall
------ ----------- ----------- --------- --------- ----------------
100.00    0.190000                  1904      1899 total

App works great, except for the phantom CPU usage leak. Adding enable-threads seems to have fixed it. Note that lazy-apps did not. Will likely report this issue against pymongo.

dav009 closed this as completed Aug 26, 2015

bgehman mentioned this issue Jan 10, 2019

CPU usage remains constant #1954

Open

fogmoon mentioned this issue Sep 5, 2019

Some worker keep idle after respawning #2040

Closed

bunop mentioned this issue Jul 14, 2021

⚡ lower uwsgi workers CPU usage cnr-ibba/SMARTER-backend#15

Closed

Zylphrex mentioned this issue Mar 16, 2023

Probable threading issue in uwsgi + django when upgrading to 1.13.0 getsentry/sentry-python#1835

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strange workers behaviour #1010

Strange workers behaviour #1010

dav009 commented Aug 20, 2015

unbit commented Aug 20, 2015

dav009 commented Aug 20, 2015

unbit commented Aug 20, 2015

dav009 commented Aug 20, 2015

dav009 commented Aug 20, 2015

dav009 commented Aug 22, 2015

dav009 commented Aug 25, 2015

dav009 commented Aug 25, 2015

dav009 commented Aug 25, 2015

dav009 commented Aug 25, 2015

unbit commented Aug 25, 2015

dav009 commented Aug 25, 2015

unbit commented Aug 25, 2015

dav009 commented Aug 25, 2015

dav009 commented Aug 26, 2015

unbit commented Aug 26, 2015

ryankuczka commented Apr 18, 2017

unbit commented Apr 18, 2017

ryankuczka commented Apr 18, 2017

ryankuczka commented Apr 19, 2017

bgehman commented Jan 10, 2019

Strange workers behaviour #1010

Strange workers behaviour #1010

Comments

dav009 commented Aug 20, 2015

unbit commented Aug 20, 2015

dav009 commented Aug 20, 2015

unbit commented Aug 20, 2015

dav009 commented Aug 20, 2015

dav009 commented Aug 20, 2015

dav009 commented Aug 22, 2015

dav009 commented Aug 25, 2015

dav009 commented Aug 25, 2015

dav009 commented Aug 25, 2015

dav009 commented Aug 25, 2015

unbit commented Aug 25, 2015

dav009 commented Aug 25, 2015

unbit commented Aug 25, 2015

dav009 commented Aug 25, 2015

dav009 commented Aug 26, 2015

Symptoms

Cause of the problem

Solution

unbit commented Aug 26, 2015

ryankuczka commented Apr 18, 2017

unbit commented Apr 18, 2017

ryankuczka commented Apr 18, 2017

ryankuczka commented Apr 19, 2017

bgehman commented Jan 10, 2019