Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange workers behaviour #1010

Closed
dav009 opened this issue Aug 20, 2015 · 21 comments
Closed

Strange workers behaviour #1010

dav009 opened this issue Aug 20, 2015 · 21 comments

Comments

@dav009
Copy link

dav009 commented Aug 20, 2015

Ive been using uwsgi for a while. Lately Im seeing the following problem (to diagnose Im using uwsgitop):

  • my instance is usually using 5,7 workers, and serving roughly 50 to 70 request per second, cpu usage usually around 50,60%
  • At some point cpu usage goes down to 20%, only 2 workers seem busy(via uwsgitop) , the requests per second drop to 10/20. Worth noticing that the Exception column show thousands of exceptions. I cant see anything unusual on uwsgi.log. Is there any hint where to see those exceptions?
  • the problem gets solved by restarting uwsgi.

Thank you in advance.

screen shot 2015-08-19 at 16 01 43
in the screeshot: 527, 598 exceptions..

Here is my .ini file

[uwsgi]
socket=/tmp/content_services.sock
wsgi-file=/mnt/code/current/content_services.wsgi
callable=app
protocol=uwsgi
master=true
single-interpreter=true
enable-threads=true
vacuum=true
processes=30
harakiri=45
listen=4096
pythonpath=/mnt/code/current/python_modules
uid=www-data
gid=
buffer-size=12992
no-orphans=true
disable-logging=false
logto=/var/log/cs/uwsgi.log
log-date=true
ignore-sigpipe=true
ignore-write-errors=true
log-maxsize=15000000
log-backupname=/var/log/cs/uwsgi.log.old

# Caching config

cache2=name=content,items=300,blocksize=655360,bitmap=1,ignore_full=1,purge_lru=1
cache2=name=scrape,items=800,blocksize=655360,bitmap=1,ignore_full=1,purge_lru=1
cache2=name=resolve,items=7000,blocksize=21846,bitmap=1,ignore_full=1,purge_lru=1
cache2=name=metaknight,items=300,blocksize=327680,bitmap=1,ignore_full=1,purge_lru=1
@unbit
Copy link
Owner

unbit commented Aug 20, 2015

That exceptions counter is pretty scaring, expecially because you do not have the exception in the logs.

I would attach a strace to the busy processes to understand what is going on and disable ignore-write-errors just to be sure it is an unreported network issue

@dav009
Copy link
Author

dav009 commented Aug 20, 2015

thanks for the quick reply.
So could write-errors give a hint on where the problem is?
I will try to generate the scenario and strace the process.

I was thinking on adding reload-on-exception or max-requests to maybe force bugged workers to start taking requests again. The funny fact is that the ones with exceptions are the ones taking requests, all the others do nothing.

@unbit
Copy link
Owner

unbit commented Aug 20, 2015

Yes reload-on-exception would help, but honestly i think it would be better to understand what is going on :)

@dav009
Copy link
Author

dav009 commented Aug 20, 2015

definitely 👍 thanks for the support. I will come back with some straces

@dav009
Copy link
Author

dav009 commented Aug 20, 2015

Gonna clean some of the straces:
here are some of the straces for one of the running workers when the problem occurs:
https://gist.github.com/dav009/acd076f94a2712bc78bc

@dav009
Copy link
Author

dav009 commented Aug 22, 2015

stracing some of the unactive workers some have what seems to be normal output:

utex(0x7fc3c8c62504, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 527335, {1440216849, 194353000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
gettimeofday({1440216849, 194598}, NULL) = 0
futex(0x7fc3c8c62580, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fc3c8c62504, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 527337, {1440216849, 199598000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
gettimeofday({1440216849, 200773}, NULL) = 0
futex(0x7fc3c8c62580, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fc3c8c62504, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 527339, {1440216849, 205773000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
gettimeofday({1440216849, 206016}, NULL) = 0
futex(0x7fc3c8c62580, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fc3c8c62504, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 527341, {1440216849, 211016000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
gettimeofday({1440216849, 211259}, NULL) = 0
futex(0x7fc3c8c62580, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fc3c8c62504, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 527343, {1440216849, 216259000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
gettimeofday({1440216849, 216496}, NULL) = 0
futex(0x7fc3c8c62580, FUTEX_WAKE_PRIVATE, 1^CProcess 26240 detached

however others just sit at:

Process 19028 attached
futex(0x7f49d51bc580, FUTEX_WAIT_PRIVATE, 2, NULL

@dav009
Copy link
Author

dav009 commented Aug 25, 2015

so when the issue happens, I go killing workers which are idle and outputing:

futex(0x7fc4b137b580, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fc4b137b504, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 157843097, {1440516253, 399105000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
gettimeofday({1440516253, 426563}, NULL) = 0
futex(0x7fc4b137b580, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fc4b137b504, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 157843099, {1440516253, 431563000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
gettimeofday({1440516253, 433662}, NULL) = 0
futex(0x7fc4b137b580, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fc4b137b504, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 157843101, {1440516253, 438662000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
gettimeofday({1440516253, 439742}, NULL) = 0
futex(0x7fc4b137b580, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fc4b137b504, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 157843103, {1440516253, 444742000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
gettimeofday({1440516253, 445104}, NULL) = 0
futex(0x7fc4b137b580, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fc4b137b504, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 157843105, {1440516253, 450104000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
gettimeofday({1440516253, 475187}, NULL) = 0..
over and over..

after I kill each one, they become active and RPS starts going up.
by the time I've killed all the idle ones, everything is back to normal.

@dav009
Copy link
Author

dav009 commented Aug 25, 2015

when the number of workers is really low (i.e:3) then when the app is just started all of the workers strace outputs what the previous comment mentioned.
No requests are processed.

@dav009
Copy link
Author

dav009 commented Aug 25, 2015

A healthy idle worker output looks like:

Process 18544 attached
epoll_wait(11,

@dav009
Copy link
Author

dav009 commented Aug 25, 2015

it turned out that pymongo's MongoClient is somehow realted to the issue.
Explicitly passing MongoClien(....,max_pool_size=100) made the workers start without locking

@unbit
Copy link
Owner

unbit commented Aug 25, 2015

the strace you reported shows a GIL deadlock and the mongoclient is heavily thread based.

Maybe you initialize it before fork() ?

@dav009
Copy link
Author

dav009 commented Aug 25, 2015

I initialized mongonClient in a with app.app_context(): to share it among workers.

@unbit
Copy link
Owner

unbit commented Aug 25, 2015

I am not sure it is supported by mongoclient, i fear the socket will be clobbered. I think initializing a new pool for each worker is the best approach

@dav009
Copy link
Author

dav009 commented Aug 25, 2015

gonna try it, saw similar posts regarding: uwsgi + gevent + pymongo
Thanks again for the quick responses 👍

@dav009
Copy link
Author

dav009 commented Aug 26, 2015

Quick summary & solution for anybody having a similar issue:

Stack: python, flask, uwsgi, pymongo

Symptoms

  • idle workers
  • strange straces (futex)
  • RPS low
  • particular workers(2 or 3) taking 90% of the requests

Cause of the problem

  • Pymongo was in app_context inside flask app.
  • Sharing the MongoClient created idle workers from the very start of the server, this can be checked by stracing the server just after started.
    The few healthy workers will eventaully become idle. Leaving only restart as an option.

Solution

After trying various things/settings, the only way to avoid it, is to create a mongoClient per worker/request.

Probably a heads up for any other lib using threads.

@dav009 dav009 closed this as completed Aug 26, 2015
@unbit
Copy link
Owner

unbit commented Aug 26, 2015

Just as a note, if you do not want to change your app code, you can use the --lazy-apps option that will load the app one time per worker

@ryankuczka
Copy link

I am having a similar issue that appears to be caused by pymongo, however, I cannot figure out why. I am seeing workers slowly start hanging, no exceptions, no responses, they just seem to stop and then once all the workers reach this state, my app goes down since there are no more workers to respond to requests.

I am using the following:

uwsgi==2.0.12
Django==1.7.10
mongoengine==0.8.7
pymongo==2.7.2

Using strace on a hanging worker shows only:

futex(0x36e7210, FUTEX_WAIT_PRIVATE, 0, NULL

This is the exact same for every hanging worker and using gdb, I discovered the following:

(gdb) py-bt
#4 Frame 0x39b4920, for file /home/myapp/myapp/venv/local/lib/python2.7/site-packages/pymongo/thread_util.py, line 235, in acquire (self=<BoundedSemaphore(_value=94, _initial_value=100, _cond=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x3914070>, acquire=<built-in method acquire of thread.lock object at remote 0x3914070>, _Condition__waiters=[], release=<built-in method release of thread.lock object at remote 0x3914070>) at remote 0x39435d0>) at remote 0x3940d40>, blocking=True, timeout=None, rc=False, endtime=None)
    self._cond.acquire()
#8 Frame 0x5f373a0, for file /home/myapp/myapp/venv/local/lib/python2.7/site-packages/pymongo/pool.py, line 323, in get_socket (self=<Pool(ssl_certfile=None, ssl_cert_reqs=None, conn_timeout=None, ssl_ca_certs=None, lock=<thread.lock at remote 0x3914150>, _socket_semaphore=<BoundedSemaphore(_value=94, _initial_value=100, _cond=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x3914070>, acquire=<built-in method acquire of thread.lock object at remote 0x3914070>, _Condition__waiters=[], release=<built-in method release of thread.lock object at remote 0x3914070>) at remote 0x39435d0>) at remote 0x3940d40>, ssl_keyfile=None, pid=19943, pool_id=0, _ident=<ThreadIdent(_lock=<DummyLock at remote 0x3943710>, _refs={}, _local=<thread._local at remote 0x3916f50>) at remote 0x3943790>, wait_queue_multiple=None, net_timeout=None, use_ssl=None, pair=(u'mongossd.srv.myapp.com', 27017), wait_queue_timeout=None, _check_interval_seconds=1, _request_counter=<Counter(_counters={}, ident=<Threa...(truncated)
    elif not self._socket_semaphore.acquire(True, self.wait_queue_timeout):
#12 Frame 0x59db940, for file /home/myapp/myapp/venv/local/lib/python2.7/site-packages/pymongo/mongo_replica_set_client.py, line 1299, in __socket (self=<MongoReplicaSetClient(_MongoReplicaSetClient__use_greenlets=False, _BaseObject__write_concern=<WriteConcern at remote 0x38fb580>, _MongoReplicaSetClient__wait_queue_timeout=None, _MongoReplicaSetClient__monitor_class=<type at remote 0x37ae8a0>, _MongoReplicaSetClient__seeds=set([(...), (...)]), _MongoReplicaSetClient__name='rs_myapp', _MongoReplicaSetClient__document_class=<type at remote 0x7f6fe0af4e60>, _MongoReplicaSetClient__ssl_ca_certs=None, _MongoReplicaSetClient__net_timeout=None, _MongoReplicaSetClient__index_cache={myapp': {u'user': {u'_cls_1': <datetime.datetime at remote 0x719ba08>, u'username_1': <datetime.datetime at remote 0x71d50a8>}, u'django_session': {u'expire_date_1': <datetime.datetime at remote 0x719bf80>}}}, _BaseObject__safe=True, _MongoReplicaSetClient__opts={'replicaset': 'rs_myapp', 'read_preference': False}, _MongoRep...(truncated)
    sock_info = member.pool.get_socket(force=force)

Thus, I am fairly certain the issue is due to sharing a single pymongo client across all the workers. I am using the --lazy-apps option to load the app in the workers. I have added print statements around where the pymongo clients get created by mongoengine in my django settings file like so:

from mongoengine import connect, connection
print 'Existing Connections:', connection._connections
if MONGO_REPLICA:
    connect(MONGO_DB, host=MONGO_IP, replicaSet=MONGO_REPLICA)
else:
    connect(MONGO_DB, port=MONGO_PORT, host=MONGO_IP)
print 'New Connections:', connection._connections
print 'Module ID:', id(connection)
print 'Connection Dict ID:', id(connection._connections)
print 'New Connection ID:', id(connection._connections['default'])

This prints the following when starting uwsgi (I have truncated it but it is the same for every worker):

Existing Connections: {}
New Connections: {'default': MongoClient('127.0.0.1', 27017)}
Module ID: 48170384
Dict ID: 48303232
New Connection ID: 49126224
Existing Connections: {}
New Connections: {'default': MongoClient('127.0.0.1', 27017)}
Module ID: 48170384
Dict ID: 48303232
New Connection ID: 49126224
Existing Connections: {}
New Connections: {'default': MongoClient('127.0.0.1', 27017)}
Module ID: 48170384
Dict ID: 48303232
New Connection ID: 49126224
...

My guess is that it has something to do with the pymongo client being stored in a module-level global dictionary in mongoengine but I am not understanding why every uwsgi process is using the same instance of the module, dictionary, and client (as noted by the same memory address).

I am running uwsgi with the following options:

/home/myapp/myapp/venv/bin/uwsgi \
    --socket=/tmp/django.sock \
    --listen=512 \
    --logto=/var/log/myapp/uwsgi.log \
    --chmod-socket=660 \
    --chown-socket=myapp:www-data \
    --chdir=/home/myapp/myapp/django/myapp \
    --wsgi-file=wsgi.py \
    --touch-reload=/home/myapp/myapp/bin/uwsgi-reload \
    --master \
    --processes=10 \
    --die-on-term \
    --lazy-apps \
    --buffer-size=32768 \
    --post-buffering=1

@unbit
Copy link
Owner

unbit commented Apr 18, 2017

Hi, try adding --enable-threads from the strace (and if i remember correctly) the mongoclient spawns a thread in background

@ryankuczka
Copy link

Hmm adding --enable-threads results in still seeing the same module and mongoclient IDs for every process. Is that an actual issue or is it possibly a red herring? Upon further investigation, it appears the same thing occurs for every module so I think that might be supposed to happen...

I haven't been able to reliably reproduce the actual hanging of the processes as it seems to occur randomly so I'll try letting it run for a while with --enable-threads to see if it still hangs. Thanks for the tip!

@ryankuczka
Copy link

Hi @unbit, I am still seeing my application hanging on connections with the mongoclient. Do you have any other insight into why I am seeing every uwsgi process use the same memory address? When running the django development server I see the following:

(venv) $ python manage.py runserver 0.0.0.0:8888


Existing Connections: {}
New Connections: {'default': MongoClient('127.0.0.1', 27017)}
Module ID: 53909104
Dict ID: 53746048
New Connection ID: 55134352
Existing Connections: {}
New Connections: {'default': MongoClient('127.0.0.1', 27017)}
Module ID: 48065064
Dict ID: 47747152
New Connection ID: 49027856

Performing system checks...

System check identified no issues (1 silenced).
April 19, 2017 - 16:22:17
Django version 1.7.10, using settings 'settings'
Starting development server at http://0.0.0.0:8888/
Quit the server with CONTROL-C.

I'll admit I'm not entirely familiar with how multiple processes/threads are handled and how different what the development server is doing compared to uwsgi but I do know there are 2 processes for the development server and each has a different instance of the module/dict/mongoclient which is what I would expect from uwsgi as well...

@bgehman
Copy link

bgehman commented Jan 10, 2019

stumbled into this after trying to figure out a phantom consistent ~3% CPU usage from my supposedly idle docker container (which has python, uwsgi, and the pymongo pip module).

strace output for just a couple of seconds (against one of the workers) show futex in a tight error loop.

$ strace -cfp 16

strace: Process 16 attached with 3 threads
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00    0.190000          99      1902      1897 futex
  0.00    0.000000           0         2         2 restart_syscall
------ ----------- ----------- --------- --------- ----------------
100.00    0.190000                  1904      1899 total

App works great, except for the phantom CPU usage leak. Adding enable-threads seems to have fixed it. Note that lazy-apps did not. Will likely report this issue against pymongo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants