New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EOFError Stream has been closed #237

Closed
AndyXan opened this Issue Nov 7, 2017 · 10 comments

Comments

Projects
None yet
2 participants
@AndyXan

AndyXan commented Nov 7, 2017

Hello rpyc community,

I am here with more of a question than an issue. The general problem ist that the error I am facing occurs quite randomly in a setup that is not reproducable here.

The short description is: I'm using a OneShotServer in an application, more detailed - The OneShotServer is running inside the IDA-Pro-Python (Python 2.7, rpyc 3.4.4) environment and exports the common ida modules. The client application uses those exported modules. Single-Threaded, only one connection over localhost (Python 2.7, rpyc 3.4.4)

This process worked so far and still does. However, recently I switched to pypy and since that change, at random events (sometimes after 5 minutes, sometimes after 10 or 15), suddenly an "EOFError: Stream has been closed" exception occurs - accordingly to the stacktrace among the execution path of syncrequest. Completely random, at different module calls and different times.

So, basically the only thing that changed during the transition to pypy is, well, the requests reaching the OneShotServer are coming faster. Approx. a 1000-times faster or even more than that. So, we're talking about several million rpc-invokes per hour, done over localhost.

The question now is: How does the OneShotServer react under such loads, can this be the source of EOFError Stream has been closed error and if yes, what could I possibly do about it despite "sleep" on client side?

Some thoughts on this are very much appreciated.
Best Regards,
Andreas

@coldfix

This comment has been minimized.

Collaborator

coldfix commented Nov 7, 2017

Sounds to me like an error due to the GC collecting the remote end of the stream. Can you give a stack trace?

@AndyXan

This comment has been minimized.

AndyXan commented Nov 8, 2017

Good Morning,

yes - I Can give a bit more background here. First the Stacktraces:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/wittmann/GitRepo/SIT-ANGR/sit-angr-cfg-implementations/src/sit_cfg_implementations/ida_cfg/idalink_cfg.py", line 74, in ida_function_cfg_dict_to_angr_function_cfg_dict
    self.convert_ida_cfg_to_angr_cfg(ida_functionEA)
  File "/home/wittmann/GitRepo/SIT-ANGR/sit-angr-cfg-implementations/src/sit_cfg_implementations/ida_cfg/idalink_cfg.py", line 128, in convert_ida_cfg_to_angr_cfg
    angr_split_node_list = self.angr_blockify(ida_basic_block)
  File "/home/wittmann/GitRepo/SIT-ANGR/sit-angr-cfg-implementations/src/sit_cfg_implementations/ida_cfg/idalink_cfg.py", line 273, in angr_blockify
    angrBasicBlockEnd = self.ida_to_angr_address(idaBasicBlock.endEA)
  File "/home/wittmann/GitRepo/SIT-ANGR/sit-angr-cfg-implementations/src/sit_cfg_implementations/ida_cfg/idalink_cfg.py", line 542, in ida_to_angr_address
    return self.calculateRealAddress(ida_ea) - self.ida.idaapi.get_imagebase() + 0x100000000
  File "/home/wittmann/GitRepo/SIT-ANGR/sit-angr-cfg-implementations/src/sit_cfg_implementations/ida_cfg/idalink_cfg.py", line 526, in calculateRealAddress
    if self.isThumb(functionEA) == True:
  File "/home/wittmann/GitRepo/SIT-ANGR/sit-angr-cfg-implementations/src/sit_cfg_implementations/ida_cfg/idalink_cfg.py", line 508, in isThumb
    thumbRegId = self.ida.idaapi.str2reg('T')
  File "/home/wittmann/.virtualenvs/pypy-angr/site-packages/rpyc/core/netref.py", line 199, in __call__
    return syncreq(_self, consts.HANDLE_CALL, args, kwargs)
  File "/home/wittmann/.virtualenvs/pypy-angr/site-packages/rpyc/core/netref.py", line 72, in syncreq
    return conn.sync_request(handler, oid, *args)
  File "/home/wittmann/.virtualenvs/pypy-angr/site-packages/rpyc/core/protocol.py", line 515, in sync_request
    self._send_request(seq, handler, args)
  File "/home/wittmann/.virtualenvs/pypy-angr/site-packages/rpyc/core/protocol.py", line 271, in _send_request
    self._send(consts.MSG_REQUEST, seq, (handler, self._box(args)))
  File "/home/wittmann/.virtualenvs/pypy-angr/site-packages/rpyc/core/protocol.py", line 266, in _send
    self._channel.send(data)
  File "/home/wittmann/.virtualenvs/pypy-angr/site-packages/rpyc/core/channel.py", line 69, in send
    self.stream.write(buf)
  File "/home/wittmann/.virtualenvs/pypy-angr/site-packages/rpyc/core/stream.py", line 233, in write
    count = self.sock.send(data[:self.MAX_IO_CHUNK])
  File "/home/wittmann/.virtualenvs/pypy-angr/site-packages/rpyc/core/stream.py", line 82, in __getattr__
    raise EOFError("stream has been closed")
EOFError: stream has been closed
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/angr/projects-sit-angr/sit-cfg-standalone/src/sit_cfg_implementations/ida_cfg/idalink_cfg.py", line 72, in ida_function_cfg_dict_to_angr_function_cfg_dict
    self.convert_ida_cfg_to_angr_cfg(ida_functionEA)
  File "/home/angr/projects-sit-angr/sit-cfg-standalone/src/sit_cfg_implementations/ida_cfg/idalink_cfg.py", line 126, in convert_ida_cfg_to_angr_cfg
    angr_split_node_list = self.angr_blockify(ida_basic_block)
  File "/home/angr/projects-sit-angr/sit-cfg-standalone/src/sit_cfg_implementations/ida_cfg/idalink_cfg.py", line 301, in angr_blockify
    if idaBasicBlock.type in self.ida_basic_block_ignore_types:
  File "/home/angr/.virtualenvs/pypy-angr/site-packages/rpyc/core/netref.py", line 151, in __getattribute__
    return syncreq(self, consts.HANDLE_GETATTR, name)
  File "/home/angr/.virtualenvs/pypy-angr/site-packages/rpyc/core/netref.py", line 72, in syncreq
    return conn.sync_request(handler, oid, *args)
  File "/home/angr/.virtualenvs/pypy-angr/site-packages/rpyc/core/protocol.py", line 515, in sync_request
    self._send_request(seq, handler, args)
  File "/home/angr/.virtualenvs/pypy-angr/site-packages/rpyc/core/protocol.py", line 271, in _send_request
    self._send(consts.MSG_REQUEST, seq, (handler, self._box(args)))
  File "/home/angr/.virtualenvs/pypy-angr/site-packages/rpyc/core/protocol.py", line 266, in _send
    self._channel.send(data)
  File "/home/angr/.virtualenvs/pypy-angr/site-packages/rpyc/core/channel.py", line 69, in send
    self.stream.write(buf)
  File "/home/angr/.virtualenvs/pypy-angr/site-packages/rpyc/core/stream.py", line 233, in write
    count = self.sock.send(data[:self.MAX_IO_CHUNK])
  File "/home/angr/.virtualenvs/pypy-angr/site-packages/rpyc/core/stream.py", line 82, in __getattr__
    raise EOFError("stream has been closed")
EOFError: stream has been closed

The exception happen to be in different remote module-calls. As mentioned, its random - not at a deterministic point during execution.

The general setup for the connection is (shortened)

Server (Inside IDA-Python-Environment which basically is the standard system python 2.7.12, rpyc 3.4.4. For completion, IDA-Python only runs within an existing IDA-Session, so IDA/IDA-Python is running within a screen or tmux session.

if __name__ == '__main__':

    idaapi.autoWait()
    idc.Wait()
    
    port = int(idc.ARGV[1]) if idc.ARGV[1:] else 18861

    from rpyc.core import SlaveService
    from rpyc.utils.server import OneShotServer

    print('OneShotServer - Single Connection')
    OneShotServer(SlaveService, port=port,).start()
    idc.Exit(0)

On client side: (pypy, rpyc 3.4.4)

def _ida_connect(host, port):
    link = rpyc_classic.connect(host, port) # also tested with keepalive=True
    l.debug('Connected to %s:%d', host, port)

    idc = link.root.getmodule('idc')
    idaapi = link.root.getmodule('idaapi')
    idautils = link.root.getmodule('idautils')
    idadex = link.root.getmodule('idadex')

    return link, idc, idaapi, idautils, idadex

And after that points its just a few million requests / per hour to the modules only available in the ida-python environment. This worked and still works if using python instead of pypy for the client side with rpyc.

So basically, from my understanding the only thing that changed is the frequency of remote module calls, by a multiple of a few thousand invokes more per minute. The GC process on Server side shouldn't be affected since the Server Environment remained unaffected by the switch to PyPy.

I'm still doing more testing on this problem - but if you have some idea, that would be great :)

As for now, I worked arround my problem by catching EOFError exception, terminating anything ida-related, restarting the server and resuming the execution. Still gives me an approx 2000% speedup compared to running analysis with Python instead of PyPy.

@coldfix

This comment has been minimized.

Collaborator

coldfix commented Nov 8, 2017

Hey,

first thanks for the stack traces and example code. I have some superficial feedback. Hopefully, I will be able to take a closer look tonight.

This might be related a history of unresolved EOFErrors: #124, #162, #190 and #217. However, in none of these cases sufficient information was reported.

Can you provide a full minimal working example by any chance?

So basically, from my understanding the only thing that changed is the frequency of remote module calls, by a multiple of a few thousand invokes more per minute. The GC process on Server side shouldn't be affected since the Server Environment remained unaffected by the switch to PyPy.

cpython has a garbage collector too, so this doesn't rule out a garbage collection problem. If you do more work (more created objects), the GC is typically run sooner (which would be consistent with your statement that you have more requests/time).

But you are right, from the server code it looks like there is always a hard reference to the connection on the stack (you could make extra sure by assigning first server = OneShotServer(...); server.start()). On the other hand, your example doesn't tell me much about the structure of the client. Does it always keep a hard reference to the connection (indirectly) available through the stack or globally? (note that some async/callback systems do not store coroutines/callbacks as hard references!)

Could you add import gc; gc.disable() to either/both your client/server to check whether it is gc related? (of course, it will be a huge memory leak, not sure whether you have enough resources)

I'm still doing more testing on this problem - but if you have some idea, that would be great :)

When there is a non-deterministic bug associated with high resource usage, it is most of the time related to either multithreading or garbage collection, so I would check these first.

BTW: the server is terminated at the same time as the client receives the EOFError, I assume?

As for now, I worked arround my problem by catching EOFError exception, terminating anything ida-related, restarting the server and resuming the execution. Still gives me an approx 2000% speedup compared to running analysis with Python instead of PyPy.

Dirty..;)

@AndyXan

This comment has been minimized.

AndyXan commented Nov 10, 2017

Hello. Thanks for the reply.

Yes the Server is terminated at the same time the client receives the EOFError. I am going to try your suggestions next week and report back as soon as I have the results.

Best Reagrds,
Andreas

@AndyXan

This comment has been minimized.

AndyXan commented Nov 21, 2017

Hello again. As of now, I am closing this Issue with the potential to reopen it. The config changes to the GC did not solve it, however with my current tests after changing some things unrelated to rpyc in the algorithmic part of my code, the error did not occur today. However, this needs more investigations from my side.

We'll see. When I'm capable of providing more useful information, the issue gets reopened.

Best regards

@AndyXan AndyXan closed this Nov 21, 2017

@AndyXan

This comment has been minimized.

AndyXan commented Feb 15, 2018

Hello coldfix.

I'm leaving this closed - but going to give some more details:

It's gc related, on client-side (The PyPy instance).

so basically, wrapping my client-side with gc.disable {.... code } and gc.enable solves the problem. So I am assuming its the PyPI implementation of garbage collection which causes the problem.

Updating this when there is more insight.

@AndyXan

This comment has been minimized.

AndyXan commented Feb 21, 2018

So. Back - and I think I found the problem.

As mentioned before. Turning of gargabe collection on the client side (pypy) the problem disappeared. I started to let rpyc print the stacktrace each time __del__ is called and noticed that the garbage collection is handled by async requests which makes perfectly sense. However, the code from rpyc uses weakref.proxy() which pypy has problems with according to their documentations: http://doc.pypy.org/en/latest/cpython_differences.html

protocol.py :: 532
res = AsyncResult(weakref.proxy(self))

Changing this to

protocol.py :: 532
res = AsyncResult(weakref.ref(self))

seems to fix the problem. I cannot tell you why exactly this fixes it, nor do I know if weakref.proxy() has been used in this case instead of weakref.ref() as everywhere else for a reason.

Maybe that helps solving this issue.

Best regards,
Andreas

@AndyXan AndyXan reopened this Feb 21, 2018

@coldfix

This comment has been minimized.

Collaborator

coldfix commented Feb 23, 2018

Hey, thanks for the additional information! I don't fully believe using weakref.ref will fix all of these problems, but if it helps in your case, I'm all for it. Can you submit a PR implementing this change? It will take me a few days to process though.

Best, Thomas

@AndyXan

This comment has been minimized.

AndyXan commented Feb 27, 2018

Going to do it next week. There's also some more time to test this thoroughly! Greetings

@AndyXan AndyXan closed this Feb 27, 2018

@coldfix

This comment has been minimized.

Collaborator

coldfix commented Mar 7, 2018

BTW, I am currently doubting the usefulness of using a weakref for netrefs/asyncs at all. The simplest solution may be to just pass a regular ref unless you see any problems?

This means that netrefs will keep connections alive even if you throw away the reference to conn after getting the netref - which I consider preferrable and more pythonic anyway.

coldfix added a commit that referenced this issue Mar 7, 2018

Keep a hard reference to connection in netrefs
This means that netrefs will now keep connections alive.

This may be related to several EOFError issues where the program only
keeps reference to a netref, but not the connection.

May also be related to pypy related EOFError issues where weakref.proxy
may have problems, see #237.

coldfix added a commit that referenced this issue Jun 11, 2018

Release rpyc 4.0.0
This release brings a few minor backward incompatibilities, so be sure to read
on before upgrading. However, fear not: the ones that are most likely relevant
to you have a relatively simple migration path.

Backward Incompatibilities
^^^^^^^^^^^^^^^^^^^^^^^^^^

* ``classic.teleport_function`` now executes the function in the connection's
  namespace by default. To get the old behaviour, use
  ``teleport_function(conn, func, conn.modules[func.__module__].__dict__)``
  instead.

* Changed signature of ``Service.on_connect`` and ``on_disconnect``, adding
  the connection as argument.

* Changed signature of ``Service.__init__``, removing the connection argument

* no longer store connection as ``self._conn``. (allows services that serve
  multiple clients using the same service object, see `#198`_).

* ``SlaveService`` is now split into two asymetric classes: ``SlaveService``
  and ``MasterService``. The slave exposes functionality to the master but can
  not anymore access remote objects on the master (`#232`_, `#248`_).
  If you were previously using ``SlaveService``, you may experience problems
  when feeding the slave with netrefs to objects on the master. In this case, do
  any of the following:

  * use ``ClassicService`` (acts exactly like the old ``SlaveService``)
  * use ``SlaveService`` with a ``config`` that allows attribute access etc
  * use ``rpyc.utils.deliver`` to feed copies rather than netrefs to
    the slave

* ``RegistryServer.on_service_removed`` is once again called whenever a service
  instance is removed, making it symmetric to ``on_service_added`` (`#238`_)
  This reverts PR `#173`_ on issue `#172`_.

* Removed module ``rpyc.experimental.splitbrain``. It's too confusing and
  undocumented for me and I won't be developing it, so better remove it
  altogether. (It's still available in the ``splitbrain`` branch)

* Removed module ``rpyc.experimental.retunnel``. Seemingly unused anywhere, no
  documentation, no clue what this is about.

* ``bin/rpyc_classic.py`` will bind to ``127.0.0.1`` instead of ``0.0.0.0`` by
  default

* ``SlaveService`` no longer serves exposed attributes (i.e., it now uses
  ``allow_exposed_attrs=False``)

* Exposed attributes no longer hide plain attributes if one otherwise has the
  required permissions to access the plain attribute. (`#165`_)

.. _#165: #165
.. _#172: #172
.. _#173: #173
.. _#198: #198
.. _#232: #232
.. _#238: #238
.. _#248: #248

What else is new
^^^^^^^^^^^^^^^^

* teleported functions will now be defined by default in the globals dict

* Can now explicitly specify globals for teleported functions

* Can now use streams as context manager

* keep a hard reference to connection in netrefs, may fix some ``EOFError``
  issues, in particular on Jython related (`#237`_)

* handle synchronous and asynchronous requests uniformly

* fix deadlock with connections talking to each other multithreadedly (`#270`_)

* handle timeouts cumulatively

* fix possible performance bug in ``Win32PipeStream.poll`` (oversleeping)

* use readthedocs theme for documentation (`#269`_)

* actually time out sync requests (`#264`_)

* clarify documentation concerning exceptions in ``Connection.ping`` (`#265`_)

* fix ``__hash__`` for netrefs (`#267`_, `#268`_)

* rename ``async`` module to ``async_`` for py37 compatibility (`#253`_)

* fix ``deliver()`` from IronPython to CPython2 (`#251`_)

* fix brine string handling in py2 IronPython (`#251`_)

* add gevent_ Server. For now, this requires using ``gevent.monkey.patch_all()``
  before importing for rpyc. Client connections can already be made without
  further changes to rpyc, just using gevent's monkey patching. (`#146`_)

* add function ``rpyc.lib.spawn`` to spawn daemon threads

* fix several bugs in ``bin/rpycd.py`` that crashed this script on startup
  (`#231`_)

* fix problem with MongoDB, or more generally any remote objects that have a
  *catch-all* ``__getattr__`` (`#165`_)

* fix bug when copying remote numpy arrays (`#236`_)

* added ``rpyc.utils.helpers.classpartial`` to bind arguments to services (`#244`_)

* can now pass services optionally as instance or class (could only pass as
  class, `#244`_)

* The service is now charged with setting up the connection, doing so in
  ``Service._connect``. This allows using custom protocols by e.g. subclassing
  ``Connection``.  More discussions and related features in `#239`_-`#247`_.

* service can now easily override protocol handlers, by updating
  ``conn._HANDLERS`` in ``_connect`` or ``on_connect``. For example:
  ``conn._HANDLERS[HANDLE_GETATTR] = self._handle_getattr``.

* most protocol handlers (``Connection._handle_XXX``) now directly get the
  object rather than its ID as first argument. This makes overriding
  individual handlers feel much more high-level. And by the way it turns out
  that this fixes two long-standing issues (`#137`_, `#153`_)

* fix bug with proxying context managers (`#228`_)

* expose server classes from ``rpyc`` top level module

* fix logger issue on jython

.. _#137: #137
.. _#146: #146
.. _#153: #153
.. _#165: #165
.. _#228: #228
.. _#231: #231
.. _#236: #236
.. _#237: #237
.. _#239: #239
.. _#244: #244
.. _#247: #247
.. _#251: #251
.. _#253: #253
.. _#264: #264
.. _#265: #265
.. _#267: #267
.. _#268: #268
.. _#269: #269
.. _#270: #270

.. _gevent: http://www.gevent.org/
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment