Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak when process big file upload #740

Closed
JerryKwan opened this issue Apr 18, 2013 · 26 comments
Closed

Memory leak when process big file upload #740

JerryKwan opened this issue Apr 18, 2013 · 26 comments

Comments

@JerryKwan
Copy link

something is wrong in tornado v3.0.1
when i upload a large file (about 101M, larger than default max_buffer_size), then tornado server raise an exception as follows:
ERROR:tornado.application:Error in connection callback
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/tornado-3.0.1-py2.7.egg/tornado/tcpserver.py", line 228, in _handle_connection
self.handle_stream(stream, address)
File "/usr/local/lib/python2.7/dist-packages/tornado-3.0.1-py2.7.egg/tornado/httpserver.py", line 157, in handle_stream
self.no_keep_alive, self.xheaders, self.protocol)
File "/usr/local/lib/python2.7/dist-packages/tornado-3.0.1-py2.7.egg/tornado/httpserver.py", line 190, in init
self.stream.read_until(b"\r\n\r\n", self._header_callback)
File "/usr/local/lib/python2.7/dist-packages/tornado-3.0.1-py2.7.egg/tornado/iostream.py", line 148, in read_until
self._try_inline_read()
File "/usr/local/lib/python2.7/dist-packages/tornado-3.0.1-py2.7.egg/tornado/iostream.py", line 398, in _try_inline_read
if self._read_to_buffer() == 0:
File "/usr/local/lib/python2.7/dist-packages/tornado-3.0.1-py2.7.egg/tornado/iostream.py", line 432, in _read_to_buffer
raise IOError("Reached maximum read buffer size")
IOError: Reached maximum read buffer size
but after several big file upload request, there is a big memory increase in tornado server, so i think may be there is a memory leak in processing big file
any suggestions? how could i fix it up?

@schlamar
Copy link
Contributor

Try to set self._read_buffer = None in BaseIOStream.close.

A proper fix should clear the write buffer, too.

@wsantos
Copy link
Contributor

wsantos commented Apr 18, 2013

Should we dynamically increase max_buffer_size ?

        if self._read_buffer_size >= self.max_buffer_size:
            gen_log.error("Reached maximum read buffer size")
            self.close()
            raise IOError("Reached maximum read buffer size")
        return len(chunk)

@schlamar
Copy link
Contributor

@wsantos I think the issue is not that the IOError is thrown, which is expected behavior. (Am I right @JerryKwan?)

The unexpected issue is that the memory of the read buffer does not get freed after the IOStream is closed (I assume it doesn't get garbage collected). So the right thing to do would be to free the read (and write) buffer while closing the IOStream.

@schlamar
Copy link
Contributor

Increasing a max_* value is really arbitrary behavior, I strongly discourage that.

@wsantos
Copy link
Contributor

wsantos commented Apr 18, 2013

@schlamar, so its ok to have a buffer with size 104857600 (100M) and send 101M?, how the IOStream handle this ?

@schlamar
Copy link
Contributor

@JerryKwan BTW, for such large uploads I would patch tornado to buffer directly on disk.

@wsantos Raises the IOError as expected. On low-level you can handle this gracefully by processing portions of the data (e.g. with read_bytes). The HTTP part however does read_until(HTTP_END), so if the data sent is bigger than 100M (in one HTTP request of course) it will fail. This is IMO fine as you can either adjust the limit on your own or overcome this limitation completely by manually buffering on disk. The latter one would require some tough work, but is the most suitable in this case IMO.

@wsantos
Copy link
Contributor

wsantos commented Apr 18, 2013

@schlamar i agree, now we can start a server and set the limit we can handle look #732

@JerryKwan it's OK to you to have a configurable option on server start ? maybe it will be the best option.

@schlamar
Copy link
Contributor

FYI; bottle is automatically switching to a temporary file for file uploads if content length of request is bigger than max buffer size: https://github.com/defnull/bottle/blob/d89d676a6510541e679fac1ac1e8a5745eb87f76/bottle.py#L1050

Tornado should be doing the same IMO (and decrease max buffer size to 1M or less).

@schlamar
Copy link
Contributor

@wsantos
Copy link
Contributor

wsantos commented Apr 18, 2013

We need to look at this with care, it will block the tornado during file upload ?

@schlamar
Copy link
Contributor

We need to look at this with care, it will block the tornado during file upload ?

What exactly and why should any of the proposals here block?

@wsantos
Copy link
Contributor

wsantos commented Apr 18, 2013

When we are writing in tempfile.

@schlamar
Copy link
Contributor

When we are writing in tempfile.

Really? What do you think is blocking more serious: writing 100M all at once to a file (this is what a user usually does after a file upload) or write small chunks to a file...

@saghul
Copy link

saghul commented Apr 18, 2013

The unexpected issue is that the memory of the read buffer does not get freed after the IOStream is closed (I assume it doesn't get garbage collected). So the right thing to do would be to free the read (and write) buffer while closing the IOStream.

The problem is that someone may want to read the remaining data in the buffer. I guess that reading until you get an exception would consume the buffer and 'fix' the issue. Or maybe have a dedicated cleanup method.

@schlamar
Copy link
Contributor

I guess that reading until you get an exception would consume the buffer

It does already read until the buffer is full, then it closes the stream and throws the exception. The buffer doesn't get consumed at all if context length > max buffer size.

Why should anyone read the buffer after the stream is closed? A close forbids any further interaction with the stream anyhow (see _checked_close).

@saghul
Copy link

saghul commented Apr 18, 2013

Imagine I'm using read_until_delimiter and the connection gets closed. There is data in the buffer which wasn't consumed.

You can actually read that data, with read_bytes, the close check is done after attempting a read from the buffer: https://github.com/facebook/tornado/blob/master/tornado/iostream.py#L383

@wsantos
Copy link
Contributor

wsantos commented Apr 18, 2013

Its the way to got, but just wondering how to implant it and avoid blocking
problems, if there any.

Grato,

Waldecir

On Thu, Apr 18, 2013 at 11:43 AM, Marc Schlaich notifications@github.comwrote:

When we are writing in tempfile.

Really? What do you think is blocking more serious: writing 100M all at
once to a file (this is what a user usually does after a file upload) or
write small chunks to a file...


Reply to this email directly or view it on GitHubhttps://github.com//issues/740#issuecomment-16580581
.

@schlamar
Copy link
Contributor

Sure. But this is IMO practically irrelevant. If you could consume the buffer with read_bytes, you would have done this in the first place and prevent the buffer overflow at all :)
If your buffer is full, you reach a point where the content of the buffer is not usable and can be thrown away. Otherwise you just designed your protocol wrong.

@schlamar
Copy link
Contributor

Or you allow the user to register a callback for buffer overflow and the default case (no callback registered) would be raising the IOError and clearing the buffer.

@wsantos
Copy link
Contributor

wsantos commented Apr 18, 2013

We need to only work on fix the memory leak in this issue, and create other
for discuss the Tempfile when size is more than mx_buffer_size. what do you
think ?

Grato,

Waldecir

On Thu, Apr 18, 2013 at 1:19 PM, Marc Schlaich notifications@github.comwrote:

Or you allow the user to register a callback for buffer overflow and the
default case (no callback registered) would be raising the IOError and
clearing the buffer.


Reply to this email directly or view it on GitHubhttps://github.com//issues/740#issuecomment-16586767
.

@schlamar
Copy link
Contributor

and create other for discuss the Tempfile when size is more than mx_buffer_size. what do you think

Exactly my thought 👍

@wsantos
Copy link
Contributor

wsantos commented Apr 18, 2013

Issue for stream bigger than max_buffer_size #743, im making some testes if memory leak issue.

regards

@saghul
Copy link

saghul commented Apr 18, 2013

Sure. But this is IMO practically irrelevant. If you could consume the buffer with read_bytes, you would have done this in the first place and prevent the buffer overflow at all :)
If your buffer is full, you reach a point where the content of the buffer is not usable and can be thrown away. Otherwise you just designed your protocol wrong.

IIRC there are other cases when the connection is interrupted (like in case the remote closes it) and the buffer is not emptied either. I can happen for reasons other than overflow.

This was referenced Apr 18, 2013
@schlamar
Copy link
Contributor

IIRC there are other cases when the connection is interrupted (like in case the remote closes it) and the buffer is not emptied either. I can happen for reasons other than overflow.

This case should be covered here: https://github.com/facebook/tornado/blob/master/tornado/iostream.py#L289

@JerryKwan
Copy link
Author

sorry for the late reply
as discussed before, i think the buffered data should be cleared when some exception occured.
so when i add the following data clear statements in function close() in iostream.py, seems like the memory leak problem is solved
now the close() function looks like this:

def close(self, exc_info=False):
        """Close this stream.
        If ``exc_info`` is true, set the ``error`` attribute to the current
        exception from `sys.exc_info` (or if ``exc_info`` is a tuple,
        use that instead of `sys.exc_info`).
        """
        if not self.closed():
            if exc_info:
                if not isinstance(exc_info, tuple):
                    exc_info = sys.exc_info()
                if any(exc_info):
                    self.error = exc_info[1]
            if self._read_until_close:
                callback = self._read_callback
                self._read_callback = None
                self._read_until_close = False
                self._run_callback(callback,
                                   self._consume(self._read_buffer_size))
            if self._state is not None:
                self.io_loop.remove_handler(self.fileno())
                self._state = None
            self.close_fd()
            # clear unneeded buffered data
            self._read_buffer = None
            self._write_buffer = None
            self._closed = True

        self._maybe_run_close_callback()

@bdarnell
Copy link
Member

The IOStream, buffered data and all, should be getting garbage collected when it's no longer needed. What's keeping the IOStreams alive in this case?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants