cgi.py multipart/form-data #44313

teyc · 2006-12-07T09:18:18Z

BPO	1610654
Nosy	@atsuoishimoto, @orsenthil, @pitrou, @bitdancer, @florentx, @hynek, @serhiy-storchaka
Files	issue1610654.patch issue1610654_1.patch issue1610654_2.patch issue1610654_3.patch issue1610654_4.patch issue1610654_5.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/serhiy-storchaka'
closed_at = None
created_at = <Date 2006-12-07.09:18:18.000>
labels = ['library', 'performance']
title = 'cgi.py multipart/form-data'
updated_at = <Date 2019-04-26.17:29:15.916>
user = 'https://bugs.python.org/teyc'

bugs.python.org fields:

activity = <Date 2019-04-26.17:29:15.916>
actor = 'BreamoreBoy'
assignee = 'serhiy.storchaka'
closed = False
closed_date = None
closer = None
components = ['Library (Lib)']
creation = <Date 2006-12-07.09:18:18.000>
creator = 'teyc'
dependencies = []
files = ['36895', '36927', '37128', '37149', '37399', '37400']
hgrepos = []
issue_num = 1610654
keywords = ['patch']
message_count = 21.0
messages = ['61046', '110090', '114915', '119535', '166021', '178292', '222416', '222557', '222591', '222952', '229235', '229262', '229265', '229285', '229389', '230170', '230620', '230860', '231554', '232377', '232379']
nosy_count = 10.0
nosy_names = ['ishimoto', 'orsenthil', 'pitrou', 'teyc', 'r.david.murray', 'flox', 'hynek', 'serhiy.storchaka', 'Chui.Tey', 'rishi.maker.forum']
pr_nums = []
priority = 'normal'
resolution = None
stage = 'patch review'
status = 'open'
superseder = None
type = 'performance'
url = 'https://bugs.python.org/issue1610654'
versions = ['Python 3.5']

teyc · 2006-12-07T09:18:18Z

Uploading large binary files using multipart/form-data can be very inefficient because LF character may occur too frequently, resulting in the read_line_to_outer_boundary looping too many times.

*** cgi.py.Py24 Thu Dec 7 18:46:13 2006
--- cgi.py Thu Dec 7 16:38:04 2006

*** 707,713 ****
last = next + "--"
delim = ""
while 1:
! line = self.fp.readline()
if not line:
self.done = -1
break
--- 703,709 ----

          last = next + "--"
          delim = ""
          while 1:
!             line = self.fp_readline()
              if not line:
                  self.done = -1
                  break
***************
*** 729,734 ****
--- 730,753

              delim = ""
          self.\_\_write(odelim + line)

+ def fp_readline(self):
+
+ tell = self.fp.tell()
+ buffer = self.fp.read(1 << 17)
+ parts = buffer.split("\n")
+ retlst = []
+ for part in parts:
+ if part.startswith("--"):
+ if retlst:
+ retval = "\n".join(retlst) + "\n"
+ else:
+ retval = part + "\n"
+ self.fp.seek(tell + len(retval))
+ return retval
+ else:
+ retlst.append(part)
+ return buffer
+
def skip_lines(self):
"""Internal: skip lines until outer boundary if defined."""
if not self.outerboundary or self.done:

The patch reads the file in larger increments. For my test file of 138 Mb, it reduced parsing time from 168 seconds to 19 seconds.

#------------ test script --------------------

import cgi
import cgi
import os
import profile
import stat

def run():
    filename = 'body.txt'
    size = os.stat(filename)[stat.ST_SIZE]
    fp = open(filename,'rb')
    environ = {}
    environ["CONTENT_TYPE"]   = open('content_type.txt','rb').read()
    environ["REQUEST_METHOD"] = "POST"
    environ["CONTENT_LENGTH"] = str(size)

    fieldstorage = cgi.FieldStorage(fp, None, environ=environ)
    return fieldstorage

import hotshot, hotshot.stats
import time
if 1:
    t1 = time.time()
    prof = hotshot.Profile("bug1718.prof")
    # hotshot profiler will crash with the 
    # patch applied on windows xp
    #prof_results = prof.runcall(run)
    prof_results  = run()
    prof.close()
    t2 = time.time()
    print t2-t1
    if 0:
      for key in prof_results.keys():
        if len(prof_results[key].value)> 100:
            print key, prof_results[key].value[:80] + "..."
        else:
            print key, prof_results[key]

content_type.txt
----------------------------
multipart/form-data; boundary=----------ThIs_Is_tHe_bouNdaRY_$

BreamoreBoy · 2010-07-12T14:35:15Z

Chui Tey does this issue still apply? If yes, could you please provide a patch according to the guidelines here.
python.org/dev/patches

BreamoreBoy · 2010-08-25T14:32:16Z

No reply to msg110090.

bitdancer · 2010-10-25T02:24:47Z

I don't think it was appropriate to close this issue.

florentx · 2012-07-21T13:14:53Z

It needs tests to demonstrate the issue in 3.x, and an updated patch.

hynek · 2012-12-27T10:28:14Z

It would be great if someone could port this patch to Python 3.4 and verify its effectiveness.

BreamoreBoy · 2014-07-06T19:53:04Z

@hynek could you port the patch as you've shown some interest in it?

hynek · 2014-07-08T12:05:25Z

I would have long ago if I had any domain knowlege on this topic, but alas….

ChuiTey · 2014-07-08T23:46:29Z

Hi,

I'm still available. There's a test case in the patch, would you like
me to separate that to another file? Would that help with assessing
it?

Best,

On 8 July 2014 22:05, Hynek Schlawack <report@bugs.python.org> wrote:

Hynek Schlawack added the comment:

I would have long ago if I had any domain knowlege on this topic, but alas….

----------

Python tracker <report@bugs.python.org>
<http://bugs.python.org/issue1610654\>

bitdancer · 2014-07-13T16:35:18Z

To move this issue along we need someone to convert it into our standard patch format (a unified diff against either the 3.4 or the default branch, preferrably produced via an 'hg diff' command without the --git option), with the test included as a unit test added to the appropriate test file (tests/test_cgi.py).

rishimakerforum · 2014-10-13T08:49:27Z

My observation is that a file with more than normal (exact numbers below) line-feed characters takes way too long.

I tried porting the above patch to my default branch, but it has some boundary and CRLF/LF issues, but more importantly it relies on seeking the file-object, which in the real world is stdin for web browsers and hence is illegal in that environment.

I have attached a patch which is based on the same principle as Chui mentioned, ie reading a large buffer, but this patch does not deal with line feeds at all. It instead searches the entire boundary in a large buffer.

The cgi module file-object only relies on readline and read functionality - so I created a wrapper class around read and readline to introduce buffering (attached as patch).

When multipart boundaries are being searched, the patch fills a huge buffer, like in the original solution. It searches for the entire boundary and returns a large chunk of the payload in one call, rather than line by line.

To search, there are corner cases ( when boundary is overlapping between buffers) and CRLF issues. A boundary in itself could have repeating characters causing more search complexity.
To overcome this, the patch uses simple regular exressions without any expanding or wild characters. If a boundary is not found, it returns the chunk - length of the buffer - CRLF prefixes, to ensure that no boundary is overlapping between two consecutive buffers. The expressions take care of CRLF issues.

When read and readline are called, the patch looks for data in the buffer and returns appropriately.

There is a overall performance improvement in cases of large files, and very significant in case of files with very high number of LF characters.

To begin with I created a 20MB file with 20% of the file filled with LineFeeds.

File - 20MB.bin
size - 20MB
description - file filled with 20% (~4MB) '\n'
Parse time with default cgi module - 53 seconds
Parse time with patch - 0.4s

This time increases linearly with the number of LFs for the default module.ie keeping the size same at 20MB and doubling the number of LFs to 40% would double the parse time.

I tried with a normal large binary file that I found on my machine.
size: 88mb
description - binary executable on my machine,
binary image has 140k lfs.
Parse time with default cgi module - 2.7s
Parse time with patch- 0.7s

I have tested with a few other files and noticed time is cut by atleast half for large files.

Note:
These numbers are consitent over multiple observations.
I tested this using the script attached, and also on my localhost server.
The time taken is obtained by running the following code.

t1=time.time()
cProfile.run("fs = cgi.FieldStorage()")
print(str(len(fs['datafile'].value)))
t2 = time.time()
print(str(t2 - t1))

I have tried to keep the patch compatible with the current module. However I have introduced a ValueError excepiton in the module when boundary is very large ie. 1024 bytes. The RFC specifies the maximum length to be 70 bytes.

pitrou · 2014-10-13T18:59:29Z

Rishi, thanks for the patch. I was going to give a review but first I have to ask: is so much support code necessary for this?

Another approach would be to wrap self.fp in a io.BufferedReader (if it's not already buffered) and then use the peek() method to find the boundary without advancing the file pointer.

rishimakerforum · 2014-10-13T19:34:59Z

Antoine, I will upload a patch that relies on BufferedReader. As you mentioned, it will get rid of supporting the buffer and reduce a lot of code.
The only issue is that it helps me to know if the current buffer is at EOF (the documentation of peek does not mention guaranteeing Eof if buffer returned is less than what we expect), because patterns at EOF are different, but I think I can work around that.

serhiy-storchaka · 2014-10-14T10:31:21Z

I doubt we can use io.BufferedReader or handmade buffering here. Current code doesn't read more bytes than necessary. Buffered reader will read ahead, and there is no way to return read bytes back to the stream in general case (an exception is seekable streams). It can be blocked in attempt to fill a buffer with unnecessary bytes.

I think that the user of the cgi module is responsible for wrapping a stream in io.BufferedReader (if it is acceptable), the cgi module shouldn't do this itself.

But we can implement special cases for buffered or seekable streams. However this optimization will not work in general case (e.g. for stdin or socket).

rishimakerforum · 2014-10-15T00:45:49Z

I have recreated the patch(issue1610654_1.patch) and it performs more or less like the earlier patch

Serhiy,
I agree we cannot use handmade buffering here, without seeking ahead.
I believe, we can make optimizations for streams which are buffered and non-seekable.
Cgi modules default value for file object is the BufferedReader of sys.stdin, so the solution is fairly generic.

I have removed handmade buffering. Neither do I create a Buffered* object.
We rely on user to create the buffered object. The sys.stdin that cgi module has a decent buffer underneath that
works well on apache.

The patch attached does not seek, nor does it read ahead. It only looks ahead.
As Antoine suggests, it peeks the buffer and determines through a fast lookup if the buffer has a bounary or not.
It moves forward only if it is convinced that the current buffer is completely within the next boundary.

The issue is that the current implementation deals with lines and not chunks.
Even when a savy user wraps sys.stdin around a large BufferredReader there is little to no peformance improvement in
the current implementation for large files in my observation. It does not solve the bug mentioned either.
The difference in extreme cases like Chui's is 53s against 0.7s and even otherwise for larger files the patch
is 3 times faster than the current implementation.
I have tested this on Apache2 server where the sys.stdin is buffered.

pitrou · 2014-10-28T19:24:24Z

Thanks for the updated patch. I'll take a look soon if no-one beats me to it.

rishimakerforum · 2014-11-04T13:50:45Z

Patch updated from review comments. Also added a few corner test cases.

rishimakerforum · 2014-11-08T14:44:25Z

Hi,
I have created a new patch with a small design change. The change is that in situations where I don't find the boundary instead of keeping the last x bytes in the buffer I simply drain the whole data and call a readline().
This seems like the right thing to do also. I managed to get rid of the two obfuscated helper functions keep_x_buffer and remove_x_buffer that I had and the code should look familiar (I hope) to a module owner.
This also helped me get rid of quite a few class member variables that I could move to locals of my main function(multi_read).
I still need to maintain an overlap, but only for the trailing CRLF boundary.
Ran all the new and old tests and tested on apache with the ubuntu iso server image. Without the patch ubuntu iso server image took 93seconds .. with the patch it took 25seconds.

serhiy-storchaka · 2014-11-23T11:03:06Z

New test fail with non-modified code. Either there is a bug in current code or tests are wrong.

rishimakerforum · 2014-12-09T14:44:04Z

There is indeed a test failure that occurs without the patch. This is a new test I had added.
The reason is that in the existing implementation, when a boundary does not exist, the implementation does not include the trailing CRLF, LF or for that matter CR as part of the payload. I think that is not correct.

However, to keep this patch compatible with behavior of existing implementation I have updated the patch to strip a single CRLF, LR or CR from the payload if a boundary is not found.

rishimakerforum · 2014-12-09T14:53:06Z

One of my comments shot the wrapped line limit. Also changed the test in question to check the lengths of the expected and actual buffer to checking the contents of the respective buffers.

kumaraditya303 · 2022-04-15T10:33:10Z

Closing as cgi module is deprecated as pep 594 and no improvement or bugs will be fixed.

teyc mannequin added interpreter-core (Objects, Python, Grammar, and Parser dirs) labels Dec 7, 2006

devdanzin mannequin added stdlib Python modules in the Lib dir performance Performance or resource usage and removed interpreter-core (Objects, Python, Grammar, and Parser dirs) labels Mar 30, 2009

BreamoreBoy mannequin closed this as completed Aug 25, 2010

bitdancer reopened this Oct 25, 2010

hynek added easy labels Dec 27, 2012

serhiy-storchaka removed the easy label Oct 14, 2014

serhiy-storchaka self-assigned this Nov 22, 2014

ezio-melotti transferred this issue from another repository Apr 10, 2022

kumaraditya303 closed this as completed Apr 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cgi.py multipart/form-data #44313

cgi.py multipart/form-data #44313

teyc mannequin commented Dec 7, 2006

teyc mannequin commented Dec 7, 2006

BreamoreBoy mannequin commented Jul 12, 2010

BreamoreBoy mannequin commented Aug 25, 2010

bitdancer commented Oct 25, 2010

florentx mannequin commented Jul 21, 2012

hynek commented Dec 27, 2012

BreamoreBoy mannequin commented Jul 6, 2014

hynek commented Jul 8, 2014

ChuiTey mannequin commented Jul 8, 2014

bitdancer commented Jul 13, 2014

rishimakerforum mannequin commented Oct 13, 2014

pitrou commented Oct 13, 2014

rishimakerforum mannequin commented Oct 13, 2014

serhiy-storchaka commented Oct 14, 2014

rishimakerforum mannequin commented Oct 15, 2014

pitrou commented Oct 28, 2014

rishimakerforum mannequin commented Nov 4, 2014

rishimakerforum mannequin commented Nov 8, 2014

serhiy-storchaka commented Nov 23, 2014

rishimakerforum mannequin commented Dec 9, 2014

rishimakerforum mannequin commented Dec 9, 2014

kumaraditya303 commented Apr 15, 2022

cgi.py multipart/form-data #44313

cgi.py multipart/form-data #44313

Comments

teyc mannequin commented Dec 7, 2006

teyc mannequin commented Dec 7, 2006

BreamoreBoy mannequin commented Jul 12, 2010

BreamoreBoy mannequin commented Aug 25, 2010

bitdancer commented Oct 25, 2010

florentx mannequin commented Jul 21, 2012

hynek commented Dec 27, 2012

BreamoreBoy mannequin commented Jul 6, 2014

hynek commented Jul 8, 2014

ChuiTey mannequin commented Jul 8, 2014

bitdancer commented Jul 13, 2014

rishimakerforum mannequin commented Oct 13, 2014

pitrou commented Oct 13, 2014

rishimakerforum mannequin commented Oct 13, 2014

serhiy-storchaka commented Oct 14, 2014

rishimakerforum mannequin commented Oct 15, 2014

pitrou commented Oct 28, 2014

rishimakerforum mannequin commented Nov 4, 2014

rishimakerforum mannequin commented Nov 8, 2014

serhiy-storchaka commented Nov 23, 2014

rishimakerforum mannequin commented Dec 9, 2014

rishimakerforum mannequin commented Dec 9, 2014

kumaraditya303 commented Apr 15, 2022