New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cgi.py multipart/form-data #44313
Comments
Uploading large binary files using multipart/form-data can be very inefficient because LF character may occur too frequently, resulting in the read_line_to_outer_boundary looping too many times. *** cgi.py.Py24 Thu Dec 7 18:46:13 2006 *** 707,713 **** last = next + "--"
delim = ""
while 1:
! line = self.fp_readline()
if not line:
self.done = -1
break
***************
*** 729,734 ****
--- 730,753
+ def fp_readline(self): The patch reads the file in larger increments. For my test file of 138 Mb, it reduced parsing time from 168 seconds to 19 seconds. #------------ test script -------------------- import cgi
import cgi
import os
import profile
import stat
def run():
filename = 'body.txt'
size = os.stat(filename)[stat.ST_SIZE]
fp = open(filename,'rb')
environ = {}
environ["CONTENT_TYPE"] = open('content_type.txt','rb').read()
environ["REQUEST_METHOD"] = "POST"
environ["CONTENT_LENGTH"] = str(size)
fieldstorage = cgi.FieldStorage(fp, None, environ=environ)
return fieldstorage
import hotshot, hotshot.stats
import time
if 1:
t1 = time.time()
prof = hotshot.Profile("bug1718.prof")
# hotshot profiler will crash with the
# patch applied on windows xp
#prof_results = prof.runcall(run)
prof_results = run()
prof.close()
t2 = time.time()
print t2-t1
if 0:
for key in prof_results.keys():
if len(prof_results[key].value)> 100:
print key, prof_results[key].value[:80] + "..."
else:
print key, prof_results[key] content_type.txt |
Chui Tey does this issue still apply? If yes, could you please provide a patch according to the guidelines here. |
No reply to msg110090. |
I don't think it was appropriate to close this issue. |
It needs tests to demonstrate the issue in 3.x, and an updated patch. |
It would be great if someone could port this patch to Python 3.4 and verify its effectiveness. |
@hynek could you port the patch as you've shown some interest in it? |
I would have long ago if I had any domain knowlege on this topic, but alas…. |
Hi, I'm still available. There's a test case in the patch, would you like Best, On 8 July 2014 22:05, Hynek Schlawack <report@bugs.python.org> wrote:
|
To move this issue along we need someone to convert it into our standard patch format (a unified diff against either the 3.4 or the default branch, preferrably produced via an 'hg diff' command without the --git option), with the test included as a unit test added to the appropriate test file (tests/test_cgi.py). |
My observation is that a file with more than normal (exact numbers below) line-feed characters takes way too long. I tried porting the above patch to my default branch, but it has some boundary and CRLF/LF issues, but more importantly it relies on seeking the file-object, which in the real world is stdin for web browsers and hence is illegal in that environment. I have attached a patch which is based on the same principle as Chui mentioned, ie reading a large buffer, but this patch does not deal with line feeds at all. It instead searches the entire boundary in a large buffer. The cgi module file-object only relies on readline and read functionality - so I created a wrapper class around read and readline to introduce buffering (attached as patch). When multipart boundaries are being searched, the patch fills a huge buffer, like in the original solution. It searches for the entire boundary and returns a large chunk of the payload in one call, rather than line by line. To search, there are corner cases ( when boundary is overlapping between buffers) and CRLF issues. A boundary in itself could have repeating characters causing more search complexity. When read and readline are called, the patch looks for data in the buffer and returns appropriately. There is a overall performance improvement in cases of large files, and very significant in case of files with very high number of LF characters. To begin with I created a 20MB file with 20% of the file filled with LineFeeds. File - 20MB.bin This time increases linearly with the number of LFs for the default module.ie keeping the size same at 20MB and doubling the number of LFs to 40% would double the parse time. I tried with a normal large binary file that I found on my machine. I have tested with a few other files and noticed time is cut by atleast half for large files. Note: t1=time.time()
cProfile.run("fs = cgi.FieldStorage()")
print(str(len(fs['datafile'].value)))
t2 = time.time()
print(str(t2 - t1)) I have tried to keep the patch compatible with the current module. However I have introduced a ValueError excepiton in the module when boundary is very large ie. 1024 bytes. The RFC specifies the maximum length to be 70 bytes. |
Rishi, thanks for the patch. I was going to give a review but first I have to ask: is so much support code necessary for this? Another approach would be to wrap self.fp in a io.BufferedReader (if it's not already buffered) and then use the peek() method to find the boundary without advancing the file pointer. |
Antoine, I will upload a patch that relies on BufferedReader. As you mentioned, it will get rid of supporting the buffer and reduce a lot of code. |
I doubt we can use io.BufferedReader or handmade buffering here. Current code doesn't read more bytes than necessary. Buffered reader will read ahead, and there is no way to return read bytes back to the stream in general case (an exception is seekable streams). It can be blocked in attempt to fill a buffer with unnecessary bytes. I think that the user of the cgi module is responsible for wrapping a stream in io.BufferedReader (if it is acceptable), the cgi module shouldn't do this itself. But we can implement special cases for buffered or seekable streams. However this optimization will not work in general case (e.g. for stdin or socket). |
I have recreated the patch(issue1610654_1.patch) and it performs more or less like the earlier patch Serhiy, I have removed handmade buffering. Neither do I create a Buffered* object. The patch attached does not seek, nor does it read ahead. It only looks ahead. The issue is that the current implementation deals with lines and not chunks. |
Thanks for the updated patch. I'll take a look soon if no-one beats me to it. |
Patch updated from review comments. Also added a few corner test cases. |
Hi, |
New test fail with non-modified code. Either there is a bug in current code or tests are wrong. |
There is indeed a test failure that occurs without the patch. This is a new test I had added. However, to keep this patch compatible with behavior of existing implementation I have updated the patch to strip a single CRLF, LR or CR from the payload if a boundary is not found. |
One of my comments shot the wrapped line limit. Also changed the test in question to check the lengths of the expected and actual buffer to checking the contents of the respective buffers. |
Closing as cgi module is deprecated as pep 594 and no improvement or bugs will be fixed. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: