Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cgi.py multipart/form-data #44313

Closed
teyc mannequin opened this issue Dec 7, 2006 · 22 comments
Closed

cgi.py multipart/form-data #44313

teyc mannequin opened this issue Dec 7, 2006 · 22 comments
Assignees
Labels
performance Performance or resource usage stdlib Python modules in the Lib dir

Comments

@teyc
Copy link
Mannequin

teyc mannequin commented Dec 7, 2006

BPO 1610654
Nosy @atsuoishimoto, @orsenthil, @pitrou, @bitdancer, @florentx, @hynek, @serhiy-storchaka
Files
  • issue1610654.patch
  • issue1610654_1.patch
  • issue1610654_2.patch
  • issue1610654_3.patch
  • issue1610654_4.patch
  • issue1610654_5.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/serhiy-storchaka'
    closed_at = None
    created_at = <Date 2006-12-07.09:18:18.000>
    labels = ['library', 'performance']
    title = 'cgi.py multipart/form-data'
    updated_at = <Date 2019-04-26.17:29:15.916>
    user = 'https://bugs.python.org/teyc'

    bugs.python.org fields:

    activity = <Date 2019-04-26.17:29:15.916>
    actor = 'BreamoreBoy'
    assignee = 'serhiy.storchaka'
    closed = False
    closed_date = None
    closer = None
    components = ['Library (Lib)']
    creation = <Date 2006-12-07.09:18:18.000>
    creator = 'teyc'
    dependencies = []
    files = ['36895', '36927', '37128', '37149', '37399', '37400']
    hgrepos = []
    issue_num = 1610654
    keywords = ['patch']
    message_count = 21.0
    messages = ['61046', '110090', '114915', '119535', '166021', '178292', '222416', '222557', '222591', '222952', '229235', '229262', '229265', '229285', '229389', '230170', '230620', '230860', '231554', '232377', '232379']
    nosy_count = 10.0
    nosy_names = ['ishimoto', 'orsenthil', 'pitrou', 'teyc', 'r.david.murray', 'flox', 'hynek', 'serhiy.storchaka', 'Chui.Tey', 'rishi.maker.forum']
    pr_nums = []
    priority = 'normal'
    resolution = None
    stage = 'patch review'
    status = 'open'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue1610654'
    versions = ['Python 3.5']

    @teyc
    Copy link
    Mannequin Author

    teyc mannequin commented Dec 7, 2006

    Uploading large binary files using multipart/form-data can be very inefficient because LF character may occur too frequently, resulting in the read_line_to_outer_boundary looping too many times.

    *** cgi.py.Py24 Thu Dec 7 18:46:13 2006
    --- cgi.py Thu Dec 7 16:38:04 2006


    *** 707,713 ****
    last = next + "--"
    delim = ""
    while 1:
    ! line = self.fp.readline()
    if not line:
    self.done = -1
    break
    --- 703,709 ----

              last = next + "--"
              delim = ""
              while 1:
    !             line = self.fp_readline()
                  if not line:
                      self.done = -1
                      break
    ***************
    *** 729,734 ****
    --- 730,753 

                  delim = ""
              self.\_\_write(odelim + line)
    

    + def fp_readline(self):
    +
    + tell = self.fp.tell()
    + buffer = self.fp.read(1 << 17)
    + parts = buffer.split("\n")
    + retlst = []
    + for part in parts:
    + if part.startswith("--"):
    + if retlst:
    + retval = "\n".join(retlst) + "\n"
    + else:
    + retval = part + "\n"
    + self.fp.seek(tell + len(retval))
    + return retval
    + else:
    + retlst.append(part)
    + return buffer
    +
    def skip_lines(self):
    """Internal: skip lines until outer boundary if defined."""
    if not self.outerboundary or self.done:

    The patch reads the file in larger increments. For my test file of 138 Mb, it reduced parsing time from 168 seconds to 19 seconds.

    #------------ test script --------------------

    import cgi
    import cgi
    import os
    import profile
    import stat
    
    def run():
        filename = 'body.txt'
        size = os.stat(filename)[stat.ST_SIZE]
        fp = open(filename,'rb')
        environ = {}
        environ["CONTENT_TYPE"]   = open('content_type.txt','rb').read()
        environ["REQUEST_METHOD"] = "POST"
        environ["CONTENT_LENGTH"] = str(size)
    
        fieldstorage = cgi.FieldStorage(fp, None, environ=environ)
        return fieldstorage
    
    import hotshot, hotshot.stats
    import time
    if 1:
        t1 = time.time()
        prof = hotshot.Profile("bug1718.prof")
        # hotshot profiler will crash with the 
        # patch applied on windows xp
        #prof_results = prof.runcall(run)
        prof_results  = run()
        prof.close()
        t2 = time.time()
        print t2-t1
        if 0:
          for key in prof_results.keys():
            if len(prof_results[key].value)> 100:
                print key, prof_results[key].value[:80] + "..."
            else:
                print key, prof_results[key]

    content_type.txt
    ----------------------------
    multipart/form-data; boundary=----------ThIs_Is_tHe_bouNdaRY_$

    @teyc teyc mannequin added interpreter-core (Objects, Python, Grammar, and Parser dirs) labels Dec 7, 2006
    @devdanzin devdanzin mannequin added stdlib Python modules in the Lib dir performance Performance or resource usage and removed interpreter-core (Objects, Python, Grammar, and Parser dirs) labels Mar 30, 2009
    @BreamoreBoy
    Copy link
    Mannequin

    BreamoreBoy mannequin commented Jul 12, 2010

    Chui Tey does this issue still apply? If yes, could you please provide a patch according to the guidelines here.
    python.org/dev/patches

    @BreamoreBoy
    Copy link
    Mannequin

    BreamoreBoy mannequin commented Aug 25, 2010

    No reply to msg110090.

    @BreamoreBoy BreamoreBoy mannequin closed this as completed Aug 25, 2010
    @BreamoreBoy BreamoreBoy mannequin closed this as completed Aug 25, 2010
    @bitdancer
    Copy link
    Member

    I don't think it was appropriate to close this issue.

    @bitdancer bitdancer reopened this Oct 25, 2010
    @bitdancer bitdancer reopened this Oct 25, 2010
    @florentx
    Copy link
    Mannequin

    florentx mannequin commented Jul 21, 2012

    It needs tests to demonstrate the issue in 3.x, and an updated patch.

    @hynek
    Copy link
    Member

    hynek commented Dec 27, 2012

    It would be great if someone could port this patch to Python 3.4 and verify its effectiveness.

    @BreamoreBoy
    Copy link
    Mannequin

    BreamoreBoy mannequin commented Jul 6, 2014

    @hynek could you port the patch as you've shown some interest in it?

    @hynek
    Copy link
    Member

    hynek commented Jul 8, 2014

    I would have long ago if I had any domain knowlege on this topic, but alas….

    @ChuiTey
    Copy link
    Mannequin

    ChuiTey mannequin commented Jul 8, 2014

    Hi,

    I'm still available. There's a test case in the patch, would you like
    me to separate that to another file? Would that help with assessing
    it?

    Best,

    On 8 July 2014 22:05, Hynek Schlawack <report@bugs.python.org> wrote:

    Hynek Schlawack added the comment:

    I would have long ago if I had any domain knowlege on this topic, but alas….

    ----------


    Python tracker <report@bugs.python.org>
    <http://bugs.python.org/issue1610654\>


    @bitdancer
    Copy link
    Member

    To move this issue along we need someone to convert it into our standard patch format (a unified diff against either the 3.4 or the default branch, preferrably produced via an 'hg diff' command without the --git option), with the test included as a unit test added to the appropriate test file (tests/test_cgi.py).

    @rishimakerforum
    Copy link
    Mannequin

    rishimakerforum mannequin commented Oct 13, 2014

    My observation is that a file with more than normal (exact numbers below) line-feed characters takes way too long.

    I tried porting the above patch to my default branch, but it has some boundary and CRLF/LF issues, but more importantly it relies on seeking the file-object, which in the real world is stdin for web browsers and hence is illegal in that environment.

    I have attached a patch which is based on the same principle as Chui mentioned, ie reading a large buffer, but this patch does not deal with line feeds at all. It instead searches the entire boundary in a large buffer.

    The cgi module file-object only relies on readline and read functionality - so I created a wrapper class around read and readline to introduce buffering (attached as patch).

    When multipart boundaries are being searched, the patch fills a huge buffer, like in the original solution. It searches for the entire boundary and returns a large chunk of the payload in one call, rather than line by line.

    To search, there are corner cases ( when boundary is overlapping between buffers) and CRLF issues. A boundary in itself could have repeating characters causing more search complexity.
    To overcome this, the patch uses simple regular exressions without any expanding or wild characters. If a boundary is not found, it returns the chunk - length of the buffer - CRLF prefixes, to ensure that no boundary is overlapping between two consecutive buffers. The expressions take care of CRLF issues.

    When read and readline are called, the patch looks for data in the buffer and returns appropriately.

    There is a overall performance improvement in cases of large files, and very significant in case of files with very high number of LF characters.

    To begin with I created a 20MB file with 20% of the file filled with LineFeeds.

    File - 20MB.bin
    size - 20MB
    description - file filled with 20% (~4MB) '\n'
    Parse time with default cgi module - 53 seconds
    Parse time with patch - 0.4s

    This time increases linearly with the number of LFs for the default module.ie keeping the size same at 20MB and doubling the number of LFs to 40% would double the parse time.

    I tried with a normal large binary file that I found on my machine.
    size: 88mb
    description - binary executable on my machine,
    binary image has 140k lfs.
    Parse time with default cgi module - 2.7s
    Parse time with patch- 0.7s

    I have tested with a few other files and noticed time is cut by atleast half for large files.

    Note:
    These numbers are consitent over multiple observations.
    I tested this using the script attached, and also on my localhost server.
    The time taken is obtained by running the following code.

    t1=time.time()
    cProfile.run("fs = cgi.FieldStorage()")
    print(str(len(fs['datafile'].value)))
    t2 = time.time()
    print(str(t2 - t1))

    I have tried to keep the patch compatible with the current module. However I have introduced a ValueError excepiton in the module when boundary is very large ie. 1024 bytes. The RFC specifies the maximum length to be 70 bytes.

    @pitrou
    Copy link
    Member

    pitrou commented Oct 13, 2014

    Rishi, thanks for the patch. I was going to give a review but first I have to ask: is so much support code necessary for this?

    Another approach would be to wrap self.fp in a io.BufferedReader (if it's not already buffered) and then use the peek() method to find the boundary without advancing the file pointer.

    @rishimakerforum
    Copy link
    Mannequin

    rishimakerforum mannequin commented Oct 13, 2014

    Antoine, I will upload a patch that relies on BufferedReader. As you mentioned, it will get rid of supporting the buffer and reduce a lot of code.
    The only issue is that it helps me to know if the current buffer is at EOF (the documentation of peek does not mention guaranteeing Eof if buffer returned is less than what we expect), because patterns at EOF are different, but I think I can work around that.

    @serhiy-storchaka
    Copy link
    Member

    I doubt we can use io.BufferedReader or handmade buffering here. Current code doesn't read more bytes than necessary. Buffered reader will read ahead, and there is no way to return read bytes back to the stream in general case (an exception is seekable streams). It can be blocked in attempt to fill a buffer with unnecessary bytes.

    I think that the user of the cgi module is responsible for wrapping a stream in io.BufferedReader (if it is acceptable), the cgi module shouldn't do this itself.

    But we can implement special cases for buffered or seekable streams. However this optimization will not work in general case (e.g. for stdin or socket).

    @rishimakerforum
    Copy link
    Mannequin

    rishimakerforum mannequin commented Oct 15, 2014

    I have recreated the patch(issue1610654_1.patch) and it performs more or less like the earlier patch

    Serhiy,
    I agree we cannot use handmade buffering here, without seeking ahead.
    I believe, we can make optimizations for streams which are buffered and non-seekable.
    Cgi modules default value for file object is the BufferedReader of sys.stdin, so the solution is fairly generic.

    I have removed handmade buffering. Neither do I create a Buffered* object.
    We rely on user to create the buffered object. The sys.stdin that cgi module has a decent buffer underneath that
    works well on apache.

    The patch attached does not seek, nor does it read ahead. It only looks ahead.
    As Antoine suggests, it peeks the buffer and determines through a fast lookup if the buffer has a bounary or not.
    It moves forward only if it is convinced that the current buffer is completely within the next boundary.

    The issue is that the current implementation deals with lines and not chunks.
    Even when a savy user wraps sys.stdin around a large BufferredReader there is little to no peformance improvement in
    the current implementation for large files in my observation. It does not solve the bug mentioned either.
    The difference in extreme cases like Chui's is 53s against 0.7s and even otherwise for larger files the patch
    is 3 times faster than the current implementation.
    I have tested this on Apache2 server where the sys.stdin is buffered.

    @pitrou
    Copy link
    Member

    pitrou commented Oct 28, 2014

    Thanks for the updated patch. I'll take a look soon if no-one beats me to it.

    @rishimakerforum
    Copy link
    Mannequin

    rishimakerforum mannequin commented Nov 4, 2014

    Patch updated from review comments. Also added a few corner test cases.

    @rishimakerforum
    Copy link
    Mannequin

    rishimakerforum mannequin commented Nov 8, 2014

    Hi,
    I have created a new patch with a small design change. The change is that in situations where I don't find the boundary instead of keeping the last x bytes in the buffer I simply drain the whole data and call a readline().
    This seems like the right thing to do also. I managed to get rid of the two obfuscated helper functions keep_x_buffer and remove_x_buffer that I had and the code should look familiar (I hope) to a module owner.
    This also helped me get rid of quite a few class member variables that I could move to locals of my main function(multi_read).
    I still need to maintain an overlap, but only for the trailing CRLF boundary.
    Ran all the new and old tests and tested on apache with the ubuntu iso server image. Without the patch ubuntu iso server image took 93seconds .. with the patch it took 25seconds.

    @serhiy-storchaka serhiy-storchaka self-assigned this Nov 22, 2014
    @serhiy-storchaka
    Copy link
    Member

    New test fail with non-modified code. Either there is a bug in current code or tests are wrong.

    @rishimakerforum
    Copy link
    Mannequin

    rishimakerforum mannequin commented Dec 9, 2014

    There is indeed a test failure that occurs without the patch. This is a new test I had added.
    The reason is that in the existing implementation, when a boundary does not exist, the implementation does not include the trailing CRLF, LF or for that matter CR as part of the payload. I think that is not correct.

    However, to keep this patch compatible with behavior of existing implementation I have updated the patch to strip a single CRLF, LR or CR from the payload if a boundary is not found.

    @rishimakerforum
    Copy link
    Mannequin

    rishimakerforum mannequin commented Dec 9, 2014

    One of my comments shot the wrapped line limit. Also changed the test in question to check the lengths of the expected and actual buffer to checking the contents of the respective buffers.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    @kumaraditya303
    Copy link
    Contributor

    Closing as cgi module is deprecated as pep 594 and no improvement or bugs will be fixed.

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    performance Performance or resource usage stdlib Python modules in the Lib dir
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants