Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OBOE (off by one error) when determining if all chunks have been read #55

Closed
mhoglan opened this issue Mar 7, 2015 · 5 comments
Closed
Labels

Comments

@mhoglan
Copy link
Contributor

mhoglan commented Mar 7, 2015

The code for determining if all chunks have been read has an OBOE that causes not all bytes to be read if the number of bytes in the chunk is 1 byte more than the amount of bytes read after the copy

        if g.cIdx >= g.rChunk.size-1 { // chunk complete
            g.sp.give <- g.rChunk.b
            g.chunkID++
            g.rChunk = nil
        }

This will occur if the chunk size is a multiple of the default byte buffer size (32KiB, 32 * 1024) that io.Copy uses + 1 byte; When the copy exits, g.cIdx is updated with the amount bytes read. This makes the variable a count which is 1-based (opposed to 0-based if it was an index into the buffer); The chunk size is a length which is 1-based, and there is no need to offset by 1.

When this is encountered, the goroutines will end up in an infinite select situation as the next iteration of the loop will check if g.bytesRead == g.contentLen which it does not because there is 1 byte remaining in the chunk, and it will proceed to call g.nextChunk() since the g.rChunk was cleared out earlier. Now the goroutine will block the select in nextChunk() and there are no more chunks.

goroutine trace from pprof

goroutine 43 [select, 83 minutes]:
github.com/rlmcpherson/s3gof3r.(*getter).nextChunk(0xc2083705a0, 0x44c2d4, 0x0, 0x0)
    /go/src/github.com/rlmcpherson/s3gof3r/getter.go:244 +0x2d9
github.com/rlmcpherson/s3gof3r.(*getter).Read(0xc2083705a0, 0xc21b1e0000, 0x8000, 0x8000, 0x8000, 0x0, 0x0)
    /go/src/github.com/rlmcpherson/s3gof3r/getter.go:207 +0x182
io.Copy(0x7feef9a5e5a8, 0xc208064800, 0x7feef9a6e320, 0xc2083705a0, 0x180000, 0x0, 0x0)
    /usr/local/go/src/pkg/io/io.go:353 +0x1f3
project/dbimport.(*bucketUtils).concat(0xc21b2071c0, 0xc208242000, 0x5a, 0x80, 0xc21b64ab40, 0x40, 0x0, 0x0)
    /go/src/project/main.go:217 +0x7ef
project/dbimport.processS3Files(0xc2080ec1e0, 0xc2081ec660, 0xc2083ce2c0, 0x7fffc6ee6e1f, 0xf, 0xc20821ff50, 0x2a, 0xc208269200, 0x5a, 0x5a, ...)
    /go/src/project/main.go:792 +0x17a3
project/dbimport.func·007(0xc2083cc2a0)
    /go/src/project/main.go:622 +0x13b6


goroutine 93481 [select]:
github.com/rlmcpherson/s3gof3r.func·002()
    /go/src/github.com/rlmcpherson/s3gof3r/pool.go:42 +0x6cd
created by github.com/rlmcpherson/s3gof3r.bufferPool
    /go/src/github.com/rlmcpherson/s3gof3r/pool.go:68 +0x15a

Can be reproduced with any file that is 1 byte larger than a multiple of 32KiB, such as in my case 1081345 (32 * 1024 * 33 + 1); Default PartSize of 20MiB

The last couple iterations of the loop with some debugs

...
s3gof3r2015/03/06 23:37:16 getter.go:203: -----loop start------
s3gof3r2015/03/06 23:37:16 getter.go:204: g.chunkTotal: 1
s3gof3r2015/03/06 23:37:16 getter.go:205: g.chunkID: 0
s3gof3r2015/03/06 23:37:16 getter.go:207: g.rChunk.id: 0
s3gof3r2015/03/06 23:37:16 getter.go:211: nw: 0
s3gof3r2015/03/06 23:37:16 getter.go:212: len(p): 32768
s3gof3r2015/03/06 23:37:16 getter.go:213: g.cIdx: 1015808
s3gof3r2015/03/06 23:37:16 getter.go:214: g.bytesRead: 1015808
s3gof3r2015/03/06 23:37:16 getter.go:215: g.contentLen: 1081345
s3gof3r2015/03/06 23:37:16 getter.go:228: g.cIdx: 1015808
s3gof3r2015/03/06 23:37:16 getter.go:229: g.rChunk.size: 1081345
s3gof3r2015/03/06 23:37:16 getter.go:230: len(p): 32768
s3gof3r2015/03/06 23:37:16 getter.go:232: len(p): 32768
s3gof3r2015/03/06 23:37:16 getter.go:233: n: 32768
s3gof3r2015/03/06 23:37:16 getter.go:234: bytesRead: 1015808
s3gof3r2015/03/06 23:37:16 getter.go:238: bytesRead: 1048576
s3gof3r2015/03/06 23:37:16 getter.go:244: g.chunkID: 0
s3gof3r2015/03/06 23:37:16 getter.go:246: g.rChunk.id: 0
s3gof3r2015/03/06 23:37:16 getter.go:250: -----loop end------
s3gof3r2015/03/06 23:37:16 getter.go:203: -----loop start------
s3gof3r2015/03/06 23:37:16 getter.go:204: g.chunkTotal: 1
s3gof3r2015/03/06 23:37:16 getter.go:205: g.chunkID: 0
s3gof3r2015/03/06 23:37:16 getter.go:207: g.rChunk.id: 0
s3gof3r2015/03/06 23:37:16 getter.go:211: nw: 0
s3gof3r2015/03/06 23:37:16 getter.go:212: len(p): 32768
s3gof3r2015/03/06 23:37:16 getter.go:213: g.cIdx: 1048576
s3gof3r2015/03/06 23:37:16 getter.go:214: g.bytesRead: 1048576
s3gof3r2015/03/06 23:37:16 getter.go:215: g.contentLen: 1081345
s3gof3r2015/03/06 23:37:16 getter.go:228: g.cIdx: 1048576
s3gof3r2015/03/06 23:37:16 getter.go:229: g.rChunk.size: 1081345
s3gof3r2015/03/06 23:37:16 getter.go:230: len(p): 32768
s3gof3r2015/03/06 23:37:16 getter.go:232: len(p): 32768
s3gof3r2015/03/06 23:37:16 getter.go:233: n: 32768
s3gof3r2015/03/06 23:37:16 getter.go:234: bytesRead: 1048576
s3gof3r2015/03/06 23:37:16 getter.go:238: bytesRead: 1081344
s3gof3r2015/03/06 23:37:16 getter.go:244: g.chunkID: 1
s3gof3r2015/03/06 23:37:16 getter.go:248: g.rChunk.id: nil
s3gof3r2015/03/06 23:37:16 getter.go:250: -----loop end------
s3gof3r2015/03/06 23:37:16 getter.go:203: -----loop start------
s3gof3r2015/03/06 23:37:16 getter.go:204: g.chunkTotal: 1
s3gof3r2015/03/06 23:37:16 getter.go:205: g.chunkID: 1
s3gof3r2015/03/06 23:37:16 getter.go:209: g.rChunk.id: nil
s3gof3r2015/03/06 23:37:16 getter.go:211: nw: 0
s3gof3r2015/03/06 23:37:16 getter.go:212: len(p): 32768
s3gof3r2015/03/06 23:37:16 getter.go:213: g.cIdx: 1081344
s3gof3r2015/03/06 23:37:16 getter.go:214: g.bytesRead: 1081344
s3gof3r2015/03/06 23:37:16 getter.go:215: g.contentLen: 1081345
s3gof3r2015/03/06 23:37:16 getter.go:271: ------nextChunk select------

Results in infinite select loop. File on disk is missing 1 byte (the last byte of file).

Can also be reproduced by using any PartSize that is a multiple of 32KiB + 1; Example of a file that succeeded with default config of 20MiB PartSize, but fails with a PartSize of 131073 (32 * 1024 * 4 + 1); Also shows a multipart download which above example did not.

...
s3gof3r2015/03/06 23:55:15 getter.go:203: -----loop start------
s3gof3r2015/03/06 23:55:15 getter.go:204: g.chunkTotal: 10
s3gof3r2015/03/06 23:55:15 getter.go:205: g.chunkID: 9
s3gof3r2015/03/06 23:55:15 getter.go:207: g.rChunk.id: 9
s3gof3r2015/03/06 23:55:15 getter.go:211: nw: 0
s3gof3r2015/03/06 23:55:15 getter.go:212: len(p): 32768
s3gof3r2015/03/06 23:55:15 getter.go:213: g.cIdx: 32768
s3gof3r2015/03/06 23:55:15 getter.go:214: g.bytesRead: 1212416
s3gof3r2015/03/06 23:55:15 getter.go:215: g.contentLen: 1275093
s3gof3r2015/03/06 23:55:15 getter.go:228: g.cIdx: 32768
s3gof3r2015/03/06 23:55:15 getter.go:229: g.rChunk.size: 95436
s3gof3r2015/03/06 23:55:15 getter.go:230: len(p): 32768
s3gof3r2015/03/06 23:55:15 getter.go:232: len(p): 32768
s3gof3r2015/03/06 23:55:15 getter.go:233: n: 32768
s3gof3r2015/03/06 23:55:15 getter.go:234: bytesRead: 1212416
s3gof3r2015/03/06 23:55:15 getter.go:238: bytesRead: 1245184
s3gof3r2015/03/06 23:55:15 getter.go:244: g.chunkID: 9
s3gof3r2015/03/06 23:55:15 getter.go:246: g.rChunk.id: 9
s3gof3r2015/03/06 23:55:15 getter.go:250: -----loop end------
s3gof3r2015/03/06 23:55:15 getter.go:203: -----loop start------
s3gof3r2015/03/06 23:55:15 getter.go:204: g.chunkTotal: 10
s3gof3r2015/03/06 23:55:15 getter.go:205: g.chunkID: 9
s3gof3r2015/03/06 23:55:15 getter.go:207: g.rChunk.id: 9
s3gof3r2015/03/06 23:55:15 getter.go:211: nw: 0
s3gof3r2015/03/06 23:55:15 getter.go:212: len(p): 32768
s3gof3r2015/03/06 23:55:15 getter.go:213: g.cIdx: 65536
s3gof3r2015/03/06 23:55:15 getter.go:214: g.bytesRead: 1245184
s3gof3r2015/03/06 23:55:15 getter.go:215: g.contentLen: 1275093
s3gof3r2015/03/06 23:55:15 getter.go:228: g.cIdx: 65536
s3gof3r2015/03/06 23:55:15 getter.go:229: g.rChunk.size: 95436
s3gof3r2015/03/06 23:55:15 getter.go:230: len(p): 32768
s3gof3r2015/03/06 23:55:15 getter.go:232: len(p): 32768
s3gof3r2015/03/06 23:55:15 getter.go:233: n: 29900
s3gof3r2015/03/06 23:55:15 getter.go:234: bytesRead: 1245184
s3gof3r2015/03/06 23:55:15 getter.go:238: bytesRead: 1275084
s3gof3r2015/03/06 23:55:15 getter.go:244: g.chunkID: 10
s3gof3r2015/03/06 23:55:15 getter.go:248: g.rChunk.id: nil
s3gof3r2015/03/06 23:55:15 getter.go:250: -----loop end------
s3gof3r2015/03/06 23:55:15 getter.go:203: -----loop start------
s3gof3r2015/03/06 23:55:15 getter.go:204: g.chunkTotal: 10
s3gof3r2015/03/06 23:55:15 getter.go:205: g.chunkID: 10
s3gof3r2015/03/06 23:55:15 getter.go:209: g.rChunk.id: nil
s3gof3r2015/03/06 23:55:15 getter.go:211: nw: 29900
s3gof3r2015/03/06 23:55:15 getter.go:212: len(p): 32768
s3gof3r2015/03/06 23:55:15 getter.go:213: g.cIdx: 95436
s3gof3r2015/03/06 23:55:15 getter.go:214: g.bytesRead: 1275084
s3gof3r2015/03/06 23:55:15 getter.go:215: g.contentLen: 1275093
s3gof3r2015/03/06 23:55:15 getter.go:271: ------nextChunk select------

Results in infinite select loop. File on disk will be missing 1 byte for every chunk.

Pull request incoming which addresses the issue and adds a couple guards

  • If bytes read does not equal content length and all chunks have been processed, error
  • If bytes read is greater than content length, error
    • This should not occur as golang uses a LimitedReader up to the content length, but for completeness should be here

For more robustness, there could be timeout on the select in the getNextChunk, If for any reason the routine gets into here and there are no more chunks, it will block forever. Cannot rely on the underlying HTTP connection being gone and triggering a close because the data has been read into memory. I did think about that the worker() function should close the g.readCh but that would not have it break out of the select (it would if it was a range on the select); Have not thought this fully out, but feel something can be done to signal that no more chunks will be arriving on the channel because the workers are gone.

rlmcpherson added a commit that referenced this issue Mar 9, 2015
* mhoglan-nil-key-request:
  add nil path test
  Added guard protections for requests / responses Add guard to handle if nil path is requested Add guard to handle if content-length from response is -1 which represents that the response is chunked transfer encoding or EOF close response
  change naked URL to a link from "IAM role"
  Correct usage of help command.
  Correct version number.
  Handle cases where etag is missing gracefully.
@rlmcpherson
Copy link
Owner

Regarding the nextChunk method, if the quit channel is closed due to an error, it will return that error. That is the designed way for this method to exit when a problem occurs.

@mhoglan
Copy link
Contributor Author

mhoglan commented Mar 10, 2015

I was thinking more of a notification that no more chunks are coming so the nextChunk method doesn't infinitely select. It still needs to work whatever chunks are present, but if no chunks exist and the workers are done, then quit. With the guards it should be less likely it gets into that state, so the benefit is lower now. Maybe it is as simple as adding a channel that the worker() function closes when it is done iterating getting chunks.

I was looking that I could wrap the reader and timeout at the application level (if my read takes so long or if I noticed after so many checks the read has not progressed) then call reader.close() which will call the quit channel and everything exit out.

@rlmcpherson
Copy link
Owner

The worker() method can't do that since there are more than one of them running and they only send on the readCh when a chunk is complete. Only the Read method calls nextChunk() and it also is the only method that increments chunkID, so unless the Read method is incorrect, nextChunk() can only be called nextChunk() when there are chunks remaining to be received from readCh. I'm open to a better solution but since all this is internal, the scenario can not occur unless the Read implementation is incorrect.

@mhoglan
Copy link
Contributor Author

mhoglan commented Mar 12, 2015

It would have to be a separate goroutine that has a CountDownLatch (not sure if this exists in vanilla golang) on the workers finishing, and when triggered then signal that all workers are done.

Agree that the chances of it happening now are much lower. The guard check for all chunks finished but not bytes do not equal will catch the majority of this.

I'll think about it some more and get a better concrete example, I know it would help to see the pseudocode or a more concrete list of actions, and as I do that I probably will come to the same understanding you are at since you know the code / flow better. I am working on a plan that could utilize the WriterAt type and allow even more concurrency to occur, and that might inherently do this also.

@rlmcpherson
Copy link
Owner

I'm closing this issue as the original bug is resolved. Please feel free to open another issue for the topic discussed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants