Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Block Uploads vs. Offset Ranges #55

Closed
Acconut opened this issue Jan 23, 2015 · 19 comments
Closed

Block Uploads vs. Offset Ranges #55

Acconut opened this issue Jan 23, 2015 · 19 comments
Milestone

Comments

@Acconut
Copy link
Member

Acconut commented Jan 23, 2015

In this issue I will propose two extensions which both aim implement the features needed for parallel uploads and non-contiguous chunks (see #3). In the end we have to choose one of them or go with something else.

Before starting I want to clarify that streaming uploads (with unknown length at the beginning) have not been included in these thoughts since you should not use them in conjunction with parallel uploads.


The first solution uses offset ranges. Instead of defining one offset which starts at 0 and is incremented for each PATCH request the server would store one or multiple ranges of allowed and free offsets. These offsets will be returned in the Offset-Range header in the HEAD request replacing the Offset response (!) header. The client then uses this information to choose the offset and uploads the same way as it's currently implemented.

Here is an example of a 300-byte file of which the second 100 bytes (100-199) have been uploaded:

HEAD /files/foo HTTP/1.1

HTTP/1.1 204 No Content
Enitity-Length: 300
Offset-Range: 0-99, 200-299
PATCH /files/foo HTTP/1.1
Content-Length: 100
Offset: 200

[100 bytes]

HTTP/1.1 204 No Content
HEAD /files/foo HTTP/1.1

HTTP/1.1 204 No Content
Enitity-Length: 300
Offset-Range: 0-99

The range of the last 100 bytes (200-299) has been removed since this buffer has been filled successfully by the upload.

While this solution allows the maximum of flexibility (compared to my second proposal) since you can upload at any offset (as long as it's available) it may be a though extension to implement for the servers. It has to ensure that the start of the offset range against which the chunk is uploaded is available and the end of the offset. Using the example from above you're not allowed to patch a 150-byte chunk at the offset of 0 because the bytes starting from 100 have already been written.


The second solution I came up with involves a bit more: When creating a new upload (using the file creation extension or somehow else) a blocksize is defined using which the file is separated into different blocks. For example, considering a file of 5KB and a blocksize of 2KB you would end up with two blocks of 2KB and a single one of 1KB. The important point is that each of the blocks has its own offset which starts at position 0 relative to the starting position of the block.

Considering the last example, the relative offset 100 of the second block would be the absolute offset of 2148: 2048 (2KB starting position of the second block) + 100 relative offset.

Only one upload is allowed at the same time per block. In this example a maximum of three parallel uploads are allowed. Each new PATCH request must resume where the last upload of the block has stopped, jumps are not allowed.

In following example we consider having a file of 5KB with the blocksize of 2KB. The first block is already fully uploaded (2048 bytes), the second with is filled with 100 bytes and the last one has not a single write yet. We are going to upload 100 bytes to the relative offset of 100 into the second block:

HEAD /files/bar HTTP/1.1

HTTP/1.1 204 No Content
Enity-Length: 5120
Blocksize: 2048
Block-Offset: 2048, 100, 0
PATCH /files/bar HTTP/1.1
Content-Length: 100
Offset: 2148

[100 bytes]

HTTP/1.1 204 No Content
HEAD /files/bar HTTP/1.1

HTTP/1.1 204 No Content
Enity-Length: 5120
Blocksize: 2048
Block-Offset: 2048, 200, 0

Please post your opinion about these solutions (I prefer the my last proposal) or any additional way we could achieve parallel and non-contiguous uploads. Also take the time to consider the work of implementations for servers and clients.

@Acconut Acconut added this to the 1.0 milestone Jan 23, 2015
@kvz
Copy link
Member

kvz commented Jan 25, 2015

I think the second solution is better too

The first one is hard to implement, but also brittle as it requires some form of synchronization between the parallel uploaders in order to guarantee they don't write in the same ranges.

I wonder if a combination of the solutions makes sense:

Upon file creation, the client specifies how many parallel blocks it wants to work on.
The server returns a range of absolute offsets.
The client starts writing into these blocks in paralel, but each 'worker' owns a block.
The upside of partitioning early on is less chance of conflicts / easier implementation. The downside is the upload is as slow as the slowest block.

A different thought I had, is what if we leave parallel logic out of the uploading itself, but offer a merge request that let's the server know several completed uploaded files (in reality they are blocks) should be concatenated.

@Acconut
Copy link
Member Author

Acconut commented Jan 27, 2015

The downside is the upload is as slow as the slowest block.

This is a general "problem" about parallel upload which occurs for every solution. The file isn't fully uploaded until the last byte is received.

A different thought I had, is what if we leave parallel logic out of the uploading itself, but offer a merge request that let's the server know several completed uploaded files (in reality they are blocks) should be concatenated.

That's a brilliant idea! While keeping the parallel extension spec as small and focused as possible the parallel uploads can also take advantage of all other features.
I could think of following upload flow:
While creating the different blocks (using the file creation extension) you have to tell the server that this upload is just part of one bigger file (so it doesn't start processing it once one upload is done).
After the client has created (not necessarily uploaded all) blocks it must tell the server which uploads (identified by their file URL) are to concatenate in which order.

@Acconut
Copy link
Member Author

Acconut commented Jan 28, 2015

@kvz After thinking a bit more about the merge approach this offers a way to drop the Offset header at all. Its value in a PATCH request must no be smaller than the current offset of the resource (see #51). Furthermore non-contiguous uploads should be implemented by creating multiple uploads and merging them (at any time). This means that no Offset bigger then the current offset must be allowed. So the only allowed value of this request header is the one returned in the HEAD request.
In the end we should think about removing the Offset header at all since its main functionality, indicating the offset to write at, is not necessary any more. The only reason to keep it is to verify that the client is aware of which offset it uses. In some cases (e.g. connection drop or internal errors) the offset on the server may be different than the one the client expected.

In addition I had some initial thoughts on how to implement this feature. He's my first draft using a new Merge header in the POST request: https://gist.github.com/Acconut/65228c8486e284d49e10

@kvz
Copy link
Member

kvz commented Jan 29, 2015

This is a general "problem" about parallel upload which occurs for every solution. The file isn't fully uploaded until the last byte is received.

I meant that if you don't work with fixed partitioning, you could repartition so that threads/workers could help out with remaining bytes of the slowest blocks. But let's not go into this as we both agree the downsides of that outweigh any upside.

As for merging 'regular' tus uploads, I'm glad you like it :)

To verify I understand, you're saying we can remove Offset as any PATCH will just be appended to the existing bytes? How would we overwrite a block with a failed checksum? I guess that block never gets added in the first place?

As for the example Gist, it makes sense to me. Should we also define that its named parts are to be removed upon a successful Merge: final;?

@Acconut
Copy link
Member Author

Acconut commented Jan 29, 2015

I meant that if don't work with fixed partitioning, you could repartition so that threads could help out with remaining bytes of the slowest blocks. But let's not go into this as we both agree the downsides of that outweigh any upside.

I agree, this case is too specific.

To verify I understand, you're saying we can remove Offset as any PATCH will just be appended to the existing bytes? How would we overwrite a block with a failed checksum? I guess that block never gets added in the first place?

Blocks with failed checksum validation never get written and so the offset of the file is never changed. As I said the only function left to the Offset header is verifying that the client uploads the data from the correct offset.

As for the example Gist, it makes sense to me. Should we also define that its named parts are to be removed upon a successful Merge: final;?

In some cases you want to do and sometimes not to (e.g. reuse a partial upload to merge with another). What about keeping this application-specific?

@kvz
Copy link
Member

kvz commented Jan 30, 2015

Agreed Marius, unless other contributors have objections I'd say this can be formalized into the protocol

Acconut added a commit that referenced this issue Jan 30, 2015
addresses #55, #51 and #3
@vayam
Copy link
Member

vayam commented Feb 1, 2015

@Acconut that is awesome you are tackling the parallel upload problem.
@kvz some nice feedback there.

While Merge approach looks good. I have 2 concerns

  • Merge: final; Header getting bloated for large files.
    How about posting it as json array instead?
  • the current proposal forces client to keep status of how many chunks it uploaded. We should provide a
    way for client to query list of chunks uploaded so far. I am not able to think of a good way for the server to list the chunks with the current approach.

@Acconut
Copy link
Member Author

Acconut commented Feb 1, 2015

@vayam Thanks for the feedback, I appreciate it a lot! 👍

Merge: final; Header getting bloated for large files. How about posting it as json array instead?

This is a concern we have discussed internally, too. For this reason we have allowed remove the protocol and hostname from the URLs (see https://github.com/tus/tus-resumable-upload-protocol/blob/merge/protocol.md#merge-1):

The host and protocol scheme of the URLs MAY be omitted. In this case the value of the Host header MUST be used as the host and the scheme of the current request.

This enables you to use /files/24e533e02ec3bc40c387f1a0e460e216 instead of http://master.tus.io/files/24e533e02ec3bc40c387f1a0e460e216 to save some bytes.

Assuming a maximum of 4KB of total headers (default for nginx, see http://stackoverflow.com/a/8623061), the default headers (Host, Accept-, User-Agent, Cache-) take about 300 bytes leaving enough space to fit about 90 URLs. For my usecases this would be enough since I hardly imagine a case where you upload 90 chunks in parallel. Maybe you can throw in some experience there?

I would like to leave the body untouched by tus (against my older opinion). I had the idea to allow merging final uploads: The uploads A and B are merged into AB. C and D are merged into CD. In the end AB and CD are merged into ABCD. It may require a bit more coordination on the server side but what do you think.

the current proposal forces client to keep status of how many chunks it uploaded. We should provide a
way for client to query list of chunks uploaded so far. I am not able to think of a good way for the server to list the chunks with the current approach.

What about including the Merge header in the HEAD response?

@vayam
Copy link
Member

vayam commented Feb 1, 2015

Assuming a maximum of 4KB of total headers (default for nginx, see http://stackoverflow.com/a/8623061), the default headers (Host, Accept-, User-Agent, Cache-) take about 300 bytes leaving enough space to fit about 90 URLs. For my usecases this would be enough since I hardly imagine a case where you upload 90 chunks in parallel. Maybe you can throw in some experience there?

You are right. I don't expect more than 10-15 parallel uploads. Say if you are uploading 10GB with 1024 10MB chunks. The client uploads 1-10 chunks in parallel. Once it finishes those it merges to say B. then does 10 more parallel uploads and merges with the first merged file B and so on right?

What about including the Merge header in the HEAD response?

That works! But how would the server tie the part uploads to main upload? I don't understand. Can you explain?

I came up with a slightly modified version:
https://gist.github.com/vayam/774990b5919cc863dee7
Alert: I am thinking aloud here.
Let me know what you guys think?

@Acconut
Copy link
Member Author

Acconut commented Feb 1, 2015

Say if you are uploading 10GB with 1024 10MB chunks. The client uploads 1-10 chunks in parallel. Once it finishes those it merges to say B. then does 10 more parallel uploads and merges with the first merged file B and so on right?

Currently this is not allowed by the specification. You can only merge partial uploads and not final uploads which consist of merged partial uploads. But we may allow this in the future if it is necessary. Your example is basically the same as mine:

The uploads A and B are merged into AB. C and D are merged into CD. In the end AB and CD are merged into ABCD. It may require a bit more coordination on the server side but what do you think.

Let me know if you want to see this in the protocol.

But how would the server tie the part uploads to main upload? I don't understand. Can you explain?

Ok, I think I paid this point to little attention: Basically merging uploads is a simple concatenation. Assume following three uploads:

ID Content
1 abc
2 def
3 ghi

If you then merge the upload 1, 2 and 3 in this order the final upload will have a length of 9 bytes (3 * 3) and its content will be abcdefghi.

I came up with a slightly modified version:
https://gist.github.com/vayam/774990b5919cc863dee7
Alert: I am thinking aloud here.
Let me know what you guys think?

I am not able to follow your thought. Could you please explain it?

@kvz
Copy link
Member

kvz commented Feb 2, 2015

We 'decided' against allow to concatenate finals for focus, and to reduce the surface for bugs to appear, but willing to be persuaded otherwise. Are there compelling use-cases we can think of?

Also, would Concatenate be a better word than Merge?

@vayam
Copy link
Member

vayam commented Feb 2, 2015

@Acconut I reread the Merge proposal. Say I am uploading 10GB file with 10MB chunks. I will have have about 1024 partial files to merge. Wouldn't that make the header bloated. To keep Merge final header small i have to make sure I upload larger file chunks.

Second question about listing the partial files by server. Say my browser crashed while uploading. I start over. If I don't store what I have uploaded so far in local storage there is no way I can ask the server to send me the currently transferred partials corresponding to my file. That is what I mean't by tracking.

We 'decided' against allow to concatenate finals for focus, and to reduce the surface for bugs to appear, but willing to be persuaded otherwise. Are there compelling use-cases we can think of? Also, would Concatenate be a better word than Merge?

@kvz concatenate or concat works too. I am also not very keen on concatenating finals.

@Acconut @kvz sorry My gist was unclear. I will try to explain better.

POST /files HTTP/1.1
Merge: partial
Entity-Length: 10737418240

HTTP/1.1 204 No Content
Location: http://tus.example.org/files/a/

Server creates a directory instead of file. directory can be more logical thing if you are in turn persisting into cloud storage per say.

Client put parts of the file using PUT /files/id/part1 .. /files/id/partN

PUT /files/a/part1 HTTP/1.1
Merge: partial
Entity-Length: 10485760

HTTP/1.1 204 No Content
PUT /files/a/part2 HTTP/1.1
Merge: partial
Entity-Length: 10485760

HTTP/1.1 204 No Content

Client lists files (needs more discussion)

HEAD /files/a/ HTTP/1.1
MERGE:final

HTTP/1.1 200 Ok
MERGE:final;part1-part1024

The next step is to create the final upload. In following request no
Entity-Length header is presented.

POST /files/a/ HTTP/1.1
Merge: final

HTTP/1.1 204 No Content
MERGE:final;part1-part1024
Location: http://tus.example.org/files/b

The goals I am trying to address.

Allow simultaneous stateless partial uploads.
Allow a large number of partial uploads.
Client has as little state as possible. Client can requests partials for a specific whole entity anytime

@Acconut
Copy link
Member Author

Acconut commented Feb 2, 2015

Second question about listing the partial files by server. Say my browser crashed while uploading. I start over. If I don't store what I have uploaded so far in local storage there is no way I can ask the server to send me the currently transferred partials corresponding to my file. That is what I mean't by tracking.

A client is able to get every offset for every partial upload using a HEAD request. Of course, the client MUST store the URLs of these uploads in order to resume them. If the browser/environment crashes and this information is lost we, as the server, can not to anything to prevent this. Your solution has the same "problem". If the client looses the URL of the directory created (/files/a/, in this case) it is not able to resume.

Speaking about your proposal I have a question:

Must the client send the Entity-Length when creating a new directory? This may collide with the idea of streaming uploads where the length is not known at the beginning. The same applies for creating new parts (/files/id/part1).

A general problem with your approach is that you require the server to define the URLs which is against a principal written in the 1.0 branch:

This specification does not describe the struture of URLs, as that is left for the specific implementation to decide. All URLs shown in this document are meant for example purposes only.

While I don't stick to this rule until death I want to question breaking it as long as it is not totally necessary.

@Acconut
Copy link
Member Author

Acconut commented Feb 2, 2015

@vayam Another problem is see with your solution is that the client need to create the parts chronologically in their order. You are not able to change these afterwards and have to be aware of how to partition the file before uploading it. Using the merge/concat approach you have indeed the possibility to throw stuff around. This is especially important if you deal with non-contiguous chunks.

@vayam
Copy link
Member

vayam commented Feb 2, 2015

@Acconut I agree with you. part1 naming was an example. You are right the protocol shouldn't dictate.

Must the client send the Entity-Length when creating a new directory? This may collide with the idea of streaming uploads where the length is not known at the beginning.

You are right. Entity-Length is not needed.

The same applies for creating new parts (/files/id/part1).

Agreed

Using the merge/concat approach you have indeed the possibility to throw stuff around. This is especially important if you deal with non-contiguous chunks.Using the merge/concat approach you have indeed the possibility to throw stuff around. This is especially important if you deal with non-contiguous chunks.

Good point. What if we did

Client lists files (needs more discussion)

HEAD /files/a/ HTTP/1.1
MERGE:final

HTTP/1.1 200 Ok
MERGE:final;part1,part2....part1024

Final Merge

POST /files/a/ HTTP/1.1
MERGE:final;part1,part2,part1024

HTTP/1.1 204 No Content
Location: http://tus.example.org/files/b

I was thinking more on the lines of http://docs.aws.amazon.com/AmazonS3/latest/dev/mpuAndPermissions.html

I am convinced your approach would be no issue for most of use cases. Let us go with that.

@Acconut
Copy link
Member Author

Acconut commented Feb 2, 2015

I will add a section to the merge extension to return the Merge header on HEAD requests. In addition I will clarify how files are merged/concatenated.

@vayam Just to be sure: Are you ok to go with the current proposal in #56 (plus the changes in the above sentence)? If so I will merge the PR and the release 1.0 Prerelease.

@vayam
Copy link
Member

vayam commented Feb 2, 2015

👍 sure go ahead and merge!

@Acconut
Copy link
Member Author

Acconut commented Feb 3, 2015

@vayam See 5e0ccb3.

@Acconut
Copy link
Member Author

Acconut commented Feb 3, 2015

Merged in #56.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants