Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upDynamic chunk size #122
Conversation
juanshishido
added some commits
Aug 21, 2017
This comment has been minimized.
This comment has been minimized.
|
Looking for feedback. If accepted, the plan is to implement the same logic in the Ruby and PHP SDKs. |
| response_time = int(response.headers['x-response-time']) | ||
| chunk_size = min_chunk_size * size(self._DEFAULT_CHUNK_SIZE, | ||
| self._RESPONSE_TIME_MAX, | ||
| response_time) |
This comment has been minimized.
This comment has been minimized.
jbabichjapan
Aug 21, 2017
Just confirming...does this need something like 253: location = response.headers['location'] from the else branch above on the last chunk? (We had some issues related to whether or not the last chunk has the location header, maybe it's on the first chunk..)
This comment has been minimized.
This comment has been minimized.
juanshishido
Aug 22, 2017
Contributor
On the POST request to the /1.1/ton/data/ta_partner endpoint, you'll see something like:
location: /1.1/ton/data/ta_partner/2417045708/AeEFBiuxuFY2zgt.txt?resumable=true&resumeId=396213
On the last PUT request, you'll see something like:
location: /ta_partner/2417045708/AeEFBiuxuFY2zgt.txt
What we do, using the location value from line 253, is return location.split("?")[0], keeping everything before the ?: /1.1/ton/data/ta_partner/2417045708/AeEFBiuxuFY2zgt.txt.
| resource = "{0}{1}".format(self._DEFAULT_RESOURCE, self.bucket) | ||
| response = self.__upload(resource, open(self._file_path, 'rb').read()) | ||
| return response.headers['location'] | ||
| else: | ||
| response = self.__init_chunked_upload() | ||
| chunk_size = int(response.headers['x-ton-min-chunk-size']) | ||
| min_chunk_size = int(response.headers['x-ton-min-chunk-size']) | ||
| chunk_size = min_chunk_size * self._DEFAULT_CHUNK_SIZE |
This comment has been minimized.
This comment has been minimized.
tushdante
Aug 25, 2017
Contributor
Are we sure that the value of x-ton-min-chunk-size won't change? Having the chunk_size be dependent on this value may cause issues if the header value is updated/corrected. I think we should either:
a. ensure that this header value is updated to reflect the actual minimum chunk size
b. set the chunk_size value as a const
cc @jbabichjapan what do you think?
This comment has been minimized.
This comment has been minimized.
juanshishido
Aug 26, 2017
•
Contributor
Thanks, @tushdante!
x-ton-min-chunk-size is only returned on the POST request to the /1.1/ton/bucket/ta_partner endpoint and is always 1 MiB (1,048,576 bytes). We only see that value once during the upload process. Does that change your thought on this?
This comment has been minimized.
This comment has been minimized.
jbabichjapan
Aug 26, 2017
The code should in best scenario have elegant handling for network errors on the first attempt to upload or subsequent chunks, but otherwise think it should be okay to iterate after this check in
juanshishido commentedAug 21, 2017
Context
This addresses issues such as this one, where SDK users are being rate limited when sending large audience files.
The problem with the current implementation
The
TONUpload'sperformmethod initiates a chunked upload when the input file is greater than or equal to_MIN_FILE_SIZE, currently set at 64 MB (1024 * 1024 * 64). The problem is that it sendsx-ton-min-chunk-size-sized (1 MB) chunks. This means a 63 MB file, because it follows the single-upload path, will only make two requests to the1.1/ton/bucket/ta_partnerendpoint—the initial POST and the subsequent PUT—while a 65 MB file will make 66 requests to the same endpoint.Proposal
For chunked uploads, choose the chunk size based on:
_RESPONSE_TIME_MAXx-response-timeI've defined a 5-second threshold (
_RESPONSE_TIME_MAX = 5000). When the response time for the previous request is less than or equal to that value, we send a max-sized chunk (64 MB, for example). Otherwise, we scale it based on the inverse of the actual response time divided by the threshold. With the 5-second max, if the previous request took 10 seconds, the chunk will be 32 MB (assuming the 64 MB default).