-
Notifications
You must be signed in to change notification settings - Fork 0
ENG-41215: Client Multipart Upload implementation (together-py) #159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
db7db93 to
36a5dd9
Compare
src/together/lib/resources/files.py
Outdated
| # Mock server scenario - return mock values for testing | ||
| if response.status_code == 200: | ||
| return "https://mock-upload-url.com", "mock-file-id" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's mock only inside the testing code, never in the actual implementation
src/together/lib/resources/files.py
Outdated
| f"Unsupported file extension: '{file.suffix}'. Supported extensions: .jsonl, .parquet, .csv" | ||
| ) | ||
|
|
||
| def _calculate_parts(self, file_size: int) -> Tuple[int, int]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems duplicated across async and non-async file managers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can consider making it function & reuse, the function uses no state
| file_size = os.stat(file.as_posix()).st_size | ||
| file_size_gb = file_size / NUM_BYTES_IN_GB | ||
|
|
||
| if file_size_gb > MAX_FILE_SIZE_GB: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this check a duplicate for the one in get_upload_url fn? Can we have one?
src/together/lib/resources/files.py
Outdated
| f"Unsupported file extension: '{file.suffix}'. Supported extensions: .jsonl, .parquet, .csv" | ||
| ) | ||
|
|
||
| def _calculate_parts(self, file_size: int) -> Tuple[int, int]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can consider making it function & reuse, the function uses no state
|
|
||
| return part_size, num_parts | ||
|
|
||
| def _get_file_type(self, file: Path) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be a function or we can make it static
This implementation adds comprehensive multipart upload functionality to support large dataset uploads up to 25GB, with automatic routing based on file size. - **Size-based routing**: Files >5GB automatically use multipart upload - **Concurrent uploads**: Up to 4 concurrent parts for optimal performance - **Progress tracking**: Real-time progress bars for multipart uploads - **Error handling**: Robust cleanup and retry mechanisms - **Integrity verification**: SHA256 hash validation for uploaded files - **Sync multipart**: MultipartUploadManager for synchronous uploads - **Async multipart**: AsyncMultipartUploadManager using ThreadPoolExecutor - **Consistent API**: Same upload_file() interface for both approaches - **Performance optimized**: Efficient concurrent part uploads
ad4944f to
6fb42a1
Compare
* chore(api): Remove auto-generated files upload API to support custom coded version * feat(api): api update * feat(api): file upload method signature and functionality match previ… (#174) * feat(api): file upload method signature and functionality match previous version * disable test * fix locks * ENG-41215: Client Multipart Upload implementation (together-py) (#159) * feat: Add multipart upload support for large files up to 25GB This implementation adds comprehensive multipart upload functionality to support large dataset uploads up to 25GB, with automatic routing based on file size. - **Size-based routing**: Files >5GB automatically use multipart upload - **Concurrent uploads**: Up to 4 concurrent parts for optimal performance - **Progress tracking**: Real-time progress bars for multipart uploads - **Error handling**: Robust cleanup and retry mechanisms - **Integrity verification**: SHA256 hash validation for uploaded files - **Sync multipart**: MultipartUploadManager for synchronous uploads - **Async multipart**: AsyncMultipartUploadManager using ThreadPoolExecutor - **Consistent API**: Same upload_file() interface for both approaches - **Performance optimized**: Efficient concurrent part uploads * Update codebase after recent changes * reduce duplicated code in sync/async upload managers --------- Co-authored-by: Blaine Kasten <blainekasten@gmail.com> * release: 0.1.0-alpha.28 --------- Co-authored-by: stainless-app[bot] <142633134+stainless-app[bot]@users.noreply.github.com> Co-authored-by: Blaine Kasten <blainekasten@gmail.com> Co-authored-by: Soroush <soroush.bassam@gmail.com>
Note
Implements concurrent multipart file uploads (sync/async) with larger file limits, updates docs/CLI/examples, and adds tests for file upload.
MultipartUploadManagerandAsyncMultipartUploadManagerwith initiate/part PUT/complete/abort flow, ETag handling, progress bars, and error handling.UploadManager/AsyncUploadManagerto auto-select multipart vs single-part based on size; improve presigned URL handling and validations.MAX_FILE_SIZE_GBto25.0; introduce configurable multipart constants (part size/timeout/concurrency/threshold).files uploadto passfile=keyword to client.src/together/resources/files.py(sync/async) and improve header/redirect logic.purpose="fine-tune".examples/file-upload.pyand dataset samples (examples/coqa*.jsonl).test_files_resource.pyfor upload flow (presign → PUT → preprocess).Written by Cursor Bugbot for commit 6fb42a1. This will update automatically on new commits. Configure here.