Skip to content

Conversation

@sbassam
Copy link
Contributor

@sbassam sbassam commented Sep 27, 2025

Note

Implements concurrent multipart file uploads (sync/async) with larger file limits, updates docs/CLI/examples, and adds tests for file upload.

  • Files Upload (Core):
    • Add concurrent multipart upload support via MultipartUploadManager and AsyncMultipartUploadManager with initiate/part PUT/complete/abort flow, ETag handling, progress bars, and error handling.
    • Enhance UploadManager/AsyncUploadManager to auto-select multipart vs single-part based on size; improve presigned URL handling and validations.
    • Increase MAX_FILE_SIZE_GB to 25.0; introduce configurable multipart constants (part size/timeout/concurrency/threshold).
  • CLI:
    • Fix files upload to pass file= keyword to client.
  • SDK Resources:
    • Wire new managers into src/together/resources/files.py (sync/async) and improve header/redirect logic.
  • Docs & Examples:
    • Update README file upload example to include purpose="fine-tune".
    • Add example script examples/file-upload.py and dataset samples (examples/coqa*.jsonl).
  • Tests:
    • Add/enable unit test test_files_resource.py for upload flow (presign → PUT → preprocess).

Written by Cursor Bugbot for commit 6fb42a1. This will update automatically on new commits. Configure here.

@linear
Copy link

linear bot commented Sep 27, 2025

@sbassam sbassam force-pushed the feature/multipart-upload branch 3 times, most recently from db7db93 to 36a5dd9 Compare September 29, 2025 18:02
@sbassam sbassam requested a review from zainhas September 29, 2025 18:13
Comment on lines 268 to 270
# Mock server scenario - return mock values for testing
if response.status_code == 200:
return "https://mock-upload-url.com", "mock-file-id"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's mock only inside the testing code, never in the actual implementation

f"Unsupported file extension: '{file.suffix}'. Supported extensions: .jsonl, .parquet, .csv"
)

def _calculate_parts(self, file_size: int) -> Tuple[int, int]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems duplicated across async and non-async file managers

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can consider making it function & reuse, the function uses no state

file_size = os.stat(file.as_posix()).st_size
file_size_gb = file_size / NUM_BYTES_IN_GB

if file_size_gb > MAX_FILE_SIZE_GB:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this check a duplicate for the one in get_upload_url fn? Can we have one?

f"Unsupported file extension: '{file.suffix}'. Supported extensions: .jsonl, .parquet, .csv"
)

def _calculate_parts(self, file_size: int) -> Tuple[int, int]:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can consider making it function & reuse, the function uses no state


return part_size, num_parts

def _get_file_type(self, file: Path) -> str:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be a function or we can make it static

@mryab mryab requested a review from khaykingleb September 30, 2025 10:33
@blainekasten blainekasten changed the base branch from main to next November 18, 2025 13:32
sbassam and others added 3 commits November 18, 2025 09:15
This implementation adds comprehensive multipart upload functionality to support
large dataset uploads up to 25GB, with automatic routing based on file size.

- **Size-based routing**: Files >5GB automatically use multipart upload
- **Concurrent uploads**: Up to 4 concurrent parts for optimal performance
- **Progress tracking**: Real-time progress bars for multipart uploads
- **Error handling**: Robust cleanup and retry mechanisms
- **Integrity verification**: SHA256 hash validation for uploaded files

- **Sync multipart**: MultipartUploadManager for synchronous uploads
- **Async multipart**: AsyncMultipartUploadManager using ThreadPoolExecutor
- **Consistent API**: Same upload_file() interface for both approaches
- **Performance optimized**: Efficient concurrent part uploads
@blainekasten blainekasten force-pushed the feature/multipart-upload branch from ad4944f to 6fb42a1 Compare November 18, 2025 15:17
@blainekasten blainekasten merged commit 91cffde into next Nov 18, 2025
10 checks passed
blainekasten added a commit that referenced this pull request Nov 18, 2025
* chore(api): Remove auto-generated files upload API to support custom coded version

* feat(api): api update

* feat(api): file upload method signature and functionality match previ… (#174)

* feat(api): file upload method signature and functionality match previous version

* disable test

* fix locks

* ENG-41215: Client Multipart Upload implementation (together-py) (#159)

* feat: Add multipart upload support for large files up to 25GB

This implementation adds comprehensive multipart upload functionality to support
large dataset uploads up to 25GB, with automatic routing based on file size.

- **Size-based routing**: Files >5GB automatically use multipart upload
- **Concurrent uploads**: Up to 4 concurrent parts for optimal performance
- **Progress tracking**: Real-time progress bars for multipart uploads
- **Error handling**: Robust cleanup and retry mechanisms
- **Integrity verification**: SHA256 hash validation for uploaded files

- **Sync multipart**: MultipartUploadManager for synchronous uploads
- **Async multipart**: AsyncMultipartUploadManager using ThreadPoolExecutor
- **Consistent API**: Same upload_file() interface for both approaches
- **Performance optimized**: Efficient concurrent part uploads

* Update codebase after recent changes

* reduce duplicated code in sync/async upload managers

---------

Co-authored-by: Blaine Kasten <blainekasten@gmail.com>

* release: 0.1.0-alpha.28

---------

Co-authored-by: stainless-app[bot] <142633134+stainless-app[bot]@users.noreply.github.com>
Co-authored-by: Blaine Kasten <blainekasten@gmail.com>
Co-authored-by: Soroush <soroush.bassam@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants