Skip to content

Support additional configuration parameters for S3 transfer #4359

@sserdyukov

Description

@sserdyukov

Problem
When using S3 as remote for DVC, boto3 reads also the configuration file ~/.aws/config typically used by the original AWS CLI tool. Not all configuration parameters from this file have an effect in DVC at the moment. For instance, the last line multipart_threshold = 500MB will be skipped by DVC while the previous line signature_version = s3 will be applied when pushing data into S3:

[default]
output = json
s3 =
  signature_version = s3
  multipart_threshold = 500MB

Suggestion
Suggestion is to support optional yet useful parameters from ~/.aws/config as per AWS CLI S3 documentation to control S3 transfers for performance reasons or to unlock the advanced configuration capabilities.

These parameters should be passed to TransferConfig:

  • max_concurrent_requests - The maximum number of concurrent requests.
  • max_queue_size - The maximum number of tasks in the task queue.
  • multipart_threshold - The size threshold the CLI uses for multipart transfers of individual files.
  • multipart_chunksize - When using multipart transfers, this is the chunk size that the CLI uses for multipart transfers of individual files.
  • max_bandwidth - The maximum bandwidth that will be consumed for uploading and downloading data to and from Amazon S3.

If it is reasonable and doable - support also the following set of parameters:

  • use_accelerate_endpoint - Use the Amazon S3 Accelerate endpoint for all s3 and s3api commands. You must first enable S3 Accelerate on your bucket before attempting to use the endpoint. This is mutually exclusive with the use_dualstack_endpoint option.
  • use_dualstack_endpoint - Use the Amazon S3 dual IPv4 / IPv6 endpoint for all s3 and s3api commands. This is mutually exclusive with the use_accelerate_endpoint option.
  • addressing_style - Specifies which addressing style to use. This controls if the bucket name is in the hostname or part of the URL. Value values are: path, virtual, and auto. The default value is auto.
  • payload_signing_enabled - Refers to whether or not to SHA256 sign sigv4 payloads. By default, this is disabled for streaming uploads (UploadPart and PutObject) when using https.

These values must be set under the top level s3 key in the AWS Config File, which has a default location of ~/.aws/config. Below is an example configuration:

[profile development]
aws_access_key_id=foo
aws_secret_access_key=bar
s3 =
  max_concurrent_requests = 20
  max_queue_size = 10000
  multipart_threshold = 64MB
  multipart_chunksize = 16MB
  max_bandwidth = 50MB/s
  use_accelerate_endpoint = true
  addressing_style = path

Metadata

Metadata

Assignees

Labels

feature requestRequesting a new featurehelp wantedp2-mediumMedium priority, should be done, but less important

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions