Skip to content
This repository has been archived by the owner on Apr 6, 2020. It is now read-only.

Handle large (>5GB) files in Data Pipeline #25

Open
arifwider opened this issue Dec 23, 2017 · 0 comments
Open

Handle large (>5GB) files in Data Pipeline #25

arifwider opened this issue Dec 23, 2017 · 0 comments

Comments

@arifwider
Copy link
Collaborator

When handling files of a size larger than a 5GB a couple of issues appear:

Creating such a file in the local file system (e.g. when using boto.download_file() / or when preparing to use boto.upload_file()) will cause a [Errno 28] No space left on device OS error no matter how much space is actually left on disk.

The alternative is to stream data directly from S3 into a pandas dataframe and vice versa, e.g. using boto.put_object(). However, when uploading more than 5GB as a single file to S3 the following error occurs:
botocore.exceptions.ClientError: An error occurred (EntityTooLarge) when calling the PutObject operation: Your proposed upload exceeds the maximum allowed size

This can be prevented with multipart uploads/downloads by passing a TransferConfig to upload_file() or upload_file_obj(). This however will then again lead to the local file system issues.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

1 participant