If you want to practice and understand Multipart Upload using your 150MB Parquet file, you can still split and upload it in multiple parts, even though it's not required for such a small file. Below is a step-by-step guide to do that using Boto3.

In [1]:
import boto3
import os

s3 = boto3.client('s3')
# bucket_name = 'my-bucket-name'
# file_path = 'myfile.parquet'
# s3_key = 'uploaded/myfile.parquet'  # Path in S3


In [4]:
s3.list_buckets()["Buckets"][0]['Name']

'nyc-taxi-data-pipeline'

In [5]:
path = 's3://nyc-taxi-data-pipeline/multipart-parquet/'
bucket_name = 'nyc-taxi-data-pipeline'
file_path = '/mount_folder/alpha/multipart_s3_upload/part-00000-2224c996-15d6-400a-8ae4-2d0740e74c18.c000.gz.parquet'
s3_key = 'multipart-parquet/Multipart.parquet'

In [6]:
response = s3.create_multipart_upload(Bucket=bucket_name, Key=s3_key)
upload_id = response['UploadId']
print(f"Upload ID: {upload_id}")


Upload ID: nX8s_GJnmoxp_DkU.Xl0V1S4PLXKjjo_iqePeo81VQOvNJAqNZ129uwJ63C8MPdpaaPyPzRvZlS5hbdWVLo.AxRZYbaMTu7aLGzlYxAvZOFRTCdPHX0GHTnyn6g2UDNE


In [7]:
chunk_size = 50 * 1024 * 1024  # 50MB per part
parts = []

with open(file_path, 'rb') as f:
    part_number = 1
    while True:
        data = f.read(chunk_size)
        if not data:
            break
        response = s3.upload_part(
            Bucket=bucket_name,
            Key=s3_key,
            PartNumber=part_number,
            UploadId=upload_id,
            Body=data
        )
        parts.append({'ETag': response['ETag'], 'PartNumber': part_number})
        print(f"Uploaded Part {part_number}")
        part_number += 1


Uploaded Part 1
Uploaded Part 2
Uploaded Part 3


In [8]:
parts

[{'ETag': '"048b48651b16bfc00b73d4c3d9638f29"', 'PartNumber': 1},
 {'ETag': '"771e16630fbed987bd913ea456b714b0"', 'PartNumber': 2},
 {'ETag': '"b4c187729ee3feeeeb9277c79e2be891"', 'PartNumber': 3}]

In [10]:
s3.complete_multipart_upload(
    Bucket=bucket_name,
    Key=s3_key,
    UploadId=upload_id,
    MultipartUpload={'Parts': parts}
)
print("Multipart Upload Completed!")


Multipart Upload Completed!


No, the uploaded file in **S3** will be the **same as the original file** after **completing the multipart upload**. S3 automatically **reassembles** the parts into a single object once the `complete_multipart_upload` API is called.

However, if you want to **verify** the upload:
- **Download the file from S3** and compare it with the original.
- **Check the file size and content** after downloading.

### **Verify by Downloading the File**
```python
s3.download_file(bucket_name, s3_key, 'downloaded_myfile.parquet')

# Check file size
import os
print("Original file size:", os.path.getsize(file_path))
print("Downloaded file size:", os.path.getsize('downloaded_myfile.parquet'))
```

If the sizes match, the upload was **successful**, and the file is the same as the original.

Would you like to see how to **split the file locally** before uploading? That way, you can inspect the parts before they go to S3. Let me know how deep you want to go! 🚀

In [11]:
s3.download_file(bucket_name, s3_key, 'Multipart.parquet')

# Check file size
import os
print("Original file size:", os.path.getsize(file_path))
print("Downloaded file size:", os.path.getsize('Multipart.parquet'))


Original file size: 139528869
Downloaded file size: 139528869


# Check while merging 2 files into single and upload as multipart 

In [12]:
import os

# Define input and output files
file1 = '/mount_folder/alpha/multipart_s3_upload/part-00000-2224c996-15d6-400a-8ae4-2d0740e74c18.c000.gz.parquet'
file2 = '/mount_folder/alpha/multipart_s3_upload/part-00002-6409130e-6e9b-44d1-90ec-9f9d0ade2504.c000.gz.parquet'
merged_file = 'merged_file.parquet'

# Merge files
with open(merged_file, 'wb') as outfile:
    for fname in [file1, file2]:
        with open(fname, 'rb') as infile:
            outfile.write(infile.read())

# Check file size after merging
print(f"Merged file size: {os.path.getsize(merged_file)} bytes")


Merged file size: 277963098 bytes


In [13]:
path = 's3://nyc-taxi-data-pipeline/multipart-parquet/'
bucket_name = 'nyc-taxi-data-pipeline'
file_path = '/mount_folder/alpha/multipart_s3_upload/merged_file.parquet'
s3_key = 'multipart-parquet/Multipart.parquet'

response = s3.create_multipart_upload(Bucket=bucket_name, Key=s3_key)
upload_id = response['UploadId']
print(f"Upload ID: {upload_id}")

chunk_size = 50 * 1024 * 1024  # 50MB per part
parts = []

with open(file_path, 'rb') as f:
    part_number = 1
    while True:
        data = f.read(chunk_size)
        if not data:
            break
        response = s3.upload_part(
            Bucket=bucket_name,
            Key=s3_key,
            PartNumber=part_number,
            UploadId=upload_id,
            Body=data
        )
        parts.append({'ETag': response['ETag'], 'PartNumber': part_number})
        print(f"Uploaded Part {part_number}")
        part_number += 1
        
s3.complete_multipart_upload(
    Bucket=bucket_name,
    Key=s3_key,
    UploadId=upload_id,
    MultipartUpload={'Parts': parts}
)
print("Multipart Upload Completed!")


Upload ID: QvG.mpVegbgmA3a.Owbz8d3PrD5.F3tAAhzV2y6r3MgQNu6pLrRo1irs3Sj74TzT2ARP.IpAfWu1NwBshivbed9VdOe.TnE_S_wrZy0EoLwgWMG52awPEheUQfUSiLdI
Uploaded Part 1
Uploaded Part 2
Uploaded Part 3
Uploaded Part 4
Uploaded Part 5
Uploaded Part 6
Multipart Upload Completed!
