### Upload train and validation Text Documents to AWS S3 for LLM Fine-Tuning with Sagemaker

### Create an AWS S3 Bucket
Amazon S3 (Simple Storage Service) provides scalable object storage, and it's a foundational element for data storage in AWS. Here's how you can create an S3 bucket:
Prerequisites
Ensure you have an AWS account and have set up AWS CLI with the necessary permissions. We will use the session credentials from AWS Sagemaker Notebook

### Step 1: Initialize Boto3 and S3 Client

We start by importing boto3 and setting up an S3 client. Boto3 is the AWS SDK for Python, which allows Python developers to write software that uses AWS services.


In [3]:
import boto3

aws_session = boto3.Session()
s3_client = aws_session.client('s3')

### Step 2: Define and Create Your Bucket

We then define a unique name for our bucket and create it. AWS S3 bucket names are globally unique.

In [9]:

bucket_name = 'rany-domain-adaptation-training-sagemaker'
s3_client.create_bucket(Bucket=bucket_name)

### Step 3: Upload Files to S3

The following code demonstrates how to upload files. We use upload_fileobj, which allows for uploading files directly from the file system.

In [11]:
from botocore.exceptions import ClientError

# Upload the train file
train_object_name = 'domain_adaptation/gptj6b/train/train.txt'
train_file_name = 'train.txt'
try:
    with open(train_file_name, 'rb') as file:
        s3_client.upload_fileobj(file, bucket_name, train_object_name)  # Pass 'file' instead of 'file_name'
    print(f"'{train_file_name}' has been uploaded to '{bucket_name}' as '{train_object_name}'.")
except ClientError as e:
    print(f"Error: {e}")
except FileNotFoundError:
    print("The file was not found. Please check the file path.")

'train.txt' has been uploaded to 'rany-domain-adaptation-training-sagemaker' as 'domain_adaptation/gptj6b/train/train.txt'.


In [12]:
# Upload the validation file
validation_object_name = 'domain_adaptation/gptj6b/validation/validation.txt'
validation_file_name = 'validation.txt'
try:
    with open(validation_file_name, 'rb') as file:
        s3_client.upload_fileobj(file, bucket_name, validation_object_name)  # Pass 'file' instead of 'file_name'
    print(f"'{validation_file_name}' has been uploaded to '{bucket_name}' as '{validation_object_name}'.")
except ClientError as e:
    print(f"Error: {e}")
except FileNotFoundError:
    print("The file was not found. Please check the file path.")

'validation.txt' has been uploaded to 'rany-domain-adaptation-training-sagemaker' as 'domain_adaptation/gptj6b/validation/validation.txt'.


### Listing the Contents of Your S3 Bucket

Finally, to verify that your files are uploaded, you can list the contents of your bucket.

In [13]:
# Specify the prefix (folder path)
prefix = 'domain_adaptation/gptj6b/'

# List objects within the specified prefix
response = s3_client.list_objects_v2(Bucket=bucket_name, Prefix=prefix)

# Check if 'Contents' key is in the response (it's not present if the directory is empty or does not exist)
if 'Contents' in response:
    # Iterate through the objects in the response and print their names
    for obj in response['Contents']:
        print(obj['Key'])
else:
    print("No objects found in the specified path.")

domain_adaptation/gptj6b/train/train.txt
domain_adaptation/gptj6b/validation/validation.txt
