### Upload train and validation Text Documents to AWS S3 for LLM Fine-Tuning with Sagemaker

Create an AWS S3 Bucket
Amazon S3 (Simple Storage Service) provides scalable object storage, and it's a foundational element for data storage in AWS. Here's how you can create an S3 bucket:
Prerequisites
Ensure you have an AWS account and have set up AWS CLI with the necessary permissions. We will use the session credentials from AWS Sagemaker Notebook
Step 1: Initialize Boto3 and S3 Client
We start by importing boto3 and setting up an S3 client. Boto3 is the AWS SDK for Python, which allows Python developers to write software that uses AWS services.

In [3]:
import boto3

aws_session = boto3.Session()
s3_client = aws_session.client('s3')

In [9]:

bucket_name = 'rany-domain-adaptation-training-sagemaker'
s3_client.create_bucket(Bucket=bucket_name)

In [11]:
from botocore.exceptions import ClientError

# Upload the train file
train_object_name = 'domain_adaptation/gptj6b/train/train.txt'
train_file_name = 'train.txt'
try:
    with open(train_file_name, 'rb') as file:
        s3_client.upload_fileobj(file, bucket_name, train_object_name)  # Pass 'file' instead of 'file_name'
    print(f"'{train_file_name}' has been uploaded to '{bucket_name}' as '{train_object_name}'.")
except ClientError as e:
    print(f"Error: {e}")
except FileNotFoundError:
    print("The file was not found. Please check the file path.")

'train.txt' has been uploaded to 'rany-domain-adaptation-training-sagemaker' as 'domain_adaptation/gptj6b/train/train.txt'.


In [12]:
# Upload the validation file
validation_object_name = 'domain_adaptation/gptj6b/validation/validation.txt'
validation_file_name = 'validation.txt'
try:
    with open(validation_file_name, 'rb') as file:
        s3_client.upload_fileobj(file, bucket_name, validation_object_name)  # Pass 'file' instead of 'file_name'
    print(f"'{validation_file_name}' has been uploaded to '{bucket_name}' as '{validation_object_name}'.")
except ClientError as e:
    print(f"Error: {e}")
except FileNotFoundError:
    print("The file was not found. Please check the file path.")

'validation.txt' has been uploaded to 'rany-domain-adaptation-training-sagemaker' as 'domain_adaptation/gptj6b/validation/validation.txt'.


In [13]:
# Specify the prefix (folder path)
prefix = 'domain_adaptation/gptj6b/'

# List objects within the specified prefix
response = s3_client.list_objects_v2(Bucket=bucket_name, Prefix=prefix)

# Check if 'Contents' key is in the response (it's not present if the directory is empty or does not exist)
if 'Contents' in response:
    # Iterate through the objects in the response and print their names
    for obj in response['Contents']:
        print(obj['Key'])
else:
    print("No objects found in the specified path.")

domain_adaptation/gptj6b/train/train.txt
domain_adaptation/gptj6b/validation/validation.txt


In [16]:
from sagemaker import image_uris

training_instance_type = "ml.g5.12xlarge"
model_id, model_version = "huggingface-textgeneration1-gpt-j-6b", "1.*"


# Retrieve the docker image
train_image_uri = image_uris.retrieve(
    region=None,
    framework=None,
    model_id=model_id,
    model_version=model_version,
    image_scope="training",
    instance_type=training_instance_type,
)

Using model 'huggingface-textgeneration1-gpt-j-6b' with wildcard version identifier '1.*'. You can pin to version '1.3.0' for more stable results. Note that models may have different input/output signatures after a major version upgrade.


In [18]:
print(train_image_uri)

763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-training:2.0.0-transformers4.28.1-gpu-py310-cu118-ubuntu20.04


In [19]:
from sagemaker import script_uris

# Retrieve the training script
train_source_uri = script_uris.retrieve(
    model_id=model_id, model_version=model_version, script_scope="training"
)


print(train_source_uri)

s3://jumpstart-cache-prod-us-east-1/source-directory-tarballs/huggingface/transfer_learning/textgeneration1/prepack/v1.2.0/sourcedir.tar.gz


In [20]:
import boto3
import tarfile
from io import BytesIO

# Initialize the S3 client
s3_client = boto3.client('s3')

# Specify the bucket and object key
bucket_name = 'jumpstart-cache-prod-us-east-1'
object_key = 'source-directory-tarballs/huggingface/transfer_learning/textgeneration1/prepack/v1.2.0/sourcedir.tar.gz'
local_file_name = 'sourcedir.tar.gz'

# Download the file from S3
s3_client.download_file(bucket_name, object_key, local_file_name)
print(f"Downloaded {object_key} to {local_file_name}")

# Unzip the file
with tarfile.open(local_file_name, "r:gz") as tar:
    tar.extractall()
    print(f"Extracted {local_file_name}")


Downloaded source-directory-tarballs/huggingface/transfer_learning/textgeneration1/prepack/v1.2.0/sourcedir.tar.gz to sourcedir.tar.gz
Extracted sourcedir.tar.gz
