Base Model Outline Objective



Train a basic text-to-text model using a pre-trained Hugging Face transformer (e.g., T5-small) on tokenized data to establish a baseline for evaluation.

We start by installing and importing all the libraries we require for this task

In [1]:
!pip install -r SIGROPM1/model/sigropm/requirements.txt
!pip install -U sagemaker
!pip install boto3 awscli --upgrade


Collecting botocore<1.36.0,>=1.35.76 (from boto3<2.0,>=1.35.75->sagemaker->-r SIGROPM1/model/sigropm/requirements.txt (line 4))
  Using cached botocore-1.35.76-py3-none-any.whl.metadata (5.7 kB)
Using cached botocore-1.35.76-py3-none-any.whl (13.2 MB)
Installing collected packages: botocore
  Attempting uninstall: botocore
    Found existing installation: botocore 1.34.162
    Uninstalling botocore-1.34.162:
      Successfully uninstalled botocore-1.34.162
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
aiobotocore 2.13.3 requires botocore<1.34.163,>=1.34.70, but you have botocore 1.35.76 which is incompatible.[0m[31m
[0mSuccessfully installed botocore-1.35.76


In [2]:
# Core SageMaker libraries
import sagemaker
from sagemaker import get_execution_role
from sagemaker.session import Session

# For model training and deployment
from sagemaker.huggingface import HuggingFace
from sagemaker.pytorch import PyTorch
from sagemaker.inputs import TrainingInput

# For data preprocessing and handling
import boto3  # AWS SDK for Python
import pandas as pd
import numpy as np

# For managing S3 bucket and files
from sagemaker.s3 import S3Uploader, S3Downloader




sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml


We will now move on to loading our data. Since it is impossible to upload our data on to github, we will upload the data to S3, and then from there on, we will be using it for our subsequent projects and so on. 

In [3]:

s3 = boto3.client('s3')
bucket_name = "squad-training-data"  # Use a valid bucket name
region = "us-west-1"

try:
    s3.create_bucket(
        Bucket=bucket_name,
        CreateBucketConfiguration={'LocationConstraint': region}
    )
    print(f"Bucket '{bucket_name}' created successfully.")
except s3.exceptions.BucketAlreadyExists:
    print(f"Bucket '{bucket_name}' already exists.")
except Exception as e:
    print(f"Error creating bucket: {e}")




Error creating bucket: An error occurred (BucketAlreadyOwnedByYou) when calling the CreateBucket operation: Your previous request to create the named bucket succeeded and you already own it.


In [4]:
local_data_path = "/home/sagemaker-user/SIGROPM1/data/expanded_training_data.jsonl"
s3_data_key = "datasets/training_data.jsonl"  # Path in S3

try:
    s3.upload_file(local_data_path, "squad-training-data", s3_data_key)
    print(f"Dataset uploaded to s3://squad-training-data/{s3_data_key}")
except Exception as e:
    print(f"Error uploading dataset: {e}")


Dataset uploaded to s3://squad-training-data/datasets/training_data.jsonl


In [5]:
!pip install s3fs


Collecting botocore<1.34.163,>=1.34.70 (from aiobotocore<3.0.0,>=2.5.4->s3fs)
  Using cached botocore-1.34.162-py3-none-any.whl.metadata (5.7 kB)
Using cached botocore-1.34.162-py3-none-any.whl (12.5 MB)
Installing collected packages: botocore
  Attempting uninstall: botocore
    Found existing installation: botocore 1.35.76
    Uninstalling botocore-1.35.76:
      Successfully uninstalled botocore-1.35.76
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
boto3 1.35.76 requires botocore<1.36.0,>=1.35.76, but you have botocore 1.34.162 which is incompatible.
awscli 1.36.17 requires botocore==1.35.76, but you have botocore 1.34.162 which is incompatible.[0m[31m
[0mSuccessfully installed botocore-1.34.162


In [6]:
import pandas as pd
import s3fs

s3_file_path = "s3://squad-training-data/datasets/training_data.jsonl"

# Load JSONL file into a pandas DataFrame
df = pd.read_json(s3_file_path, lines=True)
print(df.head())


                                              prompt  \
0  A television show, TV program, or simply a TV ...   
1  A television show, TV program, or simply a TV ...   
2  A television show, TV program, or simply a TV ...   
3  A television show, TV program, or simply a TV ...   
4  A television show, TV program, or simply a TV ...   

                             squad  
0         171. Season Finale Fans.  
1  383. 'Serial Speculators Squad'  
2    64. Candy Bar Commercial Club  
3            9. Retrovision Rebels  
4     220. Sports Broadcast Buffs.  


In [7]:
import sagemaker
from sagemaker import get_execution_role

# Get the SageMaker execution role
sagemaker_role = get_execution_role()

print(f"SageMaker Role: {sagemaker_role}")


SageMaker Role: arn:aws:iam::022043654838:role/service-role/AmazonSageMaker-ExecutionRole-20241127T173380


In [8]:

import pandas as pd
from datasets import Dataset
# Load JSONL file into a pandas DataFrame
df = pd.read_json(s3_file_path, lines=True)
print("Data Sample:")
print(df.head())

# Convert DataFrame to Hugging Face Dataset
dataset = Dataset.from_pandas(df)
print("Dataset Preview:")
print(dataset)

Data Sample:
                                              prompt  \
0  A television show, TV program, or simply a TV ...   
1  A television show, TV program, or simply a TV ...   
2  A television show, TV program, or simply a TV ...   
3  A television show, TV program, or simply a TV ...   
4  A television show, TV program, or simply a TV ...   

                             squad  
0         171. Season Finale Fans.  
1  383. 'Serial Speculators Squad'  
2    64. Candy Bar Commercial Club  
3            9. Retrovision Rebels  
4     220. Sports Broadcast Buffs.  
Dataset Preview:
Dataset({
    features: ['prompt', 'squad'],
    num_rows: 197990
})


Tokenization proceeds and data splitting

In [9]:
from transformers import T5ForConditionalGeneration

# Load the model
model = T5ForConditionalGeneration.from_pretrained("t5-small")

# Move model to GPU if available
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)


2024-12-06 10:42:05.467227: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-12-06 10:42:05.487938: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-12-06 10:42:05.494000: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-12-06 10:42:05.508981: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


T5ForConditionalGeneration(
  (shared): Embedding(32128, 512)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 512)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=512, out_features=512, bias=False)
              (k): Linear(in_features=512, out_features=512, bias=False)
              (v): Linear(in_features=512, out_features=512, bias=False)
              (o): Linear(in_features=512, out_features=512, bias=False)
              (relative_attention_bias): Embedding(32, 8)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseActDense(
              (wi): Linear(in_features=512, out_features=2048, bias=False)
              (wo): Linear(in_features=2048, out_features=512, bias=False)
              (dropout): Drop

In [10]:
import torch
print(f"Pre-installed PyTorch version: {torch.__version__}")


Pre-installed PyTorch version: 2.5.1+cpu


In [11]:
entry_point = "/home/sagemaker-user/SIGROPM1/model/sigropm/train.py"


In [12]:
from sklearn.model_selection import train_test_split
import json

# Load the dataset
data_path = "/home/sagemaker-user/SIGROPM1/data/expanded_training_data.jsonl"
# Load the dataset
with open(data_path, "r") as file:
    data = [json.loads(line) for line in file]

# Split the data into training and validation sets
train_data, validation_data = train_test_split(data, test_size=0.2, random_state=42)

# Save the splits locally
train_data_path = "/home/sagemaker-user/SIGROPM1/data/train_data.jsonl"
validation_data_path = "/home/sagemaker-user/SIGROPM1/data/validation_data.jsonl"

with open(train_data_path, "w") as train_file:
    for entry in train_data:
        json.dump(entry, train_file)
        train_file.write("\n")

with open(validation_data_path, "w") as validation_file:
    for entry in validation_data:
        json.dump(entry, validation_file)
        validation_file.write("\n")

print(f"Train data saved to {train_data_path}")
print(f"Validation data saved to {validation_data_path}")

Train data saved to /home/sagemaker-user/SIGROPM1/data/train_data.jsonl
Validation data saved to /home/sagemaker-user/SIGROPM1/data/validation_data.jsonl


In [13]:

# Define the S3 bucket name
bucket_name = "s3-sigrom-model-data-bucket"

# Initialize the S3 client
s3_client = boto3.client("s3")


try:
    s3.create_bucket(
        Bucket=bucket_name,
        CreateBucketConfiguration={'LocationConstraint': region}
    )
    print(f"Bucket '{bucket_name}' created successfully.")
except s3.exceptions.BucketAlreadyExists:
    print(f"Bucket '{bucket_name}' already exists.")
except Exception as e:
    print(f"Error creating bucket: {e}")



Error creating bucket: An error occurred (BucketAlreadyOwnedByYou) when calling the CreateBucket operation: Your previous request to create the named bucket succeeded and you already own it.


In [14]:

# # Upload the train dataset to S3
# s3_client.upload_file(
#     Filename=train_data_path,  # Local path to the train data
#     Bucket=bucket_name,        # Name of your S3 bucket
#     Key=train_s3_path,         # Path in the S3 bucket
# )

# # Upload the validation dataset to S3
# s3_client.upload_file(
#     Filename=validation_data_path,  # Local path to the validation data
#     Bucket=bucket_name,             # Name of your S3 bucket
#     Key=validation_s3_path,         # Path in the S3 bucket
# )

# # Generate S3 URIs
# train_s3_uri = f"s3://{bucket_name}/{train_s3_path}"
# validation_s3_uri = f"s3://{bucket_name}/{validation_s3_path}"

# # Print confirmation
# print(f"Train data uploaded to: {train_s3_uri}")
# print(f"Validation data uploaded to: {validation_s3_uri}")



In [17]:
from sagemaker.inputs import TrainingInput
from sagemaker.pytorch import PyTorch

# Define S3 input
train_s3_uri = "s3://squad-training-data/datasets/training_data.jsonl"
train_input = TrainingInput(train_s3_uri, content_type="application/jsonlines")

estimator = PyTorch(
    entry_point="train.py",
    source_dir="/home/sagemaker-user/SIGROPM1/model/sigropm",  # Directory containing train.py and requirements.txt
    role=sagemaker_role,
    instance_count=1,
    instance_type="ml.t3.xlarge",
    framework_version="1.12.0",
    py_version="py38",
    dependencies=["/home/sagemaker-user/SIGROPM1/model/sigropm/requirements.txt"],  # Ensure requirements.txt is included
    hyperparameters={"epochs": 5, "batch_size": 16},
)




In [18]:
# Run the training job
estimator.fit({"train": train_input})
