# Fine Tune T5 LLM

This notebook will go over how we can fine-tune a pre-trained LLM model on our dataset in SageMaker.

In [3]:
!pip install datasets --quiet

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pytest-astropy 0.8.0 requires pytest-cov>=2.0, which is not installed.
pytest-astropy 0.8.0 requires pytest-filter-subpackage>=0.1, which is not installed.
spyder 4.0.1 requires pyqt5<5.13; python_version >= "3", which is not installed.
spyder 4.0.1 requires pyqtwebengine<5.13; python_version >= "3", which is not installed.
sagemaker 2.165.0 requires importlib-metadata<5.0,>=1.4.0, but you have importlib-metadata 6.6.0 which is incompatible.
sparkmagic 0.20.4 requires nest-asyncio==1.5.5, but you have nest-asyncio 1.5.6 which is incompatible.
spyder 4.0.1 requires jedi==0.14.1, but you have jedi 0.18.2 which is incompatible.[0m[31m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.2[0m[39;49m -> [0m[32;49m23.2.1[0m
[1m[[0m[34;49mnotice[0m[1

## Loading Processed Dataset

In [4]:
from datasets import load_from_disk

train_path = "s3://sagemaker-us-east-2-003294323742/newsarticle-t5-summary/train"
valid_path = "s3://sagemaker-us-east-2-003294323742/newsarticle-t5-summary/validation"

In [5]:
train = load_from_disk(train_path)
valid = load_from_disk(valid_path)

# Fine Tune LLM

In [6]:
import sagemaker
import boto3

# Initialize sagemaker sessions and get default S3 bucket name
sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket()

# Get IAM role
role = sagemaker.get_execution_role()

# get region name
region = boto3.Session().region_name

In [7]:
model_checkpoint = "google/flan-t5-large"

In [12]:
# configure hyperparameters
epochs = 1
learning_rate = 1e-6
train_batch_size = 1
eval_batch_size = 8
model = model_checkpoint

instance_type = "ml.p3.2xlarge"
instance_count = 1

In [13]:
hyperparameters = {
    "epochs": epochs,
    "learning-rate": learning_rate,
    "train-batch-size": train_batch_size,
    "eval-batch-size": eval_batch_size,
    "model-name": model,
}

In [14]:
from sagemaker.huggingface import HuggingFace

huggingface_estimator = HuggingFace(
    role = role,
    entry_point = "train.py",
    source_dir = "../src",
    dependencies = ["../requirements.txt"],
    hyperparameters = hyperparameters,
    transformers_version = "4.26.0",
    pytorch_version = "1.13.1",
    py_version = "py39",
    instance_type = instance_type,
    instance_count = instance_count,
    #distribution = {"smdistributed": {"dataparallel": {"enabled": True}}},
)


In [15]:
huggingface_estimator.fit({"train": train_path, "valid": valid_path})

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: huggingface-pytorch-training-2023-07-28-18-10-11-071


Using provided s3_resource
2023-07-28 18:10:11 Starting - Starting the training job...
2023-07-28 18:10:35 Starting - Preparing the instances for training......
2023-07-28 18:11:33 Downloading - Downloading input data...
2023-07-28 18:11:53 Training - Downloading the training image...........................
2023-07-28 18:16:34 Training - Training image download completed. Training in progress...[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2023-07-28 18:16:53,226 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2023-07-28 18:16:53,246 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2023-07-28 18:16:53,259 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2023-07-28 18:16:53,261 sagemaker_pytorch_container.training INFO     Invoking user training scrip

In [16]:
huggingface_estimator.model_data

's3://sagemaker-us-east-2-003294323742/huggingface-pytorch-training-2023-07-28-18-10-11-071/output/model.tar.gz'