# Model Training 


In [20]:
import sagemaker

sess = sagemaker.Session()
role = sagemaker.get_execution_role()
bucket = sess.default_bucket()

print(f"IAM role arn used for running training: {role}")
print(f"S3 bucket used for storing artifacts: {sess.default_bucket()}")

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml
IAM role arn used for running training: arn:aws:iam::523272053639:role/service-role/AmazonSageMaker-ExecutionRole-20231214T100902
S3 bucket used for storing artifacts: sagemaker-us-east-2-523272053639


We are in the great position that we don't have to write our own training script. Instead we will use a script from the transformers library in Github: https://github.com/huggingface/transformers/blob/v4.6.1/examples/pytorch/summarization/run_summarization.py

In [21]:
git_config = {'repo': 'https://github.com/huggingface/transformers.git','branch': 'v4.26.1'} 

These rae the parameters for training, and this is one of the most important levers we can leverage once we are in the experimentation phase. Changing these parameters can influence the model performance and there will be a component of trial & error to find the best model. Also check out https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning.html for automated hyperparameter tuning. 

In [35]:
# hyperparameters, which are passed into the training job
hyperparameters={'per_device_train_batch_size': 2,
                 'per_device_eval_batch_size': 2,
                 'model_name_or_path': 'sshleifer/distilbart-cnn-12-6',
                 'train_file': '/opt/ml/input/data/datasets/train.csv',
                 'validation_file': '/opt/ml/input/data/datasets/val.csv',
                 'do_train': True,
                 'do_eval': True,
                 'do_predict': False,
                 'predict_with_generate': True,
                 'output_dir': '/opt/ml/model',
                 'num_train_epochs': 1,
                 'learning_rate': 5e-5,
                 'seed': 7,
                 'fp16': True,
                 'val_max_target_length': 20,
                 'text_column': 'text',
                 'summary_column': 'summary',
                 # 'force_download':True,
                 }

# configuration for running training on smdistributed Data Parallel
# distribution = {'smdistributed':{'dataparallel':{ 'enabled': True }}}

In [25]:
from sagemaker.image_uris import retrieve

deploy_instance_type = 'ml.p3.2xlarge'

pytorch_inference_image_uri = retrieve('huggingface',
                                       region='us-east-2',
                                       version='4.6.1',
                                       instance_type=deploy_instance_type,
                                       base_framework_version='pytorch1.8.1',
                                       image_scope='inference')

In [40]:
from sagemaker.huggingface import HuggingFace

# create the Estimator
huggingface_estimator = HuggingFace(
    entry_point='./run_summarization.py',
    source_dir='.',
    # git_config=git_config,
    instance_type='ml.p3.2xlarge',
    instance_count=1,
    transformers_version='4.26',
    pytorch_version='1.13',
    py_version='py39',
    role=role,
    hyperparameters=hyperparameters,
    # distribution=distribution,
)

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml


This will kick off the training job which should take around 1 hour. There is also the option to use distributed training with more instances, see here:https://docs.aws.amazon.com/sagemaker/latest/dg/distributed-training.html. Running this training with 2 distributed instances should take ~40 minutes.

In [41]:
huggingface_estimator.fit({'datasets':f's3://{bucket}/summarization/data/'}, wait=True)

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.


Using provided s3_resource


INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: huggingface-pytorch-training-2023-12-15-07-37-02-392


2023-12-15 07:37:04 Starting - Starting the training job......
2023-12-15 07:37:38 Starting - Preparing the instances for training...
2023-12-15 07:38:34 Downloading - Downloading input data...
2023-12-15 07:38:54 Downloading - Downloading the training image..............................
2023-12-15 07:43:56 Training - Training image download completed. Training in progress...[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2023-12-15 07:44:12,570 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2023-12-15 07:44:12,590 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2023-12-15 07:44:12,604 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2023-12-15 07:44:12,607 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2023-12-1

training-job: huggingface-pytorch-training-2023-12-14-18-53-25-852

In [42]:
huggingface_estimator.model_data

's3://sagemaker-us-east-2-523272053639/huggingface-pytorch-training-2023-12-15-07-37-02-392/output/model.tar.gz'

In [46]:
huggingface_estimator

<sagemaker.huggingface.estimator.HuggingFace at 0x7fad8be499f0>