# Part 3 - Training (aka *fine-tuning*) a Transformer model

In this part we will finally train our very own Transformers model. We saw that the zero-shot model didn't produce great results, and that's probably because the model was trained on summarising news articles, not academic papers. 

These lines of code are typical setup for Sagemaker, we require them for training jobs: https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-training.html

In [23]:
import sagemaker

bucket = sagemaker.Session().default_bucket()
region = sagemaker.Session().boto_region_name

# the "get_execution_role()" method doesn't work when running a notebook locally and using the API.
# see the explanation in "0b_data_prep_reviews_corrected.ipynb" for an explanation and how to get 
# the proper variable
# role = sagemaker.get_execution_role()
role = 'arn:aws:iam::595714217589:role/service-role/AmazonSageMaker-ExecutionRole-20220331T161122'

print(f"IAM role arn used for running training: {role}")
print(f"S3 bucket used for storing artifacts: {bucket}")

IAM role arn used for running training: arn:aws:iam::595714217589:role/service-role/AmazonSageMaker-ExecutionRole-20220331T161122
S3 bucket used for storing artifacts: sagemaker-us-east-1-595714217589


We are in the great position that we don't have to write our own training script. Instead we will use a script from the transformers library in Github: https://github.com/huggingface/transformers/blob/v4.6.1/examples/pytorch/summarization/run_summarization.py

In [25]:
git_config = {'repo': 'https://github.com/huggingface/transformers.git','branch': 'v4.6.1'}

These are the parameters for training, and this is one of the most important levers we can leverage once we are in the experimentation phase. Changing these parameters can influence the model performance and there will be a component of trial & error to find the best model. Also check out https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning.html for automated hyperparameter tuning. 

In [26]:
# hyperparameters, which are passed into the training job - original version
# hyperparameters={'per_device_train_batch_size': 4,
#                  'per_device_eval_batch_size': 4,
#                  'model_name_or_path': 'sshleifer/distilbart-cnn-12-6',
#                  'train_file': '/opt/ml/input/data/datasets/train.csv',
#                  'validation_file': '/opt/ml/input/data/datasets/val.csv',
#                  'do_train': True,
#                  'do_eval': True,
#                  'do_predict': False,
#                  'predict_with_generate': True,
#                  'output_dir': '/opt/ml/model',
#                  'num_train_epochs': 3,
#                  'learning_rate': 5e-5,
#                  'seed': 7,
#                  'fp16': True,
#                  'val_max_target_length': 20,
#                  'text_column': 'text',
#                  'summary_column': 'summary',
#                  }

#hyperparameters, which are passed into the training job - modified for my AWS account version
hyperparameters={'per_device_train_batch_size': 4,
                 'per_device_eval_batch_size': 4,
                 'model_name_or_path': 'sshleifer/distilbart-cnn-12-6',
                 'train_file': 's3://sagemaker-us-east-1-595714217589/summarization/data/train.csv',
                 'validation_file': 's3://sagemaker-us-east-1-595714217589/summarization/data/val.csv',
                 'do_train': True,
                 'do_eval': True,
                 'do_predict': False,
                 'predict_with_generate': True,
                 'output_dir': 's3://sagemaker-us-east-1-595714217589/summarization/output/model',
                 'num_train_epochs': 3,
                 'learning_rate': 5e-5,
                 'seed': 7,
                 'fp16': True,
                 'val_max_target_length': 20,
                 'text_column': 'text',
                 'summary_column': 'summary',
                 }




# configuration for running training on smdistributed Data Parallel
distribution = {'smdistributed':{'dataparallel':{ 'enabled': True }}}

In [34]:
from sagemaker.huggingface import HuggingFace

# create the Estimator
huggingface_estimator = HuggingFace(
    entry_point='run_summarization.py',
    source_dir='./examples/pytorch/summarization',
    git_config=git_config,
    instance_type='ml.p3.16xlarge',
    instance_count=2,
    transformers_version='4.6',
    pytorch_version='1.7',
    py_version='py36',
    role=role,
    hyperparameters=hyperparameters,
    distribution=distribution,
)

This will kick off the training job which should take around 1 hour. There is also the option to use distributed training with more instances, see here:https://docs.aws.amazon.com/sagemaker/latest/dg/distributed-training.html. Running this training with 2 distributed instances should take ~40 minutes.

In [35]:
huggingface_estimator.fit({'datasets':f's3://{bucket}/summarization/data/'}, wait=False)

ResourceLimitExceeded: An error occurred (ResourceLimitExceeded) when calling the CreateTrainingJob operation: The account-level service limit 'ml.p3.16xlarge for training job usage' is 0 Instances, with current utilization of 0 Instances and a request delta of 2 Instances. Please contact AWS support to request an increase for this limit.

### Error trying to run training job above
"ResourceLimitExceeded: An error occurred (ResourceLimitExceeded) when calling the CreateTrainingJob operation: The account-level service limit 'ml.p3.16xlarge for training job usage' is 0 Instances, with current utilization of 0 Instances and a request delta of 2 Instances. Please contact AWS support to request an increase for this limit."

Pages with info:
- https://docs.aws.amazon.com/sagemaker/latest/dg/regions-quotas.html
    - this had "Depending on your activities and resource usage over time, your SageMaker quotas might be different from the default SageMaker quotas listed on Amazon SageMaker endpoints and quotas in the AWS General Reference. If you encounter error messages that you've exceeded your quota and you need to scale up your SageMaker resources, follow the steps in the Request a service quota increase for SageMaker resources procedure on this page to request a quota increase from AWS Support."
- I followed the instructions to submit the resource increase request:

Limit increase request 1
Service: SageMaker Training Jobs
Region: US East (Northern Virginia)
Resource Type: SageMaker Training
Limit name: ml.p3.16xlarge
New limit value: 2
------------
Use case description: I'm learning how to use Sagemaker to train models. I'm following an example written by Heiko Hotz (Senior Solutions Architect at AWS) posted at https://towardsdatascience.com/setting-up-a-text-summarisation-project-introduction-526622eea4a8 . I'm unable to launch the training job due to the following error:

ResourceLimitExceeded: An error occurred (ResourceLimitExceeded) when calling the CreateTrainingJob operation: The account-level service limit 'ml.p3.16xlarge for training job usage' is 0 Instances, with current utilization of 0 Instances and a request delta of 2 Instances. Please contact AWS support to request an increase for this limit.

I'm trying to run this job remotely instead of from a Sagemaker Jupyterlab instance.