# Part 3 - Training (aka *fine-tuning*) a Transformer model

In this part we will finally train our very own Transformers model. We saw that the zero-shot model didn't produce great results, and that's probably because the model was trained on summarising news articles, not academic papers. 

These lines of code are typical setup for Sagemaker, we require them for training jobs: https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-training.html

In [8]:
import sagemaker

bucket = sagemaker.Session().default_bucket()
region = sagemaker.Session().boto_region_name

# the "get_execution_role()" method doesn't work when running a notebook locally and using the API.
# see the explanation in "0b_data_prep_reviews_corrected.ipynb" for an explanation and how to get 
# the proper variable
# role = sagemaker.get_execution_role()
role = 'arn:aws:iam::595714217589:user/Administrator'

print(f"IAM role arn used for running training: {role}")
print(f"S3 bucket used for storing artifacts: {bucket}")

IAM role arn used for running training: arn:aws:iam::595714217589:user/Administrator
S3 bucket used for storing artifacts: sagemaker-us-east-1-595714217589


We are in the great position that we don't have to write our own training script. Instead we will use a script from the transformers library in Github: https://github.com/huggingface/transformers/blob/v4.6.1/examples/pytorch/summarization/run_summarization.py

In [None]:
git_config = {'repo': 'https://github.com/huggingface/transformers.git','branch': 'v4.6.1'}

These are the parameters for training, and this is one of the most important levers we can leverage once we are in the experimentation phase. Changing these parameters can influence the model performance and there will be a component of trial & error to find the best model. Also check out https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning.html for automated hyperparameter tuning. 

In [None]:
# hyperparameters, which are passed into the training job
hyperparameters={'per_device_train_batch_size': 4,
                 'per_device_eval_batch_size': 4,
                 'model_name_or_path': 'sshleifer/distilbart-cnn-12-6',
                 'train_file': '/opt/ml/input/data/datasets/train.csv',
                 'validation_file': '/opt/ml/input/data/datasets/val.csv',
                 'do_train': True,
                 'do_eval': True,
                 'do_predict': False,
                 'predict_with_generate': True,
                 'output_dir': '/opt/ml/model',
                 'num_train_epochs': 3,
                 'learning_rate': 5e-5,
                 'seed': 7,
                 'fp16': True,
                 'val_max_target_length': 20,
                 'text_column': 'text',
                 'summary_column': 'summary',
                 }

# configuration for running training on smdistributed Data Parallel
distribution = {'smdistributed':{'dataparallel':{ 'enabled': True }}}

In [None]:
from sagemaker.huggingface import HuggingFace

# create the Estimator
huggingface_estimator = HuggingFace(
    entry_point='run_summarization.py',
    source_dir='./examples/pytorch/summarization',
    git_config=git_config,
    instance_type='ml.p3.16xlarge',
    instance_count=2,
    transformers_version='4.6',
    pytorch_version='1.7',
    py_version='py36',
    role=role,
    hyperparameters=hyperparameters,
    distribution=distribution,
)

This will kick off the training job which should take around 1 hour. There is also the option to use distributed training with more instances, see here:https://docs.aws.amazon.com/sagemaker/latest/dg/distributed-training.html. Running this training with 2 distributed instances should take ~40 minutes.

In [None]:
huggingface_estimator.fit({'datasets':f's3://{bucket}/summarization/data/'}, wait=False)