# Huggingface Sagemaker-sdk - Distributed Training Demo

### Model Parallelism using `SageMakerTrainer` 

1. [Introduction](#Introduction)  
2. [Development Environment and Permissions](#Development-Environment-and-Permissions)
    1. [Installation](#Installation)  
    2. [Development environment](#Development-environment)  
    3. [Permissions](#Permissions)
3. [Processing](#Preprocessing)   
    1. [Tokenization](#Tokenization)  
    2. [Uploading data to sagemaker_session_bucket](#Uploading-data-to-sagemaker_session_bucket)  
4. [Fine-tuning & starting Sagemaker Training Job](#Fine-tuning-\&-starting-Sagemaker-Training-Job)  
    1. [Creating an Estimator and start a training job](#Creating-an-Estimator-and-start-a-training-job)  
    2. [Estimator Parameters](#Estimator-Parameters)   
    3. [Download fine-tuned model from s3](#Download-fine-tuned-model-from-s3)
    3. [Attach to old training job to an estimator ](#Attach-to-old-training-job-to-an-estimator)  
5. [_Coming soon_:Push model to the Hugging Face hub](#Push-model-to-the-Hugging-Face-hub)

# Introduction

Welcome to our end-to-end distributed Text-Classification example. In this demo, we will use the Hugging Face `transformers` and `datasets` library together with a Amazon sagemaker-sdk extension to run GLUE `mnli` benchmark on a multi-node multi-gpu cluster using [SageMaker Model Parallelism Library](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-intro.html). The demo will use the new smdistributed library to run training on multiple gpus. We extended the `Trainer` API to a the `SageMakerTrainer` to use the model parallelism library. Therefore you only have to change the imports in your `train.py`.

_**NOTE: You can run this demo in Sagemaker Studio, your local machine or Sagemaker Notebook Instances**_

# Development Environment and Permissions 

## Installation

_*Note:* we only install the required libraries from Hugging Face and AWS. You also need PyTorch or Tensorflow, if you haven´t it installed_

In [2]:
!pip install "sagemaker>=2.48.0" --upgrade

Collecting sagemaker>=2.48.0
  Downloading sagemaker-2.144.0.tar.gz (712 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m712.8/712.8 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Collecting importlib-metadata<5.0,>=1.4.0
  Using cached importlib_metadata-4.13.0-py3-none-any.whl (23 kB)
Collecting PyYAML==5.4.1
  Using cached PyYAML-5.4.1-cp37-cp37m-manylinux1_x86_64.whl (636 kB)
Building wheels for collected packages: sagemaker
  Building wheel for sagemaker (setup.py) ... [?25ldone
[?25h  Created wheel for sagemaker: filename=sagemaker-2.144.0-py2.py3-none-any.whl size=958086 sha256=b5b4869136dbb5ab1f5d1a15b84553153838fff2b0452a8976e0ab846df3aab9
  Stored in directory: /root/.cache/pip/wheels/07/b6/ac/c8fd0c283eb5375b8f4b23643985018319a9388bd185db4acb
Successfully built sagemaker
Installing collected packages: PyYAML, importlib-metadata, sagemaker
  Attempting uninstall: PyYAML
    Found existing instal

## Development environment 

In [3]:
import sagemaker.huggingface

## Permissions

_If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) more about it._

In [4]:
import sagemaker
import boto3
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

sagemaker role arn: arn:aws:iam::514385905925:role/service-role/AmazonSageMaker-ExecutionRole-20201218T184365
sagemaker bucket: sagemaker-us-east-1-514385905925
sagemaker session region: us-east-1


# Fine-tuning & starting Sagemaker Training Job

In order to create a sagemaker training job we need an `HuggingFace` Estimator. The Estimator handles end-to-end Amazon SageMaker training and deployment tasks. In a Estimator we define, which fine-tuning script should be used as `entry_point`, which `instance_type` should be used, which `hyperparameters` are passed in .....



```python
huggingface_estimator = HuggingFace(entry_point='train.py',
                            source_dir='./scripts',
                            base_job_name='huggingface-sdk-extension',
                            instance_type='ml.p3.2xlarge',
                            instance_count=1,
                            transformers_version='4.4',
                            pytorch_version='1.6',
                            py_version='py36',
                            role=role,
                            hyperparameters = {'epochs': 1,
                                               'train_batch_size': 32,
                                               'model_name':'distilbert-base-uncased'
                                                })
```

When we create a SageMaker training job, SageMaker takes care of starting and managing all the required ec2 instances for us with the `huggingface` container, uploads the provided fine-tuning script `train.py` and downloads the data from our `sagemaker_session_bucket` into the container at `/opt/ml/input/data`. Then, it starts the training job by running. 

```python
/opt/conda/bin/python train.py --epochs 1 --model_name distilbert-base-uncased --train_batch_size 32
```

The `hyperparameters` you define in the `HuggingFace` estimator are passed in as named arguments. 

Sagemaker is providing useful properties about the training environment through various environment variables, including the following:

* `SM_MODEL_DIR`: A string that represents the path where the training job writes the model artifacts to. After training, artifacts in this directory are uploaded to S3 for model hosting.

* `SM_NUM_GPUS`: An integer representing the number of GPUs available to the host.

* `SM_CHANNEL_XXXX:` A string that represents the path to the directory that contains the input data for the specified channel. For example, if you specify two input channels in the HuggingFace estimator’s fit call, named `train` and `test`, the environment variables `SM_CHANNEL_TRAIN` and `SM_CHANNEL_TEST` are set.


To run your training job locally you can define `instance_type='local'` or `instance_type='local_gpu'` for gpu usage. _Note: this does not working within SageMaker Studio_


## Creating an Estimator and start a training job

In this example we are going to use the `run_glue.py` from the transformers example scripts. We modified it and included `SageMakerTrainer` instead of the `Trainer` to enable model-parallelism. You can find the code [here](https://github.com/huggingface/transformers/tree/master/examples/pytorch/text-classification).

```python
from transformers.sagemaker import SageMakerTrainingArguments as TrainingArguments, SageMakerTrainer as Trainer
```

In [5]:
from sagemaker.huggingface import HuggingFace

In [43]:
# hyperparameters, which are passed into the training job
#hyperparameters for flan-t5-xxl
'''
hyperparameters={
    'train_dataset_path': '/opt/ml/input/data/training',
    'test_dataset_path': '/opt/ml/input/data/test',
    "learning_rate": 1e-4,
}
'''

model_id="decapoda-research/llama-7b-hf"
# hyperparameters, which are passed into the training job
#hyperparameters for llama
hyperparameters={
  'model_name': model_id,                                # pre-trained model
  'training_dir': '/opt/ml/input/data/train', # path where sagemaker will save training dataset
  'test_dir': '/opt/ml/input/data/test',      # path where sagemaker will save test dataset
  'num_train_epochs': 1,                                         # number of training epochs
  'per_device_train_batch_size': 2,                    # batch size for training
  'per_device_eval_batch_size': 2,                     # batch size for evaluation
  'learning_rate': 1e-4,   
  'gradient_accumulation_steps': 4,
  #'model_max_length': 512                          # learning rate used during training
  'model_max_length': 1536                          # learning rate used during training
}

# configuration for running training on smdistributed Model Parallel
mpi_options = {
    "enabled" : True,
    "processes_per_host" : 8,
}
smp_options = {
    "enabled":True,
    "parameters": {
        #"microbatches": 4,
        #"placement_strategy": "spread",
        #"sharded_data_parallel_degree": 16,
        #"ddp_dist_backend": "nccl",
        "pipeline_parallel_degree": 16,
        "placement_strategy": "cluster",
        #"pipeline": "interleaved",
        "tensor_parallel_degree": 1,
        #"optimize": "speed",
        "partitions": 16,
        "fp16": True,
        "ddp": True,
    }
}

distribution={
    "smdistributed": {"modelparallel": smp_options},
    "mpi": mpi_options
}

# instance configurations
instance_type='ml.p4d.24xlarge'
instance_count = 2
#volume_size = 200

# metric definition to extract the results
'''
metric_definitions=[
     {'Name': 'train_runtime', 'Regex':"train_runtime.*=\D*(.*?)$"},
     {'Name': 'train_samples_per_second', 'Regex': "train_samples_per_second.*=\D*(.*?)$"},
     {'Name': 'epoch', 'Regex': "epoch.*=\D*(.*?)$"},
     {'Name': 'f1', 'Regex': "f1.*=\D*(.*?)$"},
     {'Name': 'exact_match', 'Regex': "exact_match.*=\D*(.*?)$"}]
'''

'\nmetric_definitions=[\n     {\'Name\': \'train_runtime\', \'Regex\':"train_runtime.*=\\D*(.*?)$"},\n     {\'Name\': \'train_samples_per_second\', \'Regex\': "train_samples_per_second.*=\\D*(.*?)$"},\n     {\'Name\': \'epoch\', \'Regex\': "epoch.*=\\D*(.*?)$"},\n     {\'Name\': \'f1\', \'Regex\': "f1.*=\\D*(.*?)$"},\n     {\'Name\': \'exact_match\', \'Regex\': "exact_match.*=\\D*(.*?)$"}]\n'

In [44]:
# estimator
environment = {'CUDA_LAUNCH_BLOCKING': '1'}
huggingface_estimator = HuggingFace(entry_point='train-llama-no-specail-token.py',
                                    source_dir           = '.', 
                                    #metrics_definition=metric_definitions,
                                    instance_type=instance_type,
                                    instance_count=instance_count,
                                    #volume_size=volume_size,
                                    role=role,
                                    #transformers_version='4.26.0',
                                    #pytorch_version='1.13.1',
                                    #py_version='py39',
                                    transformers_version='4.17',
                                    pytorch_version='1.10',
                                    py_version='py38',
                                    distribution= distribution,
                                    hyperparameters = hyperparameters,
                                    environment = environment,
                                    debugger_hook_config=False)

In [45]:
huggingface_estimator.hyperparameters()

{'model_name': '"decapoda-research/llama-7b-hf"',
 'training_dir': '"/opt/ml/input/data/train"',
 'test_dir': '"/opt/ml/input/data/test"',
 'num_train_epochs': '1',
 'per_device_train_batch_size': '2',
 'per_device_eval_batch_size': '2',
 'learning_rate': '0.0001',
 'gradient_accumulation_steps': '4',
 'model_max_length': '1536',
 'sagemaker_mpi_enabled': 'true',
 'sagemaker_mpi_num_of_processes_per_host': '8',
 'sagemaker_mpi_custom_mpi_options': '""',
 'mp_parameters': '{"pipeline_parallel_degree": 16, "placement_strategy": "cluster", "tensor_parallel_degree": 1, "partitions": 16, "fp16": true, "ddp": true}',
 'sagemaker_distributed_dataparallel_enabled': 'false',
 'sagemaker_instance_type': '"ml.p4d.24xlarge"'}

In [46]:
# starting the train job with our uploaded datasets as input
#test_input_path = 's3://sagemaker-us-east-1-514385905925/samples/datasets/test001/test'
test_input_path = 's3://sagemaker-us-east-1-514385905925/samples/datasets/lala-no-special-token-test0406/test/'


data = {
    'train': test_input_path,
    'test': test_input_path
}

huggingface_estimator.fit(data, wait=True)

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: huggingface-pytorch-training-2023-04-06-12-46-09-325


2023-04-06 12:46:43 Starting - Starting the training job...
2023-04-06 12:47:19 Starting - Preparing the instances for training............
2023-04-06 12:49:18 Downloading - Downloading input data
2023-04-06 12:49:18 Training - Downloading the training image..................
2023-04-06 12:52:19 Training - Training image download completed. Training in progress.......[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2023-04-06 12:53:18,756 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2023-04-06 12:53:18,821 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2023-04-06 12:53:18,823 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2023-04-06 12:53:21,016 sagemaker-training-toolkit INFO     Installing dependencies from requirements.txt:[0m
[34m/opt/conda/bin/python3.

UnexpectedStatusException: Error for Training job huggingface-pytorch-training-2023-04-06-12-46-09-325: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
ExitCode 1
ErrorMessage "RuntimeError: CUDA error: device-side assert triggered
 
 During handling of the above exception, another exception occurred
 Traceback (most recent call last)
 File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/module_manager.py", line 484, in record_execution_time
 yield
 File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/patches/tracing.py", line 56, in trace_forward
 output = original_forward(self, *args, **kwargs)
 File "/opt/conda/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 575, in forward
 inputs_embeds = self.embed_tokens(input_ids)
 File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
 return forward_call(*input, **kwargs)
 File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/patches/tracing.py", line 75, in trace_forward
 raise e
 File "/opt/conda/lib/python3.8/contextlib.py"

In [None]:
predictor = huggingface_estimator.deploy(1,"ml.g5.48xlarge")

Then, we use the returned predictor object to call the endpoint.

In [None]:
sentiment_input= {"inputs":"I love using the new Inference DLC."}

predictor.predict(sentiment_input)

Finally, we delete the endpoint again.

In [None]:
predictor.delete_model()
predictor.delete_endpoint()