# Distributed Training Demo for llama on SageMaker

### Model Parallelism using SageMaker model parallelism

1. [Introduction](#Introduction)  
2. [Development Environment and Permissions](#Development-Environment-and-Permissions)
    1. [Installation](#Installation)  
    2. [Development environment](#Development-environment)  
    3. [Permissions](#Permissions)
3. [Processing](#Preprocessing)   
    1. [Tokenization](#Tokenization)  
    2. [Uploading data to sagemaker_session_bucket](#Uploading-data-to-sagemaker_session_bucket)  
4. [Fine-tuning & starting Sagemaker Training Job](#Fine-tuning-\&-starting-Sagemaker-Training-Job)  
    1. [Creating an Estimator and start a training job](#Creating-an-Estimator-and-start-a-training-job)  

# Introduction

Welcome to our end-to-end distributed training example. In this demo, we will use the Hugging Face `transformers` and `datasets` library together with a Amazon sagemaker-sdk extension on a multi-node multi-gpu cluster using [SageMaker Model Parallelism Library](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-intro.html). The demo will use the new smdistributed library to run training on multiple gpus. 

_**NOTE: You can run this demo in Sagemaker Studio, your local machine or Sagemaker Notebook Instances**_

# Development Environment and Permissions 

## Installation

_*Note:* we only install the required libraries from Hugging Face and AWS. You also need PyTorch or Tensorflow, if you haven´t it installed_

In [None]:
!pip install "sagemaker>=2.156.0" --upgrade

After upgrading the sagemaker sdk library, please restart the jupyter kernel and execute the following cell.

## Development environment 

In [None]:
import sagemaker.huggingface

In [None]:
print(sagemaker.__version__)

## Permissions

_If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) more about it._

In [None]:
import sagemaker
import boto3
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

# Fine-tuning & starting Sagemaker Training Job

In order to create a sagemaker training job we need an Estimator. The Estimator handles end-to-end Amazon SageMaker training and deployment tasks. In a Estimator we define, which fine-tuning script should be used as `entry_point`, which `instance_type` should be used, which `hyperparameters` are passed in .....

When we create a SageMaker training job, SageMaker takes care of starting and managing all the required ec2 instances for us with the training container, uploads the provided fine-tuning script `train.py` and downloads the data from our `sagemaker_session_bucket` into the container at `/opt/ml/input/data`. Then, it starts the training job by running. 

The `hyperparameters` you define in the estimator are passed in as named arguments. 

Sagemaker is providing useful properties about the training environment through various environment variables, including the following:

* `SM_NUM_GPUS`: An integer representing the number of GPUs available to the host.

* `SM_CHANNEL_XXXX:` A string that represents the path to the directory that contains the input data for the specified channel. For example, if you specify two input channels in the HuggingFace estimator’s fit call, named `train` and `test`, the environment variables `SM_CHANNEL_TRAIN` and `SM_CHANNEL_TEST` are set.


To run your training job locally you can define `instance_type='local'` or `instance_type='local_gpu'` for gpu usage. _Note: this does not working within SageMaker Studio_


## Creating an Estimator and start a training job


In [None]:
from sagemaker.huggingface import HuggingFace

In [None]:
# hyperparameters, which are passed into the training job
#hyperparameters for llama
hyperparameters={
  'training_dir': '/opt/ml/input/data/train', # path where sagemaker will save training dataset
  'test_dir': '/opt/ml/input/data/test',      # path where sagemaker will save test dataset
  'num_train_epochs': 1,                                         # number of training epochs
  'per_device_train_batch_size': 2,                    # batch size for training
  'per_device_eval_batch_size': 2,                     # batch size for evaluation
  'learning_rate': 1e-5,   
  'gradient_accumulation_steps': 4,
  'model_max_length': 1536                          # learning rate used during training
}

# configuration for running training on smdistributed Model Parallel
mpi_options = {
    "enabled" : True,
    "processes_per_host" : 8,
}

smp_options = {
    "enabled":True,
    "parameters": {
        "pipeline_parallel_degree": 16,
        "placement_strategy": "cluster",
        "tensor_parallel_degree": 1,
        "partitions": 16,
        "fp16": True,
        "ddp": True,
    }
}

distribution={
    "smdistributed": {"modelparallel": smp_options},
    "mpi": mpi_options
}

# instance configurations
instance_type='ml.p4d.24xlarge'
instance_count = 2

In [None]:
# estimator
#define the model s3 path which will store your trained model asset
#Note: you should use your real s3 path to configure model_s3_path
target_model_s3_path='s3://your_bucket/llama-smp-finetuned-052111/model/'

#define the s3 path of source model before training.  
#Note: Please add the wildcard character '*' in the following path, otherwise error will happen.
source_model_s3_path = 's3://your_bucket/llama/pretrained/7B/model/*'

environment = {'CUDA_LAUNCH_BLOCKING': '1',
               'SOURCE_MODEL_BEFORE_TRAINING_S3_PATH': source_model_s3_path,
               'TARGET_MODEL_AFTER_TRAINING_S3_PATH': target_model_s3_path}

from sagemaker.pytorch import PyTorch

'''
huggingface_estimator = HuggingFace(entry_point='train-llama-file-lock-for-HF-container.py',
                                    source_dir           = '.', 
                                    instance_type=instance_type,
                                    instance_count=instance_count,
                                    role=role,
                                    transformers_version='4.17',
                                    pytorch_version='1.10',
                                    py_version='py38',
                                    distribution= distribution,
                                    hyperparameters = hyperparameters,
                                    environment = environment,
                                    debugger_hook_config=False)
'''

huggingface_estimator = PyTorch(entry_point='train-llama-for-pytorch-container.py',
                                source_dir           = '.', 
                                instance_type=instance_type,
                                instance_count=instance_count,
                                role=role,
                                framework_version='1.12.0',
                                py_version='py38',
                                distribution= distribution,
                                hyperparameters = hyperparameters,
                                environment = environment,
                                debugger_hook_config=False)


In [None]:
huggingface_estimator.hyperparameters()

In [None]:
# starting the train job with our uploaded datasets as input
train_input_path = 's3://your_bucket/samples/datasets/1536-token-length-for-llama/train'
test_input_path = 's3://your_bucket/samples/datasets/1536-token-length-for-llama/test'
data = {
    'train': train_input_path,
    'test': test_input_path
}

huggingface_estimator.fit(data, wait=True)