# Fine-Tuning LLMs with LoRA and QLoRA in SM DDP and MDP Modes

In [None]:
# ST NOTES:

"""
MAIN OBJECTIVE: 
Implement and test efficiency of LoRA and QLoRA for fine-tuning of LLM models in DDP and MDP distributed modes in SageMaker
           
SECONDARY OBJECTIVE(s): 
- Test relative performance and applicability of p3 and g5 EC2 instances to LLM fine-tuning
- Test the impact of LEARNING RATE on training performance in distributed modes
- Test the ability to increase the Batch Size with LoRA and QLoRA in distributed modes


MOTIVATION:
The success of LLM models is rooted in HUGE sizes of the models trained on HUGE number of examples on MANY GPUs over LONG period of times resulting in HUGE TRAINIG COST
Thus, LLMs are beyond reach to individual businesses due to technical and cost restrictions.
Recently, research community came up with (x)LoRA approaches to train LLM with reduced hardware requirements.
Today, a 65B parameter model can be trained on a single GPU with those techniques.
While we welcome such development, we see a huge potential in combining (x)LoRA training with Data Parallelism for even faster training. 
In such scenario, LoRA is the enabler which makes feasible an industrial application of LLMs (reasonable hardware requirements and budget sizes) 
and Data Parallelism provides training speedups that make fine-tuning of LLMs feasible for wide use by industry (reasonable time scales)
Moreover, we explore feasibility of combining (x)LoRA approaches with Model Parallelism to enable efficient training of bigger than 65B models - which we expect to be of huge interest to the industry in the coming years.


KEY FINDINGs:
- Both LoRA and QLoRA can be implemented in DDP distributed mode in SageMaker with expected benefits in:
    - resource requirements
    - training speed
- QLoRA cannot be implemented in MDP distributed mode. Training freezes for unknown reason as no error message is generated and deeper troubleshooting is needed;
- In Single GPU Mode, P3 Instances offer better performance (for smaller models) despite lower GPU memory per instance (P3:16GB vs. G5:24GB)
- G5 instances are not supported in Distributed SageMaker modes


FINDINGs for SINGLE GPU:

- Except for (Q)LoRA G5 instances offer better performance than P3 instances in terms of training speed (93%) and comparable fit measures
- The effect of LoRA and QLoRA implementations differs between the EC2 instance types
    - For P3 instances:
        - LoRA offers higher (80%) training speed than NO-LoRA training with small reduction (97%) in the fit measures (after 1 epoch!)
        - QLoRA offers even higher (67%) training speed than NO-LoRA training but it also results in further fit measure reduction (95%) (after 1 epoch!)
        - The above results meet expectations as:
            - It is expected that LoRA offers faster training due to the reduction in the number of trainable parameters
            - It is expected that QLoRA offers even faster training due to additional reduction in the memory footprint of the parameters
            - It is expected that LoRA offers lower fit measure per given amount of training as information is encoded into fewer parameters
            - It is expected that QLoRA offers even lower fit measure as 4-bit parameter encoding offers lower information capacity than do higher bitrates
        - But the LoRA and QLoRA papers indicate that both methods should provide fit measures comparable to those offered by full training:
            - We assume that training over more epochs would provide comparable model performance to the fully trained models
            - Possibly such training might require more epochs than full training but given speed up efficiencies the overall cost might be still beneficial      
    - For G5 Instances:
        - The effects do not materialize so clearly and some are even reversed for QLoRA!
        - Training Speed:
            - LoRA offers higher (90%) training speed but QLoRA offers lower (102%) training speed (it is 113% of the LoRA training speed!)
                - We suspect that it is due to presence of hardware optimizations for P3 instances that allows efficient handling of parameters represented in 4-bits on NVDIA GPUs.
                - The issue might be related to the lack of proper libraries that handle communications with GPU or with the design differences between the GPU types - TBC...
                - Alternatively, SMDDP API might have some G5 specific parameters that are not utilized in the code; TBC... 
        - Training Fit Measures:
            - G5 Instances offer comparable performance to that of the P3 instances


FINDINGs for 8xGPU Data Parallel Mode:

- Only P3 instances available due to the SageMaker Environment limitations
- Training Speed:
    - Almost Linear training time speed up for all three (x)LoRA scenarios over Single GPU instances!!!
- Training Fit Measures:
    - Comparable to those achieved by NO-LoRA on a single GPU!!!
        - But LoRA and QLoRA needed bump in the LR equal to Num-of-GPSs multiplied by:
            - LoRA: 8
            - QLoRA: 16
        - Further exploration of those effects is needed but still it indicates that (x)LoRA approaches can be successfully used in the distributed settings.
- Impact of Batch Size:
    - NO LoRA works with Batch Size of only 32
        - this is odd as it is the same batch size as the one used in the Single GPU mode
        - the parameter that controls the batch size is "per_device_train_batch_size". 
            - Despite the name, the on-line documentation states that it is global batch size. 
            - Thus, it is expected that the distributed batch size should be Num_GPUs * SingleGPU Batch Size... TBC...
    - LoRA -- not tested
    - QLoRA works with Batch Size up to 128
        - It constitutes 4-fold increase over NO-LoRA mode!
        - Fit is lower than desired but this should be addressable with fine-tuning of LR and more training epochs.


FINDINGs for 8xGPU Model Parallel Mode:

- Only P3 instances available due to the SageMaker Environment limitations
- QLoRA training failed!!!
    - Training "freezes" for unknown reason
        - some additional tests were performed without indication of the cause. TBC...
    - We expect that combination of LoRA and MDP training will enable training of big-enough models for multiple applications before the need for QLoRA appears...
- Training Speed:
    - With given model size, MDP is an OVERKILL... thus we cannot properly test it.
    - But we notice that LoRA offers speed ups (90%) over No-LoRA training similar to that on Single GPU
    - The main consideration becomes number of partitions and thus the level of data parallelism.
        - Reducing partitions from 4 to 2 results in almost 4-fold speed-up! which is attributable to
            - Significant reduction in inter-GPU communications.
            - 2-fold increase in data parallelism
- Training Fit Measures:
    - We can achieve fit measures comparable to those achieved by SINGLE GPU training
    - But it is important to control the LEARNING RATE which should be relative to the Data Parallelism level
- Impact of Batch Size:
    - Both NO LoRA and LoRA works with Batch Size of 32 and 64 with 2 Partitions and they break with Batch Size 
        - For NO LoRA, it makes sense as partition 2 means that each batch is spread across 2 GPUs and we have can work with 4 unique batches in parallel
        - For LoRA, this is odd as the test of DDP shows that it is possible for it to work with Batch size of 128. One would expect similar performance in this scenario as well

CAVIATS:
- testing training speed with only one epoch provides not the best performance estimate:
    - This is so as bulk of the time is spent on initialization of training and evaluation. 
    - It would be best to test the time of training individual epoch and preferably not the first one and even a series of epochs.
    - Still the results above are valid directionally...  
- Analyzing of "freezing" of QLoRA in MDP requires deeper review beyond standard error messages generated by python. We will come back to it.
- Subsequent tests will be performed on larger models and more epochs to get better estimates of the effects of interest.
"""

## Environment 

In [12]:
### REQUIRES INSTANCE WITH AT LEAST 8GB OF MEMORY
### You might need to restart the notebook!!!
!pip -q install "sagemaker>=2.140.0" "transformers==4.26.1" "datasets[s3]==2.10.1" "torch" --upgrade
!pip -q install accelerate

from datasets import load_dataset
from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments, AutoTokenizer, AutoModelForCausalLM #, BitsAndBytesConfig

#import sys
#print(sys.version)
#!pip show datasets

[33mDEPRECATION: pyodbc 4.0.0-unsupported has a non-standard version number. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of pyodbc or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0m[33mDEPRECATION: pyodbc 4.0.0-unsupported has a non-standard version number. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of pyodbc or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0mName: datasets
Version: 2.10.1
Summary: HuggingFace community-driven open-source library of datasets
Home-page: https://github.com/huggingface/datasets
Author: HuggingFace Inc.
Author-email: thomas@huggingface.co
License: Apache 2.0
Location: /opt/conda/lib/python3.7/si

In [3]:
import sagemaker
import boto3
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

sagemaker role arn: arn:aws:iam::578864530451:role/service-role/AmazonSageMaker-ExecutionRole-20210306T201609
sagemaker bucket: sagemaker-us-east-1-578864530451
sagemaker session region: us-east-1


# Preprocessing

We are using the `datasets` library to download and preprocess the `imdb` dataset. After preprocessing, the dataset will be uploaded to our `sagemaker_session_bucket` to be used within our training job. The [imdb](http://ai.stanford.edu/~amaas/data/sentiment/) dataset consists of 25000 training and 25000 testing highly polar movie reviews.

## Prepare Dataset

In [30]:
# DATASET USED
dataset_name = 'imdb'

if dataset_name == 'imdb':
    # tokenizer used in preprocessing
    tokenizer_name = 'distilbert-base-uncased'

    # s3 key prefix for the data
    s3_prefix = 'samples/datasets/imdb'

    # load dataset
    dataset = load_dataset(dataset_name)

    # download tokenizer
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)

    # tokenizer helper function
    def tokenize(batch):
        return tokenizer(batch['text'], padding='max_length', truncation=True)

    # load dataset
    train_dataset, test_dataset = load_dataset('imdb', split=['train', 'test'])
    test_dataset = test_dataset.shuffle().select(range(10000)) # smaller the size for test dataset to 10k 

    # tokenize dataset
    train_dataset = train_dataset.map(tokenize, batched=True)
    test_dataset = test_dataset.map(tokenize, batched=True)

    # set format for pytorch
    train_dataset =  train_dataset.rename_column("label", "labels")
    train_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])
    test_dataset = test_dataset.rename_column("label", "labels")
    test_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])

    #print("test_dataset[0]\n\n", test_dataset[0])

    # save train_dataset to s3
    training_input_path = f's3://{sess.default_bucket()}/{s3_prefix}/train'
    train_dataset.save_to_disk(training_input_path)

    # save test_dataset to s3
    test_input_path = f's3://{sess.default_bucket()}/{s3_prefix}/test'
    test_dataset.save_to_disk(test_input_path)

dataset, #dataset["train"]    



  0%|          | 0/3 [00:00<?, ?it/s]



  0%|          | 0/2 [00:00<?, ?it/s]



Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/25000 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/10000 [00:00<?, ? examples/s]

(DatasetDict({
     train: Dataset({
         features: ['text', 'label'],
         num_rows: 25000
     })
     test: Dataset({
         features: ['text', 'label'],
         num_rows: 25000
     })
     unsupervised: Dataset({
         features: ['text', 'label'],
         num_rows: 50000
     })
 }),)

## SINGLE GPU

In [22]:
from sagemaker.huggingface import HuggingFace
# hyperparameters, which are passed into the training job

if dataset_name == 'imdb':
    hyperparameters={'epochs': 1,
                     'train_batch_size': 32,
                     'model_name':'distilbert-base-uncased',
                     "eval_batch_size": 256, # InExample: 64
                     "learning_rate": 5e-5 * 1 # InExample: 5e-5
                    }

#entry_point = 'train.py' # NO LORA
#entry_point = 'train-LORA.py'
entry_point = "train-QLORA.py"

huggingface_estimator = HuggingFace(entry_point=entry_point,
                                    source_dir='./',
                                    instance_type='ml.g5.4xlarge', #'ml.p3.2xlarge', 'ml.g5.4xlarge'-- for 64GB of Memory
                                    instance_count=1,
                                    role=role,
                                    transformers_version='4.26',
                                    pytorch_version='1.13',
                                    py_version='py39', 
                                    hyperparameters = hyperparameters)

# starting the train job with our uploaded datasets as input
huggingface_estimator.fit({'train': training_input_path, 
                           'test': test_input_path})

### SINGE GPU RESULTS ANALYSIS

# MODEL: distilbert-base-uncased
# DATASET: imdb
# TASK: BinarySequenceClassification



### NO LORA -- BENCHMARK

## INSTANCE_TYPE: ml.p3.2xlarge

# EXEC TIME: 411.8172905445099 -- train & first evaluation time

# 'eval_accuracy': 0.928, 
# 'eval_f1': 0.9274924471299093, 
# 'eval_precision': 0.9373091797272542, 
# 'eval_recall': 0.9178792106836755,

## INSTANCE_TYPE: ml.g5.4xlarge

## EXEC TIME: 381.3633711338043

# 'eval_loss': 0.18705244362354279, 
# 'eval_accuracy': 0.9284, 
# 'eval_f1': 0.9277205733898648, 
# 'eval_precision': 0.9381380155165374, 
# 'eval_recall': 0.9175319488817891, 



### WITH LoRA -- as expected: FASTER ALTHOUGH SLIGHTLY WEAKER FIT

## INSTANCE_TYPE: ml.p3.2xlarge

# EXEC TIME: 331.6438286304474

# 'eval_accuracy': 0.9055, 
# 'eval_f1': 0.9063893016344725, 
# 'eval_precision': 0.9009452540370224, 
# 'eval_recall': 0.9118995415587005

## INSTANCE_TYPE: ml.g5.4xlarge

# EXEC TIME: 346.6780774593353

# 'eval_loss': 0.27975377440452576, 
# 'eval_accuracy': 0.8929, 
# 'eval_f1': 0.893294809205938, 
# 'eval_precision': 0.8914297076953669, 
# 'eval_recall': 0.895167731629393,



### WITH QLoRA -- as expected: EVEN WEAKER, BUT SLIGHTLY SLOWER!?!?!?!

## INSTANCE_TYPE: ml.p3.2xlarge

# EXEC TIME: 275.81963658332825 -- MUCH FASTER?!?!?!?

# 'eval_loss': 0.3110389709472656, 
# 'eval_accuracy': 0.8835, 
# 'eval_f1': 0.8835582208895553, 
# 'eval_precision': 0.8845307184310587, 
# 'eval_recall': 0.8825878594249201

## INSTANCE_TYPE: ml.g5.4xlarge

# EXEC TIME: 392.0947148799896

# 'eval_loss': 0.29552170634269714, 
# 'eval_accuracy': 0.8848, 
# 'eval_f1': 0.8866141732283463, 
# 'eval_precision': 0.8742236024844721, 
# 'eval_recall': 0.8993610223642172,

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.


Using provided s3_resource


INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: huggingface-pytorch-training-2023-09-22-12-36-20-432


2023-09-22 12:37:01 Starting - Starting the training job...
2023-09-22 12:37:17 Starting - Preparing the instances for training......
2023-09-22 12:38:27 Downloading - Downloading input data...
2023-09-22 12:38:53 Training - Downloading the training image...............
2023-09-22 12:41:24 Training - Training image download completed. Training in progress......[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2023-09-22 12:42:10,410 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2023-09-22 12:42:10,427 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2023-09-22 12:42:10,437 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2023-09-22 12:42:10,439 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2023-09-22 12:42:15,941 

" 2ND RUN\n{'eval_loss': 0.30123773217201233, 'eval_accuracy': 0.883, 'eval_f1': 0.884318766066838, 'eval_precision': 0.8758323540932237, 'eval_recall': 0.8929712460063898, 'eval_runtime': 38.5191, 'eval_samples_per_second': 259.612, 'eval_steps_per_second': 1.038, 'epoch': 1.0}\n100%|██████████| 782/782 [06:31<00:00,  2.85it/s]\n#015100%|██████████| 40/40 [00:37<00:00,  1.04it/s]#033[A\n#033[A\n{'train_runtime': 391.7503, 'train_samples_per_second': 63.816, 'train_steps_per_second': 1.996, 'train_loss': 0.5238332626459848, 'epoch': 1.0}\n100%|██████████| 782/782 [06:31<00:00,  2.85it/s]\n100%|██████████| 782/782 [06:31<00:00,  2.00it/s]\nEXEC TIME: 391.8400411605835\n"

In [45]:
#####################################################################
######################## DATA DISTRIBUTED ###########################
#####################################################################

from sagemaker.huggingface import HuggingFace

# hyperparameters, which are passed into the training job

instance_type = 'ml.p3.16xlarge' #'ml.p3.16xlarge', 'ml.g5.12xlarge'
partitions = 2
Adj_BatchSize = 4
LR_Adj_LoRA = 8

if instance_type == 'ml.g5.12xlarge': #ValueError: Provided instance_type ml.g5.12xlarge is not supported by smdataparallel.
    num_GPUs = 4
elif instance_type == 'ml.p3.16xlarge':
    num_GPUs = 8

if dataset_name == 'imdb':
    hyperparameters={'epochs': 1,
                     'train_batch_size': 32 * Adj_BatchSize, # 32
                     'model_name':'distilbert-base-uncased',
                     "eval_batch_size": 256, # InExample: 64
                     "learning_rate": 5e-5 * (num_GPUs/partitions) * Adj_BatchSize * LR_Adj_LoRA # LR for DATA PARALLEL training is LR_Single_GPU * #OfGPUs * LoRA Multiplier!!!
                    }

# configuration for running training on smdistributed data parallel
distribution = {'smdistributed':{'dataparallel':{'enabled': True, #}}}
                                                 "custom_mpi_options": "-verbose -x NCCL_DEBUG=VERSION", #}}}
                                                 "parameters": {"ddp": True}}}}

#entry_point = 'train.py' # NO LoRA!
#entry_point = 'train-LORA.py'
entry_point = 'train-QLORA.py'

huggingface_estimator_DDP = HuggingFace(entry_point=entry_point,
                            source_dir='./',
                            instance_type=instance_type, #'ml.p3.16xlarge', 'ml.g5.12xlarge'
                            instance_count=1,
                            role=role,
                            transformers_version='4.26.0', # InExample: '4.26' with Error about the name
                            pytorch_version='1.13.1', # InExample: '1.13' with Error about the version
                            py_version='py39',
                            hyperparameters = hyperparameters,
                            distribution = distribution)

# starting the train job with our uploaded datasets as input
huggingface_estimator_DDP.fit({'train': training_input_path, 'test': test_input_path})



### INSTANCE_TYPE: ml.p3.16xlarge -- ALL MODELs as G5 Instances are not supporte by Distirbuted SageMaker!!!



### NO LORA -- BENCHMARK

## 8 GPU ANALYSIS, LR_8_GPU == LR_Single_GPU -- LOWER FIT MEASURES!!!

# EXEC TIME: 56.322752952575684 - TRAIN & FIRST EVAL -- almost linear speedup!!!

# eval_accuracy': 0.897, 
# 'eval_f1': 0.8962321176707637, 
# 'eval_precision': 0.9060908535343247, 
# 'eval_recall': 0.8865856089296392

## 8 GPU ANALYSIS, LR_8_GPU == LR_Single_GPU * 8 -- HIGHER FIT MEASURES!!! but still not as high as on single GPU -- TBC...

# EXEC TIME: 56.322752952575684 - TRAIN & FIRST EVAL -- almost linear speedup!!!

# 'eval_accuracy': 0.917, 
# 'eval_f1': 0.9147493837304849, 
# 'eval_precision': 0.9436321254503073, # HIGHER THEN on SINGLE_GPU!!!
# 'eval_recall': 0.8875822204504684

## 8 GPU ANALYSIS, LR_8_GPU == LR_Single_GPU * 16 -- training breaks down!!!

# EXEC TIME: 56.03306198120117

# 'eval_loss': 0.5477789044380188, 
# 'eval_accuracy': 0.7475, 
# 'eval_f1': 0.6715233511122676, 
# 'eval_precision': 0.9634191862635312, 
# 'eval_recall': 0.5153753993610224,


## IMPACT OF BATCH_SIZE (Standard BS = 32):

# BS = 64 -- OOM Error!


### WITH LORA

## LR_8_GPU == LR_Single_GPU * 8

# EXEC TIME: 44.2592236995697 -- comperable times for other runs

# 'eval_accuracy': 0.688, 
# 'eval_f1': 0.6741854636591478, 
# 'eval_precision': 0.7080500109673173, 
# 'eval_recall': 0.6434123978473191,

## LR_8_GPU == LR_Single_GPU * 2

# 'eval_accuracy': 0.5162, 
# 'eval_f1': 0.11779722830051056, 
# 'eval_precision': 0.6916488222698073, 
# 'eval_recall': 0.06438110424556508,

## LR_8_GPU == LR_Single_GPU * 8 * 2

#'eval_accuracy': 0.8371, 
# 'eval_f1': 0.8364622025901014, 
# 'eval_precision': 0.8426375404530745, 
# 'eval_recall': 0.8303767191548734,

## LR_8_GPU == LR_Single_GPU * 8 * 4

# 'eval_accuracy': 0.8819, 
# 'eval_f1': 0.8771199667048174, 
# 'eval_precision': 0.9175010883761427, 
# 'eval_recall': 0.8401435120589994,

## LR_8_GPU == LR_Single_GPU * 8 * 8 --- BEST AND MOST BALANCED!!!

# 'eval_accuracy': 0.899, 
# 'eval_f1': 0.9017127286882055, 
# 'eval_precision': 0.8809659631108576, 
# 'eval_recall': 0.923460235200319,

## LR_8_GPU == LR_Single_GPU * 8 * 16

# 'eval_accuracy': 0.9001, 
# 'eval_f1': 0.896358543417367, 
# 'eval_precision': 0.9346603202077023, 
# 'eval_recall': 0.8610723539964122,



### WITH QLORA

## LR_8_GPU == LR_Single_GPU * 8 * 16  --- BEST: GOOD AND MOST BALANCED!!!

# EXEC TIME: 36.523969411849976 -- comperable times for other runs

# 'eval_loss': 0.24202285706996918, 
# 'eval_accuracy': 0.9067, 
# 'eval_f1': 0.9062028752387654, 
# 'eval_precision': 0.9079371474617244, 
# 'eval_recall': 0.9044752157334939

### LR_8_GPU == LR_Single_GPU * 8 * 8

# 'eval_loss': 0.25302860140800476, 
# 'eval_accuracy': 0.9028, 
# 'eval_f1': 0.9036860879904877, 
# 'eval_precision': 0.8925425719318849, 
# 'eval_recall': 0.9151113786875377,

### LR_8_GPU == LR_Single_GPU * 8 * 32

# 'eval_loss': 0.24021868407726288, 
# 'eval_accuracy': 0.9076, 
# 'eval_f1': 0.9105691056910569, 
# 'eval_precision': 0.8794167134043747, 
# 'eval_recall': 0.9440096327513546,

### LR_8_GPU == LR_Single_GPU * 8 * 16  WITH GRADIENT CHECKPOINTING

# EXEC TIME: 36.55616497993469 -- No increase in execution time !?!?!?

# 'eval_loss': 0.24579724669456482, 
# 'eval_accuracy': 0.9045, 
# 'eval_f1': 0.9036229690180644, 
# 'eval_precision': 0.9088509947218839, 
# 'eval_recall': 0.8984547461368654,


## IMPACT OF BATCH_SIZE (Standard BS = 32):

# BS = 64 

#[1,mpirank:0,algo-1]<stderr>:#015100%|██████████| 49/49 [00:31<00:00,  1.66it/s]

# EXEC TIME: 35.3283314704895

# 'eval_loss': 0.6055842041969299, 
# 'eval_accuracy': 0.809, 
# 'eval_f1': 0.829006266786034, 
# 'eval_precision': 0.7538261152718984, 
# 'eval_recall': 0.9208432776451869,


# BS = 128

# [1,mpirank:0,algo-1]<stderr>:#015100%|██████████| 25/25 [00:31<00:00,  1.18s/it]

# EXEC TIME: 35.87735557556152

# 'eval_loss': 0.674795389175415, 
# 'eval_accuracy': 0.7293, 
# 'eval_f1': 0.6782360632354689, 
# 'eval_precision': 0.8428360413589365, 
# 'eval_recall': 0.5674224343675418


# BS = 256

# OOM Error!


INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.


Using provided s3_resource


INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: huggingface-pytorch-training-2023-09-22-19-33-44-463


2023-09-22 19:34:26 Starting - Starting the training job......
2023-09-22 19:35:12 Starting - Preparing the instances for training.........
2023-09-22 19:36:39 Downloading - Downloading input data...
2023-09-22 19:37:04 Training - Downloading the training image..................
2023-09-22 19:40:00 Training - Training image download completed. Training in progress...[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2023-09-22 19:40:31,921 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2023-09-22 19:40:31,982 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2023-09-22 19:40:31,994 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2023-09-22 19:40:31,996 sagemaker_pytorch_container.training INFO     Invoking SMDataParallel[0m
[34m2023-09-22 19:40:31,997 s

UnexpectedStatusException: Error for Training job huggingface-pytorch-training-2023-09-22-19-33-44-463: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
ExitCode 1
ErrorMessage "torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.00 GiB (GPU 2; 15.77 GiB total capacity; 9.97 GiB already allocated; 2.54 GiB free; 12.26 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
 Traceback (most recent call last)
 File "/opt/conda/lib/python3.9/runpy.py", line 197, in _run_module_as_main
 return _run_code(code, main_globals, None,
 File "/opt/conda/lib/python3.9/runpy.py", line 87, in _run_code
 exec(code, run_globals)
 File "/opt/conda/lib/python3.9/site-packages/mpi4py/__main__.py", line 7, in <module>
 main()
 File "/opt/conda/lib/python3.9/site-packages/mpi4py/run.py", line 198, in main
 run_command_line(args)
 File "/opt/conda/lib/python3.9/site-packages/mpi4py/run.py", line 47, in run_command_line
 run_path(sys.argv[0], run_name='__main__')
 File "/opt/conda/lib/

In [49]:
#####################################################################
######################### MODEL PARALLEL ############################
#####################################################################

from sagemaker.huggingface import HuggingFace

# hyperparameters, which are passed into the training job

instance_type = 'ml.p3.16xlarge' #'ml.p3.16xlarge', 'ml.g5.12xlarge'
partitions = 4
Adj_BatchSize = 4
LR_Adj_LoRA = 8

if instance_type == 'ml.g5.12xlarge': #ValueError: Provided instance_type ml.g5.12xlarge is not supported by smdataparallel.
    num_GPUs = 4
elif instance_type == 'ml.p3.16xlarge':
    num_GPUs = 8

# hyperparameters, which are passed into the training job
if dataset_name == 'imdb':
    hyperparameters={'epochs': 1,
                     'train_batch_size': 32 * Adj_BatchSize, # 32
                     'model_name':'distilbert-base-uncased',
                     "eval_batch_size": 256, # InExample: 64
                     "learning_rate": 5e-5 * (num_GPUs/partitions) * Adj_BatchSize * LR_Adj_LoRA # LR for DATA PARALLEL training is LR_Single_GPU * #OfGPUs * LoRA Multiplier!!!
                    }

# configuration for running training on smdistributed model parallel
mpi_options = {"enabled" : True,
               "processes_per_host" : 8}

#""" # InExemple: -- but I had to change instance type and it results in GPU MEM ERROR

smp_options = {"enabled":True,
               "parameters": {"microbatches": 1, # InExample: 4 but changed to address the memory error
                              "placement_strategy": "spread",
                              "pipeline": "interleaved",
                              "optimize": "speed",
                              "partitions": 2, # InExample: 4 but changed to address the memory error
                              "ddp": True,}}
#"""

distribution={"smdistributed": {"modelparallel": smp_options},
              "mpi": mpi_options}

#entry_point = 'train.py'
entry_point = 'train-LORA.py'
#entry_point = 'train-QLORA.py'

# create the Estimator
huggingface_estimator_DMP = HuggingFace(entry_point=entry_point,
                                    source_dir='./',
                                    instance_type=instance_type,
                                    instance_count=1,
                                    role=role,
                                    transformers_version='4.26.0',
                                    pytorch_version='1.13.1',
                                    py_version='py39',
                                    hyperparameters = hyperparameters,
                                    distribution = distribution)

# starting the train job with our uploaded datasets as input
huggingface_estimator_DMP.fit({'train': training_input_path, 'test': test_input_path})


### INSTANCE_TYPE: ml.p3.16xlarge -- ALL MODELs as G5 Instances are not supporte by Distirbuted SageMaker!!!



## NO LORA

###### "microbatches": 4, "partitions": 4,
# MEM ERROR

###### "microbatches": 1, "partitions": 4,
# TRAIN EXEC TIME: 405.6870391368866

### LR = 8 * LR_Single_GPU -- very weak possibly because DP is 2 not 8 due to 4 Partitions (PP)...
# 'eval_accuracy': 0.8664, 
# 'eval_f1': 0.8531545394592219, 
# 'eval_precision': 0.9509924038225925, 
# 'eval_recall': 0.7735698624676102,

### LR = 2 * LR_Single_GPU -- DECENT PERFORMANCE; 2x as DP is 2 due to 4 Partitions (PP)!!!
# 'eval_accuracy': 0.9086, 
# 'eval_f1': 0.9041325781413887, 
# 'eval_precision': 0.9541731237547044, # VERY HIGH!!! 
# 'eval_recall': 0.8590791309547539,


###### "microbatches": 2, "partitions": 4,
# TRAIN EXEC TIME: 647.028391122818


### IMPACT OF BATCH_SIZE (Standard BS = 32):

## BS = 32 -- LR = LR_Single_GPU * 4 * 1 * 8 -- # 4x as DP is 4 due to 2 Partitions (PP) * 1 Adj_BatchSize * 8 as LoRA Adjustment

# [1,mpirank:0,algo-1]<stderr>:#015100%|██████████| 196/196 [01:46<00:00,  1.89it/s]
    
# EXEC TIME: 121.97890901565552

# 'eval_loss': 0.29470083117485046, 
# 'eval_accuracy': 0.8781, 
# 'eval_f1': 0.8800551018400079, 
# 'eval_precision': 0.8708860759493671, 
# 'eval_recall': 0.8894192521877486,
 

## BS = 64 -- LR = LR_Single_GPU * 4 * 2 * 8 -- # 4x as DP is 4 due to 2 Partitions (PP) * 2 Adj_BatchSize * 8 as LoRA Adjustment

# [1,mpirank:0,algo-1]<stderr>:#015 90%|████████▉ | 88/98 [01:31<00:09,  1.01it/s]
    
# EXEC TIME: 116.97890901565552

# 'eval_loss': 0.26501214504241943, 
# 'eval_accuracy': 0.8889, 
# 'eval_f1': 0.8889999000899191, 
# 'eval_precision': 0.8931941377233488, 
# 'eval_recall': 0.8848448687350835,

# BS = 256, 128 -- OOM Error!


## BS = 64 -- LR = LR_Single_GPU * 2 * 2 * 8 -- # 2x as DP is 2 due to 4 Partitions (PP) * 2 Adj_BatchSize * 8 as LoRA Adjustment

# [1,mpirank:0,algo-1]<stderr>:#015100%|██████████| 98/98 [01:57<00:00,  1.01it/s]

# EXEC TIME: 117.58248496055603

# 'eval_loss': 0.2440081536769867, 
# 'eval_accuracy': 0.8955, 
# 'eval_f1': 0.8903807825448442, 
# 'eval_precision': 0.9420643729189789, 
# 'eval_recall': 0.8440731901352426,
 
## BS = 128 -- LR = LR_Single_GPU * 2 * 2 * 8 -- # 2x as DP is 2 due to 4 Partitions (PP) * 2 Adj_BatchSize * 8 as LoRA Adjustment

# OOM Error!



### WITH LORA

###### "microbatches": 1, "partitions": 4,
### LR = LR_Single_GPU * 2 * 8 -- # 2x as DP is 2 due to 4 Partitions (PP); * 8 as LoRA Adjustment

# EXEC TIME: 364.84994435310364

# 'eval_accuracy': 0.9117, 
# 'eval_f1': 0.910418991579588, 
# 'eval_precision': 0.9270661157024793, 
# 'eval_recall': 0.8943591787921068

### LR = LR_Single_GPU * 2 * 16 -- # 2x as DP is 2 due to 4 Partitions (PP); * 16 as LoRA Adjsutment
# 'eval_accuracy': 0.9162, 
# 'eval_f1': 0.9155412215279178, 
# 'eval_precision': 0.9259938837920489, 
# 'eval_recall': 0.9053219055212278,

### LR = LR_Single_GPU * 2 * 32 -- # 2x as DP is 2 due to 4 Partitions (PP); * 32 as LoRA Adjsutment
# 'eval_accuracy': 0.8947, 
# 'eval_f1': 0.8900490759110369, 
# 'eval_precision': 0.9346491228070175, 
# 'eval_recall': 0.8495116603547938

###### "microbatches": 1, "partitions": 2,
### LR = LR_Single_GPU * 2 * 8 -- # 2x was a mistake -- should be 4 as DP is 4 due to 2 Partitions (PP) -- * 8 as LoRA Adjsutment

# EXEC TIME: 102.41368317604065

# 'eval_accuracy': 0.9014, 
# 'eval_f1': 0.8990994678673762, 
# 'eval_precision': 0.923869610935857, 
# 'eval_recall': 0.8756228822005182,

### LR = LR_Single_GPU * 4 * 8 -- # 4x as DP is 4 due to 2 Partitions (PP) * 8 as LoRA Adjsutment - BEST, MOST BALANCED, FASTEST!!!

# EXEC TIME: 102.41368317604065

# 'eval_accuracy': 0.9119, 
# 'eval_f1': 0.9117499749574276, 
# 'eval_precision': 0.9164317358034636, 
# 'eval_recall': 0.9071158062587203

### LR = LR_Single_GPU * 4 * 8 -- # 4x as DP is 4 due to 2 Partitions (PP) * 16 as LoRA Adjsutment
# 'eval_accuracy': 0.9144, 
# 'eval_f1': 0.9131493506493507, 
# 'eval_precision': 0.9299442033477991, 
# 'eval_recall': 0.8969503687462627,



### IMPACT OF BATCH_SIZE (Standard BS = 32):
# Assumption that increase of Batch_Size requires proportional increase in LR

## BS = 64 LR = LR_Single_GPU * 4 * 1 * 8 -- # 4x as DP is 4 due to 2 Partitions (PP) * 1 Adj_BatchSize * 8 as LoRA Adjsutment

# EXEC TIME: 99.77318120002747

# 'eval_loss': 0.30923572182655334, 
# 'eval_accuracy': 0.8839, 
# 'eval_f1': 0.8823112012164217, 
# 'eval_precision': 0.8960263537162858,

## BS = 64 LR = LR_Single_GPU * 4 * 2 * 8 -- # 4x as DP is 4 due to 2 Partitions (PP) * 2 Adj_BatchSize * 8 as LoRA Adjsutment

# EXEC TIME: 99.77318120002747

# 'eval_loss': 0.25893861055374146,
# 'eval_accuracy': 0.8984,
# 'eval_f1': 0.9001572327044025,
# 'eval_precision': 0.8896658896658897,
# 'eval_recall': 0.9108989657915673,

# BS = 256, 128 -- OOM Error!

## BS = 128 -- LR = LR_Single_GPU * 2 * 2 * 8 -- # 2x as DP is 2 due to 4 Partitions (PP) * 2 Adj_BatchSize * 8 as LoRA Adjustment

# OOM Error!



### WITH QLoRA -- FAILED EXECUTION!!!

# QLORA with SMDMP fails for unknown reason resulting in a series of error statments unseen in the SMDDP mode.
# It is expected that SMDPP mode will allow fine-tuning for a wide variaty of applications, thus further exploration of the issue will happen later.
# TBC...


# ERRORs:

# Execution Freezes on:
# #011Process OMPI jobid: [41117,1] App: 0 Process rank: 0 Bound: N/A -- IS N/A an issue?

# SMDDPCollectivesInitWarning: The system is not compatible or not configured to run SMDDP collectives optimized for AWS infrastructure. 
# The training job will fall back to NCCL.

# torch.distributed process group is initialized, but parallel_mode != ParallelMode.DISTRIBUTED. 
# In order to use Torch DDP, launch your script with `python -m torch.distributed.launch

# W smdistributed/modelparallel/torch/optimizers/optimizer.py:111] parameter base_model.model.classifier.modules_to_save.default.bias 
# is missing when loading optimizer's state_dict, skip.


# SUBSEQUENTLY, the trainig starts but freezes...
# [1,mpirank:0,algo-1]<stderr>:#015  0%|          | 0/98 [00:00<?, ?it/s]


# SOLUTION ATTEMPTS:

# to test if it is a memory error we have worked with different setting of "partition" parmeter.

# Since LoRA sytax worked in SMDMP mode we have created QLORA-MDP.py syntax
# with which we tested rolling back the modifcations form LORA.py file intodcued in the QLORA.py file.



INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.


Using provided s3_resource


INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: huggingface-pytorch-training-2023-09-22-20-27-32-127


2023-09-22 20:28:13 Starting - Starting the training job......
2023-09-22 20:29:12 Starting - Preparing the instances for training.........
2023-09-22 20:30:28 Downloading - Downloading input data...
2023-09-22 20:30:53 Training - Downloading the training image..................
2023-09-22 20:33:59 Training - Training image download completed. Training in progress....[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2023-09-22 20:34:31,064 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2023-09-22 20:34:31,126 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2023-09-22 20:34:31,139 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2023-09-22 20:34:31,142 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2023-09-22 20:34:

UnexpectedStatusException: Error for Training job huggingface-pytorch-training-2023-09-22-20-27-32-127: Failed. Reason: AlgorithmError: Framework Error: 
Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/sagemaker_training/trainer.py", line 88, in train
    entrypoint()
  File "/opt/conda/lib/python3.9/site-packages/sagemaker_pytorch_container/training.py", line 153, in main
    train(environment.Environment())
  File "/opt/conda/lib/python3.9/site-packages/sagemaker_pytorch_container/training.py", line 100, in train
    entry_point.run(uri=training_environment.module_dir,
  File "/opt/conda/lib/python3.9/site-packages/sagemaker_training/entry_point.py", line 99, in run
    return runner.get(runner_type, user_entry_point, args, env_vars, extra_opts).run(
  File "/opt/conda/lib/python3.9/site-packages/sagemaker_training/mpi.py", line 398, in run
    process_spawned = process.check_error(
  File "/opt/conda/lib/python3.9/site-packages/sagemaker_training/process.py", line 335, in check_error
    raise error_class(
TypeError: SMTrainingCompilerConfigurationError() takes no keyword arguments

SMTrain