Source: https://github.com/philschmid/huggingface-llama-2-samples/blob/master/training/sagemaker-notebook.ipynb

# Fine-tune LLaMA 2 on Amazon SageMaker

In this sagemaker example, we are going to fine-tune [LLaMA 2](https://huggingface.co/meta-llama/Llama-2-70b-hf) using [QLoRA: Efficient Finetuning of Quantized LLMs](https://arxiv.org/abs/2305.14314). [LLaMA 2](https://huggingface.co/meta-llama/Llama-2-70b-hf) is the next version of the [LLaMA](https://arxiv.org/abs/2302.13971). Compared to the V1 model, it is trained on more data - 2T tokens and supports context length window upto 4K tokens. Learn more about LLaMa 2 in the [""]() blog post.

QLoRA is an efficient finetuning technique that quantizes a pretrained language model to 4 bits and attaches small “Low-Rank Adapters” which are fine-tuned. This enables fine-tuning of models with up to 65 billion parameters on a single GPU; despite its efficiency, QLoRA matches the performance of full-precision fine-tuning and achieves state-of-the-art results on language tasks.

In our example, we are going to leverage Hugging Face [Transformers](https://huggingface.co/docs/transformers/index), [Accelerate](https://huggingface.co/docs/accelerate/index), and [PEFT](https://github.com/huggingface/peft). 

Outline of the notebook:
1. Setup Development Environment
2. Load and prepare the dataset
3. Fine-Tune LLaMA 13B with QLoRA on Amazon SageMaker
4. Deploy model: We deploy the model in a separate notebook, [link](https://d-zxgsdsdm6wdj.studio.us-east-1.sagemaker.aws/jupyter/default/files/huggingface-llama-2-samples/training/deploy_llama.ipynb?_xsrf=2%7C1713465d%7Cf7c757f2602e3ee07c19d53612a09c11%7C1690993444).

### Quick intro: PEFT or Parameter Efficient Fine-tuning

[PEFT](https://github.com/huggingface/peft), or Parameter Efficient Fine-tuning, is a new open-source library from Hugging Face to enable efficient adaptation of pre-trained language models (PLMs) to various downstream applications without fine-tuning all the model's parameters. PEFT currently includes techniques for:

- (Q)LoRA: [LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS](https://arxiv.org/pdf/2106.09685.pdf)
- Prefix Tuning: [P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks](https://arxiv.org/pdf/2110.07602.pdf)
- P-Tuning: [GPT Understands, Too](https://arxiv.org/pdf/2103.10385.pdf)
- Prompt Tuning: [The Power of Scale for Parameter-Efficient Prompt Tuning](https://arxiv.org/pdf/2104.08691.pdf)
- IA3: [Infused Adapter by Inhibiting and Amplifying Inner Activations](https://arxiv.org/abs/2205.05638)



### Access LLaMA 2

Before we can start training we have to make sure that we accepted the license of [llama 2](https://huggingface.co/meta-llama/Llama-2-70b-hf) to be able to use it. You can accept the license by clicking on the Agree and access repository button on the model page at: 
* [LLaMa 7B](https://huggingface.co/meta-llama/Llama-2-7b-hf)
* [LLaMa 13B](https://huggingface.co/meta-llama/Llama-2-13b-hf)
* [LLaMa 70B](https://huggingface.co/meta-llama/Llama-2-70b-hf)

## 1. Setup Development Environment

In [2]:
!pip install --upgrade pip

[0m

In [3]:
!pip install "transformers==4.30.2" "datasets[s3]==2.13.0" sagemaker --upgrade --quiet

[0m

In [4]:
pip install textstat

[0mNote: you may need to restart the kernel to use updated packages.


#### Huggingface token to access Llama-2 model weights

Before you execute next line make sure that you saved you hugging face token as an environmental variable.


In [5]:
!huggingface-cli login --token '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'

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [6]:
# Use this code snippet in your app.
# If you need more information about configurations
# or implementing the sample code, visit the AWS docs:
# https://aws.amazon.com/developer/language/python/
import json
import boto3
from botocore.exceptions import ClientError


def get_secret(secret_name):
    region_name = "us-east-2"

    # Create a Secrets Manager client
    session = boto3.session.Session()
    client = session.client(
        service_name='secretsmanager',
        region_name=region_name
    )

    try:
        get_secret_value_response = client.get_secret_value(
            SecretId=secret_name
        )
    except ClientError as e:
        # For a list of exceptions thrown, see
        # https://docs.aws.amazon.com/secretsmanager/latest/apireference/API_GetSecretValue.html
        raise e

    # Decrypts secret using the associated KMS key.
    secret = json.loads(get_secret_value_response['SecretString'])
    
    return secret

If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) more about it.



In [7]:
import sagemaker
import boto3
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
SESS = sagemaker.Session()
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and SESS is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = SESS.default_bucket()

try:
    ROLE = sagemaker.get_execution_role()
except ValueError:
    IAM = boto3.client('iam')
    ROLE = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

SESS = sagemaker.Session(default_bucket=sagemaker_session_bucket)
S3_CLIENT = boto3.client('s3')

# Get hugging_face token
#secret_name = "hf_token"
#HF_TOKEN = get_secret(secret_name)['HF_TOKEN']

print(f"sagemaker role arn: {ROLE}")
print(f"sagemaker bucket: {SESS.default_bucket()}")
print(f"sagemaker session region: {SESS.boto_region_name}")

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
sagemaker role arn: arn:aws:iam::275461957965:role/service-role/AmazonSageMaker-ExecutionRole-20230627T145146
sagemaker bucket: sagemaker-us-east-1-275461957965
sagemaker session region: us-east-1


## 2. Load and prepare the dataset

To tune our model we need to convert our structured examples into a collection of tasks described via instructions. We define a `formatting_function` that takes a sample and returns a string with our format instruction. I will use a few thousand sentences from 10k_filings that were labeled by the climate lab to predict whether each of them mentions a specific climate solution. We need to convert the text into alpaca style. 

1) Llama-2

```python
{
    """ "instruction": "".  
    "input": "",
    "output": "" """
}
```

2) Llama-2-instruct

```python
<s>[INST] <<SYS>>
System prompt
<</SYS>>

User prompt [/INST] Model answer </s>
```

In [8]:
s3_bucket = 'd3-data-bucket'
s3_key = 'labs/trustworthy_ai/data/data_finetuned.csv'

local_path="./data/data_finetuned.csv"

# Download the JSON file from S3
S3_CLIENT.download_file(s3_bucket, s3_key, local_path)

In [9]:
# Define a function to calculate length, readability scores, and a combined score
def compute_scores(example):
    text = example['description']
    length = len(text)
    flesch_reading_ease = textstat.flesch_reading_ease(text)
    flesch_kincaid_grade = textstat.flesch_kincaid_grade(text)
    # Example combined score: weighted sum of length and flesch_kincaid_grade
    combined_score = length + flesch_kincaid_grade # Adjust weights as needed
    return {
        "length": length,
        "flesch_reading_ease": flesch_reading_ease,
        "flesch_kincaid_grade": flesch_kincaid_grade,
        "combined_score": combined_score
    }

In [11]:
from datasets import Dataset
import pandas as pd
import textstat

# Open the file
file_contents = pd.read_csv(local_path)
print(file_contents.info())

file_contents  = file_contents.loc[~file_contents['description'].isna(), :]
print(file_contents.info())

# Specify the columns for which you want to create new columns
selected_columns = ['title', 'brand', 'price', 'feature']

# Create a new column by combining all column names and values
file_contents['combined_prompt'] = file_contents.apply(lambda row: ', '.join([f'{col}: {row[col]}' for col in selected_columns]), axis=1)

# Convert the Pandas DataFrame into a Dataset object
dataset_raw_full = Dataset.from_pandas(file_contents)

# Apply the function to the dataset
dataset_with_length = dataset_raw_full.map(compute_scores)

# Sort the dataset by the length in decreasing order
dataset_sorted = dataset_with_length.sort("combined_score", reverse=True)
    
# Select the first 1000 rows to create a new Dataset
dataset_raw = dataset_sorted.select(range(1000))

dataset_raw[0]

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6890 entries, 0 to 6889
Data columns (total 8 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   Unnamed: 0             6890 non-null   int64 
 1   title                  6890 non-null   object
 2   brand                  6890 non-null   object
 3   feature                6890 non-null   object
 4   description            6890 non-null   object
 5   price                  6890 non-null   object
 6   prompt                 431 non-null    object
 7   generated_description  395 non-null    object
dtypes: int64(1), object(7)
memory usage: 430.8+ KB
None
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6890 entries, 0 to 6889
Data columns (total 8 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   Unnamed: 0             6890 non-null   int64 
 1   title                  6890 non-null   object
 2   brand       

Map:   0%|          | 0/6890 [00:00<?, ? examples/s]

{'Unnamed: 0': 36409,
 'title': 'Sterling Silver 5mm Ball Stud Earrings',
 'brand': 'Amazon Collection',
 'feature': "['Simple stud earring featuring polished sterling silver ball affixed to friction-back post', 'Imported', 'Crafted in .925 Sterling Silver', 'High Polished', 'You can return this item for any reason and get a full refund: no shipping charges. The item must be returned in new and unused condition.', 'Read the full returns policy', 'Go to Your Orders to start the return', 'Print the return shipping label', 'Ship it!', 'Product Dimensions:\\n                    \\n4.3 x 3.9 x 0.2 inches', 'Shipping Weight:\\n                    \\n0.32 ounces (View shipping rates and policies)']",
 'description': '[\'\', \'The Amazon Curated Collection\', \'Discover the Amazon Curated Collection of fine and fashion jewelry. The expansive selection of high-quality jewelry featured in the Amazon Curated Collection offers everyday values that range from precious gemstone and diamond pieces to

In addition, to formatting our samples we also want to pack multiple samples to one sequence to have a more efficient training.

In [12]:
from transformers import AutoTokenizer

model_id = "meta-llama/Llama-2-13b-chat-hf" # sharded weights
tokenizer = AutoTokenizer.from_pretrained(model_id,use_auth_token=True)
tokenizer.pad_token = '[PAD]'
tokenizer.padding_side = "right"

We define some helper functions to pack our samples into sequences of a given length and then tokenize them.

In [13]:
import scripts.prompt_utils
from scripts.prompt_utils import format_prompt_instruct

def format_dataset(sample):
    print(sample)
    system_prompt = scripts.prompt_utils.system_prompt_descriptor
    user_instruction = sample['combined_prompt']
    answer = sample['description']
    
    prompt = format_prompt_instruct(user_instruction=user_instruction, answer=answer, system_prompt=system_prompt)
    return prompt

def template_dataset(sample):
    sample["text"] = f"{format_dataset(sample)}{tokenizer.eos_token}"
    return sample

def create_labels(sample):
    sample["labels"] = sample["input_ids"].copy()
    return sample

# Format dataset

from random import randrange
print(format_dataset(dataset_raw[randrange(len(dataset_raw))]))

{'Unnamed: 0': 59840, 'title': 'Teemzone Mens and Womens Genuine Leather Credit Card Holder Case Wallet with Id Windows', 'brand': 'teemzone', 'feature': '[\'Imported\', \'Material: Cowhide\', \'Size: 12cm8cm1cm\', \'Structure: Be made of real leather outside with Polyester inside. With one photo holder, 5 card slots, 2 receipt holders and 40 slots for business card or credit card.\', \'Hasp closure\', "With anti-degaussing function card slots, it\'s so easy to put the cards in&out of the slots", \'You can return this item for any reason and get a full refund: no shipping charges. The item must be returned in new and unused condition.\', \'Read the full returns policy\', \'Go to Your Orders to start the return\', \'Print the return shipping label\', \'Ship it!\', \'Product Dimensions:\\n                    \\n4.7 x 0.4 x 3.1 inches\', \'Shipping Weight:\\n                    \\n4.2 ounces (View shipping rates and policies)\']', 'description': "['#productDescription .aplus-3p {width: 97

In [14]:
# apply prompt template per sample
dataset = dataset_raw.map(template_dataset, remove_columns=list(dataset_raw.features))
# print random sample
print(dataset)

# tokenize and chunk dataset
lm_dataset = dataset.map(
    lambda sample: tokenizer(sample["text"], max_length= 1800, padding = 'max_length', truncation=True), batched=True, remove_columns=list(dataset.features)
).map(
    create_labels, batched=True
).filter(
    lambda x: (x['input_ids'][-1] == tokenizer.pad_token_id) or (x['input_ids'][-1] == tokenizer.eos_token_id)
)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

{'Unnamed: 0': 36409, 'title': 'Sterling Silver 5mm Ball Stud Earrings', 'brand': 'Amazon Collection', 'feature': "['Simple stud earring featuring polished sterling silver ball affixed to friction-back post', 'Imported', 'Crafted in .925 Sterling Silver', 'High Polished', 'You can return this item for any reason and get a full refund: no shipping charges. The item must be returned in new and unused condition.', 'Read the full returns policy', 'Go to Your Orders to start the return', 'Print the return shipping label', 'Ship it!', 'Product Dimensions:\\n                    \\n4.3 x 3.9 x 0.2 inches', 'Shipping Weight:\\n                    \\n0.32 ounces (View shipping rates and policies)']", 'description': '[\'\', \'The Amazon Curated Collection\', \'Discover the Amazon Curated Collection of fine and fashion jewelry. The expansive selection of high-quality jewelry featured in the Amazon Curated Collection offers everyday values that range from precious gemstone and diamond pieces to the

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1000 [00:00<?, ? examples/s]

After we processed the datasets we are going to use the new [FileSystem integration](https://huggingface.co/docs/datasets/filesystems) to upload our dataset to S3. We are using the `sess.default_bucket()`, adjust this if you want to store the dataset in a different S3 bucket. We will use the S3 path later in our training script.

In [15]:
# save train_dataset to s3
training_input_path = f's3://d3-data-bucket/labs/trustworthy_ai/processed/'
lm_dataset.save_to_disk(training_input_path)

print("uploaded data to:")
print(f"training dataset to: {training_input_path}")

severe performance issues, see also https://github.com/dask/dask/issues/10276

To fix, you should specify a lower version bound on s3fs, or
update the current installation.



Saving the dataset (0/1 shards):   0%|          | 0/1000 [00:00<?, ? examples/s]

uploaded data to:
training dataset to: s3://d3-data-bucket/labs/trustworthy_ai/processed/


## 3. Fine-Tune LLaMA 13B with QLoRA on Amazon SageMaker

We are going to use the recently introduced method in the paper "[QLoRA: Quantization-aware Low-Rank Adapter Tuning for Language Generation](https://arxiv.org/abs/2106.09685)" by Tim Dettmers et al. QLoRA is a new technique to reduce the memory footprint of large language models during finetuning, without sacrificing performance. The TL;DR; of how QLoRA works is: 

* Quantize the pretrained model to 4 bits and freezing it.
* Attach small, trainable adapter layers. (LoRA)
* Finetune only the adapter layers, while using the frozen quantized model for context.

We prepared a [run_clm.py](./scripts/run_clm.py), which implements QLora using PEFT to train our model. The script also merges the LoRA weights into the model weights after training. That way you can use the model as a normal model without any additional code. The model will be temporally offloaded to disk, if it is too large to fit into memory.

In order to create a sagemaker training job we need an `HuggingFace` Estimator. The Estimator handles end-to-end Amazon SageMaker training and deployment tasks. The Estimator manages the infrastructure use. 
SagMaker takes care of starting and managing all the required ec2 instances for us, provides the correct huggingface container, uploads the provided scripts and downloads the data from our S3 bucket into the container at `/opt/ml/input/data`. Then, it starts the training job by running.

### Harwarde requirements

We also ran several experiments to determine, which instance type can be used for the different model sizes. The following table shows the results of our experiments. The table shows the instance type, model size, context length, and max batch size. 

| Model        | Instance Type     | Max Batch Size | Context Length |
|--------------|-------------------|----------------|----------------|
| [LLama 7B]() | `(ml.)g5.4xlarge` | `3`            | `2048`         |
| [LLama 13B]() | `(ml.)g5.4xlarge` | `2`            | `2048`         |
| [LLama 70B]() | `(ml.)p4d.24xlarge` | `1++` (need to test more configs)            | `2048`         |


> You can also use `g5.2xlarge` instead of the `g5.4xlarge` instance type, but then it is not possible to use `merge_weights` parameter, since to merge the LoRA weights into the model weights, the model needs to fit into memory. But you could save the adapter weights and merge them using [merge_adapter_weights.py](./scripts/merge_adapter_weights.py) after training.

_Note: We plan to extend this list in the future. feel free to contribute your setup!_

In [16]:
import time
from sagemaker.huggingface import HuggingFace

# define Training Job Name 
job_name = f'huggingface-qlora-{time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())}'

# hyperparameters, which are passed into the training job
hyperparameters ={
  'model_id': model_id,                             # pre-trained model
  'dataset_path': '/opt/ml/input/data/training',    # path where sagemaker will save training dataset
  'epochs': 3,                                      # number of training epochs
  'per_device_train_batch_size': 2,                 # batch size for training
  'lr': 2e-4,                                       # learning rate used during training
  'hf_token': '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',                 # huggingface token to access llama 2
  'merge_weights': True,                            # wether to merge LoRA into the model (needs more memory)
}

# create the Estimator
huggingface_estimator = HuggingFace(
    entry_point          = 'run_clm.py',      # train script
    source_dir           = 'scripts',         # directory which includes all the files needed for training
    instance_type        = 'ml.g5.12xlarge',   # instances type used for the training job
    instance_count       = 1,                 # the number of instances used for training
    base_job_name        = job_name,          # the name of the training job
    role                 = ROLE,              # Iam role used in training job to access AWS ressources, e.g. S3
    volume_size          = 300,               # the size of the EBS volume in GB
    transformers_version = '4.28',            # the transformers version used in the training job
    pytorch_version      = '2.0',             # the pytorch_version version used in the training job
    py_version           = 'py310',           # the python version used in the training job
    hyperparameters      =  hyperparameters,  # the hyperparameters passed to the training job
    environment          = { "HUGGINGFACE_HUB_CACHE": "/tmp/.cache" }, # set env variable to cache models in /tmp
)

We can now start our training job, with the `.fit()` method passing our S3 path to the training script.

In [None]:
# define a data input dictonary with our uploaded s3 uris
data = {'training': training_input_path}

# starting the train job with our uploade datasets as input
huggingface_estimator.fit(data, wait=True)

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: huggingface-qlora-2024-01-19-01-46-55-2024-01-19-01-46-58-278


2024-01-19 01:46:59 Starting - Starting the training job
2024-01-19 01:46:59 Pending - Training job waiting for capacity......
2024-01-19 01:47:53 Pending - Preparing the instances for training.........
2024-01-19 01:49:13 Downloading - Downloading input data...
2024-01-19 01:49:33 Downloading - Downloading the training image........................
2024-01-19 01:53:34 Training - Training image download completed. Training in progress...[34m3%|▎         | 46/1500 [09:00<4:44:19, 11.73s/it][0m
[34m17%|█▋        | 258/1500 [50:27<4:02:52, 11.73s/it][0m


In [None]:
huggingface_estimator.model_data

## Deploy model

Please go to [script](https://d-zxgsdsdm6wdj.studio.us-east-1.sagemaker.aws/jupyter/default/files/huggingface-llama-2-samples/training/deploy_llama.ipynb?_xsrf=2%7C1713465d%7Cf7c757f2602e3ee07c19d53612a09c11%7C1690993444).