# End to end fine tuning Notebook

1. data preparation
2. tokenization
3. fine tuning with QLora
3. Evaluation

In [4]:
!pip install huggingface_hub --upgrade --quiet

[0m

In [5]:
!pip install "transformers==4.30.2" "datasets[s3]==2.13.0" sagemaker --upgrade --quiet

[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
awscli 1.27.150 requires PyYAML<5.5,>=3.10, but you have pyyaml 6.0.1 which is incompatible.
docker-compose 1.29.2 requires PyYAML<6,>=3.10, but you have pyyaml 6.0.1 which is incompatible.
jupyterlab-server 2.22.1 requires jsonschema>=4.17.3, but you have jsonschema 3.2.0 which is incompatible.[0m[31m
[0m

If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) more about it.



In [6]:
#required to work in local_mode on your notebook instance for development/debugging purpose
#!pip install 'sagemaker[local]' --upgrade --quiet
#!pip install docker-compose --quiet

In [7]:
import sagemaker
import boto3
import os
from sagemaker import LocalSession


sess = sagemaker.Session()

#uncomment to run in local mode
#sess = LocalSession()

#the below help setting up the container's root on the EBS volume of your instance.
#sess.config = {'local' : {'local_code' : True, 'container_root' : '/home/ec2-user/SageMaker/'}}

#if you're running local mode and run into out of space issues, consider running docker_scripts/prepare-docker.sh to set the docker root under /home/ec2-user/SageMaker

In [8]:
#region
region = sess.boto_region_name

notebook_home = "/home/ec2-user/SageMaker/"

# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

s3_client = boto3.client("s3")    

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {region}")


sagemaker role arn: arn:aws:iam::327216439222:role/Sagemaker
sagemaker bucket: sagemaker-us-east-1-327216439222
sagemaker session region: us-east-1


### Model selection

Choose the model you want to fine tune.

In [9]:
model_id = "tiiuae/falcon-7b"
#model_id = "tiiuae/falcon-7b-instruct"
model_name = model_id.split("/")[-1]

## 2. Data Preparation

### BBC Dataset

We're using BBC articles for our fine tuning contained in the local zip file.

In [7]:
import zipfile

#name fo the zip file that we'll use
data_zip = "BBC_news_summary.zip"
s3_prefix = "model-fine-tuning"

base_dir = os.path.join(os.getcwd())

path_to_file = os.path.join(os.getcwd(), "data", data_zip)

#unziping file
with zipfile.ZipFile(os.path.join(base_dir, "data", data_zip), 'r') as zip_ref:
    zip_ref.extractall(os.path.join(notebook_home, "data"))

#Folders that we'll iterate through after unzipping.
articles_folder = "News Articles"
summaries_folder = "Summaries"
sub_folders = ["business", "entertainment", "politics", "sport", "tech"]

articles_folders = f"{notebook_home}/data/BBC_news_summary/" + articles_folder
summaries_folder = f"{notebook_home}/data/BBC_news_summary/" + summaries_folder

### Transform folder base data into jsonlines

See below the format that we want:

{

  "id": "13818513",
  
  "summary": "Amanda baked cookies and will bring Jerry some tomorrow.",
  
  "content": "Amanda: I baked cookies. Do you want some?\r\nJerry: Sure!\r\nAmanda: I'll bring you tomorrow :-)"
  
}

In [8]:
import json

with open(os.path.join(notebook_home, "data", "data_jsonlines.jsonl"), 'w') as outfile:
    for folder in os.scandir(path = articles_folders):
        for filename in os.scandir(path = articles_folders + "/" + str(folder.name)):
            if filename.is_file():
                try:
                    #create article id of the form folder_001
                    id_article = str(folder.name) + "_" + str(filename.name).split(".")[0]

                    #get article content
                    content = ""
                    with open(filename, 'rb') as file:
                        content = file.read()
                    #get article summary
                    summary = ""
                    equivalent_summary_file = summaries_folder + "/" + str(folder.name) + "/" + str(filename.name)
                    with open(equivalent_summary_file, 'rb') as file:
                        summary = file.read()

                    #create json object
                    data = {}
                    data['id'] = id_article
                    data['content'] = content.decode("utf-8")
                    data['summary'] = summary.decode("utf-8")
   
                    json.dump(data, outfile)
                    outfile.write('\n')

                except UnicodeDecodeError:
                    print(f"skipping:{id_article} due to UnicodeDecodeError")

skipping:sport_199 due to UnicodeDecodeError


### Data Tokenization

In [9]:
from datasets import load_dataset, load_from_disk
from transformers import AutoTokenizer

#we load the data into a dataset object
dataset = load_dataset('json', data_files=os.path.join(notebook_home, "data", "data_jsonlines.jsonl"), split="train")

# Load tokenizer of falcon
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.model_max_length = 2048 # overwrite wrong value

Downloading and preparing dataset json/default to /home/ec2-user/.cache/huggingface/datasets/json/default-c37d906bda4cd879/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset json downloaded and prepared to /home/ec2-user/.cache/huggingface/datasets/json/default-c37d906bda4cd879/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4. Subsequent calls will reuse this data.


Applying the template to the dataset

In [10]:
from random import randint
from itertools import chain
from functools import partial
from string import Template

# custom instruct prompt start
prompt_template_domain = f"Provide a summary for the following article:\n{{content}}\n---\nSummary:\n{{summary}}{{eos_token}}"

prompt_template_instruction = {
"prompt": "Provide a summary for the following text article:. Text: $content",
"completion": "$summary",
}

# template dataset to add prompt to each sample
def template_dataset_domain_tuning(sample):
    sample["text"] = prompt_template_domain.format(content=sample["content"],
                                            summary=sample["summary"],
                                            eos_token=tokenizer.eos_token)
    return sample

def template_dataset_instruction_tuning(sample):
    prompt_template_instruction = {
        "prompt": "Provide a summary for the following text article:. Text: $content",
        "completion": "$summary",
    }
    
    args = {'content': sample["content"], 'summary': sample["summary"]}
    sample["text"] = Template(prompt_template_instruction).substitute(args)
    return sample

# apply prompt template per sample
dataset_domain = dataset.map(template_dataset_domain_tuning, remove_columns=list(dataset.features))

dataset_instruction = dataset.map(template_dataset_instruction_tuning, remove_columns=list(dataset.features))



Map:   0%|          | 0/2224 [00:00<?, ? examples/s]

In [11]:
#printing an example
print(dataset[randint(0, len(dataset))]["text"])

Summarize the news article:
UK firm faces Venezuelan land row

Venezuelan authorities have said they will seize land owned by a British company as part of President Chavez's agrarian reform programme.

Officials in Cojedes state said on Friday that farmland owned by a subsidiary of the Vestey Group would be taken and used to settle poor farmers. The government is cracking down on so-called latifundios, or large rural estates, which it says are lying idle. The Vestey Group said it had not been informed of any planned seizure.

The firm, whose Agroflora subsidiary operates 13 farms in Venezuela, insisted that it had complied fully with Venezuelan law. Prosecutors in the south of the country have targeted Hato El Charcote, a beef cattle ranch owned by Agroflora. According to Reuters, they plan to seize 12,900 acres (5,200 hectares) from the 32,000 acre (13,000 hectare) farm.

Officials claim that Agroflora does not possess valid documents proving its ownership of the land in question. The

Concatenating sample dataset into chunk of equal size for the training

In [12]:
# empty list to save remainder from batches to use in next batch
remainder = {"input_ids": [], "token_type_ids": [], "attention_mask": []}


def chunk(sample, chunk_length=2048):
    # define global remainder variable to save remainder from batches to use in next batch
    global remainder
    # Concatenate all texts and add remainder from previous batch
    concatenated_examples = {k: list(chain(*sample[k])) for k in sample.keys()}
    #print(concatenated_examples.keys())
    concatenated_examples = {k: remainder[k] + concatenated_examples[k] for k in concatenated_examples.keys()}
    # get total number of tokens for batch
    batch_total_length = len(concatenated_examples[list(sample.keys())[0]])

    # get max number of chunks for batch
    if batch_total_length >= chunk_length:
        batch_chunk_length = (batch_total_length // chunk_length) * chunk_length

    # Split by chunks of max_len.
    result = {
        k: [t[i : i + chunk_length] for i in range(0, batch_chunk_length, chunk_length)]
        for k, t in concatenated_examples.items()
    }
    # add remainder to global variable for next batch
    remainder = {k: concatenated_examples[k][batch_chunk_length:] for k in concatenated_examples.keys()}
    # prepare labels
    result["labels"] = result["input_ids"].copy()
    return result


# tokenize and chunk dataset
lm_dataset = dataset.map(
    lambda sample: tokenizer(sample["text"]), batched=True, remove_columns=list(dataset.features)
).map(
    partial(chunk, chunk_length=2048),
    batched=True,
)

# Print total number of samples
print(f"Total number of samples: {len(lm_dataset)}")

Map:   0%|          | 0/2224 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (3091 > 2048). Running this sequence through the model will result in indexing errors


Map:   0%|          | 0/2224 [00:00<?, ? examples/s]

Total number of samples: 804


Uploading tokenized and chunked dataset to S3

In [13]:
# save train_dataset to s3
training_input_path = os.path.join("s3://", sagemaker_session_bucket, s3_prefix, "tokenized", "train", "")
lm_dataset.save_to_disk(training_input_path)

print(f"training dataset uploaded to: {training_input_path}")

Saving the dataset (0/1 shards):   0%|          | 0/804 [00:00<?, ? examples/s]

training dataset uploaded to: s3://sagemaker-us-east-1-327216439222/model-fine-tuning/tokenized/train/


## Download and upload the model to S3

In [14]:
#model_id = "EleutherAI/gpt-j-6b"
#model_id = "meta-llama/Llama-2-13b"
model_id = "tiiuae/falcon-7b"
model_name = model_id.split("/")[-1]

In [15]:
!pip show huggingface_hub

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[0mName: huggingface-hub
Version: 0.16.4
Summary: Client library to download and publish models, datasets and other repos on the huggingface.co hub
Home-page: https://github.com/huggingface/huggingface_hub
Author: Hugging Face, Inc.
Author-email: julien@huggingface.co
License: Apache
Location: /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages
Requires: filelock, fsspec, packaging, pyyaml, requests, tqdm, typing-extensions
Required-by: datasets, evaluate, transformers


In [16]:
from pathlib import Path
from huggingface_hub import snapshot_download

model_tar_dir = Path(os.path.join(notebook_home, "models", model_name))
if not os.path.isdir(model_tar_dir):
    os.makedirs(model_tar_dir)

# Download model from Hugging Face into model_dir
snapshot_download(model_id, 
                  local_dir=str(model_tar_dir), 
                  local_dir_use_symlinks=False,
                  cache_dir="/home/ec2-user/SageMaker/models/tmp")


Fetching 12 files:   0%|          | 0/12 [00:00<?, ?it/s]

Downloading (…)49f25d4eb1/README.md:   0%|          | 0.00/10.2k [00:00<?, ?B/s]

Downloading (…)f25d4eb1/config.json:   0%|          | 0.00/950 [00:00<?, ?B/s]

Downloading (…)/configuration_RW.py:   0%|          | 0.00/2.61k [00:00<?, ?B/s]

Downloading (…)4eb1/modelling_RW.py:   0%|          | 0.00/47.6k [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

Downloading (…)d4eb1/.gitattributes:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

Downloading (…)model.bin.index.json:   0%|          | 0.00/16.9k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/281 [00:00<?, ?B/s]

Downloading (…)l-00002-of-00002.bin:   0%|          | 0.00/4.48G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/220 [00:00<?, ?B/s]

Downloading (…)l-00001-of-00002.bin:   0%|          | 0.00/9.95G [00:00<?, ?B/s]

Downloading (…)d4eb1/tokenizer.json:   0%|          | 0.00/2.73M [00:00<?, ?B/s]

'/home/ec2-user/SageMaker/models/falcon-7b'

In [21]:
print(model_tar_dir)
print("/home/ec2-user/SageMaker/models/falcon-7b")

/home/ec2-user/SageMaker/models/falcon-7b
/home/ec2-user/SageMaker/models/falcon-7b


In [25]:
cwd = str(Path.cwd())
p = Path(os.path.join(Path.cwd(), model_tar_dir))
mydirs = list(p.glob('**'))

In [124]:
#uploading the model to S3
def upload_to_s3(model_tar_dir, s3_prefix, sagemaker_session_bucket):
    stop_list = ['.ipynb_checkpoints', '.gitattributes']
    files = os.listdir(model_tar_dir)   
    for file in files:
        if file not in stop_list:
            try:
                local_path = os.path.join(model_tar_dir, file)
                if os.path.isfile(local_path):
                    remote_path = os.path.join(s3_prefix, file)
                    s3_client.upload_file(local_path, sagemaker_session_bucket, remote_path)
                    print(f"{local_path} uploaded to s3 folder: {remote_path}")
                else:
                    new_local_dir = os.path.join(model_tar_dir,file)
                    new_remote_dir = os.path.join(s3_prefix,file)
                    upload_to_s3(new_local_dir, new_remote_dir, sagemaker_session_bucket)

            except Exception as e:
                print(e)

In [125]:
upload_to_s3(model_tar_dir, os.path.join(s3_prefix, "models", model_name, ''), sagemaker_session_bucket)

/home/ec2-user/SageMaker/models/falcon-7b/pytorch_model-00001-of-00002.bin uploaded to s3 folder: model-fine-tuning/models/falcon-7b/pytorch_model-00001-of-00002.bin
/home/ec2-user/SageMaker/models/falcon-7b/pytorch_model.bin.index.json uploaded to s3 folder: model-fine-tuning/models/falcon-7b/pytorch_model.bin.index.json
/home/ec2-user/SageMaker/models/falcon-7b/pytorch_model-00002-of-00002.bin uploaded to s3 folder: model-fine-tuning/models/falcon-7b/pytorch_model-00002-of-00002.bin
/home/ec2-user/SageMaker/models/falcon-7b/test/test_file.txt uploaded to s3 folder: model-fine-tuning/models/falcon-7b/test/test_file.txt
/home/ec2-user/SageMaker/models/falcon-7b/config.json uploaded to s3 folder: model-fine-tuning/models/falcon-7b/config.json
/home/ec2-user/SageMaker/models/falcon-7b/README.md uploaded to s3 folder: model-fine-tuning/models/falcon-7b/README.md
/home/ec2-user/SageMaker/models/falcon-7b/configuration_RW.py uploaded to s3 folder: model-fine-tuning/models/falcon-7b/configur

In [126]:
#storing model path and output model path to reuse later
model_path = os.path.join("s3://", sagemaker_session_bucket, s3_prefix, "models", model_name, '')
fine_tuned_model_path = os.path.join("s3://", sagemaker_session_bucket, s3_prefix, "tuned-models", model_name, '')

In [127]:
print(training_input_path)
print(model_path)
print(fine_tuned_model_path)

s3://sagemaker-us-east-1-327216439222/model-fine-tuning/tokenized/train/
s3://sagemaker-us-east-1-327216439222/model-fine-tuning/models/falcon-7b/
s3://sagemaker-us-east-1-327216439222/model-fine-tuning/tuned-models/falcon-7b/


In [12]:
#TO REMOVE, JUST SHORCUTING CODE
training_input_path = "s3://sagemaker-us-east-1-327216439222/model-fine-tuning/tokenized/train/"
model_path = "s3://sagemaker-us-east-1-327216439222/model-fine-tuning/models/falcon-7b/"
fine_tuned_model_path = "s3://sagemaker-us-east-1-327216439222/model-fine-tuning/tuned-models/falcon-7b/"

## Fine-Tune with QLoRA on Amazon SageMaker

We are going to use the recently introduced method in the paper "[QLoRA: Quantization-aware Low-Rank Adapter Tuning for Language Generation](https://arxiv.org/abs/2106.09685)" by Tim Dettmers et al. QLoRA is a new technique to reduce the memory footprint of large language models during finetuning, without sacrificing performance. The TL;DR; of how QLoRA works is: 

* Quantize the pretrained model to 4 bits and freezing it.
* Attach small, trainable adapter layers. (LoRA)
* Finetune only the adapter layers, while using the frozen quantized model for context.

We prepared a [run_clm.py](./scripts/run_clm.py), which implements QLora using PEFT to train our model. The script also merges the LoRA weights into the model weights after training. That way you can use the model as a normal model without any additional code.

In order to create a sagemaker training job we need an `HuggingFace` Estimator. The Estimator handles end-to-end Amazon SageMaker training and deployment tasks. The Estimator manages the infrastructure use. 
SagMaker takes care of starting and managing all the required ec2 instances for us, provides the correct huggingface container, uploads the provided scripts and downloads the data from our S3 bucket into the container at `/opt/ml/input/data`. Then, it starts the training job by running.


In [13]:
print(f"model to be fine-tuned: {model_id}")

model to be fine-tuned: tiiuae/falcon-7b


In [14]:
import time
# define Training Job Name 
job_name = f'huggingface-qlora-{time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())}'

from sagemaker.huggingface import HuggingFace
from sagemaker.pytorch import PyTorch

# hyperparameters, which are passed into the training job
hyperparameters ={
  'model_id': model_id,                                # pre-trained model
  'dataset_path': '/opt/ml/input/data/training',       # path where sagemaker will save training dataset
  'model_path' : '/opt/ml/input/data/pre-trained/',    # path to load the model from
  'epochs': 1,                                         # number of training epochs
  'per_device_train_batch_size': 4,                    # batch size for training
  'lr': 2e-4,                                          # learning rate used during training
}


# create the Estimator for fine tuning on single GPU
huggingface_estimator = HuggingFace(
    #sagemaker_session=sess,                   #required for setting new container root to EBS volume
    entry_point          = 'run_clm.py',         # train script
    source_dir           = './qlora_scripts',    # directory which includes all the files needed for training
    #instance_type        = 'local_gpu',         #uncomment and comment below to switch to local mode.
    instance_type        = 'ml.g5.12xlarge',     # instances type used for the training job
    instance_count       = 1,                 # the number of instances used for training
    base_job_name        = job_name,          # the name of the training job
    role                 = role,              # Iam role used in training job to access AWS ressources, e.g. S3
    volume_size          = 500,               # the size of the EBS volume in GB
    transformers_version = '4.28.1',
    #framework_version    = '2.0.0',            # the transformers version used in the training job
    pytorch_version      = '2.0.0',            # the pytorch_version version used in the training job
    py_version           = 'py310',            # the python version used in the training job
    hyperparameters      =  hyperparameters,
    environment          = { "HUGGINGFACE_HUB_CACHE": "/tmp/.cache" }, # set env variable to cache models in /tmp
)

'''
# create the Estimator for fine tuning on multiple GPU (and local mode)
huggingface_estimator = HuggingFace(
    sagemaker_session=sess,                   #required for setting new container root to EBS volume
    entry_point          = 'launch_accelerate.sh',         # train script
    source_dir           = './qlora_scripts',    # directory which includes all the files needed for training
    instance_type        = 'local_gpu',
    #instance_type        = 'ml.g5.12xlarge',     # instances type used for the training job
    instance_count       = 1,                 # the number of instances used for training
    base_job_name        = job_name,          # the name of the training job
    role                 = role,              # Iam role used in training job to access AWS ressources, e.g. S3
    volume_size          = 400,               # the size of the EBS volume in GB
    transformers_version = '4.28.1',
    #framework_version    = '2.0.0',            # the transformers version used in the training job
    pytorch_version      = '2.0.0',            # the pytorch_version version used in the training job
    py_version           = 'py310',            # the python version used in the training job
    hyperparameters      =  hyperparameters,
    environment          = { "HUGGINGFACE_HUB_CACHE": "/tmp/.cache" }, # set env variable to cache models in /tmp
    #this configuration enabled data/model parallel distribution
    #distribution={
    #    "torch_distributed": {
    #        "enabled": True
    #    }
    #}
)
'''

'\n# create the Estimator for fine tuning on multiple GPU (and local mode)\nhuggingface_estimator = HuggingFace(\n    sagemaker_session=sess,                   #required for setting new container root to EBS volume\n    entry_point          = \'launch_accelerate.sh\',         # train script\n    source_dir           = \'./qlora_scripts\',    # directory which includes all the files needed for training\n    instance_type        = \'local_gpu\',\n    #instance_type        = \'ml.g5.12xlarge\',     # instances type used for the training job\n    instance_count       = 1,                 # the number of instances used for training\n    base_job_name        = job_name,          # the name of the training job\n    role                 = role,              # Iam role used in training job to access AWS ressources, e.g. S3\n    volume_size          = 400,               # the size of the EBS volume in GB\n    transformers_version = \'4.28.1\',\n    #framework_version    = \'2.0.0\',            #

We can now start our training job, with the `.fit()` method passing our S3 path to the training script.

In [15]:
from sagemaker.inputs import TrainingInput
#from torch import inf

fast_file = lambda x: TrainingInput(x, input_mode="FastFile")
huggingface_estimator.fit(
    {
        "pre-trained": fast_file(model_path),
        "training": fast_file(training_input_path),
    },
    wait=False,
)

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: huggingface-qlora-2023-07-26-00-26-18-2023-07-26-00-26-23-935


Using provided s3_resource


In [269]:
training_job_name = huggingface_estimator.latest_training_job.name
training_job_name

'huggingface-qlora-2023-07-25-07-31-59-2023-07-25-07-32-09-436'

In our example, the SageMaker training job took 1h for one epoch. The ml.g5.12xlarge instance we used costs `$7.09 per hour` for on-demand usage.

In [270]:
#url to the fined tuned model
model_tar_gz_s3 = os.path.join("s3://", sagemaker_session_bucket, training_job_name, "output/model.tar.gz")
print(model_tar_gz_s3)

s3://sagemaker-us-east-1-327216439222/huggingface-qlora-2023-07-25-07-31-59-2023-07-25-07-32-09-436/output/model.tar.gz


Wait for the model to be fine tuned and model.tar.gz to be created or use one already created from folder: < TODO provide folder >

### Downloading the fine tuned model

In [271]:
model_tuned_dir = os.path.join(notebook_home, "models", model_name + "-tuned", "")
if not os.path.isdir(model_tuned_dir):
    os.makedirs(model_tuned_dir)
model_tuned_file = model_tuned_dir + "model.tar.gz"
print(model_tuned_file)

/home/ec2-user/SageMaker/models/falcon-7b-tuned/model.tar.gz


In [None]:
#download from s3 to local
s3_client.download_file(sagemaker_session_bucket, 
                        os.path.join(training_job_name, "output/model.tar.gz"), 
                       model_tuned_file)

In [None]:
#untar to check the content and make sure it includes everything
!tar -xvf $model_tuned_file --directory $model_tuned_dir

## Register model

### Create model group

In [156]:
import time

sm_client = boto3.client('sagemaker', region_name=region)

model_package_group_name = "gai-fine-tuned-" + model_name

try:
    model_package_group_input_dict = {
     "ModelPackageGroupName" : model_package_group_name,
     "ModelPackageGroupDescription" : f"fine tuned versions of {model_id}"
    }
    create_model_package_group_response = sm_client.create_model_package_group(**model_package_group_input_dict)
    print('ModelPackageGroup Arn : {}'.format(create_model_package_group_response['ModelPackageGroupArn']))
except:
    print("Model group already exists")

Model group already exists


### Register a model version

In [157]:
image_url = "763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-inference:2.0.0-transformers4.28.1-gpu-py310-cu118-ubuntu20.04"

modelpackage_inference_specification =  {
    "InferenceSpecification": {
      "Containers": [
         {
            "Image": image_url,
             "ModelDataUrl": model_tar_gz_s3
         }
      ],
      "SupportedContentTypes": ["application/json"],
      "SupportedResponseMIMETypes": ["application/json"],
   }
 }

create_model_package_input_dict = {
    "ModelPackageGroupName" : model_package_group_name,
    "ModelPackageDescription" : f"qlora fine tuning of model {model_id} done during training job {training_job_name}",
    "ModelApprovalStatus" : "PendingManualApproval"
}

create_model_package_input_dict.update(modelpackage_inference_specification)

create_model_package_input_dict

{'ModelPackageGroupName': 'gai-fine-tuned-falcon-7b',
 'ModelPackageDescription': 'qlora fine tuning of model huggingface-llm-falcon-7b-bf16 done during training job huggingface-qlora-2023-07-24-11-45-33-2023-07-24-11-45-36-488',
 'ModelApprovalStatus': 'PendingManualApproval',
 'InferenceSpecification': {'Containers': [{'Image': '763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-inference:2.0.0-transformers4.28.1-gpu-py310-cu118-ubuntu20.04',
    'ModelDataUrl': 's3://sagemaker-us-east-1-327216439222/huggingface-qlora-2023-07-24-11-45-33-2023-07-24-11-45-36-488/output/model.tar.gz'}],
  'SupportedContentTypes': ['application/json'],
  'SupportedResponseMIMETypes': ['application/json']}}

In [158]:
create_model_package_response = sm_client.create_model_package(**create_model_package_input_dict)
model_package_arn = create_model_package_response["ModelPackageArn"]
print('ModelPackage Version ARN : {}'.format(model_package_arn))

ModelPackage Version ARN : arn:aws:sagemaker:us-east-1:327216439222:model-package/gai-fine-tuned-falcon-7b/1


In [159]:
sm_client.list_model_packages(ModelPackageGroupName=model_package_group_name)

{'ModelPackageSummaryList': [{'ModelPackageGroupName': 'gai-fine-tuned-falcon-7b',
   'ModelPackageVersion': 1,
   'ModelPackageArn': 'arn:aws:sagemaker:us-east-1:327216439222:model-package/gai-fine-tuned-falcon-7b/1',
   'ModelPackageDescription': 'qlora fine tuning of model huggingface-llm-falcon-7b-bf16 done during training job huggingface-qlora-2023-07-24-11-45-33-2023-07-24-11-45-36-488',
   'CreationTime': datetime.datetime(2023, 7, 25, 0, 18, 49, 264000, tzinfo=tzlocal()),
   'ModelPackageStatus': 'Completed',
   'ModelApprovalStatus': 'PendingManualApproval'}],
 'ResponseMetadata': {'RequestId': 'cb1f1c2e-d548-4620-a38d-0b888e929507',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'cb1f1c2e-d548-4620-a38d-0b888e929507',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '492',
   'date': 'Tue, 25 Jul 2023 00:19:01 GMT'},
  'RetryAttempts': 0}}

### Approve model

In [160]:
model_package_update_input_dict = {
    "ModelPackageArn" : model_package_arn,
    "ModelApprovalStatus" : "Approved"
}
model_package_update_response = sm_client.update_model_package(**model_package_update_input_dict)

## Deploy the fined tuned model

In [174]:
endpoint_name = f'{model_name}-tuned-{time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())}'
endpoint_name

'falcon-7b-tuned-2023-07-25-00-27-41'

### from the registry

In [170]:
from sagemaker import ModelPackage
from time import gmtime, strftime

def deploy_from_registry(role, model_package_arn, sess, name):
    model = ModelPackage(role=role, 
                         model_package_arn=model_package_arn, 
                         sagemaker_session=sess)
    model.deploy(initial_instance_count=1, instance_type='ml.g5.12xlarge', wait=False, endpoint_name=name)

In [172]:
deploy_from_registry(role, model_package_arn, sess, endpoint_name)

INFO:sagemaker:Creating model with name: gai-fine-tuned-falcon-7b-2023-07-25-00-26-10-810
INFO:sagemaker:Creating endpoint-config with name falcon-7b-tuned-2023-07-25-00-26-09
INFO:sagemaker:Creating endpoint with name falcon-7b-tuned-2023-07-25-00-26-09


### or directly with HuggingFaceModel sagemaker API

In [176]:
model_tar_gz_s3

's3://sagemaker-us-east-1-327216439222/huggingface-qlora-2023-07-24-11-45-33-2023-07-24-11-45-36-488/output/model.tar.gz'

In [None]:
from sagemaker.huggingface import HuggingFaceModel

#URL: 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-inference:2.0.0-transformers4.28.1-gpu-py310-cu118-ubuntu20.04

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
   model_data=model_tar_gz_s3,
   role=role, 
   transformers_version="4.28.1", 
   pytorch_version="2.0.0", 
   py_version="py310",
   model_server_workers=1,
)

# deploy model to SageMaker Inference
predictor_hf = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type= "ml.g5.12xlarge",
    wait=False,
    endpoint_name=endpoint_name,
)

endpoint_name = predictor_hf.endpoint_name
print(endpoint_name)

## Deploy the original model for comparison

In [33]:
from sagemaker.jumpstart.model import JumpStartModel

model_id, model_version = "huggingface-llm-falcon-7b-bf16", "*"

js_model = JumpStartModel(model_id=model_id, instance_type="ml.g5.12xlarge")
predictor_js = js_model.deploy(wait=False)

INFO:sagemaker:Creating model with name: hf-llm-falcon-7b-bf16-2023-07-26-00-51-19-279
INFO:sagemaker:Creating endpoint-config with name hf-llm-falcon-7b-bf16-2023-07-26-00-51-19-603
INFO:sagemaker:Creating endpoint with name hf-llm-falcon-7b-bf16-2023-07-26-00-51-19-603


### Querying the endpoints

In [177]:
#setting it manually here
endpoint_name = "falcon-7b-tuned-2023-07-25-00-26-09"
#endpoint_name = "gai-fine-tuned-falcon-7b-1690174749-2023-07-24-06-42-41-602"
original_model_endpoint_name = "hf-llm-falcon-7b-bf16-2023-07-24-12-14-32-008"

In [22]:
def query_endpoint_with_json_payload(encoded_json, endpoint_name, content_type="application/json"):
    client = boto3.client("runtime.sagemaker")
    response = client.invoke_endpoint(
        EndpointName=endpoint_name, ContentType=content_type, Body=encoded_json
    )
    return response

#method used to parse the inference model's response. we pass it as part of the model's config
def parse_response_model(query_response):
    model_predictions = json.loads(query_response["Body"].read())
    return [gen["generated_text"] for gen in model_predictions]

def query_llm(payload, endpoint_name):
    query_response = query_endpoint_with_json_payload(json.dumps(payload).encode("utf-8"), endpoint_name=endpoint_name)
    return parse_response_model(query_response)

In [17]:
prompt_template = f"Summarize the following text:\n{{text}}\n---\nSummary:\n"

In [18]:
text = """Ad sales boost Time Warner profit. Quarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (£600m) for the three months to December, from $639m year-earlier.The firm, which is now one of the biggest investors in Google, benefited from sales of high-speed internet connections and higher advert sales. TimeWarner said fourth quarter sales rose 2% to $11.1bn from $10.9bn. Its profits were buoyed by one-off gains which offset a profit dip at Warner Bros, and less users for AOL.Time Warner said on Friday that it now owns 8% of search-engine Google. But its own internet business, AOL, had has mixed fortunes. It lost 464,000 subscribers in the fourth quarter profits were lower than in the preceding three quarters. However, the company said AOL's underlying profit before exceptional items rose 8% on the back of stronger internet advertising revenues. It hopes to increase subscribers by offering the online service free to TimeWarner internet customers and will try to sign up AOL's existing customers for high-speed broadband. TimeWarner also has to restate 2000 and 2003 results following a probe by the US Securities Exchange Commission (SEC), which is close to concluding.Time Warner's fourth quarter profits were slightly better than analysts' expectations. But its film division saw profits slump 27% to $284m, helped by box-office flops Alexander and Catwoman, a sharp contrast to year-earlier, when the third and final film in the Lord of the Rings trilogy boosted results. For the full-year, TimeWarner posted a profit of $3.36bn, up 27% from its 2003 performance, while revenues grew 6.4% to $42.09bn. "Our financial performance was strong, meeting or exceeding all of our full-year objectives and greatly enhancing our flexibility," chairman and chief executive Richard Parsons said. For 2005, TimeWarner is projecting operating earnings growth of around 5%, and also expects higher revenue and wider profit margins. TimeWarner is to restate its accounts as part of efforts to resolve an inquiry into AOL by US market regulators. It has already offered to pay $300m to settle charges, in a deal that is under review by the SEC. The company said it was unable to estimate the amount it needed to set aside for legal reserves, which it previously set at $500m. It intends to adjust the way it accounts for a deal with German music publisher Bertelsmann's purchase of a stake in AOL Europe, which it had reported as advertising revenue. It will now book the sale of its stake in AOL Europe as a loss on the value of that stake."""

In [19]:
prompt = prompt_template.format(text=text)

In [20]:
payload = {
    "inputs": prompt,
    "parameters":{
        "do_sample": True,
        "top_p": 0.9,
        "temperature": 0.8,
        "max_new_tokens": 1024,
        "stop": ["<|endoftext|>", "</s>"]
    }
}

In [240]:
print(prompt)
print(f"original:{query_llm(payload, original_model_endpoint_name)}")

Summarize the following text:
Ad sales boost Time Warner profit. Quarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (£600m) for the three months to December, from $639m year-earlier.The firm, which is now one of the biggest investors in Google, benefited from sales of high-speed internet connections and higher advert sales. TimeWarner said fourth quarter sales rose 2% to $11.1bn from $10.9bn. Its profits were buoyed by one-off gains which offset a profit dip at Warner Bros, and less users for AOL.Time Warner said on Friday that it now owns 8% of search-engine Google. But its own internet business, AOL, had has mixed fortunes. It lost 464,000 subscribers in the fourth quarter profits were lower than in the preceding three quarters. However, the company said AOL's underlying profit before exceptional items rose 8% on the back of stronger internet advertising revenues. It hopes to increase subscribers by offering the online service free to TimeWarner internet customers a

In [242]:
print(prompt)
print(f"fine-tuned:{query_llm(payload, endpoint_name)}")

Summarize the following text:
Ad sales boost Time Warner profit. Quarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (£600m) for the three months to December, from $639m year-earlier.The firm, which is now one of the biggest investors in Google, benefited from sales of high-speed internet connections and higher advert sales. TimeWarner said fourth quarter sales rose 2% to $11.1bn from $10.9bn. Its profits were buoyed by one-off gains which offset a profit dip at Warner Bros, and less users for AOL.Time Warner said on Friday that it now owns 8% of search-engine Google. But its own internet business, AOL, had has mixed fortunes. It lost 464,000 subscribers in the fourth quarter profits were lower than in the preceding three quarters. However, the company said AOL's underlying profit before exceptional items rose 8% on the back of stronger internet advertising revenues. It hopes to increase subscribers by offering the online service free to TimeWarner internet customers a

ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (400) from primary with message "{
  "code": 400,
  "type": "InternalServerException",
  "message": "module \u0027transformers_modules.model.modelling_RW\u0027 has no attribute \u0027RWForCausalLM\u0027"
}
". See https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logEventViewer:group=/aws/sagemaker/Endpoints/falcon-7b-tuned-2023-07-25-00-26-09 in account 327216439222 for more information.

## Evaluation of the fine tuned model

We're using the LM Evaluation Harness framework to evaluate our fine tuned model

https://github.com/EleutherAI/lm-evaluation-harness

list of possible test/tasks to use:

https://github.com/EleutherAI/lm-evaluation-harness/blob/master/docs/task_table.md


### installation of lm-evaluation-harness

In [71]:
#installing the latest version of lm-evaluation-harness. note that the big-refactor is the new version branch that will probably become the main one soon.
#tested on revision 2820042d05e91c87852c82293f8973dc841c1a25 of the big-refactor branch
!git clone https://github.com/EleutherAI/lm-evaluation-harness && cd lm-evaluation-harness && git checkout big-refactor && pip install -e . --quiet

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Cloning into 'lm-evaluation-harness'...
remote: Enumerating objects: 14128, done.[K
remote: Counting objects: 100% (2540/2540), done.[K
remote: Compressing objects: 100% (483/483), done.[K
remote: Total 14128 (delta 2264), reused 2158 (delta 2053), pack-reused 11588[K
Receiving objects: 100% (14128/14128), 19.21 MiB | 26.62 MiB/s, done.
Resolving deltas: 100% (9379/9379), done.
branch 'big-refactor' set up to track 'origin/big-refactor'.
Switched to a new branch 'big-refactor'
[0m

### Important

you need to modify lm-evaluation-harness/lm_eval/models/huggingface.py and add around line 60:

MODEL_FOR_CAUSAL_LM_MAPPING_NAMES["RefinedWebModel"] = "RWForCausalLM"

otherwise "self._config = transformers.AutoConfig.from_pretrained" set the model_type to "RefinedWebModel" which is not in the mapping list for MODEL_FOR_CAUSAL_LM_MAPPING_NAMES.

as a result the auto_model_class selected  is transformers.AutoModelForSeq2SeqLM instead of transformers.AutoModelForCausalLM.

## Which tasks?

Not all tasks are available in the new refactored lm-evaluation-harness framework (you can always use the old version). see below for the WIP list:

https://github.com/EleutherAI/lm-evaluation-harness/tree/big-refactor/lm_eval/tasks

you'll find details under each folder for the tasks. the challenge is that there are 200+ tasks from the old frameworks so it is challenging to know which ones is the best of what you're trying to evaluate.

See below some resources that I found that can help:

https://super.gluebenchmark.com/tasks

https://gluebenchmark.com/tasks/

Notes from tasks that have already moved to the new framework:

- anli: Q&A
- ARC: Q&A science exam
- arithmetic
- hellaswag: sentence finishing
- lambada: next word prediction
- glue/qnli: Q&A https://rajpurkar.github.io/SQuAD-explorer/
- gsm8k: math problems
- headqa: health care complex reasoning
- hendrycks_ethics: ethic benchmark
- mathqa: math word problem solving
- openbookqa: Q&A??
- pile: general comprehension/Q&A across books, github repositories, webpages, chat logs, and medical, physics, math, computer science, and philosophy papers.
- piqa: commonsense Q&A, https://huggingface.co/datasets/piqa
- prost: contains 18,736 multiple-choice questions made from 14 manually curated templates, covering 10 physical reasoning concepts, https://paperswithcode.com/dataset/prost
- pubmedqa: biomedical Q&A https://github.com/pubmedqa/pubmedqa
- qa4mre: Q&A https://www.tensorflow.org/datasets/catalog/qa4mre
- race Q&A/comprehension https://huggingface.co/datasets/race
- sciq: crowdsourced science exam questions about Physics, Chemistry and Biology, https://huggingface.co/datasets/sciq
- webqs: WebQuestions is a benchmark for question answering.
- toxigen:  large-scale and machine-generated dataset of 274,186 toxic and benign statements about 13 minority groups.  https://paperswithcode.com/dataset/toxigen
- unscramble: Unscramble is a small battery of 5 “character manipulation” tasks. Each task involves giving the model a word distorted by some combination of scrambling, addition, or deletion of characters, and asking it to recover the original word.

### Useful Paths

In [90]:
lm_evaluation_git_dir = base_dir + "/lm-evaluation-harness"
lm_evaluation_git_dir

'/home/ec2-user/SageMaker/gai-finetuning/lm-evaluation-harness'

In [77]:
print(f"model_tuned_dir:{model_tuned_dir}")

model_tuned_dir:/home/ec2-user/SageMaker/models/falcon-7b-tuned/


In [112]:
output_file = "eval_tuned.txt"

### Select and launch the tasks

In [178]:
tasks = "hendrycks_ethics"

In [179]:
#to monitor GPU Memory usage on notebook instance, open a separate terminal and run: watch -n 0.4 nvidia-smi 
!cd evaluation_scripts && ./evaluation.sh hf $lm_evaluation_git_dir $model_tuned_dir $tasks > $output_file

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[0m2023-07-24:04:08:41,819 INFO     [utils.py:148] Note: NumExpr detected 48 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2023-07-24:04:08:41,819 INFO     [utils.py:160] NumExpr defaulting to 8 threads.
                                 /home/ec2-user/SageMaker/gai-finetuning/lm-evaluation-harness/lm_eval/tasks/hendrycks_ethics/utilitarianism_original.yaml
                                 Config will not be added to registry
                                 Error: argument of type 'NoneType' is not iterable
2023-07-24:04:08:42,975 INFO     [instantiator.py:21] Created a temporary directory at /tmp/tmphtatndxe
2023-07-24:04:08:42,975 INFO     [instantiator.py:76] Writing /tmp/tmphtat

### Extract json from output

In [180]:
output_path = "./evaluation_scripts/" + output_file

#quick and dirty way to extract json in the midst of the text file
def extract_json(output_path):
    try:
        with open(output_path, 'r') as file:
                data_str = file.read()
                pos1 = int(data_str.find("{"))
                pos2 = int(data_str.rfind("}"))
                return json.loads(data_str[pos1:pos2+1])
    except Exception as e:
        print(f"cannot extract json from lm evaluation output, error: {e}")

In [235]:
json_obj = extract_json(output_path)

In [237]:
json_obj['results']

{'ethics_virtue': {'acc,none': 0.43537688442211053,
  'acc_stderr,none': 0.00703006143345469},
 'ethics_justice': {'acc,none': 0.5088757396449705,
  'acc_stderr,none': 0.009615647725764407},
 'ethics_deontology': {'acc,none': 0.5150166852057843,
  'acc_stderr,none': 0.008335364597109167},
 'ethics_cm': {'acc,none': 0.5917631917631918,
  'acc_stderr,none': 0.007886611421325077},
 'ethics_utilitarianism': {'acc,none': 0.5214226289517471,
  'acc_stderr,none': 0.007204999520618655}}

### Update model's metadata with results in model registry

In [233]:
#formatting the metrics as metadata to be stored in the model registry with the model
acc_none = 'acc,none'
acc_stderr = 'acc_stderr,none'

metadataProperties = {}
for key, value in json_obj['results'].items():
    metadataProperties[key + "_acc"] = str(json_obj['results'][key][acc_none])
    metadataProperties[key + "_acc_stderr"] = str(json_obj['results'][key][acc_stderr])

metadataProperties

{'ethics_virtue_acc': '0.43537688442211053',
 'ethics_virtue_acc_stderr': '0.00703006143345469',
 'ethics_justice_acc': '0.5088757396449705',
 'ethics_justice_acc_stderr': '0.009615647725764407',
 'ethics_deontology_acc': '0.5150166852057843',
 'ethics_deontology_acc_stderr': '0.008335364597109167',
 'ethics_cm_acc': '0.5917631917631918',
 'ethics_cm_acc_stderr': '0.007886611421325077',
 'ethics_utilitarianism_acc': '0.5214226289517471',
 'ethics_utilitarianism_acc_stderr': '0.007204999520618655'}

In [234]:
model_package_update_response = sm_client.update_model_package(ModelPackageArn=model_package_arn, CustomerMetadataProperties=metadataProperties)

## Summarisation evaluation/metrics - ROUGE

ROUGE, or Recall-Orientpip install rouge-scoreed Understudy for Gisting Evaluation, is a set of metrics and a software package used for evaluating automatic summarization and machine translation software in natural language processing. The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation.

In the ROUGE paper, two flavors of ROUGE are described:


    sentence-level: Compute longest common subsequence (LCS) between two pieces of text. Newlines are ignored. This is called rougeL in this package.
    summary-level: Newlines in the text are interpreted as sentence boundaries, and the LCS is computed between each pair of reference and candidate sentences, and something called union-LCS is computed. This is called rougeLsum in this package. 

https://github.com/google-research/google-research/tree/master/rouge

In [30]:
!pip install rouge-score --quiet

[0m

### Example

In [31]:
examples_text = """Wine comedy up for six film gongs. Sideways, a wine-tasting comedy starring Paul Giamatti, is up for six Independent Spirit Awards, the art-house version of the Oscars.The awards are held on 26 February, the day before the Oscars. Spanish drama Maria Full of Grace, about a Colombian woman who becomes a drug courier, got five nominations. Controversial biopic Kinsey, starring Liam Neeson as sex researcher Alfred Kinsey, was one of four films to get four nominations. The awards, now in their 20th year, honour quirky low-budget films, all of which must have a degree of independent financing. Sideways is written and directed by Alexander Payne, who directed the 2002 hit About Schmidt, winning Jack Nicholson his 12th Academy Award nomination."These awards, for better or worse, mean everything," said Sideways producer Michael London, adding they were a "huge first step" toward getting recognition from other awards. Among the other films receiving four nominations apiece were Brother to Brother, a drama about a young gay black man forced to live on the streets, Robbing Peter and Primer. Primer, a $7,000 (£3,650) tale of discovery, won top prize at the Sundance film festival earlier this year. Walter Salles critically acclaimed The Motorcycle Diaries and the forthcoming thriller The Woodsman, starring Kevin Bacon, received three nominations each. Also in the running, with two nominations, are high school comedy Napoleon Dynamite, The Door in the Floor and Garden State - written, directed and starring Scrubs star Zach Braff alongside Natalie Portman. The awards were announced by actors Selma Blair and Dennis Quaid in Los Angeles on Tuesday."""
summary = """Sideways, a wine-tasting comedy starring Paul Giamatti, is up for six Independent Spirit Awards, the art-house version of the Oscars.Sideways is written and directed by Alexander Payne, who directed the 2002 hit About Schmidt, winning Jack Nicholson his 12th Academy Award nomination.Controversial biopic Kinsey, starring Liam Neeson as sex researcher Alfred Kinsey, was one of four films to get four nominations.Among the other films receiving four nominations apiece were Brother to Brother, a drama about a young gay black man forced to live on the streets, Robbing Peter and Primer.Also in the running, with two nominations, are high school comedy Napoleon Dynamite, The Door in the Floor and Garden State - written, directed and starring Scrubs star Zach Braff alongside Natalie Portman."""

In [32]:
from rouge_score import rouge_scorer

scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL', 'rougeLsum'], use_stemmer=True)
scores = scorer.score(examples_text, summary)
scores

INFO:absl:Using default tokenizer.


{'rouge1': Score(precision=1.0, recall=0.47388059701492535, fmeasure=0.6430379746835443),
 'rougeL': Score(precision=0.8503937007874016, recall=0.40298507462686567, fmeasure=0.5468354430379747),
 'rougeLsum': Score(precision=0.8503937007874016, recall=0.40298507462686567, fmeasure=0.5468354430379747)}

### Deploy an endpoint

## Next Steps 

You can deploy your fine-tuned model to a SageMaker endpoint and use it for inference. Check out the [Deploy Falcon 7B & 40B on Amazon SageMaker](https://www.philschmid.de/sagemaker-falcon-llm) and [Securely deploy LLMs inside VPCs with Hugging Face and Amazon SageMaker](https://www.philschmid.de/sagemaker-llm-vpc) for more details.

## Debugging cells  -  TO DELETE LATER

In [27]:
#download from s3 to local
s3_client.download_file(sagemaker_session_bucket, 
                        "huggingface-qlora-2023-07-25-07-31-59-2023-07-25-07-32-09-436/output/model.tar.gz", 
                       "/home/ec2-user/SageMaker/models/falcon-7b-tuned/model.tar.gz")

In [29]:
!tar -xvf /home/ec2-user/SageMaker/models/falcon-7b-tuned/model.tar.gz --directory /home/ec2-user/SageMaker/models/falcon-7b-tuned/

pytorch_model-00002-of-00002.bin
generation_config.json
pytorch_model-00001-of-00002.bin
pytorch_model.bin.index.json
config.json


In [248]:
#s3_client.upload_file("/home/ec2-user/SageMaker/models/test-new-py-file/model.tar.gz", sagemaker_session_bucket, "model-fine-tuning/models/test-new-py-file/model.tar.gz")

In [None]:
from sagemaker.huggingface import HuggingFaceModel
import sagemaker
import json

role = sagemaker.get_execution_role()

test_endpoint_name = "test2"

# create Hugging Face Model Class
huggingface_model_test = HuggingFaceModel(
   model_data="s3://sagemaker-us-east-1-327216439222/huggingface-qlora-2023-07-25-07-31-59-2023-07-25-07-32-09-436/output/model.tar.gz",
   role=role, 
   transformers_version="4.28.1", 
   pytorch_version="2.0.0", 
   py_version="py310",
   model_server_workers=1,
)

# deploy model to SageMaker Inference
predictor_test = huggingface_model_test.deploy(
    initial_instance_count=1,
    #instance_type= "local_gpu",
    instance_type= "ml.g5.12xlarge",
    wait=False,
    endpoint_name=test_endpoint_name,
)

In [None]:
print(predictor_test.endpoint_name)

In [25]:
print(prompt)
print(query_llm(payload, "test2"))

Summarize the following text:
Ad sales boost Time Warner profit. Quarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (£600m) for the three months to December, from $639m year-earlier.The firm, which is now one of the biggest investors in Google, benefited from sales of high-speed internet connections and higher advert sales. TimeWarner said fourth quarter sales rose 2% to $11.1bn from $10.9bn. Its profits were buoyed by one-off gains which offset a profit dip at Warner Bros, and less users for AOL.Time Warner said on Friday that it now owns 8% of search-engine Google. But its own internet business, AOL, had has mixed fortunes. It lost 464,000 subscribers in the fourth quarter profits were lower than in the preceding three quarters. However, the company said AOL's underlying profit before exceptional items rose 8% on the back of stronger internet advertising revenues. It hopes to increase subscribers by offering the online service free to TimeWarner internet customers a

ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (400) from primary with message "{
  "code": 400,
  "type": "InternalServerException",
  "message": "Loading /opt/ml/model requires you to execute the configuration file in that repo on your local machine. Make sure you have read the code there to avoid malicious use, then set the option `trust_remote_code\u003dTrue` to remove this error."
}
". See https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logEventViewer:group=/aws/sagemaker/Endpoints/test2 in account 327216439222 for more information.