1. [Set Up](#1.-Set-Up)
2. [Introduction](#2.-Introduction)
3. [Run Inference on the pre-trained model](#3.-Run-Inference-on-the-pre-trained-model)
4. [Fine-Tune the pre-trained model on a custom dataset](#4.-Fine-Tune-the-pre-trained-model-on-a-custom-dataset)

## 1. Set Up
To train and host on Amazon Sagemaker, we need to setup and authenticate the use of AWS services. Here, we use the execution role associated with the current notebook instance as the AWS account role with SageMaker access. It has necessary permissions, including access to your data in S3.

In [1]:
!pip install sagemaker ipywidgets --upgrade --quiet

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
matrix-client 0.4.0 requires urllib3~=1.21, but you have urllib3 2.0.7 which is incompatible.[0m[31m
[0m

In [1]:
import sagemaker, boto3, json
from sagemaker import get_execution_role

aws_role = get_execution_role()
aws_region = boto3.Session().region_name
sess = sagemaker.Session()

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ubuntu/.config/sagemaker/config.yaml


***
You can continue with the default model, or can choose a different model from the dropdown generated upon running the next cell. A complete list of JumpStart models can also be accessed at [JumpStart Models](https://sagemaker.readthedocs.io/en/stable/algorithms/text/text_summarization_hugging_face.html#).
***

In [8]:
model_id = "huggingface-tc-bert-base-cased"

In [9]:
import IPython
from ipywidgets import Dropdown
from sagemaker.jumpstart.notebook_utils import list_jumpstart_models
from sagemaker.jumpstart.filters import And

# Retrieves all Text Classification models available by SageMaker Built-In Algorithms.
filter_value = And("task == tc", "framework == huggingface")
tc_models = list_jumpstart_models(filter=filter_value)
# display the model-ids in a dropdown, for user to select a model.
dropdown = Dropdown(
    value=model_id,
    options=tc_models,
    description="Sagemaker Pre-Trained Text Classification Models:",
    style={"description_width": "initial"},
    layout={"width": "max-content"},
)
display(IPython.display.Markdown("## Select a pre-trained model from the dropdown below"))
display(dropdown)

## Select a pre-trained model from the dropdown below

Dropdown(description='Sagemaker Pre-Trained Text Classification Models:', layout=Layout(width='max-content'), …

In [10]:
dropdown.value

'huggingface-tc-bert-base-cased'

In [11]:
# model_version="*" fetches the latest version of the model.
infer_model_id, infer_model_version = dropdown.value, "*"

hub = {}
HF_MODEL_ID = "distilbert-base-uncased-finetuned-sst-2-english"  # Pass any other HF_MODEL_ID from - https://huggingface.co/models?pipeline_tag=text-classification&sort=downloads
if infer_model_id == "huggingface-tc-models":
    hub["HF_MODEL_ID"] = HF_MODEL_ID
    hub["HF_TASK"] = "text-classification"

In [12]:
hub

{'HF_MODEL_ID': 'distilbert-base-uncased-finetuned-sst-2-english',
 'HF_TASK': 'text-classification'}

## 3. Run Inference on the pre-trained model
***
Using SageMaker, we can perform inference on the fine-tuned model. For this example, that means on an input sentence, predicting the class label from one of the 2 classes of the [SST2](https://nlp.stanford.edu/sentiment/index.html) dataset. Otherwise predicting the class label on any of the choosen model from the HugginFace [Text-Classification](https://huggingface.co/models?pipeline_tag=text-classification&sort=downloads)
***

### 3.1. Deploy an Endpoint
***
We retrieve the deploy_image_uri, deploy_source_uri, and base_model_uri for the pre-trained model. To host the pre-trained model, we create an instance of [`sagemaker.model.Model`](https://sagemaker.readthedocs.io/en/stable/api/inference/model.html) and deploy it.
***

In [14]:
from sagemaker.jumpstart.model import JumpStartModel

my_model = JumpStartModel(
    model_id=infer_model_id,
    env=hub,
    enable_network_isolation=False if infer_model_id == "huggingface-tc-models" else True,
)
model_predictor = my_model.deploy()

--------!

### 3.2. Example input sentences for inference
***
These examples are taken from SST2 dataset downloaded from [TensorFlow](https://www.tensorflow.org/datasets/catalog/glue#gluesst2). [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0). [Dataset Homepage](https://nlp.stanford.edu/sentiment/index.html). 
***

In [15]:
text1 = "astonishing ... ( frames ) profound ethical and philosophical questions in the form of dazzling pop entertainment"
text2 = "simply stupid , irrelevant and deeply , truly , bottomlessly cynical "

### 3.3. Query endpoint and parse response
***
Input to the endpoint is a single sentence. Response from the endpoint is a dictionary containing the predicted class label, and a list of class label probabilities.
***

In [16]:
newline, bold, unbold = "\n", "\033[1m", "\033[0m"


def query_endpoint(encoded_text):
    response = model_predictor.predict(
        encoded_text, {"ContentType": "application/x-text", "Accept": "application/json;verbose"}
    )
    return response


def parse_response(query_response):
    model_predictions = query_response
    probabilities, labels, predicted_label = (
        model_predictions["probabilities"],
        model_predictions["labels"],
        model_predictions["predicted_label"],
    )
    return probabilities, labels, predicted_label


for text in [text1, text2]:
    query_response = query_endpoint(text.encode("utf-8"))
    probabilities, labels, predicted_label = parse_response(query_response)
    print(
        f"Inference:{newline}"
        f"Input text: '{text}'{newline}"
        f"Model prediction: {probabilities}{newline}"
        f"Labels: {labels}{newline}"
        f"Predicted Label: {bold}{predicted_label}{unbold}{newline}"
    )

Inference:
Input text: 'astonishing ... ( frames ) profound ethical and philosophical questions in the form of dazzling pop entertainment'
Model prediction: [0.0001314024266321212, 0.999868631362915]
Labels: ['NEGATIVE', 'POSITIVE']
Predicted Label: [1mPOSITIVE[0m

Inference:
Input text: 'simply stupid , irrelevant and deeply , truly , bottomlessly cynical '
Model prediction: [0.9998064637184143, 0.000193581247003749]
Labels: ['NEGATIVE', 'POSITIVE']
Predicted Label: [1mNEGATIVE[0m



### 3.4. Clean up the endpoint

In [18]:
# Delete the SageMaker endpoint and the attached resources
model_predictor.delete_model()
model_predictor.delete_endpoint()

## 4. Fine-Tune the pre-trained model on a custom dataset
***
### We support fine-tuning on any pre-trained model available on HugginFace [Fill-Mask](https://huggingface.co/models?pipeline_tag=fill-mask&sort=downloads) and [Text-Classification](https://huggingface.co/models?pipeline_tag=text-classification&sort=downloads). Though only the models in the dropdown list can be fine-tuned in network isolation. Please select huggingface-tc-models in the dropdown above if you can't find your choice of model to fine-tune in the dropdown list, and specify id of any model available in  HugginFace [Fill-Mask](https://huggingface.co/models?pipeline_tag=fill-mask&sort=downloads) and [Text-Classification](https://huggingface.co/models?pipeline_tag=text-classification&sort=downloads), in the HF_MODEL_ID variable below.

***

In [None]:
HF_MODEL_ID = "distilbert-base-uncased"  # Specify the HF_MODEL_ID here from https://huggingface.co/models?pipeline_tag=fill-mask&sort=downloads or https://huggingface.co/models?pipeline_tag=text-classification&sort=downloads

***
Previously, we saw how to run inference on a fine-tuned model. Next, we discuss how a model can be finetuned to a custom dataset with any number of classes.

The Text Embedding model can be fine-tuned on any text classification dataset in the same way the
model available for inference has been fine-tuned on the SST2 movie review dataset.

The model available for fine-tuning attaches a classification layer to the Text Embedding model
and initializes the layer parameters to random values.
The output dimension of the classification layer is determined based on the number of classes
detected in the input data. The fine-tuning step fine-tunes all the model
parameters to minimize prediction error on the input data and returns the fine-tuned model.
The model returned by fine-tuning can be further deployed for inference.
Below are the instructions for how the training data should be formatted for input to the model.


- **Input:** A directory containing a 'data.csv' file.
    - Each row of the first column of 'data.csv' should have integer class labels between 0 to the number of classes.
    - Each row of the second column should have the corresponding text.
- **Output:** A trained model that can be deployed for inference.

Below is an example of 'data.csv' file showing values in its first two columns. Note that the file should not have any header.

|   |   |
|---|---|
|0	|hide new secretions from the parental units|
|0	|contains no wit , only labored gags|
|1	|that loves its characters and communicates something rather beautiful about human nature|
|...|...|

SST2 dataset is downloaded from [TensorFlow](https://www.tensorflow.org/datasets/catalog/glue#gluesst2).
 [Apache 2.0 License](https://jumpstart-cache-prod-us-west-2.s3-us-west-2.amazonaws.com/licenses/Apache-License/LICENSE-2.0.txt).
  [Dataset Homepage](https://nlp.stanford.edu/sentiment/index.html).
***

### 4.1. Selecting a Model

In [19]:
from sagemaker import image_uris, model_uris, script_uris, hyperparameters

model_id, model_version = dropdown.value, "*"
training_instance_type = "ml.p3.2xlarge"

### 4.2. Set Training parameters
***
Now that we are done with all the setup that is needed, we are ready to fine-tune our Text Classification model. To begin, let us create a [``sageMaker.estimator.Estimator``](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html) object. This estimator will launch the training job. 

There are two kinds of parameters that need to be set for training. 

The first one are the parameters for the training job. These include: (i) Training data path. This is S3 folder in which the input data is stored, (ii) Output path: This the s3 folder in which the training output is stored. (iii) Training instance type: This indicates the type of machine on which to run the training. Typically, we use GPU instances for these training. We defined the training instance type above to fetch the correct train_image_uri. 
***
The second set of parameters are algorithm specific training hyper-parameters. It is also used for sepcifying the model name if we want to fine-tune on the model which is not present in the dropdown list.
***

In [20]:
# Sample training data is available in this bucket
training_data_bucket = f"jumpstart-cache-prod-{aws_region}"
training_data_prefix = "training-datasets/SST/"

training_dataset_s3_path = f"s3://{training_data_bucket}/{training_data_prefix}"

output_bucket = sess.default_bucket()
output_prefix = "jumpstart-example-tc-training"

s3_output_location = f"s3://{output_bucket}/{output_prefix}/output"

In [25]:
# download the training data into local directory for inspection
!aws s3 cp {training_dataset_s3_path} ./ --recursive

download: s3://jumpstart-cache-prod-us-east-1/training-datasets/SST/data.csv to ./data.csv


In [29]:
# check the first few lines of the training data and count the number of lines in the training data
!head -n 5 ./data.csv && wc -l ./data.csv

0,hide new secretions from the parental units 
0,"contains no wit , only labored gags "
1,that loves its characters and communicates something rather beautiful about human nature 
0,remains utterly satisfied to remain the same throughout 
0,on the worst revenge-of-the-nerds clichés the filmmakers could dredge up 
68221 ./data.csv


***
For algorithm specific hyper-parameters, we start by fetching python dictionary of the training hyper-parameters that the algorithm accepts with their default values. This can then be overridden to custom values.
***

In [21]:
from sagemaker import hyperparameters

# Retrieve the default hyper-parameters for fine-tuning the model
hyperparameters = hyperparameters.retrieve_default(model_id=model_id, model_version=model_version)

# [Optional] Override default hyperparameters with custom values
hyperparameters["batch_size"] = "64"

# Please pass eval_accumulation_steps in hyperparameters to a smaller value if you get Cuda out of memory error during evaluation
# This will trigger the copy of predictions from host to CPU more frequently and free host memory.
# hyperparameters['eval_accumulation_steps'] = "10"

***
We will use the HF_MODEL_ID pased earlier here for using all the HugginFace [Fill-Mask](https://huggingface.co/models?pipeline_tag=fill-mask&sort=downloads) and [Text-Classification](https://huggingface.co/models?pipeline_tag=text-classification&sort=downloads) models.
***

In [30]:
if model_id == "huggingface-tc-models":
    hyperparameters["hub_key"] = HF_MODEL_ID

In [31]:
hyperparameters

{'epochs': '3',
 'learning_rate': '2e-05',
 'batch_size': '64',
 'eval_batch_size': '8',
 'eval_accumulation_steps': 'None',
 'reinitialize_top_layer': 'Auto',
 'train_only_top_layer': 'False',
 'hub_key': 'distilbert-base-uncased-finetuned-sst-2-english'}

### 4.3. Train with Automatic Model Tuning ([HPO](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning.html)) <a id='AMT'></a>
***
Amazon SageMaker automatic model tuning, also known as hyperparameter tuning, finds the best version of a model by running many training jobs on your dataset using the algorithm and ranges of hyperparameters that you specify. It then chooses the hyperparameter values that result in a model that performs the best, as measured by a metric that you choose. We will use a [HyperparameterTuner](https://sagemaker.readthedocs.io/en/stable/api/training/tuner.html) object to interact with Amazon SageMaker hyperparameter tuning APIs.
***

In [32]:
from sagemaker.tuner import ContinuousParameter

# Use AMT for tuning and selecting the best model
use_amt = False

# Define objective metric, based on which the best model will be selected, the regex captures any sequence of digits and dots within the input string
amt_metric_definitions = {
    "metrics": [{"Name": "val_accuracy", "Regex": "'eval_accuracy': ([0-9\\.]+)"}],
    "type": "Maximize",
}

# You can select from the hyperparameters supported by the model, and configure ranges of values to be searched for training the optimal model.(https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-define-ranges.html)
hyperparameter_ranges = {
    "learning_rate": ContinuousParameter(0.00001, 0.0001, scaling_type="Logarithmic")
}

# Increase the total number of training jobs run by AMT, for increased accuracy (and training time).
max_jobs = 6
# Change parallel training jobs run by AMT to reduce total training time, constrained by your account limits.
# if max_jobs=max_parallel_jobs then Bayesian search turns to Random.
max_parallel_jobs = 2

### 4.4. Start Training
***
We start by creating the estimator object with all the required assets and then launch the training job.
***

In [None]:
from sagemaker.estimator import Estimator
from sagemaker.utils import name_from_base
from sagemaker.tuner import HyperparameterTuner
from sagemaker.jumpstart.estimator import JumpStartEstimator


training_metric_definitions = [
    {"Name": "val_accuracy", "Regex": "'eval_accuracy': ([0-9\\.]+)"},
    {"Name": "val_loss", "Regex": "'eval_loss': ([0-9\\.]+)"},
    {"Name": "train_loss", "Regex": "'loss': ([0-9\\.]+)"},
    {"Name": "val_f1", "Regex": "'eval_f1': ([0-9\\.]+)"},
    {"Name": "epoch", "Regex": "'epoch': ([0-9\\.]+)"},
]


# Create SageMaker Estimator instance
tc_estimator = JumpStartEstimator(
    hyperparameters=hyperparameters,
    model_id=dropdown.value,
    instance_type=training_instance_type,
    metric_definitions=training_metric_definitions,
    output_path=s3_output_location,
    enable_network_isolation=False if model_id == "huggingface-tc-models" else True,
)

if use_amt:
    hp_tuner = HyperparameterTuner(
        tc_estimator,
        amt_metric_definitions["metrics"][0]["Name"],
        hyperparameter_ranges,
        amt_metric_definitions["metrics"],
        max_jobs=max_jobs,
        max_parallel_jobs=max_parallel_jobs,
        objective_type=amt_metric_definitions["type"],
        base_tuning_job_name=training_job_name,
    )

    # Launch a SageMaker Tuning job to search for the best hyperparameters
    hp_tuner.fit({"training": training_dataset_s3_path})
else:
    # Launch a SageMaker Training job by passing s3 path of the training data
    tc_estimator.fit({"training": training_dataset_s3_path}, logs=True)

### 4.5. Extract Training performance metrics
***
Performance metrics such as training loss and validation accuracy/loss can be accessed through cloudwatch while the training. We can also fetch these metrics and analyze them within the notebook
***

In [34]:
from sagemaker import TrainingJobAnalytics

if use_amt:
    training_job_name = hp_tuner.best_training_job()
else:
    training_job_name = tc_estimator.latest_training_job.job_name


df = TrainingJobAnalytics(training_job_name=training_job_name).dataframe()
df.head(10)

Unnamed: 0,timestamp,metric_name,value
0,0.0,val_accuracy,0.982851
1,120.0,val_accuracy,0.981825
2,240.0,val_accuracy,0.982191
3,0.0,val_loss,0.05678
4,120.0,val_loss,0.065166
5,240.0,val_loss,0.077473
6,0.0,train_loss,0.0584
7,60.0,train_loss,0.055317
8,120.0,train_loss,0.042143
9,180.0,train_loss,0.037867


## 4.6. Deploy & run Inference on the fine-tuned model
***
A trained model does nothing on its own. We now want to use the model to perform inference. For this example, that means predicting the class label of an input sentence. We follow the same steps as in [3. Run inference on the pre-trained model](#3.-Run-Inference-on-the-pre-trained-model). We start by retrieving the artifacts for deploying an endpoint. However, instead of base_predictor, we  deploy the `tc_estimator` that we fine-tuned.
***

In [35]:
inference_instance_type = "ml.p2.xlarge"

# Retrieve the inference docker container uri
deploy_image_uri = image_uris.retrieve(
    region=None,
    framework=None,
    image_scope="inference",
    model_id=model_id,
    model_version=model_version,
    instance_type=inference_instance_type,
)
# Retrieve the inference script uri
deploy_source_uri = script_uris.retrieve(
    model_id=model_id, model_version=model_version, script_scope="inference"
)

endpoint_name = name_from_base(f"jumpstart-example-FT-{model_id}-")

# Use the estimator from the previous step to deploy to a SageMaker endpoint
finetuned_predictor = (hp_tuner if use_amt else tc_estimator).deploy(
    initial_instance_count=1,
    instance_type=inference_instance_type,
    entry_point="inference.py",
    image_uri=deploy_image_uri,
    source_dir=deploy_source_uri,
    endpoint_name=endpoint_name,
)

INFO:sagemaker:Repacking model artifact (s3://sagemaker-us-east-1-705247044519/jumpstart-example-tc-training/output/hf-tc-models-2024-03-14-08-04-44-158/output/model.tar.gz), script artifact (s3://jumpstart-cache-prod-us-east-1/source-directory-tarballs/huggingface/inference/tc/v1.0.1/sourcedir.tar.gz), and dependencies ([]) into single tar.gz file located at s3://sagemaker-us-east-1-705247044519/hf-tc-models-2024-03-14-09-43-17-641/model.tar.gz. This may take some time depending on model size...
INFO:sagemaker:Creating model with name: hf-tc-models-2024-03-14-09-43-17-641
INFO:sagemaker:Creating endpoint-config with name jumpstart-example-FT-huggingface-tc-mod-2024-03-14-09-43-17-640
INFO:sagemaker:Creating endpoint with name jumpstart-example-FT-huggingface-tc-mod-2024-03-14-09-43-17-640


----------!

---
Next, we input example sentences for running inference.
These examples are taken from SST2 dataset downloaded from [TensorFlow](https://www.tensorflow.org/datasets/catalog/glue#gluesst2). [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0). [Dataset Homepage](https://nlp.stanford.edu/sentiment/index.html). 

---

In [36]:
text1 = "astonishing ... ( frames ) profound ethical and philosophical questions in the form of dazzling pop entertainment"
text2 = "simply stupid , irrelevant and deeply , truly , bottomlessly cynical "

---
Next, we query the finetuned model, parse the response and print the predictions.

---

In [37]:
newline, bold, unbold = "\n", "\033[1m", "\033[0m"


def query_endpoint(encoded_text):
    response = finetuned_predictor.predict(
        encoded_text, {"ContentType": "application/x-text", "Accept": "application/json;verbose"}
    )
    return response


def parse_response(query_response):
    model_predictions = query_response
    probabilities, labels, predicted_label = (
        model_predictions["probabilities"],
        model_predictions["labels"],
        model_predictions["predicted_label"],
    )
    return probabilities, labels, predicted_label


for text in [text1, text2]:
    query_response = query_endpoint(text.encode("utf-8"))
    probabilities, labels, predicted_label = parse_response(query_response)
    print(
        f"Inference:{newline}"
        f"Input text: '{text}'{newline}"
        f"Model prediction: {probabilities}{newline}"
        f"Labels: {labels}{newline}"
        f"Predicted Label: {bold}{predicted_label}{unbold}{newline}"
    )

Inference:
Input text: 'astonishing ... ( frames ) profound ethical and philosophical questions in the form of dazzling pop entertainment'
Model prediction: [0.00019735825480893254, 0.9998026490211487]
Labels: ['NEGATIVE', 'POSITIVE']
Predicted Label: [1mPOSITIVE[0m

Inference:
Input text: 'simply stupid , irrelevant and deeply , truly , bottomlessly cynical '
Model prediction: [0.9995166063308716, 0.00048344270908273757]
Labels: ['NEGATIVE', 'POSITIVE']
Predicted Label: [1mNEGATIVE[0m



In [38]:
# Delete the SageMaker endpoint and the attached resources
finetuned_predictor.delete_model()
finetuned_predictor.delete_endpoint()

INFO:sagemaker:Deleting model with name: hf-tc-models-2024-03-14-09-43-17-641
INFO:sagemaker:Deleting endpoint configuration with name: jumpstart-example-FT-huggingface-tc-mod-2024-03-14-09-43-17-640
INFO:sagemaker:Deleting endpoint with name: jumpstart-example-FT-huggingface-tc-mod-2024-03-14-09-43-17-640
