# Document Understanding Solution - Question Answering

Question Answering is useful when you want to query a large amount of text for specific information. Maybe you're interested in extracting the
date a certain event happened. You can construct a question (or query) in natural language to retrive this information: e.g. 'When did Company X release Product Y?". Similar to extractive summarization we saw in the last notebook, Question Answering will return a verbatim slice of the
text as the answer. It won't generate new words to answer the question. In this notebook, we demonstrate three use cases of Questions and Answering:

1. How to directly deploy a pretrained Transformer-based extractive question answering model to perform inference.
2. How to fine-tune a pre-trained Transformer model on a custom dataset, and then run inference on the fine-tuned model.
3. How to run [SageMaker Automatic Model Tuning](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning.html) (a hyperparameter optimization procedure) to find the best model compared with the model fine-tuned in point 2. The performance of the optimal model and model fine-tuned in point 2 is evaluated on a hold-out test data. 

**Note**: When running this notebook on SageMaker Studio, you should make
sure the `PyTorch 1.10 Python 3.8 CPU Optimized` image/kernel is used. When
running this notebook on SageMaker Notebook Instance, you should make
sure the 'sagemaker-soln' kernel is used.

This solution relies on a config file to run the provisioned AWS resources. Run the cell below to generate that file.

In [2]:
import boto3
import os
import json

client = boto3.client('servicecatalog')
cwd = os.getcwd().split('/')
i= cwd.index('S3Downloads')
pp_name = cwd[i + 1]
pp = client.describe_provisioned_product(Name=pp_name)
record_id = pp['ProvisionedProductDetail']['LastSuccessfulProvisioningRecordId']
record = client.describe_record(Id=record_id)

keys = [ x['OutputKey'] for x in record['RecordOutputs'] if 'OutputKey' in x and 'OutputValue' in x]
values = [ x['OutputValue'] for x in record['RecordOutputs'] if 'OutputKey' in x and 'OutputValue' in x]
stack_output = dict(zip(keys, values))

with open(f'/root/S3Downloads/{pp_name}/stack_outputs.json', 'w') as f:
    json.dump(stack_output, f)

## 1. Set Up

Before executing the notebook, there are some initial steps required for setup. This notebook requires latest version of sagemaker and ipywidgets.

In [3]:
!pip install -U sagemaker ipywidgets --find-links file://$PWD/../wheelhouse

Looking in links: file:///root/S3Downloads/jumpstart-prod-doc_ewrtgp/notebooks/../wheelhouse
[0m

We start by importing a variety of packages that will be used throughout
the notebook. One of the most important packages is the Amazon SageMaker
Python SDK (i.e. `import sagemaker`). We also import modules from our own
custom (and editable) package that can be found at `../package`.

In [4]:
import sys
import boto3
import json
import sagemaker
from sagemaker.pytorch import PyTorchModel
from sagemaker.local import LocalSession

sys.path.insert(0, '../package')
from package import config, utils

aws_role = config.IAM_ROLE
aws_region = boto3.Session().region_name
sess = sagemaker.Session()

Up next, we define the current folder and create a SageMaker client (from
`boto3`). We can use the SageMaker client to call SageMaker APIs
directly, as an alternative to using the Amazon SageMaker SDK. We'll use
it at the end of the notebook to delete certain resources that are
created in this notebook.

In [5]:
current_folder = utils.get_current_folder(globals())
sagemaker_client = boto3.client('sagemaker')

## 2. Run inference on the pre-trained extractive question answering model

Our question answering system needs a machine learning model. In this
section, we'll deploy a model to an Amazon SageMaker Endpoint and then
invoke the endpoint from the notebook. We'll use a pre-trained model from
the [transformers](https://huggingface.co/transformers/) library instead
of training a model from scratch, specifically the BERT Large model that
has been pre-trained on the SQuAD dataset.

We'll use the unique solution prefix to name the model and endpoint.

In [6]:
endpoint_name = f"{config.SOLUTION_PREFIX}-question-answering-endpoint"

### 2.1. Deploy an endpoint

Up next, we need to define the Amazon SageMaker Model which references
the source code and the specifies which container to use. 

Our pre-trained model is Extractive Question Answering model [bert-large-uncased-whole-word-masking-finetuned-squad](https://huggingface.co/bert-large-uncased-whole-word-masking-finetuned-squad) built on a Transformer model from Hugging Face. It takes two strings as inputs: the first string is a question and the second string is the context or any text you want to use to find the answer of the question, and it returns a sub-string from the context as an answer to the question.

We use the PyTorchModel from the Amazon SageMaker Python SDK. Using PyTorchModel and setting the `framework_version` argument, means that our deployed model will run inside a container that has PyTorch pre-installed (i.e., downloading the model on the fly). Other requirements can be installed by defining a `requirements.txt` file at the specified source_dir location. We use the `entry_point` argument to reference the code (within `source_dir`) that should be run for model inference: functions called `model_fn`, `input_fn`, `predict_fn` and `output_fn` are expected to be defined. And lastly, you can pass `model_data` from a training job, but we are going to load the pre-trained model in the source code running on the endpoint. We still
need to provide `model_data`, so we pass an empty archive.

In [7]:
model = PyTorchModel(
    model_data=f"{config.SOURCE_S3_PATH}/artifacts/models/empty.tar.gz",
    entry_point="entry_point.py",
    source_dir="../containers/question_answering",
    role=config.IAM_ROLE,
    framework_version="1.5.0",
    py_version="py3",
    code_location="s3://" + config.S3_BUCKET + "/code",
    env={
        "MODEL_ASSETS_S3_BUCKET": config.SOURCE_S3_BUCKET,
        "MODEL_ASSETS_S3_PREFIX": f"{config.SOURCE_S3_PREFIX}/artifacts/models/question_answering/",
        "MMS_DEFAULT_RESPONSE_TIMEOUT": "3000",
    },
)

Using this Amazon SageMaker Model, we can deploy a HTTPS endpoint on a
dedicated instance. We choose to deploy the endpoint on a single
ml.p3.2xlarge instance (or ml.g4dn.2xlarge if unavailable in this
region). Our question answering model is transfomer that
benefits from GPU optimization, and a ml.p3.2xlarge has a high
performance NVIDIA V100 GPU that can reduce inference latency on each
request. You can expect this deployment step to take around 5 minutes.
After approximately 15 dashes, you can expect to see an exclamation mark
which indicates a successful deployment.

In [8]:
import time
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer

predictor = model.deploy(
    endpoint_name=endpoint_name,
    instance_type=config.HOSTING_INSTANCE_TYPE,
    initial_instance_count=1,
    serializer=JSONSerializer(),
    deserializer=JSONDeserializer()
)

time.sleep(10)

-------!

When you're trying to update the model for development purposes, but
experiencing issues because the model/endpoint-config/endpoint already
exists, you can delete the existing model/endpoint-config/endpoint by
uncommenting and running the following commands:

In [9]:
# sagemaker_client.delete_endpoint(EndpointName=endpoint_name)
# sagemaker_client.delete_endpoint_config(EndpointConfigName=endpoint_name)

When calling our new endpoint from the notebook, we use a Amazon
SageMaker SDK
[`Predictor`](https://sagemaker.readthedocs.io/en/stable/predictors.html).
A `Predictor` is used to send data to an endpoint (as part of a request),
and interpret the response. Our `model.deploy` command returned a
`Predictor` but, by default, it will send and receive numpy arrays. Our
endpoint expects to receive (and also sends) JSON formatted objects, so
we modify the `Predictor` to use JSON instead of the PyTorch endpoint
default of numpy arrays. JSON is used here because it is a standard
endpoint format and the endpoint response can contain nested data
structures.

### 2.2. Example input sentences for inference & Query endpoint

With our model successfully deployed and our predictor configured, we can
try out the question answering model out on example inputs. All we need
to do is construct a dictionary object with two keys. `context` is the
text that we wish to retrieve information from. `question` is the natural
language query which specifices what information we're interested in
extracting. We call `predict` on our predictor and we should get a
response from the endpoint that contains the most likely answers.

In [10]:
data = {'question': 'what is my name?', 'context': "my name is thom"}
response = predictor.predict(data=data)

We have the responce and we can print out the most likely answers that
has been extracted from the text above. You'll see each answer has a
confidence score used for ranking (but this score shouldn't be
interpreted as a true probability). In addition to the verbatim answer,
you also get the start and end character indexes of the answer from the
original context.

In [11]:
print(response['answers'])

[{'score': 0.9793591499328613, 'start': 11, 'end': 15, 'answer': 'thom'}, {'score': 0.02019440196454525, 'start': 0, 'end': 15, 'answer': 'my name is thom'}, {'score': 4.349117443780415e-05, 'start': 3, 'end': 15, 'answer': 'name is thom'}]


You can try more examples above, but note that this model has been
pretrained on the SQuAD dataset. You may need to fine-tune this model
with your own question answering data to obtain better results.

### 2.3. Clean up the endpoint

When you've finished with the summarization endpoint (and associated
endpoint-config), make sure that you delete it to avoid accidental
charges.

In [12]:
# Delete the SageMaker endpoint and the attached resources
predictor.delete_model()
predictor.delete_endpoint()

## 3. Finetune the pre-trained model on a custom dataset

Previously, we saw how to run inference on a pre-trained extractive qusetion answering model. Next, we discuss how a model can be finetuned to a custom dataset. 

The Text Embedding model can be fine-tuned on any extractive question 
answering dataset in the same way the model available for inference has been 
fine-tuned on the SQuAD2.0 dataset.
The model available for fine-tuning attaches an answer extracting layer to the Text Embedding model
and initializes the layer parameters to random values. The fine-tuning step fine-tunes 
all the model parameters to minimize prediction error on the input data and returns the fine-tuned model.
The model returned by fine-tuning can be further deployed for inference. Below are the instructions 
for how the training data should be formatted for input to the model. 

- **Input:**  A directory containing a 'data.csv' file.
    - The first column of the 'data.csv' should have a question.
    - The second column should have the corresponding context.
    - The third column should have the integer character starting position for the answer in the context.
    - The fourth column should have the integer character ending position for the answer in the context.
- **Output:** A trained model that can be deployed for inference. 
 
Below is an example of 'data.csv' file showing values in its first four columns. Note that the file should not have any header.

|   |  |  |   |
|---|---|---|---|
|In what country is Normandy located?|	The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse ("Norman" comes from "Norseman") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries.|	159|	165
|...   | ... |...  | ... |
 

SQuAD2.0 dataset is downloaded from 
[Dataset Homepage](https://rajpurkar.github.io/SQuAD-explorer/). 
[CC BY-SA 4.0 License](https://creativecommons.org/licenses/by-sa/4.0/legalcode).

### 3.1. Retrieve JumpStart Training artifacts
Here, for the selected model, we retrieve the training docker container, the training algorithm source, the pre-trained model, and a python dictionary of the training hyper-parameters that the algorithm accepts with their default values. Note that the `model_version`="*" fetches the latest model. Also, we do need to specify the `training_instance_type` to fetch `train_image_uri`.

You can continue with the default model id, or can choose a different model from the dropdown generated upon running the next cell.

In [13]:
model_id = 'pytorch-eqa-bert-base-uncased'

In [14]:
# download JumpStart model_manifest file.
boto3.client("s3").download_file(
    f"jumpstart-cache-prod-{aws_region}", "models_manifest.json", "models_manifest.json"
)
with open("models_manifest.json", "rb") as json_file:
    model_list = json.load(json_file)

# filter-out all the Text Classification models from the manifest list.
eqa_models_all_versions, eqa_models = [
    model["model_id"] for model in model_list if "-eqa-" in model["model_id"]
], []
[eqa_models.append(model) for model in eqa_models_all_versions if model not in eqa_models]

print(f"All the other available extractive question answering models are as below.\n")
for each in eqa_models:
    print(f"{each}")

All the other available extractive question answering models are as below.

huggingface-eqa-bert-base-cased
huggingface-eqa-bert-base-multilingual-cased
huggingface-eqa-bert-base-multilingual-uncased
huggingface-eqa-bert-base-uncased
huggingface-eqa-bert-large-cased
huggingface-eqa-bert-large-cased-whole-word-masking
huggingface-eqa-bert-large-uncased
huggingface-eqa-bert-large-uncased-whole-word-masking
huggingface-eqa-distilbert-base-cased
huggingface-eqa-distilbert-base-multilingual-cased
huggingface-eqa-distilbert-base-uncased
huggingface-eqa-distilroberta-base
huggingface-eqa-roberta-base
huggingface-eqa-roberta-base-openai-detector
huggingface-eqa-roberta-large
pytorch-eqa-bert-base-cased
pytorch-eqa-bert-base-multilingual-cased
pytorch-eqa-bert-base-multilingual-uncased
pytorch-eqa-bert-base-uncased
pytorch-eqa-bert-large-cased
pytorch-eqa-bert-large-cased-whole-word-masking
pytorch-eqa-bert-large-cased-whole-word-masking-finetuned-squad
pytorch-eqa-bert-large-uncased
pytorch-eq

In [61]:
model_id = "huggingface-eqa-bert-base-uncased" 
# model_id = "pytorch-eqa-bert-base-uncased"

In [62]:
from sagemaker import image_uris, model_uris, script_uris, hyperparameters

model_version = "*"
training_instance_type = config.TRAINING_INSTANCE_TYPE

# Retrieve the docker image
train_image_uri = image_uris.retrieve(
    region=None,
    framework=None,
    model_id=model_id,
    model_version=model_version,
    image_scope="training",
    instance_type=training_instance_type,
)
# Retrieve the training script
train_source_uri = script_uris.retrieve(
    model_id=model_id, model_version=model_version, script_scope="training"
)
# Retrieve the pre-trained model tarball to further fine-tune
train_model_uri = model_uris.retrieve(
    model_id=model_id, model_version=model_version, model_scope="training"
)

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.


### 3.2. Set Training parameters

Now that we are done with all the setup that is needed, we are ready to fine-tune our extractive question answering model. To begin, let us create a [``sageMaker.estimator.Estimator``](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html) object. This estimator will launch the training job. 

There are two kinds of parameters that need to be set for training. 

The first one are the parameters for the training job. These include: (i) Training data path. This is S3 folder in which the input data is stored, (ii) Output path: This the s3 folder in which the training output is stored. (iii) Training instance type: This indicates the type of machine on which to run the training. Typically, we use GPU instances for these training. We defined the training instance type above to fetch the correct train_image_uri. 

The second set of parameters are algorithm specific training hyper-parameters.

In [17]:
# Sample training data is available in this bucket
training_data_bucket = f"jumpstart-cache-prod-{aws_region}"
# For a quick demonstration of training we have created a random subset of SQuAD-v2 dataset.
# For complete QNLI dataset replace "SQuAD-v2-tiny" with "SQuAD-v2" in the line below.
training_data_prefix = "training-datasets/SQuAD-v2-tiny/"

training_dataset_s3_path = f"s3://{training_data_bucket}/{training_data_prefix}"

output_bucket = config.S3_BUCKET
output_prefix = "EQA"

s3_output_location = f"s3://{output_bucket}/{output_prefix}/output"

For algorithm specific hyper-parameters, we start by fetching python dictionary of the training hyper-parameters that the algorithm accepts with their default values. This can then be overridden to custom values.

In [18]:
from sagemaker import hyperparameters

# Retrieve the default hyper-parameters for fine-tuning the model
hyperparameters = hyperparameters.retrieve_default(model_id=model_id, model_version=model_version)

# [Optional] Override default hyperparameters with custom values
hyperparameters["batch-size"] = "16"
print(hyperparameters)

{'epochs': '3', 'adam-learning-rate': '2e-05', 'batch-size': '16', 'reinitialize-top-layer': 'Auto', 'train-only-top-layer': 'False'}


### 3.3. Download, preprocess, and upload the training data

In [19]:
!aws s3 cp --recursive $training_dataset_s3_path ../data/squad2

download: s3://jumpstart-cache-prod-us-east-1/training-datasets/SQuAD-v2-tiny/data.csv to ../data/squad2/data.csv


In [20]:
import pandas as pd
data = pd.read_csv("../data/squad2/data.csv", header=None)

View the first five observations of the training data

In [21]:
data.head(5)

Unnamed: 0,0,1,2,3
0,In what country is Normandy located?,The Normans (Norman: Nourmands; French: Norman...,159,165
1,In what country is Normandy located?,The Normans (Norman: Nourmands; French: Norman...,159,165
2,In what country is Normandy located?,The Normans (Norman: Nourmands; French: Norman...,159,165
3,In what country is Normandy located?,The Normans (Norman: Nourmands; French: Norman...,159,165
4,When were the Normans in Normandy?,The Normans (Norman: Nourmands; French: Norman...,94,117


In [22]:
from sklearn.model_selection import train_test_split

In [23]:
train_data, test_data = train_test_split(data, test_size=0.15, random_state=42)

In [24]:
train_data.to_csv("../data/squad2/split_train.csv", header=False, index=False)

In [25]:
import os

prefix = "EQA"
boto3.Session().resource("s3").Bucket(config.S3_BUCKET).Object(
    os.path.join(prefix, "train/data.csv")
).upload_file("../data/squad2/split_train.csv")

Process the text data to make them ready for inference. In particular, the question with multiple answers are formulated as multiple rows in above `data`. For an example, one question with three answers will yield three rows in the `data`. Each row corresponds to one answer and has the same question content. In evalution, we combine those rows that correspond to the same question. As a result, each input in test examples has an unique question content and its corresponded ground truth answers can be muliple. For each test example for inference, which includes one context and question, the model will output a predicted answer. Next. we will compare the predicted answer with each of the ground truth answers and use the best comparision result for model performance on that example. Details are shown in the inference section as below.

In [26]:
unique_test_examples = {}

for idx, row in test_data.iterrows():
    if row[0] not in unique_test_examples:
        unique_test_examples[row[0]] = {
            "context": row[1],
            "answer": [row[1][row[2]: row[3]]]
        }
    else:
        assert row[1] == unique_test_examples[row[0]]["context"]
        unique_test_examples[row[0]]["answer"].append(row[1][row[2]: row[3]])

In [27]:
test_examples = []
ground_truth = []
for key in unique_test_examples:
    test_examples.append([key, unique_test_examples[key]["context"]])
    ground_truth.append(unique_test_examples[key]["answer"])

### 3.4. Fine-tuning without hyperparameter optimization

In [28]:
from sagemaker.estimator import Estimator
from sagemaker.utils import name_from_base
from sagemaker.tuner import HyperparameterTuner

from sagemaker import get_execution_role

role = get_execution_role()

training_job_name = training_job_name = f"{config.SOLUTION_PREFIX}-eqa-finetune"

# Create SageMaker Estimator instance
eqa_estimator = Estimator(
    role=role,
    image_uri=train_image_uri,
    source_dir=train_source_uri,
    model_uri=train_model_uri,
    entry_point="transfer_learning.py",
    instance_count=1,
    instance_type=training_instance_type,
    max_run=360000,
    hyperparameters=hyperparameters,
    output_path=s3_output_location,
    tags=[{'Key': config.TAG_KEY, 'Value': config.SOLUTION_PREFIX}],
    base_job_name=training_job_name,
    debugger_hook_config=False,
)


training_data_path_updated = f"s3://{config.S3_BUCKET}/{prefix}/train"
# Launch a SageMaker Training job by passing s3 path of the training data
eqa_estimator.fit({"training": training_data_path_updated}, logs=True)

INFO:sagemaker:Creating training-job with name: sagemaker-soln-documents-js-ewrtgp-eqa--2023-06-28-02-51-35-109


2023-06-28 02:51:35 Starting - Starting the training job...
2023-06-28 02:51:51 Starting - Preparing the instances for training......
2023-06-28 02:52:57 Downloading - Downloading input data...
2023-06-28 02:53:32 Training - Downloading the training image............
2023-06-28 02:55:18 Training - Training image download completed. Training in progress.....[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2023-06-28 02:55:55,954 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2023-06-28 02:55:55,982 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2023-06-28 02:55:55,984 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2023-06-28 02:55:56,251 sagemaker-training-toolkit INFO     Invoking user script[0m
[34mTraining Env:[0m
[34m{
    "additional_framework_parameter

## 3.5. Deploy & run Inference on the fine-tuned model

A trained model does nothing on its own. We now want to use the model to perform inference. For this example, that means predicting the answer of an input sentence including a context and question. We follow the same steps as in `2. Run inference on the pre-trained extractive question answering model`. We start by retrieving the jumpstart artifacts for deploying an endpoint. However, instead of base_predictor, we  deploy the `eqa_estimator` that we fine-tuned.

In [None]:
inference_instance_type = config.HOSTING_INSTANCE_TYPE

# Retrieve the inference docker container uri
deploy_image_uri = image_uris.retrieve(
    region=None,
    framework=None,
    image_scope="inference",
    model_id=model_id,
    model_version=model_version,
    instance_type=inference_instance_type,
)
# Retrieve the inference script uri
deploy_source_uri = script_uris.retrieve(
    model_id=model_id, model_version=model_version, script_scope="inference"
)

endpoint_name_finetune = f"{config.SOLUTION_PREFIX}-eqa-finetune-endpoint-1"

# Use the estimator from the previous step to deploy to a SageMaker endpoint
finetuned_predictor = eqa_estimator.deploy(
    initial_instance_count=1,
    instance_type=inference_instance_type,
    entry_point="inference.py",
    image_uri=deploy_image_uri,
    source_dir=deploy_source_uri,
    endpoint_name=endpoint_name_finetune,
)

time.sleep(10)

In [33]:
newline, bold, unbold = "\n", "\033[1m", "\033[0m"


def query_endpoint(encoded_text, predictor):
    response = predictor.predict(
        encoded_text, {"ContentType": "application/list-text", "Accept": "application/json;verbose"}
    )
    return response


def parse_response(query_response):
    model_predictions = json.loads(query_response)
    answer = (model_predictions["answer"],)
    return answer

In [34]:
predictions = []
for question_context in test_examples:
    query_response = query_endpoint(json.dumps(question_context).encode("utf-8"), finetuned_predictor)
    answer = parse_response(query_response)
    predictions.append(answer[0])

In [35]:
# these functions are heavily influenced by the HF squad_metrics.py script
def normalize_text(s):
    """Removing articles and punctuation, and standardizing whitespace are all typical text processing steps."""
    import string, re

    def remove_articles(text):
        regex = re.compile(r"\b(a|an|the)\b", re.UNICODE)
        return re.sub(regex, " ", text)

    def white_space_fix(text):
        return " ".join(text.split())

    def remove_punc(text):
        exclude = set(string.punctuation)
        return "".join(ch for ch in text if ch not in exclude)

    def lower(text):
        return text.lower()

    return white_space_fix(remove_articles(remove_punc(lower(s))))

def compute_exact_match(prediction, truth):
    return int(normalize_text(prediction) == normalize_text(truth))

def compute_f1(prediction, truth):
    pred_tokens = normalize_text(prediction).split()
    truth_tokens = normalize_text(truth).split()
    
    # if either the prediction or the truth is no-answer then f1 = 1 if they agree, 0 otherwise
    if len(pred_tokens) == 0 or len(truth_tokens) == 0:
        return int(pred_tokens == truth_tokens)
    
    common_tokens = set(pred_tokens) & set(truth_tokens)
    
    # if there are no common tokens then f1 = 0
    if len(common_tokens) == 0:
        return 0
    
    prec = len(common_tokens) / len(pred_tokens)
    rec = len(common_tokens) / len(truth_tokens)
    
    return 2 * (prec * rec) / (prec + rec)

In [36]:
import numpy as np

em_score, f1_score = [], []

for prediction, gold_answers in zip(predictions, ground_truth):
    em_score.append(
        max((compute_exact_match(prediction, answer)) for answer in gold_answers)
    )
    f1_score.append(
        max((compute_f1(prediction, answer)) for answer in gold_answers)
    )
    
print(f"Average Exact Matching score: {np.mean(em_score)}")    
print(f"Average F1 score: {np.mean(f1_score)}")

Average Exact Matching score: 0.29133858267716534
Average F1 score: 0.42883719517577784


In [37]:
result = {"Average Exact Matching score": [np.mean(em_score)], "Average F1 Score": [np.mean(f1_score)]}

In [38]:
result = pd.DataFrame.from_dict(result, orient='index', columns=["No HPO"])

In [39]:
result

Unnamed: 0,No HPO
Average Exact Matching score,0.291339
Average F1 Score,0.428837


## 4. Finetune the pre-trained model on a custom dataset with automatic model tuning (AMT)

Amazon SageMaker automatic model tuning, also known as hyperparameter tuning, finds the best version of a model by running many training jobs on your dataset using the algorithm and ranges of hyperparameters that you specify. It then chooses the hyperparameter values that result in a model that performs the best, as measured by a metric that you choose. We will use a [HyperparameterTuner](https://sagemaker.readthedocs.io/en/stable/api/training/tuner.html) object to interact with Amazon SageMaker hyperparameter tuning APIs.

In [56]:
from sagemaker.tuner import ContinuousParameter, IntegerParameter, CategoricalParameter, HyperparameterTuner


# Define objective metric per framework, based on which the best model will be selected.
# metric_definitions_per_model = {
#     "pytorch": {
#         "metrics": [{"Name": "validation:loss", "Regex": "val_loss: ([0-9\\.]+)"}],
#         "type": "Minimize",
#     }
# }

metric_definitions_per_model = {
    "pytorch": {
        "metrics": [
            {"Name": "loss", "Regex": "'loss': ([0-9\\.]+)"},
            {"Name": "eval_loss", "Regex": "'eval_loss': ([0-9\\.]+)"}
        ],
        "type": "Minimize",
    }
}


# metric_definitions = [
#      {'Name': 'validation:loss', 'Regex': 'val_loss: ([0-9.]+)'},
#      {'Name': 'validation:accuracy', 'Regex': 'val_acc: ([0-9.]+)'},
# ]

# You can select from the hyperparameters supported by the model, and configure ranges of values to be searched for training the optimal model.(https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-define-ranges.html)
hyperparameter_ranges = {
    "adam-learning-rate": ContinuousParameter(0.00001, 0.01, scaling_type="Logarithmic"),
    "epochs": IntegerParameter(3, 10),
    "train-only-top-layer": CategoricalParameter(["True", "False"]),
}

# Increase the total number of training jobs run by AMT, for increased accuracy (and training time).
max_jobs = 1
# Change parallel training jobs run by AMT to reduce total training time, constrained by your account limits.
# if max_jobs=max_parallel_jobs then Bayesian search turns to Random.
max_parallel_jobs = 1

### 4.1. Fine-tuning with hyperparameter optimization

In [57]:
from sagemaker.estimator import Estimator
from sagemaker.tuner import HyperparameterTuner

tuning_job_name = f"{config.SOLUTION_PREFIX}-eqa-hpo-1"

# Create SageMaker Estimator instance
eqa_estimator = Estimator(
    role=role,
    image_uri=train_image_uri,
    source_dir=train_source_uri,
    model_uri=train_model_uri,
    entry_point="transfer_learning.py",
    instance_count=1,
    instance_type=training_instance_type,
    max_run=360000,
    hyperparameters=hyperparameters,
    output_path=s3_output_location,
    tags=[{'Key': config.TAG_KEY, 'Value': config.SOLUTION_PREFIX}],
    base_job_name=tuning_job_name,
    debugger_hook_config=False,
)

model_id = "pytorch-model-123"

metric_definitions = next(
    value for key, value in metric_definitions_per_model.items() if model_id.startswith(key)
)

# metric_definitions = next(
#     (value for key, value in metric_definitions_per_model.items() if model_id.startswith(key)), 
#     None
# )


hp_tuner = HyperparameterTuner(
    eqa_estimator,
    metric_definitions["metrics"][0]["Name"],
    hyperparameter_ranges,
    metric_definitions["metrics"],
    max_jobs=max_jobs,
    max_parallel_jobs=max_parallel_jobs,
    objective_type=metric_definitions["type"],
    base_tuning_job_name=training_job_name,
)

# Launch a SageMaker Tuning job to search for the best hyperparameters
hp_tuner.fit({"training": training_data_path_updated})

INFO:sagemaker:Creating hyperparameter tuning job with name: sagemaker-soln-docum-230628-0452


..........................................................................................................................................................................!


### 4.2. Deploy & run Inference on the fine-tuned model

In [None]:
# Retrieve the inference docker container uri
model_id = "huggingface-eqa-bert-base-uncased" 

deploy_image_uri = image_uris.retrieve(
    region=None,
    framework=None,
    image_scope="inference",
    model_id=model_id,
    model_version=model_version,
    instance_type=inference_instance_type,
)
# Retrieve the inference script uri
deploy_source_uri = script_uris.retrieve(
    model_id=model_id, model_version=model_version, script_scope="inference"
)

endpoint_name_hpo = f"{config.SOLUTION_PREFIX}-eqa-hpo-endpoint"

# Use the estimator from the previous step to deploy to a SageMaker endpoint
finetuned_predictor_hpo = hp_tuner.deploy(
    initial_instance_count=1,
    instance_type=inference_instance_type,
    entry_point="inference.py",
    image_uri=deploy_image_uri,
    source_dir=deploy_source_uri,
    endpoint_name=endpoint_name_hpo,
)

time.sleep(10)

In [64]:
predictions_hpo = []
for question_context in test_examples:
    query_response = query_endpoint(json.dumps(question_context).encode("utf-8"), finetuned_predictor_hpo)
    answer = parse_response(query_response)
    predictions_hpo.append(answer[0])

In [65]:
import numpy as np

em_score_hpo, f1_score_hpo = [], []

for prediction, gold_answers in zip(predictions_hpo, ground_truth):
    em_score_hpo.append(
        max((compute_exact_match(prediction, answer)) for answer in gold_answers)
    )
    f1_score_hpo.append(
        max((compute_f1(prediction, answer)) for answer in gold_answers)
    )
    
print(f"Average Exact Matching score: {np.mean(em_score_hpo)}")    
print(f"Average F1 score: {np.mean(f1_score_hpo)}")

Average Exact Matching score: 0.5354330708661418
Average F1 score: 0.723536437315965


In [66]:
result_hpo = {"Average Exact Matching score": [np.mean(em_score_hpo)], "Average F1 Score": [np.mean(f1_score_hpo)]}

In [67]:
result_hpo = pd.DataFrame.from_dict(result_hpo, orient='index', columns=["With HPO"])

In [68]:
pd.concat([result, result_hpo], axis=1)

Unnamed: 0,No HPO,With HPO
Average Exact Matching score,0.291339,0.535433
Average F1 Score,0.428837,0.723536


We can see results with hyperparameter optimization shows better performance on the hold-out test data.

## 4.3. Clean Up the endpoint

When you've finished with the summarization endpoint (and associated
endpoint-config), make sure that you delete it to avoid accidental
charges.

In [69]:
# Delete the SageMaker endpoint and the attached resources
finetuned_predictor.delete_model()
finetuned_predictor.delete_endpoint()

finetuned_predictor_hpo.delete_model()
finetuned_predictor_hpo.delete_endpoint()

INFO:sagemaker:Deleting model with name: sagemaker-jumpstart-2023-06-28-03-09-45-131
INFO:sagemaker:Deleting endpoint configuration with name: sagemaker-soln-documents-js-ewrtgp-eqa-finetune-endpoint-1
INFO:sagemaker:Deleting endpoint with name: sagemaker-soln-documents-js-ewrtgp-eqa-finetune-endpoint-1
INFO:sagemaker:Deleting model with name: sagemaker-jumpstart-2023-06-28-05-19-03-153
INFO:sagemaker:Deleting endpoint configuration with name: sagemaker-soln-documents-js-ewrtgp-eqa-hpo-endpoint
INFO:sagemaker:Deleting endpoint with name: sagemaker-soln-documents-js-ewrtgp-eqa-hpo-endpoint


## Next Stage

We've just looked at how you can query document for specific information.
Up next we'll look at a technique that can be used to extract the key
entities from a document, called Entity Recognition.

[Click here to continue with Name Entity Recognition.](./4_entity_recognition.ipynb)