# Document Understanding Solution - Name Entity Recognition

Named entity recognition (NER) seeks to locate and classify named entities in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. In this notebook, we demonstrate three use cases of Name Entity Recognition:

1. How to directly deploy a pretrained Transformer-based name entity recognition model to perform inference.
2. How to fine-tune a pre-trained Transformer model on a custom dataset, and then run inference on the fine-tuned model.
3. How to run [SageMaker Automatic Model Tuning](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning.html) (a hyperparameter optimization procedure) to find the best model compared with the model fine-tuned in point 2. The performance of the optimal model and model fine-tuned in point 2 is evaluated on a hold-out test data. 

**Note**: When running this notebook on SageMaker Studio, you should make
sure the `PyTorch 1.10 Python 3.8 CPU Optimized` image/kernel is used. When
running this notebook on SageMaker Notebook Instance, you should make
sure the 'sagemaker-soln' kernel is used.

This solution relies on a config file to run the provisioned AWS resources. Run the cell below to generate that file.

In [1]:
import boto3
import os
import json

client = boto3.client('servicecatalog')
cwd = os.getcwd().split('/')
i= cwd.index('S3Downloads')
pp_name = cwd[i + 1]
pp = client.describe_provisioned_product(Name=pp_name)
record_id = pp['ProvisionedProductDetail']['LastSuccessfulProvisioningRecordId']
record = client.describe_record(Id=record_id)

keys = [ x['OutputKey'] for x in record['RecordOutputs'] if 'OutputKey' in x and 'OutputValue' in x]
values = [ x['OutputValue'] for x in record['RecordOutputs'] if 'OutputKey' in x and 'OutputValue' in x]
stack_output = dict(zip(keys, values))

with open(f'/root/S3Downloads/{pp_name}/stack_outputs.json', 'w') as f:
    json.dump(stack_output, f)

## 1. Set Up

Before executing the notebook, there are some initial steps required for setup. This notebook requires latest version of sagemaker and ipywidgets.

In [3]:
!pip install -U sagemaker ipywidgets datasets seqeval --find-links file://$PWD/../wheelhouse

Looking in links: file:///root/S3Downloads/jumpstart-prod-doc_ewrtgp/notebooks/../wheelhouse
Processing /root/S3Downloads/jumpstart-prod-doc_ewrtgp/wheelhouse/seqeval-1.2.2-py3-none-any.whl
Installing collected packages: seqeval
Successfully installed seqeval-1.2.2
[0m

We start by importing a variety of packages that will be used throughout
the notebook. One of the most important packages is the Amazon SageMaker
Python SDK (i.e. `import sagemaker`). We also import modules from our own
custom (and editable) package that can be found at `../package`.

In [4]:
import boto3
import sagemaker
from sagemaker.pytorch import PyTorchModel
import sys
from sagemaker.huggingface import HuggingFace
from sagemaker.huggingface import HuggingFaceModel

sys.path.insert(0, '../package')
from package import config, utils

aws_role = config.IAM_ROLE
aws_region = boto3.Session().region_name
sess = sagemaker.Session()

Up next, we define the current folder and create a SageMaker client (from
`boto3`). We can use the SageMaker client to call SageMaker APIs
directly, as an alternative to using the Amazon SageMaker SDK. We'll use
it at the end of the notebook to delete certain resources that are
created in this notebook.

In [5]:
current_folder = utils.get_current_folder(globals())
sagemaker_client = boto3.client('sagemaker')

## 2. Run inference on the pre-trained name entity recognition model

We'll use the unique solution prefix to name the model and endpoint. Up next, we need to define the Amazon SageMaker Model which references
the source code and the specifies which container to use. 

This is a Named Entity Generation model [En_core_web_md](https://spacy.io/models/en#en_core_web_md) from the [spaCy](spacy.io) library. It takes a text string as input and predicts named entities in the input text. 

The pre-trained model from the spaCy library doesn't rely on a specific deep learning framework. Just for consistency with the other notebooks we'll continue to use the PyTorchModel from the Amazon SageMaker Python SDK. Using PyTorchModel and setting the framework_version argument, means that our deployed model will run inside a container that has PyTorch pre-installed. Other requirements can be installed by defining a `requirements.txt` file at the specified source_dir location. We use the `entry_point` argument to reference the code (within `source_dir`) that should be run for model inference: functions called `model_fn`, `input_fn`, `predict_fn` and `output_fn` are expected to be defined. And lastly, you can pass `model_data` from a training job, but we are going to load the pre-trained model in the source code running on the endpoint. We still need to provide `model_data`, so we pass an empty archive.

In [6]:
endpoint_name = f"{config.SOLUTION_PREFIX}-entity-recognition-endpoint"

### 2.1. Deploy an endpoint

In [7]:
model = PyTorchModel(
    model_data=f"{config.SOURCE_S3_PATH}/artifacts/models/empty.tar.gz",
    entry_point="entry_point.py",
    source_dir="../containers/entity_recognition",
    role=config.IAM_ROLE,
    framework_version="1.5.0",
    py_version="py3",
    code_location="s3://" + config.S3_BUCKET + "/code",
    env={
        "MMS_DEFAULT_RESPONSE_TIMEOUT": "3000"
    }
)

Using this Amazon SageMaker Model, we can deploy a HTTPS endpoint on a
dedicated instance. We choose to deploy the endpoint on a single
ml.p3.2xlarge instance (or ml.g4dn.2xlarge if unavailable in this
region). You can expect this deployment step to take
around 5 minutes. After approximately 15 dashes, you can expect to see an
exclamation mark which indicates a successful deployment.

In [8]:
import time
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer

predictor = model.deploy(
    endpoint_name=endpoint_name,
    instance_type=config.HOSTING_INSTANCE_TYPE,
    initial_instance_count=1,
    serializer=JSONSerializer(),
    deserializer=JSONDeserializer()
)

time.sleep(10)

------!

When you're trying to update the model for development purposes, but
experiencing issues because the model/endpoint-config/endpoint already
exists, you can delete the existing model/endpoint-config/endpoint by
uncommenting and running the following commands:

In [9]:
# sagemaker_client.delete_endpoint(EndpointName=endpoint_name)
# sagemaker_client.delete_endpoint_config(EndpointConfigName=endpoint_name)

When calling our new endpoint from the notebook, we use a Amazon
SageMaker SDK
[`Predictor`](https://sagemaker.readthedocs.io/en/stable/predictors.html).
A `Predictor` is used to send data to an endpoint (as part of a request),
and interpret the response. Our `model.deploy` command returned a
`Predictor` but, by default, it will send and receive numpy arrays. Our
endpoint expects to receive (and also sends) JSON formatted objects, so
we modify the `Predictor` to use JSON instead of the PyTorch endpoint
default of numpy arrays. JSON is used here because it is a standard
endpoint format and the endpoint response can contain nested data
structures.

### 2.2. Example input sentences for inference & Query endpoint

With our model successfully deployed and our predictor configured, we can
try out the entity recognizer out on example inputs. All we need to do is
construct a dictionary object with a single key called `text` and provide
the the input string. We call `predict` on our predictor and we should
get a response from the endpoint that contains our entities.

In [10]:
data = {'text': 'Amazon SageMaker is a fully managed service that provides every developer and data scientist with the ability to build, train, and deploy machine learning (ML) models quickly.'}
response = predictor.predict(data=data)

We have the responce and we can print out the named entities and noun
chunks that have been extracted from the text above. You will see the
verbatim text of each alongside its location in the original text (given
by start and end character indexes). Usually a document will contain many
more noun chunks than named entities, but named entities have an
additional field called `label` that indicates the class of the named
entity. Since the spaCy model was trained on the OneNotes 5 corpus, it
uses the following classes:

| TYPE | DESCRIPTION |
|---|---|
| PERSON | People, including fictional. |
| NORP | Nationalities or religious or political groups. |
| FAC | Buildings, airports, highways, bridges, etc. |
| ORG | Companies, agencies, institutions, etc. |
| GPE | Countries, cities, states. |
| LOC | Non-GPE locations, mountain ranges, bodies of water. |
| PRODUCT | Objects, vehicles, foods, etc. (Not services.) |
| EVENT | Named hurricanes, battles, wars, sports events, etc. |
| WORK_OF_ART | Titles of books, songs, etc. |
| LAW | Named documents made into laws. |
| LANGUAGE | Any named language. |
| DATE | Absolute or relative dates or periods. |
| TIME | Times smaller than a day. |
| PERCENT | Percentage, including ”%“. |
| MONEY | Monetary values, including unit. |
| QUANTITY | Measurements, as of weight or distance. |
| ORDINAL | “first”, “second”, etc. |
| CARDINAL | Numerals that do not fall under another type. |

In [11]:
print(response['entities'])
print(response['noun_chunks'])

[{'text': 'Amazon SageMaker', 'start_char': 0, 'end_char': 16, 'label': 'ORG'}]
[{'text': 'Amazon SageMaker', 'start_char': 0, 'end_char': 16}, {'text': 'a fully managed service', 'start_char': 20, 'end_char': 43}, {'text': 'that', 'start_char': 44, 'end_char': 48}, {'text': 'every developer and data scientist', 'start_char': 58, 'end_char': 92}, {'text': 'the ability', 'start_char': 98, 'end_char': 109}, {'text': 'ML', 'start_char': 156, 'end_char': 158}]


You can try more examples above, but note that this model has been
pretrained on the OneNotes 5 dataset. You may need to fine-tune this
model with your own question answering data to obtain better results.

### 2.3. Clean up the endpoint

When you've finished with the summarization endpoint (and associated
endpoint-config), make sure that you delete it to avoid accidental
charges.

In [12]:
# Delete the SageMaker endpoint and the attached resources
predictor.delete_model()
predictor.delete_endpoint()

## 3. Finetune the pre-trained model on a custom dataset

Previously, we saw how to run inference on a pre-trained name entity recognition model. Next, we discuss how a model can be finetuned to a custom dataset. 

The model for fine-tuning attaches an token classification layer on each token embeddings outputted by the Text Embedding model
and initializes the layer parameters to random values. The fine-tuning step fine-tunes 
all the model parameters to minimize prediction error on the input data and returns the fine-tuned model. The Text Embedding model we use in this demonstartion is [Distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased) from FuggingFace. The dataset we fine-tune the model is [WikiANN](https://github.com/afshinrahimi/mmner) (which is also known as PAN-X english dataset. The WikiANN (sometimes called PAN-X) is a multilingual named entity recognition dataset consisting of Wikipedia articles annotated with `LOC` (location), `PER` (person), and `ORG` (organisation) tags in the [IOB2 format](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)).


The model returned by fine-tuning can be further deployed for inference. Below are the instructions 
for how the training data should be formatted for input to the model. 

- **Input:**  A directory containing a `txt` format file.
    - The first column of the `txt` format file should have tokens parsed from sentence.
    - The second column should have the corresponding name entity tag.
- **Output:** A trained model that can be deployed for inference. 
 
Below is an example of `txt` format file showing values in its first four columns. Note that the file should not have any header. For the prefix of `B`, `I`, `O` of the tag, please check the [IOB2 format](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)) for details. The data for training and validation will be downloaded into directory `../data/wikiann` in the following section.

|   |  | 
|--- |---|
| R.H.   | B-ORG  |
|Saunders| I-ORG |
|(| O |
|St.| B-ORG |
|Lawrence| I-ORG |
|River| I-ORG |
|)| O |
|(| O |
|...   | ... |
 

The WikiANN dataset is downloaded from [Dataset Homepage](https://github.com/afshinrahimi/mmner). [Apache 2.0 License](https://creativecommons.org/licenses/by-sa/4.0/legalcode).

Citation:
@inproceedings{rahimi-etal-2019-massively,
    title = "Massively Multilingual Transfer for {NER}",
    author = "Rahimi, Afshin  and
      Li, Yuan  and
      Cohn, Trevor",
    booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2019",
    address = "Florence, Italy",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/P19-1015",
    pages = "151--164",
}


### 3.1. Download, preprocess, and upload the training data

In [13]:
!aws s3 cp --recursive $config.SOURCE_S3_PATH/artifacts/data/wikiann/ ../data/wikiann

download: s3://sagemaker-solutions-prod-us-east-1/0.2.0/Document-understanding/3.0.3/artifacts/data/wikiann/validation/dev.txt to ../data/wikiann/validation/dev.txt
download: s3://sagemaker-solutions-prod-us-east-1/0.2.0/Document-understanding/3.0.3/artifacts/data/wikiann/train/train.txt to ../data/wikiann/train/train.txt
download: s3://sagemaker-solutions-prod-us-east-1/0.2.0/Document-understanding/3.0.3/artifacts/data/wikiann/test/test.txt to ../data/wikiann/test/test.txt


The dataset has been partitioned into `train.txt`, `dev.txt`, and `test.txt` data. Thus we don't need split the train data as what we do in previous notebooks. The`train.txt` and `dev.txt` will be used as training and validation data. The `test.txt` will be used as hold-out test data to evaluate model performance with / without hyperparameter optimization. Next, we upload them into S3 path which will be used as input for training.

In [14]:
import os


bucket = config.S3_BUCKET
prefix = "NER"

boto3.Session().resource("s3").Bucket(bucket).Object(
    os.path.join(prefix, "train/data.txt")
).upload_file("../data/wikiann/train/train.txt")

boto3.Session().resource("s3").Bucket(bucket).Object(
    os.path.join(prefix, "validation/data.txt")
).upload_file("../data/wikiann/validation/dev.txt")

### 3.2. Set Training parameters

Now that we are done with all the setup that is needed, we are ready to fine-tune our name entity recognition model.

In [15]:
hyperparameters = {
    "pretrained-model": "distilbert-base-uncased",
    "learning-rate": 2e-6,
    "num-train-epochs": 2,
    "batch-size": 16,
    "weight-decay": 1e-5,
    "early-stopping-patience": 2,
}

### 3.3. Fine-tuning without hyperparameter optimization

We use the HuggingFace from the Amazon SageMaker Python SDK. The entry script is located under `../containers/entity_recognition/finetuning/training.py`

In [16]:
training_job_name = training_job_name = f"{config.SOLUTION_PREFIX}-ner-finetune"

training_instance_type = config.TRAINING_INSTANCE_TYPE

ner_estimator = HuggingFace(
    pytorch_version='1.10.2',
    py_version='py38',
    transformers_version="4.17.0",
    entry_point='training.py',
    source_dir='../containers/entity_recognition/finetuning',
    hyperparameters=hyperparameters,
    role=aws_role,
    instance_count=1,
    instance_type=training_instance_type,
    output_path=f"s3://{bucket}/{prefix}/output",
    code_location=f"s3://{bucket}/{prefix}/output",
    tags=[{'Key': config.TAG_KEY, 'Value': config.SOLUTION_PREFIX}],
    sagemaker_session=sess,
    volume_size=30,
    env={
        'MMS_DEFAULT_RESPONSE_TIMEOUT': '500'
    },
    base_job_name = training_job_name
)

In [None]:
ner_estimator.fit({
    "train": f"s3://{bucket}/{prefix}/train/",
    "validation": f"s3://{bucket}/{prefix}/validation/",
})

## 3.4. Deploy & run Inference on the fine-tuned model

A trained model does nothing on its own. We now want to use the model to perform inference. For this example, it means predicting the entity tag of an input text. 

In [19]:
inference_instance_type = config.HOSTING_INSTANCE_TYPE
endpoint_name_finetune = f"{config.SOLUTION_PREFIX}-ner-finetune-endpoint"

finetuned_predictor = HuggingFaceModel(
    model_data=ner_estimator.model_data,
    source_dir='../containers/entity_recognition/finetuning',
    entry_point='inference.py',
    role=aws_role,
    py_version="py38",
    pytorch_version='1.10.2',
    transformers_version="4.17.0",
)

In [None]:
finetuned_predictor.deploy(
    initial_instance_count=1,
    instance_type=inference_instance_type,
    endpoint_name=endpoint_name_finetune,
)

time.sleep(10)

Before using the test examples to query the deployed endpoint, we firstly prepare the `test.txt` into the right format. We will create a list of words and a list of these words entity labels for each sentence. We will store this in a `pandas.DataFrame` by reading the `test.txt` and reading each sentence as a row.

In [21]:
import itertools
import pandas as pd


def get_tokens_and_ner_tags(filename):
    with open(filename, 'r', encoding="utf8") as f:
        lines = f.readlines()
        split_list = [list(y) for x, y in itertools.groupby(lines, lambda z: z == '\n') if not x]
        tokens = [[x.split('\t')[0].split("en:")[1] for x in y] for y in split_list]
        entities = [[x.split('\t')[1][:-1] for x in y] for y in split_list] 
    return pd.DataFrame({'tokens': tokens, 'ner_tags': entities})

In [22]:
test_data = get_tokens_and_ner_tags('../data/wikiann/test/test.txt')

In [23]:
test_data

Unnamed: 0,tokens,ner_tags
0,"[Shortly, afterward, ,, an, encouraging, respo...","[O, O, O, O, O, O, O, O, O, O, O, B-LOC, O, O,..."
1,"[:, Kanye, West, featuring, Jamie, Foxx, —, ``...","[O, B-PER, I-PER, O, B-PER, I-PER, O, O, B-ORG..."
2,"[Blacktown, railway, station]","[B-ORG, I-ORG, I-ORG]"
3,"['', Mycalesis, perseus, lalassis, '', (, Hewi...","[O, B-LOC, I-LOC, I-LOC, O, O, O, O, O, O]"
4,"[Jonny, Lee, Miller, -, Eli, Stone, '']","[B-PER, I-PER, I-PER, O, B-ORG, I-ORG, O]"
...,...,...
9995,"[Tony, Stewart, ', '', (, PC4, ), ', '']","[B-PER, I-PER, O, O, O, O, O, O, O]"
9996,"[Maryland, Route, 472]","[B-ORG, I-ORG, I-ORG]"
9997,"[Renton, ,, Washington]","[B-LOC, I-LOC, I-LOC]"
9998,"[He, served, as, a, member, of, the, South, Ea...","[O, O, O, O, O, O, O, B-ORG, I-ORG, I-ORG, O]"


In [24]:
content_type = "application/list-text"

def query_endpoint(payload, endpoint_name):
    client = boto3.client("runtime.sagemaker")
    response = client.invoke_endpoint(
        EndpointName=endpoint_name,
        ContentType=content_type,
        Body=json.dumps(payload).encode("utf-8"),
    )
    return response


def parse_response(query_response):
    model_predictions = json.loads(query_response["Body"].read())
    predicted_label = model_predictions["predict_label"]
    token = model_predictions["token"]
    word_id = model_predictions["word_id"]
    return predicted_label, token, word_id


Now we query the endpoint. Each text string (corresponding to each row in `test.txt`) will be tokenzied as one or multiple tokens that could be sent into Transformer. When one text string is tokenzied as multiple tokens (for an example, text string `R.H.` will be tokensized as `R`, `.`, `H`, `.`), each of the four tokens will get a predicted name entity tag. In this case, we need duplicated the ground truth name entity tag of the text string for all the four tokens. As a result, the number of predicted and ground truth name entity tags is the same, and thus a evalution score can be computed. The predicted result `word_id` is used to identify the tokens that belong to the same text string.

In [25]:
import numpy as np
import json

batch_size = 10
num_examples = test_data.shape[0]
predicted_label, token, word_id = [], [], []

for i in np.arange(0, num_examples, step=batch_size):
    query_response_batch = query_endpoint(
        test_data.iloc[i : (i + batch_size), :].tokens.values.tolist(),
        endpoint_name_finetune,
    )

    predicted_label_batch, token_batch, word_id_batch = parse_response(query_response_batch)
    predicted_label.extend(predicted_label_batch)
    token.extend(token_batch)
    word_id.extend(word_id_batch)


The returned predictions contain `predicted_label`, `token`, and `word_id`, each of which has the same number of rows (sentences) in `test_data`. For each element in the `predicted_label` (or `token` or `word_id`), it is another list, where each element corresponds to a text string in the corresponding sentence. Let's first do a sanity checking on the number of predictions being equal to the number of rows in `test_data`.

In [26]:
assert len(predicted_label) == len(token) == len(word_id) == test_data.shape[0]

In [27]:
def tokenize_and_align_labels(examples, word_ids_all):
    label_all_tokens = True
    labels = []
    for i, label in enumerate(examples["ner_tags"]):
        word_ids = word_ids_all[i]
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            if word_idx is None:
                label_ids.append(-100)
            elif label[word_idx] == '0':
                label_ids.append(0)
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx])
            else:
                label_ids.append(label[word_idx] if label_all_tokens else -100)
            previous_word_idx = word_idx
        labels.append(label_ids)
    return labels

Next, because each text string can be tokenized into one or multiple tokens. We need duplicate the ground truth name entity tag of the text string to all the tokens that are associated to it. 

In [28]:
labels_gt = tokenize_and_align_labels(test_data, word_id)

Let's do another sanity checking that within each sentence, the number of tokens (word id) equals to the number of ground truth name entity tags we just created.

In [29]:
for idx, i in enumerate(predicted_label):
    assert len(i) == len(token[idx]) == len(word_id[idx]) == len(predicted_label[idx])

Now we load evaluation metric to compute evaluation scores.

In [30]:
from datasets import load_metric

In [31]:
metric = load_metric("seqeval")

  metric = load_metric("seqeval")


Downloading builder script:   0%|          | 0.00/2.47k [00:00<?, ?B/s]

For details of evaluaton metrics, please check the [official documentation](https://huggingface.co/spaces/evaluate-metric/seqeval). For the overall precision, recall, F1, and accuracy, larger value indicates better performance.

In [32]:
token_all, predict_all, groundtruth_all = [], [], []

for idx, i in enumerate(predicted_label):
    tmp_token, tmp_predict, tmp_gt = [], [], []
    for idx2, each_token in enumerate(token[idx]):
        if each_token in ['[CLS]', '[SEP]']: # exclude the CLS and SEP tokens
            continue
        assert len(i) == len(labels_gt[idx]) == len(token[idx])
        tmp_token.append(each_token)
        tmp_predict.append(i[idx2])
        tmp_gt.append(labels_gt[idx][idx2])
    assert len(tmp_token) == len(tmp_predict) == len(tmp_gt)
    token_all.append(tmp_token)
    predict_all.append(tmp_predict)
    groundtruth_all.append(tmp_gt)
    
assert [-100 not in x for x in groundtruth_all]

In [33]:
metrics = metric.compute(predictions=predict_all, references=groundtruth_all)
result = {"precision": [metrics["overall_precision"]], "recall": [metrics["overall_recall"]], "f1": [metrics["overall_f1"]], "accuracy": [metrics["overall_accuracy"]]}    

In [34]:
result = pd.DataFrame.from_dict(result, orient='index', columns=["No HPO"])

In [35]:
result

Unnamed: 0,No HPO
precision,0.621406
recall,0.647711
f1,0.634286
accuracy,0.857885


## 4. Finetune the pre-trained model on a custom dataset with automatic model tuning (AMT)

Amazon SageMaker automatic model tuning, also known as hyperparameter tuning, finds the best version of a model by running many training jobs on your dataset using the algorithm and ranges of hyperparameters that you specify. It then chooses the hyperparameter values that result in a model that performs the best, as measured by a metric that you choose. We will use a [HyperparameterTuner](https://sagemaker.readthedocs.io/en/stable/api/training/tuner.html) object to interact with Amazon SageMaker hyperparameter tuning APIs.

In [36]:
from sagemaker.tuner import ContinuousParameter, IntegerParameter, HyperparameterTuner

hyperparameters_range = {
    "learning-rate": ContinuousParameter(1e-5, 0.1, scaling_type="Logarithmic"),
    "weight-decay": ContinuousParameter(1e-6, 1e-2, scaling_type="Logarithmic"),
}

hyperparameters = {
    "pretrained-model": "distilbert-base-uncased",
    "num-train-epochs": 3,
    "batch-size": 16,
    "token-column-name": "tokens",
    "tag-column-name": "ner_tags",
    "early-stopping-patience": 3,
    
}

### 4.1. Fine-tuning with hyperparameter optimization

In [38]:
tuning_job_name = f"{config.SOLUTION_PREFIX}-ner-hpo"


estimator = HuggingFace(
    pytorch_version='1.10.2',
    py_version='py38',
    transformers_version="4.17.0",
    entry_point='training.py',
    source_dir='../containers/entity_recognition/finetuning',
    hyperparameters=hyperparameters,
    role=aws_role,
    instance_count=1,
    instance_type=training_instance_type,
    output_path=f"s3://{bucket}/{prefix}/output",
    code_location=f"s3://{bucket}/{prefix}/output",
    tags=[{'Key': config.TAG_KEY, 'Value': config.SOLUTION_PREFIX}],
    sagemaker_session=sess,
    volume_size=30,
    env={
        'MMS_DEFAULT_RESPONSE_TIMEOUT': '500'
    }
)

tuner = HyperparameterTuner(
    estimator,
    "f1",
    hyperparameters_range,
    [{"Name": "f1", "Regex": "'eval_f1': ([0-9\\.]+)"}],
    max_jobs=4,
    max_parallel_jobs=2,
    objective_type="Maximize",
    base_tuning_job_name=tuning_job_name,
)

tuner.fit({
    "train": f"s3://{bucket}/{prefix}/train/",
    "validation": f"s3://{bucket}/{prefix}/validation/",
}, logs=True)

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.


Using provided s3_resource


INFO:sagemaker:Creating hyperparameter tuning job with name: sagemaker-soln-docum-230628-0539


.....................................................................................................................................................................................................................................!


Fetch the exact tuning job name.

In [39]:
sm_client = boto3.Session().client("sagemaker")

tuning_job_name = tuner.latest_tuning_job.name
tuning_job_name

'sagemaker-soln-docum-230628-0539'

In [40]:
tuning_job_result = sm_client.describe_hyper_parameter_tuning_job(
    HyperParameterTuningJobName=tuning_job_name
)

status = tuning_job_result["HyperParameterTuningJobStatus"]
if status != "Completed":
    print("Reminder: the tuning job has not been completed.")

job_count = tuning_job_result["TrainingJobStatusCounters"]["Completed"]
print("%d training jobs have completed" % job_count)

is_maximize = (
    tuning_job_result["HyperParameterTuningJobConfig"]["HyperParameterTuningJobObjective"]["Type"]
    != "Maximize"
)
objective_name = tuning_job_result["HyperParameterTuningJobConfig"][
    "HyperParameterTuningJobObjective"
]["MetricName"]

4 training jobs have completed


In [41]:

tuner_analytics = sagemaker.HyperparameterTuningJobAnalytics(tuning_job_name)

full_df = tuner_analytics.dataframe()

if len(full_df) > 0:
    df = full_df[full_df["FinalObjectiveValue"] > -float("inf")]
    if len(df) > 0:
        df = df.sort_values("FinalObjectiveValue", ascending=False)
        print("Number of training jobs with valid objective: %d" % len(df))
        print({"lowest": min(df["FinalObjectiveValue"]), "highest": max(df["FinalObjectiveValue"])})
        pd.set_option("display.max_colwidth", -1)  # Don't truncate TrainingJobName
    else:
        print("No training jobs have reported valid results yet.")

df

Number of training jobs with valid objective: 4
{'lowest': 0.0, 'highest': 0.8288232088088989}


  pd.set_option("display.max_colwidth", -1)  # Don't truncate TrainingJobName


Unnamed: 0,learning-rate,weight-decay,TrainingJobName,TrainingJobStatus,FinalObjectiveValue,TrainingStartTime,TrainingEndTime,TrainingElapsedTimeSeconds
1,0.000105,0.000201,sagemaker-soln-docum-230628-0539-003-5d365d1d,Completed,0.828823,2023-06-28 05:51:51+00:00,2023-06-28 05:58:23+00:00,392.0
0,0.025527,7e-06,sagemaker-soln-docum-230628-0539-004-efd261c2,Completed,0.0,2023-06-28 05:51:52+00:00,2023-06-28 05:58:30+00:00,398.0
2,0.006734,0.004764,sagemaker-soln-docum-230628-0539-002-831c1a00,Completed,0.0,2023-06-28 05:41:07+00:00,2023-06-28 05:50:15+00:00,548.0
3,0.040223,0.000718,sagemaker-soln-docum-230628-0539-001-a04b6186,Completed,0.0,2023-06-28 05:40:53+00:00,2023-06-28 05:50:06+00:00,553.0


In [None]:
df = df[df["TrainingJobStatus"] == "Completed"] # filter out the failed jobs
output_path_best_tuning_job = os.path.join(f"s3://{bucket}/{prefix}/output", df["TrainingJobName"].iloc[0], "output")

print(f"The output path of the best model from the hpo tuning is: {output_path_best_tuning_job}")

### 4.2. Deploy & run Inference on the fine-tuned model

In [None]:
endpoint_name_hpo = f"{config.SOLUTION_PREFIX}-ner-hpo-endpoint"

tuning_best_model = HuggingFaceModel(
    model_data=os.path.join(output_path_best_tuning_job, "model.tar.gz"),
    source_dir="../containers/entity_recognition/finetuning",
    entry_point="inference.py",
    role=aws_role,
    py_version="py38",
    pytorch_version='1.10.2',
    transformers_version="4.17.0",
)

finetuned_predictor_hpo = tuning_best_model.deploy(
    instance_type=inference_instance_type,
    endpoint_name=endpoint_name_hpo,
    initial_instance_count=1,
)

time.sleep(10)

In [44]:
content_type = "application/list-text"

batch_size = 10
num_examples = test_data.shape[0]
predicted_label_hpo, token_hpo, word_id_hpo = [], [], []
for i in np.arange(0, num_examples, step=batch_size):
    query_response_batch = query_endpoint(
        test_data.iloc[i : (i + batch_size), :].tokens.values.tolist(),
        endpoint_name_hpo,
    )

    predicted_label_batch, token_batch, word_id_batch = parse_response(query_response_batch)
    predicted_label_hpo.extend(predicted_label_batch)
    token_hpo.extend(token_batch)
    word_id_hpo.extend(word_id_batch)

In [45]:
token_all_hpo, predict_all_hpo, groundtruth_all_hpo = [], [], []

for idx, i in enumerate(predicted_label_hpo):
    tmp_token, tmp_predict, tmp_gt = [], [], []
    for idx2, each_token in enumerate(token_hpo[idx]):
        if each_token in ['[CLS]', '[SEP]']:
            continue
        assert len(i) == len(labels_gt[idx]) == len(token_hpo[idx])
        tmp_token.append(each_token)
        tmp_predict.append(i[idx2])
        tmp_gt.append(labels_gt[idx][idx2])
    assert len(tmp_token) == len(tmp_predict) == len(tmp_gt)
    token_all_hpo.append(tmp_token)
    predict_all_hpo.append(tmp_predict)
    groundtruth_all_hpo.append(tmp_gt)
    
assert [-100 not in x for x in groundtruth_all_hpo]

In [46]:
metrics_hpo = metric.compute(predictions=predict_all_hpo, references=groundtruth_all)
result_hpo = {"precision": [metrics_hpo["overall_precision"]], "recall": [metrics_hpo["overall_recall"]], "f1": [metrics_hpo["overall_f1"]], "accuracy": [metrics_hpo["overall_accuracy"]]}  

In [47]:
result_hpo = pd.DataFrame.from_dict(result_hpo, orient='index', columns=["With HPO"])

In [48]:
pd.concat([result, result_hpo], axis=1) 

Unnamed: 0,No HPO,With HPO
precision,0.621406,0.811914
recall,0.647711,0.839227
f1,0.634286,0.825345
accuracy,0.857885,0.922633


We can see results with hyperparameter optimization shows better performance on the hold-out test data.

## 4.3. Clean Up the endpoint

When you've finished with the summarization endpoint (and associated
endpoint-config), make sure that you delete it to avoid accidental
charges.

In [49]:
# # Delete the SageMaker endpoint and the attached resources
sagemaker_client = boto3.client("sagemaker")

finetuned_predictor.delete_model()
sagemaker_client.delete_endpoint(EndpointName=endpoint_name_finetune) ## cannot call finetuned_predictor.delete_endpoint() because 'HuggingFaceModel' object has no attribute 'delete_endpoint'
sagemaker_client.delete_endpoint_config(EndpointConfigName=endpoint_name_finetune)

finetuned_predictor_hpo.delete_model()
sagemaker_client.delete_endpoint(EndpointName=endpoint_name_hpo)
sagemaker_client.delete_endpoint_config(EndpointConfigName=endpoint_name_hpo)

INFO:sagemaker:Deleting model with name: huggingface-pytorch-inference-2023-06-28-05-33-36-747
INFO:sagemaker:Deleting model with name: huggingface-pytorch-inference-2023-06-28-05-59-26-571


{'ResponseMetadata': {'RequestId': 'd8da2308-8ac4-4aff-b4d0-1c380b618b79',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'd8da2308-8ac4-4aff-b4d0-1c380b618b79',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '0',
   'date': 'Wed, 28 Jun 2023 06:05:12 GMT'},
  'RetryAttempts': 0}}

## Next Stage

We've just looked at how you can extract named entities and noun chunks
from a document. Up next we'll look at a technique that can be used to
classify relationships between entities.

[Click here to continue with Relation Extraction.](./5_relationship_extraction.ipynb)