# Document Understanding Solution - Relationship Extraction

Relation Extraction (RE) is the task of extracting semantic relationships from text, which usually occur between two or more entities. In this notebook,  we demonstrate two use cases of Relation Extraction:

1. How to fine-tune a pre-trained Transformer model on a custom dataset, and then run inference on the fine-tuned model.
2. How to run [SageMaker Automatic Model Tuning](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning.html) (a hyperparameter optimization procedure) to find the best model compared with the model fine-tuned in point 1. The performance of the optimal model and model fine-tuned in point 1 is evaluated on a hold-out test data. 

**Note**: When running this notebook on SageMaker Studio, you should make
sure the `PyTorch 1.10 Python 3.8 CPU Optimized` image/kernel is used. When
running this notebook on SageMaker Notebook Instance, you should make
sure the 'sagemaker-soln' kernel is used.

This solution relies on a config file to run the provisioned AWS resources. Run the cell below to generate that file.

In [2]:
import boto3
import os
import json

client = boto3.client('servicecatalog')
cwd = os.getcwd().split('/')
i= cwd.index('S3Downloads')
pp_name = cwd[i + 1]
pp = client.describe_provisioned_product(Name=pp_name)
record_id = pp['ProvisionedProductDetail']['LastSuccessfulProvisioningRecordId']
record = client.describe_record(Id=record_id)

keys = [ x['OutputKey'] for x in record['RecordOutputs'] if 'OutputKey' in x and 'OutputValue' in x]
values = [ x['OutputValue'] for x in record['RecordOutputs'] if 'OutputKey' in x and 'OutputValue' in x]
stack_output = dict(zip(keys, values))

with open(f'/root/S3Downloads/{pp_name}/stack_outputs.json', 'w') as f:
    json.dump(stack_output, f)

Install required package to run this notebook.

In [3]:
!pip install -U sagemaker ipywidgets --find-links file://$PWD/../wheelhouse

Looking in links: file:///root/S3Downloads/jumpstart-prod-doc_ewrtgp/notebooks/../wheelhouse
[0m

## 1. Set Up

We start by importing a variety of packages that will be used throughout
the notebook. One of the most important packages is the Amazon SageMaker
Python SDK (i.e. `import sagemaker`). We also import modules from our own
custom (and editable) package that can be found at `../package`.

In [4]:
import boto3
from pathlib import Path
import sagemaker
from sagemaker.pytorch import PyTorch
import sys

sys.path.insert(0, '../package')
from package import config, utils

aws_role = config.IAM_ROLE
aws_region = boto3.Session().region_name
sess = sagemaker.Session()

## 2. Finetune the pre-trained model on a custom dataset

This is a Relationship Extraction model built on a [Bert-base-uncased](https://huggingface.co/bert-base-uncased) using transformers from the [transformers](https://huggingface.co/transformers/) library. 

The model for fine-tuning attaches a linear classification layer that takes a pair of token embeddings outputted by the Text Embedding model
and initializes the layer parameters to random values. The fine-tuning step fine-tunes 
all the model parameters to minimize prediction error on the input data and returns the fine-tuned model. The Text Embedding model we use in this demonstartion is [Bert-base-uncased](https://huggingface.co/bert-base-uncased) from the [transformers](https://huggingface.co/transformers/) library. The dataset we fine-tune the model is [SemEval-2010 Task 8](https://aclanthology.org/S10-1006/). The SemEval-2 Task 8 is a dataset for multi-way classification of mutually exclusive semantic relations between pairs of nominals.


The model returned by fine-tuning can be further deployed for inference. Below are the instructions 
for how the training data should be formatted for input to the model. 

- **Input:**  A directory containing a `txt` format file.
    - Each observation contains three components, text, semantic relation label, and comment (optional), each of which takes a line in the `txt` format file. Observations are separated by an empty line. For each observation, there are markers highlighting the two terms in the text and their semantic relation label in the line below.
- **Output:** A trained model that can be deployed for inference. 
 
Below is an example of `txt` format file. Note. Desipte of the same semantic relation label, pairs of entities with different order relations are counted as different labels. For an example, `Component-Whole(e2,e1)` and `Component-Whole(e1,e2)` are different semantic relation labels. The data for training and validation will be downloaded into directory `../data/semeval2010t8` in the following section.

|   |
|--- |
|1  "The system as described above has its greatest application in an arrayed <e1>configuration</e1> of antenna <e2>elements</e2>."|
|Component-Whole(e2,e1)|
|Comment: Not a collection: there is structure here, organisation.|
||
|2  "The <e1>child</e1> was carefully wrapped and bound into the <e2>cradle</e2> by means of a cord."|
|Other|
|Comment: NA|
| |
|3  "The <e1>author</e1> of a keygen uses a <e2>disassembler</e2> to look at the raw assembly code."|
|Instrument-Agency(e2,e1)|
|Comment: NA|
||
|...   |
 


Citation:
@inproceedings{hendrickx-etal-2010-semeval,
    title = "{S}em{E}val-2010 Task 8: Multi-Way Classification of Semantic Relations between Pairs of Nominals",
    author = "Hendrickx, Iris  and
      Kim, Su Nam  and
      Kozareva, Zornitsa  and
      Nakov, Preslav  and
      {\'O} S{\'e}aghdha, Diarmuid  and
      Pad{\'o}, Sebastian  and
      Pennacchiotti, Marco  and
      Romano, Lorenza  and
      Szpakowicz, Stan",
    booktitle = "Proceedings of the 5th International Workshop on Semantic Evaluation",
    month = jul,
    year = "2010",
    address = "Uppsala, Sweden",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/S10-1006",
    pages = "33--38",
}

### 2.1. Download, preprocess, and upload the training data

In [5]:
!aws s3 cp --recursive $config.SOURCE_S3_PATH/artifacts/data/semeval2010t8/ ../data/semeval2010t8

download: s3://sagemaker-solutions-prod-us-east-1/0.2.0/Document-understanding/3.0.3/artifacts/data/semeval2010t8/test/test.txt to ../data/semeval2010t8/test/test.txt
download: s3://sagemaker-solutions-prod-us-east-1/0.2.0/Document-understanding/3.0.3/artifacts/data/semeval2010t8/train/train.txt to ../data/semeval2010t8/train/train.txt
download: s3://sagemaker-solutions-prod-us-east-1/0.2.0/Document-understanding/3.0.3/artifacts/data/semeval2010t8/validation/validation.txt to ../data/semeval2010t8/validation/validation.txt


The dataset has been partitioned into `train.txt`, `validation.txt`, and `test.txt` data. Thus we don't need split the train data as what we do in previous notebooks. The`train.txt` and `validation.txt` will be used as training and validation data. The `test.txt` will be used as hold-out test data to evaluate model performance with / without hyperparameter optimization. Next, we upload them into S3 path which will be used as input for training.

In [6]:
import os


bucket = config.S3_BUCKET
prefix = "RE"

boto3.Session().resource("s3").Bucket(bucket).Object(
    os.path.join(prefix, "train/train.txt")
).upload_file("../data/semeval2010t8/train/train.txt")

boto3.Session().resource("s3").Bucket(bucket).Object(
    os.path.join(prefix, "validation/validation.txt")
).upload_file("../data/semeval2010t8/validation/validation.txt")

### 2.2. Set Training parameters

Now that we are done with all the setup that is needed, we are ready to fine-tune our relation extraction model.

In [36]:
hyperparameters = {
    "pretrained-model": "bert-base-uncased",
    "learning-rate": 0.0002,
    "max-epoch": 2,
    "weight-decay": 0.01,
    "batch-size": 8,
    "accumulate-grad-batches": 1,
    "gradient-clip-val": 1.0
}

### 3.2. Fine-tuning without hyperparameter optimization

We use the PyTorch from the Amazon SageMaker Python SDK. The entry script is located under `../containers/relationship_extraction/entry_point.py`

In [37]:
training_job_name = f"{config.SOLUTION_PREFIX}-re-finetune"

train_instance_type = config.TRAINING_INSTANCE_TYPE
#train_instance_type = 'ml.g4dn.4xlarge'

re_estimator = PyTorch(
    framework_version='1.10.0',
    py_version='py38',
    entry_point='entry_point.py',
    source_dir='../containers/relationship_extraction',
    hyperparameters=hyperparameters,
    role=aws_role,
    instance_count=1,
    instance_type=train_instance_type,
    output_path=f"s3://{bucket}/{prefix}/output",
    code_location=f"s3://{bucket}/{prefix}/output",
    base_job_name=training_job_name,
    tags=[{'Key': config.TAG_KEY, 'Value': config.SOLUTION_PREFIX}],
    sagemaker_session=sess,
    volume_size=30,
    env={
        'MMS_DEFAULT_RESPONSE_TIMEOUT': '500'
    },
    debugger_hook_config=False,
)

In [38]:
print(train_instance_type)

ml.g4dn.2xlarge


In [None]:
re_estimator.fit({
    "train": f"s3://{bucket}/{prefix}/train/",
    "validation": f"s3://{bucket}/{prefix}/validation/",
})

## 3.3. Deploy & run Inference on the fine-tuned model

A trained model does nothing on its own. We now want to use the model to perform inference. For this example, it means predicting the semantic relation label of two text string within an input text. 

We'll use the unique solution prefix to name the model and endpoint.

In [48]:
inference_instance_type = config.HOSTING_INSTANCE_TYPE

endpoint_name_finetune = f"{config.SOLUTION_PREFIX}-re-finetune-endpoint-1"

In [None]:
import time
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer

finetuned_predictor = re_estimator.deploy(
    endpoint_name=endpoint_name_finetune,
    instance_type=inference_instance_type,
    initial_instance_count=1,
    serializer=JSONSerializer(),
    deserializer=JSONDeserializer()
)

time.sleep(10)

When calling our new endpoint from the notebook, we use a Amazon
SageMaker SDK
[`Predictor`](https://sagemaker.readthedocs.io/en/stable/predictors.html).
A `Predictor` is used to send data to an endpoint (as part of a request),
and interpret the response. Our `estimator.deploy` command returned a
`Predictor` but, by default, it will send and receive numpy arrays. Our
endpoint expects to receive (and also sends) JSON formatted objects, so
we modify the `Predictor` to use JSON instead of the PyTorch endpoint
default of numpy arrays. JSON is used here because it is a standard
endpoint format and the endpoint response can contain nested data
structures.

With our model successfully deployed and our predictor configured, we can
try out the relationship extraction model out on example inputs.

In [50]:
finetuned_predictor.predict(
    data={
        'sequence': 'Amazon SageMaker is a fully managed service that provides every developer and data scientist with the ability to build, train, and deploy machine learning (ML) models quickly.',
        'entity_one_start': 0,
        'entity_one_end': 6,
        'entity_two_start': 7,
        'entity_two_end': 16
    }
)

{'Label_id': 14, 'Label': 'Other'}

Next, let's query the deployed endpoint to get for the prediction for each test example located in `../data/semeval2010t8/test/test.txt`.

In [51]:
from utils_relation_extraction import parse_file

In [52]:
examples, ground_truth = parse_file("../data/semeval2010t8/test/test.txt")

In [53]:
prediction_labels = []
for each_example in examples:
    prediction_labels.append(
        finetuned_predictor.predict(each_example)["Label"]
    )

In [54]:
import pandas as pd
from sklearn.metrics import accuracy_score, f1_score

accuracy = accuracy_score(prediction_labels, ground_truth)
f1_macro = f1_score(prediction_labels, ground_truth, average='macro')
f1_micro = f1_score(prediction_labels, ground_truth, average='micro')

result = {"Accuracy": [accuracy], "F1 Macro": [f1_macro], "F1 Micro": [f1_micro]}

result = pd.DataFrame.from_dict(result, orient='index', columns=["No HPO"])

In [55]:
result

Unnamed: 0,No HPO
Accuracy,0.167096
F1 Macro,0.015071
F1 Micro,0.167096


Since the task is essentially multiclass classification task, we use accuracy, f1 macro, and f1 micro as the evaluation scores. For each of them, higher value indicates better results.

## 3. Finetune the pre-trained model on a custom dataset with automatic model tuning (AMT)

Amazon SageMaker automatic model tuning, also known as hyperparameter tuning, finds the best version of a model by running many training jobs on your dataset using the algorithm and ranges of hyperparameters that you specify. It then chooses the hyperparameter values that result in a model that performs the best, as measured by a metric that you choose. We will use a [HyperparameterTuner](https://sagemaker.readthedocs.io/en/stable/api/training/tuner.html) object to interact with Amazon SageMaker hyperparameter tuning APIs.

In [59]:
from sagemaker.tuner import ContinuousParameter, IntegerParameter, CategoricalParameter, HyperparameterTuner


# Define objective metric per framework, based on which the best model will be selected.
metric_definitions = {
    "metrics": [{"Name": "validation_accuracy", "Regex": "valid_accuracy=([0-9\\.]+)"}],
    "type": "Maximize",
}

# You can select from the hyperparameters supported by the model, and configure ranges of values to be searched for training the optimal model.(https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-define-ranges.html)
hyperparameter_ranges = {
    "learning-rate": ContinuousParameter(0.0001, 0.001, scaling_type="Logarithmic"),
    #"max-epoch": IntegerParameter(3, 8),
}

# Increase the total number of training jobs run by AMT, for increased accuracy (and training time).
max_jobs = 2
# Change parallel training jobs run by AMT to reduce total training time, constrained by your account limits.
# if max_jobs=max_parallel_jobs then Bayesian search turns to Random.
max_parallel_jobs = 2

### 3.1. Fine-tuning with hyperparameter optimization

In [63]:
tuning_job_name = f"{config.SOLUTION_PREFIX}-re-hpo"

hyperparameters = {
    "max-epoch": 2,
    "weight-decay": 0,
    "batch-size": 8,
    "accumulate-grad-batches": 1,
    "gradient-clip-val": 1.0
}


estimator = PyTorch(
    framework_version='1.10.0',
    py_version='py38',
    entry_point='entry_point.py',
    source_dir='../containers/relationship_extraction',
    hyperparameters=hyperparameters,
    role=aws_role,
    instance_count=1,
    instance_type=train_instance_type,
    output_path=f"s3://{bucket}/{prefix}/output",
    code_location=f"s3://{bucket}/{prefix}/output",
    base_job_name=tuning_job_name,
    tags=[{'Key': config.TAG_KEY, 'Value': config.SOLUTION_PREFIX}],
    sagemaker_session=sess,
    volume_size=30,
    env={
        'MMS_DEFAULT_RESPONSE_TIMEOUT': '500'
    },
    debugger_hook_config=False,
)

In [64]:
re_tuner = HyperparameterTuner(
    estimator,
    metric_definitions["metrics"][0]["Name"],
    hyperparameter_ranges,
    metric_definitions["metrics"],
    max_jobs=max_jobs,
    max_parallel_jobs=max_parallel_jobs,
    objective_type=metric_definitions["type"],
    base_tuning_job_name=tuning_job_name,
)

re_tuner.fit({
    "train": f"s3://{bucket}/{prefix}/train/",
    "validation": f"s3://{bucket}/{prefix}/validation/",
})

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating hyperparameter tuning job with name: sagemaker-soln-docum-230628-0859


Using provided s3_resource
............................................................................................................................................................................................................................................................................................................................................................................................................................!


### 3.2. Deploy & run Inference on the fine-tuned model

In [None]:
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer

endpoint_name_hpo = f"{config.SOLUTION_PREFIX}-re-hpo-endpoint"

finetuned_predictor_hpo = re_tuner.deploy(
    endpoint_name=endpoint_name_hpo,
    instance_type=inference_instance_type,
    initial_instance_count=1,
    serializer=JSONSerializer(),
    deserializer=JSONDeserializer()
)

time.sleep(10)

In [66]:
prediction_labels_hpo = []
for each_example in examples:
    prediction_labels_hpo.append(
        finetuned_predictor_hpo.predict(each_example)["Label"]
    )

In [67]:
from sklearn.metrics import accuracy_score, f1_score

accuracy_hpo = accuracy_score(prediction_labels_hpo, ground_truth)
f1_macro_hpo = f1_score(prediction_labels_hpo, ground_truth, average='macro')
f1_micro_hpo = f1_score(prediction_labels_hpo, ground_truth, average='micro')


result_hpo = {"Accuracy": [accuracy_hpo], "F1 Macro": [f1_macro_hpo], "F1 Micro": [f1_micro_hpo]}

result_hpo = pd.DataFrame.from_dict(result_hpo, orient='index', columns=["With HPO"])


In [68]:
pd.concat([result, result_hpo], axis = 1)

Unnamed: 0,No HPO,With HPO
Accuracy,0.167096,0.167096
F1 Macro,0.015071,0.015071
F1 Micro,0.167096,0.167096


We can see results with hyperparameter optimization shows better performance on the hold-out test data.

## 3.3. Clean Up the endpoint

When you've finished with the summarization endpoint (and associated
endpoint-config), make sure that you delete it to avoid accidental
charges.

In [69]:
#### # Delete the SageMaker endpoint and the attached resources
finetuned_predictor.delete_model()
finetuned_predictor.delete_endpoint()

finetuned_predictor_hpo.delete_model()
finetuned_predictor_hpo.delete_endpoint()

INFO:sagemaker:Deleting model with name: sagemaker-soln-documents-js-ewrtgp-re-f-2023-06-28-08-30-00-319
INFO:sagemaker:Deleting endpoint configuration with name: sagemaker-soln-documents-js-ewrtgp-re-finetune-endpoint-1
INFO:sagemaker:Deleting endpoint with name: sagemaker-soln-documents-js-ewrtgp-re-finetune-endpoint-1
INFO:sagemaker:Deleting model with name: sagemaker-soln-docum-2023-06-28-09-34-44-865
INFO:sagemaker:Deleting endpoint configuration with name: sagemaker-soln-documents-js-ewrtgp-re-hpo-endpoint
INFO:sagemaker:Deleting endpoint with name: sagemaker-soln-documents-js-ewrtgp-re-hpo-endpoint
