# Custom Named Entity Recognition

Custom entity recognition helps you identify your specific new entity types that are not in the preset generic entity types. This means that you can analyze documents and extract entities like product codes or business-specific entities that fit your particular needs.

Building an accurate custom entity recognizer on your own can be a complex process, requiring preparation of large sets of manually annotated training documents and the selection of the right algorithms and parameters for model training. Amazon Comprehend helps to reduce the complexity by providing automatic annotation and model development to create a custom entity recognition model.

Creating a custom entity recognition model is a more effective approach than using string matching or regular expressions to extract entities from documents. For example, to extract ENGINEER names in a document, it is difficult to enumerate all possible names. Additionally, without context, it is challenging to distinguish between ENGINEER names and ANALYST names. A custom entity recognition model can learn the context where those names are likely to appear. Additionally, string matching will not detect entities that have typos or follow new naming conventions, while this is possible using a custom model.

You have two options for creating a custom model:

* Annotations – provide a data set containing annotated entities for model training.
* Entity lists (plaintext only) – provide a list of entities and their type label (such as PRODUCT_CODES and a set of unannotated documents containing those entities for model training.

# MIT Movie Dataset

For our example, we'll use a template named entity recognition as our input data to train a model to recognize movie industry specific entities. 
The MIT movie dataset is widely used as a benchmark for evaluating the performance of Named Entity Recognization task. The dataset is detailed in the paper:  https://arxiv.org/pdf/2106.01760.pdf
Source URL for the dataset can be found: https://groups.csail.mit.edu/sls/downloads/movie/

## Highlights

* Given there are multiple training and validation files available, in this example, we are going to leverage the **trivia10ktrain.bio** and **trivia10ktest.bio** as the basis of the training validation dataset.
* Since the input files are already formatted into bio (https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)), we will use the annotations option for training the custom model. 


In [2]:
import pandas as pd
import os
import matplotlib.pyplot as plt
import boto3
import time
import sagemaker

In [3]:
comprehend = boto3.client("comprehend")
session = sagemaker.session.Session()
default_bucket = session.default_bucket()
s3_movie_entities_prefix = "data/comprehend/movie-ner/entities"
s3_movie_entities_train_upload_prefix = "data/comprehend/movie-ner/train"
s3_movie_entities_test_upload_prefix = "data/comprehend/movie-ner/test"
s3_movie_annotation_prefix = "data/comprehend/movie-ner/annotations"
data_access_role_arn = "arn:aws:iam::869530972998:role/service-role/AmazonComprehendServiceRole-default"
s3_movie_ner_job_output_prefix = "data/comprehend/movie-ner/job-output"

## Explore the train annotation dataset

In [4]:
entity_list_df = pd.read_csv("data/bio/train_annotation.csv")

In [5]:
entity_list_df.head()

Unnamed: 0,File,Line,Begin Offset,End Offset,Type
0,train_document.txt,0,0,12,Actor
1,train_document.txt,0,25,58,Plot
2,train_document.txt,0,60,74,Opinion
3,train_document.txt,0,76,100,Plot
4,train_document.txt,1,0,12,Actor


In [6]:
top_entity_list = entity_list_df['Type'].unique()
entity_list_for_training = [ { "Type" : x} for x in top_entity_list ]

In [7]:
top_entity_list

array(['Actor', 'Plot', 'Opinion', 'Award', 'Year', 'Genre', 'Origin',
       'Director', 'Soundtrack', 'Relationship', 'Character_Name',
       'Quote'], dtype=object)

# Train a Custom Named Entity Recognition Model
A custom entity recognizer identifies only the entity types that you include when you train the model. It does not automatically include the preset entity types. If you want to also identify the preset entity types,such as LOCATION, DATE, or PERSON, you need to provide additional training data for those entities.

When you create a custom entity recognizer using annotated PDF files, you can use the recognizer with a variety of input file formats: plaintext, image files (JPG, PNG, TIFF), PDF files, and Word documents, with no pre-processing or doc flattening required. Amazon Comprehend doesn't support annotation of image files or Word documents.

In [8]:
s3 = boto3.client("s3")
with open(os.path.join('data/bio/train_document.txt'), "rb") as f:
    s3.upload_fileobj(f, default_bucket, os.path.join(s3_movie_entities_train_upload_prefix, "train_document.txt"))

with open(os.path.join('data/bio/test_document.txt'), "rb") as f:
    s3.upload_fileobj(f, default_bucket, os.path.join(s3_movie_entities_test_upload_prefix, "test_document.txt"))
    
with open(os.path.join('data/bio/train_annotation.csv'), "rb") as f:
    s3.upload_fileobj(f, default_bucket, os.path.join(s3_movie_annotation_prefix, "train_annotation.csv"))

with open(os.path.join('data/bio/test_annotation.csv'), "rb") as f:
    s3.upload_fileobj(f, default_bucket, os.path.join(s3_movie_annotation_prefix, "test_annotation.csv"))

In [9]:
## Uncomment the following block to train a custom entity recognizer model

In [10]:
# response = comprehend.create_entity_recognizer(
#     RecognizerName=f"Movies-NER-{int(time.time())}",
#     LanguageCode="en",
#     DataAccessRoleArn=data_access_role_arn,
#     InputDataConfig={
#         'EntityTypes': entity_list_for_training,
#         "Documents": {
#             "S3Uri": f"s3://{default_bucket}/{s3_movie_entities_train_upload_prefix}",
#             'TestS3Uri': f"s3://{default_bucket}/{s3_movie_entities_test_upload_prefix}",
#             'InputFormat': 'ONE_DOC_PER_LINE'
#         },
#         'Annotations': {
#             'S3Uri': f"s3://{default_bucket}/{s3_movie_annotation_prefix}/train_annotation.csv",
#             'TestS3Uri': f"s3://{default_bucket}/{s3_movie_annotation_prefix}/test_annotation.csv"
#         },
#     }
# )
# recognizer_arn = response["EntityRecognizerArn"]

In [11]:
recognizer_arn = "arn:aws:comprehend:us-east-2:869530972998:entity-recognizer/Movies-NER-1670126643"

In [12]:
while True:
    response = comprehend.describe_entity_recognizer(
        EntityRecognizerArn=recognizer_arn
    )

    status = response["EntityRecognizerProperties"]["Status"]
    if "IN_ERROR" == status:
        print("Job failed")
        break
    if "TRAINED" == status:
        break

    time.sleep(10)

In [13]:
# Describe the entity recognizer job status
response = comprehend.describe_entity_recognizer(EntityRecognizerArn=recognizer_arn)
print(response)

{'EntityRecognizerProperties': {'EntityRecognizerArn': 'arn:aws:comprehend:us-east-2:869530972998:entity-recognizer/Movies-NER-1670126643', 'LanguageCode': 'en', 'Status': 'TRAINED', 'SubmitTime': datetime.datetime(2022, 12, 4, 4, 4, 3, 458000, tzinfo=tzlocal()), 'EndTime': datetime.datetime(2022, 12, 4, 4, 18, 20, 733000, tzinfo=tzlocal()), 'TrainingStartTime': datetime.datetime(2022, 12, 4, 4, 8, 22, 207000, tzinfo=tzlocal()), 'TrainingEndTime': datetime.datetime(2022, 12, 4, 4, 17, 26, 21000, tzinfo=tzlocal()), 'InputDataConfig': {'DataFormat': 'COMPREHEND_CSV', 'EntityTypes': [{'Type': 'Actor'}, {'Type': 'Plot'}, {'Type': 'Opinion'}, {'Type': 'Award'}, {'Type': 'Year'}, {'Type': 'Genre'}, {'Type': 'Origin'}, {'Type': 'Director'}, {'Type': 'Soundtrack'}, {'Type': 'Relationship'}, {'Type': 'Character_Name'}, {'Type': 'Quote'}], 'Documents': {'S3Uri': 's3://sagemaker-us-east-2-869530972998/data/comprehend/movie-ner/train', 'TestS3Uri': 's3://sagemaker-us-east-2-869530972998/data/compr

# Perform Realtime analysis for the trained custom entity recognizer

You can use the Amazon Comprehend run real-time analysis with a custom model. First, you create an endpoint to run the real-time analysis. After you create the endpoint, you run the real-time analysis.

In [14]:
## Uncomment the following block to deploy an endpoint

In [15]:
# endpoint_name = f"MoviesNER-{int(time.time())}"
# response = comprehend.create_endpoint(
#     EndpointName=endpoint_name,
#     ModelArn=recognizer_arn,
#     DesiredInferenceUnits=10,
#     DataAccessRoleArn=data_access_role_arn
# )
# print(response)

In [16]:
endpoint_arn = "arn:aws:comprehend:us-east-2:869530972998:entity-recognizer-endpoint/MoviesNER-1670128101"

In [17]:
response = comprehend.detect_entities(
    Text='what 2008 disney animated film starred john travolta as the titular dog and miley cyrus as his owner',
    LanguageCode='en',
    EndpointArn=endpoint_arn
)

In [18]:
response

{'Entities': [{'Score': 0.9999878406524658,
   'Type': 'Year',
   'Text': '2008',
   'BeginOffset': 5,
   'EndOffset': 9},
  {'Score': 0.5947182774543762,
   'Type': 'Genre',
   'Text': 'disney animated',
   'BeginOffset': 10,
   'EndOffset': 25},
  {'Score': 0.9999480843544006,
   'Type': 'Actor',
   'Text': 'john travolta',
   'BeginOffset': 39,
   'EndOffset': 52},
  {'Score': 0.9843842387199402,
   'Type': 'Actor',
   'Text': 'miley cyrus',
   'BeginOffset': 76,
   'EndOffset': 87}],
 'ResponseMetadata': {'RequestId': 'b3acad40-55b8-4304-bd4d-620b7d6512ed',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'b3acad40-55b8-4304-bd4d-620b7d6512ed',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '398',
   'date': 'Tue, 10 Jan 2023 17:57:29 GMT'},
  'RetryAttempts': 0}}

# Run Custom Entity Recognition Job Asynchronously

You can run an asynchronous analysis job to detect custom entities in a set of one or more documents.

Before you begin
You need a custom entity recognition model (also known as a recognizer) before you can detect custom entities. For more information about these models, see Training custom recognizers.

A recognizer that is trained with plain-text annotations supports entity detection for plain-text documents only. A recognizer that is trained with PDF document annotations supports entity detection for plain-text documents, images, PDF files, and Word documents. For files other than text files, Amazon Comprehend performs text extraction before running the analysis. For information about the input files, see Inputs for asynchronous custom analysis.

To run an async analysis job, you perform the following overall steps:

* Store the documents in an Amazon S3 bucket.
* Use the API or console to start the analysis job.
* Monitor the progress of the analysis job.
* After the job runs to completion, retrieve the results of the analysis from the S3 bucket that you specified when you started the job.

In [33]:
response = comprehend.start_entities_detection_job(
    EntityRecognizerArn=recognizer_arn,
    JobName=f"Movies-NER-{int(time.time())}",
    LanguageCode="en",
    DataAccessRoleArn=data_access_role_arn,
    InputDataConfig={
        "InputFormat": "ONE_DOC_PER_LINE",
        "S3Uri": f"s3://{default_bucket}/{s3_movie_entities_test_upload_prefix}",
        "InputFormat" : 'ONE_DOC_PER_LINE'
    },
    OutputDataConfig={
        "S3Uri": f"s3://{default_bucket}/{s3_movie_ner_job_output_prefix}"
    }
    
)
print(response)

{'JobId': 'fc076550e00c40e04895ee664701af6c', 'JobArn': 'arn:aws:comprehend:us-east-2:869530972998:entities-detection-job/fc076550e00c40e04895ee664701af6c', 'JobStatus': 'SUBMITTED', 'ResponseMetadata': {'RequestId': '31b74b9e-5e06-4bfb-a21a-34cb21db895d', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '31b74b9e-5e06-4bfb-a21a-34cb21db895d', 'content-type': 'application/x-amz-json-1.1', 'content-length': '177', 'date': 'Mon, 05 Dec 2022 18:47:43 GMT'}, 'RetryAttempts': 0}}


In [36]:
entity_recognition_job_output_df = pd.read_json("results/entity_recognition/output", lines=True)

In [None]:
entity_recognition_job_output_df.head()

In [None]:
entity_recognition_job_output_df['output']