# Custom Entity detection with Textract and Comprehend

## Contents
1. [Background](#Background)
1. [Setup](#Setup)
1. [Data Prep](#Data-Prep)
1. [Textract OCR++](#Textract-OCR++)
1. [Amazon GroundTruth Labeling](#Amazon-GroundTruth-Labeling)
1. [Comprehend Custom Entity Training](#Comprehend-Custom-Entity-Training)
1. [Model Performance](#Model-Performance)
1. [Inference](#Inference)
1. [Results](#Results)


## Background

In this notebook, we will cover how to extract and build a custom entity recognizer using Amazon Textract and Comprehend. We will be using Amazon Textract to perform OCR++ on scanned document, GroundTruth to label the interested entities, then passing the extracted documents to Amazon Comprehend to build and train a custom entity recognition model. No prior machine learning knowledge is required. 

In this example, We are using a public dataset from Kaggle: [Resume Entities for NER](https://www.kaggle.com/dataturks/resume-entities-for-ner?select=Entity+Recognition+in+Resumes.json). The dataset comprised 220 samples of candidate resumes in JSON format. 


## Setup
_This Notebook was created on ml.t2.medium notebook instances._

Let's start by install and import all neccessary libaries:

In [2]:
# Installing tqdm Python Library
!pip install tqdm

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
You should consider upgrading via the '/opt/conda/bin/python -m pip install --upgrade pip' command.[0m


In [3]:
import sagemaker
import logging
import boto3
import glob
import time
import os 
from tqdm import tqdm
import json


region = boto3.Session().region_name    
role = sagemaker.get_execution_role()
bucket = sagemaker.Session().default_bucket()
prefix = 'textract_comprehend_NER'

iam = boto3.client('iam')


# Create Data Access Role for Comprehend

## Create Policy

In [4]:
assume_role_policy_doc = {
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "comprehend.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
} 

## Create Role and Attach Policies

In [5]:
iam_textract_comprehend_role_name = 'DSOAWS_Textract_Comprehend'

In [6]:
# import json
# import time

# from botocore.exceptions import ClientError

# try:
#     iam_role_textract_comprehend = iam.create_role(
#         RoleName=iam_textract_comprehend_role_name,
#         AssumeRolePolicyDocument=json.dumps(assume_role_policy_doc),
#         Description='DSOAWS Textract Comprehend Role'
#     )
# except ClientError as e:
#     if e.response['Error']['Code'] == 'EntityAlreadyExists':
#         iam_role_textract_comprehend = iam.get_role(RoleName=iam_textract_comprehend_role_name)
#         print("Role already exists")
#     else:
#         print("Unexpected error: %s" % e)
        
# time.sleep(30)

In [7]:
textract_comprehend_s3_policy_doc = {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "s3:GetObject"
            ],
            "Resource": [
                "arn:aws:s3:::{}/*".format(bucket)
            ],
            "Effect": "Allow"
        },
        {
            "Action": [
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::{}".format(bucket)
            ],
            "Effect": "Allow"
        },
        {
            "Action": [
                "s3:PutObject"
            ],
            "Resource": [
                "arn:aws:s3:::{}/*".format(bucket)
            ],
            "Effect": "Allow"
        }
    ]
}

print(textract_comprehend_s3_policy_doc)

{'Version': '2012-10-17', 'Statement': [{'Action': ['s3:GetObject'], 'Resource': ['arn:aws:s3:::sagemaker-us-east-1-662559257807/*'], 'Effect': 'Allow'}, {'Action': ['s3:ListBucket'], 'Resource': ['arn:aws:s3:::sagemaker-us-east-1-662559257807'], 'Effect': 'Allow'}, {'Action': ['s3:PutObject'], 'Resource': ['arn:aws:s3:::sagemaker-us-east-1-662559257807/*'], 'Effect': 'Allow'}]}


# Attach Policy to Role

In [8]:
# import time

# response = iam.put_role_policy(
#     RoleName=iam_textract_comprehend_role_name,
#     PolicyName='DSOAWS_ComprehendPolicyToS3',
#     PolicyDocument=json.dumps(textract_comprehend_s3_policy_doc)
# )

# print(response)

# time.sleep(30)

In [9]:
#iam_role_textract_comprehend_arn = iam_role_textract_comprehend['Role']['Arn']
iam_role_textract_comprehend_arn = 'arn:aws:iam::662559257807:role/myComprehendDataAccessRole'

## Data Prep <a class="anchor" id="Data-Prep"></a>

PDF and PNG are most common format for scanned documents within enterprises. We already converted these resumes into PDF format to emulate this. Let's upload all these PDF resumes onto S3 for Textract processing. Please note, there are only 220 samples of resume inside the dataset. By modern standards, this is a very small dataset. This dataset also come with few labeled custom entities. However, we will be running this dataset through Amazon GroundTruth to obtain a fresh copy of entity list.

In [10]:
# Uploading PDF resumes to S3
pdfResumeFileList = glob.glob("./resume_pdf/*.pdf")
prefix_resume_pdf = prefix + "/resume_pdf/"

for filePath in tqdm(pdfResumeFileList):
    file_name = os.path.basename(filePath)
    boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix_resume_pdf, file_name)).upload_file(filePath)

resume_pdf_bucket_name = 's3://'+bucket+'/'+prefix+'/'+'resume_pdf/'
print('Uploaded Resume PDFs :\t', resume_pdf_bucket_name)

100%|██████████| 220/220 [00:34<00:00,  6.36it/s]

Uploaded Resume PDFs :	 s3://sagemaker-us-east-1-662559257807/textract_comprehend_NER/resume_pdf/





## Textract OCR++ <a class="anchor" id="Textract-OCR++"></a>

Now these PDFs are ready for Textract to perform OCR++, you can kick off the process with [StartDocumentTextDetection](https://docs.aws.amazon.com/textract/latest/dg/API_StartDocumentTextDetection.html) async API cal. Here we are only set to process 2 resume PDF for demonstrating the process. To save time, we have all 220 resumes processed and avaliable for you. See textract_output directory for all the reuslts.

In [11]:
s3_client = boto3.client('s3')
pdf_object_list = []

# Getting a list of resume PDF files:
response = s3_client.list_objects(
    Bucket= bucket,
    Prefix= prefix+'/'+'resume_pdf/text_output'
)

for obj in response['Contents']:
    pdf_object_list.append(obj['Key'])

pdf_object_list[:5]

['textract_comprehend_NER/resume_pdf/text_output_1.pdf',
 'textract_comprehend_NER/resume_pdf/text_output_10.pdf',
 'textract_comprehend_NER/resume_pdf/text_output_100.pdf',
 'textract_comprehend_NER/resume_pdf/text_output_101.pdf',
 'textract_comprehend_NER/resume_pdf/text_output_102.pdf']

In [12]:
%%time

from s3_textract_functions import *
import codecs

sample_to_process = 5

# We are only processing few files as example; You do not need to process all 220 files
for file_obj in tqdm(pdf_object_list[:sample_to_process]):
    print('Textract Processing PDF: \t'+ file_obj)             
    job_id = StartDocumentTextDetection(bucket, file_obj)
    print('Textract Job Submitted: \t'+ job_id)
    response = getDocumentTextDetection(job_id)
    
    # renaming .pdf to .text
    text_output_name = file_obj.replace('.pdf', '.txt')
    text_output_name = text_output_name[(text_output_name.rfind('/')+1):]
    print('Output Name:\t', text_output_name)
    
    output_dir = './textract_output/'
    
    # Writing Textract Output to Text Files:
    with codecs.open(output_dir + text_output_name, "w", "utf-8") as output_file:
        for item in response["Blocks"]:
            if item["BlockType"] == "LINE":
                #print('\033[94m' + item["Text"] + '\033[0m')
                output_file.write(item["Text"]+'\n')
    output_file.close()


  0%|          | 0/5 [00:00<?, ?it/s]

Textract Processing PDF: 	textract_comprehend_NER/resume_pdf/text_output_1.pdf
Textract Job Submitted: 	1a87d21ea8d8876c9556b9c149292b8bfc763d0cb5d37373f276c9b49aa29607
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS


 20%|██        | 1/5 [02:05<08:23, 125.97s/it]

Job status: SUCCEEDED
Output Name:	 text_output_1.txt
Textract Processing PDF: 	textract_comprehend_NER/resume_pdf/text_output_10.pdf
Textract Job Submitted: 	65750aefe80c8f71a0abbec898ca0f2f8482ea95748b20283a9d173dbd42166d
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: SUCCEEDED


 40%|████      | 2/5 [04:27<06:31, 130.59s/it]

Output Name:	 text_output_10.txt
Textract Processing PDF: 	textract_comprehend_NER/resume_pdf/text_output_100.pdf
Textract Job Submitted: 	9787182d66d37d0c4ce889294f9a520a7bbfdb42b8ece0f62963c43b13046c1a
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: SUCCEEDED


 60%|██████    | 3/5 [05:33<03:42, 111.17s/it]

Output Name:	 text_output_100.txt
Textract Processing PDF: 	textract_comprehend_NER/resume_pdf/text_output_101.pdf
Textract Job Submitted: 	15f1dfdb9406103fcbe639ea6e6057b7526dc5ddc90a5de45ea696c8dde69f45
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: SUCCEEDED


 80%|████████  | 4/5 [05:53<01:24, 84.06s/it] 

Output Name:	 text_output_101.txt
Textract Processing PDF: 	textract_comprehend_NER/resume_pdf/text_output_102.pdf
Textract Job Submitted: 	3cdeb0482dc8b8b7a64c9cdff914cd6a9e3777e26510e75a9a882ada8961608f
Job status: IN_PROGRESS
Job status: SUCCEEDED


100%|██████████| 5/5 [06:04<00:00, 72.98s/it]

Output Name:	 text_output_102.txt
CPU times: user 950 ms, sys: 49.1 ms, total: 999 ms
Wall time: 6min 4s





In [13]:
from tqdm import tqdm
    
# Uploading Textract Output to S3
textract_output_filelist = glob.glob("./textract_output/*.txt")
prefix_textract_output = prefix + "/textract_output/"

for filePath in tqdm(textract_output_filelist):
    file_name = os.path.basename(filePath)
    boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix_textract_output, file_name)).upload_file(filePath)

comprehend_input_doucuments = 's3://' + bucket+'/'+prefix_textract_output
print('Textract Output:\t', comprehend_input_doucuments)

100%|██████████| 6/6 [00:01<00:00,  3.66it/s]

Textract Output:	 s3://sagemaker-us-east-1-662559257807/textract_comprehend_NER/textract_output/





## Amazon GroundTruth Labeling <a class="anchor" id="Amazon-GroundTruth-Labeling"></a>

Since we need to train a custom entity recognition model with Comprehend, and with any machine learning models, we need large amount of training data. In this example, we are leveraging Amazon GroundTruth to label our entities. Amazon Comprehend by default already can recognize entities like [Person, Title, Organization, and etc](https://docs.aws.amazon.com/comprehend/latest/dg/how-entities.html). To demonstrate custom entity recognition capability, we are focusing on Skill entities inside these resumes. We have the labeled and cleaned the data with Amazon GroundTruth (see: entity_list.csv). If you are interested, you can follow this blog to [add data labeling workflow for named entity recognition](https://aws.amazon.com/blogs/machine-learning/adding-a-data-labeling-workflow-for-named-entity-recognition-with-amazon-sagemaker-ground-truth/). 


Before we start training, let's upload the entity list onto S3

In [14]:
# Uploading Entity List to S3
entity_list_file = './entity_list.csv'
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix+'/entity_list/', 'entity_list.csv')).upload_file(entity_list_file)

comprehend_input_entity_list = 's3://' + bucket+'/'+prefix+'/entity_list/'+'entity_list.csv'
print('Entity List:\t', comprehend_input_entity_list)

Entity List:	 s3://sagemaker-us-east-1-662559257807/textract_comprehend_NER/entity_list/entity_list.csv


## Comprehend Custom Entity Training <a class="anchor" id="Comprehend-Custom-Entity-Training"></a>

Now we have both raw and labeled data, and ready to train our model. You can kick off the process with create_entity_recognizer API call.

In [15]:
comprehend_client = boto3.client('comprehend')

custom_recognizer_name = 'resume-entity-recognizer-'+ str(int(time.time()))

comprehend_custom_recognizer_response = comprehend_client.create_entity_recognizer(
    RecognizerName = custom_recognizer_name,
    DataAccessRoleArn=iam_role_textract_comprehend_arn,
    InputDataConfig={
        'EntityTypes': [
            {
                'Type': 'SKILLS'
            },
        ],
        'Documents': {
            'S3Uri': comprehend_input_doucuments
        },
        'EntityList': {
            'S3Uri': comprehend_input_entity_list
        }
    },
    LanguageCode='en'
)

print(json.dumps(comprehend_custom_recognizer_response, indent=2))

{
  "EntityRecognizerArn": "arn:aws:comprehend:us-east-1:662559257807:entity-recognizer/resume-entity-recognizer-1649795645",
  "ResponseMetadata": {
    "RequestId": "1fffff98-f50e-453b-836b-4df06ea908f5",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "x-amzn-requestid": "1fffff98-f50e-453b-836b-4df06ea908f5",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "121",
      "date": "Tue, 12 Apr 2022 20:34:05 GMT"
    },
    "RetryAttempts": 0
  }
}


Once the training job is submitted, you can see the recognizer is being trained on Comprehend Console. 
This will take approxiamately 20 minutes to train

## Model Performance <a class="anchor" id="Model-Performance"></a>

In the training, Comprehend will divide the dataset into training documents and test documents. Once the recognizer is trained, you can see the recognizer’s overall performance, as well as the performance for each entity. 

In [16]:
comprehend_model_response = comprehend_client.describe_entity_recognizer(
    EntityRecognizerArn=comprehend_custom_recognizer_response['EntityRecognizerArn']
)
status = comprehend_model_response['EntityRecognizerProperties']['Status']
print('Training Job Status:\t', status)

while status != 'TRAINED':
    comprehend_model_response = comprehend_client.describe_entity_recognizer(
        EntityRecognizerArn=comprehend_custom_recognizer_response['EntityRecognizerArn']
    )
    status = comprehend_model_response['EntityRecognizerProperties']['Status']
    print(status)
    time.sleep(10)

Training Job Status:	 SUBMITTED
SUBMITTED
SUBMITTED
SUBMITTED
SUBMITTED
SUBMITTED
SUBMITTED
SUBMITTED
SUBMITTED
SUBMITTED
SUBMITTED
SUBMITTED
SUBMITTED
SUBMITTED
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINED


In [17]:
print('Number of Document Trained:\t', comprehend_model_response['EntityRecognizerProperties']['RecognizerMetadata']['NumberOfTrainedDocuments'])
print('Number of Document Tested:\t', comprehend_model_response['EntityRecognizerProperties']['RecognizerMetadata']['NumberOfTestDocuments'])
print('\n-------------- Evaluation Metrics: ----------------')
print('Precision:\t', comprehend_model_response['EntityRecognizerProperties']['RecognizerMetadata']['EvaluationMetrics']['Precision'])
print('ReCall:\t\t', comprehend_model_response['EntityRecognizerProperties']['RecognizerMetadata']['EvaluationMetrics']['Recall'])
print('F1 Score:\t', comprehend_model_response['EntityRecognizerProperties']['RecognizerMetadata']['EvaluationMetrics']['F1Score'])    

Number of Document Trained:	 415
Number of Document Tested:	 53

-------------- Evaluation Metrics: ----------------
Precision:	 75.0
ReCall:		 68.42105263157895
F1 Score:	 71.55963302752295


## Inference

Next, we have prepared a small sample of text to test out our newly trained custom entity recognizer. First, we will upload the document onto S3 and start a custom recognizer job. Once the job is submitted, you can see the progress in console under `Amazon Comprehend` ==> `Analysis Jobs` 

### Uploading Test PDF resumes to S3 for OCR++

In [18]:
pdfResumeFileList = glob.glob("./test_document/*.pdf")
prefix_resume_pdf = prefix + "/test_document/"

for filePath in tqdm(pdfResumeFileList):
    file_name = os.path.basename(filePath)
    boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix_resume_pdf, file_name)).upload_file(filePath)

resume_pdf_bucket_name = 's3://'+bucket+'/'+prefix+'/'+'test_document/'
print('Uploaded Resume PDFs :\t', resume_pdf_bucket_name)

100%|██████████| 1/1 [00:00<00:00,  6.36it/s]

Uploaded Resume PDFs :	 s3://sagemaker-us-east-1-662559257807/textract_comprehend_NER/test_document/





### Performing OCR++ Using Textract

In [19]:
pdf_object_list = []
pdf_object_list.append(prefix_resume_pdf+"test_document.pdf")

output_dir = './test_document/'

for file_obj in tqdm(pdf_object_list):
    print('Textract Processing PDF: \t'+ file_obj)             
    job_id = StartDocumentTextDetection(bucket, file_obj)
    print('Textract Job Submitted: \t'+ job_id)
    response = getDocumentTextDetection(job_id)
    
    # renaming .pdf to .text
    text_output_name = file_obj.replace('.pdf', '.txt')
    text_output_name = text_output_name[(text_output_name.rfind('/')+1):]
    print('Output Name:\t', text_output_name)
    
    
    # Writing Textract Output to Text Files:
    with codecs.open(output_dir + text_output_name, "w", "utf-8") as output_file:
        for item in response["Blocks"]:
            if item["BlockType"] == "LINE":
                print('\033[94m' + item["Text"] + '\033[0m')
                output_file.write(item["Text"]+'\n')
    output_file.close()

  0%|          | 0/1 [00:00<?, ?it/s]

Textract Processing PDF: 	textract_comprehend_NER/test_document/test_document.pdf
Textract Job Submitted: 	19dce9f13396692b4d51a1a1d4f8fbe51a297d4bba71319c2ca27015cf886a1a
Job status: IN_PROGRESS


100%|██████████| 1/1 [00:10<00:00, 10.39s/it]

Job status: SUCCEEDED
Output Name:	 test_document.txt
[94mTom Jackson[0m
[94mSkill Summary:[0m
[94m- Strong analytical and problem solving skills[0m
[94m- Holds AWS Certified Associated Solution Architect Certification[0m
[94m- Databases: MySQL, SQL[0m
[94m- Programming Languages: C, C++, Java, PHP, JavaScript[0m





### Uploading the Textract Result for Inference

In [20]:
# Uploading test document onto S3:
test_document = './test_document/test_document.txt'
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix+'/test_document/', 'test_document.txt')).upload_file(test_document)

s3_test_document = 's3://' + bucket+'/'+prefix+'/test_document/'+'test_document.txt'
s3_test_document_output = 's3://' + bucket+'/'+prefix+'/test_document/'
print('Test Document Input: ', s3_test_document)
print('Test Document Output: ', s3_test_document_output)

Test Document Input:  s3://sagemaker-us-east-1-662559257807/textract_comprehend_NER/test_document/test_document.txt
Test Document Output:  s3://sagemaker-us-east-1-662559257807/textract_comprehend_NER/test_document/


In [21]:
!aws s3 ls s3://$bucket/textract_comprehend_NER/entity_list/

2022-04-12 20:34:06     107010 entity_list.csv


In [22]:
# Start a recognizer Job:
custom_recognizer_job_name = 'recognizer-job-'+ str(int(time.time()))

recognizer_response = comprehend_client.start_entities_detection_job(
    InputDataConfig={
        'S3Uri': s3_test_document,
        'InputFormat': 'ONE_DOC_PER_LINE'
    },
    OutputDataConfig={
        'S3Uri': s3_test_document_output
    },
    DataAccessRoleArn=iam_role_textract_comprehend_arn,
    JobName=custom_recognizer_job_name,
    EntityRecognizerArn=comprehend_model_response['EntityRecognizerProperties']['EntityRecognizerArn'],
    LanguageCode='en'
)

Use follow code to check if the Detection Job for completion

In [23]:
job_response = comprehend_client.describe_entities_detection_job(
    JobId=recognizer_response['JobId']
)

status = job_response['EntitiesDetectionJobProperties']['JobStatus']

In [30]:
while status != 'COMPLETED':
#    if status == 'FAILED':
#        exit
    job_response = comprehend_client.describe_entities_detection_job(
        JobId=recognizer_response['JobId']
    )
    status = job_response['EntitiesDetectionJobProperties']['JobStatus']
    print(status)
    time.sleep(10)

In [31]:
print('Detection Job Name:\t', job_response['EntitiesDetectionJobProperties']['JobName'])
print('Detection Job ID:\t', job_response['EntitiesDetectionJobProperties']['JobId'])


Detection Job Name:	 recognizer-job-1649796524
Detection Job ID:	 07cafca19885d269587cd7cfebc8c9a8


## Results

Once the Analysis job is done, you can download the output and see the results. Here we converted the json result into table format.

In [32]:
output_url = job_response['EntitiesDetectionJobProperties']['OutputDataConfig']['S3Uri']

print('S3 Output URL:\t', output_url)

S3 Output URL:	 s3://sagemaker-us-east-1-662559257807/textract_comprehend_NER/test_document/662559257807-NER-07cafca19885d269587cd7cfebc8c9a8/output/output.tar.gz


In [34]:
from urllib.parse import urlparse

#create dir for output file:
!mkdir -p test_document_output

# Downloading Output File
if job_response['EntitiesDetectionJobProperties']['JobStatus'] == 'COMPLETED':
    filename = './test_document_output/output.tar.gz'
    output_url_o = urlparse(output_url, allow_fragments=False)
    s3_client.download_file(output_url_o.netloc, output_url_o.path.lstrip('/'), filename)

    !cd test_document_output; tar -xvzf output.tar.gz
    
    print("Output downloaded ... ")
else:
    print("Analysis job did not finish successfully.")

output
Output downloaded ... 


In [35]:
from IPython.display import HTML, display

output_file_name = './test_document_output/output'
data = [['Start Offset', 'End Offset', 'Confidence', 'Text', 'Type']]

with open(output_file_name, 'r', encoding='utf-8') as input_file:
    for line in input_file.readlines():
        json_line = json.loads(line)  # converting line of text into JSON
        entities = json_line['Entities']
        if(len(entities)>0):
            for entry in entities:
                entry_data = [entry['BeginOffset'], entry['EndOffset'], entry['Score'], entry['Text'],entry['Type']]
                data.append(entry_data)
        
display(HTML(
   '<table><tr>{}</tr></table>'.format(
       '</tr><tr>'.join(
           '<td>{}</td>'.format('</td><td>'.join(str(_) for _ in row)) for row in data)
       )
))

0,1,2,3,4
Start Offset,End Offset,Confidence,Text,Type
9,39,1.0,analytical and problem solving,SKILLS
20,23,1.0,SQL,SKILLS
2,13,1.0,Programming,SKILLS
25,26,1.0,C,SKILLS
28,31,1.0,C++,SKILLS
33,37,1.0,Java,SKILLS
39,42,1.0,PHP,SKILLS
44,54,1.0,JavaScript,SKILLS


In [36]:
!ls ~


am-l4e				  riot-neptune
amazon-sagemaker-examples	  s3unzip
ashleyproject1-p-dftq9hb1qqh7	  sagemaker-built-batch-lambda
autopilot-workshop		  scikit_learn_data
aws-samples			  sm-abalone-pipeline
comprehend-custom-classifier	  sm-byoc-workshop
comprehend-custom-classifier.tgz  sm-byom-workshop
comprehend_custom_ner.tgz	  sm-pipelines-workshop
customers			  sm_yolo_neo
emr-work			  textract-comprehend
foo.tgz				  textract-comprehend-v2
getting-started-docs		  textract-comprehend-v2.tgz
nltk-workshop			  textract-query
nltk_data			  transcribe
ntm-workshop			  translate
oreilly-workshop		  twc-automl-demo
personalize-workshop		  untitled.flow
pyramid.tgz			  xgboost
regressionwithsagemaker		  yolo-with-neo


In [6]:
!aws s3 cp ~/comprehend_custom_ner.tgz s3://am-tmp2/credence/

upload: ../../../comprehend_custom_ner.tgz to s3://am-tmp2/credence/comprehend_custom_ner.tgz


In [39]:
~

sh: 0: getcwd() failed: No such file or directory



In [2]:
!pwd

/root/customers/credence/comprehend_custom_ner


In [48]:
!ls /home/sage

sh: 0: getcwd() failed: No such file or directory


In [50]:
!ls /home

sh: 0: getcwd() failed: No such file or directory
