# Amazon Comprehend - Custom Entity Recognition


**Description:** This lab is walks you through the steps required to prepare a dataset and submit a custom entity recognizer for Amazon Comprehend

More information on how to create a custom entity recognizer model can be found here.

   https://docs.aws.amazon.com/comprehend/latest/dg/training-recognizers.html


*Note: This notebook and content has been created using content from the following sources and adapted for this workshop.*  
   - [Amazon Comprehend Custom Workshop](https://github.com/aws-samples/amazon-comprehend-custom-entity)
   - [AWS Blog - Build a custom entity recognizer using Amazon Comprehend](https://aws.amazon.com/blogs/machine-learning/building-a-custom-classifier-using-amazon-comprehend/)


## Setup
Before you start, make sure that your SageMaker Execution Role has the credentials that will be required for this lab.  First, grab the SageMaker Execution Role attached to this session. 

In [None]:
import sagemaker
from sagemaker import get_execution_role
import json
import boto3

sess = sagemaker.Session()

role = get_execution_role()
print('SageMaker Execution Role: ', role) # This is the role that SageMaker would use to leverage AWS resources (S3, CloudWatch) on your behalf

**Required Credentials**

Open the [IAM Console - Roles](https://console.aws.amazon.com/iam/home?region=us-east-1#/roles), search for the role above and ensure that role has the required credentials listed below. , search for the role above and ensure that role has the required credentials listed below. 

  (1) Comprehend Full Access
  
  (2) SageMaker Full Access
  
  (3) S3 Full Access 
  
  (4) You will need to create and add an inline policy allowing iam:PassRole as shown below:
  
             {
               "Version": "2012-10-17",
               "Statement": [
                {
                 "Action": [
                 "iam:PassRole"
                 ],
                "Effect": "Allow",
                "Resource": "*"
               }
              ]
             }
             
   (5) You will also need to add the following trust policies to your IAM Role:
   
           {
             "Version": "2012-10-17",
             "Statement": [
             {
               "Effect": "Allow",
               "Principal": {
                 "Service": [
                   "sagemaker.amazonaws.com",
                   "s3.amazonaws.com",
                   "comprehend.amazonaws.com"
                 ]
               },
            "Action": "sts:AssumeRole"
             }
            ]
          }

**Import additional libraries we will be using for the lab...**

In [None]:
import botocore
import re
import numpy as np
import pandas as pd
import matplotlib
import csv
import time
import os
import datetime

comprehend = boto3.client('comprehend')

**Set your S3 bucket and prefix...**

In this case we will be using our default session bucket for simplicity.  This bucket will be used for our model data. 

In [None]:
bucket = sess.default_bucket()
prefix = 'comprehend-custom-entity'
print('S3 Bucket for our model data: ', bucket)

## Download Data

In this example we will be using the following twitter dataset which contains tweets to and from companies doing customer support on twitter: https://www.kaggle.com/thoughtvector/customer-support-on-twitter

**Download the dataset above and save it in the ./data folder on this notebook instance**

*Note: If you don't have an account on kaggle you can run the following commands from the notebook terminal to copy the dataset to your notebook instance:* 

   aws s3 cp s3://phi-demo-london/twcs/twcs.zip /home/ec2-user/SageMaker/ml-workshop/data/twcs.zip

   cd /home/ec2-user/SageMaker/ml-workshop/data

   unzip twcs.zip

**Let's explore our data a bit by loading it into a DataFrame...**

In [None]:
tweets = pd.read_csv('./data/twcs.csv',encoding='utf-8')
print(tweets.shape)
tweets.head()

The schema for the dataset above includes: 

  (1) **tweet_id:** Unique ID for this tweet
  
  (2) **author_id:** Unique ID for this tweet author (anonymized for non-company users)
  
  (3) **inbound:** Indicates whether the tweet was sent inbound to a company
  
  (4) **created_at:** When the tweet was created
  
  (5) **text:** Text content of the tweet
  
  (6) **response_tweet_id:** The unique ID of the tweet that responded to this tweet
  
  (7) **in_response_to_tweet_id:** The tweet this tweet was in response to
  

### Data Wrangling

This is a very interesting tweet data set, about 3 million tweets, and we have information on the author of the tweets and whether the tweet was a query or a response (the "inbound" column). If the tweet was a query, the response_tweet_id gives the response made by the support team.

It would be interesting to modify this dataframe to get query - response pairs in every row.
The following code, to do just what we want, was pulled from [this kernel](https://www.kaggle.com/soaxelbrooke/first-inbound-and-response-tweets)

In [None]:
first_inbound = tweets[pd.isnull(tweets.in_response_to_tweet_id) & tweets.inbound]

QnR = pd.merge(first_inbound, tweets, left_on='tweet_id', 
                                  right_on='in_response_to_tweet_id')

# Filter to only outbound replies (from companies)
QnR = QnR[QnR.inbound_y ^ True]
print(f'Data shape: {QnR.shape}')
QnR.head()

In [None]:
#Let's filter the dataframe contains only the needed columns
QnR = QnR[["author_id_x","created_at_x","text_x","author_id_y","created_at_y","text_y"]]
QnR.head(5)

## Filter to only telco tweets
In our example, we want to create a custom entity to recognize smartphones devices. Let's filer our dataframe to only incclude the T-Mobile and Sprint tweets.

In [None]:
tweet_telco = QnR[QnR["author_id_y"].isin(["TMobileHelp", "sprintcare"])]

Let's concatenate the question and response into one column.

In [None]:
tweet_telco['text'] = tweet_telco['text_x'] + ' | ' + tweet_telco['text_y']

Let's save our telco tweets as a csv file.

In [None]:

tweet_telco['text'].to_csv('./data/tweet_telco.csv', encoding='utf-8', index=False)


## Entity list
In order to create our dataset we need to provide an entity list for our new class named DEVICE.

For now, in order to create our entity list, we will generate keywords of different smartphones manually. The list includes unique entities that have at least 1000 matches in our training dataset.

*Note: In the interest of time, we are only executing one notebook that is part of a larger workshop for [Comprehend Custom](https://github.com/aws-samples/amazon-comprehend-custom-entity).  We'd encourage those interested in exploring Comprehend Custom further to check out the [second notebook](https://github.com/aws-samples/amazon-comprehend-custom-entity/blob/master/2-BlazingText-Word2Vec-Telco-tweets.ipynb) in that workshop where we load a corpus into a word2vec model and generate a list of keywords that are contextually similar. This technique will be used in the custom classifer in the third notebook. The same technique could alternatively be applied here.*

In [None]:
sphones = ['iPhone X', 'iPhoneX', 'iphoneX', 'Samsung Galaxy', 'Samsung Note', 'iphone', 'iPhone', 'android', 'Android']

df_entity_list = pd.DataFrame(sphones, columns=['Text'])


Let's add another column with our class label. This is required part of the Amazon Comprehend training dataset.

More information can be found here.

https://docs.aws.amazon.com/comprehend/latest/dg/cer-entity-list.html


In [None]:
df_entity_list['Type'] = 'DEVICE'


In [None]:
df_entity_list.head(10)

**Let's create our training, entity list, and test file and upload it to S3...**

In [None]:
import os

training_file = './data/telco_train.csv'
tweet_telco['text'].to_csv(training_file, encoding='utf-8', index=False)

entity_file = './data/entity_list.csv'
df_entity_list.to_csv(entity_file, encoding='utf-8', index=False)

test_file = './data/telco_device_test.csv'
tweet_telco['text'].tail(10000).to_csv(test_file, encoding='utf-8', index=False)


In [None]:
def upload_to_s3(s3path, file):
    s3 = boto3.resource('s3')
    data = open(file, "rb")
    key = s3path
    s3.Bucket(bucket).put_object(Key=key, Body=data)

s3_train_key = prefix + "/train/telco_train.csv" 
s3_test_key = prefix + "/test/telco_device_test.csv"
s3_entity_key = prefix + "/entity/telco_entity.csv"

upload_to_s3(s3_train_key, training_file)
upload_to_s3(s3_test_key, test_file)
upload_to_s3(s3_entity_key, entity_file)

In [None]:
#Create s3 paths variable 
s3_train_data = 's3://{}/{}'.format(bucket, s3_train_key)
s3_train_entity = 's3://{}/{}'.format(bucket, s3_entity_key)
s3_test_data = 's3://{}/{}'.format(bucket, s3_test_key)
s3_output_test_data = 's3://{}/{}/test/{}'.format(bucket, prefix, "telco_test_output.json")
print('uploaded training data location: {}'.format(s3_train_data))


## Training
Let's prepare the Custom Entity training job request file.

In [None]:
custom_entity_request = {

      "Documents": { 
         "S3Uri": s3_train_data
      },
      "EntityList": { 
         "S3Uri": s3_train_entity
      },
      "EntityTypes": [ 
         { 
            "Type": "DEVICE"
         }
      ]
   
}

In [None]:

id = str(datetime.datetime.now().strftime("%s"))
create_custom_entity_response = comprehend.create_entity_recognizer(
        RecognizerName = "custom-device-recognizer"+id, 
        DataAccessRoleArn = role,
        InputDataConfig = custom_entity_request,
        LanguageCode = "en"
)

In [None]:
jobArn = create_custom_entity_response['EntityRecognizerArn']

max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_custom_recognizer = comprehend.describe_entity_recognizer(
        EntityRecognizerArn = jobArn
    )
    status = describe_custom_recognizer["EntityRecognizerProperties"]["Status"]
    print("Custom entity recognizer: {}".format(status))
    
    if status == "TRAINED" or status == "IN_ERROR":
        break
        
    time.sleep(60)

## Evaluation

You can see the different metrics for our custom entity recognizer. Amazon Comprehend provides you with metrics to help you estimate how well an entity recognizer should work for your job. They are based on training the recognizer model, and so while they accurately represent the performance of the model during training, they are only an approximation of the API performance during entity discovery.

More information can be found here: https://docs.aws.amazon.com/comprehend/latest/dg/cer-metrics.html

In [None]:
print(json.dumps(describe_custom_recognizer["EntityRecognizerProperties"]["RecognizerMetadata"]["EntityTypes"], indent=2, default=str))

Looking at our output above we can evaluate our model for common metrics: 

 (1) **Precision:** This indicates how many times the model makes a correct entity identification compared to the number of attempted identifications. This shows how many times the model's entity identification is truly a good identification. It is a percentage of the total number of identifications.
 
 (2) **Recall:** This indicates how many times the model makes a correct entity identification compared to the number of instances of that the entity is actually present (as defined by the total number of correct identifications true positives (tp) and missed identifcations false negatives (fn).
 
 (3) **F1:** This is a combination of the Precision and Recall metrics, which measures the overall accuracy of the model for custom entity recognition. The F1 score is the harmonic mean of the Precision and Recall metrics

## Testing our custom entity model

Let's invoke the Comprehend API to run our test job from the test file we prepared earlier.

In [None]:
test_response = comprehend.start_entities_detection_job(
    InputDataConfig={
        'S3Uri': s3_test_data,
        'InputFormat': 'ONE_DOC_PER_LINE'
    },
    OutputDataConfig={
        'S3Uri': s3_output_test_data
    },
    DataAccessRoleArn=role,
    JobName='Custom_Device_Test',
    EntityRecognizerArn=jobArn,
    LanguageCode='en'
)

Let's monitor the job.

In [None]:
jobId = test_response['JobId']
max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_job = comprehend.describe_entities_detection_job(
        JobId = jobId
    )
    status = describe_job["EntitiesDetectionJobProperties"]["JobStatus"]
    print("Job Status: {}".format(status))
    
    if status == "COMPLETED" or status == "FAILED":
        break
        
    time.sleep(60)

In [None]:
#Download the test output to local machine
job_output = describe_job["EntitiesDetectionJobProperties"]["OutputDataConfig"]["S3Uri"]
path_prefix = 's3://{}/'.format(bucket)
job_key = os.path.relpath(job_output, path_prefix)

s3 = boto3.resource('s3')
s3.Bucket(bucket).download_file(job_key, 'output.tar.gz')


In [None]:
!tar xvzf output.tar.gz

In [None]:
#Load all the Entities values in a list
import json

data = []
for line in open('output', 'r'):
    entities = json.loads(line)['Entities']
    if entities != None and len(entities) > 0:
        data.append(entities[0]['Text'])
    

# function to get unique values 
def unique(list1): 
      
    # insert the list to the set 
    list_set = set(list1) 
    # convert the set to the list 
    unique_list = (list(list_set)) 
    for x in unique_list: 
        print(x), 
        
unique(data)


Let's compare the list of the above entities that were recognized with the manual entity list we created and used as  input to our training...

In [None]:
df_entity_list.head(10)

Looking at the results above, we're able to see that Comprehend Custom Entity Recognition was able to recognize entities based on the list we created for training but you'll also notice that Amazon Comprehend has picked up additional words with varying spellings, which is something that can be expected when analyzing data that has typos or abbreviated spellings. 

### CONGRATULATIONS! 
You've successfully created a Custom Entity Recognizer using Amazon Comprehend