# AWS Comprehend

Amazon Comprehend is a natural language processing (NLP) service that uses machine learning to find insights and relationships in text. You can also use AutoML capabilities in Amazon Comprehend to build a custom set of entities or text classification models that are tailored uniquely to your organization's needs.

## Configuring Credentials

There are two types of configuration data in boto3: credentials and non-credentials. Credentials include items such as aws_access_key_id, aws_secret_access_key, and aws_session_token. Non-credential configuration includes items such as which region to use or which addressing style to use for Amazon S3. The distinction between credentials and non-credentials configuration is important because the lookup process is slightly different. Boto3 will look in several additional locations when searching for credentials that do not apply when searching for non-credential configuration.

The mechanism in which boto3 looks for credentials is to search through a list of possible locations and stop as soon as it finds credentials. The order in which Boto3 searches for credentials is:

* Passing credentials as parameters in the **boto.client()** method
* Passing credentials as parameters when creating a **Session** object
* Environment variables
* Shared credential file (~/.aws/credentials)
* AWS config file (~/.aws/config)
* Assume Role provider
* Boto2 config file (/etc/boto.cfg and ~/.boto)
* Instance metadata service on an Amazon EC2 instance that has an IAM role configured.

## Examples to initialize boto3

```python
import boto3
client = boto3.client(
    's3',
    aws_access_key_id=ACCESS_KEY,
    aws_secret_access_key=SECRET_KEY,
    aws_session_token=SESSION_TOKEN,
)

# Or via the Session
session = boto3.Session(
    aws_access_key_id=ACCESS_KEY,
    aws_secret_access_key=SECRET_KEY,
    aws_session_token=SESSION_TOKEN,
)
```

In [11]:
# Get Packages
from __future__ import print_function
import time
import boto3
from botocore.exceptions import ClientError
from botocore import exceptions
import pandas as pd
import json
import requests

In [12]:
# read credentials from file
with open('aws_credentials.json') as json_file:
    creds = json.load(json_file)

In [13]:
# get s3 object 
# credentials from system environments
s3_client = boto3.client('s3',
    aws_access_key_id=creds['AWSAccessKeyId'],
    aws_secret_access_key=creds['AWSSecretKey'])
s3 = boto3.resource('s3',
    aws_access_key_id=creds['AWSAccessKeyId'],
    aws_secret_access_key=creds['AWSSecretKey'])
for bucket in s3.buckets.all():
    print(bucket.name)

scion-dev


In [14]:
# Get comprehend session
comprehend = boto3.client(service_name='comprehend', region_name='REGION_CODE',
                          aws_access_key_id=creds['AWSAccessKeyId'],
                          aws_secret_access_key=creds['AWSSecretKey'])

## Comprehend IAM Role

In [16]:
data_access_role_arn = "arn:aws:iam::Account_ID:role/service-role/AmazonComprehendServiceRole-Comprehend_Role"

### DetectDominantLanguage

In [5]:
text = "It is raining today in Seattle"

print('Calling DetectDominantLanguage')
print(json.dumps(comprehend.detect_dominant_language(Text = text), sort_keys=True, indent=4))
print("End of DetectDominantLanguage\n")

Calling DetectDominantLanguage
{
    "Languages": [
        {
            "LanguageCode": "en",
            "Score": 0.9958581924438477
        }
    ],
    "ResponseMetadata": {
        "HTTPHeaders": {
            "content-length": "64",
            "content-type": "application/x-amz-json-1.1",
            "date": "Thu, 02 Jul 2020 06:39:45 GMT",
            "x-amzn-requestid": "c36543af-f42b-4ca7-99c7-20c951aa4496"
        },
        "HTTPStatusCode": 200,
        "RequestId": "c36543af-f42b-4ca7-99c7-20c951aa4496",
        "RetryAttempts": 0
    }
}
End of DetectDominantLanguage



### Detecting Named Entities

In [6]:
text = "It is raining today in Seattle"

print('Calling DetectEntities')
print(json.dumps(comprehend.detect_entities(Text=text, LanguageCode='en'), sort_keys=True, indent=4))
print('End of DetectEntities\n')

Calling DetectEntities
{
    "Entities": [
        {
            "BeginOffset": 14,
            "EndOffset": 19,
            "Score": 0.9999421834945679,
            "Text": "today",
            "Type": "DATE"
        },
        {
            "BeginOffset": 23,
            "EndOffset": 30,
            "Score": 0.999826967716217,
            "Text": "Seattle",
            "Type": "LOCATION"
        }
    ],
    "ResponseMetadata": {
        "HTTPHeaders": {
            "content-length": "199",
            "content-type": "application/x-amz-json-1.1",
            "date": "Thu, 02 Jul 2020 06:39:46 GMT",
            "x-amzn-requestid": "cb5b1f0f-2fc5-408d-90d0-d6b60dcbfc57"
        },
        "HTTPStatusCode": 200,
        "RequestId": "cb5b1f0f-2fc5-408d-90d0-d6b60dcbfc57",
        "RetryAttempts": 0
    }
}
End of DetectEntities



### Detecting Key Phrases

In [7]:
text = "It is raining today in Seattle"

print('Calling DetectKeyPhrases')
print(json.dumps(comprehend.detect_key_phrases(Text=text, LanguageCode='en'), sort_keys=True, indent=4))
print('End of DetectKeyPhrases\n')

Calling DetectKeyPhrases
{
    "KeyPhrases": [
        {
            "BeginOffset": 14,
            "EndOffset": 19,
            "Score": 1.0,
            "Text": "today"
        }
    ],
    "ResponseMetadata": {
        "HTTPHeaders": {
            "content-length": "77",
            "content-type": "application/x-amz-json-1.1",
            "date": "Thu, 02 Jul 2020 06:39:46 GMT",
            "x-amzn-requestid": "ccda5d22-8a62-4294-b76b-0a7e24ee78ec"
        },
        "HTTPStatusCode": 200,
        "RequestId": "ccda5d22-8a62-4294-b76b-0a7e24ee78ec",
        "RetryAttempts": 0
    }
}
End of DetectKeyPhrases



### Detecting Sentiment 

In [8]:
text = "It is raining today in Seattle"

print('Calling DetectSentiment')
print(json.dumps(comprehend.detect_sentiment(Text=text, LanguageCode='en'), sort_keys=True, indent=4))
print('End of DetectSentiment\n')

Calling DetectSentiment
{
    "ResponseMetadata": {
        "HTTPHeaders": {
            "content-length": "161",
            "content-type": "application/x-amz-json-1.1",
            "date": "Thu, 02 Jul 2020 06:39:48 GMT",
            "x-amzn-requestid": "ba74d61c-15ef-4ec7-82f3-0d855cce6616"
        },
        "HTTPStatusCode": 200,
        "RequestId": "ba74d61c-15ef-4ec7-82f3-0d855cce6616",
        "RetryAttempts": 0
    },
    "Sentiment": "NEUTRAL",
    "SentimentScore": {
        "Mixed": 0.00021913634554948658,
        "Negative": 0.162128284573555,
        "Neutral": 0.7376415133476257,
        "Positive": 0.10001111775636673
    }
}
End of DetectSentiment



### Detecting Parts of Speech

In [9]:
text = "It is raining today in Seattle"
 
print('Calling DetectSyntax')
print(json.dumps(comprehend.detect_syntax(Text=text, LanguageCode='en'), sort_keys=True, indent=4))
print('End of DetectSyntax\n')

Calling DetectSyntax
{
    "ResponseMetadata": {
        "HTTPHeaders": {
            "content-length": "714",
            "content-type": "application/x-amz-json-1.1",
            "date": "Thu, 02 Jul 2020 06:39:48 GMT",
            "x-amzn-requestid": "945d6348-20ae-459c-86e7-db9e942899fc"
        },
        "HTTPStatusCode": 200,
        "RequestId": "945d6348-20ae-459c-86e7-db9e942899fc",
        "RetryAttempts": 0
    },
    "SyntaxTokens": [
        {
            "BeginOffset": 0,
            "EndOffset": 2,
            "PartOfSpeech": {
                "Score": 0.9999788999557495,
                "Tag": "PRON"
            },
            "Text": "It",
            "TokenId": 1
        },
        {
            "BeginOffset": 3,
            "EndOffset": 5,
            "PartOfSpeech": {
                "Score": 0.9020146131515503,
                "Tag": "AUX"
            },
            "Text": "is",
            "TokenId": 2
        },
        {
            "BeginOffset": 6,
         

# Working with Custom Jobs and Inference 

# Topic Detection Job

In [11]:
# start_topics_detection_job 

input_s3_url = "s3://BUCKET_NAME/input_data/topics_data.csv"
input_doc_format = "ONE_DOC_PER_FILE"
output_s3_url = "s3://BUCKET_NAME/job_output"

number_of_topics = 10
 
input_data_config = {"S3Uri": input_s3_url, "InputFormat": input_doc_format}
output_data_config = {"S3Uri": output_s3_url}
 
start_topics_detection_job_result = comprehend.start_topics_detection_job(JobName="Topics",NumberOfTopics=number_of_topics,
                                                                              InputDataConfig=input_data_config,
                                                                              OutputDataConfig=output_data_config,
                                                                              DataAccessRoleArn=data_access_role_arn)
 
print('start_topics_detection_job_result: ' + json.dumps(start_topics_detection_job_result))

start_topics_detection_job_result: {"JobId": "216a5699a96627feb93dad9ab9c3bd46", "JobStatus": "SUBMITTED", "ResponseMetadata": {"RequestId": "543ec08b-1b96-4e6a-a95f-e4f38afff25f", "HTTPStatusCode": 200, "HTTPHeaders": {"x-amzn-requestid": "543ec08b-1b96-4e6a-a95f-e4f38afff25f", "content-type": "application/x-amz-json-1.1", "content-length": "68", "date": "Thu, 02 Jul 2020 06:41:38 GMT"}, "RetryAttempts": 0}}


In [12]:
job_id = start_topics_detection_job_result["JobId"]
 
print('job_id: ' + job_id)

job_id: 216a5699a96627feb93dad9ab9c3bd46


In [11]:
# from bson import json_util
from pprint import pprint
import time

while True:
    describe_topics_detection_job_result = comprehend.describe_topics_detection_job(JobId=job_id)
    if describe_topics_detection_job_result['TopicsDetectionJobProperties']['JobStatus'] == "COMPLETED":
        break
    print("Not ready yet...")
    time.sleep(5)
    
describe_topics_detection_job_result['TopicsDetectionJobProperties']['SubmitTime'] = str(describe_topics_detection_job_result['TopicsDetectionJobProperties']['SubmitTime'].strftime('%m/%d/%Y %H:%M:%S.%f'))

describe_topics_detection_job_result['TopicsDetectionJobProperties']['EndTime'] = str(describe_topics_detection_job_result['TopicsDetectionJobProperties']['EndTime'].strftime('%m/%d/%Y %H:%M:%S.%f'))
 
pprint('describe_topics_detection_job_result: ' + json.dumps(describe_topics_detection_job_result))

### Read Output Data

In [41]:
output_data_location = describe_topics_detection_job_result['TopicsDetectionJobProperties']['OutputDataConfig']['S3Uri']
pprint("Ouput data location:"+output_data_location)

('Ouput data '
 'location:s3://scion-dev/job_output/952449798557-TOPICS-216a5699a96627feb93dad9ab9c3bd46/output/output.tar.gz')


In [57]:
# Get the data from s3
bucket_path = f"job_output/{'/'.join(output_data_location.split('/')[4:])}"
with open(r'C:\Users\USER\output\topics.tar.gz', 'wb') as f:
    s3_client.download_fileobj("scion-dev", bucket_path, f)

In [58]:
# Extract downloaded file
import shutil
shutil.unpack_archive(r'C:\Users\USER\output\topics.tar.gz', 
                      r'C:\Users\USER\output\topics')

In [1]:
# List files from extracted folder
import glob

extracted_files = glob.glob(r'C:\Users\USER\output\topics\*')
print(extracted_files)

In [64]:
# Lets Load topics 
import pandas as pd

doc_topics_df = pd.read_csv(extracted_files[0])
topic_terms_df = pd.read_csv(extracted_files[1])

#### Number of Topics from Document

In [69]:
doc_topics_df.head(10)

Unnamed: 0,docname,topic,proportion
0,topics_data.csv,0,0.379628
1,topics_data.csv,1,0.225014
2,topics_data.csv,2,0.201156
3,topics_data.csv,3,0.161101
4,topics_data.csv,4,0.033101


#### Topics from Document

In [68]:
topic_terms_df.head(50)

Unnamed: 0,topic,term,weight
0,0,patchy,0.000484
1,0,alan,0.000484
2,0,editorial,0.000484
3,0,torture,0.000484
4,0,robbins,0.000484
5,0,mirco,0.000484
6,0,fair,0.000484
7,0,controller,0.000484
8,0,phase,0.000484
9,0,allot,0.000484


## Listing existing TopicDetection Jobs

In [38]:
list_topics_detection_jobs = comprehend.list_topics_detection_jobs()

In [10]:
list_topics_detection_jobs

# Custom Classification 

In [30]:
# train data
import pandas as pd

pd.read_csv(r"C:\Users\USER\bbc-text\bbc-text.csv").head()

Unnamed: 0,category,text
0,tech,tv future in the hands of viewers with home th...
1,business,worldcom boss left books alone former worldc...
2,sport,tigers wary of farrell gamble leicester say ...
3,sport,yeading face newcastle in fa cup premiership s...
4,entertainment,ocean s twelve raids box office ocean s twelve...


##### Traning data format
* Headrs should removed
* The class name is placed first, followed by the complete document.

## Sample MultiClass classification with custom models

### Train model using Custom Document Classifier

In [9]:
# Create a document classifier
create_response = comprehend.create_document_classifier(
    InputDataConfig={
        'S3Uri': 's3://BUCKET_NAME/input_data/train.csv'
    },
    DataAccessRoleArn=data_access_role_arn,
    DocumentClassifierName='SampleCodeClassifier1',
    LanguageCode='en'
)
print("Create response: %s\n", create_response)

In [8]:
describe_response = comprehend.describe_document_classifier(
    DocumentClassifierArn=create_response['DocumentClassifierArn'])
describe_response

In [7]:
# Check the status of the classifier

from pprint import pprint
import time

# 'Status': 'SUBMITTED'|'TRAINING'|'DELETING'|'STOP_REQUESTED'|'STOPPED'|'IN_ERROR'|'TRAINED'
while True:
    describe_response = comprehend.describe_document_classifier(
    DocumentClassifierArn=create_response['DocumentClassifierArn'])
    if describe_response['DocumentClassifierProperties']['Status'] == "TRAINED":
        break
    print("Not ready yet...")
    time.sleep(300)
    
print("Describe response: %s\n", describe_response)

In [6]:
# List all classifiers in account
list_response = comprehend.list_document_classifiers()
print("List response: %s\n", list_response)

### Start a classifier Job to predict labels using created custom classifier

In [34]:
start_response = comprehend.start_document_classification_job(
    InputDataConfig={
        'S3Uri': 's3://BUCKET_NAME/input_data/test.csv',
        'InputFormat': 'ONE_DOC_PER_LINE'
    },
    OutputDataConfig={
        'S3Uri': 's3://BUCKET_NAME/job_output'
    },
    DataAccessRoleArn=data_access_role_arn,
    DocumentClassifierArn=
    'arn:aws:comprehend:ap-south-1:ACCOUNT_ID:document-classifier/SampleCodeClassifier1'
)

print("Start response: %s\n", start_response)

Start response: %s
 {'JobId': 'ffa665fbe20c3d0a4340fa42ce0224ec', 'JobStatus': 'SUBMITTED', 'ResponseMetadata': {'RequestId': '7efb3e2f-806d-465e-bb5c-1ec8563ac705', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '7efb3e2f-806d-465e-bb5c-1ec8563ac705', 'content-type': 'application/x-amz-json-1.1', 'content-length': '68', 'date': 'Thu, 02 Jul 2020 09:38:48 GMT'}, 'RetryAttempts': 0}}


In [5]:
# from bson import json_util
from pprint import pprint
import time

# Check the status of the job
while True:
    describe_response = comprehend.describe_document_classification_job(JobId=start_response['JobId'])
    if describe_response['DocumentClassificationJobProperties']['JobStatus'] == "COMPLETED":
        break
    print("Not ready yet...")
    time.sleep(200)

describe_response['DocumentClassificationJobProperties']['SubmitTime'] = str(describe_response['DocumentClassificationJobProperties']['SubmitTime'].strftime('%m/%d/%Y %H:%M:%S.%f'))

describe_response['DocumentClassificationJobProperties']['EndTime'] = str(describe_response['DocumentClassificationJobProperties']['EndTime'].strftime('%m/%d/%Y %H:%M:%S.%f'))
 
pprint('describe_document_classification_job_result: ' + json.dumps(describe_response))

## Read Output Data

In [4]:
classifier_output_data_location = describe_response['DocumentClassificationJobProperties']['OutputDataConfig']['S3Uri']
pprint("Ouput data location:"+classifier_output_data_location)

In [40]:
# Get the data from s3
bucket_path = f"job_output/{'/'.join(classifier_output_data_location.split('/')[4:])}"
with open(r'C:\Users\USER\output\classifier.tar.gz', 'wb') as f:
    s3_client.download_fileobj("scion-dev", bucket_path, f)

In [41]:
# Extract downloaded file
import shutil
shutil.unpack_archive(r'C:\Users\USER\output\classifier.tar.gz', 
                      r'C:\Users\USER\output\classifier')

In [2]:
# List files from extracted folder
import glob

extracted_files = glob.glob(r'C:\Users\USER\output\classifier\*')
print(extracted_files)

In [47]:
# read jsonl file
with open(extracted_files[0],'r') as fp:
    fdata = fp.readlines()
pprint(fdata)

['{"File": "test.csv", "Line": "0", "Classes": [{"Name": "business", "Score": '
 '0.9735}, {"Name": "entertainment", "Score": 0.013}, {"Name": "tech", '
 '"Score": 0.0059}]}\n',
 '{"File": "test.csv", "Line": "1", "Classes": [{"Name": "tech", "Score": '
 '0.9586}, {"Name": "entertainment", "Score": 0.0318}, {"Name": "politics", '
 '"Score": 0.0054}]}\n',
 '{"File": "test.csv", "Line": "2", "Classes": [{"Name": "politics", "Score": '
 '0.7099}, {"Name": "business", "Score": 0.2326}, {"Name": "tech", "Score": '
 '0.03}]}\n',
 '{"File": "test.csv", "Line": "3", "Classes": [{"Name": "tech", "Score": '
 '0.9539}, {"Name": "entertainment", "Score": 0.0321}, {"Name": "politics", '
 '"Score": 0.0061}]}\n',
 '{"File": "test.csv", "Line": "4", "Classes": [{"Name": "politics", "Score": '
 '0.8461}, {"Name": "entertainment", "Score": 0.0932}, {"Name": "business", '
 '"Score": 0.0425}]}\n',
 '{"File": "test.csv", "Line": "5", "Classes": [{"Name": "tech", "Score": '
 '0.8239}, {"Name": "business", "

In [48]:
import json
pres = []
for line in fdata:
    da = json.loads(line)['Classes']
    pres.append(da[0]['Name'])

In [53]:
# load test data and preditions

test_data = pd.read_csv(r"C:\Users\USER\bbc-text\test.csv",names = ['labels','docs'])

In [55]:
test_data['preditions'] = pres

In [56]:
test_data

Unnamed: 0,labels,docs,preditions
0,business,wall street cool to ebay s profit shares in on...,business
1,tech,uk pioneers digital film network the world s f...,tech
2,business,ban on forced retirement under 65 employers wi...,politics
3,tech,local net tv takes off in austria an austrian ...,tech
4,politics,profile: david miliband david miliband s rapid...,politics
5,tech,argonaut founder rebuilds empire jez san the ...,tech
6,entertainment,dance music not dead says fatboy dj norman coo...,entertainment
7,politics,kennedy questions trust of blair lib dem leade...,politics
8,tech,california sets fines for spyware the makers o...,tech
9,entertainment,snicket tops us box office chart the film adap...,entertainment


In [3]:
# List all classification jobs in account
list_response = comprehend.list_document_classification_jobs()
print("List response: %s\n", list_response)

## References

https://docs.aws.amazon.com/comprehend/latest/dg/get-started-api-med.html

https://docs.aws.amazon.com/comprehend/latest/dg/get-started-customclass.html

https://docs.aws.amazon.com/comprehend/latest/dg/get-started-topics.html

https://docs.aws.amazon.com/comprehend/latest/dg/how-document-classification-training.html

https://www.youtube.com/watch?v=p5vaikbltIk

https://github.com/hervenivon/aws-experiments-comprehend-custom-classifier