<h1> <b> Comprehend Primitives and Pre-Built API's </b></h1>

 Amazon Comprehend is a natural-language processing (NLP) service that uses machine learning to uncover valuable insights and connections in text. We will explore 6 pre-trained APIs: Identifying Named Entities, Extracting Key Phrases, Identifying the Dominant Language, Determining Emotional sentiment, Determining Syntax, Detecting Detect Personally Identifiable Information (PII).
 
- Selected kernal: Python3(Data Science)
- IAM setting: ComprehendFullAccess

In [1]:
import boto3
import pprint
import pandas as pd
import numpy as np

In [2]:
#intialize the comprehend client with Boto3
comprehend = boto3.client(service_name='comprehend')

In [3]:
#sample text we will be using with Comprehend
sample_text = '''
Hello Zhang Wei. Your AnyCompany Financial Services, LLC credit card account 1111-0000-1111-0000 has a minimum payment of $24.53 that is due by July 31st. 
Based on your autopay settings, we will withdraw your payment on the due date from your bank account XXXXXX1111 with the routing number XXXXX0000. 
Your latest statement was mailed to 100 Main Street, Anytown, WA 98121. 
After your payment is received, you will receive a confirmation text message at 206-555-0100. 
If you have questions about your bill, AnyCompany Customer Service is available by phone at 206-555-0199 or email at support@anycompany.com.
'''


<h1>Identifying Named Entities</h1>

A named entity is a real-world object (persons, places, locations, organizations, etc.) that can be denoted with a proper name. Amazon Comprehend can extract named entities from a document or text. This can be useful, for example, for indexing, document labeling or search. For more information, see Detect Entities). The API used to extract these entities is the DetectEntities API. For each entity detected Amazon Comprehend returns both the type, for instance "Person" or "Date", as well as a confidence score which indicates how confident the model is in this detection. In your implementation you can use this confidence score to set threshold values.

<h4>Important Terminologies</h4>
<img src="comprehend_terminology.png" alt="comprehend_terminology" width="1000"/>

In [4]:
#detect and print entities
detected_entities = comprehend.detect_entities(Text=sample_text, LanguageCode='en')
pprint.pprint(detected_entities['Entities'][0:5])

[{'BeginOffset': 7,
  'EndOffset': 16,
  'Score': 0.9990997314453125,
  'Text': 'Zhang Wei',
  'Type': 'PERSON'},
 {'BeginOffset': 23,
  'EndOffset': 57,
  'Score': 0.9994251132011414,
  'Text': 'AnyCompany Financial Services, LLC',
  'Type': 'ORGANIZATION'},
 {'BeginOffset': 78,
  'EndOffset': 97,
  'Score': 0.988307774066925,
  'Text': '1111-0000-1111-0000',
  'Type': 'OTHER'},
 {'BeginOffset': 123,
  'EndOffset': 129,
  'Score': 0.998004138469696,
  'Text': '$24.53',
  'Type': 'QUANTITY'},
 {'BeginOffset': 145,
  'EndOffset': 154,
  'Score': 0.9991244077682495,
  'Text': 'July 31st',
  'Type': 'DATE'}]


To understand the entity detection ouput, you will see 5 elements. BeginOffset and EndOffset are the place in the document the text in relation to characters. Ex. Zhang Wei starts at character 7 and finishes at character 16. Score is the confidence of the predicition, text is the text that was classified and type is the entity that was detected.This response pattern is common across nearly all Amazon Comprehend API commands.You can build <b>Custom Entity Detection </b> and we will touch on this in a later workshop.

<h4>Explain Entity Output</h4>
<img src="entity_output.png" alt="entity_output" width="1000"/>

In [5]:
#displaying values in a more huamn readable way
detectec_entities_df = pd.DataFrame([ [entity['Text'], entity['Type'], entity['Score']] for entity in detected_entities['Entities']],
                columns=['Text', 'Type', 'Score'])
display (detectec_entities_df)

Unnamed: 0,Text,Type,Score
0,Zhang Wei,PERSON,0.9991
1,"AnyCompany Financial Services, LLC",ORGANIZATION,0.999425
2,1111-0000-1111-0000,OTHER,0.988308
3,$24.53,QUANTITY,0.998004
4,July 31st,DATE,0.999124
5,XXXXXX1111,OTHER,0.974339
6,XXXXX0000,OTHER,0.971708
7,"100 Main Street, Anytown, WA 98121",LOCATION,0.983216
8,206-555-0100,OTHER,0.999324
9,AnyCompany Customer Service,ORGANIZATION,0.78477



<h1>Detecting Key Phrases</h1>

Amazon Comprehend can extract **key noun phrases** that appear in a document. For example, a document about a basketball game might return the names of the teams, the name of the venue, and the final score. This can be used, for example, for indexing or summarization. For more information, see Detect Key Phrases.

The API used to extract these key phrases is the DetectKeyPhrases API.

Amazon Comprehend returns the key phrases, as well as a confidence score which indicates how confident the model is in this detection. In your implementation you can use this confidence score to set threshold values.


In [6]:
#Call detect key phrases API
detected_key_phrases = comprehend.detect_key_phrases(Text=sample_text, LanguageCode='en')
pprint.pprint(detected_key_phrases['KeyPhrases'][0:3])

[{'BeginOffset': 1,
  'EndOffset': 16,
  'Score': 0.8426344990730286,
  'Text': 'Hello Zhang Wei'},
 {'BeginOffset': 18,
  'EndOffset': 52,
  'Score': 0.9881375432014465,
  'Text': 'Your AnyCompany Financial Services'},
 {'BeginOffset': 54,
  'EndOffset': 97,
  'Score': 0.8444651961326599,
  'Text': 'LLC credit card account 1111-0000-1111-0000'}]


In [7]:
#displaying values in a more huamn readable way
detected_key_phrases_df = pd.DataFrame([ [entity['Text'], entity['Score']] for entity in detected_key_phrases['KeyPhrases']],
                columns=['Text', 'Score'])
display(detected_key_phrases_df)

Unnamed: 0,Text,Score
0,Hello Zhang Wei,0.842634
1,Your AnyCompany Financial Services,0.988138
2,LLC credit card account 1111-0000-1111-0000,0.844465
3,a minimum payment,0.999948
4,$24.53,0.999779
5,July 31st,0.999465
6,your autopay settings,0.999296
7,your payment,0.999452
8,the due date,0.999488
9,your bank account XXXXXX1111,0.984577



<h1>Identifying the Dominant Language</h1>

Amazon Comprehend identifies the dominant language in a document. Amazon Comprehend can currently identify many languages. This can be useful as a first step before further processing, for example when phone call transcripts can be in different languages. For more information, including which languages can be identified, see Detect the Dominant Language.

The API used to identify the dominant language is the DetectDominantLanguage API.

Amazon Comprehend returns the dominant language, as well as a confidence score which indicates how confident the model is in this detection. In your implementation you can use this confidence score to set threshold values. If more than one language is detected, it will return each detected language and its corresponding confidence score.


In [8]:
#Calling the detect lanaguage 
detected_language = comprehend.detect_dominant_language(Text=sample_text)
pprint.pprint(detected_language)

{'Languages': [{'LanguageCode': 'en', 'Score': 0.9913273453712463}],
 'ResponseMetadata': {'HTTPHeaders': {'content-length': '64',
                                      'content-type': 'application/x-amz-json-1.1',
                                      'date': 'Fri, 21 Jan 2022 14:49:41 GMT',
                                      'x-amzn-requestid': 'f42bec22-430c-4a8d-91aa-5b625784f608'},
                      'HTTPStatusCode': 200,
                      'RequestId': 'f42bec22-430c-4a8d-91aa-5b625784f608',
                      'RetryAttempts': 0}}


In [9]:
#Making it more human readable
detected_language_df = pd.DataFrame([ [code['LanguageCode'], code['Score']] for code in detected_language['Languages']],
                columns=['Language Code', 'Score'])
display (detected_language_df)

Unnamed: 0,Language Code,Score
0,en,0.991327


<h1>Determining Emotional Sentiment</h1>

Amazon Comprehend determines the emotional sentiment of a document. Sentiment can be positive, neutral, negative, or mixed. For more information, see Determine Sentiment. This can be useful for example to analyze the content of reviews or transcripts from call centres. For more information, see Detecting Sentiment.

The API used to extract the emotional sentiment is the DetectSentiment API.

Amazon Comprehend returns the different sentiments and the related confidence score for each of them, which indicates how confident the model is in this detection. The sentiment with the highest confidence score can be seen as the predominant sentiment in the text.


In [10]:
#calling detect_sentiment 
detected_sentiment = comprehend.detect_sentiment(Text=sample_text, LanguageCode='en')
pprint.pprint(detected_sentiment)

{'ResponseMetadata': {'HTTPHeaders': {'content-length': '163',
                                      'content-type': 'application/x-amz-json-1.1',
                                      'date': 'Fri, 21 Jan 2022 14:50:10 GMT',
                                      'x-amzn-requestid': 'dc2df383-7f85-4e05-9f15-cb84db63910f'},
                      'HTTPStatusCode': 200,
                      'RequestId': 'dc2df383-7f85-4e05-9f15-cb84db63910f',
                      'RetryAttempts': 0},
 'Sentiment': 'NEUTRAL',
 'SentimentScore': {'Mixed': 9.69591928878799e-06,
                    'Negative': 0.016147520393133163,
                    'Neutral': 0.9832557439804077,
                    'Positive': 0.0005869901506230235}}


In [11]:
#Finding predmoninant sentiment and making it more human readable
predominant_sentiment = detected_sentiment['Sentiment']
detected_sentiments_df = pd.DataFrame([ [sentiment, detected_sentiment['SentimentScore'][sentiment]] for sentiment in detected_sentiment['SentimentScore']],
                columns=['Language Code', 'Score'])
#Sentiment across Document
display(detected_sentiments_df)
#Predominant Senitment
display(predominant_sentiment)

Unnamed: 0,Language Code,Score
0,Positive,0.000587
1,Negative,0.016148
2,Neutral,0.983256
3,Mixed,1e-05


'NEUTRAL'


<h1>Detecting Personally Identifiable Information (PII)</h1>

Amazon Comprehend analyzes documents to detect personal data that could be used to identify an individual, such as an address, bank account number, or phone number. This can be usefull, for example, for information extraction and indexing, and to comply with legal requirements around data protection. For more information, see Detect Personally Identifiable Information (PII).

Amazon Comprehend can help you identify the location of individual PII in your document or help you label documents that contain PII.
Identify the location of PII in your text documents

Amazon Comprehend can help you identify the location of individual PII in your document. Select "Offsets" in the Personally identifiable information (PII) analysis mode.

The API used to identify the location of individual PII is the DetectPiiEntities API.

Amazon Comprehend returns the different PII and the related confidence score for each of them, which indicates how confident the model is in this detection.


In [12]:
#Calling the Detect PII API
detected_pii_entities = comprehend.detect_pii_entities(Text=sample_text, LanguageCode='en')
pprint.pprint(detected_pii_entities)

{'Entities': [{'BeginOffset': 7,
               'EndOffset': 16,
               'Score': 0.9999306797981262,
               'Type': 'NAME'},
              {'BeginOffset': 78,
               'EndOffset': 97,
               'Score': 0.999817430973053,
               'Type': 'BANK_ACCOUNT_NUMBER'},
              {'BeginOffset': 145,
               'EndOffset': 154,
               'Score': 0.9995546340942383,
               'Type': 'DATE_TIME'},
              {'BeginOffset': 258,
               'EndOffset': 268,
               'Score': 0.9999951124191284,
               'Type': 'BANK_ACCOUNT_NUMBER'},
              {'BeginOffset': 293,
               'EndOffset': 302,
               'Score': 0.9995330572128296,
               'Type': 'BANK_ROUTING'},
              {'BeginOffset': 341,
               'EndOffset': 375,
               'Score': 0.9999313354492188,
               'Type': 'ADDRESS'},
              {'BeginOffset': 458,
               'EndOffset': 470,
               'Score': 0.99

In [13]:
#Make more human readable
detected_pii_entities_df = pd.DataFrame([ [entity['Type'], entity['Score']] for entity in detected_pii_entities['Entities']],
                columns=['Type', 'Score'])
display (detected_pii_entities_df)


Unnamed: 0,Type,Score
0,NAME,0.999931
1,BANK_ACCOUNT_NUMBER,0.999817
2,DATE_TIME,0.999555
3,BANK_ACCOUNT_NUMBER,0.999995
4,BANK_ROUTING,0.999533
5,ADDRESS,0.999931
6,PHONE,0.999917
7,PHONE,0.999959
8,EMAIL,0.999556



<h2>Label text documents with PII</h2>

Amazon Comprehend can help you label documents that contain PII. Select "Labels" in the Personally identifiable information (PII) analysis mode.

The API used to extract the PII enties in the document. We used the ContainsPiiEntities API.

Amazon Comprehend returns the different PII labels and the related confidence score for each of them, which indicates how confident the model is in this detection. These labels indicate the presence of these types of PII in the document.


In [14]:
#Labelelling Text in a Document
detected_pii_labels = comprehend.contains_pii_entities(Text=sample_text, LanguageCode='en')
pprint.pprint(detected_pii_labels)

{'Labels': [{'Name': 'DATE_TIME', 'Score': 1.0},
            {'Name': 'EMAIL', 'Score': 1.0},
            {'Name': 'ADDRESS', 'Score': 0.8439050912857056},
            {'Name': 'BANK_ROUTING', 'Score': 0.6791887283325195},
            {'Name': 'PHONE', 'Score': 1.0}],
 'ResponseMetadata': {'HTTPHeaders': {'content-length': '200',
                                      'content-type': 'application/x-amz-json-1.1',
                                      'date': 'Fri, 21 Jan 2022 14:51:40 GMT',
                                      'x-amzn-requestid': '85358d24-d024-4ec8-b4d5-45b0f41f42e7'},
                      'HTTPStatusCode': 200,
                      'RequestId': '85358d24-d024-4ec8-b4d5-45b0f41f42e7',
                      'RetryAttempts': 0}}


In [15]:
#Make more human readable
detected_pii_labels_df = pd.DataFrame([ [entity['Name'], entity['Score']] for entity in detected_pii_labels['Labels']],
                columns=['Name', 'Score'])
display (detected_pii_labels_df)


Unnamed: 0,Name,Score
0,DATE_TIME,1.0
1,EMAIL,1.0
2,ADDRESS,0.843905
3,BANK_ROUTING,0.679189
4,PHONE,1.0
