### Comprehend NER

This notebook goes over the steps for using AWS Comprehend's Detect Entities tool. This method performs Named Entity Recognition over a text or collection of texts and tags terms or phrases which are indicative of a specific named object, thing, person, or any other type of entity. In order to use Comprehend in Python, you must use the boto3 SDK and create a Comprehend client. We use this client to request NLP results and create maniputable dataframes. 

In [2]:
import boto3
import pandas as pd

s3 = boto3.client('s3')
comprehend = boto3.client('comprehend')

buckets = ['arabian-folktales-transformed-comprehendable','chinese-folktales-transformed-comprehendable','english-folktales-transformed-comprehendable','german-folktales-transformed-comprehendable','indian-folktales-transformed-comprehendable','russian-folktales-transformed-comprehendable']

In our project we will analyze the results of Comprehend's NER annotator. Our input files are cleaned text files which are less than 5,000 bytes in size. In order to maintain consistency in results, we will group the stories by culture of origin and create dataframes for each section containing the pertinent information for each word tagged by the NER annotator.

In [16]:
s3.create_bucket(Bucket = 'folktales-comprehend-ner')

sections = ['arabian','chinese','english','german','indian','russian']
count = 0;
for currBucket in buckets:
    scoreList = []
    typeList = []
    tokenList = []
    beginList = []
    endList = []
    for item in s3.list_objects_v2(Bucket = currBucket).get('Contents'):
        resultsNER = comprehend.detect_entities(Text = s3.get_object(Bucket = currBucket, Key = item.get('Key')).get('Body').read().decode('utf-8'), LanguageCode = 'en')
        for entity in resultsNER.get('Entities'):
            scoreList.append(entity.get('Score'))
            typeList.append(entity.get('Type'))
            tokenList.append(entity.get('Text'))
            beginList.append(entity.get('BeginOffset'))
            endList.append(entity.get('EndOffset'))
        
    df = pd.DataFrame({'Score': scoreList, 'Type': typeList, 'Text': tokenList, 'BeginOffset': beginList, 'EndOffset': endList})
    
    key = 'ner-results-' + sections[count] + '.csv'
    count += 1
    
    s3.put_object(Bucket = 'folktales-comprehend-ner', Body = df.to_csv(), Key = key)

Comprehend gives us two available methods for requesting NER annotations: single documents or with batches up to 25 documents, all less than 5,000 bytes. Because some of our 'transformed-comprehendable' buckets contain more than 25 documents, we request NER annotation for each document available to us, one at a time. For each document, we gather the data for each tagged entity, primarily caring for the results in the 'Score','Type', and 'Text' columns. We create a dataframe for each section and then move these csv files to a new S3 bucket.