# Amazon Textract
### Reading a Journal Article

**Based on Example found in** [AWS Textract Developer Guide](https://docs.aws.amazon.com/textract/latest/dg/textract-dg.pdf#examples-blocks)  

![Amazon](https://media.gettyimages.com/photos/closeup-of-sign-with-logo-on-facade-of-the-regional-headquarters-of-picture-id1065011338?s=2048x2048)


### Article Title Page

![Ethical Concerns Regarding Twitter Data in Research](../data/research_title_page.png)

#### Import AWS API Library (boto3)

In [1]:
import boto3
import time
def startJob(s3BucketName, objectName):
    response = None
    client = boto3.client('textract')
    response = client.start_document_text_detection(
    DocumentLocation={
    'S3Object': {
    'Bucket': s3BucketName,
    'Name': objectName
         }
     })
    return response["JobId"]


def isJobComplete(jobId):
    # For production use cases, use SNS based notification 
    # Details at: https://docs.aws.amazon.com/textract/latest/dg/api-async.html
    time.sleep(5)
    client = boto3.client('textract')
    response = client.get_document_text_detection(JobId=jobId)
    status = response["JobStatus"]
    print("Job status: {}".format(status))
    while(status == "IN_PROGRESS"):
        time.sleep(5)
        response = client.get_document_text_detection(JobId=jobId)
        status = response["JobStatus"]
        print("Job status: {}".format(status))
    return status

def getJobResults(jobId):
    pages = []
    client = boto3.client('textract')
    response = client.get_document_text_detection(JobId=jobId)

    pages.append(response)
    print("Resultset page recieved: {}".format(len(pages)))
    nextToken = None
    if('NextToken' in response):
        nextToken = response['NextToken']
    while(nextToken):
        response = client.get_document_text_detection(JobId=jobId, NextToken=nextToken)
        pages.append(response)
        print("Resultset page recieved: {}".format(len(pages)))
        nextToken = None
        if('NextToken' in response):
            nextToken = response['NextToken']
    return pages

# Document
s3BucketName = "jdreed-hadley"
documentName = "The_Ethical_Challenges_Publishing_Twitter.pdf"

output_file = 'pdf_extract.txt'

jobId = startJob(s3BucketName, documentName)
print("Started job with id: {}".format(jobId))
if(isJobComplete(jobId)):
    response = getJobResults(jobId)
    
    # Print detected text
    with open(output_file, "wt") as f:
        for resultPage in response:
            for item in resultPage["Blocks"]:
                if item["BlockType"] == "LINE":
                    f.write('\033[94m' + item["Text"] + '\033[0m\n')

    # show the results
    print(f'\n\nPDF Extract OUTPUT FILE: {output_file}')

Started job with id: f817424c441392157354d7383dc856b51d6828aa8cb12eafe4815541704e0870
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: SUCCEEDED
Resultset page recieved: 1
Resultset page recieved: 2
Resultset page recieved: 3
Resultset page recieved: 4
Resultset page recieved: 5
Resultset page recieved: 6
Resultset page recieved: 7
Resultset page recieved: 8
Resultset page recieved: 9
Resultset page recieved: 10


PDF Extract OUTPUT FILE: pdf_extract.txt


In [16]:
!head pdf_extract.txt

[94mLong Session VI: Reflecting, Thinking, Understanding[0m[94mWebSci '17, June 25-28, 2017, Troy, NY, USA[0m[94mThe Ethical Challenges of Publishing Twitter Data[0m[94mfor Research Dissemination[0m[94mHelena Webb[0m[94mBernd Carsten Stahl[0m[94mWilliam Housley[0m[94mMarina Jirotka[0m[94mDe Montfort University,[0m[94mAdam Edwards[0m[94mUniversity of Oxford,[0m[94mDepartment of Informatics[0m[94mMatthew Williams[0m[94mDepartment of Computer Science[0m[94mLeicester, United Kingdom[0m[94mCardiff University,[0m[94mOxford, United Kingdom[0m[94mbstahl@dmu.ac.uk[0m[94mSchool of Social Sciences[0m[94mhelena.webb@cs.ox.ac.uk[0m[94mCardiff, United Kingdom[0m[94mmarina.jirotka.cs.ox.ac.uk[0m[94mOmer Rana[0m[94mhousleyw@cardiff.ac.uk[0m[94mPete Burnap[0m[94medwardsa2@cardiff.ac.uk[0m[94mRob Procter*[0m[94mCardiff University, School[0m[94mwilliamsm7@cardiff.ac.uk[0m[94mUniversity of Warwick,[0m[94mof Computer Science and Informatics[0m[