### Amazon Textract - Receipts Demo

In the following notebook, we will examine the use of Amazon Textract in order to perform optical character recognition (OCR), and then use Amazon Machine Learning to organize the dataset.

### General Imports

These are the libraries which we will require in order to complete different types of operations

In [1]:
import boto3
import sagemaker
import sys
import os
import re
import numpy as np
import pandas as pd
import subprocess
from sagemaker import get_execution_role
from sagemaker.amazon.amazon_estimator import get_image_uri
import gzip
from io import BytesIO
import zipfile
import random
import json
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
from sklearn.metrics import classification_report


### Constant Set-up
Configure all global constants here, these are variables which will remain constant throughout the execution of the notebook.

In [2]:

# NOTE: S3 bucket name must begin with "deeplens-" for DeepLens deployment
bucket_name='aws-demo-receipts'
prefix = '' #only use this if you want to have your files in a folder 
dataset_prefix = 'receipts_data'
training_data_prefix = 'training_data'
output_data = 'output_data'

### Environment Setup

Setting up the environment involves ensuring all the corret session and IAM roles are configured. We also need to ensure the correct region and bucket is made available.

In [3]:
def setup_env():
    
    role = get_execution_role()

    sess = sagemaker.Session()

    
    AWS_REGION = 'us-east-1'
    s3 = boto3.resource('s3')

    s3_bucket = s3.Bucket(bucket_name)

    if s3_bucket.creation_date == None:
    # create S3 bucket because it does not exist yet
        print('Creating S3 bucket {}.'.format(bucket))
        resp = s3.create_bucket(
            ACL='private',
            Bucket=bucket
        )
    else:
        print('Bucket already exists')
    return role, sess, AWS_REGION, s3,s3_bucket

role, sess,  AWS_REGION, s3, s3_bucket = setup_env()

Bucket already exists


### Download Data and Create Manifest

Here we are goin to download the data to our bucket if it does not already exist. This dataset is a pre-compiled dataset of car images which contain 2 typers:

- Whole (e.g. without damage)
- Damaged (e.g. those with damage)

The following create_dataset method will first download the Zip of the data, and then unpack it to the bucket named in the global constants



In [4]:
def create_dataset(bucket_name, s3_bucket):
    
    dataset_key = 'car-damage-dataset.zip'
    objs = list(s3_bucket.objects.filter(Prefix=dataset_key))
    if len(objs) > 0 and objs[0].key == dataset_key:
        exists = True
        print('{} Already Exists'.format(dataset_key) )
    else:
        exists = False
    
    if not exists:
    
        bucket = bucket_name
        #copy first

        source= { 'Bucket' : 'public-datasets', 'Key': dataset_key}
        s3_bucket.copy(source, dataset_key)


        s3 = boto3.client('s3', use_ssl=False)
        Key_unzip = dataset_unpacked_dir

    
        s3_resource = boto3.resource('s3')
        #Now create zip object one by one, this below is for 1st file in file_list
        zip_obj = s3_resource.Object(bucket_name=bucket_name, key=dataset_key)
        print('Unpacking {}\n'.format(dataset_key))
        
        print (zip_obj)
        buffer = BytesIO(zip_obj.get()["Body"].read())
        z = zipfile.ZipFile(buffer)
        for filename in z.namelist():
            file_info = z.getinfo(filename)
            s3_resource.meta.client.upload_fileobj(
                z.open(filename),
                Bucket=bucket_name,
                Key=Key_unzip + f'{filename}')
            
        
    
# create_dataset(bucket_name, s3_bucket)