# <b>Document text detection batch feature demo</b>

The AIServiceVisionClient offers the document text detection feature in batch mode. This notebook aims to provide overall clarity about the feature to the user in terms of requirements, usage and the output of the batch i.e. asynchronous API.<br>
<ul>
    <li>The output response files are stored at the object storage specified in <code>data/output_object_document_batch.json</code>. </li>
<li>The detected text for a randomly selected document from the batch input is displayed in the last section of the notebook.</li>
</ul>

### Steps to run the notebook:
<details>
    <summary>Notebook session setup</summary>
    <ol>
        <li><font size="2">Installing the OCI Vision SDK</font></li>
        <li><font size="2">Installing other dependencies</font></li>
        <li><font size="2">Setup sample input documents</font></li>
        <li><font size="2">Setup helper .py files</font></li>
    </ol>
</details>

<details>
    <summary>Importing the required modules</summary>
</details>

<details>
    <summary>Setting the input variables</summary>
     <font size="2">The user can change the input variables, if necessary. They have been assigned default values.</font>
</details>

<details>
    <summary>Running the main pipeline</summary>
    <font size="2">Run all cells to get the output in the <code>output</code> directory. </font><br>
</details>

### Notebook session setup
<details>
    <summary>Instructions</summary>
    <ul>
        <li><font size="2">The user needs to setup only once.</font></li>
        <li><font size="2">Uncomment the commented cells and run once to setup.</font></li>
        <li><font size="2">Comment back the same cells to avoid running again.</font></li>
    </ul>
</details>

#### Installing the OCI Vision SDK

In [1]:
# !wget "https://objectstorage.us-ashburn-1.oraclecloud.com/n/axhheqi2ofpb/b/vision-demo-notebooks/o/vision_service_python_client-0.3.45-py2.py3-none-any.whl"
# !pip install vision_service_python_client-0.3.45-py2.py3-none-any.whl
# !rm vision_service_python_client-0.3.45-py2.py3-none-any.whl

#### Installing other dependencies

In [2]:
# !pip install matplotlib==3.3.4
# !pip install pandas==1.1.5

#### Setup sample input documents

In [3]:
# !wget "https://objectstorage.us-ashburn-1.oraclecloud.com/n/axhheqi2ofpb/b/vision-demo-notebooks/o/input_objects_document_batch.json"
# !wget "https://objectstorage.us-ashburn-1.oraclecloud.com/n/axhheqi2ofpb/b/vision-demo-notebooks/o/output_object_document_batch.json"
# !mkdir data
# !mv input_objects_document_batch.json data
# !mv output_object_document_batch.json data

#### Setup helper .py files

In [4]:
# !wget "https://objectstorage.us-ashburn-1.oraclecloud.com/n/axhheqi2ofpb/b/vision-demo-notebooks/o/analyze_document_batch_utils.py"
# !mkdir helper
# !mv analyze_document_batch_utils.py helper

### Imports

In [5]:
import time
import json
import io
from random import randint
import oci

from vision_service_python_client.models import output_location
from vision_service_python_client.ai_service_vision_client import AIServiceVisionClient
from vision_service_python_client.models.create_document_job_details import CreateDocumentJobDetails
from vision_service_python_client.models.document_text_detection_feature import DocumentTextDetectionFeature
from helper.analyze_document_batch_utils import load_input_object_locations, load_output_object_location, display_classes, clean_output
from IPython.display import JSON

### Set input variables
<details>
    <summary><font size="3">input_location_path</font></summary>
    <font size="2">The file <code>data/input_objects_document_batch.json</code> specifies where the input documents are to be taken from. Sample files have been provided. The user needs to provide the following in this file:
        <ul>
            <li><code>compartment_id</code> : Compartment ID</li>
            <li><code>input_objects</code>: List with the object locations in the following format-</li>
            <ul>
                <li><code>namespace</code> : Namespace name</li>
                <li><code>bucket</code> : Bucket name</li>
                <li><code>objects</code> : List of object names</li>
            </ul>
        </ul>
    </font>
</details>

<details>
    <summary><font size="3">output_location_path</font></summary>
    <font size="2">The file <code>data/output_object_document_batch.json</code> specifies where the output files will be stored. Sample files have been provided. The user needs to provide the following in this file:
        <ul>
            <li><code>namespace</code> : Namespace name</li>
            <li><code>bucket</code> : Bucket name</li>
            <li><code>prefix</code> : Prefix name</li>
        </ul>
    </font>
</details>

In [6]:
input_location_path = 'data/input_objects_document_batch.json'
output_location_path = 'data/output_object_document_batch.json'

### Authorize user config

In [7]:
config = oci.config.from_file('~/.oci/config')

### Load input and output object locations

In [8]:
compartment_id, input_location = load_input_object_locations(input_location_path)
output_location = load_output_object_location(output_location_path)

### Create AI service vision client and document job

In [9]:
ai_service_vision_client = AIServiceVisionClient(config=config)
create_document_job_details = CreateDocumentJobDetails()

document_text_detection_feature = DocumentTextDetectionFeature()
features = [document_text_detection_feature]
create_document_job_details.features = features
create_document_job_details.compartment_id = compartment_id
create_document_job_details.input_location = input_location
create_document_job_details.output_location = output_location

res = ai_service_vision_client.create_document_job(create_document_job_details=create_document_job_details)

### Job submitted
The job is created and is in <code>ACCEPTED</code> state.

In [10]:
res_json = json.loads(repr(res.data))
clean_res = clean_output(res_json)
JSON(clean_res)

<IPython.core.display.JSON object>

### Job in progress
The job progress is tracked till completion with an interval of 5 seconds and is in <code>IN_PROGRESS</code> state.

In [11]:
job_id = res.data.id
print("Job ID :", job_id, '\n')
seconds = 0
res = ai_service_vision_client.get_document_job(document_job_id=job_id)

while res.data.lifecycle_state in ["IN_PROGRESS", "ACCEPTED"]:
    print("Job is IN_PROGRESS for " + str(seconds) + " seconds")
    time.sleep(5)
    seconds += 5
    res = ai_service_vision_client.get_document_job(document_job_id=job_id)

Job ID : ocid1.aivisiondocumentjob.oc1.iad.amaaaaaa74akfsaa7svtdsbcb4cbc75bqbwzxnubfdj3oojstcmrd5vj7kta 

Job is IN_PROGRESS for 0 seconds
Job is IN_PROGRESS for 5 seconds
Job is IN_PROGRESS for 10 seconds
Job is IN_PROGRESS for 15 seconds
Job is IN_PROGRESS for 20 seconds
Job is IN_PROGRESS for 25 seconds
Job is IN_PROGRESS for 30 seconds
Job is IN_PROGRESS for 35 seconds
Job is IN_PROGRESS for 40 seconds
Job is IN_PROGRESS for 45 seconds
Job is IN_PROGRESS for 50 seconds
Job is IN_PROGRESS for 55 seconds
Job is IN_PROGRESS for 60 seconds
Job is IN_PROGRESS for 65 seconds
Job is IN_PROGRESS for 70 seconds
Job is IN_PROGRESS for 75 seconds
Job is IN_PROGRESS for 80 seconds
Job is IN_PROGRESS for 85 seconds


### Job completed
The job is completed and is in <code>SUCCEEDED</code> state.

In [12]:
res_json = json.loads(repr(res.data))
clean_res = clean_output(res_json)
JSON(clean_res)

<IPython.core.display.JSON object>

### Display detected text
The detected text will be displayed for a randomly selected document from the batch input.

In [13]:
object_storage_client = oci.object_storage.ObjectStorageClient(config)

index = randint(0, len(input_location.object_locations) - 1)
object_location = input_location.object_locations[index]

output_object_name = output_location.prefix + "/" + res.data.id + "/" + \
        object_location.namespace_name  + "_" + object_location.bucket_name + "_" + \
            object_location.object_name

res_json = object_storage_client.get_object(output_location.namespace_name, \
    output_location.bucket_name, object_name = output_object_name+".json").data.content
res_dict = json.loads(res_json)

print("Document :", object_location.object_name, '\n')

if 'pages' in res_dict:
    for j, page in enumerate(res_dict['pages']):
        print('**************** PAGE NO.', j+1, '****************\n')

        if len(page['lines']) == 0:
            print("No text detected.\n")
            continue

        for line in page['lines']:
            print(line['text'])
        print('\n')

Document : TextDetection.pdf 

**************** PAGE NO. 1 ****************

ORACLE
12€
DATABASE
Big Data Analytics with Oracle Advanced Analytics
Making Big Data and Analytics Simple
O R A C L E W H I T E P A P E R
|
J U L Y 2 0 1 5
ORACLE®


**************** PAGE NO. 2 ****************

Disclaimer
The following is intended to outline our general product direction. It is intended for information
purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any
material, code, or functionality, and should not be relied upon in making purchasing decisions. The
development, release, and timing of any features or functionality described for Oracle's products
remains at the sole discretion of Oracle.
BIG DATA ANALYTICS WITH ORACLE ADVANCED ANALYTICS


**************** PAGE NO. 3 ****************

Table of Contents
Disclaimer
Executive Summary: Big Data Analytics with Oracle Advanced Analytics
Big Data and Analytics-New Opportunities and New Challenges
3
Pr