<h1>Ubiquitous Octo Invention</h1>
Demonstrates semantic search from visual feature and entity tags from a photo. We'll use the [JFK Assassination Records](https://www.archives.gov/research/jfk/release).

First: how to get the URLs out of the JFK document project's Excel spreadsheet? [openpyxl](https://openpyxl.readthedocs.io/en/stable/tutorial.html)

<h2>Download the Documents</h2>

In [16]:
from openpyxl import load_workbook
jfkIndex = load_workbook(filename = 'jfk.xlsx')
# sheet_ranges = jfkIndex['Worksheet']
# for cell in jfkIndex['A2':'A10']:

In [4]:
type(jfkIndex['Worksheet']['A2'])

openpyxl.cell.cell.Cell

In [13]:
# This will fail if there is no hyperlink to target
try:
    url = jfkIndex['Worksheet']['A2'].hyperlink.target
except Exception as exc:
    print(exc)
url

'https://www.archives.gov/files/research/jfk/releases/2018/104-10196-100270001.pdf'

In [10]:
jfkIndex['Worksheet']['A2'].value

'2018/104-10196-100270001.pdf'

In [None]:
urlBase = 'https://www.archives.gov/files/research/jfk/releases/'

In [19]:
import requests
from pathlib import Path
from utils.whoopsie import whoopsie

def savePDF (fn, pdf):
    with open(fn, 'wb') as f:
        f.write(pdf)


In [31]:
with requests.Session() as session:
    with session.get('https://www.archives.gov/files/research/jfk/releases/2018/104-10196-100270001.pdf', stream=True) as conn:
        savePDF('docs/' + 'test.pdf', conn.content)

In [28]:
"docs/" + jfkIndex['Worksheet']['A2'].value

'docs/2018/104-10196-100270001.pdf'

In [63]:
with requests.Session() as session:
    for c in range(10002,54637):
        cell = "A" + str(c)
        url = jfkIndex['Worksheet'][cell].hyperlink.target
        filename = jfkIndex['Worksheet'][cell].value
        with session.get(url, timeout=300) as conn:
            try:
                savePDF("docs2/" + filename, conn.content)
            except Exception as exc:
                whoopsie(str(exc) + " " + url)


In [62]:
# That's the end of the column
jfkIndex["Worksheet"]["A54637"].hyperlink.target

'https://www.archives.gov/files/research/jfk/releases/docid-32404866.pdf'

<h2>Trim the Text Quality Distribution</h2>
A lot of these images are poor quality, leading to fairly unintelligible text documents. How can I assess the "intelligibleness" of a document and trim the distribution of documents to, say, the best 1/5?

<h2>OCR the Image PDFs</h2>
The PDFs are all image PDFs! There's no text!

pdf2image or [Azure OCR](https://docs.microsoft.com/en-us/azure/cognitive-services/computer-vision/quickstarts-sdk/client-library?tabs=visual-studio&pivots=programming-language-python)?

<h3>OCR Remotely</h3>
Azure Computer Vision

In [14]:
from azure.cognitiveservices.vision.computervision import ComputerVisionClient
from azure.cognitiveservices.vision.computervision.models import OperationStatusCodes
from azure.cognitiveservices.vision.computervision.models import VisualFeatureTypes
from msrest.authentication import CognitiveServicesCredentials

from array import array
import os
from PIL import Image
import sys
import time

'''
Authenticate
Authenticates your credentials and creates a client.
'''
subscription_key = ""
endpoint = ""

computervision_client = ComputerVisionClient(endpoint, CognitiveServicesCredentials(subscription_key))

In [4]:
'''
OCR: Read File using the Read API, extract text - remote
This example will extract text in an image, then print results, line by line.
This API call can also extract handwriting style text (not shown).

You can also read text from a local image. See the 
ComputerVisionClientOperationsMixin methods, such as read_in_stream. 
Or, see the sample code on GitHub for scenarios involving local images.


'''
print("===== Read File - remote =====")
# Get an image with text
# read_image_url = "https://raw.githubusercontent.com/MicrosoftDocs/azure-docs/master/articles/cognitive-services/Computer-vision/Images/readsample.jpg"
read_image_url = "https://www.archives.gov/files/research/jfk/releases/2018/104-10196-100270001.pdf"

# Call API with URL and raw response (allows you to get the operation location)
read_response = computervision_client.read(read_image_url,  raw=True)

===== Read File - remote =====


In [7]:
# Get the operation location (URL with an ID at the end) from the response
read_operation_location = read_response.headers["Operation-Location"]
# Grab the ID from the URL
operation_id = read_operation_location.split("/")[-1]

# Call the "GET" API and wait for it to retrieve the results 
while True:
    read_result = computervision_client.get_read_result(operation_id)
    if read_result.status not in ['notStarted', 'running']:
        break
    time.sleep(1)

# Print the detected text, line by line
if read_result.status == OperationStatusCodes.succeeded:
    for text_result in read_result.analyze_result.read_results:
        for line in text_result.lines[:10]:
            print(line.text)
            # print(line.bounding_box)
print()

JFK Assassination System
Date:
5/2/2018
Identification Form
Agency Information
AGENCY :
CIA
RECORD NUMBER :
104-10196-10027
RECORD SERIES :
CIA HISTORICAL REVIEW PROGRAM
RELEASE IN FULL 1998.
CIA ACTIVITIES AND THE WARREN COMMISSION INVESTIGATION
SUMMARY
There is a need to investigate the role of the CIA and the FBI in the
investigation of the assassination of President Kennedy, and their relation-
ships.with Lee Harvey Oswald. Since several agencies had files on Oswald which
can be checked against each other, a useful case study of CIA practices is
possible. These records should shed light on the interception of mail to Russia,
CIA concern about (and file-keeping on) domestic political activities, and other



In [15]:
def extract_doc (read_image_url, filename) : 
    # Call API with URL and raw response (allows you to get the operation location)
    read_response = computervision_client.read(read_image_url,  raw=True)
    # Get the operation location (URL with an ID at the end) from the response
    read_operation_location = read_response.headers["Operation-Location"]
    # Grab the ID from the URL
    operation_id = read_operation_location.split("/")[-1]

    # Call the "GET" API and wait for it to retrieve the results 
    while True:
        read_result = computervision_client.get_read_result(operation_id)
        if read_result.status not in ['notStarted', 'running']:
            break
        time.sleep(1)

    # Save the detected text
    doc = ""

    if read_result.status == OperationStatusCodes.succeeded:
        for text_result in read_result.analyze_result.read_results:
            for line in text_result.lines:
                doc = doc + line.text

    with open(filename, 'w') as f:
        f.write(doc)

# extract_doc(read_image_url, "extracted_docs/104-10196-100270001.txt")


In [26]:
def ocr_docs () :

    with requests.Session() as session:
        for c in range(3,4982):
            cell = "A" + str(c)
            url = jfkIndex['Worksheet'][cell].hyperlink.target
            filename = jfkIndex['Worksheet'][cell].value
            filename = filename[:-4] + '.txt'
            with session.get(url, timeout=300) as conn:
                try:
                    extract_doc(url, "acs_extracted_docs/" + filename)
                except Exception as exc:
                    whoopsie(str(exc) + " " + url)


In [27]:
ocr_docs()

Operation returned an invalid status code 'Bad Request' https://www.archives.gov/files/research/jfk/releases/2018/docid-32269709.pdf
Operation returned an invalid status code 'Bad Request' https://www.archives.gov/files/research/jfk/releases/2018/194-10013-10448.pdf
Operation returned an invalid status code 'Bad Request' https://www.archives.gov/files/research/jfk/releases/2018/144-10001-10195_docid_6606912_binary_sealed.pdf
Operation returned an invalid status code 'Bad Request' https://www.archives.gov/files/research/jfk/releases/2018/144-10001-10288_docid_6607058_binary_sealed.pdf
Operation returned an invalid status code 'Bad Request' https://www.archives.gov/files/research/jfk/releases/2018/144-10001-10127_docid_6606147_binary_sealed.pdf
Operation returned an invalid status code 'Bad Request' https://www.archives.gov/files/research/jfk/releases/2018/144-10001-10176_docid_6606900_binary_sealed.pdf
Operation returned an invalid status code 'Bad Request' https://www.archives.gov/file

Aww damn! The free computer vision service maxes out at 5K transactions/month! The next up, S1, would suffice and would cost a few tens of dollars. [Pricing - Computer Vision API](https://azure.microsoft.com/en-us/pricing/details/cognitive-services/computer-vision/)