## Tesseract
Is a library from google (https://github.com/tesseract-ocr/tesseract) to perform OCR  
We will leverage the Pytesseract bindings to tesseract via spark to perform distributed OCR against document strings in a dataframe  

### First, 
We develop our algorithm locally  
The file we are using is public domain shakespeare - about 100 pages long  
We can just copy/paste this file a bunch of times to load the system - but I can garuntee without a real cluster behind this spark context you will wish you hadn't  

In [1]:
import pytesseract
import pdf2image
import time
import os
import re

In [2]:
#! rm /home/bsavoy/Downloads/ocr_docs/copy_shakespeare* 
! curl -O https://shakespeare.folger.edu/downloads/pdf/julius-caesar_PDF_FolgerShakespeare.pdf 
! seq 8 | xargs -I XY cp ./julius-caesar_PDF_FolgerShakespeare.pdf ./copy_shakespeareXY.pdf

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  603k    0  603k    0     0   459k      0 --:--:--  0:00:01 --:--:--  459k


In [3]:
fnames = list(map(lambda x: '/'.join([os.getcwd(),x]), filter(lambda x: x.endswith('.pdf'), list(os.walk(os.getcwd()))[0][2])))

In [4]:
fnames

['/home/bsavoy/pytesseract_on_pyspark/copy_shakespeare1.pdf',
 '/home/bsavoy/pytesseract_on_pyspark/copy_shakespeare4.pdf',
 '/home/bsavoy/pytesseract_on_pyspark/julius-caesar_PDF_FolgerShakespeare.pdf',
 '/home/bsavoy/pytesseract_on_pyspark/copy_shakespeare7.pdf',
 '/home/bsavoy/pytesseract_on_pyspark/copy_shakespeare3.pdf',
 '/home/bsavoy/pytesseract_on_pyspark/copy_shakespeare2.pdf',
 '/home/bsavoy/pytesseract_on_pyspark/copy_shakespeare8.pdf',
 '/home/bsavoy/pytesseract_on_pyspark/copy_shakespeare5.pdf',
 '/home/bsavoy/pytesseract_on_pyspark/copy_shakespeare6.pdf']

In [None]:
content = pdf2image.convert_from_bytes(open(fnames[0],'rb').read())
doc_string = ''
start = time.time()
for page in content:
    doc_string += pytesseract.image_to_string(page)
    
elapsed = time.time() - start
print('took {}s'.format(elapsed))

## Hmm, thats long for one document
And I don't see that scaling well beyond a handful of documents at a time - perhaps for an ad-hoc process
Perhaps performance will be better if we multiprocess it - pytesseract is a binding with the tesseract executable, afterall - sub-processes also sidestep interpreter lock.  
We should be able to get some benefit by forking more external processes

In [5]:
def do_ocr(fname):
    content = pdf2image.convert_from_bytes(open(fnames[0],'rb').read())
    doc_string = ''
    for page in content:
        doc_string += pytesseract.image_to_string(page)

In [None]:
from multiprocessing import Pool
data = []
# In my VM, I only have 8 cpu cores to work with (really 4 physical and 8 logical) available on my host
for dop in range(2,8):
    with Pool(dop) as p:
        start = time.time()
        p.map(do_ocr, fnames)
        elapsed = time.time() - start
        print('dop={} files={} seconds={}'.format(dop,len(fnames),elapsed))
        data.append(dop,len(fnames),elapsed)


## Conclusion
Benefit of parallelism drops before we even saturate all logical cores - so optimally, we have only up to 4 concurrent tesseract processes
before we start seeing a performance drop off

We also have a major disadvantage that this solution MUST run on single node.  
We could split the workload across a number of hosts but this has orchestration and scheduling headaches    
Also, what if you wanted to coherently deal with this data in a dataframe API of some sort?  
Spark helps with these problems  
* Spark can handle arbitrarily large datasets by natively splitting data across a cluster   
* Spark can apply transformations in a distributed fashion and return a single coherent dataset back from our transformations  
* The workload can scale to N number of nodes - we tune executors for the individual hosts we plan to have in the cluster and let spark lose to schedule the work

In [None]:
# start = time.time()
# for fname in fnames:
#     print(list(fnames))
#     print('Convert file={}'.format(fname))
#     content = pdf2image.convert_from_bytes(open(fname,'rb').read())
#     doc_string = ''
#     for page in content:
#         doc_string += pytesseract.image_to_string(page)
# print('took {} seconds'.format(time.time() - start))

## Next,
* We will package our OCR function into a method that Pyspark will pickle and distribute to worker nodes - eerily similarly to the do_ocr function above.    
* In our driver program, we will instruct spark to read in the PDFs as binary data into an RDD  
* We will use the built-in RDD map function to apply our ocr function to the binary content of the PDFs in each element  
* When execution is complete - we will have a new RDD which consists of the parsed text by tesseract, along with some telemetry and error handling (for troublesome PDFs or PDFs that well, aren't PDFs)
* Finally we can simply convert the RDD to a dataframe and use spark.sql api functions against our OCR'd text

In [1]:
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder.getOrCreate()

In [3]:
an_rdd = spark.sparkContext.parallelize(range(10))

In [4]:
an_rdd.collect()

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [7]:
def do_pytesseract(record):
    import socket
    import time
    
    file_name = record[0]
    binary_content = record[1]
    start_time = time.time()
    try: 
        ocr_text = do_ocr(binary_content)
    except Exception as e:
        ocr_text  = str(e)
    final_time = time.time() - start_time
    
    return [socket.gethostname(), file_name, ocr_text, str(final_time) + 's']

def do_ocr(binary_content):
    from pdf2image import convert_from_bytes
    import pytesseract
    
    content = convert_from_bytes(binary_content)
    doc_string = ''
    for page in content:
        doc_string += pytesseract.image_to_string(page)
        
    return doc_string
    

## Last,
In the programs current state, if a large document is encountered by a worker, there is a chance the rest of the cluster could finish their documents and the remaining document keep the stage open, bottlenecking work and wasting compute  
It's more efficient to distribute the pages for the document rather than the entire document, then coalesce the pages back together, in order, for a given document  

In [None]:
raw_df = spark.read.binaryFiles()aaa