## Preprocess EEBO-TCP

This notebook turns all of EEBP-TCP (60,300 files) into pickled representations of those files useful for matching.

The logic for processing one file is in function preprocess_one_file in file matching_functions.py.  Because so much logic is offloaded to matching_functions.py, this process looks simpler than it really is.

Furthermore, because most of the logic is in matching_functions.py, this notebook is really just a process for queueing up files to be processed, and for starting multiple workers to process the queue.

In [None]:
import time, glob, os
import pickle
from queue import Queue
from threading import Thread
from matching_functions import *

INPUT_FOLDER = '/home/spenteco/0/eebo_adorned/'
OUTPUT_FOLDER = '/home/spenteco/0/eebo_shingled/'
SHINGLE_LENGTH = 3
NUM_WORKER_THREADS = 6

### Define a Worker

A worker reads a task from the queue, parses it, and then calls code in matching_functions.py, where the actually processing occurs.  When it's done, it marks the task ask done, and then gets the next task.  And so on, until the queue is empty.

In [1]:
def shingle_worker(worker_number):
    
    while not q.empty():
        
        start_time = time.time()
        
        body = q.get()
        
        PATH_TO_INPUT_FILE = body.split('|')[0]
        OUTPUT_FOLDER = body.split('|')[1]
        SHINGLE_LENGTH = int(body.split('|')[2])
        TCP_ID = PATH_TO_INPUT_FILE.split('/')[-1].split('.')[0]
        
        result_message = None
        
        if os.path.isfile(OUTPUT_FOLDER + TCP_ID + '.pickle'):
            pass
        else:
            
            try:
    
                file_data = preprocess_one_file(PATH_TO_INPUT_FILE, SHINGLE_LENGTH)

                f = open(OUTPUT_FOLDER + TCP_ID + '.pickle', 'wb')
                pickle.dump(file_data, f)
                f.close()
                
                stop_time = time.time()
                    
            except etree.XMLSyntaxError:
                result_message = 'ERROR XMLSyntaxError ' + PATH_TO_INPUT_FILE
        
        q.task_done()
        
        if result_message != None:
            print(result_message + '\n\n')

### Load up the Queue

One item per EEBO-TCP file.  And item is the same as one task for a worker.

In [None]:
q = Queue()

for path_to_file in glob.glob(INPUT_FOLDER + '*.xml'):
    q.put(path_to_file + '|' + OUTPUT_FOLDER + '|' + str(SHINGLE_LENGTH))

### Start a bunch of workers

I've been starting 6 on my fairly generic Dell workstation.  These can process EEBO-TCP in about four hours.

The workers seem to be limited by the speed of the local disk.  I expect I could get it to run faster if I put the files on an SSD.

In [None]:
start_time = time.time()
    
for a in range(NUM_WORKER_THREADS):
    t = Thread(target=shingle_worker, args=(a,))
    t.daemon = True
    t.start()

q.join()

print('time:', (time.time() - start_time))

print('Done!')