## Find Bible Quotation

Search all of EEBO-TCP for places where some version of the Bible is quoted.  Save the results in a useful form for later analysis.

Here, I'm using three bibles: the EEBO_TCP transcriptions of the Geneva, and of the King James Version; and an Oxford Text Archive copy of the KJV.

Most of the logic for this process is packed into matching_functions.py, so this notebook is more or less just a method for queuing up matching tasks, and for starting workers to handle those tasks.

The procoess will match 3 bibles, verse by verse, against all of EEBO-TCP in three or four hours on a fairly generic Dell workstation.

### Magic Number(s)

It's my habit to try to put all constant declarations and imports at the top of notebooks.  Usually, I don't comment on them.

Here, however, a word about **MAX_GAP_ALLOWED**: quotation finding works by finding matching three lemma sequences, and then by merging overlapping matches.  For example, we may have a matching lemma sequences with starting and ending locations like

    [10, 12], [11, 13], [12, 14], [13, 15]
    
which we'd want to merge as

    [10, 15]

However, there's enough noise in the data (gaps, slightly different phrasing, misquotation), that we need some way to not let such noise cause us to overlook actual overlap. For example, we might have matches like

    [10, 12], [13, 15]

which, because they do not overlap, wouldn't merge.  However, **MAX_GAP_ALLOWED** specifies how much tolerance we're willing to allow in finding overlaps.

In [1]:
import time, pickle, glob, os
from queue import Queue
from threading import Thread
from matching_functions import *

BIBLE_SHINGLE_FOLDER = '/home/spenteco/text_reuse.HOME/bible_pickles/'
EEBO_SHINGLE_FOLDER = '/home/spenteco/0/eebo_shingled/'
RESULTS_FOLDER = '/home/spenteco/0/bible_matches/'

NUM_WORKER_THREADS = 6
MAX_GAP_ALLOWED = 5

### Define the Worker Function

Really, not much to see here: it's mostly input and output wrapped around a call to the match_two_files function found in matching_functions.py.

A couple of things to note:

1.  Workers share a global variable, BIBLE_PICKLE_DATA, which is a list of the verses in each of the three versions of the Bible this runs against (~99,000 verses, ~33,000 from each of three bibles), each verse of which is treated as a separate file.

2.  When I use the match_two_files function to compare two full EEBO_TCP texts, and I pass in a MIN_MATCH_LENGTH variable which specifies how the merged lemma sequences should be in order to be considered a match.  Here, I do that; however, for each verse, I compute a MIN_MATCH_LENGTH value appropriate for the length of the verse (I wouldn't want a MIN_MATCH_LENGTH value longer than the verse!).

In [None]:
BIBLE_PICKLE_DATA = []
all_the_results = []

def find_quotation_worker(worker_number):
    
    while not q.empty():
        
        result_message = None
        
        start_time = time.time()
        
        path_to_eebo_pickle = q.get()
        
        if os.path.isfile(RESULTS_FOLDER + path_to_eebo_pickle.split('/')[-1]):
            pass
        else:
        
            worker_start_time = time.time()
        
            eebo_data = load_pickle_file(path_to_eebo_pickle)

            all_verses_results = []

            for bible_verse in BIBLE_PICKLE_DATA:

                bible_file = bible_verse[0]
                verse_reference = bible_verse[1]['reference']
                    
                MIN_MATCH_LENGTH = int(len(bible_verse[1]['non_space_lemmas']) * 0.75)
        
                results = match_two_files(bible_verse[1], 
                                            eebo_data,
                                            MAX_GAP_ALLOWED, MIN_MATCH_LENGTH,
                                            return_match_offsets=True)

                for r in results:
                    all_verses_results.append([bible_file, verse_reference, 
                                               ''.join(bible_verse[1]['tokens']), r])
            
            f = open(RESULTS_FOLDER + path_to_eebo_pickle.split('/')[-1], 'wb')
            pickle.dump(all_verses_results, f)
            f.close()
                    
            worker_stop_time = time.time()
        
        q.task_done()
        
        if result_message != None:
            print(result_message + '\n\n')

### Load up the Queue

The task queue is just a list of files in EEBO-TCP.

I also read Bibles and load BIBLE_PICKLE_DATA.

In [None]:
q = Queue()

for path_to_eebo in glob.glob(EEBO_SHINGLE_FOLDER + '*.pickle'):
    q.put(path_to_eebo)
    
for path_to_bible in glob.glob(BIBLE_SHINGLE_FOLDER + '*.pickle'):
    for v in load_pickle_file(path_to_bible):
        
        BIBLE_PICKLE_DATA.append([path_to_bible, v])

### Start Workers

In [None]:
start_time = time.time()
            
    
for a in range(NUM_WORKER_THREADS):
    t = Thread(target=find_quotation_worker, args=(a,))
    t.daemon = True
    t.start()

q.join()

print('time:', (time.time() - start_time))

print('Done!')