## Tests for matching_functions, part I

"Test" perhaps not an exact term for what's going on in this notebook; instead, it might be better to say that this notebook demonstrates how morphadorned EEBO-TCP data is transformed into data suitable for finding a certain kind of text reuse.

The goal is to create for each EEBO-TCP file (or, in the case of Biblical texts, for each verse) data like:

    {'tokens': ['25', ' ', '¶', ' ', 'And', ' ', 'Adam', ' ', '•new', ' ', 'his', ' ', 'wife', ' ', 'again',
        ',', ' ', 'and', ' ', 'she', ' ', '〈◊〉', ' ', 'a', ' ', 'son', ',', ' ', 'and', ' ', 'called', ' ',
        'his', ' ', 'name', ' ', 'Seth', ':', ' ', 'For', ' ', 'God', ',', ' ', 'said', ' ', 'she', ',', ' ',
        'hath', ' ', 'appo•••••', ' ', 'me', ' ', 'another', ' ', 'seed', ' ', 'in', ' ', 'stead', ' ', 'of', 
        ' ', 'Abel', ',', ' ', 'whom', ' ', 'Cain', ' ', 'slew', '.', '8', ' ', 'And', ' ', 'Abraham', ' ',
        'said', ',', ' ', 'My', ' ', 'son', ',', ' ', 'God', ' ', 'will', ' ', 'provide', ' ', 'himself', ' ',
        'a', ' ', 'lamb', ' ', 'for', ' ', 'a', ' ', 'burnt-offering', ':', ' ', 'so', ' ', 'they', ' ',
        'went', ' ', 'both', ' ', 'of', ' ', 'them', ' ', 'together', '.'], 
    'lemmas': [' ', ' ', ' ', ' ', ' ', ' ', 'adam', ' ', ' ', ' ', ' ', ' ', 'wife', ' ', ' ', ' ', ' ', 
        ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', 'son', ' ', ' ', ' ', ' ', 'call', ' ', ' ', ' ', 'name', 
        ' ', 'seth', ' ', ' ', ' ', ' ', 'god', ' ', ' ', 'say', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', 
        ' ', ' ', 'another', ' ', 'seed', ' ', ' ', ' ', 'stead', ' ', ' ', ' ', 'abel', ' ', ' ', ' ', 
        ' ', 'cain', ' ', 'slay', ' ', ' ', ' ', ' ', ' ', 'abraham', ' ', 'say', ' ', ' ', ' ', ' ', 'son', 
        ' ', ' ', 'god', ' ', ' ', ' ', 'provide', ' ', ' ', ' ', ' ', ' ', 'lamb', ' ', ' ', ' ', ' ', ' ', 
        ' ', ' ', ' ', ' ', ' ', ' ', ' ', 'go', ' ', ' ', ' ', ' ', ' ', ' ', ' ', 'together', ' '], 
    'non_space_lemmas': ['adam', 'wife', 'son', 'call', 'name', 'seth', 'god', 'say', 'another', 'seed',
        'stead', 'abel', 'cain', 'slay', 'abraham', 'say', 'son', 'god', 'provide', 'lamb', 'go', 'together'], 
    'offsets': [6, 12, 25, 30, 34, 36, 41, 44, 55, 57, 61, 65, 70, 72, 78, 80, 85, 88, 92, 98, 111, 119], 
    'shingles': {('adam', 'wife', 'son'): [[0, 2]], ('wife', 'son', 'call'): [[1, 3]], 
        ('son', 'call', 'name'): [[2, 4]], ('call', 'name', 'seth'): [[3, 5]], 
        ('name', 'seth', 'god'): [[4, 6]], ('seth', 'god', 'say'): [[5, 7]], 
        ('god', 'say', 'another'): [[6, 8]], ('say', 'another', 'seed'): [[7, 9]], 
        ('another', 'seed', 'stead'): [[8, 10]], ('seed', 'stead', 'abel'): [[9, 11]], 
        ('stead', 'abel', 'cain'): [[10, 12]], ('abel', 'cain', 'slay'): [[11, 13]], 
        ('cain', 'slay', 'abraham'): [[12, 14]], ('slay', 'abraham', 'say'): [[13, 15]], 
        ('abraham', 'say', 'son'): [[14, 16]], ('say', 'son', 'god'): [[15, 17]], 
        ('son', 'god', 'provide'): [[16, 18]], ('god', 'provide', 'lamb'): [[17, 19]], 
        ('provide', 'lamb', 'go'): [[18, 20]], ('lamb', 'go', 'together'): [[19, 21]]}}
        
The data is a dictionary with 5 key-value pairs:

* **tokens** contains the sequence of words (original spelling), spaces and punctuation from the corresponding morphadorned EEBO-TCP file; we preserve it so that, when we find a match between two texts, we can reconstruction the full passages.
* **lemmas** contains the sequence of lemmas which correspond with the tokens; certain lemmas (stop words, punctuation, numbers, lemma made up partially or entirely of non-latin alphabets) are replaced with spaces.
* **non_space_lemmas** contains the same data as lemmas, less spaces.
* **offsets** contains the position in lemmas and tokens of every entry in non_space_lemma.  E.g. "adam", the first entry in non_space_lemma, corresponds with position 6 in tokens (we start counting with 0), "wife" with 12.
* **shingles** contains ngrams from non_space_lemmas with a list of their corresponding starting and stopping position in offsets.  E.g., "('adam', 'wife', 'son')" starts at 0 in offsets and stops at 2.  This makes it possible to work backward from a shingle to a fuller passage in the original text.  If we start here:

    ('adam', 'wife', 'son'): \[\[0, 2\]\]
    
The value at position 0 in offsets is 6, which points to a starting position in tokens ("Adam").  The value at position 2 in offsets is 12, which points to an ending location in tokens ("son").  Everything between the two constitutes the original text which corresponds with the sequence:

    'And', ' ', 'Adam', ' ', '•new', ' ', 'his', ' ', 'wife', ' ', 'again', ',', ' ', 'and', ' ', 'she', 
    ' ', '〈◊〉', ' ', 'a', ' ', 'son'
    
It's a lot of bother, of course.  But it's worth it because it is ridiculously fast to compare two texts' shingles.  Which makes the whole "find all the Bible quotations" thing possible.

In [2]:
from matching_functions import *

### Metadata

How we load and search metadata in these notebooks . . . 

In [3]:
metadata = load_metadata('metadata/EEBO_metadata.tsv')

for k, v in metadata.items():
    if 'Herrick' in v['author'] and v['year'] == '1648':
        print(k, v)

A43441 {'year': '1648', 'author': 'Herrick, Robert, 1591-1674.|Marshall, William, fl. 1617-1650.', 'title': 'Hesperides, or, The works both humane & divine of Robert Herrick, Esq.'}


### Select useful lemma

In the morphadorned output, lemma attributes can contain all sorts of things: gap markers, words written in non-latin alphabets, astrological symbols, etc.  We want to filter all of that out.

In [4]:
print('tree', is_lemma_valid('tree'))
print('Tree', is_lemma_valid('Tree'))
print(None, is_lemma_valid(None))
print('Tree!', is_lemma_valid('Tree!'))

tree True
Tree True
None False
Tree! False


### get_one_file . . . 

Demonstrate the process for reading and performing the first set of transformations to a morphadorned EEBP-TCP file.
 
     get_tokens_for_iterator
     select_token_and_lemma
     is_lemma_valid
     
### The test file contains 121 . . . 

 . . . words, spaces, and punctuation marks.

In [5]:
!egrep '<pc|<c|<w' test_adorned_xml/test_get_one_file.xml | wc -l

121


### The actual process and results

Note that

1.  We get 121 tokens and 121 lemma, which is the right number.
2.  Spaces count as tokens, as do punctutation.
3.  Lemma consists of non-stopword, valid lemma (see the section above "Select useful lemma").  Note, however, that when a lemma is not considered "valid" or "useful", we replace that lemma with a space.

In [6]:
tokens, lemmas = tokenize_lemmatize_one_file('test_adorned_xml/test_get_one_file.xml')

print('len(tokens)', len(tokens), 'len(lemmas)', len(lemmas))
print()
print(tokens)
print()
print(lemmas)

len(tokens) 121 len(lemmas) 121

['25', ' ', '¶', ' ', 'And', ' ', 'Adam', ' ', '•new', ' ', 'his', ' ', 'wife', ' ', 'again', ',', ' ', 'and', ' ', 'she', ' ', '〈◊〉', ' ', 'a', ' ', 'son', ',', ' ', 'and', ' ', 'called', ' ', 'his', ' ', 'name', ' ', 'Seth', ':', ' ', 'For', ' ', 'God', ',', ' ', 'said', ' ', 'she', ',', ' ', 'hath', ' ', 'appo•••••', ' ', 'me', ' ', 'another', ' ', 'seed', ' ', 'in', ' ', 'stead', ' ', 'of', ' ', 'Abel', ',', ' ', 'whom', ' ', 'Cain', ' ', 'slew', '.', '8', ' ', 'And', ' ', 'Abraham', ' ', 'said', ',', ' ', 'My', ' ', 'son', ',', ' ', 'God', ' ', 'will', ' ', 'provide', ' ', 'himself', ' ', 'a', ' ', 'lamb', ' ', 'for', ' ', 'a', ' ', 'burnt-offering', ':', ' ', 'so', ' ', 'they', ' ', 'went', ' ', 'both', ' ', 'of', ' ', 'them', ' ', 'together', '.']

[' ', ' ', ' ', ' ', ' ', ' ', 'adam', ' ', ' ', ' ', ' ', ' ', 'wife', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', 'son', ' ', ' ', ' ', ' ', 'call', ' ', ' ', ' ', 'name', ' ', 'seth'

###  The shingle building process

First step in the shingle building process is the creation of two lists: one, of non-space lemmas; the other, of the locations of those non-space lemma in the tokens and lemmas lists created in the preceeding step.

In [7]:
non_space_lemmas, offsets = get_non_space_lemmas_and_offsets(lemmas)

print('len(non_space_lemmas)', len(non_space_lemmas), 'len(offsets)', len(offsets))
print()
print(non_space_lemmas)
print()
print(offsets)

len(non_space_lemmas) 22 len(offsets) 22

['adam', 'wife', 'son', 'call', 'name', 'seth', 'god', 'say', 'another', 'seed', 'stead', 'abel', 'cain', 'slay', 'abraham', 'say', 'son', 'god', 'provide', 'lamb', 'go', 'together']

[6, 12, 25, 30, 34, 36, 41, 44, 55, 57, 61, 65, 70, 72, 78, 80, 85, 88, 92, 98, 111, 119]


**Shingles** are derived from non_space_lemmas and offsets.  A shingle consists of:

1.  An n-gram non_space_lemmas from non-space lemma;
2.  A list of starting and ending positions within the offsets.

In [9]:
SHINGLE_LENGTH = 3

shingles = shingle_tokens(non_space_lemmas, SHINGLE_LENGTH)

print('len(shingles)', len(shingles))
print()
print(shingles)

len(shingles) 20

{('adam', 'wife', 'son'): [[0, 2]], ('wife', 'son', 'call'): [[1, 3]], ('son', 'call', 'name'): [[2, 4]], ('call', 'name', 'seth'): [[3, 5]], ('name', 'seth', 'god'): [[4, 6]], ('seth', 'god', 'say'): [[5, 7]], ('god', 'say', 'another'): [[6, 8]], ('say', 'another', 'seed'): [[7, 9]], ('another', 'seed', 'stead'): [[8, 10]], ('seed', 'stead', 'abel'): [[9, 11]], ('stead', 'abel', 'cain'): [[10, 12]], ('abel', 'cain', 'slay'): [[11, 13]], ('cain', 'slay', 'abraham'): [[12, 14]], ('slay', 'abraham', 'say'): [[13, 15]], ('abraham', 'say', 'son'): [[14, 16]], ('say', 'son', 'god'): [[15, 17]], ('son', 'god', 'provide'): [[16, 18]], ('god', 'provide', 'lamb'): [[17, 19]], ('provide', 'lamb', 'go'): [[18, 20]], ('lamb', 'go', 'together'): [[19, 21]]}


### Serialize

Basic code to save a file's data to disk, and to load it.

In [10]:
file_a_contents = {'tokens': tokens, 'lemmas': lemmas,
                      'non_space_lemmas': non_space_lemmas, 'offsets': offsets,
                      'shingles': shingles}

f = open('test_pickles/test_get_one_file.pickle', 'wb')
pickle.dump(file_a_contents, f)
f.close()

file_a_contents = load_pickle_file('test_pickles/test_get_one_file.pickle')

print(file_a_contents)

{'tokens': ['25', ' ', '¶', ' ', 'And', ' ', 'Adam', ' ', '•new', ' ', 'his', ' ', 'wife', ' ', 'again', ',', ' ', 'and', ' ', 'she', ' ', '〈◊〉', ' ', 'a', ' ', 'son', ',', ' ', 'and', ' ', 'called', ' ', 'his', ' ', 'name', ' ', 'Seth', ':', ' ', 'For', ' ', 'God', ',', ' ', 'said', ' ', 'she', ',', ' ', 'hath', ' ', 'appo•••••', ' ', 'me', ' ', 'another', ' ', 'seed', ' ', 'in', ' ', 'stead', ' ', 'of', ' ', 'Abel', ',', ' ', 'whom', ' ', 'Cain', ' ', 'slew', '.', '8', ' ', 'And', ' ', 'Abraham', ' ', 'said', ',', ' ', 'My', ' ', 'son', ',', ' ', 'God', ' ', 'will', ' ', 'provide', ' ', 'himself', ' ', 'a', ' ', 'lamb', ' ', 'for', ' ', 'a', ' ', 'burnt-offering', ':', ' ', 'so', ' ', 'they', ' ', 'went', ' ', 'both', ' ', 'of', ' ', 'them', ' ', 'together', '.'], 'lemmas': [' ', ' ', ' ', ' ', ' ', ' ', 'adam', ' ', ' ', ' ', ' ', ' ', 'wife', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', 'son', ' ', ' ', ' ', ' ', 'call', ' ', ' ', ' ', 'name', ' ', 'seth', ' ', ' ', 