## Create Bible Pickles from EEBO-TCP Bibles . . . 

 . . . but not from all of them because, even though the Bibles are all from the same place, and they're all TEI, they don't mark up verses in the same way.  Or, in some cases, at all.
 
So, this notebook really only processes A10675 (Geneva) and A97378 (KJV).  The other Bibles I wanted to use (Coverdale, the Bishop's Bible, the Great Bible) aren't marked up in a way which makes them useful for verse-by-verse matching.

In [1]:
from lxml import etree
from matching_functions import *

SHINGLE_LENGTH = 3

### Any oddities in the XML?

At one point, I wasn't finding any quotations of the Psalms in EEBO-TCP.  Which was laughable.

It turns out that Psalms does not have chapters (i.e., it doesn't have &lt;div type="chapter"&gt;); instead, it has Psalms (&lt;div type="Psalm"&gt;).  At least in the two files I'm processing.

Here, I'm just checking what type values are attached to div's.

In [2]:
from collections import defaultdict, Counter

div_types = defaultdict(int)

def count_div_types(TCP_ID):

    tree = etree.parse('xml_bible/' + TCP_ID + '.xml')

    for div in tree.xpath('//tei:div', 
                           namespaces={'tei': 'http://www.tei-c.org/ns/1.0'}):

        if div.get('type') != None:
             div_types[div.get('type')] += 1
    
for TCP_ID in ['A10675', 'A97378',]:
    count_div_types(TCP_ID)
    
for w in Counter(div_types).most_common():
    print(w[0], w[1])

chapter 2406
Psalm 300
book 155
section 22
part 22
title_page 3
division 3
chronology 2
table 2
endnotes 2
prologue 2
addendum_to_the_book_of_Daniel 2
table_of_contents 1
letter 1
to_the_reader 1
Song_of_the_n3_children 1
Bel_and_the_dragon 1
map 1
description 1
key 1
engraved_title_page 1
dedication 1
summary_of_contents 1
Old_Testament 1
Apocrypha 1
addenda_to_the_book_of_Esther 1
New_Testament 1


### A function to process EEBO-TCP Bibles . . . 

 . . . two, at least.  Note the exceptions: if there's no chapter number, it must be chapter 1.  Ditto for verses, which I assume (and for the most part, are) each inside its own p tag.

In [3]:
def bible_to_verses(TCP_ID):
    
    print(TCP_ID)

    tree = etree.parse('xml_bible/' + TCP_ID + '.xml')

    books = tree.xpath('//tei:div[@type="book"]', 
                       namespaces={'tei': 'http://www.tei-c.org/ns/1.0'})

    results = []

    for b in books:

        book_title = b.get('n')

        chapters = b.xpath('descendant::tei:div[@type="chapter"]|descendant::tei:div[@type="Psalm"]', 
                           namespaces={'tei': 'http://www.tei-c.org/ns/1.0'})

        for c in chapters:

            chapter_n = c.get('n')
            if chapter_n == None:
                chapter_n = '1'

            verses = c.xpath('descendant::tei:p', 
                           namespaces={'tei': 'http://www.tei-c.org/ns/1.0'})

            for v in verses:
                
                if 'argument' in v.getparent().tag:
                    continue
                
                verse_n = v.get('n')
                if verse_n == None:
                    verse_n = '1'

                tokens, lemmas = get_tokens_for_iterator(v)
                non_space_lemmas, offsets = get_non_space_lemmas_and_offsets(lemmas)
                shingles = shingle_tokens(non_space_lemmas, SHINGLE_LENGTH)

                reference = book_title + '.' + chapter_n + '.' + verse_n

                results.append({'reference': reference, 
                                'tokens': tokens, 
                                'lemmas': lemmas,
                                'non_space_lemmas': non_space_lemmas, 
                                'offsets': offsets,
                                'shingles': shingles})

    print(len(results))

    f = open('bible_pickles/' + TCP_ID + '.pickle', 'wb')
    pickle.dump(results, f)
    f.close()

### Actually process files

The process prints out the TCP ID for what ever it's processing, then prints out the number of verses it found.

Which means that Geneva (A10675) has more verses (36,717) than the KJV (A97378, 36,555 verses).  Both are reasonable numbers.  I'm not unduly alarmed by the difference, since I didn't have to look far for legitimate reasons for at least 1 of those differences . . . 

In [4]:
for TCP_ID in ['A10675', 'A97378',]:
    bible_to_verses(TCP_ID)

A10675
36717
A97378
36555
