Goal of the notebook
-----------

In this notebook we will clean the raw gospel data extracted from bible database, convert them to trainable data. 


Steps
------

1. Load the raw gospel data.

Here we use American Standard Version of Gospel Of Matthew

2. Tokenize each sentence to words and special characters using an efficient word tokenizer. 

We use [Spacy](https://spacy.io/) for tokenizing sentence. Spacy model used is `en_core_web_md`. 

3. Add tokens to structure the trainable data. 

    - <SOV\> : Start of a verse
    - <EOV\> : End of a verse
    - <SOC\> : Start of a chapter
    - <EOC\> : End of a chapter
    
4. Save training data to a file. 

Replace any unwanted character of spotted.
 

In [1]:
import json
import spacy
nlp = spacy.load("en_core_web_md")

In [3]:
def load_json(file_name):
    """
    Load the json file to a json object
    """
    return json.load(open(file_name))

In [4]:
# Load raw Matthew American Standard Version 
raw_mattew_asv = load_json('../raw_gospel_data/asv/matthew_asv.json')

In [5]:
raw_mattew_asv[0]

[40001001,
 40,
 1,
 1,
 'The book of the generation of Jesus Christ, the son of David, the son of Abraham.']

In [6]:
def tokenize_sentence(input_sentence, spacy_object):
    """
    Tokenize a give sentence and return tokens as a list
    """
    return [token.text for token in spacy_object(input_sentence)]

In [7]:
def generate_training_data(raw_data, spacy_object):
    """
    Tokenize, add structure tokens <SOV>, <EOV>, <SOC>, <EOC> 
    Returns a single string with tokens seperated by space
    
    verse_data[4] = Verse ID
    verse_data[2] = Book ID
    verse_data[2] = Chapter NUmber
    verse_data[3] = Verse Number
    verse_data[4] = Verse string
    
    """

    training_data = ""
    chapter_number = 1
    chapter_start = 1

    for i, verse_data in enumerate(raw_data):
        tokens = tokenize_sentence(verse_data[4], spacy_object)
        temp = ["<SOV>"] + tokens + ["<EOV>"]
        if chapter_number == verse_data[2] and chapter_start == verse_data[3]:
            temp = ["<SOC>"] + temp
        elif i+1 < len(raw_data) and raw_data[i+1][2] == chapter_number+1:
            temp = temp + ["<EOC>"]
            chapter_number+=1
        elif i+1 == len(raw_data):
            temp = temp + ["<EOC>"] 
        temp = " ".join(temp)
        training_data = training_data + temp + " "
    return training_data

In [33]:
training_data = generate_training_data(raw_mattew_asv, nlp)

In [39]:
# Let's sample first chapter 

training_data[:3049]

"<SOC> <SOV> The book of the generation of Jesus Christ , the son of David , the son of Abraham . <EOV> <SOV> Abraham begat Isaac ; and Isaac begat Jacob ; and Jacob begat Judah and his brethren ; <EOV> <SOV> and Judah begat Perez and Zerah of Tamar ; and Perez begat Hezron ; and Hezron begat Ram ; <EOV> <SOV> and Ram begat Amminadab ; and Amminadab begat Nahshon ; and Nahshon begat Salmon ; <EOV> <SOV> and Salmon begat Boaz of Rahab ; and Boaz begat Obed of Ruth ; and Obed begat Jesse ; <EOV> <SOV> and Jesse begat David the king . And David begat Solomon of her ` that had been the wife ' of Uriah ; <EOV> <SOV> and Solomon begat Rehoboam ; and Rehoboam begat Abijah ; and Abijah begat Asa ; <EOV> <SOV> and Asa begat Jehoshaphat ; and Jehoshaphat begat Joram ; and Joram begat Uzziah ; <EOV> <SOV> and Uzziah begat Jotham ; and Jotham begat Ahaz ; and Ahaz begat Hezekiah ; <EOV> <SOV> and Hezekiah begat Manasseh ; and Manasseh begat Amon ; and Amon begat Josiah ; <EOV> <SOV> and Josiah beg

Let's replace ``` with `'`

In [40]:
training_data = training_data.replace("`", "'")

We need some tests to check if training data is correctly generated. 

American Standard Version Bible has 1071 verses and 28 chapters. That means we should have the same counts for the following tokens.

1. <SOV\> - 1071
2. <EOV\> - 1071
3. <SOC\> - 28
4. <EOC\> - 28

In [41]:
# Tests 

assert training_data.count("<SOV>") == 1071, "Count of <SOV> is wrong: {}".format(training_data.count("<SOV>"))
assert training_data.count("<EOV>") == 1071, "Count of <EOV> is wrong: {}".format(training_data.count("<EOV>"))
assert training_data.count("<SOC>") == 28, "Count of <SOC> is wrong: {}".format(training_data.count("<SOC>"))
assert training_data.count("<EOC>") == 28, "Count of <EOC> is wrong: {}".format(training_data.count("<EOC>"))

In [42]:
# Write the training data to a file 

with open("../training_data/matthew_asv.txt", "w") as text_file:
    text_file.write(training_data)

You can checkout the scripts folder for this demo in action. 