<td>
<a href="https://colab.research.google.com/github/raoulg/MADS-DAV/blob/main/notebooks/6.3.2-tanach_preprocess.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
</td>

# Old Testament

Let's load the textfiles

In [None]:
from pathlib import Path
datadir = Path('../data/raw/tanach').resolve()

files = list(datadir.glob("*.txt"))

Let's pick a text file and see what it looks like

In [None]:
filepath = files[16]
filepath

In [None]:
with filepath.open() as f:
    text = f.read()
text.split("\n")[0:10]

So, we have some unicode characters. They are special characters that denote which text should be read from left to right, or right to left. We will need to remove these characters from the text. We will also replace \xa0 with a space.

In addition to that, there is some information about the book and chapter at the beginning of the file, denoted with "xxxx". We will some of the specific patterns of the text to process the text (eg skip the lines starting with xxxx, and extract chapter number and verse)

In [None]:
import re
from loguru import logger
def clean(text):
    ucodes = r"\u202a|\u202b|\u202c|\u05c3"
    text = re.sub(ucodes, "", text)
    text = re.sub(r"\xa0", " ", text)
    return text

def parse_text(text, bookname):
    startswithx = r"xxxx"
    digits = r"\d+\s+\d+"
    cleaned = clean(text)
    data = []
    for i, line in enumerate(cleaned.split("\n")):
        if re.match(startswithx, line):
            pass
        else:
            match = re.search(digits, line)
            if match:
                num = match.group(0)
                verse, chap = num.split()
                line = re.sub(digits, "", line)
                data.append({"book": bookname, "chap": int(chap), "verse": int(verse), "text": line})
            else:
                data[-1]["text"] += " " + line
                logger.warning(f"Line {i}:{line} in {bookname} is added to {chap}:{verse}")
    return data


In [None]:
import pandas as pd

In [None]:
testament = []
for filepath in files:
    with filepath.open() as f:
        text = f.read()
        bookname = filepath.stem
        data = parse_text(text, bookname)
        testament.extend(data)

We now have the data stored in a DataFrame, with book, chapter and verse metadata, and the text of the verse. We can now use this data to do some analysis.

In [None]:
df = pd.DataFrame(testament)
df

In [None]:
df.book.unique()

In [None]:
df.to_parquet("../data/processed/tanach.parquet")