# Preprocess MorphGNT

Processes all MorphGNT text files into a standard `morphgnt.csv` file.

Columns in `morphgnt.csv` are defined in `biblesdk.columns`

| MorphGNT | morphgnt.csv | Description |
|----------|--------------|-------------|
| Column 1 | SCRIPTURE_REFERENCE | Scripture reference in BBCCVV form. |
| " | BOOK | Book in integer form. |
| " | CHAPTER | Chapter in integer form. |
| " | VERSE | Verse in integer form. |
| Column 2 | PART_OF_SPEECH_CODE | Encoded part of speech. |
| " | PART_OF_SPEECH | Part of speech description. |
| Column 3 | INFLECTION_CODES | Encoded inflection information. |
| " | PERSON | Person description. |
| " | TENSE | Tense description. |
| " | VOICE | Voice description. |
| " | MOOD | Mood description. |
| " | CASE | Case description. |
| " | NUMBER | Number description. |
| " | GENDER | Gender description. |
| " | DEGREE | Degree description. |
| Column 4 | TEXT | Word including punctuation. |
| Column 5 | WORD | Word without punctuation. |
| Column 6 | NORMALIZED_WORD | Normalized word (e.g., movable nu) |
| Column 7 | LEMMA | Lemma |


## Imports and Parameters

In [1]:
from glob import glob
from os import path
from pandas import concat, read_csv, DataFrame

import biblesdk.columns as bc

INPUT_PATH_NAME = "../BibleCore/Resources/MorphGnt"
OUTPUT_FILE_NAME = "morphgnt.csv"

## Parse MorphGnt Files into DataFrame (DF_WORDS)

Loads all MorphGnt files into `DF_WORDS`.

In [2]:
all_files = glob(path.join(INPUT_PATH_NAME, "*.txt"))

DF_WORDS: DataFrame = concat(
    (
        read_csv(
            f,
            names=[
                bc.SCRIPTURE_REFERENCE,
                bc.PART_OF_SPEECH_CODE,
                bc.INFLECTION_CODES,
                bc.TEXT,
                bc.WORD,
                bc.NORMALIZED_WORD,
                bc.LEMMA,
            ],
            dtype={bc.SCRIPTURE_REFERENCE: "str"},
            sep="\\s+",
            index_col=False,
        )
        for f in all_files
    ),
    ignore_index=True,
)

# print("===== DF_WORDS")
# print(DF_WORDS.__class__.__name__)
# print("-----")
# pprint(vars(DF_WORDS))
# print("-----")
# pprint(DF_WORDS)

## Parse Scripture Reference

Parses the `Scripture Reference` column into separate `Book`, `Chapter`, and `Verse` columns.

In [3]:
DF_WORDS[bc.BOOK] = DF_WORDS[bc.SCRIPTURE_REFERENCE].str[0:2].astype(int)
DF_WORDS[bc.CHAPTER] = DF_WORDS[bc.SCRIPTURE_REFERENCE].str[2:4].astype(int)
DF_WORDS[bc.VERSE] = DF_WORDS[bc.SCRIPTURE_REFERENCE].str[4:6].astype(int)

# print("===== DF_WORDS")
# print(DF_WORDS.__class__.__name__)
# print("-----")
# pprint(vars(DF_WORDS))
# print("-----")
# pprint(DF_WORDS)

## Parse Part of Speech Code

Maps the values in `Part of Speech Code` into `Part of Speech`.

In [4]:
parts_of_speech = {
    "A-": "Adjective",
    "C-": "Conjunction",
    "D-": "Adverb",
    "I-": "Interjection",
    "N-": "Noun",
    "P-": "Preposition",
    "RA": "Definite Article",
    "RD": "Pronoun - Demonstrative",
    "RI": "Pronoun - Indefinite",
    "RP": "Pronoun - Personal",
    "RR": "Pronoun - Relative",
    "V-": "Verb",
    "X-": "Particle",
}

DF_WORDS[bc.PART_OF_SPEECH] = DF_WORDS[bc.PART_OF_SPEECH_CODE].map(parts_of_speech)

# print("===== DF_WORDS")
# print(DF_WORDS.__class__.__name__)
# print("-----")
# pprint(vars(DF_WORDS))
# print("-----")
# pprint(DF_WORDS)

## Parse Inflection Codes

In [5]:
infection_person = {"1": "First", "2": "Second", "3": "Third"}

inflection_tense = {
    "P": "Present",
    "I": "Imperfect",
    "F": "Future",
    "A": "Aorist",
    "X": "Perfect",
    "Y": "Pluperfect",
}

inflection_voice = {
    "A": "Active",
    "M": "Middle",
    "P": "Passive",
}

inflection_mood = {
    "I": "Indicative",
    "D": "Imperative",
    "S": "Subjunctive",
    "O": "Optative",
    "N": "Infinitive",
    "P": "Participle",
}

inflection_case = {
    "N": "Nominative",
    "G": "Genitive",
    "D": "Dative",
    "A": "Accusative",
    "V": "Vocative",
}

inflection_number = {
    "S": "Singular",
    "P": "Plural",
}

inflection_gender = {"M": "Masculine", "F": "Feminine", "N": "Neuter"}

inflection_degree = {"C": "Comparative", "S": "Superlative"}

DF_WORDS[bc.PERSON] = DF_WORDS[bc.INFLECTION_CODES].str[0].map(infection_person)
DF_WORDS[bc.TENSE] = DF_WORDS[bc.INFLECTION_CODES].str[1].map(inflection_tense)
DF_WORDS[bc.VOICE] = DF_WORDS[bc.INFLECTION_CODES].str[2].map(inflection_voice)
DF_WORDS[bc.MOOD] = DF_WORDS[bc.INFLECTION_CODES].str[3].map(inflection_mood)
DF_WORDS[bc.CASE] = DF_WORDS[bc.INFLECTION_CODES].str[4].map(inflection_case)
DF_WORDS[bc.NUMBER] = DF_WORDS[bc.INFLECTION_CODES].str[5].map(inflection_number)
DF_WORDS[bc.GENDER] = DF_WORDS[bc.INFLECTION_CODES].str[6].map(inflection_gender)
DF_WORDS[bc.DEGREE] = DF_WORDS[bc.INFLECTION_CODES].str[7].map(inflection_degree)

# print("===== DF_WORDS")
# print(DF_WORDS.__class__.__name__)
# print("-----")
# pprint(vars(DF_WORDS))
# print("-----")
# pprint(DF_WORDS)

## Write morphgnt.csv File

In [6]:
DF_WORDS.to_csv(OUTPUT_FILE_NAME, index_label=bc.INDEX)