# Preprocess Lexemes

Processes the Lexemes YAML file into a standard `lexemes.csv` file.

## Parse Lexemes File into DataFrame (DF_LEXEMES)

Columns:

* Part of Speech Code (pos)
* Full Citation Form (full-citation-form)
* BDAG Entry (bdag-headword)
* Danker Entry (danker-entry)
* Dodson Entry (dodson-entry)
* Mounce Entry (mounce-headword)
* Strongs (strongs)
* GK (gk)
* Dodson Part of Speech Code (dodson-pos)
* Gloss (gloss)
* Mounce MorphCat (mounce-morphcat)

In [1]:
import pandas as pd
import yaml
from pprint import pprint

with open("../BibleCore/Resources/lexemes.yaml", "r", encoding="utf-8") as file:
    yaml_data = yaml.safe_load(file)

DF_LEXEMES = pd.DataFrame.from_dict(yaml_data, orient="index")
DF_LEXEMES.index.name = "Lemma"
DF_LEXEMES.rename(
    columns={
        "pos": "Part of Speech Code",
        "full-citation-form": "Full Citation Form",
        "bdag-headword": "BDAG Entry",
        "danker-entry": "Danker Entry",
        "dodson-entry": "Dodson Entry",
        "mounce-headword": "Mounce Entry",
        "strongs": "Strongs",
        "gk": "GK",
        "dodson-pos": "Dodson Part of Speech Code",
        "gloss": "Gloss",
        "mounce-morphcat": "Mounce MorphCat",
    },
    inplace=True,
)

# print("===== DF_LEXEMES")
# print(DF_LEXEMES.__class__.__name__)
# print("-----")
# pprint(vars(DF_LEXEMES))
# print("-----")
# pprint(DF_LEXEMES)

## Add Part of Speech

The Part of Speech column is determined using the `morphgnt.csv` file. 

In [2]:
DF_MORPHGNT = pd.read_csv("morphgnt.csv", index_col="Index")
GB_LEMMA = DF_MORPHGNT.groupby(["Lemma"])

morphgnt_parts_of_speech = dict(
    [(name[0], group["Part of Speech"].unique()[0]) for name, group in GB_LEMMA]
)

DF_LEXEMES["Part of Speech"] = pd.Series(morphgnt_parts_of_speech)

# print("===== DF_LEXEMES")
# print(DF_LEXEMES.__class__.__name__)
# print("-----")
# pprint(vars(DF_LEXEMES))
# print("-----")
# pprint(DF_LEXEMES)

## Write lexemes.csv File

In [3]:
DF_LEXEMES.to_csv("lexemes.csv", index_label="Lemma")

## Utility - Obtain all unique combinations of Part of Speech Code and Dodson Part of Speech Code

In [4]:
GB_POS = DF_LEXEMES.groupby(["Part of Speech Code", "Dodson Part of Speech Code"])
GB_POS = GB_POS.size()
GB_POS = GB_POS.reset_index()
GB_POS = GB_POS.rename(columns={0: "Count"})

# print("===== GB_POS")
# print(GB_POS.__class__.__name__)
# print("-----")
# pprint(vars(GB_POS))
# print("-----")
# pprint(GB_POS)

GB_POS

Unnamed: 0,Part of Speech Code,Dodson Part of Speech Code,Count
0,A,A,718
1,A,"A,A-NUI",1
2,A,"A,ADV",4
3,A,"A,ADV-C",2
4,A,"A,N:F,N:M",1
...,...,...,...
106,X/INJ,INJ,3
107,X/INJ,"INJ,N-OI",1
108,X/PRT-I,"PRT-I,PRT-N",1
109,X/V,INJ,1
