# Preprocess Lexemes

Processes the Lexemes YAML file into a standard `lexemes.csv` file.
The data is then enhanced with the part of speech based using `morphgnt.csv`.

| lexemes.yaml | morphgnt.csv | lexemes.csv |
|--------------|--------------|-------------|
| pos | | PART_OF_SPEECH_CODE |
| full-citation-form | | FULL_CITATION_FORM |
| bdag-headword | | BDAG_ENTRY |
| danker-entry | | DANKER_ENTRY |
| dodson-entry | | DODSON_ENTRY |
| mounce-headword | | MOUNCE_ENTRY |
| strongs | | STRONGS |
| gk | | GK |
| dodson-pos | | DODSON_PART_OF_SPEECH_CODE |
| gloss | | GLOSS |
| mounce-morphcat | | MOUNCE_MORPHCAT |
| | PART_OF_SPEECH | PART_OF_SPEECH|

## Define File Names

In [1]:
INPUT_FILE_NAME = "../BibleCore/Resources/lexemes.yaml"
OUTPUT_FILE_NAME = "lexemes.csv"
MORPHGNT_CSV = "morphgnt.csv"

## Parse Lexemes File into DataFrame (DF_LEXEMES)

Columns:

* Part of Speech Code (pos)
* Full Citation Form (full-citation-form)
* BDAG Entry (bdag-headword)
* Danker Entry (danker-entry)
* Dodson Entry (dodson-entry)
* Mounce Entry (mounce-headword)
* Strongs (strongs)
* GK (gk)
* Dodson Part of Speech Code (dodson-pos)
* Gloss (gloss)
* Mounce MorphCat (mounce-morphcat)

In [2]:
import pandas as pd
import yaml
from pprint import pprint
import biblesdk.constants as bc

with open(INPUT_FILE_NAME, "r", encoding="utf-8") as file:
    yaml_data = yaml.safe_load(file)

DF_LEXEMES = pd.DataFrame.from_dict(yaml_data, orient="index")
DF_LEXEMES.index.name = bc.LEMMA
DF_LEXEMES.rename(
    columns={
        "pos": bc.PART_OF_SPEECH_CODE,
        "full-citation-form": bc.FULL_CITATION_FORM,
        "bdag-headword": bc.BDAG_ENTRY,
        "danker-entry": bc.DANKER_ENTRY,
        "dodson-entry": bc.DODSON_ENTRY,
        "mounce-headword": bc.MOUNCE_ENTRY,
        "strongs": bc.STRONGS,
        "gk": bc.GK,
        "dodson-pos": bc.DODSON_PART_OF_SPEECH_CODE,
        "gloss": bc.GLOSS,
        "mounce-morphcat": bc.MOUNCE_MORPHCAT,
    },
    inplace=True,
)

# print("===== DF_LEXEMES")
# print(DF_LEXEMES.__class__.__name__)
# print("-----")
# pprint(vars(DF_LEXEMES))
# print("-----")
# pprint(DF_LEXEMES)

## Add Part of Speech

The Part of Speech column is determined using the `morphgnt.csv` file. 

In [3]:
DF_MORPHGNT = pd.read_csv(MORPHGNT_CSV, index_col=bc.INDEX)
GB_LEMMA = DF_MORPHGNT.groupby([bc.LEMMA])

morphgnt_parts_of_speech = dict(
    [(name[0], group[bc.PART_OF_SPEECH].unique()[0]) for name, group in GB_LEMMA]
)

DF_LEXEMES[bc.PART_OF_SPEECH] = pd.Series(morphgnt_parts_of_speech)

# print("===== DF_LEXEMES")
# print(DF_LEXEMES.__class__.__name__)
# print("-----")
# pprint(vars(DF_LEXEMES))
# print("-----")
# pprint(DF_LEXEMES)

## Write lexemes.csv File

In [4]:
DF_LEXEMES.to_csv(OUTPUT_FILE_NAME, index_label=bc.LEMMA)

## Utility - Obtain all unique combinations of Part of Speech Code and Dodson Part of Speech Code

In [5]:
GB_POS = DF_LEXEMES.groupby([bc.PART_OF_SPEECH_CODE, bc.DODSON_PART_OF_SPEECH_CODE])
GB_POS = GB_POS.size()
GB_POS = GB_POS.reset_index()
GB_POS = GB_POS.rename(columns={0: bc.COUNT})

# print("===== GB_POS")
# print(GB_POS.__class__.__name__)
# print("-----")
# pprint(vars(GB_POS))
# print("-----")
# pprint(GB_POS)

GB_POS

Unnamed: 0,Part of Speech Code,Dodson Part of Speech Code,Count
0,A,A,718
1,A,"A,A-NUI",1
2,A,"A,ADV",4
3,A,"A,ADV-C",2
4,A,"A,N:F,N:M",1
...,...,...,...
106,X/INJ,INJ,3
107,X/INJ,"INJ,N-OI",1
108,X/PRT-I,"PRT-I,PRT-N",1
109,X/V,INJ,1
