<a href="https://colab.research.google.com/github/ryderwishart/biblical-machine-learning/blob/main/macula_data_overview.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Simple TSV Data Exploration Notebook
This notebook is designed to load and explore TSV data files for Greek and ~~Hebrew~~ languages.

## Setup
First, import the necessary libraries.

In [None]:
import pandas as pd
import os
import random

## Download and Load Data
Here, download the TSV files using the `!wget` command and load them using pandas.

Choose which language data to load

In [None]:
language = "Greek"  # Change this to "Hebrew" to load Hebrew data - not yet implemented

In [None]:
if language == "Greek":
    if "macula-greek.tsv" not in [path for path in os.listdir()]:
        !wget - O macula-greek.tsv https://raw.githubusercontent.com/Clear-Bible/macula-greek/main/Nestle1904/TSV/macula-greek.tsv
    tsv_file = "macula-greek.tsv"
elif language == "Hebrew":
    if "macula-hebrew.tsv" not in [path for path in os.listdir()]:
        !wget - O macula-hebrew.tsv < URL_FOR_HEBREW_TSV > # FIXME: need to use GIT LFS for this one
    tsv_file = "macula-hebrew.tsv"
else:
    raise ValueError("Invalid language choice")

Load the chosen TSV file

In [None]:
data = pd.read_csv(tsv_file, sep="\t")

## Column Information
The following sections describe each column header, the possible values, and their meanings.

In [None]:
# Create an empty list to store dictionaries for each column
column_info_list = []

# Iterate through each column in the data DataFrame
for column in data.columns:
    num_unique_values = data[column].nunique()
    unique_values = data[column].unique()
    column_info = {"column_name": column,
                   "num_unique_values": num_unique_values,
                   "unique_values": unique_values}
    column_info_list.append(column_info)

# Convert the list of dictionaries into a DataFrame
overview = pd.concat([pd.Series(d) for d in column_info_list], axis=1).T

# Create a dictionary with attribute descriptions
attribute_descriptions = {
    "after": "Encodes the following character, including a blank space.",
    "articular": "'true' if the word has an article (i.e., modified by the word 'the').",
    "case": "Grammatical case: nominative, genitive, dative, accusative, or vocative",
    "class": "On words, the class is the word's part of speech",
    "cltype": "Explicitly marks Verbless Clauses, Verb Elided Clauses, and Minor Clauses",
    "degree": "A derivative lexical category, indicating the degree of the adjective",
    "discontinuous": "'true' if the word is discontinuous with respect to sentence order due to reordering in the syntax tree",
    "domain": "Semantic domain information from the Semantic Dictionary of Biblical Greek (SDBG)",
    "frame": "Frames of verbs, refers to the arguments of the verb",
    "gender": "Grammatical gender values",
    "gloss": "SIL data, not Berean",
    "lemma": "Form of the word as it appears in a dictionary.",
    "ln": "Short for Louw-Nida, representing the semantic domain entry in Johannes P. Louw and Eugene Albert Nida, Greek-English Lexicon of the New Testament: Based on Semantic Domains (New York: United Bible Societies, 1996).",
    "mood": "Grammatical mood",
    "morph": "Morphological parsing codes",
    "normalized": "The normalized form of the token (i.e., no trailing or leading punctuation or accent shifting depending on context)",
    "number": "Grammatical number",
    "person": "Grammatical person",
    "ref": "Verse!word reference to this edition of the Nestle1904 text by USFM id",
    "referent": "The xml:id of the node to which a pronoun (i.e., 'he') refers. Note that some of these IDs are not word IDs but rather phrase or clause IDs.",
    "role": "The clause-level role of the word.",
    "strong": "Strong's number for the lemma",
    "subjref": "The xml:id of the node that is the implied subject of a verb (for verbs without an explicit subject). Note that some of these IDs are not word IDs but rather phrase or clause IDs.",
    "tense": "Grammatical tense form",
    "text": "Text content associated with the ID",
    "type": "Indicates different types of pronominals",
    "voice": "Grammatical voice",
    "xml:id": "XML ids occur on every word and encode the corpus ('n' for New Testament), the book (40 for Matthew), the chapter (001), verse (001), and word (001)."
}

# Add descriptions to the overview dataframe
overview["description"] = overview["column_name"].map(attribute_descriptions)

# Display the overview dataframe with descriptions
print('Note that empty or "nan" values occur when a column does not apply to a given word')
overview


Note that empty or "nan" values occur when a column does not apply to a given word


Unnamed: 0,column_name,num_unique_values,unique_values,description
0,xml:id,137779,"[n40001001001, n40001001002, n40001001003, n40...",XML ids occur on every word and encode the cor...
1,ref,137779,"[MAT 1:1!1, MAT 1:1!2, MAT 1:1!3, MAT 1:1!4, M...",Verse!word reference to this edition of the Ne...
2,role,10,"[nan, s, v, vc, p, adv, o, aux, io, o2, apposi...",The clause-level role of the word.
3,class,11,"[noun, verb, det, conj, pron, prep, adj, adv, ...","On words, the class is the word's part of speech"
4,type,9,"[common, proper, nan, personal, relative, demo...",Indicates different types of pronominals
5,gloss,20021,"[[The] book, of [the] genealogy, of Jesus, Chr...","SIL data, not Berean"
6,text,19477,"[Βίβλος, γενέσεως, Ἰησοῦ, Χριστοῦ, υἱοῦ, Δαυεὶ...",Text content associated with the ID
7,after,17,"[ , ., ,, ·, ;, —, ὶ, ς, ε, χ, ὸ, ι, α, ί, ὁ, ...","Encodes the following character, including a b..."
8,lemma,5401,"[βίβλος, γένεσις, Ἰησοῦς, Χριστός, υἱός, Δαυίδ...",Form of the word as it appears in a dictionary.
9,normalized,18480,"[Βίβλος, γενέσεως, Ἰησοῦ, Χριστοῦ, υἱοῦ, Δαυεί...","The normalized form of the token (i.e., no tra..."


## Inspect a column

Specify which column name you would like to look at some examples of, and then view five random rows from the dataset as examples.

In [None]:
# Specify the column name in a variable
column_name = "mood"  # Replace with your desired column name

# Filter out null values from the specified column
non_null_data = data[data[column_name].notnull()]

# If there are fewer than 5 non-null rows, adjust the sample size accordingly
num_non_null_rows = len(non_null_data)
sample_size = min(5, num_non_null_rows)

# Select 5 (or fewer) random non-null rows
random_rows = random.sample(range(num_non_null_rows), sample_size)
random_cells = non_null_data[[column_name, 'text']].iloc[random_rows]

# Get description from overview dataset
column_description = overview[overview['column_name'] == column_name]['description'].values[0]

print(f"COLUMN NAME: {column_name}\nCOLUMN DESCRIPTION: {column_description}\n")
print(f"Five (or fewer) random non-null cells from the '{column_name}' and 'text' columns:")
print(random_cells)


COLUMN NAME: mood
COLUMN DESCRIPTION: Grammatical mood

Five (or fewer) random non-null cells from the 'mood' and 'text' columns:
              mood        text
28035   participle       ἐλθὼν
66956   indicative  ἐμελέτησαν
29496  subjunctive      πίωσιν
44827   participle   ἐρχόμενος
4732    indicative  ἠνεῴχθησαν
