<a href="https://colab.research.google.com/github/ryderwishart/biblical-machine-learning/blob/main/semantic_domains_overview.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Simple TSV Exploring Semantic Domains in Greek
This notebook is designed to load and explore the MACULA Greek semantic domains.

## Setup
Import the necessary libraries.

In [1]:
import pandas as pd
import os

## Download and Load Data
Here, download the TSV files using the `!wget` command and load them using pandas.

We need two files, namely the TSV data and also the dictionary of semantic domain labels, since semantic domains are encoded as numbers.

In [3]:
if 'macula-greek.tsv' not in [path for path in os.listdir()]:
    !wget -q 'https://raw.githubusercontent.com/Clear-Bible/macula-greek/main/Nestle1904/TSV/macula-greek.tsv'
if 'marble-domain-label-mapping.json' not in [path for path in os.listdir()]:
    !wget -q 'https://raw.githubusercontent.com/Clear-Bible/macula-greek/main/sources/MARBLE/SDBG/marble-domain-label-mapping.json'

Load the chosen TSV file

In [4]:
data = pd.read_csv('macula-greek.tsv', sep="\t")

## Create Semantic Domain Lookup Dictionary

In [5]:
# Import domain-label mapping
import json

# Open the JSON file
with open('marble-domain-label-mapping.json', 'r') as f:

    # Load the contents of the file as a dictionary
    domain_labels = json.load(f)

domain_labels['missing'] = 'no domain'

# Display the resulting dictionary
count = 0
for d, l in domain_labels.items():
    print(d, l)
    if count > 5:
        break
    count += 1

001 Geographical Objects and Features
001001 Universe, Creation
001002 Regions Above the Earth
001003 Regions Below the Surface of the Earth
001004 Heavenly Bodies
001005 Atmospheric Objects
001006 The Earth's Surface


## Inspect a column

Specify which column name you would like to look at some examples of, and then view five random rows from the dataset as examples.

In [17]:
def search_domains(data, domain_labels, label_substring, top_n):
    """
    Searches the values in domain_labels for matching substrings, and returns the top_n rows in data that match the domain label.

    Parameters:
    data (pandas.DataFrame): The MACULA Greek DataFrame to search
    domain_labels (dict): A dictionary where the keys are numeric strings and the values are human readable labels
    label_substring (str): The substring to search for in the domain labels
    top_n (int): The number of matching rows to return

    Returns:
    pandas.DataFrame: The top_n rows in data that match the domain label
    """

    label_substring_clean = ''.join([c for c in label_substring.lower() if c.isalpha()])
    # Find all the matching domain labels
    matching_domains = []
    for label in domain_labels.values():
        label_clean = ''.join([c for c in label.lower() if c.isalpha()])
        if label_substring in label_clean:
            matching_domains.append(label)

    # Filter the data for rows where the domain label matches
    matching_rows = data[data['domain'].isin([k for k, v in domain_labels.items() if v in matching_domains])].copy()

    # Replace the 'domain' column with its corresponding label from domain_labels
    matching_rows['domain'] = matching_rows['domain'].apply(lambda domain: domain_labels[domain])

    # Return the top_n matching rows with only the desired columns
    return matching_rows[['text', 'gloss', 'domain', 'ref']].head(top_n)


In [29]:
# Search for an english word (e.g., 'earth', 'save')
search_string = 'save'
search_domains(data, domain_labels, search_string, 10)

Unnamed: 0,text,gloss,domain,ref
359,σώσει,will save,Save in a Religious Sense,MAT 1:21!12
10532,ἀπόληται,should perish,Save in a Religious Sense,MAT 18:14!13
11334,σωθῆναι,to be saved,Save in a Religious Sense,MAT 19:25!11
24935,σωθῆναι,to be saved,Save in a Religious Sense,MRK 10:26!11
29471,σωθήσεται,will be saved,Save in a Religious Sense,MRK 16:16!5
29575,σωτηρίας,,Save in a Religious Sense,MRK 16:99!33
30706,σωτηρίας,of salvation,Save in a Religious Sense,LUK 1:77!4
30919,Σωτήρ,a Savior,Save in a Religious Sense,LUK 2:11!5
31250,σωτήριόν,salvation,Save in a Religious Sense,LUK 2:30!7
31721,σωτήριον,salvation,Save in a Religious Sense,LUK 3:6!6


In [27]:
# Search loop
search_string = ''
number_of_results = 5
while search_string != 'q':
    search_string = input('Enter an english word to search for: ("q" to stop this loop): ')
    if search_string != 'q':
        print(search_domains(data, domain_labels, search_string, number_of_results))


Enter an english word to search for: ("q" to stop this loop): building
            text       gloss                        domain          ref
614       οἰκίαν       house                     Buildings   MAT 2:11!5
1120    ἀποθήκην        barn                     Buildings  MAT 3:12!20
1295   πτερύγιον    pinnacle  Parts and Areas of Buildings   MAT 4:5!15
1297       ἱεροῦ      temple                     Buildings   MAT 4:5!17
1584  συναγωγαῖς  synagogues                     Buildings  MAT 4:23!10
Enter an english word to search for: ("q" to stop this loop): q
