<a href="https://colab.research.google.com/github/ryderwishart/biblical-machine-learning/blob/main/expand_macula_data_and_query.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Empowering Bots with Data

This notebook exemplifies how to build a powerful question-answering system using OpenAI's Language Model (LLM), augmented with a specific knowledge database—in this case, the Bible. The aim of this combined system is to provide accurate responses to queries related to the Bible, as well as to demonstrate the capabilities of LLMs when combined with structured data sources, thus empowering them with long-term memory and specificity.

The main components of this notebook are:

1. **Building a Bible QA System**: A question-answering system specifically trained on the Bible is created. This involves vectorizing the Bible text and storing it in a Vectorstore database. The Bible QA system is built using the RetrievalQA method from the langchain.chains module, which leverages the power of the LLM to reason about and retrieve information from the Bible database.

2. **Initializing an LLM Agent**: An LLM Agent is initialized with the Bible QA system as a tool. This allows the agent to utilize the Bible QA system when answering questions, essentially enabling it to "remember" information from the Bible and reason about it. The LLM Agent acts as a router and decision-maker, determining how and when to use the Bible QA system based on the input question.

3. **Running Queries**: The combined system is then used to answer complex, multi-step questions related to the Bible. The LLM Agent's ability to perform multi-step reasoning is showcased, with the agent using the Bible QA system to answer individual parts of the question and then combining those answers into a final response.

This notebook thus exemplifies the use of LLMs as a reasoning tool, enhanced by a specific knowledge database. This approach not only improves the accuracy of responses to specific queries but also provides the LLM with a form of long-term memory, allowing it to consistently access and reason about a large, fixed body of knowledge.

In [6]:
!pip install langchain openai chromadb tiktoken unidecode

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [7]:
import os
os.environ["LANGCHAIN_TRACING"] = "false"

# Provision data for context and "prosify"

In [62]:
import requests, json, re, os
import pandas as pd


def download_file(url, file_name):
    response = requests.get(url)
    with open(file_name, "wb") as file:
        file.write(response.content)


# file1_url = "https://raw.githubusercontent.com/Clear-Bible/macula-greek/main/Nestle1904/TSV/macula-greek.tsv"
file1_url = "https://github.com/Clear-Bible/macula-greek/raw/feature/add-sentence-id-to-tsv/Nestle1904/TSV/macula-greek.tsv" # PR version with sentence IDs
file2_url = "https://raw.githubusercontent.com/Clear-Bible/macula-greek/main/sources/MARBLE/SDBG/marble-domain-label-mapping.json"
file1_name = "macula-greek.tsv"
file2_name = "marble-domain-label-mapping.json"

if file1_name not in os.listdir():
    download_file(file1_url, file1_name)

if file2_name not in os.listdir():
    download_file(file2_url, file2_name)

file_3_url = "https://raw.githubusercontent.com/Clear-Bible/speaker-quotations/main/json/SpeakerProjections-clear.json"
file_4_url = "https://raw.githubusercontent.com/Clear-Bible/speaker-quotations/main/json/character_detail.semantic_data.json"
file_3_name = "SpeakerProjections-clear.json"
file_4_name = "character_detail.semantic_data.json"  # stores info about each unique character id (unique string value)

if file_3_name not in os.listdir():
    download_file(file_3_url, file_3_name)

if file_4_name not in os.listdir():
    download_file(file_4_url, file_4_name)



In [63]:


# Import Macula Greek data
mg = pd.read_csv(
    "macula-greek.tsv", index_col="xml:id", sep="\t", header=0, converters={"*": str}
).fillna("missing")
# add an 'id' column
mg["id"] = mg.index

# mg['domain'] = mg['domain'].astype(str).fillna('missing')

# Extract book, chapter, and verse into separate columns
mg[["book", "chapter", "verse"]] = mg["ref"].str.extract(r"(\d?[A-Z]+)\s(\d+):(\d+)")

# Add columns for book + chapter, and book + chapter + verse for easier grouping
mg["book_chapter"] = mg["book"] + " " + mg["chapter"].astype(str)
mg["book_chapter_verse"] = mg["book_chapter"] + ":" + mg["verse"].astype(str)




# Import domain-label mapping
# Open the JSON file
with open("marble-domain-label-mapping.json", "r") as f:
    # Load the contents of the file as a dictionary
    domain_labels = json.load(f)

domain_labels["missing"] = "no domain"
domain_labels["nan"] = "no domain"

# Use domain labels to create a new column


def get_domain_label(domain_string_number):
    labels = [domain_labels[label] for label in domain_string_number.split(" ")]
    return labels


mg["domain_label"] = mg["domain"].apply(get_domain_label)
mg.head()


# Create a dataframe from the SpeakerProjections-clear.json file
with open("SpeakerProjections-clear.json", "r") as f:
    speaker_projections = json.load(f)
    speaker_data = pd.DataFrame(speaker_projections)

# Create a dataframe from the character_detail.semantic_data.json file
with open("character_detail.semantic_data.json", "r") as f:
    # this data is an array of JSON objects
    character_detail = json.load(f)
    character_data = pd.DataFrame(character_detail)

# Transpose the DataFrame
transposed_speaker_data = speaker_data.transpose()

# Reset the index
transposed_speaker_data.reset_index(inplace=True)

# Rename the columns
transposed_speaker_data.columns = ["row_id", "instance_data", "projections"]

# Normalize the 'instance_data' column
flattened_instance_data = pd.json_normalize(transposed_speaker_data["instance_data"])

# Merge the normalized DataFrame with the original transposed DataFrame
merged_speaker_data = pd.concat(
    [transposed_speaker_data.drop(columns=["instance_data"]), flattened_instance_data],
    axis=1,
)

# Create an empty DataFrame to store the result
expanded_speaker_data = pd.DataFrame()

# Iterate through the rows in the merged_speaker_data DataFrame
for idx, row in merged_speaker_data.iterrows():
    projections = row["projections"]

    # Iterate through the projections
    for proj_idx, projection in enumerate(projections):
        # Create a new row with the inherited speaker instance data
        new_row = row.drop("projections").to_dict()
        new_row.update(projection)

        # Set the row ID with the projection index
        new_row["row_id"] = f"{row['row_id']}|{proj_idx}"

        # Append the new row to the expanded_speaker_data DataFrame
        expanded_speaker_data = expanded_speaker_data.append(new_row, ignore_index=True)

# Add 'tokens' and 'token_ids' columns with default empty lists
expanded_speaker_data = expanded_speaker_data.assign(
    tokens=[[]] * len(expanded_speaker_data),
    token_ids=[[]] * len(expanded_speaker_data),
)

# Iterate through the rows in the expanded_speaker_data DataFrame
for idx, row in expanded_speaker_data.iterrows():
    words = row["Words"]

    # Extract the 'Text' and 'Id' values from the 'Words' data
    tokens = [word["Text"] for word in words]
    token_ids = [word["Id"] for word in words]

    # Assign the 'tokens' and 'token_ids' fields to the row
    expanded_speaker_data.at[idx, "tokens"] = tokens
    expanded_speaker_data.at[idx, "token_ids"] = token_ids

# Create a dataframe from 'word_level_semantic_data.tsv'. First row is header
if "word_level_semantic_data.tsv" not in [path for path in os.listdir()]:
    download_file('https://raw.githubusercontent.com/ryderwishart/biblical-machine-learning/main/gpt-inferences/word_level_semantic_data.tsv', 'word_level_semantic_data.tsv')
semantic_role_data = pd.read_csv("word_level_semantic_data.tsv", sep="\t", header=0)
semantic_role_data = semantic_role_data.fillna("")

semantic_role_data_summary = pd.DataFrame(
    [
        [
            column,
            semantic_role_data[column].unique(),
            len(semantic_role_data[column].unique()),
        ]
        for column in semantic_role_data.columns
    ],
    columns=["column_name", "unique_values", "num_unique_values"],
)

# Read roles.json and load it into a Python object
if 'roles.json' not in [path for path in os.listdir()]:
    download_file('https://raw.githubusercontent.com/ryderwishart/biblical-machine-learning/main/gpt-inferences/roles.json', 'roles.json')
with open("roles.json", "r") as f:
    data = json.load(f)


def extract_data(node):
    node_data = {
        "node_category": node["node_category"],
        "node_head": node["node_head"],
        "field_name": node["field_name"],
        "field_type": node["field_type"],
        "forms": [],
        "lemmas": [],
        "glosses": [],
        "ids": [],
    }

    for word_data in node["words"].values():
        node_data["forms"].append(word_data["word"])
        node_data["lemmas"].append(word_data["lemma"])
        node_data["glosses"].append(word_data["gloss"])
        node_data["ids"].append(word_data["word_identifier"])

    return node_data


processed_data = [extract_data(node) for node in data]

semantic_role_wordings_lookup = pd.DataFrame(processed_data)

# for i in range(0, len(semantic_role_wordings_lookup)):
#     for j in semantic_role_wordings_lookup.iloc[i]['ids']:
#         if j.startswith('n5700102000'):
#             print(semantic_role_wordings_lookup.iloc[i].to_dict())

attribute_descriptions = {
    "after": "Encodes the following character, including a blank space.",
    "articular": "'true' if the word has an article (i.e., modified by the word 'the').",
    "case": "Grammatical case: nominative, genitive, dative, accusative, or vocative",
    "class": "On words, the class is the word's part of speech",
    "cltype": "Explicitly marks Verbless Clauses, Verb Elided Clauses, and Minor Clauses",
    "degree": "A derivative lexical category, indicating the degree of the adjective",
    "discontinuous": "'true' if the word is discontinuous with respect to sentence order due to reordering in the syntax tree",
    "domain": "Semantic domain information from the Semantic Dictionary of Biblical Greek (SDBG)",
    "frame": "Frames of verbs, refers to the arguments of the verb",
    "gender": "Grammatical gender values",
    "gloss": "SIL data, not Berean",
    "lemma": "Form of the word as it appears in a dictionary.",
    "ln": "The semantic domain entry in Louw and Nida's, 'Greek-English Lexicon of the New Testament: Based on Semantic Domains'.",
    "mood": "Grammatical mood",
    "morph": "Morphological parsing codes",
    "normalized": "The normalized form of the token (i.e., no trailing or leading punctuation or accent shifting depending on context)",
    "number": "Grammatical number",
    "person": "Grammatical person",
    "ref": "Verse!word reference to this edition of the Nestle1904 text by USFM id",
    "referent": "The xml:id of the node to which a pronoun (i.e., 'he') refers. Note that some of these IDs are not word IDs but rather phrase or clause IDs.",
    "role": "The clause-level role of the word.",
    "strong": "Strong's number for the lemma",
    "subjref": "The xml:id of the node that is the implied subject of a verb (for verbs without an explicit subject). Note that some of these IDs are not word IDs but rather phrase or clause IDs.",
    "tense": "Grammatical tense form",
    "text": "Text content associated with the ID",
    "type": "Indicates different types of pronominals",
    "voice": "Grammatical voice",
    "xml:id": "XML ids occur on every word and encode the corpus ('n' for New Testament), the book (40 for Matthew), the chapter (001), verse (001), and word (001).",
}

discourse_types = {
    "Main clauses": {
        "description": "Main clauses are the top-level clauses in a sentence. They are the clauses that are not embedded in other clauses."
    },
    "Historical Perfect": {
        "description": "Highlights not the speech or act to which it refers but the event(s) that follow (DFNTG §12.2)."
    },
    "Specific Circumstance": {
        "description": "The function of ἐγενετο ‘it came about’ and an immediately following temporal expression varies with the author (see DFNTG §10.3). In Matthew’s Gospel, it usually marks major divisions in the book (e.g. Mt 7:28). In Luke-Acts, in contrast, ‘it picks out from the general background the specific circumstance for the foreground events that are to follow’ (ibid.), as in Acts 9:37 (see also Mt 9:10)."
    },
    "Verb Focus+": {
        "description": "Verb in final position in clause demonstrates verb focus."
    },
    "Articular Pronoun": {
        "description": "Articular pronoun, which often introduces an ‘intermediate step’ in a reported conversation."
    },
    "Topical Genitive": {
        "description": "A genitival constituent that is nominal is preposed within the noun phrase for two purposes: 1) to bring it into focus; 2) within a point of departure, to indicate that it is the genitive in particular which relates to a corresponding constituent of the context.(DFNTG §4.5)"
    },
    "Embedded DFE": {
        "description": "'Dominant focal elements' embedded within a constituent in P1."
    },
    "Reported Speech": {"description": "Reported speech."},
    "Ambiguous": {"description": "Marked but ambiguous constituent order."},
    "Over-encoding": {
        "description": "Any instance in which more encoding than the default is employed to refer to an active participant or prop. Over-encoding is used in Greek, as in other languages: to mark the beginning of a narrative unit (e.g. Mt 4:5); and to highlight the action or speech concerned (e.g. Mt 4:7)."
    },
    "Highlighter": {
        "description": "Presentatives - Interjections such as ἰδού and ἴδε ‘look!, see!’ typically highlight what immediately follows (Narr §5.4.2, NonNarr §7.7.3)."
    },
    "Referential PoD": {
        "description": "Pre-verbal topical subject other referential point of departure (NARR §3.1, NonNarr §4.3, DFNTG §§2.2, 2.8; as in 1 Th 1:6)."
    },
    "annotations": {"description": "Inline annotations."},
    "Left-Dislocation": {
        "description": "Point of departure - A type of SENTENCE in which one of the CONSTITUENTS appears in INITIAL position and its CANONICAL position is filled by a PRONOUN or a full LEXICAL NOUN PHRASE with the same REFERENCE, e.g. John, I like him/the old chap.”"
    },
    "Focus+": {
        "description": "Constituents placed in P2 to give them focal prominence."
    },
    "Tail-Head linkage": {
        "description": "Point of departure involving renewal - Tail-head linkage involves “the repetition in a subordinate clause, at the beginning (the ‘head’) of a new sentence, of at least the main verb of the previous sentence (the tail)” (Dooley & Levinsohn 2001:16)."
    },
    "Postposed them subject": {
        "description": "When a subject is postposed to the end of its clause (following nominals or adjuncts), it is marked ThS+ (e.g. Lk 1:41 [twice]). Such postposing typically marks as salient the participant who performs the next event in chronological sequence in the story (see Levinsohn 2014)."
    },
    "EmbeddedRepSpeech": {
        "description": "Embedded reported speech - speech that is reported within a reported speech."
    },
    "Futuristic Present": {
        "description": "Highlights not the speech or act to which it refers but the event(s) that follow (DFNTG §12.2)."
    },
    "OT quotes": {"description": "Old Testament quotations."},
    "Constituent Negation": {
        "description": "Negative pro-forms when they are in P2 indicate that the constituent has been negated rather than the clause as a whole."
    },
    "Split Focal": {
        "description": "The second part of a focal constituent with only the first part in P2 (NonNarr §5.5, DFNTG §4.4)."
    },
    "Right-Dislocated": {
        "description": "Point of departure - A type of SENTENCE in which one of the CONSTITUENTS appears in FINAL position and its CANONICAL position is filled by a PRONOUN with the same REFERENCE, e.g. ... He’s always late, that chap."
    },
    "Appositive": {"description": "Appositive"},
    "Situational PoD": {
        "description": "Situational point of departure (e.g. temporal, spatial, conditional―(NARR §3.1, NonNarr §4.3, DFNTG §§2.2, 2.8; as in 1 Th 3:4)."
    },
    "Historical Present": {
        "description": "Highlights not the speech or act to which it refers but the event(s) that follow (DFNTG §12.2)."
    },
    "Noun Incorporation": {
        "description": "Some nominal objects that appear to be in P2 may precede their verb because they have been “incorporated” (Rosen 1989) in the verb phrase. Typically, the phrase consists of an indefinite noun and a “light verb” such as “do, give, have, make, take” (Wikipedia entry on Light Verbs)."
    },
    "Thematic Prominence": {
        "description": "Thematic prominence - In Greek, prominence is given to active participants and props who are the current centre of attention (NARR §4.6) by omitting the article (DFNTG §§9.2.3-9.4), by adding αυτος ‘-self’ (e.g. in 1 Th 3:11), by using the proximal demonstrative οὗτος (NARR chap. 9, Appendix 1; e.g. in 3:3), and by postposing the constituent concerned (e.g. Mt 14:29). If such constituents are NOT in postion P1, they are demonstrating topical prominence."
    },
    "Cataphoric Focus": {
        "description": "An expression that points forward to and highlights something which ‘is about to be expressed.’"
    },
    "Cataphoric referent": {
        "description": "The clause or sentence to which a cataphoric reference refers when NOT introduced with ὅτι or ἵνα."
    },
    "DFE": {
        "description": "Constituents that may be moved from their default position to the end of a proposition to give them focal prominence include verbs, pronominals and objects that follow adjuncts (NonNarr §5.3, DFNTG §3.5). Such constituents, also called ‘dominant focal elements’or DFEs (Heimedinger 1999:167)."
    },
    "Embedded Focus+": {
        "description": "A constituent of a phrase or embedded clause preposed for focal prominence."
    },
}

ENDPOINT = "https://macula-atlas-api-qa-25c5xl4maa-uk.a.run.app/graphql/"
headers = {"Content-Type": "application/json"}

# Levinsohn discourse features query

discourse_features_query = """
query AnnotationFeatures($filters1: AnnotationFeatureFilter, $filters2: AnnotationFilter, $filters3: WordTokenFilter ) {
  annotationFeatures(filters: $filters1) {
    label
    uri
    instances(filters: $filters2) {
      uri
      tokens(filters: $filters3) {
        ref
        wordValue
        xmlId
      }
    }
  }
}
"""


def get_discourse_annotation_types(xmlId):
    tokenData = mg.loc[xmlId].to_dict()
    passage = tokenData["ref"].split("!")[0]

    variables = {
        "filters1": {
            "reference": passage,
        },
        "filters2": {
            "reference": passage,
        },
        "filters3": {
            "xmlId": xmlId,
        },
    }

    payload = {"query": discourse_features_query, "variables": variables}

    response = requests.post(ENDPOINT, json=payload, headers=headers)

    response_data = json.loads(response.text)
    annotation_features = response_data["data"]["annotationFeatures"]

    labels = [feature["label"] for feature in annotation_features]
    return labels



[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  expanded_speaker_data = expanded_speaker_data.append(new_row, ignore_index=True)
  expanded_speaker_data = expanded_speaker_data.append(new_row, ignore_index=True)
  expanded_speaker_data = expanded_speaker_data.append(new_row, ignore_index=True)
  expanded_speaker_data = expanded_speaker_data.append(new_row, ignore_index=True)
  expanded_speaker_data = expanded_speaker_data.append(new_row, ignore_index=True)
  expanded_speaker_data = expanded_speaker_data.append(new_row, ignore_index=True)
  expanded_speaker_data = expanded_speaker_data.append(new_row, ignore_index=True)
  expanded_speaker_data = expanded_speaker_data.append(new_row, ignore_index=True)
  expanded_speaker_data = expanded_speaker_data.append(new_row, ignore_index=True)
  expanded_speaker_data = expanded_speaker_data.append(new_row, ignore_index=True)
  expanded_speaker_data = expanded_speaker_data.append(new_row, ignore_index=True)
  expanded_speaker_dat

In [57]:

situations_lookup_json = json.loads(
    """
                                    {
    "parameters": [
        {
            "name": "field",
            "type": "register_parameter",
            "summary": "Field is the subject matter of a situation and concerns the nature and the structure of the activity being carried out. It involves three parameters: abstractness, activity focus, and goals.",
            "system": "field"
        },
        {
            "name": "tenor",
            "type": "register_parameter",
            "summary": "Tenor pertains to the social roles and relationships among the participants involved in a situation. It covers five parameters: value orientation predisposition, publicity, number of speaking participants, control, and social distance.",
            "system": "tenor"
        },
        {
            "name": "mode",
            "type": "register_parameter",
            "summary": "Mode is the dimension of a situation through which participants are brought into contact with each other. It involves the systems of material contact and semantic contact, focusing on four parameters: language role, process sharing, channel, and medium.",
            "system": "mode"
        }
    ],
    "systems": [
        {
            "name": "abstractness",
            "type": "system",
            "summary": "The abstractness of a situation refers to the distinction between conceptual and practical activities, with the former involving theoretical or abstract activities, while the latter involves concrete or immediate actions."
        },
        {
            "name": "activity focus",
            "type": "system",
            "summary": "The activity focus of a situation involves the domain of experience that participants are focusing on. An experiential focus refers to what is happening, an interpersonal focus refers to why it is happening, and a logical focus refers to how, when, or where it is happening. The analysis only captures the beginning and the end of the focus, even if it changes during the episode."
        },
        {
            "name": "goals",
            "type": "system",
            "summary": "The goals of a situational activity involve the motivation of actions, and can be instructing, projecting, or asserting."
        },
        {
            "name": "control",
            "type": "system",
            "summary": "Control involves social tendencies related to deference between participants based on their relative status, power, authority, or institutional roles. Situations may be hierarchic or non-hierarchic, with the former being unequal and the latter being equal. In unequal relationships, there may be numerous subjects that cannot typically be discussed, whereas equal relationships allow for a greater range of meanings to be exchanged."
        },
        {
            "name": "plurality",
            "type": "system",
            "summary": "The plurality system pertains to the number of speaking participants in a situation, and includes the parameters of monological, dialogical, and multilogical. This system recognizes that more than two participants may be engaged in dialogical activity, interacting with each other in various overlapping arrangements over the course of a situation."
        },
        {
            "name": "value-orientation-disposition",
            "type": "system",
            "summary": "Value-orientation disposition refers to the nature of relative alliance or opposition between agents in a situation, and answers the question of whether the situation is presented as if there is agreement or opposition between participants. The system distinguishes between an allying disposition that realizes agreement and an opposing disposition that realizes disagreement."
        },
        {
            "name": "social-distance",
            "type": "system",
            "summary": "Social distance refers to the level of familiarity between participants in a situation. Close participants may exchange more kinds of meanings and require less explicitness in their communication, while distant participants tend to require more explicitness and have a more restricted set of possible meaning exchanges."
        },
        {
            "name": "publicity",
            "type": "system",
            "summary": "Publicity is a dimension of tenor that refers to the presence or absence of onlookers with regard to a social act, and the various levels of engagement such onlookers might reveal. It includes disinterested, interested (neutral or biased), and private situations."
        },
        {
            "name": "language-role",
            "type": "system",
            "summary": "Language role refers to the amount of work language does in accomplishing a situation's activity, and can be constitutive or ancillary depending on whether language is the primary means of accomplishing the activity or simply assists in the unfolding of non-linguistic actions."
        },
        {
            "name": "process-sharing",
            "type": "system",
            "summary": "Process sharing refers to the degree of active participation by more than one participant in the unfolding of text, and can be active or passive depending on whether participants share in the creation of the text or engage with it more passively."
        },
        {
            "name": "channel",
            "type": "system",
            "summary": "Channel refers to the physical mechanics of the addressee's interaction with the text, and can be phonic or graphic. It is closely related to process sharing, and is decided by the nature of the social activity and of the social relation between the participants."
        },
        {
            "name": "medium",
            "type": "system",
            "summary": "Medium refers to the style or patterning of the wordings themselves, and can be spoken or written. It is a matter of style, and is related to the extemporaneousness of the language realizing a situation."
        }
    ],
    "features": [
        {
            "name": "conceptual-ie-internally-oriented",
            "type": "feature",
            "register_parameter": "field",
            "description": "Conceptual field values involve abstract or theoretical activities, such as theological or philosophical discussions.",
            "system": "field",
            "parameter": "abstractness"
        },
        {
            "name": "practical-ie-outwardly-oriented",
            "type": "feature",
            "register_parameter": "field",
            "description": "Practical field values involve activities that are concrete or focused on immediate actions, such as fishing, a healing, or a miracle.",
            "system": "field",
            "parameter": "abstractness"
        },
        {
            "name": "experiential",
            "type": "feature",
            "register_parameter": "field",
            "description": "Experiential activity focus characterizes a situation whose linguistic activity chiefly relates to the unfolding of events or happenings, where participants are involved in carrying out or observing the activity.",
            "system": "field",
            "parameter": "activity_focus"
        },
        {
            "name": "interpersonal",
            "type": "feature",
            "register_parameter": "field",
            "description": "Interpersonal activity focus pertains to the social interaction between participants, focusing on their roles, relationships, and attitudes.",
            "system": "field",
            "parameter": "activity_focus"
        },
        {
            "name": "logical",
            "type": "feature",
            "register_parameter": "field",
            "description": "Logical activity focus involves reasoning, argumentation, or explanation, where participants engage in activities that require logical thinking.",
            "system": "field",
            "parameter": "activity_focus"
        },
        {
            "name": "instructing",
            "type": "feature",
            "register_parameter": "field",
            "description": "Instructing goals are centered around teaching, explaining, or providing guidance to others.",
            "system": "field",
            "parameter": "goals"
        },
        {
            "name": "projecting",
            "type": "feature",
            "register_parameter": "field",
            "description": "Projecting goals involve making predictions, prophesying, or discussing future events or possibilities.",
            "system": "field",
            "parameter": "goals"
        },
        {
            "name": "asserting",
            "type": "feature",
            "register_parameter": "field",
            "description": "Asserting goals involve stating or affirming beliefs, claims, or opinions, often in a declarative manner.",
            "system": "field",
            "parameter": "goals"
        },
        {
            "name": "allying",
            "type": "feature",
            "register_parameter": "tenor",
            "description": "Allying value orientation predisposition refers to participants who share the same views or are supportive of each other's positions.",
            "system": "tenor",
            "parameter": "value_orientation_predisposition"
        },
        {
            "name": "opposing",
            "type": "feature",
            "register_parameter": "tenor",
            "description": "Opposing value orientation predisposition refers to participants who hold different views or are antagonistic toward each other's positions.",
            "system": "tenor",
            "parameter": "value_orientation_predisposition"
        },
        {
            "name": "disinterested",
            "type": "feature",
            "register_parameter": "tenor",
            "description": "Disinterested publicity refers to a neutral stance where participants are not personally invested in the outcome or do not take sides.",
            "system": "tenor",
            "parameter": "publicity"
        },
        {
            "name": "neutral",
            "type": "feature",
            "register_parameter": "tenor",
            "description": "Neutral publicity refers to a situation where participants neither support nor oppose a particular stance or outcome.",
            "system": "tenor",
            "parameter": "publicity"
        },
        {
            "name": "on-someones-side",
            "type": "feature",
            "register_parameter": "tenor",
            "description": "On-someones-side publicity refers to a situation where participants actively support a particular stance or outcome.",
            "system": "tenor",
            "parameter": "publicity"
        },
        {
            "name": "private",
            "type": "feature",
            "register_parameter": "tenor",
            "description": "Private publicity refers to a situation where participants actively oppose a particular stance or outcome.",
            "system": "tenor",
            "parameter": "publicity"
        },
        {
            "name": "monological",
            "type": "feature",
            "register_parameter": "tenor",
            "description": "Monological refers to situations with only one speaking participant, such as a monologue or a soliloquy.",
            "system": "tenor",
            "parameter": "number_of_speaking_participants"
        },
        {
            "name": "dialogical",
            "type": "feature",
            "register_parameter": "tenor",
            "description": "Dialogical refers to situations with two speaking participants, such as a dialogue or conversation.",
            "system": "tenor",
            "parameter": "number_of_speaking_participants"
        },
        {
            "name": "multilogical",
            "type": "feature",
            "register_parameter": "tenor",
            "description": "Multilogical refers to situations with three or more speaking participants, such as group discussions or debates.",
            "system": "tenor",
            "parameter": "number_of_speaking_participants"
        },
        {
            "name": "institutional",
            "type": "feature",
            "register_parameter": "tenor",
            "description": "Institutional control refers to situations where one participant or a group of participants hold authority or power over others.",
            "system": "tenor",
            "parameter": "control"
        },
        {
            "name": "non-institutional-or-neutralized",
            "type": "feature",
            "register_parameter": "tenor",
            "description": "Non-institutional or neutralized control refers to situations where no specific participant or group holds authority or power over others.",
            "system": "tenor",
            "parameter": "control"
        },
        {
            "name": "unclear",
            "type": "feature",
            "register_parameter": "tenor",
            "description": "Unclear control refers to situations where it is not evident who holds authority or power.",
            "system": "tenor",
            "parameter": "control"
        },
        {
            "name": "equalized",
            "type": "feature",
            "register_parameter": "tenor",
            "description": "Equalized control refers to situations where all participants have an equal share of authority or power.",
            "system": "tenor",
            "parameter": "control"
        },
        {
            "name": "close",
            "type": "feature",
            "register_parameter": "tenor",
            "description": "Close social distance refers to situations where participants have a close relationship or are familiar with each other.",
            "system": "tenor",
            "parameter": "social_distance"
        },
        {
            "name": "distant",
            "type": "feature",
            "register_parameter": "tenor",
            "description": "Distant social distance refers to situations where participants have a distant relationship or are not familiar with each other.",
            "system": "tenor",
            "parameter": "social_distance"
        },
        {
            "name": "constitutive",
            "type": "feature",
            "register_parameter": "mode",
            "description": "Constitutive language role refers to situations where language is the primary means of carrying out the activity or achieving the goal.",
            "system": "mode",
            "parameter": "language_role"
        },
        {
            "name": "ancillary",
            "type": "feature",
            "register_parameter": "mode",
            "description": "Ancillary language role refers to situations where language plays a secondary or supporting role in carrying out the activity or achieving the goal.",
            "system": "mode",
            "parameter": "language_role"
        },
        {
            "name": "addressee-more-active",
            "type": "feature",
            "register_parameter": "mode",
            "description": "Addressee-more-active process sharing refers to situations where the recipient of the message is more actively involved in the communication process, such as asking questions or providing feedback.",
            "system": "mode",
            "parameter": "process_sharing"
        },
        {
            "name": "addressee-more-passive",
            "type": "feature",
            "register_parameter": "mode",
            "description": "Addressee-more-passive process sharing refers to situations where the recipient of the message is less actively involved in the communication process, such as listening or reading without providing feedback.",
            "system": "mode",
            "parameter": "process_sharing"
        },
        {
            "name": "phonic",
            "type": "feature",
            "register_parameter": "mode",
            "description": "Phonic channel refers to communication through sound, such as spoken language or music.",
            "system": "mode",
            "parameter": "channel"
        },
        {
            "name": "graphic",
            "type": "feature",
            "register_parameter": "mode",
            "description": "Graphic channel refers to communication through visual means, such as written language, images, or symbols.",
            "system": "mode",
            "parameter": "channel"
        },
        {
            "name": "spoken",
            "type": "feature",
            "register_parameter": "mode",
            "description": "Spoken medium refers to communication that takes place through speech, either in face-to-face conversations or through audio recordings.",
            "system": "mode",
            "parameter": "medium"
        },
        {
            "name": "written",
            "type": "feature",
            "register_parameter": "mode",
            "description": "Written medium refers to communication that takes place through text, either in print or digital formats.",
            "system": "mode",
            "parameter": "medium"
        }
    ]
}
"""
)

# Situation data query

SITUATIONS_ENDPOINT = "https://gospelgenre.ryderwishart.com/api/token/"

# just fetch the raw JSON from the situations endpoint plus the ref (e.g., MAT 3:13)
# Note, you only need to submit one token to get the social situation data, if available
# def get_situations_data(tokenRef):
#     # validate tokenRef
#     if not re.match(r'^\d?[A-Z]+ \d+:\d+$', tokenRef):
#         return {'error': 'invalid ref'};
#     expanded_endpoint = SITUATIONS_ENDPOINT + tokenRef
#     response = requests.get(expanded_endpoint)
#     return response.json()

clusterLabels = {
    0: "narration/account",
    1: "denouncement",
    2: "forewarning/private discussion",  # and predictions? look into this
    3: "assignment",  # with predictions of how it will go?
    4: "charge",
    5: "appraisal",
    6: "questioning",
    7: "controversial action",  # jesus heals a person, and people ask questions about it
    8: "disputation",
    9: "vilifying story",
    10: "rebuke",
    11: "organizing",
    12: "judicial examination",
    13: "public execution",
    14: "presumptive interaction",
    15: "announcement",
    16: "examination",
    17: "public spectacle/novelty",
    18: "correction",
    19: "surprising turn of events",
    20: "redirection",  # NOTE: not sure about this one
    21: "solicitation",
    22: "illustrated lesson",
    23: "conflict",
    24: "oration",
    25: "accommodation",
    26: "challenge",
    27: "disappointing request",
    28: "disagreement",
}  # NOTE: these should probably be reworded to represent situational activities better

file_5_url = 'https://raw.githubusercontent.com/ryderwishart/biblical-machine-learning/main/gpt-inferences/situations_prose_descriptions.json'
file_5_name = 'situations_prose_descriptions.json'
file_6_url = 'https://raw.githubusercontent.com/ryderwishart/biblical-machine-learning/main/gpt-inferences/situations_data.json'
file_6_name = 'situations_data.json'

if file_5_name not in [path for path in os.listdir()]:
    download_file(file_5_url, file_5_name)
if file_6_name not in [path for path in os.listdir()]:
    download_file(file_6_url, file_6_name)

with open("situations_prose_descriptions.json", "r") as f:
    situations_prose_descriptions = json.load(f)
    # each description has a title and description, and the title corresponds to the clusterLabels *value*
    # add descriptions to the clusterLabels dict
    for cluster in clusterLabels:
        matching_description = next(
            (
                item
                for item in situations_prose_descriptions
                if item["title"].lower() == clusterLabels[cluster]
            ),
            None,
        )
        clusterLabels[cluster] = {
            "number": cluster,
            "title": matching_description["title"],
            "description": matching_description["description"],
        }


# I'm instead just going to load the situations data locally, since there seems to be some problem with the next.js endpoint for some tokens/pericopes
def get_situations_data(tokenRef):
    print(tokenRef)
    with open("situations_data.json", "r") as f:
        situations_data = json.load(f)

        """
        Array of situations, like this:
                  
          [
            {
              "situation": "01-01",
              "title": [
                "The Genealogy of Christ"
              ],
              "preTextFeatures": "no embedded discourse",
              "viaTextFeatures": "no embedded discourse",
              "start": [
                "SBLGNT.Matt.1.1.w1"
              ],
              "section": [
                "01-01"
              ],
              "morphGntId": [
                "010101"
              ],
              "ref": "MAT 1:1!1 MAT 1:1!1 MAT 1:1!1 MAT 1:1!1 ...",
              "text": "Βίβλος γενέσεως Ἰησοῦ χριστοῦ υἱοῦ Δαυὶδ...",
              "token_ids": "n40001001001 n40001001002 n40001001003...",
              "token_refs": [
                "MAT 1:2!14",
                "MAT 1:3!9",
                "MAT 1:7!4",
                ...
        """
        # Find the situation where tokenRef is in token_refs array
        match = None
        for situation in situations_data:
            try:
                if not situation.get("token_refs"):
                    print("no token_refs for situation", situation["situation"])
                    continue
                if tokenRef in situation["token_refs"]:
                    match = situation
            except:
                continue

        if not situation["cluster"]:
            return match
        else:
            # FIXME: this is so hacky it makes my eyes water
            type = clusterLabels[int(situation["cluster"][0])]
            return {**match, "type": type}


def process_features(lookup, feature_list):
    feature_descriptions = []
    system_descriptions = set()

    for feature_name in feature_list:
        feature = next(
            (item for item in lookup["features"] if item["name"] == feature_name), None
        )
        if feature:
            feature_descriptions.append(feature["description"])

            system = next(
                (
                    item
                    for item in lookup["systems"]
                    if item["name"] == feature["system"]
                ),
                None,
            )
            if system:
                system_descriptions.add(system["summary"])

    return feature_descriptions, system_descriptions


def generate_mutations(pre_text_features, via_text_features, lookup):
    mutations = []

    pre_text_features_set = set(pre_text_features)
    via_text_features_set = set(via_text_features)

    gained_features = via_text_features_set - pre_text_features_set
    lost_features = pre_text_features_set - via_text_features_set

    if gained_features:
        mutations.append("gained the following features: " + ", ".join(gained_features))

    if lost_features:
        mutations.append("lost the following features: " + ", ".join(lost_features))

    # These features still need descriptions (i.e., process_features), since they don't occur in the pre-text features
    gained_feature_descriptions, _ = process_features(lookup, gained_features)
    mutations.append("\n".join(gained_feature_descriptions))

    return mutations


# test mutations
test_sit = get_situations_data("MAT 3:15!1")
generate_mutations(
    test_sit["preTextFeatures"], test_sit["viaTextFeatures"], situations_lookup_json
)

# Get speaker quotation data for a token
"""
speaker_data.columns =
['CharacterId',
'MaxSpeakers',
'Gender',
'Age',
'Comment',
'SDBH',
'LouwNida',
'FCBHCharacter',
'Divinity']
"""


def get_speaker_quotation_data(token_ref: str):
    """
    Accepts a token ref, and returns the speaker quotation data (from expanded_speaker_data) for that token.
    """
    # The token id is the row Name
    token_data = mg[mg["ref"] == token_ref]
    # print(token_data)
    token_id = token_data.index[0]
    # print(token_ref, 'matched to', token_id)
    speaker_data_for_token = expanded_speaker_data[
        expanded_speaker_data["token_ids"].apply(lambda x: token_id in x)
    ]
    # print(speaker_data_for_token)

    if speaker_data_for_token.empty:
        return None

    speaker_ids = speaker_data_for_token["CharacterIds"].iloc[0]
    # print(speaker_ids)
    results = []
    for speaker_id in speaker_ids:
        speaker_character_data = character_data[
            character_data["CharacterId"] == speaker_id
        ]
        speaker_character_data = {
            key: value
            for key, value in speaker_character_data.iloc[0].items()
            if not type(value) == float
        }
        # print('speaker_character_data', speaker_character_data)
        # print(speaker_character_data)

        result = {
            "who_is_speaking": speaker_id,
            "delivery_tone": speaker_data_for_token["Delivery"].iloc[0],
            # "contained_in_speech_by": # TODO: somehow I would like to note that the Baptist's speech is contained in the Narrator's speech
            "what_is_said_truncated": " ".join(
                speaker_data_for_token["tokens"].iloc[0][:10]
            )
            + "...",
            "what_is_said_complete": " ".join(speaker_data_for_token["tokens"].iloc[0]),
        }
        for character_item in speaker_character_data:
            result[character_item] = speaker_character_data[character_item]

        results.append(result)

    return results


# test
get_speaker_quotation_data("MAT 3:14!6")

# Use semantic_role_data to get the semantic role data for a token

import pandas as pd
import re


def describe_semantic_configuration(query_id, df, output_template):
    # Find the row with the query ID
    row = df.loc[df["xml:id"] == query_id].iloc[0]

    # Find the matching rows for the verb frame IDs, handling multiple frame verb rows
    frame_verb_ids = row["frame_verb_id"].split("|")

    # FIXME: If there are no frame verb IDs, we probably just need to find the syntactic clause containing this token and return the treedown

    frame_verb_rows = [
        df.loc[df["xml:id"] == frame_verb_id].iloc[0]
        for frame_verb_id in frame_verb_ids
        if not df.loc[df["xml:id"] == frame_verb_id].empty
    ]

    outputs = []

    for frame_verb_row in frame_verb_rows:
        # Extract all frame roles and IDs using regex, and store them in a dictionary where the role (e.g., A0) is the key and the value is a list of xml:ids
        role_id_dict = {
            role: ids.split(";")
            for role, ids in re.findall(
                r"(A[0-9A]+):([^;\s]+)", frame_verb_row["frame"]
            )
        }

        # Find the rows with the extracted role IDs
        role_rows = {
            role: df.loc[df["xml:id"].isin(ids)] for role, ids in role_id_dict.items()
        }

        # Generate the output
        frame_verb_lemma, frame_verb_gloss, frame_verb_role = (
            frame_verb_row["lemma"],
            frame_verb_row["gloss"],
            "Process",
        )

        a0_rows = role_rows.get("A0", pd.DataFrame())
        if a0_rows.empty or not "lemma" in a0_rows:
            """"""  # I'm not sure of the syntax to 'continue' without skipping the rest of the loop... -\_(ツ)_/-
            # 'no a0 rows lemma', a0_rows
        else:
            a0_lemmas, a0_glosses, a0_roles = (
                a0_rows["lemma"].tolist(),
                a0_rows["gloss"].tolist(),
                a0_rows["semantic_role_label"].tolist(),
            )
            a0_string = f'A0: {", ".join(a0_lemmas)}'
            a0_gloss = f'{", ".join(a0_glosses)}'

        all_other_role_strings = []
        all_other_role_gloss_strings = []
        for role, rows in role_rows.items():
            if role == "A0":
                continue
            lemmas, glosses, roles = (
                rows["lemma"].tolist(),
                rows["gloss"].tolist(),
                rows["semantic_role_label"].tolist(),
            )
            role_string = f'[{role}: {", ".join(lemmas)}]'
            role_gloss_string = f'[{roles[0]}: {", ".join(glosses)}]'
            all_other_role_strings.append(role_string)
            all_other_role_gloss_strings.append(role_gloss_string)

        output = output_template.format(
            a0_string=a0_string,
            frame_verb_lemma=frame_verb_lemma,
            all_other_role_strings=" ".join(all_other_role_strings),
            a0_role=a0_roles[0],
            a0_gloss=a0_gloss,
            frame_verb_role=frame_verb_role,
            frame_verb_gloss=frame_verb_gloss,
            all_other_role_gloss_strings=" ".join(all_other_role_gloss_strings),
        )

        outputs.append(output)

    return "\n".join(outputs)


# Example usage:
query_id = "n41001005008"
output_template = "[{a0_string}] [{frame_verb_lemma}] {all_other_role_strings} / [{a0_role}: {a0_gloss}] [{frame_verb_role}: {frame_verb_gloss}] {all_other_role_gloss_strings}"
result = describe_semantic_configuration(query_id, semantic_role_data, output_template)
print(result)

# Get the plain treedown representation for a token's sentence

# example endpoint: "https://labs.clear.bible/symphony-dev/api/GNT/Nestle1904/lowfat?usfm-ref=JHN%2014:1" - JHN 14:1

from lxml import etree


def process_element(element, usfm_ref, indent=0):
    treedown_str = ""

    if element.get("class") == "cl":
        treedown_str += "\n" + "  " * indent

    if element.get("role"):
        role = element.attrib["role"]
        if role == "adv":
            role = "+"
        treedown_str += "\n" + "  " * indent + role + ": "

    # bold the matching token using usfm ref
    if element.tag == "w" and element.get("ref") == usfm_ref:
        treedown_str += "**" + element.text + "**"
        treedown_str += element.attrib.get("after", "") + " "

    if element.tag == "w" and element.text:
        treedown_str += element.attrib.get("gloss", "") + f"[{element.text}]"
        treedown_str += element.attrib.get("after", "") + " "

    for child in element:
        treedown_str += process_element(child, usfm_ref, indent + 1)

    return treedown_str


def get_treedown_by_ref(usfm_ref):
    usfm_passage = usfm_ref.split("!")[0]
    endpoint = (
        "https://labs.clear.bible/symphony-dev/api/GNT/Nestle1904/lowfat?usfm-ref="
        + usfm_passage
    )

    # Note: the response is XML like this:
    """
    <sentences xml:lang="grc" ref="JHN 14:1">
        <sentence>
        <p>
        <milestone unit="verse" id="JHN 14:1">JHN 14:1</milestone>
        Μὴ ταρασσέσθω ὑμῶν ἡ καρδία·
        </p>
        <wg>
        <wg class="cl" rule="ADV-V-S">
        <w role="adv" ref="JHN 14:1!1" after=" " class="adv" id="n43014001001" lemma="μή" normalized="Μή" strong="3361" gloss="Not" domain="069002" ln="69.3" morph="PRT-N" unicode="Μὴ">Μὴ</w>
        ...
    """
    text_response = requests.get(endpoint).text
    # print(text_response)

    xml = etree.fromstring(text_response.encode("utf-8"))
    # turn xml into simple treedown, with all text on one line, except a new line for <wg class="cl".../> elements, and a new indented line for <w role.../> elements

    treedown = process_element(xml, usfm_passage)
    return treedown


# test

get_treedown_by_ref("MAT 3:14")


OPENTEXT_ENDPOINT = "https://ww-network-annotations---macula-atlas-api-qa-25c5xl4maa-uk.a.run.app/graphql/"

opentext_query = """
query TokensWithAllOpenTextSystemValues(
  $wordTokensFilters: WordTokenFilter
  $wordTokensPagination: OffsetPaginationInput
  $annotationInstancesFilters: AnnotationFilter
) {
  wordTokens(filters: $wordTokensFilters, pagination: $wordTokensPagination) {
    ref
    annotationInstances(filters: $annotationInstancesFilters) {
      uri
      feature {
        label
        data
      }
    }
  }
}
"""


def get_opentext_syntax_data(xmlId):
    tokenData = mg.loc[xmlId].to_dict()
    passage = tokenData["ref"].split("!")[0]

    opentext_variables = {
        "wordTokensFilters": {"passageReference": passage},
        "annotationInstancesFilters": {
            "uri": {
                "regex": "https://github.com/OpenText-org/placeholder-data:system-values..+.all"
            }
        },
    }

    opentext_payload = {"query": opentext_query, "variables": opentext_variables}

    response = requests.post(OPENTEXT_ENDPOINT, json=opentext_payload, headers=headers)
    response_data = json.loads(response.text)
    words = response_data["data"]["wordTokens"]

    results = []
    for word in words:
        ref = word["ref"]
        if ref == tokenData["ref"]:
            annotation_features = word["annotationInstances"]
            for feature in annotation_features:
                feature_name = feature["feature"]["label"]
                if feature_name.startswith("$"):
                    feature_description = (
                        f"The grammar requires a {feature_name} lemma here"
                    )
                else:
                    feature_description = feature["feature"]["data"]["description"]
                    if feature_name.endswith("_tbd"):
                        feature_description += (
                            " (Fall back to morphological description of this wording)"
                        )
                    results.append(
                        {
                            "feature_name": feature_name,
                            "feature_description": feature_description,
                        }
                    )
    # for feature in annotation_features:
    #     feature_name = feature["uri"].split(".")[-1]
    #     feature_description = feature["data"]["description"]
    #     instances = feature["instances"]
    #     results.append(
    #         {
    #             "feature_name": feature_name,
    #             "feature_description": feature_description,
    #             "instances": instances,
    #         }
    #     )

    return results


# test
# get_opentext_syntax_data("MAT 3:14!6")


MAT 3:15!1
no token_refs for situation 02-15
no token_refs for situation 02-16
[A0: χώρα] [ἐκπορεύομαι]  / [Source: region] [Process: were going out] 
[A0: χώρα] [βαπτίζω] [A1: χώρα] / [Source: region] [Process: were being baptized] [Source: region]
[A0: χώρα] [ἐξομολογέω] [A1: ἁμαρτία] / [Source: region] [Process: confessing] [Goal: sins]


## Prosifying functions

In [58]:
# Generate prosaic context function
def generate_prosaic_context(word_id, selected_fields=None):
    # The user may pass in a verse!word ref or an id
    if not "!" in word_id:
        word_data = mg.loc[word_id].to_dict()
        # Get annotations using combined annotations function
        word_ref = word_data["ref"]
    else:
        # copy the value of word_id into word_ref (not just the variable reference)
        word_ref = f"{word_id}"
        word_data = mg.loc[mg["ref"] == word_ref].iloc[0].to_dict()
        # word_id will be the 'id' column of word_data
        word_id = word_data["id"]

    lemma = word_data["lemma"]

    # print(word_ref)

    if not selected_fields:
        selected_fields = list(attribute_descriptions.keys())

    context_data = {
        "1. Lexical features": [],
        "2. Syntactic context and function": [],
        "3. Discourse context": [],
        "4. Social context": [],  # To be implemented
        "5. Cultural/encyclopedic knowledge": [],
    }

    for key in selected_fields:
        value = word_data.get(key)
        if value not in (None, "missing", "nan"):
            if key == "class":
                context_data["1. Lexical features"].append(
                    f"- {key}: {lemma} is a {value},"
                )
            elif key == "gloss":
                context_data["1. Lexical features"].append(
                    f'- {key}: meaning "{value}."'
                )
            elif key == "lemma":
                context_data["1. Lexical features"].append(
                    f"- {key}: The lemma form of this word is {value},"
                )
            elif key == "morph":
                context_data["1. Lexical features"].append(
                    f"- {key}: and it is parsed as a {value}"
                )  # TODO: expand morphological parse codes into prose - although, is this necessary given the other data points?
            elif key == "strong":
                context_data["1. Lexical features"].append(
                    f"- {key}: with a Strong's number of {value}."
                )
            elif key in (
                "person",
                "number",
                "gender",
                "case",
                "tense",
                "voice",
                "mood",
                "degree",
                "type",
            ):
                context_data["2. Syntactic context and function"].append(
                    f"- {key}: {attribute_descriptions[key]}: {value},"
                )
            elif key in ("ln"):
                context_data["5. Cultural/encyclopedic knowledge"].append(
                    f"- {key}: {attribute_descriptions[key]}: {value},"
                )
            elif key in ("domain_label"):
                context_data["5. Cultural/encyclopedic knowledge"].append(
                    f"- {key}: {attribute_descriptions[key]}: {domain_labels[value]},"
                )

    # Add semantic role data to syntactic context: 'this word is the {semantic_role} in the configuration {semantic_configuration}'
    semantic_configuration = describe_semantic_configuration(
        word_id, semantic_role_data, output_template
    )  # output_template was defined above
    # Find the matching row from semantic_role_data, and get all of the labels plus the column names
    role_data_row = semantic_role_data.loc[
        semantic_role_data["xml:id"] == word_id
    ].iloc[0]
    semantic_role_info = {
        key: value for key, value in role_data_row.items() if not value == ""
    }
    if semantic_role_info.get("semantic_role_label"):
        semantic_role = semantic_role_info["semantic_role_label"]
        context_data["2. Syntactic context and function"].append(
            f"- Semantic configuration: This word is the {semantic_role_info['semantic_role_label']} {'in `' + semantic_configuration + '`' if semantic_configuration else ''}, and it has the following data: {semantic_role_info}"
        )
    # TODO: need to exploit the roles.json file in order to get all related wordings for each frame.

    # Add opentext syntax data to syntactic context
    opentext_syntax_data = get_opentext_syntax_data(word_id)
    if opentext_syntax_data:
        context_data["2. Syntactic context and function"].append(
            f"- OpenText syntax: This word has the following syntactic selection features:"
        )
        for feature in opentext_syntax_data:
            context_data["2. Syntactic context and function"].append(
                f"  - {feature['feature_name']}: {feature['feature_description']}"
            )

    # Add treedown syntax data
    treedown_data = get_treedown_by_ref(word_ref)
    if treedown_data:
        context_data["2. Syntactic context and function"].append(
            f"- Treedown syntax: This word is part of the following sentence:\n{treedown_data}"
        )

    # Add Levinsohn discourse features
    discourse_features = get_discourse_annotation_types(word_id)
    if discourse_features:
        context_data["3. Discourse context"].append(
            f"This word functions within {len(discourse_features)} discourse features:"
        )
        for feature in discourse_features:
            context_data["3. Discourse context"].append(
                f"- {feature} is defined as {discourse_types[feature]['description']}"
            )

    speaker_information = get_speaker_quotation_data(word_ref)
    """
    Speaker information is an array of speakers, like this:
    [{'who_is_speaking': 'John the Baptist',
        'delivery_tone': 'humble',
        'what_is_said_truncated': 'ἐγὼ χρείαν ἔχω ὑπὸ σοῦ βαπτισθῆναι καὶ σὺ ἔρχῃ πρός...',
        'what_is_said_complete': 'ἐγὼ χρείαν ἔχω ὑπὸ σοῦ βαπτισθῆναι καὶ σὺ ἔρχῃ πρός με',
        'CharacterId': 'John the Baptist',
        'MaxSpeakers': 1, # this represents the number of speakers who are speaking at the same time (up to n)
        'Gender': 'Male', # this value, if present, can inform the pronouns used in the prosaic description below
        'LouwNida': ['93.190a',
        '93.190b',
        '93.190e',
        '93.190f',
        '93.190d',
        '93.190c']}]
        
    With possible additional values for the speaker, like this:
    'Age', # this value, if present, can be appended to the parenthetical content after the speaker's name (e.g. "John the Baptist ({Age} old)"
    'Comment', # this value, such as an alternate name for the speaker, can be appended to the parenthetical content after the speaker's name (e.g. "John the Baptist ({Age} old, {Comment})"
    'SDBH', # not used, available in macula greek database, but potentially useful for profiling the subject matter of the speech
    'LouwNida', # not used, available in macula greek database, but potentially useful for profiling the subject matter of the speech
    'FCBHCharacter', # not used, alternate id
    'Divinity' # will be 'Y' if present, otherwise not present. This value can be used to determine whether to use "he" or "He" in the prosaic description below, with "He" used for divinities.
    """
    if speaker_information:
        if len(speaker_information) == 1:
            speaker = speaker_information[0]
            context_data["3. Discourse context"].append(
                f"This word is spoken by {speaker['who_is_speaking']}"
            )
            if speaker.get("Divinity") == "Y":
                context_data["3. Discourse context"].append(
                    f"- {speaker['who_is_speaking']} is a divinity"
                )
            if speaker.get("Age"):
                context_data["3. Discourse context"].append(f", age: {speaker['Age']}")
            if speaker.get("Comment"):
                context_data["3. Discourse context"].append(f" ({speaker['Comment']})")
            speech = speaker.get("what_is_said_complete")
            if len(speech) > 100:
                speech = speaker.get("what_is_said_truncated")
            context_data["3. Discourse context"].append(
                f", who says (in a {speaker['delivery_tone']} tone), \"{speech}\""
            )

        else:
            context_data["3. Discourse context"].append(
                f"This word is spoken by {len(speaker_information)} speaker(s): "
            )
            for speaker in speaker_information:
                context_data["3. Discourse context"].append(
                    f"This word is spoken by {speaker['who_is_speaking']}"
                )
                if speaker.get("Divinity") == "Y":
                    context_data["3. Discourse context"].append(
                        f"- {speaker['who_is_speaking']} is a divinity"
                    )
                if speaker.get("Age"):
                    context_data["3. Discourse context"].append(
                        f", {speaker['Age']} years old"
                    )
                if speaker.get("Comment"):
                    context_data["3. Discourse context"].append(
                        f" ({speaker['Comment']})"
                    )
                speech = speaker.get("what_is_said_complete")
                if len(speech) > 100:
                    speech = speaker.get("what_is_said_truncated")
                context_data["3. Discourse context"].append(
                    f", who says (in a {speaker['Delivery']} tone), \"{speech}\""
                )

    lookup = situations_lookup_json
    situation_data = get_situations_data(word_ref)
    print(">>> situation_data", situation_data)
    # if situation_data and situation_data.get('matchingSituation'):
    #     situation_data = situation_data['matchingSituation']
    if situation_data:
        pre_text_features = situation_data["preTextFeatures"]
        via_text_features = situation_data["viaTextFeatures"]

        pre_text_feature_descriptions, pre_text_system_descriptions = process_features(
            lookup, pre_text_features
        )
        via_text_feature_descriptions, via_text_system_descriptions = process_features(
            lookup, via_text_features
        )

        mutations = generate_mutations(pre_text_features, via_text_features, lookup)
        print(mutations)

        context_data["4. Social context"].append(
            f"This word is part of the passage '{situation_data['title'][0]}'"
        )

        # Add situation type information
        if situation_data.get("type"):
            situation_type = situation_data["type"]
            context_data["4. Social context"].append(
                f"This passage is a {situation_type['title']} situation, which can be described in typical terms as follows: {situation_type['description']}"
            )

        context_data["4. Social context"].append(
            f"It begins as a {' '.join(pre_text_features)} situation"
        )
        context_data["4. Social context"].extend(pre_text_feature_descriptions)
        # context_data["4. Social context"].extend(pre_text_system_descriptions)

        # context_data["4. Social context"].append(f"And ends as a {' '.join(via_text_features)} situation")
        # context_data["4. Social context"].extend(via_text_feature_descriptions)
        # context_data["4. Social context"].extend(via_text_system_descriptions)

        if mutations and mutations[-1] != "":
            context_data["4. Social context"].append(
                "During the passage, the situation:"
            )
            context_data["4. Social context"].extend("\n".join(mutations))
    output_lines = []
    for header, sentences in context_data.items():
        if sentences:
            output_lines.append(f"## {header}\n")
            output_lines.append("\n".join(sentences))
            output_lines.append("\n")

    prosaic_context = "".join(output_lines)
    # print(prosaic_context)
    return prosaic_context

## Turn macula data into text to retrieve from Chroma DB

In [75]:
# ## VERSE-BASED TEXTS

# import numpy as np

# # Initialize lists
# text_list = []
# dict_list = []
# id_list = []
# glosses_list = []

# # Group the DataFrame by 'book_chapter_verse'
# grouped = mg.groupby('book_chapter_verse')

# for name, group in grouped:
#     # Combine the 'text' and 'after' fields into a single string for each group
#     text = ''.join(group['text'] + group['after'].replace(np.nan, '', regex=True) + ' ')
#     text_list.append(text)

#     # Extract book, chapter, and verse from the group
#     book = group['book'].values[0]
#     chapter = group['book_chapter'].str.split().str[1].values[0]
#     verse = group['book_chapter_verse'].str.split(':').str[1].values[0]
#     b_c_v = group['book_chapter_verse'].values[0]
    
#     # Use the 'xml:id' field to create a list of IDs for the verse (joined by pipes)
#     id_entry = '|'.join(group['id'].tolist())
    
#     # ... Same for glosses
#     verse_gloss = ''.join(group['gloss'].replace(np.nan, '[no gloss]', regex=True) + group['after'].replace(np.nan, '', regex=True) + ' ')
#     glosses_list.append(b_c_v + ' - ' + verse_gloss)
    
#     # Add metadata for verse
#     dict_entry = {'source': b_c_v, 'book': book, 'chapter': chapter, 'verse': verse, 'ids': id_entry, 'gloss': verse_gloss}
#     dict_list.append(dict_entry)
    
#     id_list.append(b_c_v)

# # Print the lists for testing
# print(text_list[:5])
# print(dict_list[:5])
# print(id_list[:5])

In [64]:
mg.columns

Index(['sentence', 'ref', 'role', 'class', 'type', 'gloss', 'text', 'after',
       'lemma', 'normalized', 'strong', 'morph', 'person', 'number', 'gender',
       'case', 'tense', 'voice', 'mood', 'degree', 'domain', 'ln', 'frame',
       'subjref', 'referent', 'id', 'book', 'chapter', 'verse', 'book_chapter',
       'book_chapter_verse', 'domain_label'],
      dtype='object')

In [74]:
### SENTENCE-BASED TEXTS

import numpy as np

# Initialize lists
text_list = []
dict_list = []
id_list = []
glosses_list = []

# Group the DataFrame by 'sentence_id'
grouped = mg.groupby('sentence')
for name, group in grouped:
#     # print('>>>', name, group)
    # Combine the 'text' and 'after' fields into a single string for each group
    text = ''.join(group['text'] + group['after'].replace(np.nan, '', regex=True) + ' ')
    []
#     text_list.append(text)

#     # Extract book, chapter, and verse from the group
#     book = group['book'].values[0]
#     chapter = group['book_chapter'].str.split().str[1].values[0]
#     verse = group['book_chapter_verse'].str.split(':').str[1].values[0]
#     b_c_v = group['book_chapter_verse'].values[0]
    
#     # Use the 'xml:id' field to create a list of IDs for the verse (joined by pipes)
#     id_entry = '|'.join(group['id'].tolist())
    
#     # ... Same for glosses
#     verse_gloss = ''.join(group['gloss'].replace(np.nan, '[no gloss]', regex=True) + group['after'].replace(np.nan, '', regex=True) + ' ')
#     glosses_list.append(b_c_v + ' - ' + verse_gloss)
    
#     # Add metadata for verse
#     dict_entry = {'source': b_c_v, 'book': book, 'chapter': chapter, 'verse': verse, 'ids': id_entry, 'gloss': verse_gloss}
#     dict_list.append(dict_entry)
    
#     id_list.append(b_c_v)

# # Print the lists for testing
# print(text_list[:5])
# print(dict_list[:5])
# print(id_list[:5])

pluralmasculine singularmissing pluralneuter singularmissing pluralmissing pluralmasculine missingmissing singularmasculine singularmasculine singularmasculine singularmasculine singularmasculine singularmasculine singularmasculine singularmasculine singularmasculine singularmasculine singularmasculine singularmasculine singularmasculine singularmasculine singularmasculine singularmasculine singularmasculine singularmasculine singularmasculine singularmasculine singularmasculine singularmasculine singularmasculine singularmasculine singularmasculine singularmasculine singularmasculine singularmasculine singularmasculine singularmasculine singularmasculine singularmasculine singularmasculine singularmasculine singularmasculine singularmasculine singularmasculine singularmasculine singularmasculine singularmasculine singularmasculine singularmasculine singularmasculine singularmasculine singularmasculine singularmasculine singularmasculine singularmasculine singularmasculine singularmasc

In [16]:
import getpass
secret_key = getpass.getpass('Enter OpenAI secret key: ')
os.environ['OPENAI_API_KEY'] = secret_key

Enter OpenAI secret key: ··········


In [17]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.text_splitter import CharacterTextSplitter
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA, RetrievalQAWithSourcesChain
from langchain.document_loaders import TextLoader
from langchain.agents import initialize_agent, Tool
from langchain.agents import AgentType
from pathlib import Path

# Load Language Model
llm = OpenAI(temperature=0)

In [18]:
# # Load and process the Bible text
# doc_path = filename
# loader = TextLoader(doc_path)
# documents = loader.load()
# text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=50)
# texts = text_splitter.split_documents(documents)

In [19]:
# Create embeddings and store in a vectorstore
embeddings = OpenAIEmbeddings()
collection = Chroma("bible-qa", embeddings)
"""
collection_name: str = _LANGCHAIN_DEFAULT_COLLECTION_NAME,
        embedding_function: Optional[Embeddings] = None,
        persist_directory: Optional[str] = None,
        client_settings: Optional[chromadb.config.Settings] = None,
        collection_metadata: Optional[Dict] = None,
        client: Optional[chromadb.Client] = None,
"""



# bible_chroma = Chroma.from_documents(texts, embeddings, collection_name="kjv-bible")



'\ncollection_name: str = _LANGCHAIN_DEFAULT_COLLECTION_NAME,\n        embedding_function: Optional[Embeddings] = None,\n        persist_directory: Optional[str] = None,\n        client_settings: Optional[chromadb.config.Settings] = None,\n        collection_metadata: Optional[Dict] = None,\n        client: Optional[chromadb.Client] = None,\n'

In [20]:
# Add greek texts with metadata
# collection.add_texts(
#     texts=text_list,
#     metadatas=dict_list,
#     ids=id_list
#     )

In [21]:
# Add glossed texts with metadata
collection.add_texts(
    texts=glosses_list,
    metadatas=dict_list,
    ids=[id + '_gloss' for id in id_list]
    )

['1CO 10:1_gloss',
 '1CO 10:10_gloss',
 '1CO 10:11_gloss',
 '1CO 10:12_gloss',
 '1CO 10:13_gloss',
 '1CO 10:14_gloss',
 '1CO 10:15_gloss',
 '1CO 10:16_gloss',
 '1CO 10:17_gloss',
 '1CO 10:18_gloss',
 '1CO 10:19_gloss',
 '1CO 10:2_gloss',
 '1CO 10:20_gloss',
 '1CO 10:21_gloss',
 '1CO 10:22_gloss',
 '1CO 10:23_gloss',
 '1CO 10:24_gloss',
 '1CO 10:25_gloss',
 '1CO 10:26_gloss',
 '1CO 10:27_gloss',
 '1CO 10:28_gloss',
 '1CO 10:29_gloss',
 '1CO 10:3_gloss',
 '1CO 10:30_gloss',
 '1CO 10:31_gloss',
 '1CO 10:32_gloss',
 '1CO 10:33_gloss',
 '1CO 10:4_gloss',
 '1CO 10:5_gloss',
 '1CO 10:6_gloss',
 '1CO 10:7_gloss',
 '1CO 10:8_gloss',
 '1CO 10:9_gloss',
 '1CO 11:1_gloss',
 '1CO 11:10_gloss',
 '1CO 11:11_gloss',
 '1CO 11:12_gloss',
 '1CO 11:13_gloss',
 '1CO 11:14_gloss',
 '1CO 11:15_gloss',
 '1CO 11:16_gloss',
 '1CO 11:17_gloss',
 '1CO 11:18_gloss',
 '1CO 11:19_gloss',
 '1CO 11:2_gloss',
 '1CO 11:20_gloss',
 '1CO 11:21_gloss',
 '1CO 11:22_gloss',
 '1CO 11:23_gloss',
 '1CO 11:24_gloss',
 '1CO 11:25

In [22]:
# Inspect some texts 
print('MAT 1:1 -->', collection.search('MAT 1:1', search_type='similarity'))
print('blind -->', collection.search('blind', search_type='similarity'))
# TODO: sort out metadata filtering
print('blind with filters -->', collection.search('blind in 1PE', search_type='similarity'))

MAT 1:1 --> [Document(page_content='MAT 1:1 - [The] book  of [the] genealogy  of Jesus  Christ  son  of David  son  of Abraham. ', metadata={'source': 'MAT 1:1', 'book': 'MAT', 'chapter': '1', 'verse': '1', 'ids': 'n40001001001|n40001001002|n40001001003|n40001001004|n40001001005|n40001001006|n40001001007|n40001001008', 'gloss': '[The] book  of [the] genealogy  of Jesus  Christ  son  of David  son  of Abraham. '}), Document(page_content='MAT 1:25 - and  not  knew  her  until  that  she had brought forth  a son· and  he called  the  name  of Him  Jesus. ', metadata={'source': 'MAT 1:25', 'book': 'MAT', 'chapter': '1', 'verse': '25', 'ids': 'n40001025001|n40001025002|n40001025003|n40001025004|n40001025005|n40001025006|n40001025007|n40001025008|n40001025009|n40001025010|n40001025011|n40001025012|n40001025013|n40001025014', 'gloss': 'and  not  knew  her  until  that  she had brought forth  a son· and  he called  the  name  of Him  Jesus. '}), Document(page_content='MAT 1:21 - She will bear 

# Chains

In [23]:
from langchain.chains.qa_with_sources import load_qa_with_sources_chain

In [24]:
docs = collection.search('blind', search_type='similarity')
docs

[Document(page_content='2PE 1:9 - In whomever  for  not  are present  these things, blind  he is  being short sighted, forgetfulness  having received  of the  purification  the  former  of him  sins. ', metadata={'source': '2PE 1:9', 'book': '2PE', 'chapter': '1', 'verse': '9', 'ids': 'n61001009001|n61001009002|n61001009003|n61001009004|n61001009005|n61001009006|n61001009007|n61001009008|n61001009009|n61001009010|n61001009011|n61001009012|n61001009013|n61001009014|n61001009015|n61001009016', 'gloss': 'In whomever  for  not  are present  these things, blind  he is  being short sighted, forgetfulness  having received  of the  purification  the  former  of him  sins. '}),
 Document(page_content='JHN 9:1 - And  passing by  He saw  a man  blind  from  birth. ', metadata={'source': 'JHN 9:1', 'book': 'JHN', 'chapter': '9', 'verse': '1', 'ids': 'n43009001001|n43009001002|n43009001003|n43009001004|n43009001005|n43009001006|n43009001007', 'gloss': 'And  passing by  He saw  a man  blind  from  b

In [25]:
from langchain.retrievers.tfidf import TFIDFRetriever

class ExtendedTFIDFRetriever(TFIDFRetriever):
    def get_distinctive_terms(self, doc_index, top_n=10):
        feature_names = self.vectorizer.get_feature_names_out()
        tfidf_vector = self.tfidf_array[doc_index]

        # Convert the sparse matrix row to a dense array
        tfidf_array_dense = tfidf_vector.toarray().flatten()

        # Get indices of top_n features
        top_indices = tfidf_array_dense.argsort()[-top_n:][::-1]

        # Get the corresponding feature names
        top_terms = [feature_names[i] for i in top_indices]

        # TODO: Filter out stopwords using list generated in one of the missional-ai notebooks

        return top_terms

texts = [doc.page_content for doc in docs] # get page content out of the queried docs

retriever = ExtendedTFIDFRetriever.from_texts(texts)
distinctive_terms = retriever.get_distinctive_terms(0) # index of document in texts
print(distinctive_terms)

['of', 'the', 'present', 'received', 'being', 'short', 'sighted', 'sins', 'for', 'purification']


In [48]:
if 'tfidf.dictionary.model' not in os.listdir():
    !wget "https://github.com/ryderwishart/biblical-machine-learning/raw/main/gpt-inferences/tfidf.dictionary.model"
if 'tfidf.model' not in os.listdir():
    !wget "https://github.com/ryderwishart/biblical-machine-learning/raw/main/gpt-inferences/tfidf.model"

--2023-05-16 15:10:39--  https://github.com/ryderwishart/biblical-machine-learning/raw/main/gpt-inferences/tfidf.model
Resolving github.com (github.com)... 192.30.255.112
Connecting to github.com (github.com)|192.30.255.112|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/ryderwishart/biblical-machine-learning/main/gpt-inferences/tfidf.model [following]
--2023-05-16 15:10:40--  https://raw.githubusercontent.com/ryderwishart/biblical-machine-learning/main/gpt-inferences/tfidf.model
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25333933 (24M) [application/octet-stream]
Saving to: ‘tfidf.model’


2023-05-16 15:10:41 (173 MB/s) - ‘tfidf.model’ saved [25333933/25333933]



In [51]:
from collections.abc import Mapping
import numpy
from unidecode import unidecode
from gensim.models.tfidfmodel import TfidfModel
from gensim.corpora import Dictionary
tfidf_dictionary = Dictionary.load('tfidf.dictionary.model')
tfidf_model = TfidfModel().load('tfidf.model')

perseus_stopwords = "μή, ἑαυτοῦ, ἄν, ἀλλ', ἀλλά, ἄλλος, ἀπό, ἄρα, αὐτός, δ', δέ, δή, διά, δαί, δαίς, ἔτι, ἐγώ, ἐκ, ἐμός, ἐν, ἐπί, εἰ, εἰμί, εἴμι, εἰς, γάρ, γε, γα, ἡ, ἤ, καί, κατά, μέν, μετά, μή, ὁ, ὅδε, ὅς, ὅστις, ὅτι, οὕτως, οὗτος, οὔτε, οὖν, οὐδείς, οἱ, οὐ, οὐδέ, οὐκ, περί, πρός, σύ, σύν, τά, τε, τήν, τῆς, τῇ, τι, τί, τις, τίς, τό, τοί, τοιοῦτος, τόν, τούς, τοῦ, τῶν, τῷ, ὑμός, ὑπέρ, ὑπό, ὡς, ὦ, ὥστε, ἐάν, παρά, σός".split(', ')
perseus_stopwords += "συ δ μοι".split(' ')
perseus_stopwords = [unidecode(w) for w in perseus_stopwords]

def tfidf_tokenize(string):
    output = string
    # Filter numeric digits from token
    output = ''.join(filter(lambda x: x.isalpha() or x == ' ', string))
    return [token.lower() for token in output.split() if unidecode(token.lower()) not in perseus_stopwords] # use unidecode to strip accents temporarily

# example
input_text = "Ἐν ἀρχῇ ἦν ὁ λόγος, καὶ ὁ λόγος ἦν πρὸς τὸν θεόν, καὶ θεὸς ἦν ὁ λόγος."
input_tokens = [w for w in tfidf_tokenize(input_text)]
input_bow = tfidf_dictionary.doc2bow(input_tokens)
input_tfidf = tfidf_model[input_bow]
summary = sorted(input_tfidf, key=lambda x: x[1], reverse=True)[:10]
print('Most significant words in input text: ')
for result in summary:
    id, score = result
    token = tfidf_dictionary[id]
    print(f'{score:.2f}: {token}')
    
def get_tfidf_summary(input_text):
    input_tokens = [w for w in tfidf_tokenize(input_text)]
    input_bow = tfidf_dictionary.doc2bow(input_tokens)
    input_tfidf = tfidf_model[input_bow]
    summary = sorted(input_tfidf, key=lambda x: x[1], reverse=True)[:10]
    output = []
    for result in summary:
        id, score = result
        token = tfidf_dictionary[id]
        output.append((f'{score:.2f}',token))
    return output

print(get_tfidf_summary(input_text))

Most significant words in input text: 
0.56: λόγος
0.56: θεόν
0.48: ἀρχῇ
0.39: θεὸς
[('0.56', 'λόγος'), ('0.56', 'θεόν'), ('0.48', 'ἀρχῇ'), ('0.39', 'θεὸς')]


In [52]:
def get_lexical_information(word_id):
    # Retrieve and return lexical prose only for the given word
    word_data = mg.loc[word_id].to_dict()
    lemma = word_data['lemma']
    text = word_data['text']
    output = f"Lexical information for {lemma}:\n"
    for key, value in word_data.items():
        if value not in (None, 'missing', 'nan'):
            if key == "class":
                output += f"- {key}: {text} is a {value},\n"
            elif key == "gloss":
                output += f"- {key}: meaning \"{value}.\"\n"
            elif key == "lemma":
                output += f"- {key}: The lemma form of this word is {value},\n"
            elif key == "morph":
                # output += f"- {key}: and it is parsed as a {value}\n" # TODO: expand morphological parse codes into prose - although, is this necessary given the other data points?
                pass
            elif key == "strong":
                # output += f"- {key}: with a Strong's number of {value}\n"
                pass
            elif key in ("ln"):
                # output += f"- {key}: {attribute_descriptions[key]}: {value}\n"
                pass
            elif key in ("domain_label"):
                # output += f"- {key} (relates to the general subject matter if the lemma): {attribute_descriptions[key]}: {domain_labels[value]}\n"
                pass
    
    return output

def get_tfidf_filtered_lexical_information(token_ids=None):
    print('getting token data for token ids', token_ids)
    # Retrieve and return TFIDF-filtered lexical information for the given passage
    tokens = [mg.loc[token_id].to_dict() for token_id in token_ids]
    
    passage_string = ' '.join([token['text'] for token in tokens])
    most_significant_token_tuples_in_passage = get_tfidf_summary(passage_string) # returns an array of (score,token) tuples
    print(most_significant_token_tuples_in_passage)
    
    tokens_data = []
    for score, token in most_significant_token_tuples_in_passage:
        token_data = mg.loc[mg['text'].apply(lambda x: x.lower()) == token].iloc[0].to_dict()
        token_data['tfidf_score'] = score
        tokens_data.append(token_data)
    
    return tokens_data

def get_syntactic_information(word_id):
    # Retrieve and return syntactic information for the given word
    word_data = mg.loc[word_id].to_dict()
    word_ref = word_data['ref']
    lemma = word_data['lemma']
    text = word_data['text']
    output = f"Syntactic information for {lemma}:\n"
    for key, value in word_data.items():
        if value not in (None, 'missing', 'nan'):
            if key == "subjref":
                # The id of the implied subject of the verb  - need to retrieve this from the macula greek 'mg' dataframe
                subject_referent_data = mg.loc[value].to_dict()
                # get the 'text' 'gloss' 'lemma' and semantic role label
                subject_referent_text = subject_referent_data['text']
                subject_referent_gloss = subject_referent_data['gloss']
                subject_referent_lemma = subject_referent_data['lemma']
                # subject_referent_semantic_role_label, complete_wordings = semantic_role_data.loc[semantic_role_data['xml:id'] == value].iloc[0]['semantic_role_label'], get_complete_wordings_for_role(value)
                # for wording in complete_wordings:
                #     complete_wordings_array = []
                #     if wording['forms'] != subject_referent_text:
                #         complete_wordings_array.append(' '.join(wording['forms']))
                #         break
                #     else:
                #         complete_wordings = None
                # complete_wordings_string = ', '.join(complete_wordings_array) if complete_wordings_array else None
                
                output += f"- Subject referent: {attribute_descriptions[key]}: {subject_referent_text} ({subject_referent_gloss}, lemma {subject_referent_lemma}) is the subject of the verb.\n" # and it plays the role of {subject_referent_semantic_role_label}")
                # if complete_wordings_string:
                #     output += f" (complete wording: {complete_wordings_string})\n"
                # output += f" in its semantic configuration.\n"
            elif key == "referent":
                # The id of the (usually pronominal) referent - need to retrieve this from the macula greek 'mg' dataframe
                referent_data = mg.loc[value].to_dict()
                # get the 'text' 'gloss' 'lemma' and semantic role label
                referent_text = referent_data['text']
                referent_gloss = referent_data['gloss']
                referent_lemma = referent_data['lemma']
                # referent_semantic_role_label, complete_wordings = semantic_role_data.loc[semantic_role_data['xml:id'] == value].iloc[0]['semantic_role_label'], get_complete_wordings_for_role(value)
                # for wording in complete_wordings:
                #     complete_wordings_array = []
                #     if wording['forms'] != referent_text:
                #         complete_wordings_array.append(' '.join(wording['forms']))
                #         break
                #     else:
                #         complete_wordings = None
                # complete_wordings_string = ', '.join(complete_wordings_array) if complete_wordings_array else None
                
                output += f"- Referent: {attribute_descriptions[key]}: {referent_text} ({referent_gloss}, lemma {referent_lemma}) is the referent of the pronoun.\n" # playing the role of {referent_semantic_role_label}\n"
                # if complete_wordings_string:
                #     output += f" (complete wording: {complete_wordings_string})\n"
                # output += f" in its semantic configuration.\n"
            elif key in ("person", "number", "gender", "case", "tense", "voice", "mood", "degree", "type"):
                output += f"- {key}: {attribute_descriptions[key]}: {value},\n"
     
    # Add semantic role data to syntactic context: 'this word is the {semantic_role} in the configuration {semantic_configuration}' 
    semantic_configuration = describe_semantic_configuration(word_id, semantic_role_data, output_template) # output_template was defined above
    # Find the matching row from semantic_role_data, and get all of the labels plus the column names
    role_data_row = semantic_role_data.loc[semantic_role_data['xml:id'] == word_id].iloc[0]
    semantic_role_info = {key: value for key, value in role_data_row.items() if not value == ''}
    if semantic_role_info.get('semantic_role_label'):
        semantic_role = semantic_role_info['semantic_role_label']    
        output += f"- Semantic configuration (useful for figuring out what is taking place in the sentence and how this word plays a role): This word is the {semantic_role} {'in `' + semantic_configuration + '`' if semantic_configuration else ''}, and it has the following data: {semantic_role_info}\n"
    # TODO: need to exploit the roles.json file in order to get all related wordings for each frame. 
    
    
    # Add opentext syntax data to syntactic context
    # opentext_syntax_data = get_opentext_syntax_data(word_id)
    # if opentext_syntax_data:
    #     output += f"- OpenText syntax (useful for identifying all of the grammatical choices that led up to this word, such as whether it is part of a derived 'entity' definition or a nested 'turn' or a particular kind of speech act):\n  This word has the following syntactic selection features:\n"
    #     for feature in opentext_syntax_data:
    #         output += f"  - {feature['feature_name']}: {feature['feature_description']}\n"
    
    # Add treedown syntax data
    treedown_data = get_treedown_by_ref(word_ref)
    if treedown_data:
        output += f"- Treedown syntax: This word is part of the following sentence:\n{treedown_data}\n"
    
    return output

# def get_syntactic_information_for_verse(verse_ref):
    

def get_discourse_information(word_id):
    # Retrieve and return discourse information for the given word
    output = ''
    word_data = mg.loc[word_id].to_dict()
    word_ref = word_data['ref']
    
    discourse_features = get_discourse_annotation_types(word_id)
    if discourse_features:
        output += f"This word functions within {len(discourse_features)} discourse features (these are useful heuristic interpretive annotations that tell you about the nature of the proposition a word is in):\n"
        for feature in discourse_features:
            output += f"- {feature} is defined as {discourse_types[feature]['description']}\n"
    
    speaker_information = get_speaker_quotation_data(word_ref)
    if speaker_information:
        output += f"\nSpeaker data is critical to identifying quoted material and relating it to the proper speaker.\n"
        if len(speaker_information) == 1:
            speaker = speaker_information[0]
            output += f"This word is spoken by {speaker['who_is_speaking']}"
            if speaker.get("Divinity") == "Y":
                output += f"- {speaker['who_is_speaking']} is a divinity"
            if speaker.get("Age"):
                output += f", age: {speaker['Age']}"
            if speaker.get("Comment"):
                output += f" ({speaker['Comment']})"
            if len(speech) > 100:
                speech = speaker.get("what_is_said_truncated")
            else:
                speech = speaker.get("what_is_said_complete")
            output += f", who says (in a {speaker['delivery_tone']} tone), \"{speech}\"\n"
        
        else:
            output += f"This word is spoken by {len(speaker_information)} speaker(s): \n"
            for speaker in speaker_information:
                output += f"This word is spoken by {speaker['who_is_speaking']}"
                if speaker.get("Divinity") == "Y":
                    output += f"- {speaker['who_is_speaking']} is a divinity"
                if speaker.get("Age"):
                    output += f", {speaker['Age']} years old"
                if speaker.get("Comment"):
                    output += f" ({speaker['Comment']})"
                if len(speech) > 100:
                    speech = speaker.get("what_is_said_truncated")
                else:
                    speech = speaker.get("what_is_said_complete")
                output += f", who says (in a {speaker['Delivery']} tone), \"{speech}\"\n"
    return output

def get_social_information(word_id):
    # Retrieve and return social information for the given word
    word_data = mg.loc[word_id].to_dict()
    word_ref = word_data['ref']
    
    lookup = situations_lookup_json
    situation_data = get_situations_data(word_ref)
    # if situation_data and situation_data.get('matchingSituation'):
    #     situation_data = situation_data['matchingSituation']
    if situation_data:
        pre_text_features = situation_data['preTextFeatures']
        via_text_features = situation_data['viaTextFeatures']
        
        if pre_text_features == 'no embedded discourse':
            return None

        pre_text_feature_descriptions, pre_text_system_descriptions = process_features(lookup, pre_text_features)
        via_text_feature_descriptions, via_text_system_descriptions = process_features(lookup, via_text_features)

        mutations = generate_mutations(pre_text_features, via_text_features, lookup)
        print(mutations)

        output = f"This word is part of the passage '{situation_data['title'][0]}'\n"

        # Add situation type information
        if situation_data.get('type'):
            situation_type = situation_data['type']
            output += f"This passage is a {situation_type['title']} situation, which can be described in typical terms as follows: {situation_type['description']}\n"

        output += f"It begins as a {' '.join(pre_text_features)} situation\n"
        output += '\n'.join(pre_text_feature_descriptions)
        # output += '\n'.join(pre_text_system_descriptions)

        # output += f"And ends as a {' '.join(via_text_features)} situation\n"
        # output += '\n'.join(via_text_feature_descriptions)
        # output += '\n'.join(via_text_system_descriptions)

        if mutations and mutations[-1] != '':
            output += "During the passage, the situation:\n"
            output += '\n'.join(mutations)
    
    return output

def get_cultural_information(word_id):
    # Retrieve and return cultural information for the given word
    word_data = mg.loc[word_id].to_dict()
    lemma = word_data['lemma']
    text = word_data['text']
    output = f"Cultural information for {lemma}:\n"
    
    ln_data = word_data['ln']
    domain_data = word_data['domain_label']
    if not(domain_data):
        print('no domain data for word!', word_data)
    
    if domain_data not in (None, 'missing', 'nan'):
        domain_string = '; '.join(domain_data)
        output += f"- Domain label (relates to the general subject matter if the lemma): {domain_string},\n"
    if ln_data not in (None, 'missing', 'nan'):
        output += f"- Louw and Nida domain: {attribute_descriptions['ln']}: {ln_data},\n"
                
    return output

def get_context_for_word(word_id, selected_fields=['lexis']):
    output = ''
    # Generate and return the prosaic context for the given word
    for field in selected_fields:
        if field not in ('lexis', 'syntax', 'discourse', 'social', 'cultural'):
            raise ValueError(f"Invalid field name '{field}'")
        elif field == 'lexis':
            lexical_data = get_lexical_information(word_id)
            if lexical_data:
                output += lexical_data
        elif field == 'syntax':
            syntax_data = get_syntactic_information(word_id)
            if syntax_data:
                output += syntax_data
        elif field == 'discourse':
            discourse_data = get_discourse_information(word_id)
            if discourse_data:
                output += discourse_data
        elif field == 'social':
            social_data = get_social_information(word_id)
            if social_data:
                output += social_data
        elif field == 'cultural':
            cultural_data = get_cultural_information(word_id)
            if cultural_data:
                output += cultural_data
    return output
            
def get_context_for_verse(verse_ref):
    # Generate and return the prosaic context for the given verse
    if '!' in verse_ref:
        # If a word ref gets passed in, just get all of the words for the word's verse
        verse_tokens = mg.loc[mg['book_chapter_verse'] == verse_ref.split('!')[0]].to_dict('records')
    else:
        # If a verse ref gets passed in, get all of the words for the verse
        verse_tokens = mg.loc[mg['book_chapter_verse'] == verse_ref].to_dict('records')
    
    token_ids = [token['id'] for token in verse_tokens]
    print(token_ids)
    # Get the most distinctive words in the verse
    most_significant_tokens = get_tfidf_filtered_lexical_information(token_ids)
    print('most_significant_tokens in verse', len(most_significant_tokens), f'of {len(token_ids)} total tokens')
    
    word_data = []
    for token in most_significant_tokens:
        print('Processing token', token['text'])
        # Get lexical and cultural context for those words
        word_data.append(get_context_for_word(token['id']))
    
        
    # Get discourse context for first word in verse
    word_data.append(get_context_for_word(most_significant_tokens[0]['id'], selected_fields=['discourse']))
    # Get syntactic context for first word in verse
    word_data.append(get_context_for_word(most_significant_tokens[0]['id'], selected_fields=['syntax']))
    
    # Get social context for verse using first word
    word_data.append(get_context_for_word(most_significant_tokens[0]['id'], selected_fields=['social']))
    
    
    
    # filter out duplicate lines in word data
    word_data = list(set(word_data))
    
    return '\n'.join(word_data)
    
        
def get_context_for_chapter(chapter):
    # Generate and return the prosaic context for the given chapter
    pass

def get_context_for_book(book):
    # Generate and return the prosaic context for the given book
    pass

def get_context_for_pericope(pericope):
    # Generate and return the prosaic context for the given pericope
    pass


In [53]:


def get_similar_docs(query_string):
    return collection.search(query_string, search_type='similarity')

def get_range_of_examples(query_string):
    return collection.search(query_string, search_type='mmr')

def get_prosaic_context_for_verse(verse_ref_string):
    return get_context_for_verse(verse_ref_string)

# prompt should be something like this:
"""
(
    "You are a language model trained in linguistics," 
    " and you are great at summarizing structured data while"
    " focusing on linguistic features without delving into theological issues."
    " Please provide a concise textual commentary on the given {linguistic_data}"
    " (use the treedown representation for the larger text context of the target word)"
    " by examining its key lexical choices, syntactic structures, discourse organization,"
    " social context, and cultural references. Avoid personal opinions and maintain objectivity."
    " Illuminate the passage's nuances, foster clarity, and establish connections within the work,"
    " empowering readers to grasp the author's intentions and the interplay between language and content."
    " Please format your output using the following headings:"
    " 1. Lexical features"
    " 2. Syntactic context and function"
    " 3. Discourse context"
    " 4. Social context"
    " 5. Cultural/encyclopedic knowledge"    
).format(linguistic_data=linguistic_data)
"""

def answer_question_with_context(input_question):
    # first, find the relevant bible verse source ids
    similar_verses = get_similar_docs(input_question)
    print(similar_verses)
    # second, get the prosaic context for the first verse (later this can be a multi-verse thing...)
    verse_refs = [verse.metadata['source'] for verse in similar_verses]
    # print('>>>>>>', similar_verses[0].metadata['source'])
    prose = get_prosaic_context_for_verse(verse_refs[0])
    return prose

In [54]:
print(answer_question_with_context('Who healed the man born blind?'))

[Document(page_content='JHN 9:1 - And  passing by  He saw  a man  blind  from  birth. ', metadata={'source': 'JHN 9:1', 'book': 'JHN', 'chapter': '9', 'verse': '1', 'ids': 'n43009001001|n43009001002|n43009001003|n43009001004|n43009001005|n43009001006|n43009001007', 'gloss': 'And  passing by  He saw  a man  blind  from  birth. '}), Document(page_content='MAT 21:14 - And  came  to Him  blind  and  lame  in  the  temple, and  He healed  them. ', metadata={'source': 'MAT 21:14', 'book': 'MAT', 'chapter': '21', 'verse': '14', 'ids': 'n40021014001|n40021014002|n40021014003|n40021014004|n40021014005|n40021014006|n40021014007|n40021014008|n40021014009|n40021014010|n40021014011|n40021014012', 'gloss': 'And  came  to Him  blind  and  lame  in  the  temple, and  He healed  them. '}), Document(page_content='JHN 9:2 - And  asked  Him  the  disciples  of Him  saying  Rabbi, who  sinned, this [man]  or  the  parents  of him, that  blind  he should be born; ', metadata={'source': 'JHN 9:2', 'book': 'J

# Chains

In [None]:
chain = load_qa_with_sources_chain(OpenAI(temperature=0), chain_type="stuff")
query = "Is blindness a generally positive or negative trait?"
chain({"input_documents": docs, "question": query}, return_only_outputs=True)

# Agents

In [None]:
# # Create a RetrievalQA tool with this vectorstore
# bible_tool_no_sources = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=collection.as_retriever())

# tools = [
#     Tool(
#         name = "Bible QA System",
#         func=bible_tool_no_sources.run,
#         description="useful for when you need to find relevant documents for answering the user's questions. Input should be a fully formed question.",
#         return_direct=True,  # If you want to use the agent as a router and return results directly
#     ),
# ]

# Create a RetrievalQAWithSourcesChain tool with this vectorstore
# bible_tool = RetrievalQAWithSourcesChain.from_chain_type(llm=llm, chain_type="stuff", retriever=collection.as_retriever())

# tools = [
#     Tool(
#         name = "Bible QA System",
#         func=bible_tool.run,
#         description="useful for when you need to answer questions about the Bible. Input should be a fully formed question.",
#         # return_direct=True  # If you want to use the agent as a router and return results directly
#     ),
# ]

# Create a RetrievalQAWithSourcesChain tool with this vectorstore
bible_doc_qa_chain = load_qa_with_sources_chain(llm=llm, chain_type="stuff", verbose=True)
query = "Is blindness a generally positive or negative trait?"
print('test run', bible_doc_qa_chain({"input_documents": docs, "question": query}, return_only_outputs=True))

tools = [
    Tool(
        name = "Bible QA System",
        func=bible_doc_qa_chain,
        description="useful for when you need to answer questions about the Bible. Input should be a fully formed question.",
        verbose=True
        # return_direct=True  # If you want to use the agent as a router and return results directly
    ),
]

In [None]:
# Initialize the agent
agent = initialize_agent(tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=True)


In [None]:
agent.run("How does Jesus heal a man born blind?")

Here is a two-step question. The zero-shot react description Cannot handle the second part properly.

In [None]:
# agent.run("Where was Job from? Does that place get mentioned anywhere else in the Bible?")
agent.run("What is an ephod? Who wears one?")



However, the self-ask with search agent can.

In [None]:
from langchain import OpenAI, SerpAPIWrapper
from langchain.agents import initialize_agent, Tool
from langchain.agents import AgentType

# Assuming you have already initialized your Bible QA system
bible_qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=collection.as_retriever())

tools = [
    Tool(
        name="Intermediate Answer",
        func=bible_qa.run,
        description="useful for when you need to answer questions about the Bible. Input should be a fully formed question."
    )
]

self_ask_with_search = initialize_agent(tools, llm, agent=AgentType.SELF_ASK_WITH_SEARCH, verbose=True)
# self_ask_with_search.run("Where was Job from? Does that place get mentioned in the Bible?")
self_ask_with_search.run("What is an ephod? Who wears one?")


Much better.

## Next steps:

- Combine the self-ask agent with the Bible vector retrieval agent.
- Think through some custom tooling for exciting new applications such as:
 - Bible translation functionality in low- or no-resource languages
 - Extracting new structured datasets using text plus existing structured data that can be retrieved on the fly
 - Evaluating Bible translations using qualitative "metrics" (as opposed to more conventional ones) on the basis of structured data extracted from the back translation

Questions? Ideas? Reach out to ryderwishart (at) gmail dot com!