<a href="https://colab.research.google.com/github/ryderwishart/biblical-machine-learning/blob/main/gpt-inferences/greek-hebrew-tsv-qa-agent.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Question answer over MACULA Greek an Hebrew

In [None]:
# !pip install tabulate pandas langchain

Collecting tabulate
  Using cached tabulate-0.9.0-py3-none-any.whl (35 kB)
Installing collected packages: tabulate
Successfully installed tabulate-0.9.0


## Set up MACULA dataframes

First, let's load up the Greek data and populate the domain column with the domain labels instead of numbers.

In [None]:
import requests, json, re, os
import pandas as pd

def download_file(url, file_name):
    response = requests.get(url)
    with open(file_name, "wb") as file:
        file.write(response.content)

# file1_url = "https://raw.githubusercontent.com/Clear-Bible/macula-greek/main/Nestle1904/TSV/macula-greek.tsv"
file1_url = "https://github.com/Clear-Bible/macula-greek/raw/feature/add-sentence-id-to-tsv/Nestle1904/TSV/macula-greek.tsv" # PR version with sentence IDs
file1_name = "macula-greek.tsv"

# file1_url = "https://raw.githubusercontent.com/Clear-Bible/macula-greek/main/Nestle1904/TSV/macula-greek.tsv"
file2_url = "https://raw.githubusercontent.com/Clear-Bible/macula-greek/main/sources/MARBLE/SDBG/marble-domain-label-mapping.json"
file2_name = "marble-domain-label-mapping.json"

if file1_name not in os.listdir():
    download_file(file1_url, file1_name)

if file2_name not in os.listdir():
    download_file(file2_url, file2_name)


In [None]:
# Import Macula Greek data
mg = pd.read_csv(
    "macula-greek.tsv", index_col="xml:id", sep="\t", header=0, converters={"*": str}
).fillna("missing")
# add an 'id' column
mg["id"] = mg.index

# mg['domain'] = mg['domain'].astype(str).fillna('missing')

# Extract book, chapter, and verse into separate columns
mg[["book", "chapter", "verse"]] = mg["ref"].str.extract(r"(\d?[A-Z]+)\s(\d+):(\d+)")

# Add columns for book + chapter, and book + chapter + verse for easier grouping
mg["book_chapter"] = mg["book"] + " " + mg["chapter"].astype(str)
mg["book_chapter_verse"] = mg["book_chapter"] + ":" + mg["verse"].astype(str)

# Import domain-label mapping
# Open the JSON file
with open("marble-domain-label-mapping.json", "r") as f:
    # Load the contents of the file as a dictionary
    domain_labels = json.load(f)

domain_labels["missing"] = "no domain"
domain_labels["nan"] = "no domain"

# Use domain labels to create a new column


def get_domain_label(domain_string_number):
    labels = [domain_labels[label] for label in domain_string_number.split(" ")]
    return labels


mg["domain_label"] = mg["domain"].apply(get_domain_label)
mg.head()

Unnamed: 0_level_0,sentence,ref,role,class,type,gloss,text,after,lemma,normalized,...,frame,subjref,referent,id,book,chapter,verse,book_chapter,book_chapter_verse,domain_label
xml:id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
n40001001001,n40001001001_n40001001008,MAT 1:1!1,missing,noun,common,[The] book,Βίβλος,,βίβλος,Βίβλος,...,missing,missing,missing,n40001001001,MAT,1,1,MAT 1,MAT 1:1,[Written Language]
n40001001002,n40001001001_n40001001008,MAT 1:1!2,missing,noun,common,of [the] genealogy,γενέσεως,,γένεσις,γενέσεως,...,missing,missing,missing,n40001001002,MAT,1,1,MAT 1,MAT 1:1,[Kinship Relations Involving Successive Genera...
n40001001003,n40001001001_n40001001008,MAT 1:1!3,missing,noun,proper,of Jesus,Ἰησοῦ,,Ἰησοῦς,Ἰησοῦ,...,missing,missing,missing,n40001001003,MAT,1,1,MAT 1,MAT 1:1,[Persons]
n40001001004,n40001001001_n40001001008,MAT 1:1!4,missing,noun,proper,Christ,Χριστοῦ,,Χριστός,Χριστοῦ,...,missing,missing,missing,n40001001004,MAT,1,1,MAT 1,MAT 1:1,[Persons]
n40001001005,n40001001001_n40001001008,MAT 1:1!5,missing,noun,common,son,υἱοῦ,,υἱός,υἱοῦ,...,missing,missing,missing,n40001001005,MAT,1,1,MAT 1,MAT 1:1,[Kinship Relations Involving Successive Genera...


Second, let's load up the Hebrew data and populate the domain column with the domain labels instead of numbers.

In [None]:
# MACULA Hebrew URL: https://media.githubusercontent.com/media/Clear-Bible/macula-hebrew/main/TSV/macula-hebrew.tsv

file3_url = "https://raw.githubusercontent.com/Clear-Bible/macula-hebrew/main/TSV/macula-hebrew.tsv"
file3_name = "macula-hebrew.tsv"

file4_url = "https://raw.githubusercontent.com/Clear-Bible/macula-hebrew/main/sources/MARBLE/SDBH/domain-label-mapping-1.json"
file4_name = "domain-label-mapping-1.json"

if file3_name not in os.listdir():
    # This is a GIT LFS file, so we need to ensure the LFS protocol is used (which wget seems to resolve automatically)
    !wget https://media.githubusercontent.com/media/Clear-Bible/macula-hebrew/main/TSV/macula-hebrew.tsv
    
if file4_name not in os.listdir():
    download_file(file4_url, file4_name)

# Import Macula Hebrew data
mh = pd.read_csv(
    "macula-hebrew.tsv", index_col="xml:id", sep="\t", header=0, converters={"*": str}
).fillna("missing")
# add an 'id' column
mh["id"] = mh.index

# mh['domain'] = mh['domain'].astype(str).fillna('missing')

# Extract book, chapter, and verse into separate columns
mh[["book", "chapter", "verse"]] = mh["ref"].str.extract(r"(\d?[A-Z]+)\s(\d+):(\d+)")
# Add columns for book + chapter, and book + chapter + verse for easier grouping
mh["book_chapter"] = mh["book"] + " " + mh["chapter"].astype(str)
mh["book_chapter_verse"] = mh["book_chapter"] + ":" + mh["verse"].astype(str)

# Import domain-label mapping to Hebrew as well using SBDH domain mapping
# Open the JSON file
with open("domain-label-mapping-1.json", "r") as f:
    # Load the contents of the file as a dictionary
    hebrew_domain_labels = json.load(f)

def get_hebrew_domain_label(domain_string_number):
    labels = [hebrew_domain_labels[label] for label in domain_string_number.split(">") if label in hebrew_domain_labels]
    return labels

mh["domain_label"] = mh["lexdomain"].apply(get_hebrew_domain_label)

mh.head()

  mh = pd.read_csv(


Unnamed: 0_level_0,ref,class,text,transliteration,after,strongnumberx,stronglemma,sensenumber,greek,greekstrong,...,frame,subjref,participantref,id,book,chapter,verse,book_chapter,book_chapter_verse,domain_label
xml:id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
o010010010011,GEN 1:1!1,prep,בְּ,bə,missing,0871a,b,missing,ἐν,1722.0,...,missing,missing,missing,o010010010011,GEN,1,1,GEN 1,GEN 1:1,[]
o010010010012,GEN 1:1!1,noun,רֵאשִׁ֖ית,rēʾšiyṯ,,7225,7225,1.0,ἀρξῇ,746.0,...,missing,missing,missing,o010010010012,GEN,1,1,GEN 1,GEN 1:1,[Begin]
o010010010021,GEN 1:1!2,verb,בָּרָ֣א,bārāʾ,,1254,1254 a,1.0,ἐποίησεν,4160.0,...,A0:010010010031; A1:010010010052;010010010072;,missing,missing,o010010010021,GEN,1,1,GEN 1,GEN 1:1,[Exist]
o010010010031,GEN 1:1!3,noun,אֱלֹהִ֑ים,ʾĕlōhiym,,0430,430,1.0,θεὸς,2316.0,...,missing,missing,missing,o010010010031,GEN,1,1,GEN 1,GEN 1:1,[Deities]
o010010010041,GEN 1:1!4,om,אֵ֥ת,ʾēṯ,,0853,853,missing,missing,missing,...,missing,missing,missing,o010010010041,GEN,1,1,GEN 1,GEN 1:1,[Identifiers]


In [None]:
print("Macula Greek column names: ", mg.columns)
print("Macula Hebrew column names: ", mh.columns)

Macula Greek column names:  Index(['sentence', 'ref', 'role', 'class', 'type', 'gloss', 'text', 'after',
       'lemma', 'normalized', 'strong', 'morph', 'person', 'number', 'gender',
       'case', 'tense', 'voice', 'mood', 'degree', 'domain', 'ln', 'frame',
       'subjref', 'referent', 'id', 'book', 'chapter', 'verse', 'book_chapter',
       'book_chapter_verse', 'domain_label'],
      dtype='object')
Macula Hebrew column names:  Index(['ref', 'class', 'text', 'transliteration', 'after', 'strongnumberx',
       'stronglemma', 'sensenumber', 'greek', 'greekstrong', 'gloss',
       'english', 'mandarin', 'stem', 'morph', 'lang', 'lemma', 'pos',
       'person', 'gender', 'number', 'state', 'type', 'lexdomain',
       'contextualdomain', 'coredomain', 'sdbh', 'extends', 'frame', 'subjref',
       'participantref', 'id', 'book', 'chapter', 'verse', 'book_chapter',
       'book_chapter_verse', 'domain_label'],
      dtype='object')


## Set up QA agent

In [None]:
import getpass
secret_key = getpass.getpass('Enter OpenAI secret key: ')
os.environ['OPENAI_API_KEY'] = secret_key

In [None]:
import pandas as pd
from langchain.llms import OpenAI
from langchain.agents import create_pandas_dataframe_agent

In [None]:
multi_df_agent = create_pandas_dataframe_agent(
    OpenAI(temperature=0), 
    [mg, mh], # list of dataframes, here mg = Macula Greek, mh = Macula Hebrew
    verbose=True,
    max_tokens_limit = 3000,
    max_tokens_per_prompt = 1000,
)

In [None]:
multi_df_agent.run("How many unique domain labels are there (each domain label is a list)?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: I need to find the unique domain labels in the dataframes
Action: python_repl_ast
Action Input: df1['domain_label'].nunique()[0m
Observation: [36;1m[1;3mTypeError: unhashable type: 'list'[0m
Thought:[32;1m[1;3m I need to convert the domain labels to strings
Action: python_repl_ast
Action Input: df1['domain_label'].apply(lambda x: str(x)).nunique()[0m
Observation: [36;1m[1;3m1648[0m
Thought:[32;1m[1;3m I now know the final answer
Final Answer: 1648[0m

[1m> Finished chain.[0m


'1648'

In [None]:
multi_df_agent.run("If you group all the rows by book, which books do not have some form of 'lord' or 'Lord' in their aggregated gloss column?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: I need to group the rows by book and then check if any of the aggregated gloss columns contain 'lord' or 'Lord'.
Action: python_repl_ast
Action Input: df1.groupby('book')['gloss'].agg(lambda x: ' '.join(x)).str.contains('lord|Lord')[0m
Observation: [36;1m[1;3mbook
1CO     True
1JN    False
1PE     True
1TH     True
1TI     True
2CO     True
2JN    False
2PE     True
2TH     True
2TI     True
3JN    False
ACT     True
COL     True
EPH     True
GAL     True
HEB     True
JAS     True
JHN     True
JUD     True
LUK     True
MAT     True
MRK     True
PHM     True
PHP     True
REV     True
ROM     True
TIT    False
Name: gloss, dtype: bool[0m
Thought:[32;1m[1;3m I now know the final answer
Final Answer: The books that do not have some form of 'lord' or 'Lord' in their aggregated gloss column are 1JN, 2JN, 3JN, and TIT.[0m

[1m> Finished chain.[0m


"The books that do not have some form of 'lord' or 'Lord' in their aggregated gloss column are 1JN, 2JN, 3JN, and TIT."

This was a bit shocking to me, but neither First, Second, or Third John, or Titus contain the word 'Lord'! To double check this, look at the vocab list through [Scaife](https://vocab.perseus.org/word-list/urn:cts:greekLit:tlg0031.tlg017.perseus-grc2/?page=all)