The goal is to take every row from the conditional data and add macula ids for each row for both the protasis and apodosis separately.

Ingest excel sheets

In [1]:
import re
import os
import pandas as pd

In [2]:
# Set the maximum number of rows and columns to display (set them to None for unlimited)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

In [3]:
ids_file = 'macula_ids_with_glosses.tsv'
excel_path = 'CanIL Analysis of NT Conditionals by book220831.xlsx'

In [4]:
# Function to merge columns with similar names
def merge_columns(df):
    # Group by the first word in the column name
    grouped = df.groupby(by=lambda x: x.split(" ")[0], axis=1)
    
    # Combine grouped columns
    for name, group in grouped:
        if len(group.columns) > 1:
            df[name] = group.apply(lambda row: ' '.join(row.dropna().astype(str)), axis=1)

    # Drop the original columns, keeping only the merged ones
    for name, group in grouped:
        if len(group.columns) > 1:
            df.drop(columns=group.columns[1:], inplace=True)
    return df

In [5]:

# Read all sheet names in the Excel file
xl = pd.ExcelFile(excel_path)
sheet_names = xl.sheet_names

# Skip the first sheet and last sheet, as it's just introductory material
sheet_names = sheet_names[1:-1]

# Read and concatenate all the sheets, making the column names lowercase
df_list = []
for sheet in sheet_names:
    df = pd.read_excel(excel_path, sheet_name=sheet)
    df.columns = map(str.lower, df.columns)  # Making column names lowercase
    df = merge_columns(df)  # Merge similar columns
    
    # Need to process the 'reference' column to add the sheet name + ' ' to the beginning of each value
    try:
        df['reference'] = sheet + ' ' + df['reference'].astype(str)
    except KeyError:
        print('Sheet name', sheet, 'does not have a "reference" column.')
    
    df_list.append(df)

In [6]:
# Concatenate all DataFrames in df_list
concatenated_df = pd.concat(df_list, ignore_index=True)

# Your existing code, adjusted for the concatenated DataFrame
mark_df = concatenated_df

ids_df = pd.read_csv(ids_file, sep='\t')
# Add 'macula_ids_in_protasis' and 'macula_ids_in_apodosis' columns to mark_df as empty arrays
mark_df['macula_ids_in_protasis'] = [[] for _ in range(len(mark_df))]
mark_df['macula_ids_in_apodosis'] = [[] for _ in range(len(mark_df))]
# let's extract the 'ref' into a 'b_ch_v' column: e.g., MRK 1:1!1 -> MRK 1:1
ids_df['b_ch_v'] = ids_df['ref'].str.extract(r'(\w+ \d+:\d+)')
mark_df['b_ch_v'] = mark_df['reference'].str.extract(r'(\w+ \d+:\d+)')

# create a subset of the ifs_df for rows where the 'b_ch_v' value is a substring within the 'Reference' column of the mark_df
# e.g., 1:1 is a substring of MRK 1:1!1
subset = ids_df[ids_df['b_ch_v'].isin(mark_df['reference'])]

# fill all NaN in both dataframes with empty strings
mark_df = mark_df.fillna('')

In [7]:
# add matched english words (which line up with matched ids) into the dataframe as a new column, one for 'english', one for 'gloss'
mark_df['matched_english_in_protasis'] = [[] for _ in range(len(mark_df))]
mark_df['matched_english_in_apodosis'] = [[] for _ in range(len(mark_df))]
mark_df['matched_gloss_in_protasis'] = [[] for _ in range(len(mark_df))]
mark_df['matched_gloss_in_apodosis'] = [[] for _ in range(len(mark_df))]
mark_df['all_verse_word_tuples'] = [[] for _ in range(len(mark_df))]

mark_df['matched_protasis_words'] = [[] for _ in range(len(mark_df))]
mark_df['all_protasis_words'] = [[] for _ in range(len(mark_df))]
mark_df['matched_apodosis_words'] = [[] for _ in range(len(mark_df))] 
mark_df['all_apodosis_words'] = [[] for _ in range(len(mark_df))] 
mark_df['unmatched_protasis_words'] = [[] for _ in range(len(mark_df))] 
mark_df['unmatched_apodosis_words'] = [[] for _ in range(len(mark_df))] 

backup_df = mark_df.copy()

In [8]:
mark_df.head()

Unnamed: 0,reference,scope of conditional (esv unless noted),class,inv.,probability,time orientation,illocutionary force,english translations,notes,parallel passage(s),unnamed: 10,unnamed:,parallel passages,unnamed: 9,scope of conditional (esv unless otherwise indicated),scope of conditional (esv),scope of conditional (esv unless stated otherwise),macula_ids_in_protasis,macula_ids_in_apodosis,b_ch_v,matched_english_in_protasis,matched_english_in_apodosis,matched_gloss_in_protasis,matched_gloss_in_apodosis,all_verse_word_tuples,matched_protasis_words,all_protasis_words,matched_apodosis_words,all_apodosis_words,unmatched_protasis_words,unmatched_apodosis_words
0,MAT 4:3,p: (if you are the Son of God)\nq: (command th...,1,,Factual,Present,Exhort,"ESV, NASB, NRSV, NIV, NLT: ""if""",P presents a fact that the temptor knew to be ...,Luke 4:3,,,,,,,,[],[],MAT 4:3,[],[],[],[],[],[],[],[],[],[],[]
1,MAT 4:6,p: (if you are the Son of God)\nq: (throw your...,1,,Factual,Present,Exhort,"ESV, NASB, NRSV, NIV, NLT: ""if""","As with 4:3, this conditional expresses no dou...",Luke 4:9,,,,,,,,[],[],MAT 4:6,[],[],[],[],[],[],[],[],[],[],[]
2,MAT 4:9,q:(All these I will give you)\np: (if you will...,3,x,Very Unlikely,Present,Promise / Exhort,"ESV, NASB, NRSV, NIV, NLT: ""if""",In 4:3 and 4:6 the exhortations (or temptation...,Luke 4:7,,,,,,,,[],[],MAT 4:9,[],[],[],[],[],[],[],[],[],[],[]
3,MAT 5:13a,p: (if salt has lost its taste)\nq: (how shall...,3,,Unlikely,Gnomic,Assert,"ESV, NASB, NRSV, NIV, NLT: ""if""",This conditional must be read in its context o...,Mark 9:50; Luke 14:34,,,,,,,,[],[],MAT 5:13,[],[],[],[],[],[],[],[],[],[],[]
4,MAT 5:20,p: (unless your righteousness exceeds that of ...,3,,Very Unlikely,Gnomic,Warn,"ESV, NASB, NRSV, NIV, NLT: ""unless""","Greek: εαν μη. Nolland, Hagner, and others agr...",,,,,,,,,[],[],MAT 5:20,[],[],[],[],[],[],[],[],[],[],[]


Now we need to try to match up strings. 
Here's an example. 

in the `Scope of conditional (ESV unless noted)` column, we have the string:

"p: (If you will) q: (you can make me clean)"
`p: \(.*?\)` matches the protasis
`q: \(.*?\)` matches the apodosis

Now, in the ids_df, for rows where ids_df[ids_df['b_ch_v'].isin(mark_df['Reference'])] (i.e., the subset relevant to just one row in mark_df) is true, we have the rows:

xml:id	ref	english	gloss	text	b_ch_v
603	n41001040001	MRK 1:40!1	and	And	Καὶ	1:40
604	n41001040002	MRK 1:40!2	came	comes	ἔρχεται	1:40
605	n41001040003	MRK 1:40!3	to	to	πρὸς	1:40
606	n41001040004	MRK 1:40!4	him	Him	αὐτὸν	1:40
607	n41001040005	MRK 1:40!5	leper	a leper	λεπρὸς	1:40

So, here we note that there is both an 'english' and 'gloss' column.

We want to find all ifs_df rows where the 'b_ch_v' value is a substring of the 'Reference' column in the mark_df, and then we want to take the 'english' and 'gloss' values from the ids_df and match them up with any English words (strip brackets and square brackets on both sides of the equation) into the 'Scope of conditional (ESV unless noted)' column in the mark_df.
Then, we want to populate the columns 'macula_ids_in_protasis' and 'macula_ids_in_apodosis' with the 'xml:id' values from the ids_df.

In [34]:
mark_df.tail()

Unnamed: 0,reference,scope of conditional (esv unless noted),class,inv.,probability,time orientation,illocutionary force,english translations,notes,parallel passage(s),unnamed: 10,unnamed:,parallel passages,unnamed: 9,scope of conditional (esv unless otherwise indicated),scope of conditional (esv),scope of conditional (esv unless stated otherwise),macula_ids_in_protasis,macula_ids_in_apodosis,b_ch_v,matched_english_in_protasis,matched_english_in_apodosis,matched_gloss_in_protasis,matched_gloss_in_apodosis,all_verse_word_tuples,matched_protasis_words,all_protasis_words,matched_apodosis_words,all_apodosis_words,unmatched_protasis_words,unmatched_apodosis_words
940,REV 13:15,,3.0,,,,,"ESV, NRSV ""those who""\nNASB: ""as many as""\nNIV: ""all who""\nNLT: ""anyone""",Greek: ὅσοι ἐὰν 'as many as' Not in Boyer or Elliot,,,,,,,,p: (so that the image of the beast might speak and might cause)\nq: (those who would not worship the image of the beast to be slain),[],[],REV 13:15,[],[],[],[],[],[],[],[],[],[],[]
941,REV 14:3,,,x,,,,"ESV, NASB, NRSV, NIV, NLT: ""except""",Greek: εἰ μὴ 'except' Not in Elliot,,,,,,,,"q: (no one could learn that song)\np: (except the 144,000 who had been redeemed from the earth)",[],[],REV 14:3,[],[],[],[],[],[],[],[],[],[],[]
942,REV 14:11,,1.0,x,,,,"ESV, NASB: ""whoever""\nNRSV, NIV: ""anyone who""\nNLT: ""for they have... accepted the mark...""",Greek: εἴ τις 'if anyone' but here the sense is 'whoever' or 'those who',,,,,,,,"q: (and the smoke of their torment goes up forever and ever, and they have no rest, day or night, these worshippers of the beast and its image)\np: (and whoever receives the mark of its name)",[],[],REV 14:11,[],[],[],[],[],[],[],[],[],[],[]
943,REV 19:12,,,,,,,"ESV, NRSV, NIV: ""but""\nNASB, NLT: ""except""",Greek: εἰ μὴ 'except' Not in Elliot,,,,,,,,q: (and he has a name written that no one knows)\np: (but himself),[],[],REV 19:12,[],[],[],[],[],[],[],[],[],[],[]
944,REV 21:27,,,x,,,,"ESV, RSV, NIV, NASB: ""but only""",Greek: εἰ μὴ 'except' Not in Elliot,,,,,,,,"q: (but nothing uncleas will ever enter it, nor anyone who does what is detestable or false)\np: (but only those who are written in the Lamb's book of life)",[],[],REV 21:27,[],[],[],[],[],[],[],[],[],[],[]


In [10]:
verse_df = ids_df[ids_df['b_ch_v'] == mark_df.iloc[0]['b_ch_v']]
verse_df

Unnamed: 0,xml:id,ref,english,gloss,text,b_ch_v
1247,n40004003001,MAT 4:3!1,and,And,καὶ,MAT 4:3
1248,n40004003002,MAT 4:3!2,came,having come,προσελθὼν,MAT 4:3
1249,n40004003003,MAT 4:3!3,the,the,ὁ,MAT 4:3
1250,n40004003004,MAT 4:3!4,tempter,[one] tempting,πειράζων,MAT 4:3
1251,n40004003005,MAT 4:3!5,said,he said,εἶπεν,MAT 4:3
1252,n40004003006,MAT 4:3!6,him,to Him,αὐτῷ,MAT 4:3
1253,n40004003007,MAT 4:3!7,if,If,Εἰ,MAT 4:3
1254,n40004003008,MAT 4:3!8,son,Son,υἱὸς,MAT 4:3
1255,n40004003009,MAT 4:3!9,are,You are,εἶ,MAT 4:3
1256,n40004003010,MAT 4:3!10,,-,τοῦ,MAT 4:3


In [23]:
pd.set_option('display.max_colwidth', None)

In [33]:
def generate_prompt(verse):
    p_q = mark_df.loc[mark_df['b_ch_v'] == verse]
    verse_df = ids_df[ids_df['b_ch_v'] == verse]                            

    prompt = f'''
    ## Instruction:
    Use the rows from the table below to associate xml:ids with the protasis and apodosis listed below to create 2 csv files. One for the protasis, and one for the apodosis:

    ## Context
    {verse_df}

    Here is an example:
    Protasis: if you are the Son of God
    ```
    if, If, Εἰ, n40004003007
    son, Son, υἱὸς, n40004003008
    are, You are, εἶ, n40004003009
    ...
    ```

    Apodosis: command these stones to become bread
    ```
    command, speak, εἰπὲ, n40004003012
    to, that, ἵνα, n40004003013
    NaN, the, οἱ, n40004003014
    ...
    ```

    Now follow a similar format with this protasis and apodosis pair:
    {p_q['scope of conditional (esv unless noted)']}


    ## Results:

    '''
    return prompt

In [28]:
import openai
import getpass

In [29]:
openai_pass = getpass.getpass('Enter OpenAI secret key: ')

In [31]:
# define your GPT completion function

openai.api_key = openai_pass

model = 'gpt-3.5-turbo' # or 4

MAX_RETRIES = 10
def align(prompt):
    system_prompt = "Analyze the p-q phrases and align the individual words to ids with the table the user provides"
    messages = [
        {"role": 'system', "content": system_prompt},
        {"role": 'user', 'content': prompt}
    ]
    for i in range(MAX_RETRIES):
        try:
            response = openai.ChatCompletion.create(
                model=model,
                messages=messages,
                temperature=0.1,
            )
            generated_texts = [
                choice.message["content"] for choice in response["choices"]
            ]
            return generated_texts[0]
        except (openai.error.APIConnectionError, openai.error.APIError) as e:
            print('Error in alignment:', e)
            if i < MAX_RETRIES - 1:  # i is zero indexed
                continue
            else:
                return {"error": str(e)}

In [35]:
refs = mark_df['b_ch_v'].tolist()
refs[:5]

['MAT 4:3', 'MAT 4:6', 'MAT 4:9', 'MAT 5:13', 'MAT 5:20']

In [32]:
for ref in refs:  
    prompt = generate_prompt(ref)  
    results = align(prompt)
    with open('pq_macula.txt', 'a') as file:
        file.write(results)
        file.write('\n')

'Protasis:\n```\nif, If, Εἰ, n40004006004\nyou, You, σὺ, n40004006006\nare, are, εἶ, n40004006006\nthe, the, τοῦ, n40004006007\nSon, Son, υἱὸς, n40004006005\nof, of, τοῦ, n40004006007\nGod, God, θεοῦ, n40004006008\n```\n\nApodosis:\n```\nthrow, throw, βάλε, n40004006009\nyourself, Yourself, σεαυτὸν, n40004006010\ndown, down, κάτω, n40004006011\n```'