# Create a Glossary

In theory GPT4 can take a small amount of text and find the obvious words and how they are used.  Over time if given more sections it should learn more and be able to refine the glossary.

 - [ ] Given a portion of Scripture generate a glossary
 - [ ] Expand the glossary by adding the old glossary and more text and ask GPT to output new words and edits to words (to reduce tokens)
 - [ ] Given a completed draft glossary loop word by word and find references of the word being used and ask GPT to improve the glossary
 - [ ] Save the glossary to a database and make it editable, indexes on potential words
 - [ ] On requesting translation add the potential words

In [1]:
# Setup some defaults
TRAINING_SOURCE = ['MRK']
TEST_SOURCE = ['MAT']
GPT_VERSION = 'gpt-4-32k'
TOKENS_RESERVED_FOR_NEW_GRAMMER_PER_WORD = 500
VERSION = 'birrig'  # target version
SOURCES = ['source_content']
SAMPLES_PER_WORD = 50
MAX_WORDS_PER_BATCH = 1


In [2]:
# Imports and fixes to pathing
%reload_ext autoreload
import sys
sys.path.append('../lib')

import pandas as pd
import tiktoken
import openai, time, os
from openai.error import RateLimitError, OpenAIError
from config import get_config
from collections import defaultdict
from cipher import substitution_cipher

openai.api_type = os.environ["OPENAI_API_TYPE"] = get_config('openai')['api_type']
openai.api_base = os.environ["OPENAI_API_BASE"] = get_config('openai')['api_base']
openai.api_key = os.environ["OPENAI_API_KEY"] = get_config('openai')['api_key']
openai.api_version = os.environ["OPENAI_API_VERSION"] = get_config('openai')['api_version']

## Encode the langauge

Based on [https://github.com/ChrisPriebe/BibleTranslation/blob/exp/test-basic/get_bible.ipynb] encode the English version BBE using a letter substitution cipher.  This ensure we have 100% new words and simulates a new language.

PRO
 - Starts with an empty language
 - By using BBE as an input and training it on translating a non-BBE version it prevents word to word translation.  It needs to capture meanings

CONS
 - Does not reflect the linguisitical nuances like stemming, changes to word order, etc in other languages.

 

In [3]:


# Read the data/berrig.csv file into dataframe
df = pd.read_csv('../data/birrig.csv')
# rename df[0] to df['vref']
df.rename(columns={df.columns[0]: 'vref'}, inplace=True)
df.head()

Unnamed: 0,vref,book,chapter,verse,eng-web,eng-asv,eng-kjv2006,engBBE,hin2017,arbnav,latVUC,amo,source_content,birrig
0,GEN 1:1,GEN,1,1,"In the beginning, God created the heavens and ...",In the beginning God created the heavens and t...,In the beginning God created the heaven and th...,At the first God made the heaven and the earth.,आदि में परमेश्‍वर ने आकाश और पृथ्वी की सृष्टि ...,فِي الْبَدْءِ خَلَقَ اللهُ السَّمَاوَاتِ وَالأ...,In principio creavit Deus cælum et terram.,,בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁ...,El lxi sovzl Guw newi lxi xiemir erw lxi ievlx.
1,GEN 1:2,GEN,1,2,The earth was formless and empty. Darkness was...,And the earth was waste and void; and darkness...,"And the earth was without form, and void; and ...",And the earth was waste and without form; and ...,"पृथ्वी बेडौल और सुनसान पड़ी थी, और गहरे जल के ...",وَإِذْ كَانَتِ الأَرْضُ مُشَوَّشَةً وَمُقْفِرَ...,"Terra autem erat inanis et vacua, et tenebræ e...",,וְהָאָ֗רֶץ הָיְתָ֥ה תֹ֨הוּ֙ וָבֹ֔הוּ וְחֹ֖...,Erw lxi ievlx hez hezli erw holxual suvn; erw ...
2,GEN 1:3,GEN,1,3,"God said, “Let there be light,” and there was ...","And God said, Let there be light: and there wa...","And God said, Let there be light: and there wa...","And God said, Let there be light: and there wa...","तब परमेश्‍वर ने कहा, “उजियाला हो*,” तो उजियाला...",أَمَرَ اللهُ: «لِيَكُنْ نُورٌ». فَصَارَ نُورٌ،,Dixitque Deus: Fiat lux. Et facta est lux.,,וַיֹּ֥אמֶר אֱלֹהִ֖ים יְהִ֣י א֑וֹר וַֽיְהִי...,"Erw Guw zeow, Pil lxivi fi pogxl: erw lxivi he..."
3,GEN 1:4,GEN,1,4,"God saw the light, and saw that it was good. G...","And God saw the light, that it was good: and G...","And God saw the light, that it was good: and G...","And God, looking on the light, saw that it was...",और परमेश्‍वर ने उजियाले को देखा कि अच्छा है*; ...,وَرَأَى اللهُ النُّورَ فَاسْتَحْسَنَهُ وَفَصَل...,Et vidit Deus lucem quod esset bona: et divisi...,,וַיַּ֧רְא אֱלֹהִ֛ים אֶת־ הָא֖וֹר כִּי־ ט֑וֹ...,"Erw Guw, puucorg ur lxi pogxl, zeh lxel ol hez..."
4,GEN 1:5,GEN,1,5,"God called the light “day”, and the darkness h...","And God called the light Day, and the darkness...","And God called the light Day, and the darkness...","Naming the light, Day, and the dark, Night. An...",और परमेश्‍वर ने उजियाले को दिन और अंधियारे को ...,وَسَمَّى اللهُ النُّورَ نَهَاراً، أَمَّا الظَّ...,"Appellavitque lucem Diem, et tenebras Noctem: ...",,וַיִּקְרָ֨א אֱלֹהִ֤ים׀ לָאוֹר֙ י֔וֹם וְלַחֹ...,"Renorg lxi pogxl, Wej, erw lxi wevc, Rogxl. Er..."


In [4]:
# loop through all of eng-web and count word by word and tell me which is the most common word
# and how many times it occurs



In [5]:
training_df = df[df['book'].isin(TRAINING_SOURCE)]
test_df = df[df['book'].isin(TEST_SOURCE)]
print(f"TRAIN SIZE: {len(training_df)} verses")
print(f"TEST SIZE: {len(test_df)} verses")

TRAIN SIZE: 678 verses
TEST SIZE: 1071 verses


# Word frequency analysis


In [6]:
# Word analysis
# Determine the most common words in the train data

def get_words(df, version='birrig', min_frequency=0, min_length=0, cipher_decode=True):
    word_count = defaultdict(int)
    for index, row in df.iterrows():
        for word in row[version].split():
            # lowercase and remove punctation
            word = word.lower().strip('.,;!?')
            if len(word) >= min_length:
                word_count[word] += 1

    # Sort the words by frequency
    sorted_words = sorted(word_count.items(), key=lambda item: item[1], reverse=cipher_decode)
    all_words = [(word[0], substitution_cipher(word[0], encode=False).strip(), word[1]) for word in sorted_words]
    
    # reduce words to those that appear more than min_frequency
    return [word for word in all_words if word[2] > min_frequency]

words = get_words(training_df)
# convert words to dataframe
words_df = pd.DataFrame(words, columns=['word', 'cipher', 'frequency'])
test_words = get_words(test_df)
# convert words to dataframe
test_words_df = pd.DataFrame(test_words, columns=['word', 'cipher', 'frequency'])

# merge the two dataframes
words_df = words_df.merge(test_words_df, on='word', how='outer', suffixes=('_train', '_test'))
words_df


Unnamed: 0,word,cipher_train,frequency_train,cipher_test,frequency_test
0,erw,and,1197.0,and,1488.0
1,lxi,the,942.0,the,1601.0
2,lu,to,639.0,to,977.0
3,us,of,476.0,of,861.0
4,xi,he,397.0,he,414.0
...,...,...,...,...,...
1644,zruh:,,,snow:,1.0
1645,uvwiviw:,,,ordered:,1.0
1646,kavvirl,,,current,1.0
1647,qvizirl,,,present,1.0


In [7]:
# how many words are in training but not test
print(f"TRAINING ONLY WORDS: {len(words_df[words_df['frequency_test'].isna()])}")
print(f"TEST ONLY WORDS: {len(words_df[words_df['frequency_train'].isna()])}")
print(f"OVERLAPPING WORDS: {len(words_df[words_df['frequency_train'].notna() & words_df['frequency_test'].notna()])}")

TRAINING ONLY WORDS: 209
TEST ONLY WORDS: 487
OVERLAPPING WORDS: 953


## Order the data to learn the most common first
Instead of randomly learning words start with the most common (a, the, Jesus, etc) so they are known to be correct first, then work your way down the list.  This should help with the learning process.

# GPT ANALYSIS


In [8]:
system_message = """
# Role
You are an expert linguistic and polyglot.  You love to learn new languages.  You also have a doctorate of Theology and are fluent in the Biblical languages and the original word meanings.  

# Task
You will be given a new language and you are to create a glossary and dictionary for each word in the targeted words list.  

Here are the steps you must return.  Show your work for each step.

## Step 1
The word we added is in all the target verses we provided therefore if you make a list of all the most common words that their concept appears is all the source languages you will get our word. 

Make that list now of what meanings appear in all words.  (If there are no words that appear in all assume this word has more than one meaning and find all those meanings)

Create a dictionary of words and meanings. Adding it to the dictionary the user provides (if they provide one)

## Step 2
For each targeted word look at all the verses it appears in.  What is the most likely word that appears in all those verses and is not yet defined in our dictionary.  What is the second most likely word and third.

# Step 3
Disagree with your answer in Step 2 and argue why it could be a different word

# Step 4
Create the glossary in the format provided below.

Here are some guidelines

 - Use the vref to look up the original language, NIV, NKJV, Amplified, Swahali, Arabic, Chinese, German and French translations.  I am also providing the original text and the targeted text. 
 - Think step by step.  Propose 2 to 3 words it could be.  Explain why by quoting sentence fragments from English, Greek, German, Arabic, Swahili, Hebrew.

# Glossary format
Return the results in properly formatted JSON inside a code block

word: string

strongs: (string[]) array of strongs concordance numbers in order of likely meaning (max 3)

english: (string[]) array of english words in order of likely meaning (max 3)

grammer: (char[])  array of grammer codes in order of likely meaning (part of speech, Tense, number, gender, stem (which word it is derived from), etc) Focus only on this targetted word and think step by step through the examples of how it was used.

debug: (string: max 200 words) Show your work here, what words did you consider, why is this one picked, quote sentence fragments from Greek and English to show why.  What is an alternate word it could mean?  

exceptions: (string): Note any edge cases, exceptions, unique language rules for this word.

# Input format
```
"""

input_format = """
## Targeted Words
{targeted_words}

## Current glossary (if any)
```
word\tdefinition
```
{definitions}

## Training Data
```
| Bible Verse Reference | Target Language to Learn |
{verses}
```
"""
system_message += input_format

def format_verses(row, version='source_content'):
    source = row[version].replace("|",":")
    target = row[VERSION].replace("|",":")
    return f"""
| {row['vref']} | {source} | {target} |
""".strip()

In [9]:
def call_gpt(system_message, content, max_tokens, debug=False):
    messages = [
                {"role":"system","content": system_message},
                {"role":"user","content": content}
            ]
    start_time = time.time()
    print("Calling GPT", messages)
    try:
        response = openai.ChatCompletion.create(
            engine=GPT_VERSION,
            messages=messages,
            temperature=0.01,
            max_tokens=max_tokens,
        )
        result = response.get('choices',[{}])[0].get('message',{'content':''}).get('content','')  
        print("GPT Response", time.time()-start_time, result)
        return result
    
    except RateLimitError as e:
        print(f"Rate Limit Error: {e}")
        time.sleep(30)
        
    except OpenAIError as e:
        print(f"OpenAI Error: {e}")
        return None

In [10]:
import re, json
###
# Extract glossary words form a GPT response

def extract_glossary(response, glossary={}):
    if not response:
        return glossary
    
    # use a regex to find all the blocks inside a markdown block quote
    matches = re.findall(r'```(.+?)```', response, re.DOTALL)
    new_words = {}
    for match in matches:
        # if it is json then use it, otherwise ignore it
        try:
            result = json.loads(match)
            new_words[result['word']] = result
        except:
            pass
        


    return new_words

    # for line in response.split('\n'):
    #     if line.startswith('|'):
    #         parts = line.split('|')
    #         if len(parts) >= 5:
    #             word = parts[1].strip()
    #             decoded = substitution_cipher(word, encode=False).strip()
    #             strongs = parts[2].strip()
    #             english = parts[3].strip()
    #             grammer = parts[4].strip()
    #             glossary[word] = {'decoded': decoded, 'strongs': strongs, 'english': english, 'grammer': grammer, 'row': line}
    # return glossary

example = "## Step 1\nThe word \"ol\" appears in all the target verses. The most common meanings that appear in all source languages are:\n\n1. And\n2. But\n3. For\n\nCreating a dictionary of words and meanings:\n\n```\nword    definition\nol      and\nol      but\nol      for\n```\n\n## Step 2\nFor each targeted word, the most likely word that appears in all those verses and is not yet defined in our dictionary is \"ol\". The second most likely word is \"erw\", and the third most likely word is \"xi\".\n\n## Step 3\nIt could be argued that \"ol\" might also mean \"then\" or \"so\" in some contexts, as these words also appear frequently in the verses provided.\n\n## Step 4\nCreating the glossary:\n\n```\n{\n  \"word\": \"ol\",\n  \"strongs\": [\"G2532\", \"G1161\", \"G1063\"],\n  \"english\": [\"and\", \"but\", \"for\"],\n  \"grammer\": [\"CC\", \"CC\", \"CC\"],\n  \"debug\": \"Considered words: and, but, for. Picked 'ol' because it appears in all verses and matches the meanings of 'and', 'but', 'for' in Greek and English. Alternate word: then, so.\",\n  \"exceptions\": \"Might also mean 'then' or 'so' in some contexts.\"\n}\n```"
extract_glossary(example)


{'ol': {'word': 'ol',
  'strongs': ['G2532', 'G1161', 'G1063'],
  'english': ['and', 'but', 'for'],
  'grammer': ['CC', 'CC', 'CC'],
  'debug': "Considered words: and, but, for. Picked 'ol' because it appears in all verses and matches the meanings of 'and', 'but', 'for' in Greek and English. Alternate word: then, so.",
  'exceptions': "Might also mean 'then' or 'so' in some contexts."}}

In [11]:
# Test it
glossary = extract_glossary("""| word | strongs | english | grammer |
|------|---------|---------|---------|
| erw  | G2532   | and     | CC      |
| lxi  | G3588   | the     | T-NSF   |
| xon  | G846    | him     | P-DSM   |
| lu   | G4314   | to      | P       |
| xi   | G3588   | the     | T-DSM   |
| zeow | G3004   | say     | V-PAI-3S|
| jua  | G4771   | you     | P-2DP   |
| ol   | G3756   | not     | D       |
| zii  | G5100   | some    | PX-NSM  |
| lxin | G846    | them    | P-DPM   |
| hivi | G1096   | be      | V-PNI-3P|
| us   | G1537   | from    | P       |
| gu   | G1519   | into    | P       |
| or   | G1722   | in      | P       |
| fal  | G1161   | but     | CC      |
| zu   | G2443   | so that | CS      |
| lopp | G2193   | until   | CS      |
| lxel | G3704   | when    | CS      |
| lopp | G2193   | until   | CS      |
| lxi  | G3588   | the     | T-NSM   |
| hopp | G3361   | not     | D       |
| rul  | G1410   | able    | V-PNI-3S|
| leci | G2983   | take    | V-2AAS-3S|
| ni   | G1722   | in      | P       |
| az   | G1473   | us      | P-1GP   |
| xemi | G5100   | any     | PX-ASN  |
| ru   | G1722   | in      | P       |
| nurij| G3778   | this    | PD-DSF  |
| wuri | G3704   | so that | CS      |
| zu   | G2443   | so that | CS      |
| lxel | G3704   | when    | CS      |
| lxi  | G3588   | the     | T-NSF   |
| hopp | G3361   | not     | D       |
| rul  | G1410   | able    | V-PNI-3S|
| leci | G2983   | take    | V-2AAS-3S|
| ni   | G1722   | in      | P       |
| az   | G1473   | us      | P-1GP   |
| xemi | G5100   | any     | PX-ASN  |
| ru   | G1722   | in      | P       |
| nurij| G3778   | this    | PD-DSF  |""",{})

# convert glossary into a dataframe
pd.DataFrame(glossary).transpose()


#### Analysis

The above was from asking for 

## Targetted Words
 - erw
 - lxi
 - lu
 - us
 - xi
 - xon
 - lxij
 - e
 - zeow
 - or
 - jua

The other words it thought it has confidence on but I think we should leave them off as it doesn't have enough context yet.

It got xi wrong so I said

The word xi is not "the"  Please think step by step and let me know what it should be, show your work

It replied

```
I apologize for the confusion. I will reanalyze the word "xi" and provide the correct translation with my work.

In the provided verses, "xi" appears to be functioning as a pronoun rather than the article "the." Let's take a closer look at one of the verses:

MRK 5:31: καὶ ἔλεγον αὐτῷ οἱ μαθηταὶ αὐτοῦ Βλέπεις τὸν ὄχλον συνθλίβοντά σε, καὶ λέγεις Τίς μου ἥψατο;
And his disciples said to him, "You see the crowd pressing around you, and yet you say, 'Who touched me?'"

In this verse, "xi" is used as a pronoun referring to Jesus ("him"):

Erw xoz wozkoqpiz zeow lu xon, Jua zii lxi qiuqpi vuarw jua ur imivj zowi, erw xi zeow, Hxj evi jua zlopp lvuafporg lxi Nezliv?

I will now update the glossary entry for "xi":
word	strongs	english	grammer
xi	G846	him, his	D
```

I like the him, his
We may need to do a step by step for each targetted word like this and then a summary table at the end.

In [None]:
# every message follows <im_start>{role/name}\n{content}<im_end>\n
tokenizer = tiktoken.encoding_for_model('gpt-4')
tokens_left = 32000

frozen_tokens_left = tokens_left
glossary = {}

# Reest these after each GPT call
# We only have so many tokens to work with so we need to break up the data
verses = "" # The verses we are learning from
glossary_words = ""  # The words we already know
glossary_current = {} # Quick lookup table of words we already know to prevent duplication
targeted_words = {}

class GlossaryCreator:
    def __init__(self, training_df, words, target_language='birrig'):
        self.df = training_df.copy()
        self.df['normalized'] = self.df[target_language].apply(self.normalize_text)
        # drop all rows that have no value in normalized
        self.df = self.df[self.df['normalized'].notna()]
        self.words = words
        self.index = 0
        self.target_language = target_language
        self.glossary = {}
        self.definitions = {}
        self.load_dictionary()
        self._reset_for_next_call()

    def load_dictionary(self):
        dictionary = pd.read_csv('dictionary.csv')
        # export pandas dataframe to dictionary
        for index, row in dictionary.iterrows():
            self.definitions[self.normalize_text(row['target'])] = row['source']

    def run(self):
        for (word, decoded, count) in self.words:
            self._next_word(word)
        # Flush our final data to GPT as the last batch won't get called above
        self.call_gpt()
        return self.glossary


    def normalize_text(self, text):
        # Remove punc and lowercase all words
        # TODO: for more languages you can use unicode base tools to look at the type of char it is
        # and if type of punc then skip it.
        return " ".join([ word.lower().strip('.,;!?[]{}()\\') for word in text.split()])

    def format_verse(self, verse):
        return f"{verse['vref']}\t{verse['source_content']}\t{verse[self.target_language]}" or ''
        #return f"{verse['vref']}\t{verse[self.target_language]}" or ''

    def format_word_definitions(self, word_definitions):
        return "\n".join([f"{word}\t{definition}" for (word, definition) in word_definitions.items() if definition])

    def get_used_word_definitions(self, verses_df):
        # Get all the words used in the verses
        words_used = {}
        for index, row in verses_df.iterrows():
            for word in row['normalized'].split():
                if word not in self.glossary_words:
                    words_used[word] = self.definitions.get(word, None)
        return words_used

    def _reset_for_next_call(self):
        self._reset_tokens()
        self.verses = ""
        self.glossary_words = {}
        self.targetted_words = []
        self.word_in_batch = 0

    def _reset_tokens(self):
        self.tokens_available = 32000
        self.tokens_available -= len(tokenizer.encode(system_message))-4
        self.tokens_available -= len(tokenizer.encode(input_format))-4        

    def _next_word(self, word):
        # find all verses that contain the word
        # shuffle the verses so we don't keep getting the same ones
        df = self.df.sample(frac=1)
        # Find x verses that have this word in it
        normalized = self.normalize_text(word)
        verses = df[df['normalized'].str.contains(normalized)].head(SAMPLES_PER_WORD)
        verses_text = "\n".join(verses.apply(self.format_verse, axis=1))
        new_word_definitions = self.get_used_word_definitions(verses)
        new_word_definition_text = self.format_word_definitions(new_word_definitions)

        # See if we have enough tokens to add this word
        tokens_used = len(tokenizer.encode(verses_text))+1 # +1 for the newline
        tokens_used += len(tokenizer.encode(word+"\n"))
        tokens_used += len(tokenizer.encode(new_word_definition_text))
        tokens_used += TOKENS_RESERVED_FOR_NEW_GRAMMER_PER_WORD

        if self.tokens_available - tokens_used > 0:
            # Push these changes to our data
            self.verses += "\n" + verses_text
            self.targetted_words.append(word)
            self.glossary_words.update(new_word_definitions)
            self.tokens_available -= tokens_used
            self.word_in_batch += 1
            if self.word_in_batch >= MAX_WORDS_PER_BATCH:
                self.call_gpt()
                self._reset_for_next_call()

        else:
            self.call_gpt()
            self._reset_for_next_call()
            self.verses = verses_text
            self.targetted_words = [word]
            # Rebuild glossary words as the glossary changed and we need all the words we are using
            self.glossary_words = self.get_used_word_definitions(verses)

    def call_gpt(self):
        definitions = self.format_word_definitions(self.glossary_words)
        content = input_format.format(definitions=definitions, verses=self.verses, targeted_words="\n".join(self.targetted_words))
        print(content)
        for attempt in range(3):
            # -100 is just a buffer as you will get an openai error if you calculated it wrongly
            response = call_gpt(system_message, content, max_tokens=(self.tokens_available-100),debug=True)
            if response:
                break
        
        new_words = extract_glossary(response, self.glossary)
        for word, metadata in new_words.items():
            self.definitions[word] = json.dumps(metadata['english'])
        self.glossary.update(new_words)
        self.save_glossary()

    def save_glossary(self):
        # convert glossary into a dataframe
        pd.DataFrame(self.glossary).transpose().to_csv('glossary.csv')
        

glossary = GlossaryCreator(training_df, words).run()











## Targeted Words
erw

## Current glossary (if any)
```
word	definition
```
yuxr	John

yizaz	Jesus

qiliv	Peter


## Training Data
```
| Bible Verse Reference | Target Language to Learn |

MRK 7:2	καὶ  ἰδόντες  τινὰς  τῶν  μαθητῶν  αὐτοῦ  ὅτι  κοιναῖς  χερσίν, τοῦτ’  ἔστιν  ἀνίπτοις, ἐσθίουσιν  τοὺς  ἄρτους,—	Erw xew ziir lxel zuni us xoz wozkoqpiz luuc lxiov fview holx arkpier, lxel oz, arhezxiw, xerwz.
MRK 10:41	Καὶ  ἀκούσαντες  οἱ  δέκα  ἤρξαντο  ἀγανακτεῖν  περὶ  Ἰακώβου  καὶ  Ἰωάνου.	Erw xievorg lxoz, lxi lir fikeni mivj ergvj holx Yeniz erw Yuxr.
MRK 15:43	ἐλθὼν  Ἰωσὴφ  ὁ  ἀπὸ  Ἀριμαθαίας, εὐσχήμων  βουλευτής, ὃς  καὶ  αὐτὸς  ἦν  προσδεχόμενος  τὴν  βασιλείαν  τοῦ  Θεοῦ, τολμήσας  εἰσῆλθεν  πρὸς  τὸν  Πειλᾶτον  καὶ  ᾐτήσατο  τὸ  σῶμα  τοῦ  Ἰησοῦ.	Lxivi keni Yuziqx us Evonelxeie, e vizqurzofpi ner or xogx xuruav, hxu hez xonzips heolorg suv lxi corgwun us Guw; erw xi hirl or lu Qopeli holxual siev, erw newi e vitaizl suv lxi fuwj us Yizaz.
MRK 16:1	Καὶ  διαγενομένου  τοῦ  σαββάτο