# Create a Glossary

In theory GPT4 can take a small amount of text and find the obvious words and how they are used.  Over time if given more sections it should learn more and be able to refine the glossary.

 - [ ] Given a portion of Scripture generate a glossary
 - [ ] Expand the glossary by adding the old glossary and more text and ask GPT to output new words and edits to words (to reduce tokens)
 - [ ] Given a completed draft glossary loop word by word and find references of the word being used and ask GPT to improve the glossary
 - [ ] Save the glossary to a database and make it editable, indexes on potential words
 - [ ] On requesting translation add the potential words

In [1]:
# Setup some defaults
TRAINING_SOURCE = ['MRK']
TEST_SOURCE = ['MAT']
GPT_VERSION = 'gpt-4-32k'
TOKENS_RESERVED_FOR_NEW_GRAMMER = 8000


In [2]:
# Imports and fixes to pathing
%reload_ext autoreload
import sys
sys.path.append('../lib')

import pandas as pd
import tiktoken
import openai, time, os
from openai.error import RateLimitError, OpenAIError
from config import get_config
from collections import defaultdict
from cipher import substitution_cipher

openai.api_type = os.environ["OPENAI_API_TYPE"] = get_config('openai')['api_type']
openai.api_base = os.environ["OPENAI_API_BASE"] = get_config('openai')['api_base']
openai.api_key = os.environ["OPENAI_API_KEY"] = get_config('openai')['api_key']
openai.api_version = os.environ["OPENAI_API_VERSION"] = get_config('openai')['api_version']

## Encode the langauge

Based on [https://github.com/ChrisPriebe/BibleTranslation/blob/exp/test-basic/get_bible.ipynb] encode the English version BBE using a letter substitution cipher.  This ensure we have 100% new words and simulates a new language.

PRO
 - Starts with an empty language
 - By using BBE as an input and training it on translating a non-BBE version it prevents word to word translation.  It needs to capture meanings

CONS
 - Does not reflect the linguisitical nuances like stemming, changes to word order, etc in other languages.

 

In [3]:


# Read the data/berrig.csv file into dataframe
df = pd.read_csv('../data/birrig.csv')
# rename df[0] to df['vref']
df.rename(columns={df.columns[0]: 'vref'}, inplace=True)
df.head()

Unnamed: 0,vref,book,chapter,verse,eng-web,eng-asv,eng-kjv2006,engBBE,hin2017,arbnav,latVUC,amo,source_content,birrig
0,GEN 1:1,GEN,1,1,"In the beginning, God created the heavens and ...",In the beginning God created the heavens and t...,In the beginning God created the heaven and th...,At the first God made the heaven and the earth.,आदि में परमेश्‍वर ने आकाश और पृथ्वी की सृष्टि ...,فِي الْبَدْءِ خَلَقَ اللهُ السَّمَاوَاتِ وَالأ...,In principio creavit Deus cælum et terram.,,בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁ...,El lxi sovzl Guw newi lxi xiemir erw lxi ievlx.
1,GEN 1:2,GEN,1,2,The earth was formless and empty. Darkness was...,And the earth was waste and void; and darkness...,"And the earth was without form, and void; and ...",And the earth was waste and without form; and ...,"पृथ्वी बेडौल और सुनसान पड़ी थी, और गहरे जल के ...",وَإِذْ كَانَتِ الأَرْضُ مُشَوَّشَةً وَمُقْفِرَ...,"Terra autem erat inanis et vacua, et tenebræ e...",,וְהָאָ֗רֶץ הָיְתָ֥ה תֹ֨הוּ֙ וָבֹ֔הוּ וְחֹ֖...,Erw lxi ievlx hez hezli erw holxual suvn; erw ...
2,GEN 1:3,GEN,1,3,"God said, “Let there be light,” and there was ...","And God said, Let there be light: and there wa...","And God said, Let there be light: and there wa...","And God said, Let there be light: and there wa...","तब परमेश्‍वर ने कहा, “उजियाला हो*,” तो उजियाला...",أَمَرَ اللهُ: «لِيَكُنْ نُورٌ». فَصَارَ نُورٌ،,Dixitque Deus: Fiat lux. Et facta est lux.,,וַיֹּ֥אמֶר אֱלֹהִ֖ים יְהִ֣י א֑וֹר וַֽיְהִי...,"Erw Guw zeow, Pil lxivi fi pogxl: erw lxivi he..."
3,GEN 1:4,GEN,1,4,"God saw the light, and saw that it was good. G...","And God saw the light, that it was good: and G...","And God saw the light, that it was good: and G...","And God, looking on the light, saw that it was...",और परमेश्‍वर ने उजियाले को देखा कि अच्छा है*; ...,وَرَأَى اللهُ النُّورَ فَاسْتَحْسَنَهُ وَفَصَل...,Et vidit Deus lucem quod esset bona: et divisi...,,וַיַּ֧רְא אֱלֹהִ֛ים אֶת־ הָא֖וֹר כִּי־ ט֑וֹ...,"Erw Guw, puucorg ur lxi pogxl, zeh lxel ol hez..."
4,GEN 1:5,GEN,1,5,"God called the light “day”, and the darkness h...","And God called the light Day, and the darkness...","And God called the light Day, and the darkness...","Naming the light, Day, and the dark, Night. An...",और परमेश्‍वर ने उजियाले को दिन और अंधियारे को ...,وَسَمَّى اللهُ النُّورَ نَهَاراً، أَمَّا الظَّ...,"Appellavitque lucem Diem, et tenebras Noctem: ...",,וַיִּקְרָ֨א אֱלֹהִ֤ים׀ לָאוֹר֙ י֔וֹם וְלַחֹ...,"Renorg lxi pogxl, Wej, erw lxi wevc, Rogxl. Er..."


In [4]:
training_df = df[df['book'].isin(TRAINING_SOURCE)]
test_df = df[df['book'].isin(TEST_SOURCE)]
print(f"TRAIN SIZE: {len(training_df)} verses")
print(f"TEST SIZE: {len(test_df)} verses")

TRAIN SIZE: 678 verses
TEST SIZE: 1071 verses


# Word frequency analysis


In [5]:
# Word analysis
# Determine the most common words in the train data

def get_words(df, version='birrig', min_frequency=0, min_length=0, cipher_decode=True):
    word_count = defaultdict(int)
    for index, row in df.iterrows():
        for word in row[version].split():
            # lowercase and remove punctation
            word = word.lower().strip('.,;!?')
            if len(word) >= min_length:
                word_count[word] += 1

    # Sort the words by frequency
    sorted_words = sorted(word_count.items(), key=lambda item: item[1], reverse=cipher_decode)
    all_words = [(word[0], substitution_cipher(word[0], encode=False).strip(), word[1]) for word in sorted_words]
    
    # reduce words to those that appear more than min_frequency
    return [word for word in all_words if word[2] > min_frequency]

words = get_words(training_df)
# convert words to dataframe
words_df = pd.DataFrame(words, columns=['word', 'cipher', 'frequency'])
test_words = get_words(test_df)
# convert words to dataframe
test_words_df = pd.DataFrame(test_words, columns=['word', 'cipher', 'frequency'])

# merge the two dataframes
words_df = words_df.merge(test_words_df, on='word', how='outer', suffixes=('_train', '_test'))
words_df


Unnamed: 0,word,cipher_train,frequency_train,cipher_test,frequency_test
0,erw,and,1197.0,and,1488.0
1,lxi,the,942.0,the,1601.0
2,lu,to,639.0,to,977.0
3,us,of,476.0,of,861.0
4,xi,he,397.0,he,414.0
...,...,...,...,...,...
1644,zruh:,,,snow:,1.0
1645,uvwiviw:,,,ordered:,1.0
1646,kavvirl,,,current,1.0
1647,qvizirl,,,present,1.0


In [6]:
# how many words are in training but not test
print(f"TRAINING ONLY WORDS: {len(words_df[words_df['frequency_test'].isna()])}")
print(f"TEST ONLY WORDS: {len(words_df[words_df['frequency_train'].isna()])}")
print(f"OVERLAPPING WORDS: {len(words_df[words_df['frequency_train'].notna() & words_df['frequency_test'].notna()])}")

TRAINING ONLY WORDS: 209
TEST ONLY WORDS: 487
OVERLAPPING WORDS: 953


# GPT ANALYSIS


In [7]:
system_message = """
# Role
You are an expert linguistic and polyglot.  You love to learn new languages.  You also have a doctorate of Theology and are fluent in the Biblical languages and the original word meanings.  

# Task
You will be given a new language and you are to create a glossary for each word in that language.  Here are some guidelines

 - Only add words you are fairly certain of
 - List the most certain words first
 - Only return words not present in the current glossary (given by user if any) AND words that you changed the meanings
 - Use the vref to look up the original language, NIV, NKJV, Amplified, Swahali, Arabic, Chinese, German and French translations to see if you can get more context for the word but don't show those translations.
 
# Glossary format
Table Format

word: (string)
strongs: (string[]) array of strongs concordance numbers in order of likely meaning (max 3)
english: (string[]) array of english words in order of likely meaning (max 3)
grammer: (char[])  array of grammer codes in order of likely meaning (part of speech N for noun, Tense, number, gender, stem (which word it is derived from), etc)

# Input format
"""

input_format = """
## Current glossary (if any)
```
| word | strongs | english | grammer |
{glossary}
```

## Training Data
```
| Bible Verse Reference | Source Language | Target Language to Learn |
{verses}
```
"""
system_message += input_format

def format_verses(row, version='source_content'):
    return f"""
| {row['vref']} | {row[version]} | {row['birrig']} |
""".strip()

In [8]:


def call_gpt(system_message, content):
    try:
        response = openai.ChatCompletion.create(
            engine=GPT_VERSION,
            messages=[
                {"role":"system","content": system_message},
                {"role":"user","content": content}
            ],
            temperature=0.01,
            max_tokens=TOKENS_RESERVED_FOR_NEW_GRAMMER,
        )
        return response.get('choices',[{}])[0].get('message',{'content':''}).get('content','')  
    
    except RateLimitError as e:
        print(f"Rate Limit Error: {e}")
        time.sleep(30)
        
    except OpenAIError as e:
        print(f"OpenAI Error: {e}")
        return None

In [9]:
# every message follows <im_start>{role/name}\n{content}<im_end>\n
tokenizer = tiktoken.encoding_for_model('gpt-4')
tokens_left = 32000
tokens_left -= TOKENS_RESERVED_FOR_NEW_GRAMMER
tokens_left -= len(tokenizer.encode(system_message))-4
tokens_left -= len(tokenizer.encode(input_format))-4
frozen_tokens_left = tokens_left
verses = ""

for index, row in training_df.iterrows():
    verse = format_verses(row, 'source_content')
    tokens_left -= len(tokenizer.encode(verse))-1   # -1 for the newline
    if tokens_left < 0:
        content = input_format.format(glossary='', verses=verses)
        print(call_gpt(system_message, content))
        verses = ""
        break
    tokens_left = frozen_tokens_left
    verses += verse + "\n"

23625
| word | strongs | english | grammer |
|------|---------|---------|---------|
| Lxi  | G3588   | the     | N       |
| sovzl | G746    | beginning | N       |
| huvwz | G3056   | word     | N       |
| us    | G3588   | of       | N       |
| guuw  | G2098   | gospel   | N       |
| rihz  | G5547   | Christ   | N       |
| Yizaz | G2424   | Jesus    | N       |
| Kxvozl | G5547  | Christ   | N       |
| Zur   | G5207   | Son      | N       |
| Guw   | G2316   | God      | N       |
| imir  | G2531   | as       | N       |
| ez    | G1722   | in       | N       |
| ol    | G3753   | when     | N       |
| or    | G1909   | on       | N       |
| zii   | G5101   | who      | N       |
| wu    | G4160   | do       | N       |
| uri   | G1415   | able     | N       |
| suv   | G1487   | if       | N       |
| fal   | G1161   | but      | N       |
| xi    | G3778   | this     | N       |
| zeni  | G2962   | Lord     | N       |
| oz    | G2076   | is       | N       |
| puvw  | G2962