# Align Words across all languages

In theory AI supported language translation has 3 parts to it (this notebook dealing with part 1) (at least from a big data perspective)
 1. Align all words across all languages so Jesus is Yeshua is Iesu.  All languages have similar word embeddings.
 1. Pack all language nuance, commentaries, sermons, translation notes into an embedding representing a verse.
 1. Auto-discover language family relationships (as all words come from somewhere) to learn from other related grammers and stemming etc.




## A Novel Approach
If we consider the case of a concordance, let's say we wanted to know what "love" means.  We would thus search the Bible for all verses with love in it.  After reading it we get an impression of what the Bible says/means about love.  

Now imagine if we took all those verses in a standarized language that AI understood well (in this case English b/c chatGPT is strongest at that language) and we asked GPT, what is love based on these verses?  In fact we can do that, AI has the concept of a sentence embedding (but sentence can be any text up to 8000 tokens (aprox 6000 words)).  That is it will represent the meaning of all those verses together in a single vector (a list of numbers).

If we did this for a second language by looking up every verse reference each word of that language appears in then swapping out the references for the english text (since GPT knows English best) we would get a similar meaning.  Of course in Greek we have Eros, Philia, Agape, etc.  In this case the Greek work Agape would only return the verses using that form of love.  Thus the Greek would have it's own unique embedding for Agape but if you mapped all the words on a graph you should see Greek word Agape sharing similar numbers in space with the English word Love.  This is the goal of this notebook.  That is to get all numbers for similar words to be similar.

The alternate approach is you have to know what many words are in each language then get the AI to make them similar in number.  This approach means we do not need to know anything about the language, it is purely stastical and it is leveraging the unique ability that the Bible is aligned in thought verse by verse with the same thoughts shared across all languages for the same verses.

Essentially I believe all languages can be reverse engineered.  Further I think it is like a jigsaw puzzle where if you find the easier pieces first (distinct concepts, proper nouns) then the rest of the puzzle becomes easier to solve.  Much like doing the edges first then getting down to the hard pieces last who have but one place to go.





## Technical Issues

languages like Arabic (so I'm told) where words like "you" are embedded in the verb itself.  Thus the verb then is not one word but changes based on the pronoun.  

And in Chinese it is based on characters (so I understand).

And of course there is idioms and words that do not translate to another word.

But we do not want word to word translation.  What we want is a concept in one language is "close" to the same concept in another language.  This is represented in vector space.  That is a series of numbers like [1,1,2,1,1,1] is similar to [1,1,2.31,1,1,1] but not to [2,2,1,1,1,2]

We want to do this automatically and statistically and this is a well researched technique.  In short we feed an entire corpus like the Bible into a neural network and it learns the vector space.  Then we can take a word like "green" and find the closest word like "blue" or "yellow".  The question remains how to get green to mean "vert" (thus aligning it across languages).  We have a unique advantage in that the Bible is chunked into the same verses (essentially but statistically consistently enough).  So we can "align" all languages to the same in a unique way.

WHY?  If all our words (or more accurately concepts) are aligned with each other than when we say "For God so loved the world" we get the same meaning in all languages.  This is the first step to AI supported language.  Without this alignment then we are training each language by itself and are unable to leverage the big data findings we start to get from language families sharing a grammer and nuances.

### More Technical Issues: Small common words with little meaning on own.

Because we are grabbing the entire verse there is more meaning in that verse than the word we want.  In the case of common words with little distinct meaning on their own like "the", "his", "you" these will get lost.  I have tried to remediate this by giving them many verses to match with.  Presumably the other language will also have the same verses but in reality many languages use different conjunctions/stems with these words and so they may not align on the same verses.  

A potential solution is to consider these some of the inner pieces of a jigsaw puzzle.  We could first resolve the easier words (which we could determine statistically by the nearness to other words) then use the approach in the book _create_glossary) but translate all the known words first, thus leaving the small words which will now have a context of how they are used.



## Potential Issues
 - Ambiguity: Many words have multiple meanings, and their usage may differ even within the same document based on the context. It might be challenging to align these types of words across languages effectively.
 - Non-Biblical Texts: The Bible, while extensive, is a specific type of text with its own set of linguistic features and themes. How well the method generalizes to other kinds of texts is an open question.
 - Linguistic Drift: Languages evolve over time, and what might be a perfect alignment today may not hold in the future.
 - If there are minimum verses and usally appear in the same phrase then it may be bound closer to that phrase ex. "you have heard that it was said but I say to you" may map "it was said" to "I say to you"

## A Novel way to Align Languages

Since all Bible verses are aligned by verses we can use that alignment to extract a meaning.  To do this we will follow these steps

1. Count the occurances of all words in a language (I will show later we will actualy count ngrams to capture meanings but more on this in a moment)
1. Starting with the most common word search for all verses that contain that word
1. Look up all those verses in a common language (not the source language as openAI sentence embeddings are not aligned the same for each verse across languages - see universal_embeddings.ipynb), it is the common language that will give us a shared meaning of a word across all languages
1. Generate a sentence embedding that covers all the verses and thus embeds all the meaning of those verses (this will not give us the meaning of the word but the meaning of that group of verses will be shared with all other languages even when partial as in the case of "the" in greek having 16 variations, thus it will be similar but not the same which is good as each langauge has slighly different nuances per word)

A potential extension of this work
1. Take FastText and all Bibles across all languages train new embeddings
1. Using the embeddings above as pre-trained embeddings
1. Thus deepening the meaning of each word to actually be the words meaning for that language but with a starting point of it's shared meaning as used in verses across all languages.




In [126]:
# Setup some defaults
TRAINING_SOURCE = ['MAT',
 'MRK',
 'LUK',
 'JHN',
 'ACT',
 'ROM',
 '1CO',
 '2CO',
 'GAL',
 'EPH',
 'PHP',
 'COL',
 '1TH',
 '2TH',
 '1TI',
 '2TI',
 'TIT',
 'PHM',
 'HEB',
 'JAS',
 '1PE',
 '2PE',
 '1JN',
 '2JN',
 '3JN',
 'JUD',
 'REV'
 ]
VERSIONS = ['eng-web','eng-asv','eng-kjv2006','engBBE','hin2017', 'arbnav', 'latVUC', 'amo']
COMMON_VERSION = 'eng-asv'
MAX_VERSES = 50
MODEL = 'text-embedding-ada-002'


In [127]:
# Imports and fixes to pathing
%reload_ext autoreload
import sys
sys.path.append('../lib')

import pandas as pd
import tiktoken
import openai, time, os
from openai.error import RateLimitError, OpenAIError
from config import get_config
from collections import defaultdict
from cipher import substitution_cipher

# openai.api_type = os.environ["OPENAI_API_TYPE"] = get_config('openai')['api_type']
# openai.api_base = os.environ["OPENAI_API_BASE"] = get_config('openai')['api_base']
# openai.api_key = os.environ["OPENAI_API_KEY"] = get_config('openai')['api_key']
# openai.api_version = os.environ["OPENAI_API_VERSION"] = get_config('openai')['api_version']
openai.api_key = os.environ["OPENAI_API_KEY"] = get_config('embeddings')['api_key']



# Get our Bibles

https://github.com/ChrisPriebe/BibleTranslation/blob/master/get_bible.ipynb

In [128]:
#TODO: remplace this with download from ecorpus and prepare it

# Read the data/berrig.csv file into dataframe
df = pd.read_csv('../data/birrig.csv')
df.rename(columns={df.columns[0]: 'vref'}, inplace=True)
df.head()

Unnamed: 0,vref,book,chapter,verse,eng-web,eng-asv,eng-kjv2006,engBBE,hin2017,arbnav,latVUC,amo,source_content,birrig
0,GEN 1:1,GEN,1,1,"In the beginning, God created the heavens and ...",In the beginning God created the heavens and t...,In the beginning God created the heaven and th...,At the first God made the heaven and the earth.,आदि में परमेश्‍वर ने आकाश और पृथ्वी की सृष्टि ...,فِي الْبَدْءِ خَلَقَ اللهُ السَّمَاوَاتِ وَالأ...,In principio creavit Deus cælum et terram.,,בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁ...,El lxi sovzl Guw newi lxi xiemir erw lxi ievlx.
1,GEN 1:2,GEN,1,2,The earth was formless and empty. Darkness was...,And the earth was waste and void; and darkness...,"And the earth was without form, and void; and ...",And the earth was waste and without form; and ...,"पृथ्वी बेडौल और सुनसान पड़ी थी, और गहरे जल के ...",وَإِذْ كَانَتِ الأَرْضُ مُشَوَّشَةً وَمُقْفِرَ...,"Terra autem erat inanis et vacua, et tenebræ e...",,וְהָאָ֗רֶץ הָיְתָ֥ה תֹ֨הוּ֙ וָבֹ֔הוּ וְחֹ֖...,Erw lxi ievlx hez hezli erw holxual suvn; erw ...
2,GEN 1:3,GEN,1,3,"God said, “Let there be light,” and there was ...","And God said, Let there be light: and there wa...","And God said, Let there be light: and there wa...","And God said, Let there be light: and there wa...","तब परमेश्‍वर ने कहा, “उजियाला हो*,” तो उजियाला...",أَمَرَ اللهُ: «لِيَكُنْ نُورٌ». فَصَارَ نُورٌ،,Dixitque Deus: Fiat lux. Et facta est lux.,,וַיֹּ֥אמֶר אֱלֹהִ֖ים יְהִ֣י א֑וֹר וַֽיְהִי...,"Erw Guw zeow, Pil lxivi fi pogxl: erw lxivi he..."
3,GEN 1:4,GEN,1,4,"God saw the light, and saw that it was good. G...","And God saw the light, that it was good: and G...","And God saw the light, that it was good: and G...","And God, looking on the light, saw that it was...",और परमेश्‍वर ने उजियाले को देखा कि अच्छा है*; ...,وَرَأَى اللهُ النُّورَ فَاسْتَحْسَنَهُ وَفَصَل...,Et vidit Deus lucem quod esset bona: et divisi...,,וַיַּ֧רְא אֱלֹהִ֛ים אֶת־ הָא֖וֹר כִּי־ ט֑וֹ...,"Erw Guw, puucorg ur lxi pogxl, zeh lxel ol hez..."
4,GEN 1:5,GEN,1,5,"God called the light “day”, and the darkness h...","And God called the light Day, and the darkness...","And God called the light Day, and the darkness...","Naming the light, Day, and the dark, Night. An...",और परमेश्‍वर ने उजियाले को दिन और अंधियारे को ...,وَسَمَّى اللهُ النُّورَ نَهَاراً، أَمَّا الظَّ...,"Appellavitque lucem Diem, et tenebras Noctem: ...",,וַיִּקְרָ֨א אֱלֹהִ֤ים׀ לָאוֹר֙ י֔וֹם וְלַחֹ...,"Renorg lxi pogxl, Wej, erw lxi wevc, Rogxl. Er..."


In [129]:
training_df = df[df['book'].isin(TRAINING_SOURCE)]
print(f"TRAIN SIZE: {len(training_df)} verses")


TRAIN SIZE: 7957 verses


In [None]:
def normalize_text(text):
        # Remove punc and lowercase all words
        # TODO: for more languages you can use unicode base tools to look at the type of char it is
        # and if type of punc then skip it.
        return " ".join([ word.lower().strip('.,;!?[]{}()\'"\\') for word in text.split()])


def get_ngrams(df, column_name, max_ngarms_length=10, trim_bigrams=5, normalize=False):
    ngrams = defaultdict(int)
    for index, row in df.iterrows():
        text = normalize_text and normalize_text(row[column_name]) or row[column_name]
        words = text.split()
        for i, word in enumerate(words):
            for j in range(0, max_ngarms_length):
                if i+j < len(words):
                    ngrams[" ".join(words[i:i+j+1])] += 1

    # Remove any ngrams that have only one count
    if trim_bigrams:
        ngrams = {ngram: count for ngram, count in ngrams.items() if count >= trim_bigrams or " " not in ngram}

    # Sort the ngrams by frequency
    sorted_ngrams = sorted(ngrams.items(), key=lambda item: item[1], reverse=True)
    return sorted_ngrams

def get_embedding(inputs, words_df):
    for i in range(5):
        try:
            result = openai.Embedding.create(input=[input for (_, input) in inputs], model=MODEL)['data']
            for i, (word, _) in enumerate(inputs):
                embeddings.loc[len(embeddings)] = [word['word'], result[i]['embedding']]
            return
        
        except RateLimitError as e:
            print(f"Rate Limit Error: {e}")
            time.sleep(30)
            continue

        except Exception as e:
            if "502 Bad Gateway" in e.message:
                print("Bad Gateway Error")
                time.sleep(30)
                continue

            print("Error", e)
            return

tokenizer = tiktoken.get_encoding('cl100k_base')

for VERSION in VERSIONS:
    print("Processing", VERSION)
    # remove nan from training_df[VERSION]
    df = training_df[training_df[VERSION].notna()]
    df = df[df[COMMON_VERSION].notna()]
    df.loc[:, 'normalized'] = df[VERSION].apply(normalize_text)

    words = get_ngrams(df, 'normalized')
    # convert words to dataframe
    words_df = pd.DataFrame(words, columns=['word', 'frequency'])
    words_df['grams'] = words_df['word'].apply(lambda x: len(x.split()))
    # limit to words that appear at least 3 times
    words_df = words_df[words_df['frequency'] > 2]

    print("Total Words", len(words_df))

    current_tokens = 0
    inputs = []
    embeddings = pd.DataFrame(columns=['word', 'embeddings'])
    total_tokens = 0
    total = 0
    
    # Create our sentence embedding
    for _, word in words_df.iterrows():
        # Find all the occurrences of the word in the training data
        verses = df[df['normalized'].str.contains(word['word'])][['vref', COMMON_VERSION]]

        # Create the input for the embedding
        keep = len(verses) > MAX_VERSES and MAX_VERSES / len(verses) or 1.0
        while input is None or len(tokenizer.encode(input)) > 7000:
            # Cut the input by 20% until it's under 8000 tokens
            #input = '\n'.join([f"{verse['vref']}\t{verse[COMMON_VERSION]}" for _, verse in verses.sample(frac=keep, random_state=1).iterrows()])
            input = '\n'.join([verse[COMMON_VERSION] for _, verse in verses.sample(frac=keep, random_state=1).iterrows()])
            keep = keep * 0.8

        current_tokens += len(tokenizer.encode(input))
        if current_tokens > 7000:
            total % 100 == 0 and print(f"Getting Embeddings\t {total} of {len(words_df)} {total/len(words_df)}%")
            get_embedding(inputs, words_df)
            total_tokens += current_tokens
            current_tokens = len(tokenizer.encode(input))
            inputs = []

    
        inputs.append((word, input))
        total += 1
        input = None

    # One last call to finish it off
    get_embedding(inputs, words_df)
    total_tokens += current_tokens

    print(f"DONE {VERSION}\ttotal tokens {total_tokens}\tAprox Cost ${total_tokens*0.0000001}")

    words_df['version'] = VERSION
    # join the embeddings to the words_df on word
    df2 = words_df.merge(embeddings, on='word', how='left')
    df2.to_json(f'../data/{VERSION}_embeddings.jsonl', orient='records', lines=True)
    

In [149]:
# TODO 
# compare embeddings to see which are matched
# test with other languages (like the greek the issue where has many forms)

# load the embeddings and concat them
df = pd.concat([
    pd.read_json('../data/eng-web_embeddings.jsonl', lines=True),\
    #pd.read_json('../data/birrig_embeddings.jsonl', lines=True),\
    pd.read_json('../data/engBBE_embeddings.jsonl', lines=True), \
    pd.read_json('../data/eng-asv_embeddings.jsonl', lines=True), \
    pd.read_json('../data/hin2017_embeddings.jsonl', lines=True), \
    pd.read_json('../data/latVUC_embeddings.jsonl', lines=True), \
    pd.read_json('../data/arbnav_embeddings.jsonl', lines=True), \
    pd.read_json('../data/source_content_embeddings.jsonl', lines=True), \
])
len(df)

68466

In [151]:
# Drop the column embedding if it exists
df.drop(columns=['embedding'], inplace=True, errors='ignore')

# drop rows with no value in embeddings
df.dropna(inplace=True)

In [137]:
import numpy as np
from openai.embeddings_utils import cosine_similarity
# convert the DataFrame to a matrix

for _, word in df[1000:1010].iterrows():
    print(f"Processing '{word['word']}' from {word['version']} which is most similar to")
    # Remove this verse so we don't get ourselves
    df['similarities'] = df.embeddings.apply(lambda x: cosine_similarity(x, word['embeddings']))
    print(df.sort_values('similarities', ascending=False)[['word','version', 'grams']].head(10))


Processing 'amen' from eng-web which is most similar to
                    word  version  grams
1000                amen  eng-web      1
1987               be it   engBBE      2
12472               आमीन  hin2017      1
2259         be with you  eng-asv      3
1885               so be   engBBE      2
1496               आमीन।  hin2017      1
2925   مَعَكُمْ جَمِيعاً   arbnav      2
11865  with you all amen  eng-web      4
11866       you all amen  eng-web      3
11867           all amen  eng-web      2
Processing 'of life' from eng-web which is most similar to
                   word  version  grams
1001            of life  eng-web      2
1061            of life  eng-asv      2
1564           i am the  eng-asv      3
4716          “i am the  eng-web      3
2637       believeth on  eng-asv      2
4412   who has faith in   engBBE      4
2503           the life  eng-asv      2
3390     मुझ पर विश्वास  hin2017      3
4404  that believeth on  eng-asv      3
4302       जीवन के लिये  hin2017  

In [139]:
# Test some specific words, created with a GPT prompt
tests = {
  "love": {
    "Arabic": ["حب"],
    "Latin": ["amor"],
    "Hindi": ["प्रेम", "मोहब्बत"],
    "Greek": ["ἀγάπη", "έρως"]
  },
  "peace": {
    "Arabic": ["سلام"],
    "Latin": ["pax"],
    "Hindi": ["शांति", "अमन"],
    "Greek": ["εἰρήνη"]
  },
  "Jesus": {
    "Arabic": ["يسوع"],
    "Latin": ["Iesus"],
    "Hindi": ["ईसा"],
    "Greek": ["Ἰησοῦς"]
  },
  "Moses": {
    "Arabic": ["موسى"],
    "Latin": ["Moses"],
    "Hindi": ["मूसा"],
    "Greek": ["Μωυσῆς"]
  },
  "meek": {
    "Arabic": ["وديع", "خاضع"],
    "Latin": ["mitis"],
    "Hindi": ["विनम्र", "कोमल"],
    "Greek": ["πρᾳός"]
  },
  "slave": {
    "Arabic": ["عبد", "رقيق"],
    "Latin": ["servus"],
    "Hindi": ["गुलाम", "दास"],
    "Greek": ["δοῦλος"]
  },
  "the": {
    "Arabic": ["ال"],
    "Latin": ["ille", "illa", "illud"],
    "Hindi": ["यह", "वह"],
    "Greek": ["ὁ", "ἡ", "τό"]
  },
  "slave to sin": {
    "Arabic": ["عبد للخطيئة"],
    "Latin": ["servus peccati"],
    "Hindi": ["पाप का गुलाम"],
    "Greek": ["δοῦλος τῆς ἁμαρτίας"]
  },
  "he said": {
    "Arabic": ["قال"],
    "Latin": ["dixit"],
    "Hindi": ["उसने कहा"],
    "Greek": ["εἶπεν"]
  },
  "then I saw": {
    "Arabic": ["ثم رأيت"],
    "Latin": ["tum vidi"],
    "Hindi": ["फिर मैंने देखा"],
    "Greek": ["τότε εἶδον"]
  }
}


In [154]:
# Loop through each word in tests and for each language look up that word and compare the embeddings of each
for word, translations in tests.items():
    # Find the word in the dataframe
    word = normalize_text(word)
    word_df = df[df['word'] == word]
    if len(word_df) == 0:
        print(f"{word} not found in dataframe")
        continue
    word = word_df.iloc[0]
    print(f"Testing {word['word']} from {word['version']} that appeared {word['frequency']} times")

    for language, translations in translations.items():
        for translation in translations:
            # Find the word in the dataframe
            translation = normalize_text(translation)
            df2 = df[df['word'] == translation]
            if len(df2) == 0:
                print(f"  {language}\t{translation} not found")
                continue

            # Get the cosine similarity between word and translation
            similarity = cosine_similarity(word['embeddings'], df2.iloc[0]['embeddings'])
            print(f"  {language}\t{translation}\t{similarity}")
        
    df['similarities'] = df.embeddings.apply(lambda x: cosine_similarity(x, word['embeddings']))
    print(df.sort_values('similarities', ascending=False)[['word','version', 'similarities']].head(10))


Testing love from eng-web that appeared 205 times
  Arabic	حب not found
  Latin	amor not found
  Hindi	प्रेम	0.9248584900873895
  Hindi	मोहब्बत not found
  Greek	ἀγάπη not found
  Greek	έρως not found
                         word  version  similarities
156                      love  eng-web      1.000000
878                     loved  eng-web      0.945976
806             الأَحِبَّاءُ،   arbnav      0.945891
805    أَيُّهَا الأَحِبَّاءُ،   arbnav      0.945891
2281                  प्रियों  hin2017      0.945433
1037                    loved  eng-asv      0.941613
2719                 my loved   engBBE      0.938760
5560                تُحِبُّوا   arbnav      0.937771
1722                  for one   engBBE      0.937285
11915                हे प्रिय  hin2017      0.937039
Testing peace from eng-web that appeared 86 times
  Arabic	سلام not found
  Latin	pax not found
  Hindi	शांति not found
  Hindi	अमन not found
  Greek	εἰρήνη not found
                word  version  similarities
406  

# TODO

 - [ ] try sorting matching verses by length of verse (shortest first) to see if that helps
 - [/] confirm sample with random_state keeps randomizer current
 - [ ] try matching more verses
 - [ ] try different versions like amplified which has more words
 - [ ] Would fasttext sentence embeddings be better starting with a an already 