# Few-shot translation with a language model

- [ ] Zip together the source and target sentences into a single list of tuples.
- [ ] Split the data into train and test sets.
- [ ] Store the sentences in a Chroma vectorstore.
- [ ] Move through the test sentences one at a time.
- [ ] Use LangChain to dynamically feed similar examples to the current sentence on the fly.
- [ ] Use OpenAI's LLM for annotating new translations on the test sentences using the examples.
- [ ] Backtranslate the results to the source language without seeing the original translations.
- [ ] Compare the resulting target translation and back-translated source to the original sentence pair.

In [2]:
!pip install kor

Collecting kor
  Using cached kor-0.9.2-py3-none-any.whl (28 kB)
Installing collected packages: kor
Successfully installed kor-0.9.2


In [1]:
# Import required libraries
import openai
from kor import create_extraction_chain, Object, Text
from langchain.chat_models import ChatOpenAI
from sklearn.model_selection import train_test_split
from langchain.vectorstores import Chroma
import os
import json

In [24]:
import getpass
secret_key = getpass.getpass('Enter OpenAI secret key: ')
os.environ['OPENAI_API_KEY'] = secret_key

## Scrape some target language data

In [6]:
!pip install bs4

Collecting bs4
  Using cached bs4-0.0.1-py3-none-any.whl
Collecting beautifulsoup4 (from bs4)
  Using cached beautifulsoup4-4.12.2-py3-none-any.whl (142 kB)
Collecting soupsieve>1.2 (from beautifulsoup4->bs4)
  Using cached soupsieve-2.4.1-py3-none-any.whl (36 kB)
Installing collected packages: soupsieve, beautifulsoup4, bs4
Successfully installed beautifulsoup4-4.12.2 bs4-0.0.1 soupsieve-2.4.1


In [15]:
import requests
from bs4 import BeautifulSoup
import json

def scrape_url(url, chapter_number):
    # Send a get request to the webpage
    response = requests.get(url)

    # Parse the response text with BeautifulSoup using the html.parser
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find the 'chapter justify' div
    chapter_div = soup.find('div', class_='chapter justify')

    # chapter looks like """<div class="chapter justify"><span class="drop-caps">1</span>"""
    chapter = chapter_div.find('span', class_='drop-caps').text
    
    # Find all span elements with the class 'align-left' in the div
    verse_spans = chapter_div.find_all('span', class_='align-left')

    # Extract chapter, verse, and text from each span and store in a dictionary
    verse_dicts = []
    for span in verse_spans:
        verse = span['data-verseid']
        text = span.find('span', {'data-verseid': verse}).text
        verse_dicts.append({'ch': str(chapter_number), 'v': verse, 'text': text})
        print(chapter_number, verse, text)
    # Convert list of dictionaries to json
    # verse_json = json.dumps(verse_dicts)
    
    return verse_dicts

# Replace this with your actual URL
url = "http://live.bible.is/bible/ABMTSC/LUK/"

target_verses = []
# Luke has 24 chapters
for i in range(1, 25):
    # print(scrape_url(url + str(i)))
    # break
    target_verses.append(scrape_url(url + str(i), i))

print(len(target_verses))

1 1 Tiyɔfilɔs agbem anɛ agba nɔg ayɛlɛ nsɔl yi ɛlɔ lem na agira.
1 2 Ɛyog nsɔl yifa na anyɔ anɛ ba akɔlɔ amir abɔ ayɛn nsɔl yi ɛlemɔ na ɛkuma nub na anɛ ba abongɔ alom Ɔsɔwɔ.
1 3 Mɛ ngba nar biji nkpele nsɔl fɛb na ɛkuma nub ntamɛ ntu ɛlama ndɛ mɛ nyɛlɛ nsɔl yifɔ jangjang mɔng ɛjɔlɔ Tiyɔfilɔs.
1 4 Mɛ nem jifa na ji wɔ ɔla kun ɛtete ɔbangɛ kpakpa nsɔl yi alɔ wa teb.
1 5 Ngara yi Hɛrɔd ajɔlɔ ntol na Judiya ɛbɛl nemanjɔm Ɔsɔwɔ yɔ ajɔl wɛ kog arɛ Sakariya yɔ ajogɔ na ɔgu bi balemanjɔm Ɔsɔwɔ bi Abija. Nkal ɔwɛ Ɛlisabɛt ajog ɔgu bi Ɛrɔng.
1 6 Abɔ anɛ abalabal ajɔl ateng anɛ na libri Ɔsɔwɔ afɛng ajɔ bom ajing ya Tata na ajigi ayɛ.
1 7 Abɔ abɔn agbɛlam, ateb arɛ Ɛlisabɛt amalam ayibi ɔla. Abɔ anɛ abal abal agba kol.
1 8 Ɔfo brang ɛbɔngɛ balemanjɔm Ɔsɔwɔ ji Sakariya ajɔl na litom, ajɔl jege Ɔsɔwɔ mɔng nemanjɔm.
1 9 Mɔng ligbɛna li balemanjɔm Ɔsɔwɔ, atob lifanga ayege Sakariya arɛ asɔng na njo Ɔsɔwɔ ɔfo bifɔ aji sɔ nsɔl yi ɛkpenɔ ɛteng ɛlu akarɛ Ɔsɔwɔ.
1 10 Anɛ ajɔl kag ɔrɔ na ɛtegla na ji ajɔl 

In [4]:
target_verses[0]

{'ch': 1,
 'v': 1,
 'text': 'Tiyɔfilɔs agbem anɛ agba nɔg ayɛlɛ nsɔl yi ɛlɔ lem na agira.'}

In [3]:
# Sample target data from file
with open('target_verses.txt', 'r') as f:
    target_verses = json.loads(f.read())

## Get source sentences

In [5]:
import requests, json, re, os
import pandas as pd

def download_file(url, file_name):
    response = requests.get(url)
    with open(file_name, "wb") as file:
        file.write(response.content)

# file1_url = "https://raw.githubusercontent.com/Clear-Bible/macula-greek/main/Nestle1904/TSV/macula-greek.tsv"
file1_url = "https://github.com/Clear-Bible/macula-greek/raw/feature/add-sentence-id-to-tsv/Nestle1904/TSV/macula-greek.tsv" # PR version with sentence IDs
file1_name = "macula-greek.tsv"

if file1_name not in os.listdir():
    download_file(file1_url, file1_name)


In [6]:
# Import Macula Greek data
mg = pd.read_csv(
    "macula-greek.tsv", index_col="xml:id", sep="\t", header=0, converters={"*": str}
).fillna("missing")
# add an 'id' column
mg["id"] = mg.index

# mg['domain'] = mg['domain'].astype(str).fillna('missing')

# Extract book, chapter, and verse into separate columns
mg[["book", "chapter", "verse"]] = mg["ref"].str.extract(r"(\d?[A-Z]+)\s(\d+):(\d+)")

# Add columns for book + chapter, and book + chapter + verse for easier grouping
mg["book_chapter"] = mg["book"] + " " + mg["chapter"].astype(str)
mg["book_chapter_verse"] = mg["book_chapter"] + ":" + mg["verse"].astype(str)

# note that 'Luke' == 'LUK'

In [7]:
# Extract sentences from Luke
luke_sentences = mg[mg['book'] == 'LUK']

In [8]:
luke_sentences

Unnamed: 0_level_0,ref,role,class,type,gloss,text,after,lemma,normalized,strong,...,ln,frame,subjref,referent,id,book,chapter,verse,book_chapter,book_chapter_verse
xml:id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
n42001001001,LUK 1:1!1,missing,conj,missing,Inasmuch as,Ἐπειδήπερ,,ἐπειδήπερ,Ἐπειδήπερ,1895,...,89.32,missing,missing,missing,n42001001001,LUK,1,1,LUK 1,LUK 1:1
n42001001002,LUK 1:1!2,s,adj,missing,many,πολλοὶ,,πολύς,πολλοί,4183,...,59.1,missing,missing,missing,n42001001002,LUK,1,1,LUK 1,LUK 1:1
n42001001003,LUK 1:1!3,v,verb,missing,have undertaken,ἐπεχείρησαν,,ἐπιχειρέω,ἐπεχείρησαν,2021,...,68.59,A0:n42001001002 A1:n42001001004,missing,missing,n42001001003,LUK,1,1,LUK 1,LUK 1:1
n42001001004,LUK 1:1!4,v,verb,missing,to draw up,ἀνατάξασθαι,,ἀνατάσσομαι,ἀνατάξασθαι,392,...,62.3,A0:n42001001002 A1:n42001001005,n42001001002,missing,n42001001004,LUK,1,1,LUK 1,LUK 1:1
n42001001005,LUK 1:1!5,o,noun,common,a narration,διήγησιν,,διήγησις,διήγησιν,1335,...,33.11,missing,missing,missing,n42001001005,LUK,1,1,LUK 1,LUK 1:1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
n42024053006,LUK 24:53!6,missing,det,missing,the,τῷ,,ὁ,τῷ,3588,...,92.24,missing,missing,missing,n42024053006,LUK,24,53,LUK 24,LUK 24:53
n42024053007,LUK 24:53!7,missing,noun,missing,temple,ἱερῷ,,ἱερός,ἱερῷ,2411,...,7.16,missing,missing,missing,n42024053007,LUK,24,53,LUK 24,LUK 24:53
n42024053008,LUK 24:53!8,missing,verb,missing,blessing,εὐλογοῦντες,,εὐλογέω,εὐλογοῦντες,2127,...,33.356,A0:n42024033013;n42024033015;n42024013003 A1:n...,n42024033013 n42024033015 n42024013003,missing,n42024053008,LUK,24,53,LUK 24,LUK 24:53
n42024053009,LUK 24:53!9,missing,det,missing,-,τὸν,,ὁ,τόν,3588,...,92.24,missing,missing,missing,n42024053009,LUK,24,53,LUK 24,LUK 24:53


In [9]:
### SENTENCE-BASED TEXTS

import pandas as pd
import numpy as np

# Initialize lists
text_list = []
target_verse_list = []
dict_list = []
id_list = []

# Group the DataFrame by 'sentence'
# grouped = mg.groupby('sentence')
grouped = mg.groupby('book_chapter_verse')

for name, group in grouped:
    # Combine the 'text' and 'after' fields into a single string for each group
    greek_text = ''.join(group['text'] + group['after'].replace(np.nan, ' ', regex=True))
    sentence_gloss = ''.join(group['gloss'].replace(np.nan, '[no gloss]', regex=True) + group['after'].replace(np.nan, '', regex=True) + ' ')
    text_list.append(sentence_gloss)
    
    # Extract book, chapter, and verse from the group
    book = group['book'].values[0]
    chapter = group['book_chapter'].str.split().str[1].values[0]
    verse = group['book_chapter_verse'].str.split(':').str[1].values[0]
    b_c_v = group['book_chapter_verse'].values[0]
    # id_entry = group['sentence'].values[0]
    id_entry = '|'.join(group['id'].tolist())
    
    target_verse = None
    
    # Find the target text for the verse if scraped
    # for verse_list in target_verses:
    #     for verse_dict in verse_list:
    #         print(verse_dict)
    #         if verse_dict['ch'] == chapter and verse_dict['v'] == verse:
    #             target_verse = verse_dict['text']
    
    # Find the target text for the verse if loaded from file
    for verse_dict in target_verses:
        if str(verse_dict['ch']) == chapter and str(verse_dict['v']) == verse:
            target_verse = verse_dict['text']
    
    
    
    # dict_entry = {'book': book, 'chapter': chapter, 'verse': verse, 'greek': greek_text}
    dict_entry = {'source': b_c_v, 'book': book, 'chapter': chapter, 'verse': verse, 'ids': id_entry, 'greek': greek_text}
    if target_verse:
        dict_entry['target'] = target_verse
    dict_list.append(dict_entry)
    
    
    # Use the 'xml:id' field to create an ID for the verse
    id_entry = group['id'].tolist()
    id_list.append(id_entry)

# Print the lists for testing
print(text_list[:5])
print(dict_list[:5])
print(id_list[:5])

['Not  I want  for  you  to be ignorant, brothers, that  the  fathers  of us  all  under  the  cloud  were  and  all  through  the  sea  passed, ', 'Neither  are you to grumble, as  some  of them  grumbled, and  perished  by  the  Destroyer. ', 'These things  now  [as] types  happened  to them, were written  then  for  admonition  of us, to  whom  the  ends  of the  ages  are arrived. ', 'Therefore  the [one]  thinking  to stand  let him take heed  lest  he fall, ', 'Temptation  you  not  has seized  if  not  what is common to man· faithful  now  -  [is] God, who  not  will allow  you  to be tempted  beyond  what  you are able, but  will provide  with  the  temptation  also  the  escape  -  to be able  to endure [it]. ']
[{'source': '1CO 10:1', 'book': '1CO', 'chapter': '10', 'verse': '1', 'ids': 'n46010001001|n46010001002|n46010001003|n46010001004|n46010001005|n46010001006|n46010001007|n46010001008|n46010001009|n46010001010|n46010001011|n46010001012|n46010001013|n46010001014|n46010001

In [10]:
# Filter out any text, dict, ids, that are not in Luke

# zip all three together
zipped = zip(text_list, dict_list, id_list)

luke_sentences = [v for v in zipped if v[1]['book'] == 'LUK']



In [164]:
with open('sentence_sets.txt', 'w', encoding='utf8') as f:
    f.write(json.dumps(luke_sentences, ensure_ascii=False, indent=2))

In [11]:
len(luke_sentences)

1149

In [12]:
luke_sentences[0][1]['source']

'LUK 10:1'

In [13]:
print(luke_sentences[0][1]['source'])
print(luke_sentences[0][0])
print(luke_sentences[0][1]['greek'])
print(luke_sentences[0][1]['target'])

# Make an array of objects like this
triples_list = []
for sentence in luke_sentences:
    triples_list.append({'source': sentence[1]['source'], 'greek': luke_sentences[0][1]['greek'], 'gloss': sentence[0], 'target': sentence[1]['target']})
    
triples_list[0:5]

LUK 10:1
After  now  these things  appointed  the  Lord  others  seventy, and  sent  them  in  two [by]  before  [the] face  of Himself  into  every  city  and  place  where  was about  He Himself  to go. 
Μετὰ δὲ ταῦτα ἀνέδειξεν ὁ Κύριος ἑτέρους ἑβδομήκοντα,καὶ ἀπέστειλεν αὐτοὺς ἀνὰ δύο πρὸ προσώπου αὐτοῦ εἰς πᾶσαν πόλιν καὶ τόπον οὗ ἤμελλεν αὐτὸς ἔρχεσθαι.
Naji jifɔ ɛlɔ seng Tata arang anɛ atil ara na woba na abal atom abɔ na ɛtegla anɛ abal abal. Arɛ agbɔmba asɔng na bijiba na afom ya wɛ akɛrɛnɔ arɛ wɛ aji.


[{'source': 'LUK 10:1',
  'greek': 'Μετὰ δὲ ταῦτα ἀνέδειξεν ὁ Κύριος ἑτέρους ἑβδομήκοντα,καὶ ἀπέστειλεν αὐτοὺς ἀνὰ δύο πρὸ προσώπου αὐτοῦ εἰς πᾶσαν πόλιν καὶ τόπον οὗ ἤμελλεν αὐτὸς ἔρχεσθαι.',
  'gloss': 'After  now  these things  appointed  the  Lord  others  seventy, and  sent  them  in  two [by]  before  [the] face  of Himself  into  every  city  and  place  where  was about  He Himself  to go. ',
  'target': 'Naji jifɔ ɛlɔ seng Tata arang anɛ atil ara na woba na abal atom abɔ na ɛtegla anɛ abal abal. Arɛ agbɔmba asɔng na bijiba na afom ya wɛ akɛrɛnɔ arɛ wɛ aji.'},
 {'source': 'LUK 10:10',
  'greek': 'Μετὰ δὲ ταῦτα ἀνέδειξεν ὁ Κύριος ἑτέρους ἑβδομήκοντα,καὶ ἀπέστειλεν αὐτοὺς ἀνὰ δύο πρὸ προσώπου αὐτοῦ εἰς πᾶσαν πόλιν καὶ τόπον οὗ ἤμελλεν αὐτὸς ἔρχεσθαι.',
  'gloss': 'Into  whatever  now  -  city  you might enter  and  not  they receive  you, having gone out  into  the  streets  of it  say  ',
  'target': 'Wun ɔlɔ fɛngɛ ɔyɛl na ɛjiba ji adɛnɔ wun kpera, ba sɔng na aba ɛjiba wun ɔbong

## Process data

In [14]:
# Us
data = triples_list

# Split the data into train and test sets
train_data, test_data = train_test_split(data, test_size=0.2)

In [18]:
persist_directory = '/Users/ryderwishart/biblical-machine-learning/ai-translation'

from langchain.embeddings import HuggingFaceEmbeddings, SentenceTransformerEmbeddings
embeddings = HuggingFaceEmbeddings()

collection = Chroma("translation-db", embeddings, persist_directory=persist_directory)

Using embedded DuckDB with persistence: data will be stored in: /Users/ryderwishart/biblical-machine-learning/ai-translation


In [19]:
# Add greek texts with metadata
"""
[{'source': 'LUK 10:1',
  'greek': 'Μετὰ δὲ ταῦτα ἀνέδειξεν ὁ Κύριος ἑτέρους ἑβδομήκοντα,καὶ ἀπέστειλεν αὐτοὺς ἀνὰ δύο πρὸ προσώπου αὐτοῦ εἰς πᾶσαν πόλιν καὶ τόπον οὗ ἤμελλεν αὐτὸς ἔρχεσθαι.',
  'gloss': 'After  now  these things  appointed  the  Lord  others  seventy, and  sent  them  in  two [by]  before  [the] face  of Himself  into  every  city  and  place  where  was about  He Himself  to go. ',
  'target': 'Naji jifɔ ɛlɔ seng Tata arang anɛ atil ara na woba na abal atom abɔ na ɛtegla anɛ abal abal. Arɛ agbɔmba asɔng na bijiba na afom ya wɛ akɛrɛnɔ arɛ wɛ aji.'},
 
"""

collection.add_texts(
    texts=[sentence['target'] for sentence in train_data],
    metadatas=[{'source': sentence['source'], 'greek': sentence['greek'], 'gloss': sentence['gloss'], 'target': sentence['target']} for sentence in train_data],
    ids=[sentence['source'] for sentence in train_data],
)
collection.persist()

In [None]:
# Load the persisted database from disk and use it as normal
collection_from_disk = Chroma(persist_directory=persist_directory, embedding_function=embeddings, collection_name="prosaic_contexts")

# Set up translation and back-translation chains

In [121]:
import copy

# Initialize the ChatOpenAI model
llm = ChatOpenAI(
    model_name="gpt-3.5-turbo",
    temperature=0,
    max_tokens=2000,
    frequency_penalty=0,
    presence_penalty=0,
    top_p=1.0,
)

# Define the schema for the translation task
schema = Object(
    id="translation",
    description="Translate the source sentence to the target language.",
    attributes=[
        Text(
            id="source_sentence",
            description="The sentence in the source language",
            examples=[],
            many=False,
        ),
        Text(
            id="target_sentence",
            description="The translated sentence in the target language",
            examples=[],
            many=False,
        ),
    ],
    many=False,
)

# Create an extraction chain with the model and schema
translate_chain = create_extraction_chain(llm, schema, encoder_or_encoder_class='json')

# Make a copy (not just a reference) to the schema for backtranslation
backtranslation_schema = copy.deepcopy(schema)
backtranslation_chain = create_extraction_chain(llm, backtranslation_schema, encoder_or_encoder_class='json')

                    frequency_penalty was transferred to model_kwargs.
                    Please confirm that frequency_penalty is what you intended.
                    presence_penalty was transferred to model_kwargs.
                    Please confirm that presence_penalty is what you intended.
                    top_p was transferred to model_kwargs.
                    Please confirm that top_p is what you intended.


In [122]:
# Example of how to find similar verses based on target language text
collection.search(test_data[0]['target'], search_type='similarity', k=5)

[Document(page_content='“Ndɛ abong atɔg barɔgantom abɛ arɛ, ‘Baba kakab! Na ɛteng ɛleba wun ɔba yɔbɛ wɛ. Ɔkagɛ ɛbel na nyig ɔbɔ na akpagata na ata ayɛ.', metadata={'source': 'LUK 15:22', 'greek': 'Μετὰ δὲ ταῦτα ἀνέδειξεν ὁ Κύριος ἑτέρους ἑβδομήκοντα,καὶ ἀπέστειλεν αὐτοὺς ἀνὰ δύο πρὸ προσώπου αὐτοῦ εἰς πᾶσαν πόλιν καὶ τόπον οὗ ἤμελλεν αὐτὸς ἔρχεσθαι.', 'gloss': 'Said  then  the  father  to  the  servants  of him  Quickly  bring out  robe  the  best  and  clothe  him, and  give  a ring  for  the  hand  of him  and  sandals  for  his  feet, ', 'target': '“Ndɛ abong atɔg barɔgantom abɛ arɛ, ‘Baba kakab! Na ɛteng ɛleba wun ɔba yɔbɛ wɛ. Ɔkagɛ ɛbel na nyig ɔbɔ na akpagata na ata ayɛ.'}),
 Document(page_content='Na ji bateb ɔteb Ɔsɔwɔ alɔ tum abɔ asa afɛngɛ njim ajɛn na libonta, ba babe ɔjɔra akuma mbonga atɔga atɛm, “Ɛjɛn na Bɛtlɛm ɛji yɛn nsɔl yi ɛlɔ lem ji Tata alɔ wur tɔg.”', metadata={'source': 'LUK 2:15', 'greek': 'Μετὰ δὲ ταῦτα ἀνέδειξεν ὁ Κύριος ἑτέρους ἑβδομήκοντα,καὶ ἀπέστειλεν αὐτοὺ

In [134]:
# Move through the test sentences one at a time
for sentence_dict in test_data:
    # Use Kor to dynamically feed similar examples to the current sentence on the fly
    similar_sentences = collection.search(sentence_dict['target'], search_type='similarity', k=10)
    print(similar_sentences)
    print(dir(similar_sentences[0]))
    # examples = [(source_sentence, target_sentence) for source_sentence, target_sentence in similar_sentences] # use similar sentences
    examples = [(sentence.metadata['gloss'], sentence.page_content) for sentence in similar_sentences if sentence != sentence_dict['target']] # use similar sentences

    # Build some examples for back translation which put the gloss second and the target language first
    backtranslation_examples = [(sentence.page_content, sentence.metadata['gloss']) for sentence in similar_sentences if sentence != sentence_dict['target']] # use similar sentences

    # Update the examples in the schema
    schema.attributes[0].examples = examples
    
    # Update the examples in the backtranslation schema
    backtranslation_schema.attributes[0].examples = backtranslation_examples
    
    print(sentence_dict['source'], '-->', examples)

    # Use OpenAI's completion endpoint for annotating new translations on the test sentences using the examples
    prediction = translate_chain.predict_and_parse(text=sentence_dict['gloss'])

    print('prediction', prediction)
    translated_sentence = prediction['data']['translation']['source_sentence']
    
    # TODO: Backtranslate the results to the source language without seeing the original translations
    # This requires an additional translation model or API that can translate from the target language back to the source language.
    reverse_prediction = backtranslation_chain.predict_and_parse(text=translated_sentence)
    print('reverse_prediction', reverse_prediction)
    backtranslated_target_sentence = reverse_prediction['data']['translation']['source_sentence']
    
    # Backtranslate the original, gold-standard translation for a comparative basis
    original_backtranslation_prediction = backtranslation_chain.predict_and_parse(text=sentence_dict['target'])
    backtranslated_original_target_sentence = original_backtranslation_prediction['data']['translation']['source_sentence']
    
    # TODO: Compare the resulting target translation and back-translated source to the original sentence pair
    # This requires a suitable metric for comparison, such as the BLEU score for translation tasks.
    
    # For now, just print the results and compare to the original
    print('Original:', sentence_dict['target'])
    print('Predicted:', prediction['data']['source_sentence'])
    print('Backtranslated version of the model\'s prediction:', backtranslated_target_sentence)
    print('Backtranslated version of the gold-standard sentence:', backtranslated_original_target_sentence)
    break


[Document(page_content='“Ndɛ abong atɔg barɔgantom abɛ arɛ, ‘Baba kakab! Na ɛteng ɛleba wun ɔba yɔbɛ wɛ. Ɔkagɛ ɛbel na nyig ɔbɔ na akpagata na ata ayɛ.', metadata={'source': 'LUK 15:22', 'greek': 'Μετὰ δὲ ταῦτα ἀνέδειξεν ὁ Κύριος ἑτέρους ἑβδομήκοντα,καὶ ἀπέστειλεν αὐτοὺς ἀνὰ δύο πρὸ προσώπου αὐτοῦ εἰς πᾶσαν πόλιν καὶ τόπον οὗ ἤμελλεν αὐτὸς ἔρχεσθαι.', 'gloss': 'Said  then  the  father  to  the  servants  of him  Quickly  bring out  robe  the  best  and  clothe  him, and  give  a ring  for  the  hand  of him  and  sandals  for  his  feet, ', 'target': '“Ndɛ abong atɔg barɔgantom abɛ arɛ, ‘Baba kakab! Na ɛteng ɛleba wun ɔba yɔbɛ wɛ. Ɔkagɛ ɛbel na nyig ɔbɔ na akpagata na ata ayɛ.'}), Document(page_content='Na ji bateb ɔteb Ɔsɔwɔ alɔ tum abɔ asa afɛngɛ njim ajɛn na libonta, ba babe ɔjɔra akuma mbonga atɔga atɛm, “Ɛjɛn na Bɛtlɛm ɛji yɛn nsɔl yi ɛlɔ lem ji Tata alɔ wur tɔg.”', metadata={'source': 'LUK 2:15', 'greek': 'Μετὰ δὲ ταῦτα ἀνέδειξεν ὁ Κύριος ἑτέρους ἑβδομήκοντα,καὶ ἀπέστειλεν αὐτοὺς

KeyError: 'target_sentence'

In [141]:
print('Original:\n', sentence_dict['target'])
print('Original gloss:\n', sentence_dict['gloss'])
print('Backtranslated version of the gold-standard sentence:\n', backtranslated_original_target_sentence)

print('\nPredicted:\n', prediction['data']['translation']['source_sentence'])
print('Backtranslated version of the model\'s prediction:\n', backtranslated_target_sentence)

Original:
 “Baba wur ɛtig Tata Ɔsɔwɔ bi Isrɛl! ateb wɛ agba ba naji anɛ abɛ arɛ aba nyangɛ abɔ.
Original gloss:
 Blessed be  [the] Lord  the  God  -  of Israel, because  He has visited  and  has performed  redemption  [on] the  people  of Him, 
Backtranslated version of the gold-standard sentence:
 Blessed  [be]  the  Lord  the  God  of Israel, because  He has visited  and  made  redemption  for  His  people, 

Predicted:
 “Ɔtɔŋŋɛ Ɔsɔwɔ, Tata alɔ Isirayel, ɛlɛ mɛnɛŋ mɛnɛŋ, ɛlɛ mɛnɛŋ ɛtɛm ji ɛla ba ɛ, ɛlɛ mɛnɛŋ ɛtɛm ji ɛla ba ɛ, ɛlɛ mɛnɛŋ ɛtɛm ji ɛla ba ɛ.
Backtranslated version of the model's prediction:
 And  having taken  bread  having given thanks  He broke [it]  and  gave  to them  saying  This  is  My  body  which  for  you  is given· this  do  in  the  of Me  remembrance. 


In [143]:
print(translate_chain.prompt.format_prompt(text="[user input]").to_string())
print('>>>>>>>>>')
print(backtranslation_chain.prompt.format_prompt(text="[user input]").to_string())



Your goal is to extract structured information from the user's input that matches the form described below. When extracting information please make sure it matches the type information exactly. Do not add any attributes that do not appear in the schema shown below.

```TypeScript

translation: { // Translate the source sentence to the target language.
 source_sentence: string // The sentence in the source language
 target_sentence: string // The translated sentence in the target language
}
```


Please output the extracted information in JSON format. Do not output anything except for the extracted information. Do not add any clarifying information. Do not add any fields that are not in the schema. If the text contains attributes that do not appear in the schema, please ignore them. All output must be in JSON format and follow the schema specified above. Wrap the JSON in <json> tags.



Input: Said  then  the  father  to  the  servants  of him  Quickly  bring out  robe  the  best  and  

# Define translation functions

In [20]:

# TODO: implement several ways of breaking down the Greek text, including:
# 1. syntactic - either clauses or word groups? Similar examples would be lexical...?
# 2. semantic - broken down by semantic roles? Similar examples would be semantic role based, and lexical?
# 3. discourse - broken down by speech acts? Similar examples would be act plus same core lexeme?

def split_verse_by_chunk_type(verse_ref, chunk_type="speech act"):
    """Splits a verse based on speech acts or other chunk types.
    
    Possible types:
    - "speech act": leverages OpenText syntax
    - "word group": leverages OpenText syntax
    - "token": leverages MACULA tokens
    - "clause": leverages OpenText syntax
    - "verse": leverages MACULA syntax (gets sentence that contains verse)
    """
    
    return ['this is an array', 'of chunks of the specified type']
    # TODO: figure out how best to get chunks by type from a verse ref...
    # I think what I need to do is create a database for each type of chunk.
    # TODO: use oxygen to create the basic transforms

def build_few_shot_prompt(source_text, chroma_database, backtranslate=False):
    # TODO: use prompt template instead of kor extraction chain, since it's not really structured data...
    # e.g., 'translate this text, using only data from the examples...
    # perhaps I should tokenize the input string and make sure every
    # token is represented in the examples. That way... there is at least
    # a good chance the model can figure out some way to associate source 
    # and target sequences
    
    examples = [] # TODO: provision examples for each token in source_text (see note above)
    # NOTE: I think for the examples I need some kind of example validation function
    # e.g., check for tfidf token coverage (or reverse?? since infrequent tokens can
    # be transliterated and frequent ones will help secure collocations/colligations?)
    
    # FIXME: get similar examples based on chunks in source text... each chunk's core/head needs to be represented
    # chunk_type = 'speech act'
    # chunks = split_verse_by_chunk_type(verse_ref, chunk_type)
    # for chunk in chunks:
    #     verse_with_chunk = chroma_database.search(chunk, search_type='similarity', k=1) # FIXME: query a db of the chunk type
    similar_sentences = chroma_database.search(source_text, search_type='similarity', k=10)
    
    if backtranslate:
        examples = [f"Source: {sentence.metadata['gloss']}\nTarget: {sentence.page_content}" for sentence in similar_sentences if sentence != source_text]
    else:
        examples = [f"Source: {sentence.page_content}\nTarget: {sentence.metadata['gloss']}" for sentence in similar_sentences if sentence != source_text]

    examples = '\n'.join(examples)
    
    # Validate examples
    unaccounted_tokens = []
    for token in source_text.split(): # TODO: remove punctuation as well
        if token.lower() not in examples.lower():
            unaccounted_tokens.append(token)
            
    unaccounted_tokens = ','.join(unaccounted_tokens)
        
    output_instructions = (
        "Translate the source text below using only the data from the "
        "supplied example pairs. Do not fill in any gaps in the sense if the "
        "examples do not provide adequate coverage of the source text. "
        "Rather, retain any untranslated source text content in [square brackets]. "
        "As this will help identify where new example pairs are needed. "
    )
    
    if len(unaccounted_tokens) > 0:
        output_instructions += (
            "The following tokens are not accounted for, so don't bother trying to "
            "translate them: {unaccounted_tokens}"
            ).format(unaccounted_tokens=unaccounted_tokens)
    
    prompt_template = (
        "You are LowResourceLanguageTranslationBot. With minimal input "
        "data, you can identify covariance patterns between source and "
        "target texts. Your aim is to draft translations. Given a source "
        "text, you require only some examples of the source tokens being "
        "used in other translation pairs."
        "\n"
        "{output_instructions}\n\n"
        "Example pairs:\n{examples}\n\n"
        "Source: {source_text}\nTarget:"
    ).format(output_instructions=output_instructions, source_text=source_text, examples=examples)
    
    prompt = prompt_template
    
    return prompt
    
    # Use OpenAI's completion endpoint for annotating new translations on the test sentences using the examples
    # prediction = translate_chain.predict_and_parse(text=sentence_dict['gloss'])
    

    # print('prediction', prediction)
    # translated_sentence = prediction['data']['translation']['source_sentence']
    
    # # TODO: Backtranslate the results to the source language without seeing the original translations
    # # This requires an additional translation model or API that can translate from the target language back to the source language.
    # reverse_prediction = backtranslation_chain.predict_and_parse(text=translated_sentence)
    # print('reverse_prediction', reverse_prediction)
    # backtranslated_target_sentence = reverse_prediction['data']['translation']['source_sentence']
    
    # # Backtranslate the original, gold-standard translation for a comparative basis
    # original_backtranslation_prediction = backtranslation_chain.predict_and_parse(text=sentence_dict['target'])
    # backtranslated_original_target_sentence = original_backtranslation_prediction['data']['translation']['source_sentence']
    
    # # TODO: Compare the resulting target translation and back-translated source to the original sentence pair
    # # This requires a suitable metric for comparison, such as the BLEU score for translation tasks.
    
    # # For now, just print the results and compare to the original
    # print('Original:', sentence_dict['target'])
    # print('Predicted:', prediction['data']['source_sentence'])
    # print('Backtranslated version of the model\'s prediction:', backtranslated_target_sentence)
    # print('Backtranslated version of the gold-standard sentence:', backtranslated_original_target_sentence)


In [26]:
test_sentence = 'Blessed be  [the] Lord  the  God  -  of Israel, because  He has visited  and  has performed  redemption  [on] the  people  of Him,'

prompt = build_few_shot_prompt(test_sentence, collection)
print(prompt)

You are LowResourceLanguageTranslationBot. With minimal input data, you can identify covariance patterns between source and target texts. Your aim is to draft translations. Given a source text, you require only some examples of the source tokens being used in other translation pairs.
Translate the source text below using only the data from the supplied example pairs. Do not fill in any gaps in the sense if the examples do not provide adequate coverage of the source text. Rather, retain any untranslated source text content in [square brackets]. As this will help identify where new example pairs are needed. The following tokens are not accounted for, so don't bother trying to translate them: Blessed,[the],Lord,Israel,,because,visited,performed,redemption,[on]

Example pairs:
Source: Mɛ ngɔrɛ batɔnanjim aba ndɛ akam ɛbe ɛtugi jifɔ ayege, amalam.”
Target: And  I begged  the  disciples  of You  that  they might cast out  it, and  not  they were able. 
Source: “Mbima ɛgbɛ ajil abim amug a ns

In [31]:
# Use OpenAI completion endpoint (davinci 003) to complete the prompt from template
from langchain.llms import OpenAI
from langchain.prompts import PromptTemplate

llm = OpenAI(temperature=0.1)
# prompt = PromptTemplate(
#     input_variables=["product"],
#     template="What is a good name for a company that makes {product}?",
# )
# Use my prompt expansion function to build the prompt
prompt = build_few_shot_prompt(test_sentence, collection)

# Reformat into the langchain PromptTemplate to avoid this error: ValidationError: 1 validation error for LLMChain\nprompt\n value is not a valid dict (type=type_error.dict)
prompt = PromptTemplate(
    input_variables=["source_text"],
    template=prompt,
)

from langchain.chains import LLMChain
chain = LLMChain(
    llm=llm, 
    prompt=prompt)


ValidationError: 1 validation error for PromptTemplate
__root__
  Invalid prompt schema; check for mismatched or missing input parameters. {'source_text'} (type=value_error)

In [None]:
chain.run("colorful socks")

## Move functionality into langchain ecosystem

In [33]:
from langchain.prompts.few_shot import FewShotPromptTemplate
from langchain.prompts.example_selector import SemanticSimilarityExampleSelector
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.chains.base import Chain
from langchain.chains import LLMChain
from langchain.prompts.prompt import PromptTemplate
from langchain.chains import TransformChain, SimpleSequentialChain
from langchain.llms import OpenAI

In [151]:
class TranslatePromptBuilder(LLMChain):
    class Config:
        extra = "allow"

    def __init__(self, chroma_database, backtranslate=False, **kwargs):
        super().__init__(**kwargs)
        self.chroma_database = chroma_database
        self.backtranslate = backtranslate

    @property
    def input_keys(self):
        return ["source_text"]

    @property
    def output_keys(self):
        return ["prompt"]

    def _call(self, inputs, run_manager=None):
        source_text = inputs["source_text"]
        examples = self.chroma_database.search(source_text, search_type='similarity', k=8)
        if self.backtranslate:
            examples = [f"Source: {sentence.page_content}\nTarget: {sentence.metadata['gloss']}" for sentence in examples if sentence != source_text]
        else:
            examples = [f"Source: {sentence.metadata['gloss']}\nTarget: {sentence.page_content}" for sentence in examples if sentence != source_text]
        examples = '\n'.join(examples)
        unaccounted_tokens = [token for token in source_text.split() if token.lower() not in examples.lower()]
        unaccounted_tokens = ','.join(unaccounted_tokens)
        output_instructions = (
            "Translate the source text below using only the data from the "
            "supplied example pairs. Do not fill in any gaps in the sense if the "
            "examples do not provide adequate coverage of the source text. "
            "Rather, retain any untranslated source text content in [square brackets]. "
            "As this will help identify where new example pairs are needed. "
        )
        
        if self.backtranslate:
            output_instructions += (
                "Your job is to back-translate a draft translation for QA purposes. "
                "Please translate the translation text back into the source language. "
            )
        
        if len(unaccounted_tokens) > 0:
            output_instructions += (
                "The following tokens are not accounted for, so don't bother trying to "
                "translate them: {unaccounted_tokens}"
                ).format(unaccounted_tokens=unaccounted_tokens)
        
        prompt_template = ( # FIXME: extract this logic so it can be used in the backtranslation prompt builder too
            "You are LowResourceLanguageTranslationBot. With minimal input "
            "data, you can identify covariance patterns between source and "
            "target texts. Your aim is to draft translations. Given a source "
            "text, you require only some examples of the source tokens being "
            "used in other translation pairs."
            "\n"
            "{output_instructions}\n\n"
            "Example pairs:\n{examples}\n\n"
            "Source: {source_text}\nTarget:"
        ).format(output_instructions=output_instructions, source_text=source_text, examples=examples)
        return {"prompt": prompt_template}


In [152]:
# Create your example selector
example_selector = SemanticSimilarityExampleSelector(
    vectorstore=collection,
    k=10
)

In [153]:

# Create your prompt template
prompt_template = FewShotPromptTemplate(
    example_selector=example_selector, 
    example_prompt=PromptTemplate(input_variables=["source", "target"], template="Source: {source}\nTarget: {target}"), 
    suffix="Source: {input}\nTarget:", 
    input_variables=["input"]
)

# Create your translate chain
translate_prompt = TranslatePromptBuilder(
    chroma_database=collection,
    prompt=prompt_template, #FIXME: I should not be passing this in, or else I should not be building this within the class
    llm=OpenAI()
)

# Create your back translate chain
back_translate_chain = TranslatePromptBuilder(
    chroma_database=collection,
    backtranslate=True,
    prompt=prompt_template,
    llm=OpenAI()
)


In [112]:
# Your test sentence
test_sentence = "Blessed be  [the] Lord  the  God  -  of Israel, because  He has visited  and  has performed  redemption  [on] the  people  of Him,"

# Use the translate chain
translation_prompt = translate_chain.run(test_sentence)
# print("## Translation:\n\n", translation_prompt)

In [126]:
result = openai.Completion.create(
    engine="davinci",
    prompt=translation_prompt,
    max_tokens=200,
    temperature=0.9,
    # top_p=1,
    # frequency_penalty=0,
    # presence_penalty=0,
    stop=["\n"]
)

In [127]:
completed = ' '.join([translation_prompt, result.choices[0].text])

print("## Translation:\n\n", completed)

## Translation:

 You are LowResourceLanguageTranslationBot. With minimal input data, you can identify covariance patterns between source and target texts. Your aim is to draft translations. Given a source text, you require only some examples of the source tokens being used in other translation pairs.
Translate the source text below using only the data from the supplied example pairs. Do not fill in any gaps in the sense if the examples do not provide adequate coverage of the source text. Rather, retain any untranslated source text content in [square brackets]. As this will help identify where new example pairs are needed. The following tokens are not accounted for, so don't bother trying to translate them: Blessed,[the],Lord,Israel,,because,visited,performed,redemption,[on]

Example pairs:
Source: And  I begged  the  disciples  of You  that  they might cast out  it, and  not  they were able. 
Target: Mɛ ngɔrɛ batɔnanjim aba ndɛ akam ɛbe ɛtugi jifɔ ayege, amalam.”
Source: Went out  the

In [154]:
# Use the back translate chain
# Note: The back translate chain expects the input to be in the format "source\ntarget".
back_translation_prompt = back_translate_chain.run(result.choices[0].text)


In [155]:
back_translation_prompt

"You are LowResourceLanguageTranslationBot. With minimal input data, you can identify covariance patterns between source and target texts. Your aim is to draft translations. Given a source text, you require only some examples of the source tokens being used in other translation pairs.\nTranslate the source text below using only the data from the supplied example pairs. Do not fill in any gaps in the sense if the examples do not provide adequate coverage of the source text. Rather, retain any untranslated source text content in [square brackets]. As this will help identify where new example pairs are needed. Your job is to back-translate a draft translation for QA purposes. Please translate the translation text back into the source language. The following tokens are not accounted for, so don't bother trying to translate them: “Abɔbat,Ɔrɔja,niyi,abɔ.,Ana,kɔn,arɛfɛl,nɛmɛ,niyi,yi.\n\nExample pairs:\nSource: “Ajing ya Mosis na ɔteb bi ba yɛna amir atetebe tɛtɛ na ngara Jɔn nwura anɛ alib. B

In [158]:
back_result = openai.Completion.create(
    engine="davinci",
    prompt=back_translation_prompt,
    max_tokens=150,
    temperature=0.9,
    # top_p=1,
    # frequency_penalty=0,
    # presence_penalty=0,
    stop=["\n"]
)

In [159]:
back_completed = ' '.join([back_translation_prompt, back_result.choices[0].text])
print("## Back Translation:\n\n", back_completed)

## Back Translation:

 You are LowResourceLanguageTranslationBot. With minimal input data, you can identify covariance patterns between source and target texts. Your aim is to draft translations. Given a source text, you require only some examples of the source tokens being used in other translation pairs.
Translate the source text below using only the data from the supplied example pairs. Do not fill in any gaps in the sense if the examples do not provide adequate coverage of the source text. Rather, retain any untranslated source text content in [square brackets]. As this will help identify where new example pairs are needed. Your job is to back-translate a draft translation for QA purposes. Please translate the translation text back into the source language. The following tokens are not accounted for, so don't bother trying to translate them: “Abɔbat,Ɔrɔja,niyi,abɔ.,Ana,kɔn,arɛfɛl,nɛmɛ,niyi,yi.

Example pairs:
Source: “Ajing ya Mosis na ɔteb bi ba yɛna amir atetebe tɛtɛ na ngara Jɔn

## Use Alignment to improve utility of examples

In [1]:
alignment_prompt_template = f"""
Please give your best guess as to the optimal aligning of token ngrams \
(i.e., phrases—so don't get too caught up in having one token match one \
other token; rather, you should recognize that, e.g., 'Jisɔs yɔ Nasarɛt' \
likely aligns with 'Jesus of Nazareth', etc. ) in these two strings \
(which are a translation pair). Please attempt a semantic alignment, not \
simply paying attention to whitespace, but also reordering the English/target \
words wherever necessary to match the source text as best as possible, and \
you should take phonetic similarity into account for content words. That is, \
you can tell from the English which words are content words, so try to find \
matching content words such as proper nouns using phonetic similarities \
('Nasar-' sounds like 'Nazar-', for example):

Here's an example of a good alignment:
## Input:
Source: Wɛ abib arɛ, “Ba nsɔl yi?” Abɔ afanga arɛ, “Nsɔl yi ɛlemɔ Jisɔs yɔ Nasarɛt. Ajɔl nyɛna amir abɛl ɛkɔ na alom na nema na libri Ɔsɔwɔ na anɛ kpakpa.
Target: And  He said  to them  What things; -  And  they said  to Him  The things  concerning  Jesus  of  Nazareth, who  was  a man  a prophet  mighty  in  deed  and  word  before  -  God  and  all  the  people, 

## Alignment:
Source token(s)	Target token(s)	Notes
Wɛ abib arɛ,	And He said to them	Phrasal alignment. Likely a common way to say "He said to them"
“Ba nsɔl yi?”	What things;	Orthographic (question mark and quotations marks following a speech verb)
Abɔ afanga arɛ,	And they said to Him	Phrasal and phonological; to 'Wɛ abib arɛ' above
“Nsɔl yi ɛlemɔ Jisɔs yɔ Nasarɛt.	The things concerning Jesus of Nazareth,	Phonological matches; 'yɔ' seems to align with 'of'
Ajɔl nyɛna amir abɛl	who was a man a prophet mighty	Semantic; 'abɛl' ~ 'mighty/able'
ɛkɔ na alom na nema na libri	in deed and word before	Semantic; 'na' seems to be a correlative connective
Ɔsɔwɔ na anɛ kpakpa.	God and all the people,	Semantic; 'na' ~ 'and', 'anɛ' ~ 'all', 'Ɔsɔwɔ' ~ 'God', 'kpakpa' ~ 'people'

Here are some example translation pairs for additional context to aid you in pattern matching:
{examples}

```
For example, I would expect that 'Nasarɛt' aligns with 'Nazareth', and 'libri' aligns with 'word'. Please format the output as a markdown table (| Source token | Target token | Notes [concise comments or notes or rationale as to why you aligned these tokens] |).

SyntaxError: incomplete input (3113059934.py, line 1)