In [None]:
!pip install openai

Change the following cell if you are not using a local model.

In [None]:
import openai
import json
import os

In [None]:
os.environ['OPENAI_API_KEY'] = 'EMPTY'
openai.api_key = os.environ["OPENAI_API_KEY"]
openai.base_path = "http://localhost:1234/v1"
openai.api_base = "http://localhost:1234/v1"

## New Spanish alignment


- [ ] Fix the output parsing part of the notebook. Perhaps make a function called `get_alignments_from_prompt_output()` instead of just splitting on highly specific string values
- [ ] (TODO LATER) Programmatically load a Spanish translation from eBible corpus


Some ideas for improving the initial prompt.

- [x] Add language typology information about Spanish
- [x] Add information about translation style of specific Spanish translation we are using (more literal vs. more dynamic)
- [x] Add more examples?
- [x] Replace current examples with examples from the language (e.g., Spanish)
- [x] Move the rationale of each alignment before the alignment itself
- [x] Experiment with tweaking the system prompt to see if it helps
- [x] SIMPLIFY the prompt down to just the most relevant content (do we need a rationale? Does it work without? We can probably drop out relevant grammatical patterns)

In [None]:
data = 'C:/Users/natha/Downloads/spapddpt.json'

In [None]:
def generate_broad_greek_alignment_prompt(file_path, verse):
    with open(file_path, 'r') as f:
        data = json.load(f)
    
    for element in data:
        if element['vref'] == verse:
            return f'''
Here are some general facts to note about Spanish:
Spanish is a fusional language, ensure correct affix attachment; follow SVO order; mark verbs for tense, aspect, mood.
For translating from Greek: replace Greeks's three-gender system with Spanish's two-gender system, ensuring agreement; shift to SVO order; adapt Greek Voice/Aspect/Mood markings to Spanish system.

Translation style:
The Spanish translation is  a literal translation trying to stick closely to the Greek word order, but there may occasionally be instances where Spanish phrases differ to produce a more natural translation.

Here is a sentence:
Spanish: —¿Qué es lo que ha pasado? —preguntó. Ellos respondieron: —Lo de Jesús de Nazaret. Era un profeta poderoso en obras y en palabras delante de Dios y de todo el pueblo.
English: And He said to them What things; - And they said to Him The things concerning Jesus of Nazareth, who was a man a prophet mighty in deed and word before - God and all the people,
Greek: καὶ εἶπεν αὐτοῖς Ποῖα;οἱ δὲ εἶπαν αὐτῷ Τὰ περὶ Ἰησοῦ τοῦ Ναζαρηνοῦ,ὃς ἐγένετο ἀνὴρ προφήτης δυνατὸς ἐν ἔργῳ καὶ λόγῳ ἐναντίον τοῦ Θεοῦ καὶ παντὸς τοῦ λαοῦ,

Here is a phonological, semantic, orthographic alignment of that sentence:
```
[
    {{
        "Spanish phrase": "—preguntó.",
        "English phrase": "And He said to them",
        "Greek phrase": "καὶ εἶπεν αὐτοῖς"
    }},
    {{
        "Spanish phrase": "¿Qué es lo que ha pasado?",
        "English phrase": "What things;",
        "Greek phrase": "Ποῖα;"
    }},
    {{
        "Spanish phrase": "Ellos respondieron",
        "English phrase": "And they said to Him",
        "Greek phrase": "οἱ δὲ εἶπαν αὐτοῖς"
    }},
    {{
        "Spanish phrase": "—Lo de Jesús de Nazaret.",
        "English phrase": "The things concerning Jesus of Nazareth,",
        "Greek phrase": "Τὰ περὶ Ἰησοῦ τοῦ Ναζαρηνοῦ"
    }},
    {{
        "Spanish phrase": "Era un profeta poderoso en obras y en palabras",
        "English phrase": "who was a man a prophet mighty in deed and word",
        "Greek phrase": "ὃς ἐγένετο ἀνὴρ προφήτης δυνατὸς ἐν ἔργῳ καὶ λόγῳ"
    }},
    {{
        "Spanish phrase": "delante de Dios y de todo el pueblo.",
        "English phrase": "before - God and all the people",
        "Greek phrase": "ἐναντίον τοῦ Θεοῦ καὶ παντὸς τοῦ λαοῦ"
    }}
]

```

Please also align the following sentence. Avoid including multiple phrases in a single alignment unit. You may need to break phrases  on commas or other major punctuation, including enclosing quotation marks. But you may also need to break a phrase along conjunctions or other words that typically mark the start of a new phrase:

Spanish Phrase: {element['target']['content']}
English Phrase: {element['bsb']['content']}
Greek Phrase: {element['macula']['content']}
'''
    else:
        return 'Please enter a valid Bible reference in the format of: BOOK(3 letter code) CH:VERSE'

In [None]:
def generate_broad_hebrew_alignment_prompt(file_path, verse):
    with open(file_path, 'r') as f:
        data = json.load(f)
    
    for element in data:
        if element['vref'] == verse:
            return f'''
Here are some general facts to note about Spanish:
Spanish is a fusional language, ensure correct affix attachment; follow SVO order; mark verbs for tense, aspect, mood.
For translating from Hebrew: shift to SVO order; adapt Hebrew Voice/Aspect/Mood markings to Spanish system.

Translation style:
The Spanish translation is  a literal translation trying to stick closely to the Hebrew word order, but there may occasionally be instances where Spanish phrases differ to produce a more natural translation.

Here is a sentence:
Spanish: Pero la tierra estaba desolada y vacía, y había oscuridad sobre la superficie del abismo. El Espíritu de ʼElohim se movía sobre la superficie de las aguas.
English: Now the earth was formless and void, and darkness was over the surface of the deep. And the Spirit of God was hovering over the surface of the waters.
Hebrew: וְהָאָ֗רֶץ הָיְתָ֥ה תֹ֨הוּ֙ וָבֹ֔הוּ וְחֹ֖שֶׁךְ עַל־פְּנֵ֣י תְה֑וֹם וְר֣וּחַ אֱלֹהִ֔ים מְרַחֶ֖פֶת עַל־פְּנֵ֥י הַמָּֽיִם׃

Here is a phonological, semantic, orthographic alignment of that sentence:
```
[
    {{
        "Spanish phrase": "Pero la tierra estaba desolada y vacía,",
        "English phrase": "Now the earth was formless and void,",
        "Hebrew phrase": "וְהָאָ֗רֶץ הָיְתָ֥ה תֹ֨הוּ֙ וָבֹ֔הוּ"
    }},
    {{
        "Spanish phrase": "y había oscuridad sobre la superficie del abismo.",
        "English phrase": "and darkness was over the surface of the deep.",
        "Hebrew phrase": "וְחֹ֖שֶׁךְ עַל־פְּנֵ֣י תְה֑וֹם"
    }},
    {{
        "Spanish phrase": "El Espíritu de ʼElohim se movía sobre la superficie de las aguas.",
        "English phrase": "And the Spirit of God was hovering over the surface of the waters.",
        "Hebrew phrase": "וְר֣וּחַ אֱלֹהִ֔ים מְרַחֶ֖פֶת עַל־פְּנֵ֥י הַמָּֽיִם׃"
    }}
]

```

Please also align the following sentence. Avoid including multiple phrases in a single alignment unit. You may need to break phrases  on commas or other major punctuation, including enclosing quotation marks. But you may also need to break a phrase along conjunctions or other words that typically mark the start of a new phrase:

Spanish Phrase: {element['target']['content']}
English Phrase: {element['bsb']['content']}
Hebrew Phrase: {element['macula']['content']}
'''
    else:
        return 'Please enter a valid Bible reference in the format of: BOOK(3 letter code) CH:VERSE'

In [None]:
book_idx = {'GEN': 1, 'EXO': 2, 'LEV': 3, 'NUM': 4, 'DEU': 5, 'JOS': 6, 'JDG': 7, 'RUT': 8, '1SA': 9, '2SA': 10,
 '1KI': 11, '2KI': 12, '1CH': 13, '2CH': 14, 'EZR': 15, 'NEH': 16, 'EST': 17, 'JOB': 18, 'PSA': 19, 'PRO': 20,
 'ECC': 21, 'SNG': 22, 'ISA': 23, 'JER': 24, 'LAM': 25, 'EZK': 26, 'DAN': 27, 'HOS': 28, 'JOL': 29, 'AMO': 30,
 'OBA': 31, 'JON': 32, 'MIC': 33, 'NAH': 34, 'HAB': 35, 'ZEP': 36, 'HAG': 37, 'ZEC': 38, 'MAL': 39, 'MAT': 40,
 'MRK': 41, 'LUK': 42, 'JHN': 43, 'ACT': 44, 'ROM': 45, '1CO': 46, '2CO': 47, 'GAL': 48, 'EPH': 49, 'PHP': 50,
 'COL': 51, '1TH': 52, '2TH': 53, '1TI': 54, '2TI': 55, 'TIT': 56, 'PHM': 57, 'HEB': 58, 'JAS': 59, '1PE': 60,
 '2PE': 61, '1JN': 62, '2JN': 63, '3JN': 64, 'JUD': 65, 'REV': 66}


In [None]:
def generate_broad_alignment_prompt(data, reference):
    if book_idx[reference[:3]] < 40:
        return generate_broad_hebrew_alignment_prompt(data, reference)
    else:
        return generate_broad_greek_alignment_prompt(data, reference)

In [None]:
prompt = generate_broad_alignment_prompt(data, 'GEN 1:1')

In [None]:
messages = [
    {"role": "system", "content": f"You are LangAlignerGPT. Analyze the user-supplied alignment examples below and follow any instructions the user gives."},
    {"role": "user", "content": prompt},
]

response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=messages,
    temperature=0.3,
    n=1,
    presence_penalty=0.5,
    frequency_penalty=0.5,
)

generated_texts = [
    choice.message["content"].strip() for choice in response["choices"]
]
print(generated_texts[0])

## Align individual chunks from output

In [None]:
def get_alignments_from_prompt_output(generated_texts):
  
  if generated_texts.rfind('```') != generated_texts.find('```'):
    start_index = generated_texts.find('```')
    end_index = generated_texts.rfind('```')
    json_data = generated_texts[start_index + 3:end_index]
  else:
    start_index = generated_texts.find('```')
    json_data = generated_texts[start_index + 3:]
  json_data = json_data.strip()
  print(json_data)
  
  try:
    data = json.loads(json_data)
  except json.JSONDecodeError:
    print("Invalid JSON data in the generated_texts string.")
    return None
  return data


In [None]:
output = get_alignments_from_prompt_output(generated_texts[0])

In [None]:
with open('C:/Users/natha/Downloads/dynamic_posalign.txt', 'w') as file:
    file.write('[\n')
    for i in output:
        file.write(json.dumps(i, ensure_ascii = False) + ',')
    file.write(']')

In [None]:
def refined_greek_alignments_from_prompt_output(output, chunk):
    return '''Here is a phrase:
        {POS_aligned_chunks}

        Please further align and break down the provided chunk into a mapping of the fewest possible tokens (sometimes multiple tokens will align to one token; that's expected) in a format similar to the following:

        E.g.: "Spanish phrase": "Después de esto,",\n\t"English phrase": "After now these things",\n\t"Greek phrase": "Μετὰ δὲ ταῦτα"

        - Spanish token(s)\t\t-->\tEnglish token(s)\t\t-->\tGreek token(s)
        - Después de\t\t-->\tAfter now\t\t-->\tΜετὰ δὲ
        - esto\t\t-->\tthese things\t\t-->\tταῦτα

        Do not include any other information or commentary. Only tell me what the alignment is.

        The chunk I want you to align is:
            ```
            {current_chunk}
            ```
        '''.format(POS_aligned_chunks=json.dumps(output, ensure_ascii = False), current_chunk=json.dumps(chunk, ensure_ascii = False))

In [None]:
def refined_hebrew_alignments_from_prompt_output(output, chunk):
    return '''Here is a phrase:
        {POS_aligned_chunks}

        Please further align and break down the provided chunk into a mapping of the fewest possible tokens (sometimes multiple tokens will align to one token; that's expected) in a format similar to the following:

        E.g.: "Spanish phrase": "El Espíritu de ʼElohim se movía sobre la superficie de las aguas.",\n\t"English phrase": "And the Spirit of God was hovering over the surface of the waters.",\n\t"Hebrew phrase":  "וְר֣וּחַ אֱלֹהִ֔ים מְרַחֶ֖פֶת עַל־פְּנֵ֥י הַמָּֽיִם׃"

        - Spanish token(s)\t\t-->\tEnglish token(s)\t\t-->\tHebrew token(s)
        - El Espíritu de ʼElohim\t\t-->\tAnd the Spirit of God\t\t-->\tוְר֣וּחַ אֱלֹהִ֔ים
        - se movía sobre la superficie de las aguas.\t\t-->\twas hovering over the surface of the waters.\t\t-->\tמְרַחֶ֖פֶת עַל־פְּנֵ֥י הַמָּֽיִם׃

        Do not include any other information or commentary. Only tell me what the alignment is.

        The chunk I want you to align is:
            ```
            {current_chunk}
            ```
        '''.format(POS_aligned_chunks=json.dumps(output, ensure_ascii = False), current_chunk=json.dumps(chunk, ensure_ascii = False))

In [None]:
def generate_texts(prompt):
    messages = [
        {"role": "system", "content": f"You are LangAlignerGPT. Analyze the user-supplied alignment examples below and follow any instructions the user gives."},
        {"role": "user", "content": prompt},
    ]

    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=messages,
        temperature=0.3,
        n=1,
        presence_penalty=0.5,
        frequency_penalty=0.5,
    )

    generated_texts_for_chunk = [
        choice.message["content"].strip() for choice in response["choices"]

    ]
    
    return generated_texts_for_chunk

In [None]:
final_alignments = []

for chunk in output:
  if 'Hebrew' in json.dumps(chunk, ensure_ascii = False):
    prompt = refined_hebrew_alignments_from_prompt_output(output=output, chunk=chunk)
  else:
    prompt = refined_greek_alignments_from_prompt_output(output=output, chunk=chunk)

  generated_texts = generate_texts(prompt)
  
  # Count the occurrences of 'phrase' in each element to improve odds of correct output
  phrase_count = sum(element.count('phrase":') for element in generated_texts)
  
  # Repeat until 'phrase' is found exactly three times in each element
  while phrase_count != 3 and colon_count != 3:
    generated_texts = generate_texts(prompt)
    phrase_count = sum(element.count('phrase":') for element in generated_texts)

  print(generated_texts[0])
  final_alignments.append(generated_texts[0])

In [None]:
with open('C:/Users/natha/Downloads/refined_dynamic_posalign.txt', 'w') as file:
    file.write('[\n')
    for i in final_alignments:
        print(i)
        print('-----')
        file.write(json.dumps(i, ensure_ascii = False) + ',')
    file.write(']')
