<a href="https://colab.research.google.com/github/ryderwishart/biblical-machine-learning/blob/main/alignments/posalign_two_step_spanish.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import getpass, os
secret_key = getpass.getpass('Enter OpenAI secret key: ')
os.environ['OPENAI_API_KEY'] = secret_key

In [None]:
import openai

## New Spanish alignment


- [ ] Fix the output parsing part of the notebook. Perhaps make a function called `get_alignments_from_prompt_output()` instead of just splitting on highly specific string values
- [ ] (TODO LATER) Programmatically load a Spanish translation from eBible corpus


Some ideas for improving the initial prompt.

- [ ] Add language typology information about Spanish
- [ ] Add information about translation style of specific Spanish translation we are using (more literal vs. more dynamic)
- [ ] Add more examples?
- [ ] Replace current examples with examples from the language (e.g., Spanish)
- [ ] Move the rationale of each alignment before the alignment itself
- [ ] Experiment with tweaking the system prompt to see if it helps
- [ ] SIMPLIFY the prompt down to just the most relevant content (do we need a rationale? Does it work without? We can probably drop out relevant grammatical patterns)

In [None]:
prompt = '''
Here are some general facts to note about Bantu languages:
Bantu languages: are agglutinating, ensure correct affix attachment; employ complex noun class system, ensure noun agreement across sentences; follow SOV order; mark verbs for tense, aspect, mood; adhere to rules of conjunctive and disjunctive orthography; apply tone system to distinguish meaning.
For translating from Greek: handle Koine Greek's inflection as agglutination, pay attention to affixes; replace Greek's three-gender system with Bantu noun class system, ensuring agreement; shift to SOV order; adapt Greek Voice/Aspect/Mood markings to Bantu system; implement rules of conjunctive and disjunctive orthography; recognize there is a tonal system for distinguishing some grammatical systems.
Most words in a Bantu sentence are marked by a prefix indicating the category to which the noun used as the subject of the sentence belongs. If there is an object, the words in that noun phrase and the verb are also marked by a prefix determined by the noun class of the object

Here is a sentence:
English: And He said to them What things; - And they said to Him The things concerning Jesus of Nazareth, who was a man a prophet mighty in deed and word before - God and all the people,
Greek: καὶ εἶπεν αὐτοῖς Ποῖα;οἱ δὲ εἶπαν αὐτῷ Τὰ περὶ Ἰησοῦ τοῦ Ναζαρηνοῦ,ὃς ἐγένετο ἀνὴρ προφήτης δυνατὸς ἐν ἔργῳ καὶ λόγῳ ἐναντίον τοῦ Θεοῦ καὶ παντὸς τοῦ λαοῦ,
Abanyom: Wɛ abib arɛ, “Ba nsɔl yi?” Abɔ afanga arɛ, “Nsɔl yi ɛlemɔ Jisɔs yɔ Nasarɛt. Ajɔl nyɛna amir abɛl ɛkɔ na alom na nema na libri Ɔsɔwɔ na anɛ kpakpa.

Here is a phonological, semantic, orthographic alignment of that sentence:
```
	[
	  {
	    "Rationale": "The comma indicates an obvious breaking point",
	    "Abanyom phrase": "Wɛ abib arɛ,",
	    "Target phrase": "And He said to them",
	    "Greek phrase": "καὶ εἶπεν αὐτοῖς",
	  },
	  {
	    "Abanyom phrase": "“Ba nsɔl yi?”",
	    "Target phrase": "What things;",
	    "Greek phrase": "Ποῖα;",
	    "Rationale": "Semantic alignment; orthographic (quotation marks, capitalized Greek word)",
	    "Relevant grammatical patterns": "interrogative"
	  },
	  {
	    "Abanyom phrase": "Abɔ afanga arɛ",
	    "Target phrase": "And they said to Him",
	    "Greek phrase": "οἱ δὲ εἶπαν αὐτῷ",
	    "Rationale": "Semantic alignment; orthographic (comma, capitalized Greek word)",
	    "Relevant grammatical patterns": "sentential connection, projective matrix"
	  },
	  {
	    "Abanyom phrase": "“Nsɔl yi ɛlemɔ Jisɔs yɔ Nasarɛt.",
	    "Target phrase": "The things concerning Jesus of Nazareth,",
	    "Greek phrase": "Τὰ περὶ Ἰησοῦ τοῦ Ναζαρηνοῦ,",
	    "Rationale": "Semantic and phonetic similarity (Nasarɛt/Nazareth/Ναζαρηνοῦ, Jisɔs/Jesus/Ἰησοῦ, nsɔl/things [identified in a previous phrase]); orthographic (quotation marks, terminal punctuation)",
	    "Relevant grammatical patterns": "complex nominal construction with adjunct"
	  },
	  {
	    "Abanyom phrase": "Ajɔl nyɛna amir abɛl ɛkɔ na alom na nema na libri",
	    "Target phrase": "who was a man a prophet mighty in deed and word",
	    "Greek phrase": "ὃς ἐγένετο ἀνὴρ προφήτης δυνατὸς ἐν ἔργῳ καὶ λόγῳ",
	    "Rationale": "Semantic alignment and matching content words ('libri' aligns with 'word', 'λόγῳ'); orthographic (period)",
	    "Relevant grammatical patterns": "subordination (relative construction), coordination"
	  },
	  {
	    "Abanyom phrase": "Ɔsɔwɔ na anɛ kpakpa.",
	    "Target phrase": "before - God and all the people",
	    "Greek phrase": "ἐναντίον τοῦ Θεοῦ καὶ παντὸς τοῦ λαοῦ,",
	    "Rationale": "Semantic alignment and matching content words ('Ɔsɔwɔ' aligns with 'God', 'Θεοῦ'); orthographic (period)",
	    "Relevant grammatical patterns": "prepositional construction, coordination"
	  }
	]
```

Please also align the following sentence. Avoid including multiple phrases in a single alignment unit (break phrases at the very least on commas or other major punctuation, including enclosing quotation marks):

English: After  now  these things  appointed  the  Lord  others  seventy, and  sent  them  in  two [by]  before  [the] face  of Himself  into  every  city  and  place  where  was about  He Himself  to go.
Greek: Μετὰ δὲ ταῦτα ἀνέδειξεν ὁ Κύριος ἑτέρους ἑβδομήκοντα,καὶ ἀπέστειλεν αὐτοὺς ἀνὰ δύο πρὸ προσώπου αὐτοῦ εἰς πᾶσαν πόλιν καὶ τόπον οὗ ἤμελλεν αὐτὸς ἔρχεσθαι.
Spanish: Después de esto, el Señor designó a otros setenta, y los envió de dos en dos delante de Él, a toda ciudad y lugar adonde Él había de ir.
'''

messages = [
    # {"role": "system", "content": f"You are CodeAnalyzerGPT. Analyze the user-supplied code below and follow any instructions the user gives."},
    {"role": "system", "content": f"You are LangAlignerGPT. Analyze the user-supplied alignment examples below and follow any instructions the user gives."},
    {"role": "user", "content": prompt},
]

response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=messages,
    temperature=0.3,
    n=1,
    presence_penalty=0.5,
    frequency_penalty=0.5,
)

generated_texts = [
    choice.message["content"].strip() for choice in response["choices"]
]
print(generated_texts[0])

Here is a phonological, semantic, orthographic alignment of that sentence:	
```
	[
	  {
	    "Source phrase": "Después de esto,",
	    "Target phrase": "After now these things",
	    "Greek phrase": "Μετὰ δὲ ταῦτα",
	    "Rationale": "Semantic alignment; Orthographic (comma)",
	    "Relevant grammatical patterns": "Temporal clause"
	  },
	  {
	    "Source phrase": "el Señor designó a otros setenta,",
	    "Target phrase": "appointed the Lord others seventy,",
	    "Greek phrase": "ἀνέδειξεν ὁ Κύριος ἑτέρους ἑβδομήκοντα,",
	    "Rationale": "'Señor' aligns with 'Lord', 'setenta' aligns with 'seventy'; Semantic alignment; Orthographic (comma)",
	    "Relevant grammatical patterns": "[Subject] + [verb] + [direct object]"
	  },
	  {
	    "Source phrase": 	"y los envió de dos en dos delante de Él,",
	    "Target phrase": 	"and sent them in two [by] before [the] face of Himself ",
	    "Greek phrase": 	"καὶ ἀπέστειλεν αὐτοὺς ἀνὰ δύο πρὸ προσώπου αὐτοῦ ",
 	   	"Rationale":"Semantic alignment

In [None]:
# pass the generated_texts list into a function for parsing the json output using json.loads() on the content between the triple backticks ?

## Align individual chunks from output

In [None]:
final_alignments = []

output = generated_texts[0]
# strip off '''Here is a phonological, semantic, orthographic alignment of the sentence:

# ```
# [
#   {'''
# from output
output = output.split('''Here is a phonological, semantic, orthographic alignment of that sentence:
```
	[
	  {''')[1]
output = output.split('''
  }
]
```''')[0]

output = output.split('''},
  {''')

output = [i.strip().replace('\n    ', '\n') for i in output]
output = [i.strip().replace('\n   \t', '\n') for i in output]

print(output)

['"Source phrase": "Después de esto,",\n\t    "Target phrase": "After now these things",\n\t    "Greek phrase": "Μετὰ δὲ ταῦτα",\n\t    "Rationale": "Semantic alignment; Orthographic (comma)",\n\t    "Relevant grammatical patterns": "Temporal clause"\n\t  },\n\t  {\n\t    "Source phrase": "el Señor designó a otros setenta,",\n\t    "Target phrase": "appointed the Lord others seventy,",\n\t    "Greek phrase": "ἀνέδειξεν ὁ Κύριος ἑτέρους ἑβδομήκοντα,",\n\t    "Rationale": "\'Señor\' aligns with \'Lord\', \'setenta\' aligns with \'seventy\'; Semantic alignment; Orthographic (comma)",\n\t    "Relevant grammatical patterns": "[Subject] + [verb] + [direct object]"\n\t  },\n\t  {\n\t    "Source phrase": \t"y los envió de dos en dos delante de Él,",\n\t    "Target phrase": \t"and sent them in two [by] before [the] face of Himself ",\n\t    "Greek phrase": \t"καὶ ἀπέστειλεν αὐτοὺς ἀνὰ δύο πρὸ προσώπου αὐτοῦ ",\n \t   \t"Rationale":"Semantic alignment; matching content words (\'envió\' aligns wi

In [None]:
for chunk in output:

    prompt = '''Here is a phrase:
    {POS_aligned_chunks}

    Please further align and break down this chunk into a mapping of the fewest possible tokens (sometimes multiple tokens will align to one token; that's expected):

    E.g.: "Source phrase": "Después de esto,",\n\t    "Target phrase": "After now these things",\n\t    "Greek phrase": "Μετὰ δὲ ταῦτα",\n\t    "Rationale": "Semantic alignment; Orthographic (comma)",\n\t    "Relevant grammatical patterns": "Temporal clause"

    - Source token(s)\t\t-->\tTarget token(s)\t\t-->\tGreek token(s)
    - Después de\t\t-->\tAfter now\t\t-->\tΜετὰ δὲ
    - esto\t\t-->\tthese things\t\t-->\tταῦτα

    {current_chunk}
    '''.format(POS_aligned_chunks='\n'.join(output), current_chunk=chunk)

    messages = [
        {"role": "system", "content": f"You are LangAlignerGPT. Analyze the user-supplied alignment examples below and follow any instructions the user gives."},
        {"role": "user", "content": prompt},
    ]

    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=messages,
        temperature=0.3,
        n=1,
        presence_penalty=0.5,
        frequency_penalty=0.5,
    )

    generated_texts_for_chunk = [
        choice.message["content"].strip() for choice in response["choices"]
    ]
    print(generated_texts_for_chunk[0])
    final_alignments.append(generated_texts_for_chunk[0])

Sure, here is the breakdown:

1. "Después de esto," --> "After now these things" --> "Μετὰ δὲ ταῦτα"
   - Source token(s)		-->	Target token(s)		-->	Greek token(s)
   - Después de esto,  --> After now these things, --> Μετὰ δὲ ταῦτα,

2. "el Señor designó a otros setenta," --> "appointed the Lord others seventy," --> "ἀνέδειξεν ὁ Κύριος ἑτέρους ἑβδομήκοντα,"
   - Source token(s)		-->	Target token(s)		-->	Greek token(s)
   - el Señor  --> the Lord  --> ὁ Κύριος
   - designó a otros setenta,  --> appointed others seventy,  --> ἀνέδειξεν ἑτέρους ἑβδομήκοντα,

3. "y los envió de dos en dos delante de Él," --> "and sent them in two [by] before [the] face of Himself" --> "καὶ ἀπέστειλεν αὐτοὺς ἀνὰ δύο πρὸ προσώπου αὐτοῦ"
   - Source token(s)		-->	Target token(s)		-->	Greek token(s)
   - y los envió  --> and sent them  --> καὶ ἀπέστειλεν αὐτοὺς
   - de dos en dos delante de Él,  --> in two [by] before [the] face of Himself,  --> ἀνὰ δύο πρὸ προσώπου αὐτοῦ,

4. "a toda ciudad y lugar adonde Él 