# Prompt tuning for translating English > Mambai

Translate an English sentence to Mambai by:
1. Find closest sentences using LASER
2. Find dictionary entries for words in sentence
3. Construct prompt, with a mix of example sentences and dict entries

TODO:
* Clean up Mambai corpus
  * Some dict entries missing as it relies on font weight, which is not always OCRed correctly
    *  others need to be separated (e.g. "sit; live")
  * Some sentences poorly aligned
  
* Get similar sentences based on syntactic similarity, instead of `get_sentences_starting_with_same_words`

In [None]:
!pip install laser_encoders
!python -m spacy download en_core_web_sm
# !wget -O mambai_parallel_data.csv https://docs.google.com/spreadsheets/d/1AtPC9JCq-2CWFnjYc-CRhS7WFRNtm-VPV7dcDENE2ss/export?format=csv&id=1AtPC9JCq-2CWFnjYc-CRhS7WFRNtm-VPV7dcDENE2ss&gid=1811721104

### Get Mambai corpus, split between sentences and dict entries

In [3]:
import csv
import json
import random

with open('mambai_parallel_data.csv') as f:
    reader = csv.DictReader(f)
    data = list(reader)

print(f"Total of {len(data)} rows in the dataset.")

random.sample(data, 5)

# for now, keep only rows where Mambai and English are defined
data = [row for row in data if row['Mambai (mgm)'] and row['English (eng)']]
print(f"Total of {len(data)} rows where both Mambai and English are defined.")

Total of 1792 rows in the dataset.
Total of 1681 rows where both Mambai and English are defined.


In [4]:
import json

with open("eng_mgm.json") as f:
    dict_entries = json.load(f)

### Get LASER encoder, encode English sentences from Mambai corpus

In [5]:
from laser_encoders import LaserEncoderPipeline

encoder = LaserEncoderPipeline(lang="eng_Latn")

embeddings = encoder.encode_sentences([row['English (eng)'] for row in data])

100%|██████████| 1.01M/1.01M [00:00<00:00, 17.4MB/s]
100%|██████████| 179M/179M [00:01<00:00, 93.1MB/s]
100%|██████████| 470k/470k [00:00<00:00, 9.12MB/s]


### Construct prompt

In [7]:
from sklearn.metrics.pairwise import cosine_similarity

import spacy
nlp = spacy.load("en_core_web_sm")


def get_closest_sentences(input):
    embedded_input = encoder.encode_sentences([input])
    closest_indices = cosine_similarity(embedded_input, embeddings)[0].argsort()[-5:][::-1]
    return [data[i] for i in closest_indices]

def get_sentences_starting_with_same_words(input):
    input_words = input.split()
    first_two_words = " ".join(input_words[:2])
    for row in data:
        if row['English (eng)'].startswith(first_two_words):
            yield row

def get_relevant_dict_entries(sent):
    doc = nlp(sent)
    lemmas = [token.lemma_ for token in doc]
    for lemma in lemmas:
        for row in dict_entries:
            if row['entry'] == lemma:
                yield row

In [14]:
prompt = """You are a translator for the Mambai language, originally from Timor-Leste.

# Example sentences
{sentences_str}

# Dictionary entries
{dict_str}

English: {input}
Reasoning:"""

def get_sentences_str(rows):
    out = ''
    for row in rows:
        out += f"English: {row['English (eng)']}\n"
        out += f"Mambai: {row['Mambai (mgm)']}\n"
        out += "\n"
    return out

def get_dict_str(dict_entries):
    out = ""
    for row in dict_entries:
        out += f"English: {row['entry']}\n"
        out += f"Mambai: {row['definition']}\n"
        out += "\n"
    return out

def get_prompt(input):
    sentences = get_closest_sentences(input)
    more_sentences = list(get_sentences_starting_with_same_words(input))
    sentences.extend(more_sentences[:5])
    dict_entries = list(get_relevant_dict_entries(input))
    return prompt.format(
        sentences_str=get_sentences_str(sentences),
        dict_str=get_dict_str(dict_entries),
        input=input
    )

In [15]:
print(get_prompt("We will be sitting there having coffee"))

# chatGPT gives: It hei medei lala mua kafé.

You are a translator for the Mambai language, originally from Timor-Leste.

# Example sentences
English: \Ne would like to stop somewhere for lunch (for coffee).
Mambai: Am hakarak deskansa nei hati kid ôd mua meiudia (ôd ên kafé).

English: Please bring something sweet to eat with the coffee.
Mambai: Favór id ôd nam midar seri ma mua nor kafé.

English: Do you have milk for the tea (coffee)?
Mambai: Nei sus-era paraôd sur put nor xa (kafé) rai?

English: We will sit together.
Mambai: It hei medei put.

English: I will give you a sedative.
Mambai: Au hei né ai-moruk boe ni.

English: We will be happy forever.
Mambai: It hei kontente la man hati.

English: We will be with God.
Mambai: It hei mori futu nor Maromak.

English: We will sit together.
Mambai: It hei medei put.

English: We will come back in one hour.
Mambai: Am ma sul nei oras id ni lala.

English: We will have to spend the night in this village.
Mambai: It tenke boe nei knua rai.



# Dictionary entries
English: we
Mambai: (