# Prompt tuning for translating English > Mambai

Translate an English sentence to Mambai by:
1. Find closest sentences using LASER
2. Find dictionary entries for words in sentence
3. Construct prompt, with a mix of example sentences and dict entries

TODO:
* Clean up Mambai corpus
  * Some sentences poorly aligned
  * Some dict entries need to be separated (e.g. "sit; live")
* Get similar sentences based on syntactic similarity, instead of `get_sentences_starting_with_same_words`

In [7]:
!pip install laser_encoders

Collecting laser_encoders
  Downloading laser_encoders-0.0.1-py3-none-any.whl (24 kB)
Collecting sacremoses==0.1.0 (from laser_encoders)
  Downloading sacremoses-0.1.0-py3-none-any.whl (895 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m895.1/895.1 kB[0m [31m14.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting unicategories>=0.1.2 (from laser_encoders)
  Downloading unicategories-0.1.2.tar.gz (12 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting sentencepiece>=0.1.99 (from laser_encoders)
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m42.8 MB/s[0m eta [36m0:00:00[0m
Collecting fairseq>=0.12.2 (from laser_encoders)
  Downloading fairseq-0.12.2.tar.gz (9.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.6/9.6 MB[0m [31m51.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build depend

### Get Mambai corpus, split between sentences and dict entries

In [50]:
!wget -O mambai_parallel_data.csv https://docs.google.com/spreadsheets/d/1AtPC9JCq-2CWFnjYc-CRhS7WFRNtm-VPV7dcDENE2ss/export?format=csv&id=1AtPC9JCq-2CWFnjYc-CRhS7WFRNtm-VPV7dcDENE2ss&gid=1811721104

--2024-01-29 03:36:02--  https://docs.google.com/spreadsheets/d/1AtPC9JCq-2CWFnjYc-CRhS7WFRNtm-VPV7dcDENE2ss/export?format=csv
Resolving docs.google.com (docs.google.com)... 142.250.97.102, 142.250.97.100, 142.250.97.139, ...
Connecting to docs.google.com (docs.google.com)|142.250.97.102|:443... connected.
HTTP request sent, awaiting response... 307 Temporary Redirect
Location: https://doc-04-94-sheets.googleusercontent.com/export/54bogvaave6cua4cdnls17ksc4/djnt29o9njajsgmqa493p7fs4k/1706499360000/111605484374857807395/*/1AtPC9JCq-2CWFnjYc-CRhS7WFRNtm-VPV7dcDENE2ss?format=csv [following]
--2024-01-29 03:36:02--  https://doc-04-94-sheets.googleusercontent.com/export/54bogvaave6cua4cdnls17ksc4/djnt29o9njajsgmqa493p7fs4k/1706499360000/111605484374857807395/*/1AtPC9JCq-2CWFnjYc-CRhS7WFRNtm-VPV7dcDENE2ss?format=csv
Resolving doc-04-94-sheets.googleusercontent.com (doc-04-94-sheets.googleusercontent.com)... 173.194.210.132, 2607:f8b0:400c:c0f::84
Connecting to doc-04-94-sheets.googleusercont

In [51]:
import csv
import json
import random

with open('mambai_parallel_data.csv') as f:
    reader = csv.DictReader(f)
    data = list(reader)

print(f"Total of {len(data)} rows in the dataset.")

random.sample(data, 5)

# for now, keep only rows where Mambai and English are defined
data = [row for row in data if row['Mambai (mgm)'] and row['English (eng)']]
print(f"Total of {len(data)} rows where both Mambai and English are defined.")

Total of 7309 rows in the dataset.
Total of 7192 rows where both Mambai and English are defined.


In [21]:

for i, row in enumerate(data)
    # dict starts where Mambai = "á" and English = "to eat"
    if row['Mambai (mgm)'] == "á" and row['English (eng)'] == "to eat":
        start_of_dict_index = i
        break

print(f"Dict starts at index {start_of_dict_index}")

dict_entries = data[start_of_dict_index:]

Dict starts at index 3054


### Get LASER encoder, encode English sentences from Mambai corpus

In [13]:
from laser_encoders import LaserEncoderPipeline

encoder = LaserEncoderPipeline(lang="eng_Latn")

embeddings = encoder.encode_sentences([row['English (eng)'] for row in data])

### Construct prompt

In [42]:
from sklearn.metrics.pairwise import cosine_similarity

def get_closest_sentences(input):
    embedded_input = encoder.encode_sentences([input])
    closest_indices = cosine_similarity(embedded_input, embeddings)[0].argsort()[-5:][::-1]
    return [data[i] for i in closest_indices]

def get_sentences_starting_with_same_words(input):
    input_words = input.split()
    first_two_words = " ".join(input_words[:2])
    for row in data:
        if row['English (eng)'].startswith(first_two_words):
            yield row


def get_relevant_dict_entries(input):
    input_words = input.split()
    for word in input_words:
        for row in dict_entries:
            if row['English (eng)'] == word:
                yield row
                break

In [45]:
prompt = """You are a translator for the Mambai language, originally from Timor-Leste.

Here are some example sentences, translated from English to Mambai:
{sentences_str}

Here are some dictionary entries:
{dict_str}

Translate the following sentence from English to Mambai:
{input}
"""

def get_sentences_str(rows):
    out = ''
    for row in rows:
        out += f"English: {row['English (eng)']}\n"
        out += f"Mambai: {row['Mambai (mgm)']}\n"
        out += "\n"
    return out

def get_dict_str(dict_entries):
    out = ""
    for row in dict_entries:
        out += f"English: {row['English (eng)']} | "
        out += f"Mambai: {row['Mambai (mgm)']}\n"
    return out

def get_prompt(input):
    sentences = get_closest_sentences(input)
    more_sentences = list(get_sentences_starting_with_same_words(input))
    sentences.extend(more_sentences[:5])
    dict_entries = list(get_relevant_dict_entries(input))
    return prompt.format(
        sentences_str=get_sentences_str(sentences),
        dict_str=get_dict_str(dict_entries),
        input=input
    )

In [46]:
print(get_prompt("We will be sitting there having coffee"))

You are a translator for the Mambai language, originally from Timor-Leste.

Here are some example sentences, translated from English to Mambai:
English: She's going to make coffee
Mambai: Urá pun sôp kafé

English: She will make coffee
Mambai: Urá hei pun kafé

English: She's about to make coffee
Mambai: Urá pun tel sôp kafé

English: She makes coffee
Mambai: urá pun kafé

English: Please bring something sweet to eat with the coffee
Mambai: Favór id ôd nam midar seri ma mua nor kafé

English: We will be happy forever.
Mambai: It hei kontente la man hati.

English: We will be with God.
Mambai: It hei mori futu nor Maromak.

English: We will sit together.
Mambai: Favór id pei bal óleu nor er.

English: We will come back in one hour.
Mambai: Au hakarak meza kid lao ada hoda.

English: We will sit together.
Mambai: It hei medei put.



Here are some dictionary entries:
English: will | Mambai: hei
English: there | Mambai: ran
English: coffee | Mambai: kafé, kapé


Translate the following se