# Transliteration via Large Language Models (LLMs) 
---
This notebook provides code that transliterates English text using large language models (LLMs), specifically OpenAI's GPT models. To run this code, you need access to the OpenAI API. Visit [OpenAI's website](https://openai.com/index/openai-api/) to purchase the required quotas. Once you have your API credentials, put them in the following cell: your API key (`api_key`) and API Base Link (`api_base`).

---

In [None]:
#####################################################
############### API Key of Elevenlabs ###############
#####################################################

# api_key = "sk_..."
# api_base = ""

#####################################################
#####################################################
#####################################################

import warnings
warnings.filterwarnings("ignore")

from phonemizer import phonemize
import pandas as pd
import collections
import numpy as np
from tqdm import tqdm
import glob
import os
from openai import OpenAI

import sys
sys.path.append('../pyfiles/')
from normalizer import EnglishTextNormalizer
from postprocessing import get_json_result, CheckResultValidity, PostprocessTransliteration, GetResult
from gpt import gpt_api_no_stream, GetLLMPrompt


normalizer = EnglishTextNormalizer()
client = OpenAI(api_key=api_key, base_url=api_base)

#########################################
### Get the phonemes of "the" and "a" ###
#########################################

adds = {
    "zhi": ["the", ["ðɪ"]],
    "za": ["the pineapple", ["ðə", "pˈaɪnæpəl"]],
    "ah": ["a little awkward", ["ɐ","lˈɪɾəl","ˈɔːkwɚd"]],
}
postprocessing = {a: {} for a in adds}
for addname in adds:
    sentence, phonemized = adds[addname]
    for language in ["Hindi", "Korean", "Japanese"]:
        filelists = glob.glob(f"./responses_the_a/{language}/postprocessing_{addname}_*.npy")
        a_list = [np.load(path, allow_pickle=True).item() for path in filelists]
        dirs = []
        for a in a_list:
            a = {key: a[key] for key in sentence.split()}
            dirs += [a]
        for i in range(len(dirs)):
            for key in dirs[i]:
                newlist = []
                for j in range(len(dirs[i][key]["similarity order"])):
                    newlist += [dirs[i][key]["similarity order"][j]]*(3-j)
                dirs[i][key]["similarity order"] = newlist
        data = {key: [element for i in range(len(dirs)) for element in dirs[i][key]["similarity order"]] for key in dirs[0]}
        # Get the transliterated sentences
        arrays = []
        counts = []
        for word in sentence.split():
            c = collections.Counter(data[word])
            df = pd.DataFrame(c.items(), columns=["phonemes", "count"]).sort_values("count", ascending=False).values
            arrays += [df[0,0]]
            counts += [list(df[:1,1])]
            
        postprocessing[addname][language] = arrays[0]

---
# Trial of Transliteration via LLMs
---

In this example, we will transliterate an English sentence using a GPT model. Please adjust the following variables:

- `sentence`: A string containing the English sentence you wish to transliterate.
- `language`: A string specifying the target language. Supported options are "Hindi", "Korean", and "Japanese".
- `gptmodel`: A string indicating which GPT model to use. Available options include "gpt-3.5", "gpt-4omini", "gpt-4o", and "gpt-o1mini". You can add or modify the list of released models by editing the file `MacST-project-page/sho_util/pyfiles/gpt.py`.

Feel free to try out the transliteration with one response from the GPT model using these variables.

---

In [None]:
###########################################
########## Adjustable Parameters ##########
###########################################

sentence = "Transliterate English text into Hindi text."
language = "Hindi"
gptmodel = "gpt-4omini"

###########################################
###########################################
###########################################

inputtext = normalizer(sentence)
prompt = GetLLMPrompt(inputtext, language)
result = GetResult(client, prompt, gptmodel, inputtext, normalizer, display_print=True)
transliterated = PostprocessTransliteration(sentence, [result], normalizer, adds, postprocessing)

print("English       :", sentence)
print("Normalized    :", inputtext)
print("Transliterated:", transliterated)
print("\n----------------------------------------\n----------------------------------------\n----------------------------------------\n")
print("PROMPT:\n")
print(prompt)
print("\n----------------------------------------\n----------------------------------------\n----------------------------------------\n")
print("Response:\n")
print(response)

---
# Transliterate Multiple Texts
---

In this example, we will transliterate multiple English sentences using a GPT model. To improve the reliability of the results, the code generates several transliteration responses for each sentence. Adjust the following variables as needed:

- `sentence_list`: A dictionary where each key is a text name and the corresponding value is the English sentence you want to transliterate.
- `language`: A string specifying the target language for transliteration. The supported options are "Hindi", "Korean", and "Japanese".
- `gptmodel`: A string that indicates which GPT model to use. The available options include "gpt-3.5", "gpt-4omini", "gpt-4o", and "gpt-o1mini". You can add or modify the list of models by editing the file `MacST-project-page/sho_util/pyfiles/gpt.py`.
- `savedir` : A string that specifies the directory where all transliteration responses will be saved.
- `repeatnum`: An integer that sets the number of responses (transliterations) to generate for each sentence.
- `reset_response`: A boolean that determines whether to re-generate the transliteration responses, even if previous responses exist in `savedir`.
- `transliterated_results`: A dictionary where each key is a text name and the corresponding value is the transliterated text.

---

In [None]:
###########################################
########## Adjustable Parameters ##########
###########################################

sentence_list = {
    "text1": "ICASSP in India.",
    "text2": "I'm Sho Inoue.",
}
language = "Hindi"
gptmodel = "gpt-4omini"
savedir = f"./responses_{language}_{gptmodel}/"
repeatnum = 3 # Increase this number for more reliable transliteration
reset_response = False

###########################################
###########################################
###########################################

# Save the valid responses
for key in sentence_list:
    print(key)
    exist_length = len(glob.glob(savedir+f"{key}_*.npy"))
    if not(reset_response) and exist_length>=repeatnum:
        continue
    sentence = sentence_list[key]
    inputtext = normalizer(sentence)
    prompt = GetLLMPrompt(inputtext, language)
    
    for r in tqdm(range(repeatnum)):
        savepath = savedir + f"{key}_{r}.npy"
        if not(reset_response) and os.path.exists(savepath):
            continue
        result = GetResult(client, prompt, gptmodel, inputtext, normalizer, display_print=False)
        os.makedirs(os.path.dirname(savepath), exist_ok=True)
        np.save(savepath, result)

transliterated_results = {}
for key in sentence_list:
    files = glob.glob(savedir+f"{key}_*.npy")
    transliterated = PostprocessTransliteration(sentence_list[key], [np.load(path, allow_pickle=True).item() for path in files], normalizer, adds, postprocessing)
    transliterated_results[key] = transliterated
    
for key in sentence_list:
    print("\n----------------------------------------\n----------------------------------------\n----------------------------------------\n")
    print("Key           :", key)
    print("English       :", sentence_list[key])
    print("Transliterated:", transliterated_results[key])