# V5

However, there might me many syntax error in the output such as:

`Expecting ',' delimiter: line 4 column 9 (char 65)`  
```json  
[     
    {         
        "correction": "atmosphere had vanished."         
        "zh-tw": "氣氛已消失殆盡。"   
    }
]
```
`Expecting ',' delimiter: line 4 column 9 (char 99)`
```json  
[
    {
        "correction": "“Yes, sir! Understood! I’ll begin preparations right away."
        "zh-tw": "好的，長官！了解！我馬上開始準備。"
    },
    {
        "correction": "And then...is it okay if we watch?”"
        "zh-tw": "然後...我們可以觀看嗎？"
    },
    {
        "correction": "“Sure, I don’t mind. I’m the only one who can hold it, so you should at least have a look.”"
        "zh-tw": "當然，我沒意見。只有我能拿著它，所以你至少應該看看。"
    }
]
```

## Gemini Pro API and OpenAI API
Check [Bilingual-NovelTranslator-v1.ipynb](Bilingual-NovelTranslator-v1.ipynb) for more information.

**Note** : Remember to fill you api key in `.env` file

In [1]:
%load_ext autoreload
%autoreload 2
    
import utils

geminipro, openai = utils.init_llm()

So, we have to correct this.
In my experience, you should not ask a prompt to do too many different things at the same time,
say,  
```python
"""
You should split the sentence, detect the speaker of any dialogue, fix the typo and translate the article. 
You should also mark all character names to make a character list. 
If the text is a title, you should add markdown to highlight it...
...
"""
```

So I prefer to fix the sybtax error in another chain.

I also rewrite the split function.  
I believe a complete description should be end with `.\n` or `!\n` ...,etc.


In [2]:
def split_paragraph(source_path):
    from langchain_text_splitters import RecursiveCharacterTextSplitter
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1024,
        chunk_overlap=0,
        separators= ["[\"\.!\?”][\s\t]?\n\n\n\n",
                     "[\"\.!\?”][\s\t]?\n\n\n",
                     "[\"\.!\?”][\s\t]?\n\n",
                     "[\"\.!\?”][\s\t]?\n"
                    ],
        # length_function=len,
        keep_separator=True,
        is_separator_regex=True,
    )
    content=""
    with open(source_path, 'r') as f:
        content = f.read()
    
        
    return text_splitter.split_text(content)

Now I add functions to verify and fix json.
For LLM to debug, I pass the error message as well.

In [3]:
def verify_json(markdown_string):
    import re
    import json

    # 使用正則表達式找到所有以```json開頭 與```結尾的片段
    pattern = re.compile(r"```json(.*?)```", re.DOTALL)
    matches = pattern.findall(markdown_string)
    
    # 初始化一個空的list來存放合併後的資料
    # merged_list = []
    
    # 解析每個區塊並將其加入到合併後的list中
    for match in matches:
        try:
            json_data = json.loads(match.strip())
            # merged_list.extend(json_data)
        except json.JSONDecodeError as e:
            print(e)
            print(f"Error decoding JSON in block: {match}")
            return str(e)
    return None

In [4]:
def fix_json(llm, json_string, err_msg):
    from langchain_core.prompts import ChatPromptTemplate, PromptTemplate
    from langchain_core.output_parsers.string import StrOutputParser
    prompt = PromptTemplate.from_template(
"""You are a professional javascript engineer. Your task is to validate the json format.

There will be a code snippet formatted in json schema,
You should check if the code snippet is well formated in json and fix it if necessary.

Make sure the final json format is correct and output the validated json code, do not print anything else.

Here's the error message about the code snippet
{err_msg}

The original code snippet is as following:

{input}

""")
    chain = prompt | llm | StrOutputParser()
    temp = chain.invoke({"input": json_string, "err_msg":err_msg})
    print("\nError Hint\n")
    print(f"{err_msg}")
    print("\nFixed Json\n")
    print(temp)
    print("\n---\n")
    return temp

Now slightly modify the main llm chain
In addition to sentence, I ask llm to split dialogue as well.
And I will try to fix invalid json up to 3 retries.

In [5]:
def autoprocess_paragraph(llm, paragraphs, target_file, start=0):
    from langchain_core.prompts import ChatPromptTemplate, PromptTemplate
    from langchain_core.output_parsers.string import StrOutputParser
    prompt = PromptTemplate.from_template(
"""You are a professional translation assistant. Your task is to translate a paragraph from english to traditional Chinese.

There will be an OCR document with many layout issues.
You should first try to find out every complete sentence and speech in the article,
and check and fix it if the sentence or speech got split(linebreak) by mistake.
Every sentence or speech should be separated and distinct.
Your shoud check and fix misspellings and typos a well.

Then you should translate each sentence into zh-tw.
If a sentence is nonsense, you can just copy it to the correction and translation fields without any modification.

The output should be a markdown code snippet formatted in the following json schema, including the leading and trailing "```json" and "```":

```json
[
    {{
        "correction": string  // the sentence after error correction
        "zh-tw": string  // the translated sentence zh-tw
    }},
    {{
        "correction": string  // the sentence after error correction
        "zh-tw": string  // the translated sentence zh-tw
    }},
    .......
    ........
    {{
        "correction": string  // the sentence after error correction
        "zh-tw": string  // the translated sentence zh-tw
    }}
]
```

Example:
---
input:

Eefore one girl and another even younger one stood a figure in 
full p1ate armor brandishing a 5word.Lhe blade swung, sparkl-
ing in the sun1ight as if to say that taking their lives in a 
single stroke would be an act of mercy.
"No, let me go" She begged.
""
---
output:

```json
[
    {{
        "correction": "Before one girl and another even younger one stood a figure in full plate armor brandishing a sword.",
        "zh-tw": "在一名少女以及比她更年輕的少女面前，站著一位身穿全身板甲、揮舞著劍的男子。"
    }},
    {{
        "correction": "The blade swung, sparkling in the sunlight as if to say that taking their lives in a single stroke would be an act of mercy."
        "zh-tw": "刀鋒揮動，在陽光下閃爍，彷彿在說一刀奪命是仁慈的作為。"
    }},
    {{
        "correction": "“No, let me go”"
        "zh-tw": "不，放我走。"
    }},
    {{
        "correction": "She begged."
        "zh-tw": "她如此哀求著。"
    }},
]
```
---


The original paragraph is as following:

{input}

""")
    print(f"using llm: {llm}")
    chain = prompt | llm | StrOutputParser()
    translated=[]
    from tqdm.notebook import tqdm, trange
    errorindx = -1
    with tqdm(total=len(paragraphs)) as progress_bar:
        for i, paragraph in enumerate(paragraphs):
            if i < start:
                progress_bar.update(1)
                continue
            if len(paragraph.strip()) > 0:
                try:
                    # temp = ""
                    # for chunk in chain.stream({"input": paragraph}):
                    #     print(chunk, end="", flush=True)
                    #     temp += chunk
                    # print("\n------\n")    
                    temp = chain.invoke({"input": paragraph})
                except Exception as e:
                    print(f"Error occurs at {i}:")
                    print(paragraph)
                    print(e)
                    errorindx = i
                    return errorindx,translated;
                retry=0
                temp2 = temp
                while retry < 3:  # We retry to fix this 3 times, if we cannot fix it, just don't touch it.
                    err_msg = verify_json(temp2)
                    if err_msg:
                        temp2 = fix_json(llm, temp2, err_msg)
                        retry += 1
                        continue
                    temp = temp2
                    break;
                        
                translated.append(temp)
                with open(target_file, 'a+') as f:
                    f.write(temp)
                    f.write('\n')
            progress_bar.update(1)
    return errorindx,translated;

In [6]:
source_filenames = ["alice_in_wonderland.txt"]

In [7]:
errorbookindex=-1
errorindx=-1

In [8]:
for i, source_file in enumerate(source_filenames):
    if i < errorbookindex:
        continue

    print(source_file)
    target = source_file + ".v5.txt"
    path = source_file
    paragraphs = split_paragraph(path)[1:5]

    progress=0
    while (progress < len(paragraphs)):
        errorindx, translated_paragraphs = autoprocess_paragraph(geminipro, paragraphs, target, start=errorindx)        
        if errorindx < 0:
            progress=len(paragraphs)
        else: # errorindx >= 0:# error, retry
            _errorindx = -1
            for _llm in [openai]: # use optional llm to retry
                _paragraphs = paragraphs[errorindx:errorindx+1]
                _errorindx, _translated_paragraphs = autoprocess_paragraph(_llm, _paragraphs, target, start=0)
                if _errorindx < 0: # retry success
                    break
            if _errorindx < 0: # any retry success
                errorindx=errorindx+1 # continue next
                progress = errorindx
                continue

    if errorindx >= 0:
        errorbookindex=i
        break
        

alice_in_wonderland.txt
using llm: model='gemini-pro' temperature=0.5 top_p=0.85 safety_settings={<HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: 10>: <HarmBlockThreshold.BLOCK_NONE: 4>, <HarmCategory.HARM_CATEGORY_HATE_SPEECH: 8>: <HarmBlockThreshold.BLOCK_NONE: 4>, <HarmCategory.HARM_CATEGORY_HARASSMENT: 7>: <HarmBlockThreshold.BLOCK_NONE: 4>, <HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: 9>: <HarmBlockThreshold.BLOCK_NONE: 4>} client= genai.GenerativeModel(
   model_name='models/gemini-pro',
   generation_config={}.
   safety_settings={}
) convert_system_message_to_human=True


  0%|          | 0/4 [00:00<?, ?it/s]

  
You can open the file `alice_in_wonderland.txt.v5.txt` to check the result.  