# V7
The V6 fixed some json issues with llm, but not all.  
There are still some errors about missing `,` or `\`  
  
LLM is not stable in fixing these kind of issues.  
So a classic library might be a better choice:  
https://github.com/mangiucugna/json_repair

We will make `repair_markdown_json()` to replace previous `verify_json()` and `fix_json()`


## Gemini Pro API and OpenAI API
Check [Bilingual-NovelTranslator-v1.ipynb](Bilingual-NovelTranslator-v1.ipynb) for more information.

**Note** : Remember to fill you api key in `.env` file

In [1]:
%load_ext autoreload
%autoreload 2
    
import utils

geminipro, openai = utils.init_llm()

In [2]:

import utils
# Check [Bilingual-NovelTranslator-v6.ipynb](Bilingual-NovelTranslator-v6.ipynb) for more information.
utils.monkeypatch_split_text_with_regex()

# refer to V5 for more info
split_paragraph = utils.split_paragraph

In [3]:
!pip install -qU json_repair

In [4]:
def repair_markdown_json(bad_markdown_json):
    from json_repair import repair_json
    import json, re
    pattern = re.compile(r"```json(.*?)```", re.DOTALL)
    matches = pattern.findall(bad_markdown_json)
    merged_list = []
    for match in matches:
        try:
            json_data = repair_json(match.strip(), return_objects=True)
            new_json = json.dumps(json_data, indent=4, ensure_ascii=False)
            # print(new_json)
            merged_list.append(new_json)
        except json.JSONDecodeError as e:
            print(e)
            print(f"Error decoding JSON in block: {match}")
    return "\n".join(["```json"] + merged_list +["```\n"])

In [5]:
def autoprocess_paragraph(llm, paragraphs, target_file, start=0):
    from langchain_core.prompts import ChatPromptTemplate, PromptTemplate
    from langchain_core.output_parsers.string import StrOutputParser
    prompt = PromptTemplate.from_template(
"""You are a professional translation assistant. Your task is to translate a paragraph from english to traditional Chinese.

There will be an OCR document with many layout issues.
You should first try to find out every complete sentence and speech in the article,
and check and fix it if the sentence or speech got split(linebreak) by mistake.
Every sentence or speech should be separated and distinct.
Your shoud check and fix misspellings and typos a well.

Then you should translate each sentence into zh-tw.
If a sentence is nonsense, you can just copy it to the correction and translation fields without any modification.

The output should be a markdown code snippet formatted in the following json schema, including the leading and trailing "```json" and "```":

```json
[
    {{
        "correction": string  // the sentence after error correction
        "zh-tw": string  // the translated sentence zh-tw
    }},
    {{
        "correction": string  // the sentence after error correction
        "zh-tw": string  // the translated sentence zh-tw
    }},
    .......
    ........
    {{
        "correction": string  // the sentence after error correction
        "zh-tw": string  // the translated sentence zh-tw
    }}
]
```

Example:
---
input:

Eefore one girl and another even younger one stood a figure in 
full p1ate armor brandishing a 5word.Lhe blade swung, sparkl-
ing in the sun1ight as if to say that taking their lives in a 
single stroke would be an act of mercy.
"No, let me go" She begged.
""
---
output:

```json
[
    {{
        "correction": "Before one girl and another even younger one stood a figure in full plate armor brandishing a sword.",
        "zh-tw": "在一名少女以及比她更年輕的少女面前，站著一位身穿全身板甲、揮舞著劍的男子。"
    }},
    {{
        "correction": "The blade swung, sparkling in the sunlight as if to say that taking their lives in a single stroke would be an act of mercy."
        "zh-tw": "刀鋒揮動，在陽光下閃爍，彷彿在說一刀奪命是仁慈的作為。"
    }},
    {{
        "correction": "“No, let me go”"
        "zh-tw": "不，放我走。"
    }},
    {{
        "correction": "She begged."
        "zh-tw": "她如此哀求著。"
    }},
]
```
---


The original paragraph is as following:

{input}

""")
    print(f"using llm: {llm}")
    chain = prompt | llm | StrOutputParser()
    translated=[]
    from tqdm.notebook import tqdm, trange
    errorindx = -1
    with tqdm(total=len(paragraphs)) as progress_bar:
        for i, paragraph in enumerate(paragraphs):
            if i < start:
                progress_bar.update(1)
                continue
            if len(paragraph.strip()) > 0:
                try:
                    # temp = ""
                    # for chunk in chain.stream({"input": paragraph}):
                    #     print(chunk, end="", flush=True)
                    #     temp += chunk
                    # print("\n------\n")    
                    temp = chain.invoke({"input": paragraph})
                except Exception as e:
                    print(f"Error occurs at {i}:")
                    print(paragraph)
                    print(e)
                    errorindx = i
                    return errorindx,translated;
                    
                temp = repair_markdown_json(temp)
                # retry=0
                # temp2 = temp
                # while retry < 3:  # We retry to fix this 3 times, if we cannot fix it, just don'y touch it.
                #     err_msg = verify_json(temp2)
                #     if err_msg:
                #         temp2 = fix_json(llm, temp2, err_msg)
                #         retry += 1
                #         if retry == 3:
                #             print("Invalid Json:\n")
                #             print(f"{temp}\n")
                #             print(f"{err_msg}\n")
                #         continue
                #     temp = temp2
                #     break;
                        
                translated.append(temp)
                with open(target_file, 'a+') as f:
                    f.write(temp)
                    f.write('\n')
            progress_bar.update(1)
    return errorindx,translated;

In [6]:
source_filenames = ["alice_in_wonderland.txt"]

In [7]:
errorbookindex=-1
errorindx=-1

In [8]:
for i, source_file in enumerate(source_filenames):
    if i < errorbookindex:
        continue

    print(source_file)
    target = source_file + ".v7.txt"
    path = source_file
    paragraphs = split_paragraph(path)[1:5]

    progress=0
    while (progress < len(paragraphs)):
        errorindx, translated_paragraphs = autoprocess_paragraph(geminipro, paragraphs, target, start=errorindx)        
        if errorindx < 0:
            progress=len(paragraphs)
        else: # errorindx >= 0:# error, retry
            _errorindx = -1
            for _llm in [openai]: # use optional llm to retry
                _paragraphs = paragraphs[errorindx:errorindx+1]
                _errorindx, _translated_paragraphs = autoprocess_paragraph(_llm, _paragraphs, target, start=0)
                if _errorindx < 0: # retry success
                    break
            if _errorindx < 0: # any retry success
                errorindx=errorindx+1 # continue next
                progress = errorindx
                continue

    if errorindx >= 0:
        errorbookindex=i
        break

alice_in_wonderland.txt
using llm: model='gemini-pro' temperature=0.5 top_p=0.85 safety_settings={<HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: 10>: <HarmBlockThreshold.BLOCK_NONE: 4>, <HarmCategory.HARM_CATEGORY_HATE_SPEECH: 8>: <HarmBlockThreshold.BLOCK_NONE: 4>, <HarmCategory.HARM_CATEGORY_HARASSMENT: 7>: <HarmBlockThreshold.BLOCK_NONE: 4>, <HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: 9>: <HarmBlockThreshold.BLOCK_NONE: 4>} client= genai.GenerativeModel(
   model_name='models/gemini-pro',
   generation_config={}.
   safety_settings={}
) convert_system_message_to_human=True


  0%|          | 0/4 [00:00<?, ?it/s]