# V6

However, there are still problems.
The RecursiveCharacterTextSplitter doesn’t do a good job of spliting our string.
It combines the tail separator to next slice, which make the sentence wierd.
```python
[
    "“I’m glad I was able to see you after so long",
    ".” \n\n“It makes me glad to hear you say that", 
    '!” \n\n“But I should probably get going soon...”"
]
``
So, we have to monkey patch the RecursiveCharacterTextSplitter

## Gemini Pro API and OpenAI API
Check [Bilingual-NovelTranslator-v1.ipynb](Bilingual-NovelTranslator-v1.ipynb) for more information.

**Note** : Remember to fill you api key in `.env` file

In [1]:
%load_ext autoreload
%autoreload 2
    
import utils

geminipro, openai = utils.init_llm()

In [2]:
# Monkey patch
from langchain_text_splitters import character
from typing import List
import re
def _split_text_with_regex(
    text: str, separator: str, keep_separator: bool
) -> List[str]:
    # Now that we have the separator, split the text
    if separator:
        if keep_separator:
            # The parentheses in the pattern keep the delimiters in the result.
            _splits = re.split(f"({separator})", text)
            splits = [_splits[i] + _splits[i + 1] for i in range(0, len(_splits)-1, 2)]
            if len(_splits) % 2 == 1:
                splits += _splits[-1:]
        else:
            splits = re.split(separator, text)
    else:
        splits = list(text)
    return [s for s in splits if s != ""]

character._split_text_with_regex = _split_text_with_regex

In [3]:
# refer to V5 for more info
split_paragraph = utils.split_paragraph

In [4]:
def verify_json(markdown_string):
    import re
    import json

    pattern = re.compile(r"```json(.*?)```", re.DOTALL)
    matches = pattern.findall(markdown_string)
    
    for match in matches:
        try:
            json_data = json.loads(match.strip())
            # merged_list.extend(json_data)
        except json.JSONDecodeError as e:
            print(e)
            print(f"Error decoding JSON in block: {match}")
            return str(e)
    return None

In [5]:
def fix_json(llm, json_string, err_msg):
    from langchain_core.prompts import ChatPromptTemplate, PromptTemplate
    from langchain_core.output_parsers.string import StrOutputParser
    prompt = PromptTemplate.from_template(
"""You are a professional javascript engineer. Your task is to validate the json format.

There will be a code snippet formatted in json schema,
You should check if the code snippet is well formated in json and fix it if necessary.

Make sure the final json format is correct and output the validated json code, do not print anything else.

Here's the error message about the code snippet
{err_msg}

The original code snippet is as following:

{input}

""")
    chain = prompt | llm | StrOutputParser()
    temp = chain.invoke({"input": json_string, "err_msg":err_msg})
    print("\nError Hint\n")
    print(f"{err_msg}")
    print("\nFixed Json\n")
    print(temp)
    print("\n---\n")
    return temp

In [6]:
def autoprocess_paragraph(llm, paragraphs, target_file, start=0):
    from langchain_core.prompts import ChatPromptTemplate, PromptTemplate
    from langchain_core.output_parsers.string import StrOutputParser
    prompt = PromptTemplate.from_template(
"""You are a professional translation assistant. Your task is to translate a paragraph from english to traditional Chinese.

There will be an OCR document with many layout issues.
You should first try to find out every complete sentence and speech in the article,
and check and fix it if the sentence or speech got split(linebreak) by mistake.
Every sentence or speech should be separated and distinct.
Your shoud check and fix misspellings and typos a well.

Then you should translate each sentence into zh-tw.
If a sentence is nonsense, you can just copy it to the correction and translation fields without any modification.

The output should be a markdown code snippet formatted in the following json schema, including the leading and trailing "```json" and "```":

```json
[
    {{
        "correction": string  // the sentence after error correction
        "zh-tw": string  // the translated sentence zh-tw
    }},
    {{
        "correction": string  // the sentence after error correction
        "zh-tw": string  // the translated sentence zh-tw
    }},
    .......
    ........
    {{
        "correction": string  // the sentence after error correction
        "zh-tw": string  // the translated sentence zh-tw
    }}
]
```

Example:
---
input:

Eefore one girl and another even younger one stood a figure in 
full p1ate armor brandishing a 5word.Lhe blade swung, sparkl-
ing in the sun1ight as if to say that taking their lives in a 
single stroke would be an act of mercy.
"No, let me go" She begged.
""
---
output:

```json
[
    {{
        "correction": "Before one girl and another even younger one stood a figure in full plate armor brandishing a sword.",
        "zh-tw": "在一名少女以及比她更年輕的少女面前，站著一位身穿全身板甲、揮舞著劍的男子。"
    }},
    {{
        "correction": "The blade swung, sparkling in the sunlight as if to say that taking their lives in a single stroke would be an act of mercy."
        "zh-tw": "刀鋒揮動，在陽光下閃爍，彷彿在說一刀奪命是仁慈的作為。"
    }},
    {{
        "correction": "“No, let me go”"
        "zh-tw": "不，放我走。"
    }},
    {{
        "correction": "She begged."
        "zh-tw": "她如此哀求著。"
    }},
]
```
---


The original paragraph is as following:

{input}

""")
    print(f"using llm: {llm}")
    chain = prompt | llm | StrOutputParser()
    translated=[]
    from tqdm.notebook import tqdm, trange
    errorindx = -1
    with tqdm(total=len(paragraphs)) as progress_bar:
        for i, paragraph in enumerate(paragraphs):
            if i < start:
                progress_bar.update(1)
                continue
            if len(paragraph.strip()) > 0:
                try:
                    # temp = ""
                    # for chunk in chain.stream({"input": paragraph}):
                    #     print(chunk, end="", flush=True)
                    #     temp += chunk
                    # print("\n------\n")    
                    temp = chain.invoke({"input": paragraph})
                except Exception as e:
                    print(f"Error occurs at {i}:")
                    print(paragraph)
                    print(e)
                    errorindx = i
                    return errorindx,translated;
                retry=0
                temp2 = temp
                while retry < 3:  # We retry to fix this 3 times, if we cannot fix it, just don't touch it.
                    err_msg = verify_json(temp2)
                    if err_msg:
                        temp2 = fix_json(llm, temp2, err_msg)
                        retry += 1
                        continue
                    temp = temp2
                    break;
                        
                translated.append(temp)
                with open(target_file, 'a+') as f:
                    f.write(temp)
                    f.write('\n')
            progress_bar.update(1)
    return errorindx,translated;

In [7]:
source_filenames = ["alice_in_wonderland.txt"]

In [8]:
errorbookindex=-1
errorindx=-1

In [9]:
for i, source_file in enumerate(source_filenames):
    if i < errorbookindex:
        continue

    print(source_file)
    target = source_file + ".v6.txt"
    path = source_file
    paragraphs = split_paragraph(path)[1:5]

    progress=0
    while (progress < len(paragraphs)):
        errorindx, translated_paragraphs = autoprocess_paragraph(geminipro, paragraphs, target, start=errorindx)        
        if errorindx < 0:
            progress=len(paragraphs)
        else: # errorindx >= 0:# error, retry
            _errorindx = -1
            for _llm in [openai]: # use optional llm to retry
                _paragraphs = paragraphs[errorindx:errorindx+1]
                _errorindx, _translated_paragraphs = autoprocess_paragraph(_llm, _paragraphs, target, start=0)
                if _errorindx < 0: # retry success
                    break
            if _errorindx < 0: # any retry success
                errorindx=errorindx+1 # continue next
                progress = errorindx
                continue

    if errorindx >= 0:
        errorbookindex=i
        break

alice_in_wonderland.txt
using llm: model='gemini-pro' temperature=0.5 top_p=0.85 safety_settings={<HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: 10>: <HarmBlockThreshold.BLOCK_NONE: 4>, <HarmCategory.HARM_CATEGORY_HATE_SPEECH: 8>: <HarmBlockThreshold.BLOCK_NONE: 4>, <HarmCategory.HARM_CATEGORY_HARASSMENT: 7>: <HarmBlockThreshold.BLOCK_NONE: 4>, <HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: 9>: <HarmBlockThreshold.BLOCK_NONE: 4>} client= genai.GenerativeModel(
   model_name='models/gemini-pro',
   generation_config={}.
   safety_settings={}
) convert_system_message_to_human=True


  0%|          | 0/4 [00:00<?, ?it/s]

Expecting ',' delimiter: line 4 column 20 (char 154)
Error decoding JSON in block: 
[
  {
    "correction": "Well!' thought Alice to herself, `after such a fall as this, I shall think nothing of tumbling down stairs!",
    "zh-tw": "好極了！"愛麗絲心想，「經歷過這種墜落後，我以後滾下樓梯根本不算什麼！大家一定會覺得我超勇敢！就算從屋頂上摔下來，我都不會說半個字！」（這很可能是真的。）"
  },
  {
    "correction": "How brave they'll all think me at home!  Why, I wouldn't say anything about it, even if I fell off the top of the house!'",
    "zh-tw": "大家一定會覺得我超勇敢！就算從屋頂上摔下來，我都不會說半個字！」"
  },
  {
    "correction": "(Which was very likely true.)",
    "zh-tw": "(這很可能是真的。)"
  },
  {
    "correction": "Down, down, down.",
    "zh-tw": "往下墜，往下墜，一直往下墜。"
  },
  {
    "correction": "Would the fall NEVER come to an end!",
    "zh-tw": "這場墜落會不會永遠沒有盡頭！"
  },
  {
    "correction": "`I wonder how many miles I've fallen by this time?' she said aloud.",
    "zh-tw": "「我到底已經墜落了多少英里？」她大聲說著。"
  }
]


Error Hint

Expecting ',' delimiter: line 4 column 20 (char 154)

Fixed Json

```jso


You can open the file `alice_in_wonderland.txt.v6.txt` to check the result.  