# V3
I tried to ask llm to fix redundent '\n' in previous V2, but V2 didn't work well.  
Now I tried to replace the plain text output with json format.

The json will reminds LLM what attributes it should output.

## Gemini Pro API and OpenAI API
Check [Bilingual-NovelTranslator-v1.ipynb](Bilingual-NovelTranslator-v1.ipynb) for more information.

**Note** : Remember to fill you api key in `.env` file

In [1]:
%load_ext autoreload
%autoreload 2
    
import utils

geminipro, openai = utils.init_llm()

In [2]:
def split_paragraph(source_path):
    from langchain_text_splitters import RecursiveCharacterTextSplitter
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1024,
        chunk_overlap=0,
        separators= ["\n\n\n\n","\n\n\n","\n\n"]
        # length_function=len,
        # is_separator_regex=True,
    )
    content=""
    with open(source_path, 'r') as f:
        content = f.read()
        
    return text_splitter.split_text(content)

In [3]:
def fix_paragraph(llm, paragraphs, target_file=None):
    from langchain_core.prompts import ChatPromptTemplate, PromptTemplate
    from langchain_core.output_parsers.string import StrOutputParser
    from langchain.callbacks.tracers import ConsoleCallbackHandler
    prompt = PromptTemplate.from_template(
    """
You are an article assistant.
There will be an OCR document with many layout issues.
Your task is to fix the issues.
You should first try to find out every complete sentence in the article,
and check and fix it if the sentence got split by mistake.
Your shoud check and fix misspellings and typos a well.

If a sentence is nonsense, make the output identical to original source.

The output should be a markdown code snippet formatted in the following json schema, including the leading and trailing "```json" and "```":

```json
[
    {{
     "correction": string  // the sentence after error correction
    }},
    {{
     "correction": string  // the sentence after error correction
    }},
    .......
    ........
    {{
     "correction": string  // the sentence after error correction
    }}
]
```

Example:
---
input:

Eefore one girl and another even younger one stood a figure in 
full p1ate armor brandishing a 5word.Lhe blade swung, sparkl-
ing in the sun1ight as if to say that taking their lives in a 
single stroke would be an act of mercy.

---

output:
```json
[
    {{
        "correction": "Before one girl and another even younger one stood a figure in full plate armor brandishing a sword."
    }},
    {{
        "correction": "The blade swung, sparkling in the sunlight as if to say that taking their lives in a single stroke would be an act of mercy."
    }}
]
```
---


Here's the article we are going to check:

{input}

""")
    
    
    chain = prompt | llm | StrOutputParser()
    
    fixed_paragraphs = []
    from tqdm.notebook import tqdm, trange
    with tqdm(total=len(paragraphs)) as progress_bar:
        for i, paragraph in enumerate(paragraphs):
            if len(paragraph.strip()) > 0:
                # temp = ""
                # for chunk in chain.stream({"input": paragraph}):
                #     print(chunk, end="", flush=True)
                #     temp += chunk
                # print("\n------\n")    
                temp = chain.invoke({"input": paragraph})
                print(temp)
                print("\n------\n")    
                fixed_paragraphs.append(temp)
                if target_file:
                    with open(target_file, 'a+') as f:
                            f.write(temp)
                            f.write('\n')
            progress_bar.update(1)
    return fixed_paragraphs

In [4]:
def translate_paragraph(llm, paragraphs, target_file=None):
    from langchain_core.prompts import ChatPromptTemplate, PromptTemplate
    from langchain_core.output_parsers.string import StrOutputParser
    prompt = PromptTemplate.from_template(
        """You are a professional translation assistant. Your task is to translate a paragraph from english to traditional Chinese.
      
You should first sepatate each paragraph into several complete sentences.
Each sentence should be prepended with a newline(\n).
Then you should translate each sentence into zh-tw followed by the original sentence, separated by a newline(\n).
Add one more newline(\n) at end before processing next sentence.
Make sure no original text and translations should be skipped.

Example:
---
input:
Before one girl and another even younger one stood a figure in full plate armor brandishing a sword.
The blade swung, sparkling in the sunlight as if to say that taking their lives in a single stroke would be an act of mercy.
---
output:
在一名少女以及比她更年輕的少女面前，站著一位身穿全身板甲、揮舞著劍的男子。
Before one girl and another even younger one stood a figure in full plate armor brandishing a sword.

刀鋒揮動，在陽光下閃爍，彷彿在說一刀奪命是仁慈的作為。
The blade swung, sparkling in the sunlight as if to say that taking their lives in a single stroke would be an act of mercy.


---

The original paragraph is as following:

{input}

""")
    
    chain = prompt | llm | StrOutputParser()
    translated=[]
    from tqdm.notebook import tqdm, trange
    with tqdm(total=len(paragraphs)) as progress_bar:
        for i, paragraph in enumerate(paragraphs):
            if len(paragraph.strip()) > 0:
                temp = ""
                for chunk in chain.stream({"input": paragraph}):
                    print(chunk, end="", flush=True)
                    temp += chunk
                print("\n------\n")    
                translated.append(temp)
                if target_file:
                    with open(target_file, 'a+') as f:
                        f.write(temp)
                        f.write('\n')
            progress_bar.update(1)

In [5]:
source_filenames = ["alice_in_wonderland.txt"]
for i, source_file in enumerate(source_filenames):
    print(source_file)
    path = source_file
    paragraphs = split_paragraph(path)[1:5]
    fixed_paragraphs = fix_paragraph(geminipro, paragraphs)
fixed_paragraphs

alice_in_wonderland.txt


  0%|          | 0/4 [00:00<?, ?it/s]

```json
[]
```

------

```json
[
  {
    "correction": "Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, `and what is the use of a book,' thought Alice `without pictures or conversation?'"
  },
  {
    "correction": "So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her."
  }
]
```

------

```json
[
  {
    "correction": "There was nothing so VERY remarkable in that; nor did Alice think it so VERY much out of the way to hear the Rabbit say to itself, 'Oh dear!  Oh dear!  I shall be late!'"
  },
  {
    "correction": "(when she thought it over afterwards, it occurred to her that she ough

['```json\n[]\n```',
 '```json\n[\n  {\n    "correction": "Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, `and what is the use of a book,\' thought Alice `without pictures or conversation?\'"\n  },\n  {\n    "correction": "So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her."\n  }\n]\n```',
 '```json\n[\n  {\n    "correction": "There was nothing so VERY remarkable in that; nor did Alice think it so VERY much out of the way to hear the Rabbit say to itself, \'Oh dear!  Oh dear!  I shall be late!\'"\n  },\n  {\n    "correction": "(when she thought it over afterwards, it occurred to her 