# Ascendance of a Bookworm MTL

Version 0.2

This notebook uses a custom machine translation model to translate the Ascendance of a Bookworm WN into English.

This model is in BETA. Pronouns are not fixed yet, new characters' names may be wrong, and sentence splitting isn't implemented yet, so the model likes making a single long sentence. These issues will be fixed in the future.

If you encounter any poorly translated sentences and want to help improve the model, see the note at the bottom of the page.

To run this notebook, make sure you are using a GPU runtime and then go to
Runtime > Run all. Once that is done, you can change the text in the translation cell and run it multiple times by clicking the run button to the left of the cell. 

In [None]:
#@title Run this once to set up the environment

!pip install transformers
!pip install accelerate
!pip install unidecode

In [None]:
#@title Run this once to import python packages

from functools import partial
import torch
from torch.cuda.amp import autocast
from transformers import AutoTokenizer, AutoConfig, AutoModelForSeq2SeqLM, NllbTokenizerFast
import re
import unidecode
import unicodedata
import base64
import json

In [None]:
#@title Run this once to set the output language
#@markdown This model is multi-lingual! Here you can set the output language.
#@markdown It is best with English, but it can translate into other
#@markdown languages too. A couple are listed here, but you can enter a different
#@markdown one if you want. See pages 13-16 in [this pdf](https://arxiv.org/pdf/2207.04672.pdf)
#@markdown for a full list of supported languages.

target_language = 'eng_Latn' #@param ["eng_Latn", "spa_Latn", "fra_Latn", "deu_Latn"] {allow-input: true}

In [None]:
#@title Run this once to initialize the model

DEVICE = 'cuda:0'
model_checkpoint = "thefrigidliquidation/nllb-200-distilled-1.3B-bookworm"

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, src_lang="jpn_Jpan", tgt_lang=target_language)
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint, torch_dtype=torch.float16).to(DEVICE)

In [None]:
#@title Run this once to set up the code to do the translating

JA_PUNCTUATION_REGEX = re.compile(r"([？。！?.!][？。！?.!\]」]*)")


def char_filter(string):
    latin = re.compile('[a-zA-Z]+')
    for char in unicodedata.normalize('NFC', string):
        decoded = unidecode.unidecode(char)
        if latin.match(decoded):
            yield char
        else:
            yield decoded


def clean_string(string):
    s = "".join(char_filter(string))
    s = "\n".join((x.rstrip() for x in s.splitlines()))
    return s


def split_ja_sentences(text: str):
    for line in text.splitlines():
        splits = JA_PUNCTUATION_REGEX.split(line)
        if len(splits) == 1:
            yield line
            continue
        current = ""
        for split in splits:
            current += split
            if JA_PUNCTUATION_REGEX.fullmatch(split):
                yield current
                current = ""
        if current != "":
            yield current


def translate_m2m(translator, tokenizer: NllbTokenizerFast, device, pars, verbose: bool = False):
    en_pars = []
    pars_it = pars
    for line in pars_it:
        if line.strip() == "":
            en_pars.append("")
            continue
        inputs = tokenizer(line, return_tensors="pt")
        inputs = {k: v.to(device) for (k, v) in inputs.items()}
        generated_tokens = translator.generate(
            **inputs,
            forced_bos_token_id=tokenizer.lang_code_to_id[tokenizer.tgt_lang],
            max_new_tokens=512,
            no_repeat_ngram_size=4,
        ).cpu()
        with tokenizer.as_target_tokenizer():
            outputs = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
        en_pars.append(outputs[0])
    return en_pars


translate = partial(translate_m2m, model, tokenizer, DEVICE)


def translate_long_text(text: str):
    lines = text.splitlines()
    with torch.no_grad():
        with autocast(dtype=torch.float16):
            for line in lines:
                sents = [clean_string(x).strip() for x in split_ja_sentences(line)]
                en_sents = translate(sents)
                print(" ".join(en_sents))

In [None]:
#@title Run this multiple times to translate the text

#@markdown Enter the Japansese text into the box on the left between the three quation marks (""").
#@markdown Make sure there is no text on the lines containing the three quotes.
#@markdown See the example text for an idea of the formatting required.

text = """
本須もとす麗乃うらのは本が好きだ。

心理学、宗教、歴史、地理、教育学、民俗学、数学、物理、地学、化学、生物学、芸術、体育、言語、物語……人類の知識がぎっちり詰め込まれた本を心の底から愛している。

様々な知識が一冊にまとめられている本を読むと、とても得をした気分になれるし、自分がこの目で見たことがない世界を、本屋や図書館に並ぶ写真集を通して見るのも、世界が広がっていくようで陶酔できる。

外国の古い物語だって、違う時代の、違う国の風習が垣間見えて趣深いし、あらゆる分野において歴史があり、それを紐解いていけば、時間を忘れるなんていつものことである。

麗乃は、図書館の古い本が集められている書庫の、古い本独特の少々黴かび臭い匂いや埃っぽい匂いが好きで、図書館に行くとわざわざ書庫に入り込む。そこでゆっくりと古い匂いのする空気を吸い込み、年を経た本を見回せば、麗乃はそれだけで嬉しくなって、興奮してしまう。
"""[1:-1]

translate_long_text(text)

In [None]:
#@title Submit corrected sentences to improve the model!
#@markdown If you encounter poorly translated sentences with the wrong name or term, please correct it!
#@markdown You can use other translation sites (like [DeepL](https://www.deepl.com/translator))
#@markdown to make sure the Japanese and English sentences match.

#@markdown Then run this cell and message [u/thefrigidliquidation](https://www.reddit.com/user/thefrigidliquidation/)
#@markdown on reddit with this cells output. It will output what looks like
#@markdown jibberish to prevent spoilers.

ja_sent = 'The Japanese sentence.' #@param {type:"string"}
en_sent = 'The corrected English sentence.' #@param {type:"string"}

df = {'translation': {'en': en_sent, 'ja': ja_sent}}
df_json = json.dumps(df)

print(base64.b64encode(df_json.encode('ascii')).decode('ascii'))
