# Notebook 1: Texts Preprossesing

In this Jupyter Notebook, we demonstrate basic preprosessing of the corpus of OCRed texts of Atomic Bomb Literature Corpus.

## OCR

**Book source**: 15 Volume *Anthology of Japanese Atomic Bomb Literature* (日本の原爆文学, 東京 : ほるぷ出版 , 1983)\
**Number of Volumes**: 13 (the last two volumes with non-fiction are exluded)\
**Number of works**: 106
\
\
The texts were OCR'd using ABBYY FineReader 15. As the program supports validation of the OCR against the raw image and can suggest characters it doubts in recognition, a researcher promptly reviewed each OCR'd page. If incorrect recognition was identified, the researcher manually replaced the misrecognized characters, subjectively estimating about 1 or 2 issues per page, excluding common patterns of misrecognition (see below).

## Cleaning Texts

The process requires any standard library for operating with regular expressions.

In [None]:
import os
import regex

The Japanese language uses a specific long space "　". Other spaces are not typically present in Japanese, except for those inserted in foreign language text. The misrecognized regular spaces were deleted (only between Japanese characters, to preserve the spaces in foreign language text).

In [25]:
def is_japanese(char):
    """checks if a character is Japanese."""
    unicode_point = ord(char)
    return (0x3040 <= unicode_point <= 0x309F or
            0x30A0 <= unicode_point <= 0x30FF or
            0x4E00 <= unicode_point <= 0x9FAF)

def clear_extra_spaces (input_text:str):
    """removes unnecessary common spaces (excluding Japanese spaces) from the text"""
    text = list(input_text)
    extra_spaces_index = []

    for i in range(1,len(text)-1):
        if text[i] == " ":
            if (is_japanese(text[i-1]) and is_japanese(text[i+1])) or (text[i-1] == " " or text[i+1] == " "):
                if (not is_japanese(text[i-1]) and text[i-1] != " ") and text[i+1] == " ": 
                    continue
                extra_spaces_index.append(i)
        
    no_extra_spaces = [text[i] for i in range(len(text)) if i not in extra_spaces_index]
    return "".join(no_extra_spaces)


In [26]:
def remove_newline_between_japanese(text):
    # Pattern to match a Japanese character, followed by a newline, followed by another Japanese character
    pattern = r'([\p{Script=Hiragana}\p{Script=Katakana}\p{Script=Han}])\n([\p{Script=Hiragana}\p{Script=Katakana}\p{Script=Han}])'
    # Replace the pattern with the two Japanese characters without the newline
    replaced_text = regex.sub(pattern, r'\1\2', text)
    return replaced_text
    

By reviewing the numerous OCR cases in the corpus, common recognition errors were identified. Manual observation of cases during the OCR process showed that the patterns, listed in the code below, directly correspond to the following characters (mostly incorrect punctuation marks).

In [27]:
def correct_ocr_errors(input_text:str): # in this shell the major OCR errors are present
    """replaces some common ocr errors for the current case""" 
    circle = clear_extra_spaces(input_text)
    circle = circle.replace(":::", "……")
    circle = circle.replace(":：:", "……")
    circle = circle.replace("	", "……")
    circle = circle.replace("・て", "で")
    circle = circle.replace("•て", "で")
    circle = circle.replace("•", "・")
    circle = circle.replace(":：:〇", "……。")
    circle = circle.replace("^", "。")
    circle = circle.replace(":：:。", "……。")
    circle = circle.replace("た〇", "た。")
    circle = circle.replace("た0", "た。")
    circle = remove_newline_between_japanese(circle)
    return circle

In [28]:
# The following code is given as an example and needs the adjustment before running 
folders = os.listdir("texts")
for subfolder in folders:
    if not os.path.exists(f"preprocessed_texts\\{subfolder}"):
        os.makedirs(f"preprocessed_texts\\{subfolder}") 

    subfolder_files = os.listdir(f"texts\\{subfolder}")
    for doc in subfolder_files:
        with open(f"texts\\{subfolder}\\{doc}", encoding="utf-8") as file:
            text = file.read()
        cleaned_text = correct_ocr_errors(text)
        with open(f"preprocessed_texts\\{subfolder}\\{doc}", encoding="utf-8", mode="w") as file:
            file.write(cleaned_text)
    


In [4]:
folders = os. listdir("preprocessed_texts")
for folder in folders:
    files = os.listdir(f"preprocessed_texts\\{folder}")
    compound_text = ""
    for doc in files:
        with open(f"preprocessed_texts\\{folder}\\{doc}", encoding="utf-8") as file:
            text = file.read()
            compound_text += text
    with open(f"texts per author\\{folder}.txt", encoding="utf-8", mode="w") as file:
        file.write(compound_text)