# Notebook 2: Direct Speech Extraction

In this notebook, direct speech was extracted. We used the basic assumption that in Japanese texts, direct speech is enclosed in quotation marks 「」. This corresponds with the standards of modern Japanese language. Other marks such as 『』 and 〈〉 are used for emphasis, proper names, or book titles, for example.

For European languages, Byszuk et al. estimated rule-based direct speech extraction. For languages with clear typographic conventions, like English, the regex-based system achieved high precision and recall (around 0.98 and 0.99, respectively). We can expect robust results with Japanese, as its direct speech marking is well-conventionalized.


In [20]:
import os
import re
import random

This regex sufficies the convention of marking direct speech in the modern Japanese language.

In [21]:
def ds_extract(input_text:str):
    """extracts direct speech from an input text"""
    pattern = r'「(.*?)」'
    direct_speech = re.findall(pattern, input_text, re.DOTALL)
    return "\n".join(direct_speech)

Here, the corpus of extracted direct speech is composed.

In [None]:
folders = os.listdir("preprocessed_texts")
for subfolder in folders:
    if not os.path.exists(f"direct speech\\{subfolder}"):
        os.makedirs(f"direct speech\\{subfolder}")
    
    subfolder_files = os.listdir(f"preprocessed_texts\\{subfolder}")
    for doc in subfolder_files:
        with open(f"preprocessed_texts\\{subfolder}\\{doc}", encoding="utf-8") as file:
            text = file.read()
        direct_speech = ds_extract(text)   
        with open(f"direct speech\\{subfolder}\\{doc}", encoding="utf-8", mode="w") as file:
            file.write(direct_speech)

Random sampling of direct speech from the entire corpus. 2 lines from each text.

In [23]:
def random_ds(input_text:str):
    "returns two random direct speech sentences if the length is above 2"
    sents = input_text.split("\n")
    if len(sents) > 4:
        return random.sample(sents,4)
    else:
        return sents
 

In [24]:
random_sampling =[]
folders = os.listdir(f"direct speech")
for folder in folders:
    files = os.listdir(f"direct speech\\{folder}")
    for doc in files:
        with open (f"direct speech\\{folder}\\{doc}", encoding="utf-8") as file:
            text = file.read()
        random_sampling += random_ds(text)
with open ("direct speech sampling JA new.txt", encoding="utf-8", mode="w") as file:
    file.write("\n".join(random_sampling))

In [26]:
def aus_extract(input_text:str):
    """extracts author's speech from an input text"""
    pattern = r'「(.*?)」'
    pre_author_speech = re.sub(pattern, "", text)
    author_speech = re.sub("\n\n","\n",pre_author_speech)
    return author_speech
    

In [27]:
folders = os.listdir("preprocessed_texts")
for subfolder in folders:
    if not os.path.exists(f"author's speech\\{subfolder}"):
        os.makedirs(f"author's speech\\{subfolder}")
    
    subfolder_files = os.listdir(f"preprocessed_texts\\{subfolder}")
    for doc in subfolder_files:
        with open(f"preprocessed_texts\\{subfolder}\\{doc}", encoding="utf-8") as file:
            text = file.read()
        direct_speech = aus_extract(text)   
        with open(f"author's speech\\{subfolder}\\{doc}", encoding="utf-8", mode="w") as file:
            file.write(direct_speech)


random sampling of author's speech

In [28]:
def random_aus(input_text:str):
    "returns two random author's speech sentences if the length is above 2"
    sents = input_text.split("\n")
    if len(sents) > 4:
        return random.sample(sents, 4)
    else:
        return sents

In [29]:
random_sampling =[]
folders = os.listdir(f"author's speech")
for folder in folders:
    files = os.listdir(f"author's speech\\{folder}")
    for doc in files:
        with open (f"author's speech\\{folder}\\{doc}", encoding="utf-8") as file:
            text = file.read()
        random_sampling += random_aus(text)
with open ("author's speech sampling JA new.txt", encoding="utf-8", mode="w") as file:
    file.write("\n".join(random_sampling))

### References
Byszuk, Joanna, et al. “Detecting Direct Speech in Multilingual Collection of 19th-Century Novels.” Proceedings of 1st Workshop on Language Technologies for Historical and Ancient Languages, European Language Resources Association, 2020, pp. 100–04.
