# Direct Speech Extraction
In this notebook the direct speech was extracted. We used the basic assumption that in Japanese texts the direct speech is included between quatation marks 「」. Although the research on the accuracy of this approach to extraction of direct speech in Japanese sentences are not familiar to us, for the European language research by Maciej Eder shows that such approach is rather accurate. For the Japanese language that shoul be even more accuarate as there are more speciffic quatation marks for book titles and other purposes as 『』、〈〉, etc.

In [20]:
import os
import re
import random

In [21]:
def ds_extract(input_text:str):
    """extracts direct speech from an input text"""
    pattern = r'「(.*?)」'
    direct_speech = re.findall(pattern, input_text, re.DOTALL)
    return "\n".join(direct_speech)



Here, the we compose the corpus of extracted direct speech

In [22]:
folders = os.listdir("preprocessed_texts")
for subfolder in folders:
    print(subfolder)
    if not os.path.exists(f"direct speech\\{subfolder}"):
        os.makedirs(f"direct speech\\{subfolder}")
    
    subfolder_files = os.listdir(f"preprocessed_texts\\{subfolder}")
    for doc in subfolder_files:
        with open(f"preprocessed_texts\\{subfolder}\\{doc}", encoding="utf-8") as file:
            text = file.read()
        direct_speech = ds_extract(text)   
        with open(f"direct speech\\{subfolder}\\{doc}", encoding="utf-8", mode="w") as file:
            file.write(direct_speech)




Abe Kazue - M
Ariyoshi Sawako - F
Fujimoto Hitoshi - M
Fumizawa Ryuichi - M
Hara Tamiki - M
Hashioka Takeshi - M
Hayashi Kyoko - F
Hironaka Toshio - F
Hosoda Tamiki - M
Hotta Yoshie - M
Iida Momo - M
Ikuguchi Juro - M
Inada Mihoko - F
Inoue Mitsuhara - M
Ishida Koji - M
Iwasaki Seiichiro - M
Kajiyama Tohiyuki - M
Kamezawa Miyuki - M
Kanai Toshihiro - M
Katsura Yoshihisa - M
Kawakami Sokun - M
Kokubo Hitoshi - M
Kora Chihoko - F
Kurita Tohei - M
Kyo Kusao - M
Mikawa Kiyo - F
Nakai Masafumi - M
Nakamoto Takako - F
Nakayama Shiro - M
Nakazato Kisho - M
Natsubori Masamoto - M
Nishihara Kei - M
Ochi Michio - M
Oda Katsuzo - M
Oda Makoto - M
Oe Kenzaburo - M
Ota Yoko - F
Saiki Hisao - M
Sata Ineko - F
Takeda Taijun - M
Takenishi Hiroko - F
Tsukuda Jitsuo - M


Random sampling of direct speech from the entire corpus. 2 lines from each text.

In [23]:
def random_ds(input_text:str):
    "returns two random direct speech sentences if the length is above 2"
    sents = input_text.split("\n")
    if len(sents) > 4:
        return random.sample(sents,4)
    else:
        return sents
 

In [24]:
random_sampling =[]
folders = os.listdir(f"direct speech")
for folder in folders:
    files = os.listdir(f"direct speech\\{folder}")
    for doc in files:
        with open (f"direct speech\\{folder}\\{doc}", encoding="utf-8") as file:
            text = file.read()
        random_sampling += random_ds(text)
with open ("direct speech sampling JA new.txt", encoding="utf-8", mode="w") as file:
    file.write("\n".join(random_sampling))

In [25]:
#later fix when punctuation marks are removed

In [26]:
def aus_extract(input_text:str):
    """extracts author's speech from an input text"""
    pattern = r'「(.*?)」'
    pre_author_speech = re.sub(pattern, "", text)
    author_speech = re.sub("\n\n","\n",pre_author_speech)
    return author_speech
    

In [27]:
folders = os.listdir("preprocessed_texts")
for subfolder in folders:
    if not os.path.exists(f"author's speech\\{subfolder}"):
        os.makedirs(f"author's speech\\{subfolder}")
    
    subfolder_files = os.listdir(f"preprocessed_texts\\{subfolder}")
    for doc in subfolder_files:
        with open(f"preprocessed_texts\\{subfolder}\\{doc}", encoding="utf-8") as file:
            text = file.read()
        direct_speech = aus_extract(text)   
        with open(f"author's speech\\{subfolder}\\{doc}", encoding="utf-8", mode="w") as file:
            file.write(direct_speech)


random sampling of author's speech

In [28]:
def random_aus(input_text:str):
    "returns two random author's speech sentences if the length is above 2"
    sents = input_text.split("\n")
    if len(sents) > 4:
        return random.sample(sents, 4)
    else:
        return sents

In [29]:
random_sampling =[]
folders = os.listdir(f"author's speech")
for folder in folders:
    files = os.listdir(f"author's speech\\{folder}")
    for doc in files:
        with open (f"author's speech\\{folder}\\{doc}", encoding="utf-8") as file:
            text = file.read()
        random_sampling += random_aus(text)
with open ("author's speech sampling JA new.txt", encoding="utf-8", mode="w") as file:
    file.write("\n".join(random_sampling))