# **Data Collection**

I mostly collected sentences from `KeiCo Corpus` which was previously used for the project called **Construction and Validation of a Japanese Honorific Corpus Based on Systemic Functional Linguistics** (https://aclanthology.org/2022.dclrl-1.3/). The reason for choosing this corpus was that this corpus is organized in order of politeness, and the level of politeness has been already labeled for every sentence. However, the labeling methodology is slightly different from the one on this project, so I modified the corpus in the following ways.

* Although the corpus contains around 10000 sentences, the number of sentences collected are reduced to 4000 sentences in total in order to ease strain on annotation.

* There were four levels assigned in the corpus. However, since this project divides the type of politeness into three levels (Polite = Level 1 / Neutral = Level 2 / Impolite = Level 3), the sentences with level 1 (The highest honorific level) and level 2 (Secondary honorific level) are merged to level 1 on the dataset used for the project. 

* The level 3 sentences were eliminated from the dataset for the project because level 3 contains both polite and impolite expressions, making the border for politeness ambiguous. 

* The sentences at level 4, which includes only impolite expressions, were used for the dataset at level 3.

* For neutral sentences (level 2), the text on Wikipedia is used. 


### **Collecting Data on Wikipedia**

In [None]:
!pip install wikipedia

In [2]:
import wikipedia

This algorithm below randomly choses a Wikipedia page written in Japanese.
Once text is extracted line by line, all the lines are appended to txt file.

In the process of extraction, the following types of lines are filtered out by regular expression.

The line starting with:

* `0-9` 

* `=`

* `ISBN` 

* `http`

* `www`

They are likely not to be sentences.

In [None]:
import re

# Set language to Japanese
wikipedia.set_lang("ja") 

random_title = wikipedia.random()

try:
    page = wikipedia.page(random_title)

    # Read the input text
    input_text = page.content

    # Split text into sentences based on "。" (full stop)
    sentences = input_text.split("。")

    # Remove the first sentence (title) and filter out sentences starting with any numbers or "=|ISBN|http|www".
    filtered_sentences = [
        sentence.strip() for i, sentence in enumerate(sentences) if i > 0 and len(sentence) >= 10
        # Remove sentences starting with numbers
        and not re.match(r"^\d+", sentence.strip()) 
        # IGNORECASE: Case-insensitive URL/ISBN check
        and not re.match(r'^(=|ISBN|http|www)', sentence.strip(), re.IGNORECASE)]

    # Join the remaining lines back into a single text
    filtered_text = "。\n".join(filtered_sentences)

    # Ensure the last sentence ends with "。"
    if not filtered_text.endswith("。"):
        filtered_text += "。"
        print(filtered_text)

    # Append to file 
    # with open("wiki_appended.txt", "a", encoding="utf-8") as file:
    #     file.write("\n" + filtered_text) 

    # print("Filtered text appended to 'wiki_appended.txt'")

except wikipedia.exceptions.PageError:
    print("Page not found.")
except wikipedia.exceptions.DisambiguationError as e:
    print(f"Disambiguation Error. Options: {e.options}")

    


After extracting text up to 2000 sentences, the txt file is merged to the csv file, which has already contained 4000 sentences (level 1 and 2)

In [None]:
import csv

txt_file = "wiki_appended.txt"
csv_file = "keico_corpus(forLREC)-OldVersion_all.csv"

with open(txt_file, "r", encoding="utf-8") as txt, open(csv_file, "a", encoding="utf-8", newline="") as csvfile:
    csv_writer = csv.writer(csvfile)

    for line in txt:
        cleaned_line = line.strip()
        if cleaned_line: 
            csv_writer.writerow([cleaned_line])