# **Data Collection**

#### All datasets: https://github.com/shuhashi0352/Japanese-Politeness-Classification/tree/main/Datasets

### **Collecting Data from KeiCo Corpus**

I mostly collected sentences from the KeiCO Corpus, which was originally developed for the project titled Construction and Validation of a Japanese Honorific Corpus Based on Systemic Functional Linguistics (https://aclanthology.org/2022.dclrl-1.3/). This corpus was selected because it is systematically organised by levels of politeness, with each sentence already annotated for its degree of honorific usage. However, since the labeling scheme used in the corpus differs slightly from the one adopted in this project, several modifications were made:

* Although the original corpus includes approximately 10,000 sentences, the number of sentences used in this project was reduced to 4,000 to reduce annotation effort and improve manageability.

* The original four-level scale in the KeiCO Corpus was mapped onto a three-class system for this project: sentences labeled as Level 1 (the highest honorific level) and Level 2 (secondary honorific level) were merged and assigned to the Polite category (Level 1 in this project’s terms).

* Sentences from Level 3 were excluded, as they often include both polite and impolite expressions, making the classification boundary too ambiguous for consistent annotation.

* Sentences from Level 4, which contain primarily impolite expressions, were retained and relabeled as Impolite (Level 3).

* For the Neutral (Level 2) category, additional sentences were independently sampled from Japanese Wikipedia to ensure domain separation and to represent descriptive, non-stylised language.


### **Collecting Data from Wikipedia**

The following algorithm randomly selects a Wikipedia page written in Japanese. After extracting the content line by line, each line is appended to a text file for storage. 

During this extraction process, specific types of lines are excluded using regular expressions to ensure the quality and relevance of the collected data.

The line to begin with:

* `0-9` 

* `=`

* `ISBN` 

* `http`

* `www`

The chances are pretty low that they are sentences.

In [None]:
!pip install wikipedia

In [2]:
import wikipedia

In [None]:
import re

# Set language to Japanese
wikipedia.set_lang("ja") 

random_title = wikipedia.random()

try:
    page = wikipedia.page(random_title)

    # Read the input text
    input_text = page.content

    # Split text into sentences based on "。" (full stop)
    sentences = input_text.split("。")

    # Remove the first sentence (title) and filter out sentences starting with any numbers or "=|ISBN|http|www".
    filtered_sentences = [
        sentence.strip() for i, sentence in enumerate(sentences) if i > 0 and len(sentence) >= 10
        # Remove sentences starting with numbers
        and not re.match(r"^\d+", sentence.strip()) 
        # IGNORECASE: Case-insensitive URL/ISBN check
        and not re.match(r'^(=|ISBN|http|www)', sentence.strip(), re.IGNORECASE)]

    # Join the remaining lines back into a single text
    filtered_text = "。\n".join(filtered_sentences)

    # Ensure the last sentence ends with "。"
    if not filtered_text.endswith("。"):
        filtered_text += "。"
        print(filtered_text)

    # Append to file 
    # with open("wiki_appended.txt", "a", encoding="utf-8") as file:
    #     file.write("\n" + filtered_text) 

    # print("Filtered text appended to 'wiki_appended.txt'")

except wikipedia.exceptions.PageError:
    print("Page not found.")
except wikipedia.exceptions.DisambiguationError as e:
    print(f"Disambiguation Error. Options: {e.options}")

    


After extracting up to 2,000 sentences from Wikipedia, the resulting text file is merged with an existing CSV file containing 4,000 sentences from Levels 1 and 2 of the KeiCO Corpus, forming a unified dataset for politeness classification.

In [None]:
import csv

txt_file = "wiki_appended.txt"
csv_file = "keico_corpus(forLREC)-OldVersion_all.csv"

with open(txt_file, "r", encoding="utf-8") as txt, open(csv_file, "a", encoding="utf-8", newline="") as csvfile:
    csv_writer = csv.writer(csvfile)

    for line in txt:
        cleaned_line = line.strip()
        if cleaned_line: 
            csv_writer.writerow([cleaned_line])