# An Introduction to Japanese Text Mining: Part One

![Japanese Text Mining](images/japanese_text_mining.jpg)
Check out the [Emory University workshop blog](https://scholarblogs.emory.edu/japanese-text-mining/) on Japanese Text Mining. The example notebook cells below repeat the steps in the [tutorial](http://history.emory.edu/RAVINA/JF_text_mining/Guides/Jtextmining_intro_part2.html) of Mark Ravina using python instead of R. The quoted text below is directly from Ravina's article, with minor word changes for python syntax.

## Imports

In [None]:
import re
import requests
import pandas as pd
import plotly_express as px

## Regex

> Regex is short for “regular expressions.” Think of regex as an extreme version of searching in a word processor using “wild cards.” We can search not only for specific strings, but types of strings, such as lowercase letters or kanji or kana, and narrow our search based on position and the surrounding text. There are entire books dedicated to regex, but we’ll cover the core concepts to get you started.

> Let’s begin with a simple example: we’ll search a few characters before and after a given string. In regex, the “period” character “.” means “any character, including whitespace.”

In [None]:
string = "これはペンです"
pattern = "は"
re.findall(pattern, string)

In [None]:
string = "これはペンです"
pattern = ".は."
re.findall(pattern, string)

In [None]:
string = "これはペンです"
pattern = "..は.."
re.findall(pattern, string)

> The function `re.findall`, as the name suggests, finds all the strings matching the pattern argument. (Argument is the technical term for the details of a function or command.) More interesting is the role of the period in that pattern argument. Note how the argument pattern = “..は..” gets two characters on either side of “は”

>Let’s try looking at something more substantial than “これはペンです”. We’ll use the 1889 Imperial Rescript on Education

In [None]:
rescript = "朕惟フニ我カ皇祖皇宗國ヲ肇ムルコト宏遠ニ德ヲ樹ツルコト深厚ナリ我カ臣民克ク忠ニ克ク孝ニ億兆心ヲ一ニシテ世世厥ノ美ヲ濟セルハ此レ我カ國體ノ精華ニシテ敎育ノ淵源亦實ニ此ニ存ス爾臣民父母ニ孝ニ兄弟ニ友ニ夫婦相和シ朋友相信シ恭儉己レヲ持シ博愛衆ニ及ホシ學ヲ修メ業ヲ習ヒ以テ智能ヲ啓發シ德器ヲ成就シ進テ公益ヲ廣メ世務ヲ開キ常ニ國憲ヲ重シ國法ニ遵ヒ一旦緩急アレハ義勇公ニ奉シ以テ天壤無窮ノ皇運ヲ扶翼スヘシ是ノ如キハ獨リ朕カ忠良ノ臣民タルノミナラス又以テ爾祖先ノ遺風ヲ顯彰スルニ足ラン斯ノ道ハ實ニ我カ皇祖皇宗ノ遺訓ニシテ子孫臣民ノ俱ニ遵守スヘキ所之ヲ古今ニ通シテ謬ラス之ヲ中外ニ施シテ悖ラス朕爾臣民ト俱ニ拳々服膺シテ咸其德ヲ一ニセンコトヲ庶幾フ"
pattern = "..皇.."
re.findall(pattern, rescript)

> We can use square bracket to search for more than one character at a time: [皇朕] means the characters 朕 OR 皇.

In [None]:
pattern = "..[皇朕].."
re.findall(pattern, rescript)

> This is a rudimentary form of KWIC, or “key words in context.” Take a moment to experiment with the command above, changing the kanji and the number of characters. Rather than adding periods, you can use a number in “curly brackets” to specify repetition.

In [None]:
pattern = ".{4}民.{4}"
re.findall(pattern, rescript)

> Regex is an enormously powerful tool with a wide range of expressions. In this lesson we’re going to focus on using regex to find chapter or section breaks in texts. But before we move on, here are two examples of more powerful regex searches. What do you suppose this regex finds, and why?

In [None]:
pattern = "民[ァ-ン]."
re.findall(pattern, rescript)

In [None]:
print(rescript)

> Regex can also be used to find anything between two characters. The expression “皇.*?民“ will find everything between “皇” and “民”. Remember that the period means ”any character." The asterisk allows for repetition, and the question mark tells R to stop at the first instance of “民” after “皇”.

In [None]:
pattern = "皇.*?民"
re.findall(pattern, rescript)

## Raw texts

> In the first lesson, we used a pre-processed text, the Meiroku zasshi, which was already in a tidy dataframe. In this section we’ll use something less tidy: a plain text file of Hayashi Fumiko’s Ukigumo. This is just the text of a [web page](http://jti.lib.virginia.edu/japanese/hayashi/ukigumo/HayUkig.html) copied and pasted as a plain text file. Unlike the data frame we used earlier, this .txt is not carefully structured, so we’ll read it in as lines of text.

In [None]:
url_ukigumo = 'http://history.emory.edu/RAVINA/JF_text_mining/Guides/data/ukigumo.txt'
response = requests.get(url_ukigumo)

In [None]:
response.encoding = 'utf-8'
Ukigumo_lines = [t.split('" "') for t in response.text.split('\n')]

In [None]:
Ukigumo_lines[1:40]

> The metadata runs to 23, then there are two blank lines, and then the text begins. As for the tail . . . let’s get sophisticated. We’ll use `len(Ukigumo_lines)` to get the number of lines and then subtract ~~10~~ 12.

In [None]:
Ukigumo_lines[(len(Ukigumo_lines)-12):len(Ukigumo_lines)]

> It looks like the text ends with the line “（完）” and then has two blank lines and nine lines of metadata. Let’s get the exact line numbers . .

In [None]:
[n for n,l in enumerate(Ukigumo_lines) if "（完） " in l]

In [None]:
[n for n,l in enumerate(Ukigumo_lines) if "Japanese Text Initiative" in l]

> So let’s just break Ukigumo into text and metadata.

In [None]:
Ukigumo_head = Ukigumo_lines[1:23]
Ukigumo_tail = Ukigumo_lines[5220:len(Ukigumo_lines)]
Ukigumo_metadata = Ukigumo_head + Ukigumo_tail
Ukigumo_text = Ukigumo_lines[25:5218]

In [None]:
Ukigumo_metadata

In [None]:
Ukigumo_text[0:5]

In [None]:
Ukigumo_text[len(Ukigumo_text)-5:len(Ukigumo_text)]

In [None]:
len(Ukigumo_text)

> We read in the text of Ukigumo as a series of lines with line breaks, and that was useful for finding and pulling out the metadata. Now that we have isolated the main text, we might want to collapse those ~~5,194~~ 5193 lines into one long string. The command is:

In [None]:
Ukigumo_collapsed = '\n'.join([t[0] for t in Ukigumo_text])

> Now let’s do a little regex searching on Ukigumo. We can search for all the terms that appear around the name of the protagonist Yukiko. The list is long, so we’ll just peak at the first 10 hits.

In [None]:
Yukiko_kwic = re.findall( ".{5}ゆき子.{5}", Ukigumo_collapsed)

In [None]:
Yukiko_kwic[1:10]

> While this certainly isn’t a “summary” of the novel, the phrases “孤獨な心” and “汚れた手” do get at key themes in the work: isolation and postwar privation.

## Chunking

> In order to explore the internal structure of a text, we often need to break it into parts. Sometimes these parts are inherent to the text itself, such as sections or chapters. Sometimes we will want to impose arbitrary breaks on a text. In either case it is necessary to “chunk” the text, that is, break it into parts.

> If there are explicit markers within the text, the first step is to find those. For example, we might want to find “第一課,” “第二課,” and “第三課,” etc. As you can see, the first and last characters are consistent, but the middle character(s) change. So we need to find every instance of “第” followed by one or two digits and then “課.” On the off chance that this is part of the text (e.g., 第三課に叙述した例文), we can specify that the text appear on a line by itself.

> Let’s try finding the breaks in Hayashi’s Ukigumo. If you glance at Ukigumo either in print or up on [Aozora bunko](http://www.aozora.gr.jp/cards/000291/files/52236_58934.html), you’ll see that the chapter breaks are marked with simple numbers ranging from “ー” to “六十七.” This is a fairly simple challenge for regex. 

> We will want to search for:
> * A character in the sequence 一二三四五六七八九十
> * That character repeated between one and three times
> * That pattern starting at the start of a line
> * That pattern ending at the end of a line

> Working this through step by step
> * A character in the sequence 一二三四五六七八九十 is [一二三四五六七八九十]. The “square brackets” in regex have an implicit “or” so we are searching for ANY of these characters
> * Repeated between one and three times is expressed with curly brackets: {1,3}. It seems as though the concept “one through three”" should be expressed with a colon as [1:3] or maybe {1:3}, but regex has its own rules.
> * Starting at the start of a line is marked with the “hatchek”: ^
> * Ending at the end of a line is expressed with the “dollar sign”: $

> The complete regex is `“^[一二三四五六七八九十]{1,3}$”`

> We can use the familiar function str_count to count those instances . . .

In [None]:
import unicodedata

In [None]:
Ukigumo_collapsed = unicodedata.normalize('NFKC', Ukigumo_collapsed)

In [None]:
iter = re.finditer("\n[一二三四五六七八九十]{1,3}\n", Ukigumo_collapsed)
breaks = [m.span() for m in iter]

In [None]:
print(Ukigumo_collapsed[2709:2712])

In [None]:
breaks[0]

> In order to check whether we have the right number of breaks, we can ask python for the length of that list of hits:

In [None]:
len(breaks)

> It’s indeed the same as the number of chapters, so we’ve specified the regex correctly. Let’s check one more thing. We’ll look at the actual lines where the search results say we should find chapter markers

In [None]:
Ukigumo_collapsed[80:92]

In [None]:
Ukigumo_collapsed[87:90]

In [None]:
for n,s in enumerate(Ukigumo_collapsed[80:92].split('\n'), start=80):
    print(n, '|'+s+'|')

In [None]:
breaks[-1]

In [None]:
Ukigumo_collapsed[234352:234357]

In [None]:
for n,s in enumerate(Ukigumo_collapsed[234352:234357+10].split('\n'), start=234352):
    print(n, '|'+s+'|')

In [None]:
print(breaks)

In [None]:
breaks[66]

>Again, the results are good. Now we can chunk Ukigumo. First let’s mark all the breakpoints with the arbitrary but distinctive term “BREAKPOINT!!!” We can use any string that does not occur in the original text, but “BREAKPOINT!!!” seems clear. We’ll make a copy of the text called Ukigumo_text_new and change the content of the relevant lines

Deviate from the tutorial here. Checked above that we were parsing the break points correctly. Now use a list comprehension to make a list of all the chapters. Treat the last chapter, differently.

In [None]:
Ukigumo_text_new = [Ukigumo_collapsed[breaks[n][0]:breaks[n+1][0]] for n in range(len(breaks)-1)]
Ukigumo_text_new.append(Ukigumo_collapsed[breaks[66][0]:len(Ukigumo_collapsed)])

In [None]:
Ukigumo_split_df = pd.DataFrame(list(range(1, len(Ukigumo_text_new)+1)), columns=['chapter_number'])
Ukigumo_split_df['text'] = Ukigumo_text_new

> We now have the text of Ukigumo in a data frame, roughly parallel to the format of the Meiroku zasshi. If we do a regex search on this data frame, the results will be by chapter. For example, how often does the protagonist appear by name

In [None]:
Ukigumo_split_df['ゆき子'] = Ukigumo_split_df.text.str.count('ゆき子')

> We also already know how to extract all the quotes in each chapter. Let’s capture everything between “「” and “」” and reuse the regex we used above to search between 皇 and 民 . Then we’ll count the number of characters in the quotes to find the “quotiest” chapters

In [None]:
Ukigumo_split_df['ゆき子'] = Ukigumo_split_df.text.str.count('ゆき子')

In [None]:
def quote_length(text):
    total_length = 0
    for quote in re.findall('「(.*?)」', text):
        total_length += len(quote)
    return total_length

In [None]:
quote_length(Ukigumo_text_new[1])

In [None]:
Ukigumo_split_df['quotes'] = Ukigumo_split_df.text.apply(quote_length)

In [None]:
print(list(Ukigumo_split_df.quotes))

**Note:** Computations seem to be systematically different by a constant factor than Ravina's.

> Is there a pattern? Let’s graph it and see. We’ll use the rownames of the dataframe as a proxy for chapter names.

In [None]:
px.scatter(Ukigumo_split_df, x='chapter_number', y='quotes')

> Counting on characters is useful, but we might want the text split into words (or tokens) so that we can create a document term matrix. That’s our next step.

## Tokenizing

<img src="images/Sudachi.png" width="200" align="right" style="margin: 0px 20px"/>

> Tokenizing Japanese is a difficult but often necessary step. The details of installing the MeCab tokenizer and the related R packages (RMeCab and RMeCabUni) are relegated to [another page](http://history.emory.edu/RAVINA/JF_text_mining/Guides/MeCab_RMeCab.html). Let’s assume that you either have RMeCab successfully installed, or are using the server. We’ll start with a simple string.

Instead of using the MeCab tokenizer, we will use the [Sudachi Japanese morphological analyzer](https://github.com/WorksApplications/Sudachi).

In [None]:
from sudachipy import tokenizer
from sudachipy import dictionary
from sudachipy import config
import json

In [None]:
with open(config.SETTINGFILE, "r", encoding="utf-8") as f:
    settings = json.load(f)
tokenizer_obj = dictionary.Dictionary(settings).create()

In [None]:
s = 'これはペンです'
mode = tokenizer.Tokenizer.SplitMode.C
for t in zip([m.surface() for m in tokenizer_obj.tokenize(mode, s)], [m.part_of_speech() for m in tokenizer_obj.tokenize(mode, s)]):
    print(t)

> As you can see, ~~RMeCab~~ Sudachi returns the results of its tokenization in a somewhat dense form, with the words combined with their POS (“part of speech”) tags. To get just the neatly tokenized words, we can tell Python that we want the output as a simple vector, not a list. If that statement seems opaque, just use the line below and tackle the difference between vectors and lists as needed . . . which may be never.

In [None]:
s = 'これはペンです'
mode = tokenizer.Tokenizer.SplitMode.C
[m.surface() for m in tokenizer_obj.tokenize(mode, s)]

> If you want the POS tags, we can grab the “attributes” of the list.

In [None]:
s = 'これはペンです'
mode = tokenizer.Tokenizer.SplitMode.C
[m.part_of_speech()[0] for m in tokenizer_obj.tokenize(mode, s)]

> If we want to work with longer strings or vectors of strings, we can direct ~~RMeCab~~ Sudachi to tokenize part of a dataframe. For example, we can have ~~RMeCab~~ Sudachi tokenize either all of Ukigumo or just the extracted quotes. Since we are looking at Hayashi’s Ukigumo from 1951 and not Futabatei Shimei’s 1887 novel by the same name, we’ll use the tokenzier for 現代語. RMeCab’s somewhat strange syntax refers to the dataframe as dataf and the column as coln.

In [None]:
Ukigumo_chapters_tokenized = []
mode = tokenizer.Tokenizer.SplitMode.C
text_length = 0
for chapter in Ukigumo_text_new:
    all_text = ''
    wordlist = []
    wordlist = [m.dictionary_form() for m in tokenizer_obj.tokenize(mode, chapter)]
    text_length += len(wordlist)
    all_text += ' '.join(wordlist)
    Ukigumo_chapters_tokenized.append(all_text)
    print('\r{}'.format(text_length), end='')

In [None]:
Ukigumo_split_df['token'] = Ukigumo_chapters_tokenized

In [None]:
print(Ukigumo_split_df.token.loc[0])

> Since we now have Ukigumo tokenized, we can reuse our earlier code to create a document-term matrix. First we’ll get a list of all the unique words in the complete text. We’ll combine the tokenized chapters together with join so that Python gives us one list of unique words, not ~~68~~ 67 lists, one for each chapter.

For some reason, our data does not contain the preface but just the 67 chapters.

In [None]:
Ukigumo_complete = '\n'.join(Ukigumo_split_df.token)
Ukigumo_unique_words = set(Ukigumo_complete.split())
print(len(Ukigumo_unique_words))

> Now we can reuse the code from the previous chapter, but with one improvement. Now that we know some regex, we can tell Python to distinguish between 女 as a word and 女 as part of a compound such as 女性 or 少女. The extra regex tag is \b for “word boundary.” That tells Python we only want 女 either with whitespace on both sides, or at the start or end of a string. We’ll paste \b before and after every unique word.

In [None]:
for w in Ukigumo_unique_words:
    if '女' in w:
        print(w)

In [None]:
pattern = r"\女史\b"
re.findall(pattern, Ukigumo_complete)

> Now let’s reuse our code from the last section to get a document-term matrix.

In [None]:
from collections import Counter

In [None]:
complete_ukigumo_split = Ukigumo_complete.split()

In [None]:
len(complete_ukigumo_split)

In [None]:
ukigumo_unique_words = set(complete_ukigumo_split)
len(ukigumo_unique_words)

In [None]:
counts = Counter(complete_ukigumo_split)
Ukigumo_frequency_df = pd.DataFrame.from_dict(counts, orient='index').reset_index()
Ukigumo_frequency_df.columns = ['word', 'count']
Ukigumo_frequency_df = Ukigumo_frequency_df.sort_values(by='count', ascending=False)
Ukigumo_frequency_df['term index'] = list(range(1,len(Ukigumo_frequency_df)+1))

In [None]:
fig = px.scatter(Ukigumo_frequency_df, x='term index', y='count', 
                 hover_name='word', log_x=True, log_y=True)
fig.layout.title = 'Total Vocabulary {}'.format(len(set(complete_ukigumo_split)))
fig

In [None]:
def text_length(text):
    return len(text.split())

Ukigumo_split_df['text_length'] = Ukigumo_split_df.token.map(text_length)

In [None]:
def text_frequency(text):
    counts = Counter({word:0 for word in Ukigumo_frequency_df.word})
    counts.update(text.split())
    return counts

In [None]:
Ukigumo_split_df['word_counts'] = Ukigumo_split_df.token.map(text_frequency)

In [None]:
dtm = pd.DataFrame.from_dict(list(Ukigumo_split_df.word_counts.values))
dtm = dtm[Ukigumo_frequency_df.word]

In [None]:
mask = (Ukigumo_frequency_df['count'] == 10)
Ukigumo_frequency_df[mask].tail()

## Appendix: Key regex expression

### Japanese-specific regex expressions

xpression	meaning	example
\p{Hiragana}	Hiragana	ぁ あ ぃ い ぅ う ぇ え ぉ お か が き ぎ く
\p{Katakana}	Katakana (Full Width)	ァ ア ィ イ ゥ ウ ェ エ ォ オ
\p{Han}	Kanji	漢字 日本語 文字 言語 言葉
[\x3000-\x303F]	Japanese Symbols and Punctuation	。 〃 〄 々 〆 〇 〈 〉 《 》 「 」
[\xFF5F-\xFF9F]	Katakana and Punctuation (Half Width)	｟ ｠ ｡ ｢ ｣ ､ ･ ｦ ｧ ｨ ｩ ｪ ｫ ｬ