# An Introduction to Japanese Text Mining: Part One

![Japanese Text Mining](images/japanese_text_mining.jpg)
Check out the [Emory University workshop blog](https://scholarblogs.emory.edu/japanese-text-mining/) on Japanese Text Mining. The example notebook cells below repeat the steps in the [tutorial](http://history.emory.edu/RAVINA/JF_text_mining/Guides/Jtextmining_intro_part1.html) of Mark Ravina using python instead of R. The quoted text below is directly from Ravina's article, with minor word changes for python syntax.

## Imports

In [None]:
import re
import requests
import pandas as pd
import plotly_express as px

## Regex

> Regex is short for “regular expressions.” Think of regex as an extreme version of searching in a word processor using “wild cards.” We can search not only for specific strings, but types of strings, such as lowercase letters or kanji or kana, and narrow our search based on position and the surrounding text. There are entire books dedicated to regex, but we’ll cover the core concepts to get you started.

> Let’s begin with a simple example: we’ll search a few characters before and after a given string. In regex, the “period” character “.” means “any character, including whitespace.”

In [None]:
string = "これはペンです"
pattern = "は"
re.findall(pattern, string)

In [None]:
string = "これはペンです"
pattern = ".は."
re.findall(pattern, string)

In [None]:
string = "これはペンです"
pattern = "..は.."
re.findall(pattern, string)

> The function `re.findall`, as the name suggests, finds all the strings matching the pattern argument. (Argument is the technical term for the details of a function or command.) More interesting is the role of the period in that pattern argument. Note how the argument pattern = “..は..” gets two characters on either side of “は”

>Let’s try looking at something more substantial than “これはペンです”. We’ll use the 1889 Imperial Rescript on Education

In [None]:
rescript = "朕惟フニ我カ皇祖皇宗國ヲ肇ムルコト宏遠ニ德ヲ樹ツルコト深厚ナリ我カ臣民克ク忠ニ克ク孝ニ億兆心ヲ一ニシテ世世厥ノ美ヲ濟セルハ此レ我カ國體ノ精華ニシテ敎育ノ淵源亦實ニ此ニ存ス爾臣民父母ニ孝ニ兄弟ニ友ニ夫婦相和シ朋友相信シ恭儉己レヲ持シ博愛衆ニ及ホシ學ヲ修メ業ヲ習ヒ以テ智能ヲ啓發シ德器ヲ成就シ進テ公益ヲ廣メ世務ヲ開キ常ニ國憲ヲ重シ國法ニ遵ヒ一旦緩急アレハ義勇公ニ奉シ以テ天壤無窮ノ皇運ヲ扶翼スヘシ是ノ如キハ獨リ朕カ忠良ノ臣民タルノミナラス又以テ爾祖先ノ遺風ヲ顯彰スルニ足ラン斯ノ道ハ實ニ我カ皇祖皇宗ノ遺訓ニシテ子孫臣民ノ俱ニ遵守スヘキ所之ヲ古今ニ通シテ謬ラス之ヲ中外ニ施シテ悖ラス朕爾臣民ト俱ニ拳々服膺シテ咸其德ヲ一ニセンコトヲ庶幾フ"
pattern = "..皇.."
re.findall(pattern, rescript)

> We can use square bracket to search for more than one character at a time: [皇朕] means the characters 朕 OR 皇.

In [None]:
pattern = "..[皇朕].."
re.findall(pattern, rescript)

> This is a rudimentary form of KWIC, or “key words in context.” Take a moment to experiment with the command above, changing the kanji and the number of characters. Rather than adding periods, you can use a number in “curly brackets” to specify repetition.

In [None]:
pattern = ".{4}民.{4}"
re.findall(pattern, rescript)

> Regex is an enormously powerful tool with a wide range of expressions. In this lesson we’re going to focus on using regex to find chapter or section breaks in texts. But before we move on, here are two examples of more powerful regex searches. What do you suppose this regex finds, and why?

In [None]:
pattern = "民[ァ-ン]."
re.findall(pattern, rescript)

In [None]:
print(rescript)

> Regex can also be used to find anything between two characters. The expression “皇.*?民“ will find everything between “皇” and “民”. Remember that the period means ”any character." The asterisk allows for repetition, and the question mark tells R to stop at the first instance of “民” after “皇”.

In [None]:
pattern = "皇.*?民"
re.findall(pattern, rescript)

## Raw texts

> In the first lesson, we used a pre-processed text, the Meiroku zasshi, which was already in a tidy dataframe. In this section we’ll use something less tidy: a plain text file of Hayashi Fumiko’s Ukigumo. This is just the text of a [web page](http://jti.lib.virginia.edu/japanese/hayashi/ukigumo/HayUkig.html) copied and pasted as a plain text file. Unlike the data frame we used earlier, this .txt is not carefully structured, so we’ll read it in as lines of text.

In [None]:
url_ukigumo = 'http://history.emory.edu/RAVINA/JF_text_mining/Guides/data/ukigumo.txt'
response = requests.get(url_ukigumo)

In [None]:
response.encoding = 'utf-8'
Ukigumo_lines = [t.split('" "') for t in response.text.split('\n')]

In [None]:
Ukigumo_lines[1:40]

> The metadata runs to 23, then there are two blank lines, and then the text begins. As for the tail . . . let’s get sophisticated. We’ll use `len(Ukigumo_lines)` to get the number of lines and then subtract 10.

In [None]:
Ukigumo_lines[(len(Ukigumo_lines)-16):len(Ukigumo_lines)]

In [None]:
[n for n,l in enumerate(Ukigumo_lines) if "（完） " in l]

In [None]:
[n for n,l in enumerate(Ukigumo_lines) if "Japanese Text Initiative" in l]

> So let’s just break Ukigumo into text and metadata.

In [None]:
Ukigumo_head = Ukigumo_lines[1:23]
Ukigumo_tail = Ukigumo_lines[5220:len(Ukigumo_lines)]
Ukigumo_metadata = Ukigumo_head + Ukigumo_tail
Ukigumo_text = Ukigumo_lines[25:5218]

In [None]:
Ukigumo_metadata

> We read in the text of Ukigumo as a series of lines with line breaks, and that was useful for finding and pulling out the metadata. Now that we have isolated the main text, we might want to collapse those ~~5,194~~ 5192 lines into one long string. The command is:

In [None]:
Ukigumo_collapsed = ' '.join([t[0] for t in Ukigumo_text])

> Now let’s do a little regex searching on Ukigumo. We can search for all the terms that appear around the name of the protagonist Yukiko. The list is long, so we’ll just peak at the first 10 hits.

In [None]:
Yukiko_kwic = re.findall( ".{5}ゆき子.{5}", Ukigumo_collapsed)

In [None]:
Yukiko_kwic[1:10]

> While this certainly isn’t a “summary” of the novel, the phrases “孤獨な心” and “汚れた手” do get at key themes in the work: isolation and postwar privation.