# An Introduction to Japanese Text Mining: Part One

![Japanese Text Mining](images/japanese_text_mining.jpg)
Check out the [Emory University workshop blog](https://scholarblogs.emory.edu/japanese-text-mining/) on Japanese Text Mining. The example notebook cells below repeat the steps in the [tutorial](http://history.emory.edu/RAVINA/JF_text_mining/Guides/Jtextmining_intro_part1.html) of Mark Ravina using python instead of R.

## Imports

In [None]:
import pandas as pd

## Data Structures

Pandas `DataFrame` is the main python analogue of `R`'s `dataframe`.

In [None]:
meiroku_zasshi_url = 'http://history.emory.edu/RAVINA/JF_text_mining/Guides/data/meiroku_zasshi.txt'
Meiroku_df = pd.read_csv(meiroku_zasshi_url, sep=' ')

In [None]:
Meiroku_df.head()

In [None]:
Meiroku_df.author.tail()

In [None]:
Meiroku_df.author.unique()

In [None]:
Meiroku_df.author[2]

In [None]:
Meiroku_df.author[1:5]

In [None]:
Meiroku_df.loc[2, 'author']

In [None]:
from itertools import chain
rows = list(chain([2,10], range(6,9)))
Meiroku_df.loc[rows, ('title', 'author')]

In [None]:
Meiroku_df.loc[:, 'author']

In [None]:
Meiroku_df.loc[1:6, 'year']

## Assignment and Subsetting

In [None]:
# Author Nishi Amane
mask = Meiroku_df.author == '西周'
Nishi_articles_df = Meiroku_df[mask]

In [None]:
Nishi_articles_df.title

In [None]:
Nishi_articles_df[Nishi_articles_df.year == 1874]

## Functions and more Subsetting

In [None]:
mask = Meiroku_df.text.str.count('女') != 0
Meiroku_df['女_count'] = Meiroku_df.text.str.count('女')
Meiroku_df[mask].head()

In [None]:
# Drop our new column.
Meiroku_df = Meiroku_df.drop(['女_count'], axis=1)

In [None]:
mask = Meiroku_df.text.str.count(' 女 ') != 0
Meiroku_女_df = Meiroku_df[mask]

In [None]:
Meiroku_女_df

We can now use the same tricks as before to subset a data frame. Let’s select every essay in the Meiroku zasshi that used the characters 女 more than 自由.

In [None]:
mask = Meiroku_df.text.str.count('自由') != 0
Meiroku_自由_df = Meiroku_df[mask]

In [None]:
print('There are {} articles containing the string " 女 " and {} articles containing "自由".'.format(
    len(Meiroku_女_df), len(Meiroku_自由_df)))

We can, of course, add additional criteria, such as choosing only works by Mori Arinori that use 女 more than 自由. We can either subset in several steps . . .

In [None]:
mask = (Meiroku_df.text.str.count('女') > Meiroku_df.text.str.count('自由'))
mask = mask & (Meiroku_df.author == '森有礼')
Meiroku_df[mask].title

You can also combine conditions with the “or” operator | , the uppercase version of the “backslash.” If you want the titles of essays written by either Mori Arinori or Katō Hiroyuki.

In [None]:
mask = (Meiroku_df.author == '森有礼') | (Meiroku_df.author == '加藤弘之')
Meiroku_df[mask].title