# An Introduction to Japanese Text Mining: Part One

![Japanese Text Mining](images/japanese_text_mining.jpg)
Check out the [Emory University workshop blog](https://scholarblogs.emory.edu/japanese-text-mining/) on Japanese Text Mining. The example notebook cells below repeat the steps in the [tutorial](http://history.emory.edu/RAVINA/JF_text_mining/Guides/Jtextmining_intro_part1.html) of Mark Ravina using python instead of R. The quoted text below is directly from Ravina's article, with minor word changes for python syntax.

## Imports

In [None]:
import pandas as pd
import plotly_express as px

## Data Structures

Pandas `DataFrame` is the main python analogue of `R`'s `dataframe`.

> Let’s start with a tidy, pre-processed text, the famous nineteenth-century journal Meiroku zasshi 明六雑誌。We’ll come back to the demanding task of inputting, cleaning, and “chunking” texts, but for now, let’s build on the wonderful work of NINJAL (the National Institute for Japanese Language and Linguistics), who compiled and proofed this corpus. To load the Meiroku zasshi (technically to read it into your local environment) run the code below. 

To run a cell of code, just get the cursor in that cell and hit the run icon in the top pane of Jupyter, or use the shortcut keys: Shift+Enter.

Pandas has a number of ways to get data (locally or externally) into a data frame. Reading csv (comma-separated-values) formated files is very common. The cell below reads a csv file that is stored on a web-site hosted by the original author (Mark Ravina) at Emory University. Note that the field separator is a white-space instead of a comma.

In [None]:
meiroku_zasshi_url = 'http://history.emory.edu/RAVINA/JF_text_mining/Guides/data/meiroku_zasshi.txt'
Meiroku_df = pd.read_csv(meiroku_zasshi_url, sep=' ')

This command pulls the whole Meiroku zasshi into the Jupyter notebook. 

> As you can see, we now have the Meiroku zasshi in a data frame called Meiroku.df. A data frame is similar to a spreadsheet, although a software engineer might cringe at that statement. In a data frame, each row is a case or observation and each column is a variable. We’ll come back to those terms many times. For now, just understand that the Meiroku zasshi articles have been put into a grid, with each article and related metadata in its own row. Metadata is information about the text, such as the author.

> We identify columns of a data frame by combining the name of the data frame and the name of the column, joined by the `.` mark. The `.` mark tells Python that a given column (or vector) is associated with a specific dataframe. `Meiroku_df.author` means the column `author` in the dataframe `Meiroku_df`.

Common ways to inspect a dataframe are to look at the `head` or `tail` of the underlying table. 

In [None]:
Meiroku_df.head()

Actually, when you select a column, pandas returns a `Series` object instead of a `DataFrame`, so beware. 

In [None]:
Meiroku_df.author.tail()

> You’ll note that the output is the list of Meiroku zasshi authors, with multiple appearances if they wrote more than one article. The output is 155 elements long because there are 155 articles.

Extract a (series) of author names from the dataframe, and then make a unique list.

In [None]:
Meiroku_df.author.unique()

> The vector `Meiroku_df.author` is a one dimensional data object, so if we want to grab a single element we just need one number. We indicate the element’s location using square brackets. Thus, for the author of the second Meiroku zasshi article:

In [None]:
Meiroku_df.author[2]

Select rows 2 thru 5 (inclusive): `Meiroku_df.author[1:5]`

In [None]:
Meiroku_df.author.iloc[1:5]

> Data frames are two-dimensional objects, so identifying an element requires two markers, first the row number(s), then the column name(s). The author information is in the 4th column of the data frame Meiroku_df, so to get the author of the second article:

In [None]:
Meiroku_df.loc[2, 'author']

Odd shaped selections of rows can be build up as a union (`|`) of sets and used as part of the dataframe selection argument.

In [None]:
rows = {2} | set(range(6,9)) | {10,12}
print(rows)
Meiroku_df.loc[rows, ('title', 'author')]

In [None]:
Meiroku_df.loc[:, 'author']

In [None]:
Meiroku_df.loc[1:6, 'year']

## Assignment and Subsetting

> Let’s build on that basic knowledge of vectors, by asking Python about `Meiroku_df.author`. For example, which elements of `Meiroku_df` author are equal to Nishi Amane 西周? Note that the interrogative requires a double equals sign. (As an aside, a single equals sign is more of a command, telling Python to make `Meiroku_df.author` equal to Nishi Amane. That would actually overwrite all the author values! So . . .

In [None]:
# Author Nishi Amane
special_subset = Meiroku_df.author == '西周'

> Python answers our query with a logical vector: a series of TRUE/FALSE responses. Nishi Amane is the author of the first element, but not the second, etc. This is accurate, but not especially useful. We can, however, use this information to get the subset of the dataframe elements for which the answer is TRUE.

> Now we can use that vector to get the titles of all the Meiroku zasshi articles written by Nishi Amane.

In [None]:
Meiroku_df[special_subset].title

In [None]:
Meiroku_df.year[special_subset]

> Remember that the syntax for a data frame is

> `Name_of_data_frame[row_number, column_name]`

> and that a nothing after the comma means “everything.” So we just took all of the columns of Meiroku.df but just some of the rows. If you want a denser syntax, you can skip the intermediate step of creating the vector special_subset. Just put the selection criteria right in the brackets

In [None]:
Nishi_articles_df = Meiroku_df[Meiroku_df.author == '西周']

> Programmers love dense code like that and they esteem “one-liners,” extremely compact, powerful code snippets. But, at least at first, it can be much easier to code in small incremental steps.

## Functions and more Subsetting

> In order to do more sophisticated text mining, we’ll rely on some packages and their functions. Setting aside the technicalities, functions are commands and packages are bundles of related functions. In order to use a package we need to install it once, but load it each time we restart Python or otherwise clear the Python environment. By way of extended metaphor, installed packages is like having the library buy a book. By contrast library gets the book and opens it on your desk. For the string package,the commands are:

Available methods for the python `string` class ignoring `dunder` (double underscore) functions.

In [None]:
function = str
print([s for s in dir(function) if '__' not in s])

> The string class has a series of logically named methods for handling strings, a technical term for alphanumeric text. A good example of such a simple, logical function is `str.count`. What do you suppose this command does?

In [None]:
mask = Meiroku_df.text.str.count('女') != 0
Meiroku_df['女'] = Meiroku_df.text.str.count('女')
Meiroku_df[mask].head()

> Note that this command counts the character 女 both alone and in longer compounds such as 男女 and 女性. We’ll explore methods for refining that search soon. For now, as an interim method, you can add whitespace and search for " 女 “. That will miss the occassional cases of 女 at the beginning or end of a sentence, or (if there’s punctuation) before a period or comma. So we’ll cover a more refined method of search in the next session.

> Rather than just let the results of str_count hang loose, we can add them to the data frame Meiroku_df, creating a new column called 女. Use the `.` operator to put the vector in the data frame.

In [None]:
mask = Meiroku_df.text.str.count('女') != 0
Meiroku_df['女'] = Meiroku_df.text.str.count(' 女 ')

In [None]:
Meiroku_df[mask]

> We can now use the same tricks as before to subset a data frame. Let’s select every essay in the Meiroku zasshi that used the characters 女 more than 自由

In [None]:
mask = Meiroku_df.text.str.count('自由') != 0
Meiroku_df['自由'] = Meiroku_df.text.str.count('自由')

In [None]:
print('There are {} articles containing the string " 女 " and {} articles containing "自由".'.format(
    len(Meiroku_df[Meiroku_df['女'] > 0]), len(Meiroku_df[Meiroku_df['自由'] >  0])))

In [None]:
Meiroku_subset_df = Meiroku_df[Meiroku_df.女 > Meiroku_df.自由]

In [None]:
len(Meiroku_subset_df)

We can, of course, add additional criteria, such as choosing only works by Mori Arinori that use 女 more than 自由. We can either subset in several steps . . .

In [None]:
mask = (Meiroku_df.text.str.count('女') > Meiroku_df.text.str.count('自由'))
mask = mask & (Meiroku_df.author == '森有礼')
Meiroku_df[mask].title

> You can also combine conditions with the “or” operator `|` , the uppercase version of the “backslash.” If you want the titles of essays written by either Mori Arinori or Katō Hiroyuki.

In [None]:
mask = (Meiroku_df.author == '森有礼') | (Meiroku_df.author == '加藤弘之')
Meiroku_df[mask].title

> Take a moment to experiment with subsetting, creating new variables, and specifying multiple criteria.

## Colocation — A basic data visualization

> We’re now going to shift from straightforward, simple code to some dense, advanced commands. That means, for now, just focusing on a few key arguments and ignoring other parts of the command. Python has some wonderful packages for visualizing data. We’ll use plotly express, a great interactive graphing package. As with all packages, you’ll need to install them once, but only once.

In [None]:
mask = Meiroku_df.text.str.count(' 女 ') != 0
Meiroku_df['女'] = Meiroku_df.text.str.count('女')

In [None]:
px.scatter(Meiroku_df, x='女', y='自由', hover_name='author')

> Note that if you want to visualize a term, you need to first get the word count. If, for example, we want to plot 女 against 男, we need the count for 男.

In [None]:
mask = Meiroku_df.text.str.count('男') != 0
Meiroku_df['男'] = Meiroku_df.text.str.count('男')

> Now let’s reuse the plotting code, but replacing 自由 with 男

In [None]:
px.scatter(Meiroku_df, x='女', y='男', hover_name='author')

> As you may have noticed, these graphs do not contain 155 points because some of the articles have the exact same values. This problem is called overplotting: we can’t see some of the observations because they are underneath other observations with the same value.

> In this case, we can fix the problem by counting the words as percentages of the total characters in each article. That’s sometimes called “normalizing.”

In [None]:
def nchar(text):
    '''Counts the number of characters in the string (less spaces).'''
    text = text.replace(' ', '')
    return len(list(text))

In [None]:
Meiroku_df['女'] = Meiroku_df.text.apply(lambda x: x.count('女')/nchar(x)*100)
Meiroku_df['男'] = Meiroku_df.text.apply(lambda x: x.count('男')/nchar(x)*100)a
Meiroku_df['自由'] = Meiroku_df.text.apply(lambda x: x.count('自由')/nchar(x)*100)

In [None]:
px.scatter(Meiroku_df, x='女', y='男', hover_name='author')

> Note that any numeric variable can be used for x and y in `plotly`, so here’s how the usage of 女 varied over time.

In [None]:
px.scatter(Meiroku_df, x='issue', y='女', hover_name='author')

## Creating a DTM: Document-term matrix

> We now have some fairly powerful tools, but these methods are somewhat labor-intensive. There are over 15,000 unique words in the Meiroku zasshi and it would be cumbersome to write 15,000 lines of code by hand, one for each term.

> Fortunately, Python loves to help with repetitive tasks so we can write 7 or 8 lines of code instead of 15,000. Unfortunately, some of that code is rather advanced, so, for now at least, you’ll just have to use the commands without fully understanding the details. Much of the complexity below involves turning lists and matrices into data frames. We’ll get to those more conceptual issues later.

> For now we’ll need a list of all the unique terms in the Meiroku zasshi. To get that, we’ll need to smash all the individual articles together into one long string. We’ll use the command `join`, the text equivalent of addition.

In [None]:
from collections import Counter

> Because the individual articles are elements of a vector, we need to use the join function. Note how this “joins” all the articles into one long string.

In [None]:
complete_meiroku = ' '.join(Meiroku_df.text)

> Now we can split the string into individual words, separating on whitespace. The command str.split is appropriately named: it splits strings. The first line below should therefore be obvious. 

In [None]:
complete_meiroku_split = complete_meiroku.split()

In [None]:
len(complete_meiroku_split)

> The object all_words is now a vector with ~~173,197~~ 172,875 elements, the total word count for all 155 articles. To get a list of unique words:

In [None]:
meiroku_unique_words = set(complete_meiroku_split)
len(meiroku_unique_words)

> Note that `meiroku_unique_words` is much smaller: only ~~15,603~~ 15,601 elements. Another handy class is `Counter`, which quickly and easily calculates the frequency of every word in the Meiroku zasshi. Note that you can easily sort this data frame by frequency.

Note: Attempt to resolve different values for counts. Notice unicode value \u3000 showing up in the text so tried normalizing. May also need to account for all white space.

In [None]:
import unicodedata

In [None]:
def normalize(text):
    return unicodedata.normalize('NFKC', text)

In [None]:
Meiroku_df['text'] = Meiroku_df.text.map(normalize)

In [None]:
all_text = ' '.join(Meiroku_df.text)
all_words = all_text.split()
print(len(all_words), len(set(all_words)))

In [None]:
counts = Counter(all_words)
Meiroku_frequency_df = pd.DataFrame.from_dict(counts, orient='index').reset_index()
Meiroku_frequency_df.columns = ['word', 'count']
Meiroku_frequency_df = Meiroku_frequency_df.sort_values(by='count', ascending=False)
Meiroku_frequency_df['term index'] = list(range(1,len(Meiroku_frequency_df)+1))

In [None]:
fig = px.scatter(Meiroku_frequency_df, x='term index', y='count', 
                 hover_name='word', log_x=True, log_y=True)
fig.layout.title = 'Total Vocabulary {}'.format(len(set(all_words)))
fig

In [None]:
def text_length(text):
    return len(text.split())

Meiroku_df['text_length'] = Meiroku_df.text.map(text_length)

In [None]:
def text_frequency(text):
    counts = Counter({word:0 for word in Meiroku_frequency_df.word})
    counts.update(text.split())
    return counts

In [None]:
Meiroku_df['word_counts'] = Meiroku_df.text.map(text_frequency)

In [None]:
dtm = pd.DataFrame.from_dict(list(Meiroku_df.word_counts.values))
dtm = dtm[Meiroku_frequency_df.word]

In [None]:
dtm

In [None]:
mask = (Meiroku_frequency_df['count'] == 10)
Meiroku_frequency_df[mask].tail()

In [None]:
dtm[dtm.朝鮮 > 0]

In [None]:
# document_index = 46
# Meiroku_df.text.loc[document_index + 1]

## Aggregating and Sorting

Let’s do some final manipulation of the document term matrix, aggregating by author. Which authors favored which words? First, let’s see how many authors wrote for the Meiroku zasshi

In [None]:
set(Meiroku_df.author)

Now let’s aggregate the word frequencies by author. We’ll get the total word count for each author, and then “renormalize” the dtm. We’ll create a dataframe temp.df that just has the author names and the word counts. One catch is that the names of the authors are non-numeric, so we’ll need to tell R not to do math on the author names! First we’ll create a new data frame with the author’s names and the word counts

In [None]:
dtm.shape

In [None]:
Meiroku_df.shape

In [None]:
dtm['author'] = list(Meiroku_df.author)

In [None]:
dtm.head()

In [None]:
df = dtm.groupby('author').sum(axis=0)
df

In [None]:
df_normalized = df.apply(lambda x: x/x.sum(), axis=1)*100

In [None]:
df_normalized.自由.sort_values(ascending=False)