# An Introduction to Japanese Text Mining: Part Three

![Japanese Text Mining](images/japanese_text_mining.jpg)
Check out the [Emory University workshop blog](https://scholarblogs.emory.edu/japanese-text-mining/) on Japanese Text Mining. The example notebook cells below repeat the steps in the [tutorial](http://history.emory.edu/RAVINA/JF_text_mining/Guides/Jtextmining_intro_part2.html) of Mark Ravina using python instead of R. The quoted text below is directly from Ravina's article, with minor word changes for python syntax.

## Imports

In [None]:
import re
import requests
import pandas as pd
import numpy as np
import plotly_express as px
import unicodedata
from collections import Counter

## Overview of Basic Statistical Techniques

> In the following session we will learn several techniques for comparing large numbers of documents. All of these techniques rely on the vector space model of documents, which sees each text as a vector of values corresponding to the words in that text. We will learn how to manipulate word vectors, how to preprocess them for analysis, and how to visualize their distances from one another using a variety of clustering algorithms. Finally, we will explore one approach for identifying words that distinguish between two groups of texts. This workbook assumes that you’ve already completed Parts 1 and 2 and builds on the programming skills you developed there.

## Loading the Data

> To begin, we will load the Meiroku Zasshi data as we did in Part 2 of this tutorial. This means grabbing the texts and their associated metadata and producing a document-term-matrix (DTM) with raw counts of every word across all 155 texts.

In [None]:
def text_frequency(text):
    counts = Counter({word:0 for word in Meiroku_frequency_df.word})
    counts.update(text.split())
    return counts

In [None]:
def bad_text_frequency(text):
    '''
    Illustrate error in counting charcter matches. Use this 
    frequency count to reproduce exact results of tutorial.
    
    Below essentially reproduces the computation in the str_count R code (counting regex matches):
      dtm.matrix <- sapply(X = Meiroku.unique.words, 
                           FUN = function(x) str_count(string = Meiroku.df$text, pattern = x)
                        )
      dtm.df <- as.data.frame(dtm.matrix)
    '''
    counts = {word:0 for word in Meiroku_frequency_df.word}
    for word in counts:
        matches = re.findall('{}'.format(word), text)
        counts[word] = len(matches)
    return counts

In [None]:
meiroku_zasshi_url = 'http://history.emory.edu/RAVINA/JF_text_mining/Guides/data/meiroku_zasshi.txt'
Meiroku_df = pd.read_csv(meiroku_zasshi_url, sep=' ')
complete_meiroku = ' '.join(Meiroku_df.text)
complete_meiroku_split = complete_meiroku.split()
meiroku_unique_words = set(complete_meiroku_split)
all_words = complete_meiroku.split()

counts = Counter(all_words)
Meiroku_frequency_df = pd.DataFrame.from_dict(counts, orient='index').reset_index()
Meiroku_frequency_df.columns = ['word', 'count']
Meiroku_frequency_df = Meiroku_frequency_df.sort_values(by='count', ascending=False)
Meiroku_frequency_df['term index'] = list(range(1,len(Meiroku_frequency_df)+1))

# TODO: Track down if R automatically normalized text (esp. \u3000 codes).
Meiroku_df['text'] = Meiroku_df.text.map(lambda x: unicodedata.normalize('NFKC', x))

# Note: Use the bad_text_frequency to reproduce tutorial results.
Meiroku_df['word_counts'] = Meiroku_df.text.map(text_frequency)
dtm_df = pd.DataFrame.from_dict(list(Meiroku_df.word_counts.values))
dtm_df = dtm_df[Meiroku_frequency_df.word]

### Discrepancy in Numeric values of Python and R Notebooks.

Results from the tutorial for the document term counts:
```
##   洋字  を 以  て 國語 書する  の 論 　 西
## 1    7 282 29 188    6      2 276  9 22  1
```

Python results on counting already tokenized text by splitting on whitespace. Can reproduce the results in the tutorial by using the `bad_text_frequency` routine. 

Basic issue is counting regex matches instead of tokenized characters, so substrings get overcounted.
```
洋字       7
を      281
以       24
て      129
國語       6
書する      2
の      249
論        4
西        1
Name: 0, dtype: int64
```

Appears that `R` code was finding substring matches instead of whole word matches.

In [None]:
text = Meiroku_df.text.iloc[0]

In [None]:
for word in ['洋字','を','以','て','國語','書する','の','論','西']:
    matches = re.findall('.{}.'.format(word), text)
    # print(len(matches), matches)

    discrepancy = len([m.strip() for m in matches if (len(m.strip()) - len(word)) > 0])
    print(word, discrepancy, ' + ', dtm_df.loc[0, word], ' = ', discrepancy + dtm_df.loc[0, word])

In [None]:
matches_split = []
word = 'て'
for w in text.split():
    if w == word:
        matches_split.append(w)

len(matches_split)

In [None]:
matches_split = []
for w in re.split(r'\W+', text):
    if w == word:
        matches_split.append(w)
        
len(matches_split)

In [None]:
dtm_df.loc[0, ['洋字','を','以','て','國語','書する','の','論','西']]

## Word Vectors

> Each row in our document-term-matrix can be treated as a word vector. A vector is simply a sequence of numbers wherein each number represents the value of some variable (or attribute) associated with a particular data point. In this case, the variable is the raw frequency of a particular word. Seen this way, our document-term-matrix is a collection of 155 vectors and 15,603 variables (i.e., the number of unique words in the corpus). What’s great about these vector representations of each text is that we now have a way to compare many texts via mathematical operations. For each text is now an arrow pointing out into a high dimensional space (a space with 15,603 dimensions!) and there are easy ways to compare the distances between arrows based on where they point in this space. This is known as the vector space model.

> Within the vector space model, there are multiple ways of manipulating the vectors in order to compare them. We will look at the following methods in this part of the tutorial:
* Euclidean Distance
* Cosine Similarity
* K-Means Clustering

> The first two are closely related and rely on the geometrical properties inherent to vectors. If we think of these vectors as pointing out into n-dimensional space, then Euclidean Distance and Cosine Similarity are two different ways of measuring how close any two vectors are in this space. They differ, however, in that Euclidean is more sensitive to length than Cosine because the former measures distance between the endpoints of the vectors. Thus one must be careful to normalize one’s vectors before using Euclidean distance, otherwise the differences between texts may only reflect their size (e.g., the more words in a text, the further its vector will point).

> Cosine similarity, in contrast, measures the cosine of the angle between two vectors and is thus not susceptible to differences in length. It is a similarity measure, not a distance measure, and ranges from 0 to 1. The closer to 1 it is, the more similar two word vectors are. To express as a distance, we subtract the cosine similarity from 1.

> This brings us to an important point about the importance of normalizing vectors. We want to be sure that we’re comparing texts on the same scale. We did this previously when we normalized our word counts by document length, producing relative frequencies.

In [None]:
dtm_norm_df = dtm_df.apply(lambda x: x/x.sum(), axis=1)*100

In [None]:
dtm_norm_df.loc[0, ['洋字','を','以','て','國語','書する','の','論','西']]

> But there are other ways to normalize our texts depending on the kind of information we want to extract from them. One problem with relative frequencies is that they give a lot of weight to high frequency terms (e.g., like the particle を). A lot of high frequency words, however, are not distinguishing features between texts. If lots of documents use the particle を, then it doesn’t tell us a great deal about differences between documents. One method for down weighting such terms is the term-frequency inverse-document frequency (tf-idf) method. This method re-weights words according to how often they occur across a corpus, giving less weight to terms that appear in many documents and more weight to terms that appear in only a few documents. Here’s the formula:

> The logarithm in the denominator ensures that rarer terms will be given a higher weight. Here’s code to re-produce our DTM with the tf-idf weights instead of the relative frequencies. We should see that the values for some of the common, high frequency words is now lower than the values for rarer, less frequent words.

In [None]:
tf = dtm_df.copy()
dtm_norm_matrix = tf.apply(lambda x: x/x.sum(), axis=1)

# Note: Tutorial R based code uses base 2 log.
idf = np.log(len(tf)/tf.astype(bool).sum(axis=0))
# tfidf = dtm_norm_matrix*idf
tfidf = tf*idf

In [None]:
tfidf.loc[0,['洋字','を','以','て','國語','書する','の','論','西']]

### Toy TFIDF Example

From [github project](https://github.com/mayank408/TFIDF/blob/master/TFIDF.ipynb) of Mayank Tripathi. (Love google search.) Differs from the tutorial by using a base 10 log and multiplying the inverse document frequency by the __normalized__ term frequencies.

In [None]:
docA = "The cat sat on my face"
docB = "The dog sat on my bed"

In [None]:
# TODO: include doc_frequency_df as a parameter.
def toy_text_frequency(text):
    counts = Counter({word:0 for word in toy_frequency_df.word})
    counts.update(text.split())
    return counts

In [None]:
complete_doc = ' '.join([docA, docB])
complete_doc_split = complete_doc.split()
doc_unique_words = set(complete_doc_split)
toy_all_words = complete_doc.split()

toy_counts = Counter(toy_all_words)
toy_frequency_df = pd.DataFrame.from_dict(toy_counts, orient='index').reset_index()
toy_frequency_df.columns = ['word', 'count']
toy_frequency_df = toy_frequency_df.sort_values(by='count', ascending=False)
toy_frequency_df['term index'] = list(range(1,len(toy_frequency_df)+1))

doc_df = pd.DataFrame([docA, docB], columns=['text'])
doc_df['word_counts'] = doc_df.text.map(toy_text_frequency)
toy_dtm_df = pd.DataFrame.from_dict(list(doc_df.word_counts.values))
# dtm_df = dtm[Meiroku_frequency_df.word]

In [None]:
toy_dtm_df

In [None]:
toy_dtm_norm_matrix = toy_dtm_df.apply(lambda x: x/x.sum(), axis=1)
toy_dtm_norm_matrix

Check that the rows are normalized to 1:

In [None]:
toy_dtm_norm_matrix.sum(axis=1)

The document frequency is how many documents each term shows up in. It should be a number between 1 and the number of documents.

In [None]:
# Favorite way to count nonzero elements in column (axis 0).
toy_dtm_df.astype(bool).sum(axis=0)

The inverse document frequency, `idf` normalizes the count by dividing by the total number of documents, inverts that ratio, and then take the logarithm.

In [None]:
toy_idf = np.log10(len(toy_dtm_df)/toy_dtm_df.astype(bool).sum(axis=0))

In [None]:
toy_idf

In [None]:
toy_dtm_norm_matrix*toy_idf

## Feature Selection

> Once you’ve normalized your texts into comparable units, the next question to consider is whether you need all the dimensions that are available to you. Or to put it another way, is there a way to reduce the dimensions to filter out some of the noise introduced by having so many words? All of the words may not be important to the question you are trying to answer. This process of reducing the dimensions is called feature selection. Here are some common methods for reducing features:
* Retain only those words with an average frequency above some value (i.e., keep most frequent terms).
* Filter out the grammatical function words (or stopwords). These can be good indicators of authorial style, but they are less useful for identifying differences in content. Lists of stopwords should be tailored to whatever corpus you are working with and are typically created using the most frequent words in the corpus.
* Lemmatize or stem the words in your corpus, reducing them to their base forms. In this way, one can collapse all of the variants of a word into a single term. Lemma can easily be extracted from the MeCab/Unidic output, as discussed in a previous session.

> Here, we will perform two kinds of feature selection on the corpus before proceeding with our analysis. First we will filter out stopwords from the corpus.

In [None]:
# To build a list of stopwords, let's first get the highest frequency words.
# Build a frequency table from the Meiroku data.
all_words = complete_meiroku.split()
counts = Counter(all_words)
Meiroku_frequency_df = pd.DataFrame.from_dict(counts, orient='index').reset_index()
Meiroku_frequency_df.columns = ['word', 'count']
Meiroku_frequency_df = Meiroku_frequency_df.sort_values(by='count', ascending=False)
Meiroku_frequency_df['term index'] = list(range(1,len(Meiroku_frequency_df)+1))
Meiroku_frequency_df = Meiroku_frequency_df.reset_index(drop=True)

In [None]:
Meiroku_frequency_df.head(20)

In [None]:
len(Meiroku_frequency_df)

In [None]:
stopwords = list(Meiroku_frequency_df.word.loc[0:15])

In [None]:
print(stopwords)

In [None]:
nostop_norm_df = dtm_norm_df.copy()
nostop_norm_df.drop(stopwords, axis=1, inplace=True)

In [None]:
nostop_tfidf = tfidf.copy()
nostop_tfidf.drop(stopwords, axis=1, inplace=True)

In [None]:
print(nostop_norm_df.shape, nostop_tfidf.shape)

> Next, let’s reduce the features by keeping only those words that occur more than some threshold for mean relative frequency or mean tf-idf score.

`#keep only those columns where the mean relative frequency of a word is >= .05`

`reduced.norm.df <- nostop.norm.df[,apply(nostop.norm.df, 2, mean) >= .05]`

`#check to see how many features you have; this still seems like a lot so we will reduce further`

`dim(reduced.norm.df)`

In [None]:
mask = (nostop_norm_df.mean(axis=0) > 0.05)
columns = nostop_norm_df.columns[mask]
nostop_norm_df[columns].shape

In [None]:
mask = (nostop_norm_df.mean(axis=0) > 0.01)
columns = nostop_norm_df.columns[mask]
nostop_norm_df[columns].shape

In [None]:
mask = (nostop_norm_df.mean(axis=0) > 0.003)
columns = nostop_norm_df.columns[mask]
reduced_norm_df = nostop_norm_df[columns]
reduced_norm_df.shape

In [None]:
mask = (nostop_tfidf.mean(axis=0) > 0.3)
columns = nostop_tfidf.columns[mask]
reduced_tfidf_df = nostop_tfidf[columns]
reduced_tfidf_df.shape

## Visualizing Distances

> We are finally ready to start comparing our texts. Since 155 texts are a lot to visualize all at once, we will just look at the first 50 titles. Let’s also grab the corresponding metadata for these works since we will need them to label our visualizations.

In [None]:
authors = Meiroku_df.author
titles = Meiroku_df.title

In [None]:
# Just to make things interesting, we'll obscure one of the authors names
authors[12] = "Mystery Author"

> Now we create a distance matrix using the function dist(). This function will do a pairwise comparison of all the texts in our dataset (every text against every other) and store the results as a large matrix. This function will use Euclidean Distance to calculate distances. Let’s create a second distance matrix using Cosine Similarity. For this, we need to borrow a function from another R library.

In [None]:
# Create a distance matrix using the relative frequency DTM.
dist = lambda p1, p2: sqrt(((p1-p2)**2).sum())
dm = np.asarray([[dist(p1, p2) for p2 in xy_list] for p1 in xy_list])

In [None]:
reduced_norm_df.head()

In [None]:
from scipy.spatial.distance import pdist
import umap

In [None]:
reduced_norm_matrix = reduced_norm_df.to_numpy()

In [None]:
euc_dist_matrix = pdist(reduced_norm_matrix, metric='euclidean')

In [None]:
reducer = umap.UMAP()
embedding = reducer.fit_transform(reduced_norm_matrix)
print(embedding.shape)

embedding_df = pd.DataFrame(embedding, columns=['x', 'y'])
embedding_df['authors'] = authors
embedding_df['title'] = titles

In [None]:
px.scatter(embedding_df, x='x', y='y', hover_name='authors')

In [None]:
import plotly.graph_objs as go

In [None]:
unique_authors = list(Meiroku_df.author.unique())

In [None]:
mask = (embedding_df.authors == unique_authors[0])
embedding_df[mask].head()

In [None]:
fig = go.FigureWidget()
for author in unique_authors:
    mask = (embedding_df.authors == author)
    scatter = fig.add_scatter(x=embedding_df[mask].x, y=embedding_df[mask].y)
    scatter.name = author
    scatter.mode = 'markers'
    scatter.hovertext = embedding_df[mask].title
    scatter.hoverinfo = 'x+y+text'

In [None]:
fig

In [None]:
mask = (embedding_df.x < 3.5) & (embedding_df.y > 1)
embedding_df[mask]

In [None]:
reducer = umap.UMAP()
reduced_tfidf_matrix = reduced_tfidf_df.to_numpy()
embedding = reducer.fit_transform(reduced_tfidf_matrix)
print(embedding.shape)

embedding_df = pd.DataFrame(embedding, columns=['x', 'y'])
embedding_df['authors'] = authors
embedding_df['title'] = titles

In [None]:
fig_tfidf = go.FigureWidget()
for author in unique_authors:
    mask = (embedding_df.authors == author)
    scatter = fig_tfidf.add_scatter(x=embedding_df[mask].x, y=embedding_df[mask].y)
    scatter.name = author
    scatter.mode = 'markers'
    scatter.hovertext = embedding_df[mask].title
    scatter.hoverinfo = 'x+y+text'

In [None]:
fig_tfidf

In [None]:
mask = (embedding_df.y < -7.3)
embedding_df[mask]