# An Introduction to Japanese Text Mining: Part Three

![Japanese Text Mining](images/japanese_text_mining.jpg)
Check out the [Emory University workshop blog](https://scholarblogs.emory.edu/japanese-text-mining/) on Japanese Text Mining. The example notebook cells below repeat the steps in the [tutorial](http://history.emory.edu/RAVINA/JF_text_mining/Guides/Jtextmining_intro_part2.html) of Mark Ravina using python instead of R. The quoted text below is directly from Ravina's article, with minor word changes for python syntax.

## Imports

In [None]:
import re
import requests
import pandas as pd
import numpy as np
import plotly_express as px
import unicodedata
from collections import Counter

## Overview of Basic Statistical Techniques

> In the following session we will learn several techniques for comparing large numbers of documents. All of these techniques rely on the vector space model of documents, which sees each text as a vector of values corresponding to the words in that text. We will learn how to manipulate word vectors, how to preprocess them for analysis, and how to visualize their distances from one another using a variety of clustering algorithms. Finally, we will explore one approach for identifying words that distinguish between two groups of texts. This workbook assumes that you’ve already completed Parts 1 and 2 and builds on the programming skills you developed there.

## Loading the Data

> To begin, we will load the Meiroku Zasshi data as we did in Part 2 of this tutorial. This means grabbing the texts and their associated metadata and producing a document-term-matrix (DTM) with raw counts of every word across all 155 texts.

In [None]:
def text_frequency(text):
    counts = Counter({word:0 for word in Meiroku_frequency_df.word})
    counts.update(text.split())
    return counts

In [None]:
meiroku_zasshi_url = 'http://history.emory.edu/RAVINA/JF_text_mining/Guides/data/meiroku_zasshi.txt'
Meiroku_df = pd.read_csv(meiroku_zasshi_url, sep=' ')
complete_meiroku = ' '.join(Meiroku_df.text)
complete_meiroku_split = complete_meiroku.split()
meiroku_unique_words = set(complete_meiroku_split)
all_words = complete_meiroku.split()

counts = Counter(all_words)
Meiroku_frequency_df = pd.DataFrame.from_dict(counts, orient='index').reset_index()
Meiroku_frequency_df.columns = ['word', 'count']
Meiroku_frequency_df = Meiroku_frequency_df.sort_values(by='count', ascending=False)
Meiroku_frequency_df['term index'] = list(range(1,len(Meiroku_frequency_df)+1))

Meiroku_df['text'] = Meiroku_df.text.map(lambda x: unicodedata.normalize('NFKC', x))
Meiroku_df['word_counts'] = Meiroku_df.text.map(text_frequency)
dtm_df = pd.DataFrame.from_dict(list(Meiroku_df.word_counts.values))
dtm_df = dtm_df[Meiroku_frequency_df.word]

In [None]:
dtm_df.loc[0, ['洋字','を','以','て','國語','書する','の','論','西']]

## Word Vectors

> Each row in our document-term-matrix can be treated as a word vector. A vector is simply a sequence of numbers wherein each number represents the value of some variable (or attribute) associated with a particular data point. In this case, the variable is the raw frequency of a particular word. Seen this way, our document-term-matrix is a collection of 155 vectors and 15,603 variables (i.e., the number of unique words in the corpus). What’s great about these vector representations of each text is that we now have a way to compare many texts via mathematical operations. For each text is now an arrow pointing out into a high dimensional space (a space with 15,603 dimensions!) and there are easy ways to compare the distances between arrows based on where they point in this space. This is known as the vector space model.

> Within the vector space model, there are multiple ways of manipulating the vectors in order to compare them. We will look at the following methods in this part of the tutorial:
* Euclidean Distance
* Cosine Similarity
* K-Means Clustering

> The first two are closely related and rely on the geometrical properties inherent to vectors. If we think of these vectors as pointing out into n-dimensional space, then Euclidean Distance and Cosine Similarity are two different ways of measuring how close any two vectors are in this space. They differ, however, in that Euclidean is more sensitive to length than Cosine because the former measures distance between the endpoints of the vectors. Thus one must be careful to normalize one’s vectors before using Euclidean distance, otherwise the differences between texts may only reflect their size (e.g., the more words in a text, the further its vector will point).

> Cosine similarity, in contrast, measures the cosine of the angle between two vectors and is thus not susceptible to differences in length. It is a similarity measure, not a distance measure, and ranges from 0 to 1. The closer to 1 it is, the more similar two word vectors are. To express as a distance, we subtract the cosine similarity from 1.

> This brings us to an important point about the importance of normalizing vectors. We want to be sure that we’re comparing texts on the same scale. We did this previously when we normalized our word counts by document length, producing relative frequencies.

In [None]:
dtm_norm_matrix = dtm_df.apply(lambda x: x/x.sum(), axis=1)*100

In [None]:
0.166985/0.03972082

In [None]:
dtm_norm_matrix.loc[0,['洋字','を','以','て','國語','書する','の','論','西']]/4.2039665847784615

In [None]:
dtm_norm_matrix = dtm_norm_matrix/4.2039665847784615

> But there are other ways to normalize our texts depending on the kind of information we want to extract from them. One problem with relative frequencies is that they give a lot of weight to high frequency terms (e.g., like the particle を). A lot of high frequency words, however, are not distinguishing features between texts. If lots of documents use the particle を, then it doesn’t tell us a great deal about differences between documents. One method for down weighting such terms is the term-frequency inverse-document frequency (tf-idf) method. This method re-weights words according to how often they occur across a corpus, giving less weight to terms that appear in many documents and more weight to terms that appear in only a few documents. Here’s the formula:

> The logarithm in the denominator ensures that rarer terms will be given a higher weight. Here’s code to re-produce our DTM with the tf-idf weights instead of the relative frequencies. We should see that the values for some of the common, high frequency words is now lower than the values for rarer, less frequent words.

In [None]:
tf = dtm_df.copy()
dtm_norm_matrix = tf.apply(lambda x: x/x.sum(), axis=1)
idf = np.log10(len(tf)/tf.astype(bool).sum(axis=0))
tfidf = dtm_norm_matrix*idf

In [None]:
tfidf.loc[0,['洋字','を','以','て','國語','書する','の','論','西']]/4.20396

### Toy TFIDF Example

In [None]:
docA = "The cat sat on my face"
docB = "The dog sat on my bed"

In [None]:
# TODO: include doc_frequency_df as a parameter.
def text_frequency2(text):
    counts = Counter({word:0 for word in doc_frequency_df.word})
    counts.update(text.split())
    return counts

In [None]:
complete_doc = ' '.join([docA, docB])
complete_doc_split = complete_doc.split()
doc_unique_words = set(complete_doc_split)
all_words = complete_doc.split()

counts = Counter(all_words)
doc_frequency_df = pd.DataFrame.from_dict(counts, orient='index').reset_index()
doc_frequency_df.columns = ['word', 'count']
doc_frequency_df = doc_frequency_df.sort_values(by='count', ascending=False)
doc_frequency_df['term index'] = list(range(1,len(doc_frequency_df)+1))

doc_df = pd.DataFrame([docA, docB], columns=['text'])
doc_df['word_counts'] = doc_df.text.map(text_frequency2)
dtm_df = pd.DataFrame.from_dict(list(doc_df.word_counts.values))
# dtm_df = dtm[Meiroku_frequency_df.word]

In [None]:
dtm_df

In [None]:
dtm_norm_matrix = dtm_df.apply(lambda x: x/x.sum(), axis=1)
dtm_norm_matrix

In [None]:
dtm_norm_matrix.sum(axis=1)

The document frequency is how many documents each term shows up in. It should be a number between 1 and the number of documents.

In [None]:
dtm_df.astype(bool).sum(axis=0)

The inverse document frequency, `idf` normalizes the count by dividing by the total number of documents, inverts that ratio, and then take the logarithm.

In [None]:
idf = np.log10(len(dtm_df)/dtm_df.astype(bool).sum(axis=0))

In [None]:
idf

In [None]:
dtm_norm_matrix*idf