# Japanese text segmenter

version 1.0

For Japanese text to be usable for various language-agnostic digital humanities tools and methodologies (e.g. word counts, topic modeling, or word vectors), it needs to be *segmented* -- i.e. spaces need to be artificially inserted.

[MeCab](https://taku910.github.io/mecab/) is widely used as a tool for Japanese segmentation and part-of-speech tagging, but if you run into problems intalling or using MeCab, you can try this Jupyter notebook, which uses [RakutenMA Python](https://github.com/ikegami-yukino/rakutenma-python) as its segmenter. If you want to try out RakutenMA to see how well it works for your text before using the notebook, you can [use the web-based interface for RakutenMA here](http://rakuten-nlp.github.io/rakutenma/).

## Suggested citation
If you use this notebook as part of your project workflow, you can cite it with something to the effect of:

Dombrowski, Quinn. *Japanese text segmenter* Jupyter notebook. https://github.com/quinnanya/japanese-segmenter. 2019.

## Text preparation
This notebook has two versions of the segmenter code: one that converts a single file, and one that converts everything in a folder.

Before using this notebook, you should make sure your texts are saved as .txt files with UTF-8 encoding (**not** shift-JIS, which many Japanese websites use).

If you're not sure whether your text file uses UTF-8 encoding, you can open it with the free cross-platform [Atom](https://atom.io/) text editor. In the bottom right corner, it will show your file encoding. If you save the text file in Atom, it will convert it to UTF-8.

## Install segmenter module
The Japanese segmenter Python module isn't installed automatically with Anaconda (if you're using that to run Jupyter notebooks). It also isn't available through conda. Instead, the code block below installs it in your Anaconda directory so the notebook can use it.

You only have to run the code block below the first time you use this notebook. If you run it again, you'll just get a message saying "Requirement already satisfied".

In [11]:
!pip install rakutenma



## Import modules
The code block below imports a few modules that are 

In [12]:
# os is used for things like changing directories and listing files
import os

# io is used for opening and writing files
import io

#itertools is used for some of the iterative code
#from itertools import chain

# glob is used to find all the pathnames matching a specified pattern 
#( here, all text files)
import glob

# rakutenma is the Japanese segmenter
from rakutenma import RakutenMA

## Define the source directory
Within the single quotes, put the full path to the folder that contains the .txt file or file(s) you want to segment. If you want to segment multiple files, it's easiest if you put them in a folder that only contains those files. If you want to segment just one single file, you can put it anywhere as long as you can get the full path to that directory.

For instance, the default path to the Documents directory is (substituting your user name on the computer for YOUR-USER-NAME):

- On Mac: '/Users/YOUR-USER-NAME/Documents'
- On Windows: 'C:\Users\YOUR-USER-NAME\Documents'

In [None]:
# This is the full path to the directory
# where you have the plain text file(s)
sourcefiledirectory = '/Users/qad/Documents/jpnlp'

# Changing the directory to where you've stored the source texts,
# so you can open them in later code blocks
os.chdir(sourcefiledirectory)

## Load RakutenMA segmenter code & model
The code block below loads the default model for Japanese and a required hash function used to map the data.

In [13]:
rma = RakutenMA(phi=1024, c=0.007812)  # Specify hyperparameter for SCW (for demonstration purpose)
rma.load("model_ja.json")
rma.hash_func = rma.create_hash_func(15)

## Segmenting a single file
Use the code block below if you have one single file that you need to segment. Put the filename within the single quotes after *filename*, replacing *泉鏡花_海異記.txt*.

Make sure that you've indicated the path to the directory where the file is located for *sourcefiledirectory* in the code block above, and run that code block first.

This code block will create a new plain text file with *_segmented* appended to the end of the source filename.

If you want to see the output within the Jupyter notebook, you can delete the # character before the word *print* in the last line before you run the code block.

In [33]:
# Define your input filename here
filename = '泉鏡花_海異記.txt'
# The name of the output file appends _segmented to the end of the source file
outname = filename.replace('.txt', '_segmented.txt')
# Open the input file
f = open(filename, 'r')
# Read the input file
text = f.read()
# Create and open the output file
with open(outname, 'w') as out:
# Use the rma.tokenize function from RakutenMA to create a list of segmented words
    segmentedwords = (rma.tokenize(text))
# For each word in the segmented words list...
    for segmentedword in segmentedwords:
# Grab just the segmented word, not the linguistic annotations that RakutenMA creates
        word = segmentedword[0]
# Write the word to the output file
        out.write(word)
# Put a space between each word
        out.write(" ")
# Delete the # character below to also print out the words within the Jupyter notebook
        #print(word, end=" ")

## Segmenting multiple files
If you just need to segment a single .txt file, you don't need the following code block.

If you want to segment multiple .txt files, make sure they're all in the directory that you specified at the top of this notebook -- and make sure there aren't other text files that you *don't* want segmented in the same directory.

Running the code block below will generate an output file (with *segmented* appended to the filename) for each .txt file in the directory.

If you want to have the output displayed in this Jupyter notebook as well, you can delete the # character before *print* in the last line.

In [35]:
# Check the source file directory you indicated in a code block above for files
for filename in os.listdir(sourcefiledirectory):
# If there are files that end in .txt...
    if filename.endswith(".txt"):
# But they don't end with _segmented.txt (i.e. previous output files)...
        if not filename.endswith("_segmented.txt"):
#One at a time, open the .txt files and read the contents  
            f = open(filename, 'r')
            text = f.read()
# The name of the output file appends _segmented to the end of the source file
            outname = filename.replace('.txt', '_segmented.txt')
# Create and open the output file
            with open(outname, 'w') as out:
# Use the rma.tokenize function from RakutenMA to create a list of segmented words
                segmentedwords = (rma.tokenize(text))
# For each word in the segmented words list...
                for segmentedword in segmentedwords:
# Grab just the segmented word, not the linguistic annotations that RakutenMA creates
                    word = segmentedword[0]
# Write the word to the output file
                    out.write(word)
# Put a space between each word
                    out.write(" ")
# Delete the # character below to also print out the words within the Jupyter notebook
                    #print(word, end=" ")