# Tokenizing Text
This utility will take a folder of text files as input, tokenize the files, and return a new folder of files with spaces inserted between the words. It is designed to be as simple as possible, so all you need to do is specify the language you want to tokenize with, and the name of the folder you want to tokenize. This will create a new folder with the tokenized text

## Setting variables
You need to set a number of variables to make sure the code runs properly.

### Language
Below, inside the quotation marks, specify the language you are working with (this script is compatible with Chinese, Japanese, and Korean).

In [19]:
language = "Chinese"

### Corpus folder
Inside the quotation marks below specify the name of the folder containing the files you want tokenized.

In [20]:
corpus_dir = "chinese_corpus"

### Output folder
You can set the name of the folder where your results will be saved. I have set it up by default to be "tokenized" plus the name of your corpus directory. Note that anything inside this folder may be deleted when you start the script.

In [21]:
output_dir = f"tokenized_{corpus_dir}"

## The Code Itself
That's it! Just run the codeblock below and it will tokenize your file for you. Note that this does try to install a tokenization library if your system does not already have it installed.

In [22]:
import os

# make language lowercase for ease of entry
language = language.lower()

# depending on the language, import a tokenizer
if language == "chinese":
    
    # try to import the library. install it if cannot find module
    try:
        import jieba
    except ModuleNotFoundError:
        import sys
        !{sys.executable} -m pip install jieba
        import jieba
        
elif language == "japanese":
    try:
        from fugashi import Tagger
    except ModuleNotFoundError:
        import sys
        !{sys.executable} -m pip install fugashi
        !{sys.executable} -m pip install unidic-lite
        from fugashi import Tagger
        
    tagger = Tagger('-Owakati')
    
elif language == "korean":
    try:
        from konlpy.tag import Kkma
    except ModuleNotFoundError:
        import sys
        !{sys.executable} -m pip install konlpy
        from konlpy.tag import Kkma
        
    kkma = Kkma()
    
else:
    print("please set language to Chinese, Japanese, or Korean (this is case sensitive)")
    exit()


# check if the output directory exists. If it not, make it
if not os.path.isdir(output_dir):
    os.mkdir(output_dir)


# iterate through all files in the corpus directory
for root, dirs, files in os.walk(corpus_dir):
    for fname in files:
        
        # read file:
        with open(os.path.join(root, fname), 'r', encoding='utf8') as rf:
            text = rf.read()
            
        # tokenize the document and join by spaces if need be
        if language == "chinese":
            res = jieba.cut(text)
            res = " ".join(res)
        elif language == "japanese":
            res = ' '.join([w.surface for w in tagger(text)])
            
        elif language == "korean":
            res = kkma.morphs(text)
            res = " ".join(res)
        
        with open(os.path.join(output_dir, fname), 'w', encoding='utf8') as wf:
            wf.write(res)