# Modern Chinese segmenter

This Jupyter notebook takes a text file, or folder of text files in Chinese, and creates a set of segmented derivative files (where all words are separated by spaces). These segmented files can then be used for searching, or other computational text analysis methods.

Please only use this for **modern** Chinese. Classical Chinese works better with per-character segmenting (putting a space in between each character).

## First-time setup
The first code cell below installs the [*jieba* package](https://github.com/fxsjy/jieba) which can do the actual segmentingg. You only need to run it the first time you use this notebook in a particular environment (laptop, virtual machine, etc.) You can skip it the next time you use the notebook, but nothing bad will happen if you re-run it.

In [None]:
#Imports the sys module which you can use to perform system-level things 
#like installing new modules
import sys

#Installs jieba
!{sys.executable} -m pip install jieba

## Importing modules

The next code cell imports the modules you need to run this notebook. Run it every time you open this notebook. 

In [None]:
#os is used for navigating your filesystem and getting to the files you want to lemmatize
import os

#jieba is the module that does the actual lemmatizing
import jieba

## Lemmatizing a single file
Put the full path to your text file in the cell below, using the correct syntax for your operating system. 

For instance, the default path a text file in the Documents directory is (substituting your user name on the computer for YOUR-USER-NAME):

* On Mac: '/Users/YOUR-USER-NAME/Documents/YOUR-TEXT-FILE.txt'
* On Windows: 'C:\\Users\\YOUR-USER-NAME\\Documents\\YOUR-TEXT-FILE.txt'

In [None]:
#Put the full path to your file between the single quotes here
filepath = '/Users/qad/Documents/chinese.txt'

#The outname is the name of the lemmatized file that this notebook creates
#If you want it to be named something other than the original file name + -segmented
#you can change that here
outname = filepath.replace('.txt', '-segmented.txt')

In [None]:
#Opens the file you specified
with open(filepath, 'r') as f:
    #Creates an empty text file with -segmented.txt appended to the name
    with open(outname, 'w') as out:
        #Reads the text of the file you specified
        text = f.read()
        #Segments the text
        tokens = jieba.cut(text)
        #Combine the tokens
        segmented = " ".join(tokens)
        #Writes the lemmas to the text file with -segmented appended to the name
        out.write(segmented)

## Lemmatizing a folder of text files
Put the full path to your folder of text files in the cell below, using the correct syntax for your operating system. 

For instance, the default path to a folder called "chinese" in the Documents directory is (substituting your user name on the computer for YOUR-USER-NAME):

* On Mac: '/Users/YOUR-USER-NAME/Documents/chinese'
* On Windows: 'C:\\Users\\YOUR-USER-NAME\\Documents\\chinese'

In [None]:
#Put the full path to your folder between single quotes here
textfolder = '/Users/qad/Documents/chinese'
#Changes the working directory to the folder you specified
os.chdir(textfolder)

In [None]:
#For every file in the folder you specified...
for filename in os.listdir(textfolder):
    #If it's a text file, but not one of the text files with just lemmas
    if filename.endswith('.txt') and not filename.endswith('-segmented.txt'):
        #The outname is the name of the lemmatized file that this notebook creates
        #If you want it to be named something other than the original file name + -lemmatized
        #you can change that here
        outname = filename.replace('.txt', '-segmented.txt')
        #Opens the file you specified
        with open(filename, 'r') as f:
            #Creates an empty text file with -lemmatized.txt appended to the name
            with open(outname, 'w') as out:
                #Reads the text of the file you specified
                text = f.read()
                #Segments the text
                tokens = jieba.cut(text)
                #Combine the tokens
                segmented = " ".join(tokens)
                #Writes the lemmas to the text file with -segmented appended to the name
                out.write(segmented)

## About

This Jupyter notebook was originally developed by Quinn Dombrowski for use in [DLCL 204: Digital Humanities Across Borders](https://github.com/quinnanya/dlcl204) at Stanford University, fall 2020. 