# Tokenize JSON Documents

This notebook will generate tokens counts for each of your documents and save them to a `bag_of_words` field in each document. This can speed up processing for downstream tasks.

### INFO

__author__    = 'Scott Kleinman'  
__copyright__ = 'copyright 2020, The WE1S Project'  
__license__   = 'MIT'  
__version__   = '2.0'  
__email__     = 'scott.kleinman@csun.edu'

## Configuration

If your data already has a `features` table and you would like a `bag_of_words` field to be generated, set `bagify_features=True`. For more information on features tables, see the <a href="README.md" target="_blank">README</a> file.

The default tokenization method strips all non-alphanumeric characters and splits the text into tokens on white space. If you would like to use the WE1S tokenizer, set `method='we1s'`. Note that this method takes longer. The WE1S tokenizer leverages <a href="https://spacy.io/" target="_blank">spaCy</a> and its the language models. The default language model is `en_core_web_sm`, but this can be changed. However, you will have to download another model into your environment. See the <a href="README.md" target="_blank">README</a> file for instructions.

Errors will be logged to the path you set for the `log_file`.

In [None]:
bagify_features  = True
method           = 'default'
language_model   = 'en_core_web_sm'
log_file         = 'tokenizer_log.txt'

In [None]:
# Python imports
from pathlib import Path
from IPython.display import display, HTML

# Get path to project_dir
current_dir            = %pwd
project_dir            = str(Path(current_dir).parent.parent)
json_dir               = project_dir + '/project_data/json'
tokenizer_script_path  = 'scripts/import_tokenizer.py'

# Run the tokenization script
%run {tokenizer_script_path}
tokenizer = ImportTokenizer(json_dir, language_model='en_core_web_sm',
                            log_file='tokenizer_log.txt')
tokenizer.start(bagify_features=bagify_features, method=None)