# Export JSON to TXT + CSV

The WE1S workflows use JSON format internally for manipulating data. However, you may wish to export JSON data from a project to plain text files with a CSV metadata file for use with other external tools.

This notebook uses JSON project data to export a collection of plain txt files &mdash; one per JSON document &mdash; containing only the document contents field or bag of words. Each file is named with the name of the JSON document and a `.txt` extension.

It also produces a `metadata.csv` file. This file contains a header and one row per document with the document filename plus required fields.

Output from this notebook can be imported using the import module by copying the `txt.zip` and `metadata.csv` from `project_data/txt` to `project_data/import`. However, it is generally not recommended to export and then reimport data, as you may lose metadata in the process.


## Info

__authors__    = 'Jeremy Douglass, Scott Kleinman'  
__copyright__ = 'copyright 2020, The WE1S Project'  
__license__   = 'MIT'  
__version__   = '2.6'  
__email__     = 'jeremydouglass@gmail.com'


## Setup

This cell imports python modules and defines import file paths.

In [None]:
# Python imports
from pathlib import Path
from IPython.display import display, HTML

# Get path to project_dir
current_dir            = %pwd
project_dir            = str(Path(current_dir).parent.parent)
json_dir               = project_dir + '/project_data/json'
config_path            = project_dir + '/config/config.py'
export_script_path     = 'scripts/json_to_txt_csv.py'
# Import the project configuration and classes
%run {config_path}
%run {export_script_path}
display(HTML('Ready!'))

## Configuration

The default configuration assumes:

1. There are JSON files in `project_data/json`.
2. Each JSON has the required fields `pub_date`, `title`, `author`.
3. Each JSON file has either:
   - a `content` field, or
   - a `bag_of_words` field created using the `import` module tokenizer (see the "Export Features Tables" section below to export text from the `features` field).

By default, the notebook will export to `project_data/txt`.

In [None]:
limit = 10  # limit files exported -- 0 = unlimited.

txt_dir  = project_dir + '/project_data/txt'
metafile = project_dir + '/project_data/txt/metadata.csv'
zipfile  = project_dir + '/project_data/txt/txt.zip'

# The listed fields will be checked in order.
# The first one encountered will be the export content.
# Documents with no listed field will be excluded from export.
txt_content_fields = ['content', 'bag_of_words']

# The listed fields will be copied from json to metadata.csv columns
csv_export_fields = ['pub_date', 'title', 'author']

# Set to true to zip the exported text files and remove the originals 
zip_output = True

# Delete any previous export contents in the `txt` directory, including `metadata` file and zip file
clear_cache = True

## Export

Start the export.

In [None]:
# Optionally, clear the cache
if clear_cache:
    clear_txt(txt_dir, metafile=metafile, zipfile=zipfile)
    
# Perform the export
json_to_txt_csv(json_dir=json_dir,
                txt_dir=txt_dir,
                txt_content_fields=txt_content_fields,
                csv_export_fields=csv_export_fields,
                metafile=metafile,
                limit=limit)

# Inspect results
report_results(txt_dir, metafile)

# Optionally, zip the output
if zip_output:
    zip_txt(txt_dir=txt_dir, zipfile=zipfile)    

## Export Features Tables

If your data contains features tables (lists of lists containing linguistic features), use the cell below to export features tables as CSV files for each document in your JSON folder. Set the `save_path` to a directory where you wish to save the CSV files. If you are using WE1S public data, this may apply to you.

In [None]:
# Configuration
save_path = ''

# Run the export
export_features_tables(save_path, json_dir)