# Import Data

This notebook enables you to import data to a project directory or a MongoDB database. Valid data sources are:

- A zip archive of plain text files with an accompanying csv-formatted metadata file.
- A zip archive of json files.
- A zipped Frictionless Data data package containing json files.

Metadata fields can be mapped onto those required by the WhatEvery1Says Workspace.

By default, the notebook will import to the project directory in which it is located.

It is also possible to import data directly from a MongoDB database to a project directory. For this functionality, see the **Import from MongoDB** cell below.

## Info

__authors__    = 'Scott Kleinman'  
__copyright__ = 'copyright 2020, The WE1S Project'  
__license__   = 'GPL'  
__version__   = '2.5'  
__email__     = 'scott.kleinman@csun.edu'



## Setup

In [None]:
# Python imports
from pathlib import Path
from IPython.display import display, HTML

# Get path to project_dir
current_dir            = %pwd
project_dir            = str(Path(current_dir).parent.parent)
json_dir               = project_dir + '/project_data/json'
config_path            = project_dir + '/config/config.py'
import_script_path     = 'scripts/import.py'
tokenizer_script_path  = 'scripts/import_tokenizer.py'

# Import the project configuration and classes
%run {config_path}
%run {import_script_path}
display_setup_message()

## Configuration

Configuration options are explained briefly below. For more information, please see this module's <a href="README.md" target="_blank">README</a> file.

- `zip_file`: The name of the zip archive containing your data. By default, the archive is called `import.zip`, but you can modify the filename. If the data is in plain text format, you must also prepare a `metadata.csv` file. Does not apply when importing from MongoDB (you can set it to `None`).
- `metadata.csv`: The name of your metadata file if you are importing plain text data. By default, it is called `metadata.csv`, but you can change the name. <span style="color:red;">Important:</span> The metadata file must have `filename`, `pub_date`, `title`, and `author` as its first four headers. You can include additional metadata fields _after_ the `author` field. Does not apply when importing directly from JSON files or from MongoDB (you can set it to `None`).
- `remove_existing_json`: Empty the json folder before importing. The default is `False`, so it is possible to add additional data on multiple runs.
- `delete_imports_dir`: If set to `True`, the folder containing your `zip_file` and `metadata.csv` file will be deleted when the import is complete. Does not apply when importing from MongoDB (you can set it to `None`).
- `delete_text_dir`: If set to `True`, the folder containing your imported plain text files will be deleted after they are converted to json format. Does not apply when importing directly from JSON files or from MongoDB (you can set it to `None`).
- `data_dirs`: If you are importing data already in json format, you can specify a list of paths in your zip archive or Frictionless Data data package where the json files are located. Does not apply when importing from MongoDB (you can set it to `None`).
- `title_field`: If you are importing data already in json format that does not contain a field named `title` you can map an existing field to this key by providing the name of the existing field here.
- `author_field`: If you are importing data already in json format that does not contain a field named `author` you can map an existing field to this key by providing the name of the existing field here.
- `pub_date_field`: If you are importing data already in json format that does not contain a field named `pub_date` you can map an existing field to this key by providing the name of the existing field here.
- `content_field`: If you are importing data already in json format that does not contain a field named `content` you can map an existing field to this key by providing the name of the existing field here.
- `dedupe`: If set to `True`, the script will check for duplicate files within the project that may have been created by importing data from multiple zip archives. Duplicate files will be given the extension `.dupe`. This option also changes the extension of json files containing empty `content` fields to `.empty`. <span style="color:red;">Warning:</span> For very large projects (~100,000 or more documents), duplicate detection may take up to several hours to run and, depending on other traffic on the server, may cause a server error.
- `random_sample`: If you wish to import a random sample of the data in your `zip_file`, specify the number of documents you wish to import.
- `random_seed`: Specify a number to initialize the random sampling. This ensures reproducibility if you have to run the import multiple times. In most cases, the setting can be left as `1`.
- `required_phrase`: A word or phrase which will be used to filter the imported data. Only documents that contain the `required_phrase` value will be imported to your project.
- `log_file`: The path to the file where errors and deduping results are logged. The default is `import_log.txt` in the same folder as this notebook.

If you are importing your data directly to MongoDB, rather than a project folder, configure your MongoDB `client`, your database as `db`, and the name of your `collection`. For the `client` setting you can simply enter `MONGODB_CLIENT` to use your project's configuration. If importing from MongoDB, the `query` setting should be a valid MongoDB query. Since MongoDB syntax can be difficult &mdash; especially for complex queries &mdash; you may wish to use the <a href="query-builder/index.html" target="_blank">WE1S QueryBuilder</a> to construct your query and then paste it into the configuration cell. For information on customizing the QueryBuilder for your data, see the <a href="README.md" target="_blank">README</a> file.

In [None]:
# Import Configuration
zip_file              = 'import.zip' # The name of the zip archive containing your data files
metadata_file         = 'metadata.csv' # The name of the meadata file (only required for plain text data)
remove_existing_json  = False # Clear an existing json folder before importing
delete_imports_dir    = False # Delete the imports folder after the data has been extracted
delete_text_dir       = False # Delete the plain text files folder after the data has been imported
data_dirs             = None # For zipped json files and data packages, list of directories in which data is located
title_field           = None
author_field          = None
pub_date_field        = None
content_field         = None
dedupe                = False  
random_sample         = None
random_seed           = 1
required_phrase       = None
save_mode             = 'project' # Set to 'db' to import data directly to MongoDB
logfile               = 'import_log.txt' # The name of the error log file

# MongoDB Configuration (required only if importing from MongoDB or saving imports to MongoDB)
client                = 'mongodb://mongo:27017'
db                    = ''
collection            = ''
query                 = {} # The query to perform if importing from MongoDB

## Prepare the Workspace for File Import

In [None]:
# Initialise the Import object
task = Import(zip_file=zip_file, metadata=metadata_file, delete_imports_dir=delete_imports_dir,
              delete_text_dir=delete_text_dir, title_field=title_field, author_field=author_field,
              pub_date_field=pub_date_field, content_field=content_field, dedupe=dedupe,
              random_sample=random_sample, random_seed=random_seed, required_phrase=required_phrase,
              logfile=logfile, client=client, db=db, collection=collection, project_dir=project_dir,
              json_dir=json_dir, save_mode=save_mode, environment='jupyter')

# Create the import directories
task.setup()

## Perform the Import

In [None]:
# Start the import
task.start_import(remove_existing_json=remove_existing_json)

# Tokenization message
display(HTML('<p>If you would like tokenize your imported data, proceed to the <strong>Tokenize the Data</strong> section below.</p>'))

## Import from MongoDB

Make sure that you have configured your database information and query in the **Configuration** cell above.

In [None]:
# Initialise the Import object
task = MongoDBImport(query, client=client, db=db, collection=collection, project_dir=project_dir, json_dir=json_dir,
                     title_field=title_field, author_field=author_field, pub_date_field=pub_date_field,
                     content_field=content_field, dedupe=dedupe, random_sample=random_sample, random_seed=random_seed,
                     required_phrase=required_phrase, logfile=logfile, environment='jupyter')

# Start the import
task.start_import(remove_existing_json=remove_existing_json)

# Tokenization message
display(HTML('<p>If you would like tokenize your imported data, proceed to the <strong>Tokenize the Data</strong> section below.</p>'))

## Tokenize the Data

The cell below will generate tokens counts for each of your documents and save them to a `bag_of_words` field in each document. This can speed up processing for downstream tasks.

### Configuration

You do not have to reconfigure the `json_dir` if you have already run the first cell of this notebook. Errors will be logged to the path you set for the `log_file`.

If you would like to save your tokens as a bag of words, set `bagify_features=True`. If your data has the tokens in a `features` table tokens will be counted from that table; otherwise, the `content` field will be tokenized first. If your data does not have a `features` table, and you would like to save one to your json documents, set `save_features_table=True` and `method='we1s'`. For more information on features tables, see the <a href="README.md" target="_blank">README</a> file.

The default tokenization method strips all non-alphanumeric characters and splits the text into tokens on white space. If you would like to use the WE1S tokenizer, set `method='we1s'`. Note that this method takes longer. The WE1S tokenizer leverages <a href="https://spacy.io/" target="_blank">spaCy</a> and its the language models. The default language model is `en_core_web_sm`, but this can be changed. However, you will have to download another model into your environment. See the <a href="README.md" target="_blank">README</a> file for instructions.

In [None]:
json_dir             = json_dir
log_file             = 'tokenizer_log.txt'
bagify_features      = True
save_features_table  = False
method               = 'we1s'
language_model       = 'en_core_web_sm'

### Start the Tokenizer

In [None]:
%run {tokenizer_script_path}
tokenizer = ImportTokenizer(json_dir, language_model='en_core_web_sm',
                            log_file='tokenizer_log.txt')
tokenizer.start(bagify_features=bagify_features, save_features_table=save_features_table, method=method)