# JSON Utilties

This notebook provides a method of accessing the contents of a project's `json` folder. These folders can be quite large, and they will cause the browser to freeze if they are opened using the Jupyter notebook file browser. This notebook creates a `Documents` object with which you can call methods that list or read the contents of the files in the `json` folder. It also allows you to perform database-like queries on the contents to filter your results and to export the results to a zip archive.

### Info

__authors__    = 'Scott Kleinman'  
__copyright__ = 'copyright 2020, The WE1S Project'  
__license__   = 'MIT'  
__version__   = '1.0'  
__email__     = 'scott.kleinman@csun.edu'

## Setup

In [None]:
# Import the Documents class
%run scripts/json_utilities.py

# Python imports
from pathlib import Path 
from IPython.display import display, HTML

# Get the project directory
current_dir = %pwd
project_dir = str(Path(current_dir).parent.parent)

display(HTML('<p style="color: green;">Setup complete.</p>'))

## Create a `Documents` Object

The cell below shows you how to create a `Documents` object and use it to list or read files.

The `get_file_list()` methods can optionally take `start` and `end` values as shown in the example below.

In [None]:
# Create a Documents object for the project folder
docs = Documents(project_dir)

# Get the number of documents
num_docs = docs.count
display(HTML('<p><strong>Number of documents:</strong> {0}</p> '.format(num_docs)))

# Get a list of the first 5 documents in the json folder
result = docs.get_file_list(0, 5)
display(HTML('<p><strong>First Five Files:</strong></p>'))
display(HTML('<ul>'))
for file in result:
    display(HTML('<li>{0}</li>'.format(file)))
display(HTML('</ul>'))

# Count the number of documents in a result
num_results = docs.count_docs(result)
display(HTML('<p><strong>Number of documents in result:</strong> {0}</p> '.format(num_results)))

### Preview a Document

To preview a document, run the cell below after configuring the `filename` (just the filename &mdash; the path is assumed to be the project's `json` folder) and the `preview_length` (number of characters to display).

In [None]:
# Configure filename
filename        = ''
preview_length  = 300

# Read a document by filename
doc = docs.read(filename)
display(HTML('<p><strong>First 300 characters of the document:</strong></p>'))
print(doc['content'][0:preview_length] + '...')

## View Metadata Fields (Optional)

If you wish to perform a query on your documents, it can be helpful to know what metadata fields are available. The cell below will read the first 100 documents and extract the keys for each metadata field. Note that listed keys may not be available in all documents. If you think that your metadata is very inconsistent, you may want to run `docs.get_metadata_keys()` without start and end values. However, this can take a long time, so it is not recommended unless you have reason to think that there are large discrepancies across your collection.

It is also possible to get the keys for a specific file with `docs.get_metadata_keys(filelist=['file1', 'file2', etc.])`. If you have already run something like `result = docs.get_file_list(0, 5)`, you can simply run `docs.get_metadata_keys(filelist=result)`.

In [None]:
fields = docs.get_metadata_keys(0, 100)
print(fields)

You can generate a table of your documents with `get_table()`. It takes a list of files and a list of fields as its arguments, as in the example below. Columns can be re-ordered, sorted, and filtered. However, it is recommended that you only supply a small number of columns. The bigger the table, the longer the lag time when you scroll.

In [None]:
# Configure columns
columns = [] # E.g. ['name', 'pub_date']

file_list = docs.get_file_list()
table = docs.get_table(file_list, columns)
table

If you wish to save the table after you have sorted and/or filtered it, set the `filename` in the cell below and run the cell.

In [None]:
# Configuration
filename = 'table.csv'

# Save the table
table.get_changed_df().to_csv(filename)

## Performing Queries

Although many questions about your data can be answered by working with the table above, sometimes you may need to perform more  sophisticated database-like queries to filter the data in your project's json folder. The cell below provide an interface for performing these queries.

A basic query is given in the form of a tuple with the syntax `(fieldname, operator, value)`. The `fieldname` is the name of the metadata field you wish to search. The `value` is the value you are looking for in the field, and the `operator` is the method by which you will evaluate the value. Here are the possible operators: `<`, `<=`, `=` (or `==`), `!=` (meaning "not equal to"), `>`, `>=`, `contains`. The last will match any value anywhere in the field. For greater power, you can use `regex` as the `operator` and a regex pattern as the `value`.

**Important:** The `fieldname` and `operator` must be enclosed in single quotes. The `value` must also be single quotes unless it is a number or Boolean (`True` or `False`).

The `find()` method takes three arguments: a list of filenames, a query, and, optionally, a Boolean `lower_case` value. If `lower_case=True` the `value` data will be converted to lower case before it is evaluated. The default is `False`.

In the cell below, we will get a list of the first 5 files (to keep things quick) and search for the ones that contain "Politics" in the document's `name` field.

Note: There is a built-in timer class that can be used to time queries of long file lists. Its use is illustrated in the cell below, but it can be used to time any of the methods.

In [None]:
# Find all docs where the name contains Politics
file_list = docs.get_file_list(0, 5)
timer = Timer()
result = docs.find(file_list, ('name',  'contains', 'Politics'))
print(result)
print('Time elapsed: %s' % timer.get_time_elapsed())

## Performing Multiple Queries

You can pass multiple queries to the `find()` method by using a list of tuples. As you can see from the example below. The result will be every document that matches any of the queries in the list.

In [None]:
result = docs.find(file_list, [('name', 'contains', 'Politics'), ('name', 'contains', 'opinion')])
print(result)

## Adding Boolean Logic

It is possible to add more complex Boolean logic by passing a dictionary as the query with `'and'` or `'or'` as the key. The value should be a list of one or more tuples. 

#### Example with `and`

In [None]:
result = docs.find(file_list, {'and': [('name', 'contains', 'Politics'), ('name', 'contains', 'opinion')]})
print(result)

#### Example with `or`

In [None]:
result = docs.find(file_list, {'or': [('name', 'contains', 'Politics'), ('name', 'contains', 'opinion')]})
print(result)

#### You can provide a list of dictionaries.

In [None]:
result = docs.find(file_list, [
    {'and': [('name', 'contains', 'Politics'), ('name', 'contains', 'opinion')]},
    {'or': [('name', 'contains', 'Jump')]},
])
print(result)

## Exporting the Results of a Query

You can save the documents found by your query to a zip file with the `export()` method. It takes a list of filenames and a path where you wish to save the zip file. A filename is sufficient if you wish to save it in the current folder.

The `export()` method takes an optional `text_only` argument. Setting `text_only=True` will export only the `content` fields as plain text files.

Here is an example in which you create a `Documents` object, get a file list, find files in the list that match your query, and export the results to a zip archive.

The timer class is automatically applied to exports.

In [None]:
docs = Documents(project_dir)
file_list = docs.get_file_list(0, 5)
result = docs.find(file_list,
    [
        ('name', 'contains', 'Politics'),
        ('name', 'contains', 'opinion')
    ]
)
docs.export(result, zip_filepath='export.zip')