## Vocab

This builds a single json file (the vocab file) containing term counts for all the documents in a json directory.

The vocab file can be loaded into a `Vocab` object called `vocab`. You can view the file's contents by calling `vocab.vocab`. However, since this is a large list, it is recommended that you view a slice like `vocab.vocab[0:100]` (to view the first 100 terms), or you may freeze the notebook.

The `Vocab` object also has a number of methods for obtaining various types of information from the vocab file. These methods are listed below: 

- `vocab.get_filenames()`: Returns a list of filenames in the vocab.
- `vocab.get_names()`: Returns a list of names of json documents in the vocab.
- `vocab.get_document(value)`: Returns a dict containing a single document. The value can be either be a filename or a `name` field value.
- `vocab.get_documents(value)`: Returns a list of dicts containing document. The value can be either be a list of filenames or a list of `name` field values.
- `vocab.get_num_docs()`: Returns the number of documents in the vocab.
- `vocab.get_num_terms(documents=None)`:  Returns the number of terms in the entire vocab or a list of documents. If using a list of documents, call `vocab.get_num_terms(documents=['document1', 'document2'])`. If you are unsure of the names of your documents, you can get a list with `vocab.get_names()`.
- `vocab.get_num_tokens(documents=None)`:  Returns the total number of tokens in the entire vocab or a list of documents. If using a list of documents, call `vocab.get_num_tokens(documents=['document1', 'document2'])`. If you are unsure of the names of your documents, you can get a list with `vocab.get_names()`.
- `vocab.get_terms(documents=None, sortby=['TERM', 'COUNT'], ascending=[True, False], as_dict=False)`: Returns a dataframe containing the terms and counts in the vocab or a list of documents specified by filenames or `name` field values. By default, the data is sorted in ascending order of terms and descending order of counts. These can be modified using the `sortby` and `ascending` parameters. If you choose to include only one `sortby` criterion in the list make sure that the `ascending` parameter also has one value (and vice versa). `Setting `as_dict=True` will return a plain dict.

### INFO

__author__    = 'Scott Kleinman'  
__copyright__ = 'copyright 2020, The WE1S Project'  
__license__   = 'MIT'  
__version__   = '2.0'  
__email__     = 'scott.kleinman@csun.edu'

## Setup and Configuration

In [None]:
# Python imports
import os
from pathlib import Path
from IPython.display import display, HTML

# Configuration (only needs to be changed in rare circumstances)
json_dir = '/project_data/json'
vocab_file = '/project_data/vocab.json'

# Define paths
current_dir = %pwd
current_pathobj = Path(current_dir)
project_dir = str(current_pathobj.parent.parent)
json_dir = project_dir + json_dir
vocab_file = project_dir + vocab_file

# Import scripts
%run scripts/vocab.py

# Display the project directory
display(HTML('<p><strong>Project Directory:</strong> ' + project_dir + '</p>'))

## Build the Vocab File

You only need to run this cell once.

In [None]:
# Build the vocab file
build_vocab(json_dir, vocab_file)

## Create the `Vocab` Object

In [None]:
# Create a Vocab object
vocab = Vocab(vocab_file)

## Do Stuff

You can now do stuff by calling the `Vocab` object's methods. The example below gets a table of term counts for the entire vocabulary. For convenience, here is a copy of the available methods (described in the first cell of this notebook):

- `vocab.get_filenames()`: Returns a list of filenames in the vocab.
- `vocab.get_names()`: Returns a list of names of json documents in the vocab.
- `vocab.get_document(value)`: Returns a dict containing a single document. The value can be either be a filename or a `name` field value.
- `vocab.get_documents(value)`: Returns a list of dicts containing document. The value can be either be a list of filenames or a list of `name` field values.
- `vocab.get_num_docs()`: Returns the number of documents in the vocab.
- `vocab.get_num_terms(documents=None)`:  Returns the number of terms in the entire vocab or a list of documents. If using a list of documents, call `vocab.get_num_terms(documents=['document1', 'document2'])`. If you are unsure of the names of your documents, you can get a list with `vocab.get_names()`.
- `vocab.get_num_tokens(documents=None)`:  Returns the total number of tokens in the entire vocab or a list of documents. If using a list of documents, call `vocab.get_num_tokens(documents=['document1', 'document2'])`. If you are unsure of the names of your documents, you can get a list with `vocab.get_names()`.
- `vocab.get_terms(documents=None, sortby=['TERM', 'COUNT'], ascending=[True, False], as_dict=False)`: Returns a dataframe containing the terms and counts in the vocab or a list of documents specified by filenames or `name` field values. By default, the data is sorted in ascending order of terms and descending order of counts. These can be modified using the `sortby` and `ascending` parameters. If you choose to include only one `sortby` criterion in the list make sure that the `ascending` parameter also has one value (and vice versa). `Setting `as_dict=True` will return a plain dict.

## Get Term Counts Table

In [None]:
table = vocab.get_terms(sortby=['COUNT', 'TERM'], ascending=[False, True])
table

## Perform a Query on the Term Counts Table

This cell provides a means of querying the table produced in the previous cell (you will get an error if you do not first run that cell). It uses the <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html" target="_blank">pandas query method</a>, and user should consult the pandas documentation for information about how to construct specific queries.

The example below filters the table so that it contains only terms not in the list of terms provided.

In [None]:
query = table.query('TERM not in ["the", "a", "and", "of", "to", "in", "is", "for", "that", "was", "at", "i"]')
query

## Filter Stop Words

The cells below provide an example of how you to use a list of stop words to filter the table. You must have already run the **Get Term Counts** table above.

### Configure Stop Word List

Add words to the list below to create a stop word list. If you want to load the text from a text file (one word per line), replace the list with the path to the file (e.g. `stopwords = 'stoplist.txt'`).

In [2]:
stopwords = ['the', 'and']

### Example: Download the A List of Stop Words

This cell provides an example of how you might download a list of stop words (in this case the standard WE1S stop word list). It also shows how to add additional stop words to the list such as "'s" to the stop word list so that the query in the next cell can also filter the possessive "'s".

If the stop word file is downloaded successfully, the first five stop words are displayed.

In [None]:
# Get the stopwords
import requests
stoplist_url = 'https://raw.githubusercontent.com/whatevery1says/preprocessing/master/libs/vectors/we1s_standard_stoplist.txt'
response = requests.get(stoplist_url)
stopwords = response.text.split('\n')

# We'll filter possessive 's as well
stopwords.append("'s")

# Display the first five words of the stop word list
stopwords[0:5]

### Perform the Query to Filter the Stop Words

In [None]:
query = table.query('TERM not in ' + str(stopwords))
query

## Filter Punctuation Marks

There are many ways that you could filter punctuation from your table. The method below removes any term that contains a character not defined by Unicode as a "word" character, which includes punctuation marks.

In [None]:
query = table[table.TERM.str.contains('\w', regex= True, na=False)]
query