# Scattertext

This notebook uses Jason Kessler's <a href="https://github.com/JasonKessler/scattertext" target="_blank">Scattertext</a> library to allow you to explore key terms in your corpus in relation to your documents' metadata.

Note that for the purpose of working with Scattertext, the original text is re-tokenized using slightly different rules from the WE1S preprocessor, so there may be some small discrepancies. By default, the WE1S standard stoplist in your project's MALLET module is applied.

### INFO

__author__    = 'Scott Kleinman'  
__copyright__ = 'copyright 2020, The WE1S Project'  
__license__   = 'MIT'  
__version__   = '0.9.1'  
__email__     = 'scott.kleinman@csun.edu'

## Settings

In [None]:
# Python imports
import os
from IPython.display import display, HTML
from pathlib import Path
import pandas as pd
try:
    import scattertext as st
except ImportError:
    !pip install scattertext
    import scattertext as st

# Get paths
current_dir                = %pwd
project_dir                = str(Path(current_dir).parent.parent)
data_dir                   = project_dir + '/project_data'
json_dir                   = data_dir + '/json'
module_data_dir            = current_dir + '/data'
topic_weights_script_path  = current_dir + '/' + 'scripts/topic_weights.py'
scattertext_script_path    = current_dir + '/' + 'scripts/scattertext.py'
stoplist_path              = project_dir + '/modules/topic_modeling/scripts/we1s_standard_stoplist.txt'

# Import scripts
%run {scattertext_script_path}

# Output message
display(HTML('<p style="color:green;font-weight:bold;">Setup complete.</p>'))

## Load Documents

This cell loads the json documents for the entire collection. This can take a while. For experimentation, it is best to set `end` to a smaller number. For an explanaiton of the various configuration settings, see the <a href="README.md" target="_blank">README</a> file.

If you have run this cell previously and wish to use the same settings, you can skip this cell. The next cell will load your Documents dataframe from a stored copy called `documents_df.parquet`. If you wish to save the settings, this fille will be overwritten with the new dataframe, so make a backup if you wish to keep the old one.

In [None]:
# Configuration
start            = 0
end              = None # E.g. 2000
extra_fields     = {} # E.g. {'date': 'pub_date', 'tags': 'tags'}
random_sampling  = None # The percentage of the collection to sample or None
use_file         = False

if use_file:
    try:
        table = load_documents_df(module_data_dir, to_qgrid = True)
    except IOError:
        table = build_document_dataframe(json_dir, start, end, extra_fields, random_sampling)
else:
    table = build_document_dataframe(json_dir, start, end, extra_fields, random_sampling)
    
table

### Save the Dataframe to CSV

If you wish to save the dataframe to a CSV file, configure a filename and then uncomment one of the lines below. The first will save the original dataframe and the second will save the dataframe after any sorting or filtering you have done.

In [None]:
# Configuration
filename = 'table.csv'

# Save the original dataframe
# table.df.to_csv(filename)

# Save the changed dataframe
# table.get_changed_df().to_csv(filename)

## Generate Document Counts Report

This cell provides a table of document counts for each field column beginning with the one configured for the `start_column` value. The rows provide the document counts by _value_. This information will be used for configuring the cells below.

In [None]:
# Configure start_column
start_column = 4
preview      = None

generate_counts_report(table.df, start_column, preview)

## Build a Corpus

When the corpus is built, each document is parsed using spaCy, so this can take a while. For that reason, it is a good idea to set the limit to around 2000 documents or smaller.

Before generating the corpus, the cell will automatically look for a previously-saved corpus file to speed loading time. If you have changed your `limit` or `field` settings, change the name of the `corpus_file` or set `from_file=False`. If you do not change the name of `corpus_file`, any previous corpus with that filename will be overwritten. 

<p style="color:red;">Important: A Scattertext corpus requires a <code>field</code> category corresponding to one of the column headings in the Document Counts Report above. The column must contain at least two non-zero. If you get an error, you may not have chosen a valid field.</p>

Results seem to be improved by using lemmas rather than the original tokens, but this can be changed with `use_lemmas=False`. The other options are more unpredicatable. The `entity_types_to_use` and `tag_types_to_use` lists allow you to specify entity and part of speech categories that should be retained in the analysis. Tokens not belonging to the types you specify will be excluded from the corpus. A list of the category abbreviations can be found in the spaCy documentation for <a href="https://spacy.io/api/annotation#named-entities" target="_blank">named entities</a> and <a href="https://spacy.io/api/annotation#pos-universal" target="_blank">part of speech tags</a>. If you wish to use all the categories, set these values to `All`.

You can also "censor" certain types, which replaces the original token it entity or part of speech abbreviation. Lastly, periods at the ends of tokens can be stripped if they have escaped spaCy's tokenizer.

For convenience, here are list of all entity and part of speech abbreviations, which you can use to copy and paste into the cell below.

### Named Entities

<code>"PERSON", "NORP", "FAC", "ORG", "GPE", "LOC", "PRODUCT", "EVENT", "WORK_OF_ART", "LAW", "DATE", "TIME", "PERCENT", "MONEY", "QUANTITY", "ORDINAL", "CARDINAL"</code>

### Parts of Speech

<code>"$", "``", "''", ",", "-LRB-", "-RRB-", ".", ":", "ADD", "CC", "CD", "DT", "EX", "IN", "LS", "NFP", "NIL", "NNP", "NNPS", "PDT", "POS", "PRP", "PRP$", "RP", "SYM", "TO", "UH", "WDT", "WP", "WP$", "WRB"</code>

In [None]:
# Configuration
limit                   = 2000 # Less than or equal to the `end` value in Load Documents
field                   = '' # Eg. 'funding'
corpus_file             = '' # E.g. 'corpus' -- No extension necessary
from_file               = True
stoplist_path           = stoplist_path
use_lemmas              = True
entity_types_to_use     = None # E.g. ["PERSON", "NORP", "FAC", "ORG", "GPE", "LOC", "PRODUCT", "EVENT", "WORK_OF_ART", "LAW"]
entity_types_to_censor  = [] # E.g. ["PERSON", "NORP", "FAC", "ORG", "GPE", "LOC", "PRODUCT", "EVENT", "WORK_OF_ART", "LAW"]
tag_types_to_use        = None # E.g. ["AFX", "NN", "NNS", "RB", "RBR", "RBS", "RP", "VB", "VBD", "VBF", "VBN", "VBP", "VBZ"]
tag_types_to_censor     = [] # E.g. ["NNP", "NNPS"]
strip_final_period      = False

if not isinstance(df, pd.DataFrame):
    df = load_documents_df(module_data_dir)
if from_file is True:
    corpus = load_corpus(module_data_dir, corpus_file)
    display(HTML('<p style="color:green;">Corpus loaded from file.</p>'))
if corpus is None or from_file is False:
    corpus = generate_corpus(module_data_dir, corpus_file, nlp, df.head(limit), field, stoplist_path=stoplist_path,
                             use_lemmas=use_lemmas, entity_types_to_use=entity_types_to_use, tag_types_to_use=tag_types_to_use, 
                             entity_types_to_censor=entity_types_to_censor, tag_types_to_censor=tag_types_to_censor,
                             strip_final_period=strip_final_period)


## Generate terms that differentiate the collection from a general English corpus

Note: We _think_ that Scattertext is using the Brown Corpus for comparison, but we have not been able to confirm this.

In [None]:
# Configuration
limit = 20 # The number of key terms to display

display(HTML('<h4>Terms Characteristic of This Corpus:</h4>'))
terms = ', '.join(list(corpus.get_scaled_f_scores_vs_background().index[:limit]))
display(HTML('<p>' + terms + '</p>'))

## Generate terms associated with a field value

For the `score_query` configuration, supply one of the row values in the Counts Report above. The `score_label` can be a more human-readable or descriptive label for the value.

In [None]:
# Configuration
limit = 20 # The number of key terms to display
score_query = '' # The column value to query, e.g. 'US private college'
score_label = '' # The label to give the query results -- can be the same or a more descriptive label

term_freq_df = corpus.get_term_freq_df()
term_freq_df[score_label] = corpus.get_scaled_f_scores(score_query)
display(HTML('<h4>Key Terms Associated with "' + score_label + '":</h4>'))
terms = ', '.join(list(term_freq_df.sort_values(by=score_label, ascending=False).index[:limit]))
display(HTML('<p>' + terms + ':</p'))

## Generate a Scattertext Visualization of Term Associations

This cell generates a Scattertext visualization, which is saved at the location you specify for `filename`. Be sure to set `limit` to the same number you used for generating the corpus.

The `field_name` value should be taken from one of the row values in the Counts Report above. In the graph, the axis for this field will be labelled with the value you provide for `field_label`. The other access will be labelled by the value you provide for `non_field_label`. You can also modify the width of the graph and supply an extra metadata category, which will be the name of a column in the Documents table above. The values for that category will be displayed above sample documents in the graph.

The results can be filtered by minimum term frequency and pointwise mutual information (the higher the number the greater the requirement that terms co-occur in the same document).

In [None]:
# Configuration
filename                   = '' # E.g. 'US-private-college_test.html'
limit                      = 2000 # Less than or equal to the `limit` value used to build the corpus
field                      = '' # E.g. 'US private college'
field_label                = '' # A more descriptive label for the field
non_field_label            = '' # E.g. 'non-US private college'
width_in_pixels            = 1000
extra_metadata             = 'date'
minimum_term_frequency     = 0
pmi_threshold_coefficient  = 0

# Generate and save the html file
html = st.produce_scattertext_explorer(corpus, category=field, category_name=field_label, not_category_name=non_field_label,
                                       width_in_pixels=width_in_pixels, metadata=corpus.get_df()[extra_metadata].head(limit),
                                       minimum_term_frequency=minimum_term_frequency,
                                       pmi_threshold_coefficient=pmi_threshold_coefficient)
open(filename, 'wb').write(html.encode('utf-8'))

# Display the link
current_dir = %pwd
project_dir = str(Path(current_dir).parent.parent)
config_path = project_dir + '/config/config.py'
%run {config_path}
%run {scattertext_script_path}
display_link(filename, project_dir, WRITE_DIR, PORT)