# Topic Statistics by Metadata

This notebook extracts information form a model's `topic-docs.txt` file and combines it with the information in the documents' metadata fields to provide counts of the number of documents associated with specific metadata fields. Since MALLET automatically selects the top 100 documents in each topic, this is the basis for the data. The results can be viewed in pandas dataframes and saved to CSV files.

The results may be visualised in a static bar chart (stacked or unstacked) or an interactive plotly bar chart for any metadata field. The visusalisations may be saved to static PNG files.

### INFO

__author__    = 'Scott Kleinman'  
__copyright__ = 'copyright 2020, The WE1S Project'  
__license__   = 'MIT'  
__version__   = '2.5'  
__email__     = 'scott.kleinman@csun.edu'

## Settings

In [None]:
# Python imports
from IPython.display import display, HTML
from pathlib import Path

# Get paths
current_dir                = %pwd
project_dir                = str(Path(current_dir).parent.parent)
data_dir                   = project_dir + '/project_data'
model_dir                  = data_dir + '/models'
json_dir                   = data_dir + '/json'
topic_weights_script_path  = current_dir + '/' + 'scripts/topic_stats.py'
topic_name                 = 'topic'

# Import scripts
%run {topic_weights_script_path}

# Output message
display(HTML('<p style="color:green;font-weight:bold;">Setup complete. Please set the configuration values in the next cell.</p>'))

## Configuration

Select **one** model to explore. **Please run the next cell regardless of whether you change anything.**

If you are unsure of the name of your model, navigate to the `your_project_name/project_data/models` directory in your project, and choose the name of one of the subdirectories in that folder. Each subdirectory should be called `topicsn1`, where `n1` is the number of topics you chose to model, for example: `selection = 'topics100'`. Please follow this format exactly.

**In most cases, you should not need to change the `data_path` and `from_file` configurations.** The `data_path` variable specifies the folder where the notebook's assets will be saved. Some cells below attempt to load assets from this data folder so that you do not need to re-run procedures if you have already run the cell once. If for some reason you wish to bypass loading data from a saved file, set `from_file=False` in the cell's configuration section.

In [None]:
selection   = '' # e.g. 'topics25'
data_path  = 'data'
from_file   = True

# Output message
display(HTML('<p style="color:green;font-weight:bold;">Configuration complete.</p>'))

## Read Topic-Docs File

In [None]:
df = pd.read_csv(os.path.join(model_dir, selection + '/' + selection.replace('topics', 'topic-docs') + '.txt'), sep=' ').drop('...', axis=1)
topics = df['#topic'].values.tolist()
topics = [int(i) + 1 for i in topics]
topics = pd.DataFrame(topics, columns=['topic'])
df = pd.concat([topics, df], axis=1)
df.columns = ['#topic', 'delete', 'doc', 'name', 'proportion']
df = df.drop(columns=['delete'])
to_qgrid(df)

## Export Data from Top Documents (Optional)

Run the next cell to export the contents of the top documents to plain text files. Before running the cell, configure the `save_path` with the path to the directory where you want to save the text files. If `topic_num` is set to `All`, the content of all documents will be exported. If you set it to a topic number, only the documents associated with that topic will be exported.

In [None]:
# Configuration
topic_num = '1' # Make sure to keep the quotation marks
save_path = '' #'data' # Path to save directory -- leave as '' for the current directory

# Start the export
start_export(df, json_dir, topic_num=topic_num, save_path=save_path)

## Gather Collection Metadata

The cell below gathers metadata for a list of JSON fields from the documents in the collection. Before running the cell, configure a list of metadata fields to collect using the `fields` cell.

In the returned dataframe (called `topic_docs_metadata`), tag attributes which exist but without subattributes have values of `1`; missing tag attributes have values of `0`. Click the second cell below to display the dataframe.

Note: This cell can take some time to run. If you have already run it once, it should read the metadata from a saved file. Set `from_file=False` to re-generate the data.

In [None]:
# Configuration
fields     = [] # The metadata fields to collect from JSON files, e.g. ['tags']
from_file  = True # Set to false if you want the script to ignore previously created files

# Read the json files, get the metadata, and combine it with the topic-doc proportions
display(HTML('<p>Getting metadata...</p>'))
try:
    metadata = get_metadata(df, fields, selection, json_dir, from_file=from_file, data_path=data_path)
    topic_docs_metadata = pd.concat([df, metadata], axis=1)
    table = to_qgrid(topic_docs_metadata)
    display(HTML('<p style="color: green;">Done!</p>'))
except:
    display(HTML('<p style="color: red;">An error occurred. Please double-check your configuration.</p>'))

### Display the Dataframe

You can drag column boundaries to change their width or column labels to re-order the columns. To sort the columns, click the column label (and click again to sort in reverse order). Click the filter icon to filter your data by column values.

In [None]:
table

## Save to CSV

If you wish to save a copy of the output to a CSV file, set `save_path` to a relative path to the location where you wish to save the file. The path should include the filename. It is recommended that you include a topic number in the filename so that you do not accidentally overwrite a file from a different model.

By default, the CSV will reflect the table above _after_ any modifications you make by filtering or sorting. If you wish to use the original table, set `use_original=True`.

In [None]:
# Configuration
save_path     = '' # Filename with relative path where CSV file will be saved
use_original  = False

# Save the file
save_to_csv(table, save_path, use_original)

## Get Counts by Column Values

This cell calculates document counts for each topic by column value using the `topic_docs_metadata` dataframe. Before running the cell, make sure that you configure the `column` variable with the name of the dataframe column from which you wish to count values. 

If you wish to save a copy of the output to a CSV file, set `save_path` to a relative path to the location where you wish to save the file. The path should include the filename.It is recommended that you include a topic number in the filename so that you do not accidentally overwrite a file from a different model.

By default, the CSV will reflect the table above _after_ any modifications you make by filtering or sorting. If you wish to use the original table, set `use_original=True`.

In [None]:
# Configuration
column        = '' # The column from which to count values (e.g. 'region')
save_path     = '' # Filename with relative path where CSV file will be saved, e.g. 'data/media-counts.csv'
use_original  = False


# Get the counts table
counts = get_counts(topic_docs_metadata, column)
counts_table = to_qgrid(counts)

# Save to csv
if save_path is not None:
    save_to_csv(table, save_path, use_original)

# Display the table
display(counts_table)

## Visualise Metadata with a Simple Bar Plot (Static Version)

Set `fields` to a list of column headings in the `counts_table` above. If you wish to display different names in the legend, provide a list of corresponding names for `legend_labels` (in the same order). You may adjust the `title`, `xlabel` (for the x-axis) and `ylabel` (for the y-axis) to describe the content of your data accurately.

Since plots can be very cramped, you may want to look at a limited range of topics. To do this, modify the `start_topic` and `end_topic` values. You can also save space by creating a stacked plot with `stacked=True`. (The interactive plot in the next cell provides another option with pan and zoom features.)

To save the plot a file, set `save_path` to a full file path, including the filename. The type of file is inferred from the extension. For instance, files ending in `.png` will be saved as PNG files and files ending in `.pdf` will be saved as PDF files. SVG format is also available.

In [None]:
# Configuration
title          = # E.g. 'Top Document Counts by Classification Label for Topics 1-50'
start_topic    = 1
end_topic      = None
xlabel         = # E.g. 'Topic'
ylabel         = # E.g. 'Count'
stacked        = True
save_path      = None # Or supply a file path if you wish to save the plot to a file
fields         = # E.g. ['top_humanities_count', 'top_science_count']
legend_labels  = # E.g. ['Humanities', 'Science']

# Create the plot
bar_plot(counts, start_topic, end_topic, fields, title, xlabel=xlabel, ylabel=ylabel, legend_labels=legend_labels, stacked=stacked, save_path=save_path)

## Visualise Metadata with a Plotly Bar Plot (Interactive Version)

The interactive plot (using Plotly) takes the same settings as the static plot above, except the stacked mode is not available. However, because zoom and pan features are available, it is possible to display the entire range of topics in a single graph. Click and drag over the graph to zoom in on a location. Click the home icon in the Plotly toolbar to restore the default zoom level. Click on the boxes in the legend to show and hide specific categories. Double-click to restore the default display.

You can download the plot as PNG file by clicking the camera icon in the Plotly toolbar. If you wish to save the interactive plot as a standalone web page, set the `save_path` to a full file path, including the filename ending in `.html`.

In [None]:
# Configuration
title          = # E.g. 'Document Counts by Region for Topics 1-100 (Based on the Top 100 Documents)'
start_topic    = 1
end_topic      = None
xlabel         = # E.g. 'Topic'
ylabel         = # E.g. 'Count'
save_path      = None # Supply a file path if you wish to save the plot to a file
fields         = # E.g. ['Humanities', 'Science']
legend_labels  = # E.g. ['Humanities', 'Science']

# Create the plot
plotly_bar_plot(counts, start_topic, end_topic, fields, title, xlabel=xlabel, ylabel=ylabel, legend_labels=legend_labels, save_path=save_path)

## Generate Topic-Doc Dictionary (Optional Utility)

This cell generates a dictionary with topic numbers as keys and a list of filenames in each topic as the values. Individual topics can be inspected with `topic_docs_dict[1]`, where "1" is the desired topic number. The dictionary can be saved as a json file by setting the `save_path` to a location where you would like to save the file. The path should include the filename, ending in `.json`.

In [None]:
# Configuration
save_path = None

topic_docs_dict = generate_topic_doc_dict(df, save_path=save_path)
print(json.dumps(topic_docs_dict, indent=2))
# topic_docs_dict[1]