# Create PyLDAvis

<a href="https://github.com/bmabey/pyLDAvis" target="_blank">pyLDAvis</a> is a port of the R LDAvis package for interactive topic model visualization by Carson Sievert and Kenny Shirley.

pyLDAvis is designed to help users interpret the topics in a topic model by examining the relevance and salience of terms in topics. Once a pyLDAvis object has been generated, many of its properties can be inspected as in tabular form as a way to examine the model. However, the main output is a visualization of the relevance and salience of key terms to the topics.

pyLDAvis is not designed to use MALLET data out of the box. This notebook transforms the MALLET state file into the appropriate data formats before generating the visualization. The code is based on Jeri Wieringa's blog post <a href="http://jeriwieringa.com/2018/07/17/pyLDAviz-and-Mallet/" target="_blank">Using pyLDAvis with Mallet</a> and has been slightly altered and commented.

### INFO

__author__    = 'Scott Kleinman, Lindsay Thomas'  
__copyright__ = 'copyright 2019-, The WE1S Project'  
__license__   = 'GPL'  
__version__   = '2.5'  
__email__     = 'scott.kleinman@csun.edu'

## Settings

In most cases, you can simply run this cell without modifying any of the settings.

In [None]:
# Python imports
import gzip
import json
import os
from IPython.display import display, HTML
from pathlib import Path

# Get module paths
current_dir                 = %pwd
project_dir                 = str(Path(current_dir).parent.parent)
model_dir                   = project_dir + '/project_data/models'
json_dir                    = project_dir + '/project_data/json'
config_path                 = project_dir + '/config/config.py'
output_path                 = current_dir
output_file                 = 'index.html'

# Import scripts
%run scripts/PyLDAvis.py
%run {config_path}

display(HTML('<p style="color: green;">Setup complete.</p>'))

## Configuration

Select models to create pyLDAvis visualizations for. **Please run the next cell regardless of whether you change anything.**

By default, this notebook is set to create a pyLDAvis for all of the models in your project `models` directory. If you would like to select only certain models to produce a pyLDAvis for, make those selections in the next cell (see next paragraph). Otherwise leave the value for `selection` as `All`, which is the default. 

**To produce pyLDAvis for a selection of the models you created, but not all:** Navigate to the `your_project_name/project_data/models` directory in your project. Note the name of each subdirectory in that folder. Each subdirectory should be called `topicsn1`, where `n1` is the number of topics you chose to model. You should see a subdirectory for each model you produced. To choose which subdirectory/ies you would like to produce browsers for, change the value of `selection` in the cell below to a list of subdirectory names. For example, if you wanted to produce browsers for only the 50- and 75-topic models you created, change the value of `selection` below to this:

Example:

`selection = ['topics50', 'topics75']`

Please follow this format exactly.

In [None]:
# Configuration
selection = 'All' # Or e.g. ['topics50', 'topics75']

Get names of model subdirectories to visualize and their state files.

In [None]:
models = get_models(model_dir, selection)

### Add Metadata and Labels for the User Interface (Optional)

The pyLDAvis plot can be customized to display metadata information in your project's json files. Skip this cell if you just want to generate a basic pyLDAvis plot.

By default, pyLDAvis displays circles representing the topics in the visualization's left panel and tokens (words) in the right panel. Setting the `metadata` property below to another field in your project's json files, will cause pyLDAvis to display the contents of that field in the right panel. For instance, if you had a `publication` field, the title of the publication would be displayed.

To do this, identify the index number for each model in the list above, and add the necessary information using the following lines in the next cell.

```python
models[0]['metadata'] = 'publication'
models[0]['ui_labels'] = [
                'Intertopic Distance Map (via multidimensional scaling)',
                'topic',
                'publication',
                'publications',
                'tokens'
            ]
```
Additional models would be `models[1]`, `models[2]`, etc.

The `ui_labels` must be given in the following order:

1. The title of the multidimensional scaling graph
2. The type of unit represented by the graph circles
3. The singular form of the unit represented in the bar graphs on the right
4. The plural form of the unit represented in the bar graph on the right.
5. The unit represented by the percentage in the Relevance display.

The example above indicates that the model will represent a map of intertopic distances in which each topic will show the distribution of publications, as represented by the percentage of topic tokens in the publication.

**If you are unsure what to put, you do not have to assign `ui_labels`. A visualization will still be generated but may not have appropriate labels for the type of metadata you are using.**

In [None]:
# Uncomment and modify these lines to run the cell

# models[0]['metadata'] = 'publication'
# models[0]['ui_labels'] = [
#                 'Intertopic Distance Map (via multidimensional scaling)',
#                 'topic',
#                 'publication',
#                 'publications',
#                 'tokens'
#             ]

# models[1]['metadata'] = 'publication'
# models[1]['ui_labels'] = [
#                 'Intertopic Distance Map (via multidimensional scaling)',
#                 'topic',
#                 'publication',
#                 'publications',
#                 'tokens'
#             ]

display(HTML('<p style="color: green;">Here is a summary of the information you will be using to generate your visualization(s).</p>'))
print(json.dumps(models, indent=2))

## Generate the Visualizations

Generate visualizations and links to all pyLDAvis visualizations in your project folder. See the next cell if you wish to make them public.

Since this cell can take some time to run (hours for many thousands of documents and multiple models), the output is captured instead of shown as the script is processing. Run `output.show()` in the following cell when it is finished to check that everything ran as expected.

**Note:** You may receive the following warning:

```
FutureWarning: Sorting because non-concatenation axis is not aligned. A future version of pandas will change to not sort by default.
```

It is OK to ignore this warning.

In [None]:
%%capture output

msg = """<p><strong>Note:</strong> You may receive the following warning:</p><p><code>FutureWarning: Sorting because non-concatenation axis is not aligned. 
A future version of pandas will change to not sort by default.</code></p><p>It is OK to ignore this warning.</p>"""
display(HTML(msg))
%run scripts/PyLDAvis.py
result, vis = generate(model_dir, models, output_path, output_file, json_dir)
display_links(project_dir, models, WRITE_DIR, PORT)

In [None]:
output.show()

## Create Zipped Copies of your Visualizations for Export (Optional)

By default, browsers for all available models will be zipped. If you wish to zip only one model, change the `models` setting to indicate the name of the model folder (e.g. `'topics25'`). If you wish to zip more than one model, but not all, provide a list in square brackets (e.g. `['topics25', 'topics50']`).

In [None]:
# Configuration
models = 'All' # You can also select models with vaues like 'topics25' or ['topics25', 'topics50']

# Zip the models
%run scripts/zip.py
zip(models)

## Access pyLDAvis Data Attributes (Optional)

pyLDAvis generates a number of useful variables which it can be helpful for understanding the data underlying the visualization or for use in other applications. These variables can be accessed via the `vis` object created in **Generate the Visualizations** section. You can view the data by calling `show_attribute()` with the appropriate attribute configured. Possible attributes are `alpha`, `beta`, `doc_lengths`, `hyperparameters`, `model_state`, `phi`, `phi_df`, `theta`, `theta_df`. Further information about these attributes can be found in the discussion of Jeri Wieringa's blog post <a href="http://jeriwieringa.com/2018/07/17/pyLDAviz-and-Mallet/" target="_blank">Using pyLDAvis with Mallet</a>.

You can restrict the number of lines shown by modifying the `start` and `end` settings. If you wish to save the result to a file, set the `save_path`. Tabular data should be saved to a csv file; everything else can be plain text.

In [None]:
# Configuration
attribute  = 'theta'
start      = 0
end        = None
save_path  = None


def show_attribute(vis, attribute, start=None, end=None, save_path=None):
    """Show a pyLDAvis attribute by name."""
    result = getattr(vis, attribute)
    result = result[start:end]
    if save_path is not None:
        if isinstance(result, pd.DataFrame):
            result.to_csv(save_path)
        else:
            with open(save_path, 'w') as f:
                f.write(save_path)
    display(result)

# Show the pyLDAvis attribute
show_attribute(vis, attribute, start=start, end=end, save_path=save_path)