# 5. CREATE PYLDAVIS BROWSER

[pyLDAvis](https://github.com/bmabey/pyLDAvis) is a port of the R LDAvis package for interactive topic model visualization by Carson Sievert and Kenny Shirley.

pyLDAvis is designed to help users interpret the topics in a topic model by examining the relevance and salience of terms in topics. Along the way, it displays tabular data which can be used to examine the model.

pyLDAvis is not designed to use Mallet data out of the box. This notebook transforms the Mallet state file into the appropriate data formats before generating the visualisation. The code is based on Jeri Wieringa's blog post [Using pyLDAvis with Mallet](http://jeriwieringa.com/2018/07/17/pyLDAviz-and-Mallet/) and has been slightly altered and commented.

### INFO

__author__    = 'Scott Kleinman'  
__copyright__ = 'copyright 2019, The WE1S Project'  
__license__   = 'GPL'  
__version__   = '2.0'  
__email__     = 'scott.kleinman@csun.edu'

## Settings

In [140]:
import gzip
import json
import os
from IPython.display import display, HTML
from pathlib import Path

current_dir = %pwd
current_pathobj = Path(current_dir)
project_dir = str(current_pathobj.parent.parent)
print(project_dir)
published_site_folder_name = os.path.basename(project_dir)

data_dir = project_dir + '/project_data'
model_dir = data_dir + '/models'
pyldavis_script_path   = current_dir + '/' + 'pyldavis_scripts/PyLDAvis.py'

output_path = current_dir
output_file = 'index.html'
json_dir = data_dir + '/json'

%run {pyldavis_script_path}

/home/jovyan/write/templates/multiple_topics_template


## Configuration

Select models to create pyldavis visualizations for. **Please run the next cell regardless of whether you change anything.**

By default, this notebook is set to create a pyldavis for all of the models you produced in Notebook 2 (`02_model_topics.ipynb`). If you would like to select only certain models to produce a pyldavis for, make those selections in the next cell (see next paragraph). Otherwise leave the value in the next cell set to `None`, which is the default. 

**To produce pyldavis for a selection of the models you created, but not all:** Navigate to the `your_project_name/project_data/models` directory in your project. Note the name of each subdirectory in that folder. Each subdirectory should be called `topicsn1`, where `n1` is the number of topics you chose to model. You should see a subdirectory for each model you produced. To choose which subdirectory/ies you would like to produce browsers for, change the value of `selection` in the cell below to a list of subdirectory names. For example, if you wanted to produce browsers for only the 50- and 75-topic models you created, change the value of `selection` below to this:

Example:

`selection = ['topics50','topics75']`

Please follow this format exactly.

In [141]:
selection = ['topics10']

Get names of model subdirectories to visualize and their state files.

In [142]:
models = get_models(model_dir, selection)

# Display all model sub-directories with index numbers
for index, item in enumerate(models):
    print(str(index) + ': ' + item['model'])

0: topics10


### Add metadata and labels for the user interface (Optional).

To do this, identify the index number for each model in the list above, and add the necessary information using the following lines in the next cell.

```python
models[0]['metadata'] = 'pub'
models[0]['ui_labels'] = [
                'Intertopic Distance Map (via multidimensional scaling)',
                'topic',
                'publication',
                'publications',
                'tokens'
            ]
```
Additional models would be `models[1]`, `models[2]`, etc.

The `ui_labels` must be given in the following order:

1. The title of the multidimensional scaling graph
2. The type of unit represented by the graph circles
3. The singular form of the unit represented in the bar graphs on the right
4. The plural form of the unit represented in the bar graph on the right.
5. The unit represented by the percentage in the Relevance display.

The example above indicates that the model will represent a map of intertopic distances in which each topic will show the distribution of publications, as represented by the percentage of topic tokens in the publication.

**If you are unsure what to put, you do not have to assign `ui_labels`. A visualization will still be generated but may not have appropriate labels for the type of metadata you are using.**

In [143]:
# Uncomment and modify these lines to run the cell

models[0]['metadata'] = 'pub'
models[0]['ui_labels'] = [
                'Intertopic Distance Map (AKA the 10 Circles of Hell)',
                'topic',
                'publication',
                'publications',
                'tokens'
            ]

# models[1]['metadata'] = 'pub'
# models[1]['ui_labels'] = [
#                 'Intertopic Distance Map (via multidimensional scaling)',
#                 'topic',
#                 'publication',
#                 'publications',
#                 'tokens'
#             ]

display(HTML('<h4>Here is a summary of the information you will be using to generate your visualization(s).</h4>'))
print(json.dumps(models, indent=2))

[
  {
    "model": "topics10",
    "state_file": "/home/jovyan/write/templates/multiple_topics_template/project_data/models/topics10/topic-state10.gz",
    "metadata": "pub",
    "ui_labels": [
      "Intertopic Distance Map (AKA the 10 Circles of Hell)",
      "topic",
      "publication",
      "publications",
      "tokens"
    ]
  }
]


## Generate the Visualizations

In [144]:
%run {pyldavis_script_path}
generate(model_dir, models, output_path, output_file, json_dir)

Creating metadata state...
Processing /home/jovyan/write/templates/multiple_topics_template/project_data/models/topics10/topic-state-pub.gz...
    Getting hyperparameters...
    Creating dataframe...
    Getting document lengths...
    Getting term frequencies...
    Getting topic-word-assignments...
    Getting topic-term-matrix...
    Saving... to index-pub.html
call customise labels
custom labels called
Done!


of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))
