## Visualise our Topic Model with pyLDAvis

[pyLDAvis](https://github.com/bmabey/pyLDAvis) is a port of the R LDAvis package for interactive topic model visualization by Carson Sievert and Kenny Shirley.

pyLDAvis is designed to help users interpret the topics in a topic model by examining the relevance and salience of terms in topics. Along the way, it displays tabular data which can be used to examine the model.

pyLDAvis is not designed to use Mallet data out of the box. This notebook transforms the Mallet state file into the appropriate data formats before generating the visualisation. The code is based on Jeri Wieringa's blog post [Using pyLDAvis with Mallet](http://jeriwieringa.com/2018/07/17/pyLDAviz-and-Mallet/) and has been slightly altered and commented.


### INFO

__author__    = 'Scott Kleinman'  
__copyright__ = 'copyright 2019, The WE1S Project'  
__license__   = 'GPL'  
__version__   = '1.2.1'  
__email__     = 'scott.kleinman@csun.edu'

## Imports

In [None]:
import gzip
import os
import pandas as pd
from IPython.display import display, HTML

## Configuration

** IMPORTANT:** You can select `Cell > Run All` to generate the link to your visualisation automatically. If you wish to view individual data displays, you can set their values to `True` in the configuration, or you can enable individual cells below. 

In [None]:
published_site_folder_name = os.path.basename(os.getcwd())
# print('Project name:', published_site_folder_name)

data_dir               = 'caches/model'
topic_state_file       = 'topic-state.gz'
pyldavis_script_path   = 'scripts/pyldavis/pyldavis.py'

# Data Display Configuration
display_hyperparameters = False
display_topic_state_format = False
display_document_lengths = False
display_term_frequencies = False
display_word_topic_assignments = False
display_doc_topic_matrix = False
sort_by = 'type' # You can change this to 'topic' or 'token_count'
ascending = True # Change this to False to reverse sort
phi_smoothing = False
theta_smoothing = False

## Generate pyLDAvis

In [None]:
# Run pyLDAvis and save viewer to the project pyldavis folder
%run {pyldavis_script_path} 

# Set the publish_path on the server
publish_path = '/home/jovyan/view/pyldavis/' + published_site_folder_name + '/'

# Copy the pyldavis folder to the publish_path
!mkdir {publish_path}
!cp -rf pyldavis/index.html {publish_path}
!ls -la {publish_path}

# Generate the published link
path_parts = publish_path.split('view')
publish_url = 'http://harbor.english.ucsb.edu:10002' + path_parts[1]
output = '<h2>Your pyLDAvis is now available to view at the link below:</h2>'
output += '<h2><a href="' + publish_url + '" target="_blank">' + publish_url + '</a></h2>'
output += "<p>To download a copy, go to your project's <code>pyldavis</code> folder and download the <code>index.html</code> file.</p>"
browser_link_html = HTML(output)
display(browser_link_html)

# Data Display Functions

The cells below provide tabular data generated in the course of building a pyLDAvis. They are only executed if you have set their display parameters to `True` in the configuration above. However, you can also uncomment the configurations in each cell to run the cell individually.

Each cell (apart from the hyperparameters) displays the first 10 rows of a table, represented in brackets with `:10`. Change this to `10:20` to view the next ten rows, and so on.       

## Display Hyperparameters

In [None]:
# display_hyperparameters = True # Uncomment this line if running this cell individually
if display_hyperparameters == True:
    print("Hyperparameters:\n")
    print("{}, {}".format(alpha, beta))

## Display Topic-State Format

Show the first 10 rows of the topic-state file. Modify `df[:10]` to change the number of rows displayed.

In [None]:
df = state_to_df(os.path.join(data_dir, topic_state_file))
df['type'] = df.type.astype(str)

df[:10]

## Display the Document Lengths from the State File

Shows the first 10 documents. Modify `df[:10]` to change the number of rows displayed.

In [None]:
# display_document_lengths = True # Uncomment this line if running this cell individually
if display_document_lengths == True:
    display(docs[:10])

## Display the Vocabulary and Term Frequencies from the State File

Shows the first 10 terms in alphabetical order. Modify `df[:10]` to change the number of rows displayed.

In [None]:
# display_term_frequencies = True # Uncomment this line if running this cell individually
if display_term_frequencies == True:
    display(vocab[:10])

## Display Word-Topic Assignments

Aggregates by topic and word for `phi`, the topic-term matrix, counts the number of times each word was assigned to each topic, and then sorts the resulting dataframe alphabetically by word so that it matches the order of the vocabulary frame. The beta hyperparameter is used as the smoothing value. The first 10 words are shown. Modify `phi_df[:10]` to change the number of rows displayed.

In [None]:
# Uncomment these lines if running this cell individually
# display_word_topic_assignments = True
# sort_by = 'type' # You can change this to 'topic' or 'token_count'
# ascending = True # Change this to False to reverse sort
# phi_df = phi_df.sort_values(by=sort_by, ascending=ascending)
# phi_smoothing = False # Change to True to show smoothed values
if display_word_topic_assignments == True:
    display(phi_df[:10])
if phi_smoothing == True:
    print('=======================================')
    display(phi[:10])

## Display Document-Topic Matrix

Repeat the process, but focused on the documents and topics, to generate the theta document-topic matrix. Uses the alpha hyperparameter as the smoothing value. The first 10 documents are shown. Modify `theta_df[:10]` to change the number of rows displayed.

In [None]:
# display_doc_topic_matrix = True # Uncomment if running this cell individually
# theta_smoothing = False
if display_doc_topic_matrix == True:
    display(theta_df[:10])
if theta_smoothing == True:
    print('=======================================')
    display(theta[:10])

In [None]:
display(browser_link_html)

If you wish to use any of the Data Display Functions, run the cells individually, starting [here](#Data-Display-Functions).