# Visualise our Topic Model with pyLDAVis

[pyLDAVis](https://github.com/bmabey/pyLDAvis) is a port of the R LDAVis package for interactive topic model visualization by Carson Sievert and Kenny Shirley.

pyLDAvis is designed to help users interpret the topics in a topic model by examining the relevance and salience of terms in topics. Along the way, it displays tabular data which can be used to examine the model.

pyLDAVis is not designed to use Mallet data out of the box. This notebook transforms the Mallet state file into the appropriate data formats before generating the visualisation. The code is based on Jeri Wieringa's blog post [Using pyLDAvis with Mallet](http://jeriwieringa.com/2018/07/17/pyLDAviz-and-Mallet/) and has been slightly altered and commented.


### INFO

__author__    = 'Scott Kleinman'  
__copyright__ = 'copyright 2018, The WE1S Project'  
__license__   = 'GPL'  
__version__   = '1.0'  
__email__     = 'scott.kleinman@csun.edu'

## Imports

In [None]:
import gzip
import os
import pandas as pd
from IPython.display import display, HTML

## Configuration

** IMPORTANT:** You can select `Cell > Run All` to generate the link to your visualisation automatically. If you wish to run the individual cells and see the tabular data, set `run_automatically` to `False`. 

In [None]:
data_dir               = 'caches/model'
topic_state_file       = 'topic-state.gz'
pyldavis_script_path   = 'pyldavis.py' # Change to '../scripts/pylavis.py'
run_automatically      = True

# Only for local use -- Do not modify on the server
save                   = True
new_window             = False

In [None]:
if run_automatically == True:
    %run {pyldavis_script_path} 
    from IPython.display import display, HTML
    browser_link_html = HTML('<h2><a href="http://mirrormask.english.ucsb.edu:10001' + project_reldir + '/browser/pyldavis/pyLDAvis.html" target="_blank">View pyLDAvis</a></h2>')
    display(browser_link_html)

In [None]:
if run_automatically == True:
    html = """
    <h2 style="color: red;">Stopping Processes...</h2>
    <p style="color: red;">If you used <code>Run All</code>, the error below prevents 
    the rest of the cells from running. 
    Click <a href="http://mirrormask.english.ucsb.edu:10001' + project_reldir + '/browser/pyldavis/pyLDAvis.html" target="_blank">here</a> to view the pyLDAvis you have generated."""
    html = HTML(html)
    display(html)
    
    # Do not proceed if the user selects Run All
    assert False

# State File Functions

In [None]:
def extract_params(statefile):
    """Extract the alpha and beta values from the statefile.

    Args:
        statefile (str): Path to statefile produced by MALLET.
    Returns:
        tuple: alpha (list), beta    
    """
    with gzip.open(statefile, 'r') as state:
        params = [x.decode('utf8').strip() for x in state.readlines()[1:3]]
    return (list(params[0].split(":")[1].split(" ")), float(params[1].split(":")[1]))


def state_to_df(statefile):
    """Transform state file into pandas dataframe.
    The MALLET statefile is tab-separated, and the first two rows contain the alpha and beta hypterparamters.
    
    Args:
        statefile (str): Path to statefile produced by MALLET.
    Returns:
        datframe: topic assignment for each token in each document of the model
    """
    return pd.read_csv(statefile,
                       compression='gzip',
                       sep=' ',
                       skiprows=[1,2]
                       )

## Extract Hyperparameters

In [None]:
params = extract_params(os.path.join(data_dir, topic_state_file))
alpha = [float(x) for x in params[0][1:]]
beta = params[1]
print("Hyperparameters:\n")
print("{}, {}".format(alpha, beta))

## Show Topic-State Format

Show the first 10 rows of the topic-state file. Modify `df[:10]` to change the number of rows displayed.

In [None]:
df = state_to_df(os.path.join(data_dir, topic_state_file))
df['type'] = df.type.astype(str)

df[:10]

## Get the Document Lengths from the State File

Shows the first 10 documents. Modify `df[:10]` to change the number of rows displayed.

In [None]:
docs = df.groupby('#doc')['type'].count().reset_index(name ='doc_length')

docs[:10]

## Get the Vocabulary and Term Frequencies from the State File

Shows the first 10 terms in alphabetical order. Modify `df[:10]` to change the number of rows displayed.

In [None]:
# Get vocab and term frequencies from statefile
vocab = df['type'].value_counts().reset_index()
vocab.columns = ['type', 'term_freq']
vocab = vocab.sort_values(by='type', ascending=True)

vocab[:10]

## Create a Topic-Term Matrix from the State File

In [None]:
# Topic-term matrix from state file
# https://ldavis.cpsievert.me/reviews/reviews.html

import sklearn.preprocessing

def pivot_and_smooth(df, smooth_value, rows_variable, cols_variable, values_variable):
    """
    Turns the pandas dataframe into a data matrix.
    Args:
        df (dataframe): aggregated dataframe 
        smooth_value (float): value to add to the matrix to account for the priors
        rows_variable (str): name of dataframe column to use as the rows in the matrix
        cols_variable (str): name of dataframe column to use as the columns in the matrix
        values_variable(str): name of the dataframe column to use as the values in the matrix
    Returns:
        dataframe: pandas matrix that has been normalized on the rows.
    """
    matrix = df.pivot(index=rows_variable, columns=cols_variable, values=values_variable).fillna(value=0)
    matrix = matrix.values + smooth_value
    
    normed = sklearn.preprocessing.normalize(matrix, norm='l1', axis=1)
    
    return pd.DataFrame(normed)

## Get Word-Topic Assignments

Aggregates by topic and word for `phi`, the topic-term matrix, counts the number of times each word was assigned to each topic, and then sorts the resulting dataframe alphabetically by word so that it matches the order of the vocabulary frame. The beta hyperparameter is used as the smoothing value. The first 10 words are shown. Modify `phi_df[:10]` to change the number of rows displayed.

In [None]:
phi_df = df.groupby(['topic', 'type'])['type'].count().reset_index(name ='token_count')
phi_df = phi_df.sort_values(by='type', ascending=True)

phi_df[:10]

In [None]:
phi = pivot_and_smooth(phi_df, beta, 'topic', 'type', 'token_count')

# phi[:10]

## Get Document-Topic Matrix

Repeat the process, but focused on the documents and topics, to generate the theta document-topic matrix. Uses the alpha hyperparameter as the smoothing value. The first 10 documents are shown. Modify `theta_df[:10]` to change the number of rows displayed.

In [None]:
theta_df = df.groupby(['#doc', 'topic'])['topic'].count().reset_index(name ='topic_count')

theta_df[:10]

In [None]:
theta = pivot_and_smooth(theta_df, alpha , '#doc', 'topic', 'topic_count')

# theta[:10]

## Generate the Visualisation

This cell saves the visualisation file to your project's `browser/pyldavis` folder.

If you are using this notebook on your local computer, you can set the `save` and `new_window` options to open the visualisation automatically or to save the visualisation file to the folder you specify in the configuration section above.

In [None]:
import pyLDAvis

data = {'topic_term_dists': phi, 
        'doc_topic_dists': theta,
        'doc_lengths': list(docs['doc_length']),
        'vocab': list(vocab['type']),
        'term_frequency': list(vocab['term_freq'])
       }

display(HTML('<h3 style="color: red;">The warning below is expected. The link to your visualisation will appear below the warning.</h3>'))

vis_data = pyLDAvis.prepare(**data)

# Save the visualisation HTML
if save == True and new_window == False:
    pwd = %pwd
    project_reldir = pwd.split('/projects')[-1]
    pyLDAvis.save_html(vis_data, os.path.join(output_dir, output_file))
    browser_link_html = HTML('<hr><h2><a href="http://mirrormask.english.ucsb.edu:10001' + project_reldir + '/browser/pyldavis/pyLDAvis.html" target="_blank">View pyLDAvis</a></h2><hr>')
    display(browser_link_html)

# Open the notebook (local use only)
if save == False and new_window == False:
    pyLDAvis.display(vis_data)
    
# Open the Visualisation in a new window (local use only)
if new_window == True:
    pyLDAvis.show(vis_data, port=8889)