# pyLDAvis

[`pyLDAvis`](https://github.com/bmabey/pyLDAvis) is a python libarary for interactive topic model visualization.
It is a port of the fabulous [R package](https://github.com/cpsievert/LDAvis) by Carson Sievert and Kenny Shirley.  They did the hard work of crafting an effective visualization. `pyLDAvis` makes it easy to use the visualiziation from Python and, in particualr, IPython notebooks. To learn more about the method behind the visualization I suggest reading the [original paper](http://nlp.stanford.edu/events/illvi2014/papers/sievert-illvi2014.pdf) explaining it.

This notebook provides a quick overview of how to use `pyLDAvis`. Refer to the [documenation](https://pyldavis.readthedocs.org/en/latest/) for details.


## BYOM - Bring your own model

`pyLDAvis` is agnostic to how your model was trained. To visualize it you need to provide the topic-term distribtuions, document-topic distributions, and basic information about the corpus which the model was trained on. The main function is the [`prepare`](https://pyldavis.readthedocs.org/en/latest/modules/API.html#pyLDAvis.prepare) function that will transform your data into the format needed for the visualization.

Below we load a model trained in R and then visualize it. The model was trained on a corpus of 2000 movie reviews parsed by [Pang and Lee (ACL, 2004)](http://www.cs.cornell.edu/people/pabo/movie-review-data/), originally gathered from the IMDB archive of the rec.arts.movies.reviews newsgroup.

In [1]:
import json
import numpy as np

def load_R_model(filename):
    with open(filename, 'r') as j:
        data_input = json.load(j)
    data = {'topic_term_dists': data_input['phi'], 
            'doc_topic_dists': data_input['theta'],
            'doc_lengths': data_input['doc.length'],
            'vocab': data_input['vocab'],
            'term_frequency': data_input['term.frequency']}
    return data

movies_model_data = load_R_model('data/movie_reviews_input.json')

print('Topic-Term shape: %s' % str(np.array(movies_model_data['topic_term_dists']).shape))
print('Doc-Topic shape: %s' % str(np.array(movies_model_data['doc_topic_dists']).shape))

ValueError: Expecting value: line 1 column 1 (char 0)

Now that we have the data loaded we use the `prepare` function:

In [2]:
import pyLDAvis
movies_vis_data = pyLDAvis.prepare(**movies_model_data)

Once you have the visualization data prepared you can do a number of things with it. You can [save the vis](https://pyldavis.readthedocs.org/en/latest/modules/API.html#pyLDAvis.save_html) to an stand-alone HTML file, [serve it](https://pyldavis.readthedocs.org/en/latest/modules/API.html#pyLDAvis.show), or [dispaly it](https://pyldavis.readthedocs.org/en/latest/modules/API.html#pyLDAvis.display) in the notebook. Let's go ahead and display it:

In [3]:
pyLDAvis.display(movies_vis_data)

Pretty, huh?! Again, you should be thanking the original [LDAvis people](https://github.com/cpsievert/LDAvis) for that. You may thank me for the IPython integartion though. :) Aside from being aesthetically pleasing this visualization more importantly represents a lot of information about the topic model that is hard to take in all at once with ad-hoc queries. To learn more about the visual elements and how they help you explore your model see [this documentation](http://cran.r-project.org/web/packages/LDAvis/vignettes/details.pdf) from the original R project and this presentation ([slides](https://speakerdeck.com/bmabey/visualizing-topic-models), [video](https://www.youtube.com/watch?v=tGxW2BzC_DU)).


To see other models visualzied check out [this notebook](http://nbviewer.ipython.org/github/bmabey/pyLDAvis/blob/master/notebooks/Movie%20Reviews,%20AP%20News,%20and%20Jeopardy.ipynb).

*ProTip:* To avoid tediously typing in `display` all the time use:

In [4]:
pyLDAvis.enable_notebook()

By default the topics are projected to the 2D plane using [PCoA](https://en.wikipedia.org/wiki/PCoA) on a distance matrix created using the [Jensen-Shannon divergence](https://en.wikipedia.org/wiki/Jensen–Shannon_divergence) on the topic-term distritbuions. You can pass in a different multidimensional scaling function via the `mds` pararameter. In addition to `pcoa` another provided option is `tsne` which operates on the same JS-divergence distance matrix. `tsne` requires that you have sklearn installed. Here is `tnse` in action:

In [9]:
pyLDAvis.prepare(mds='tsne', **movies_model_data)

## Making the common case easy - Gensim and others!

Built on top of the generic `prepare` function are helper functions for [gensim](https://radimrehurek.com/gensim/) and [GraphLab Create](https://dato.com/products/create/). To demonstrate below I am loading up a trained gensim model and coresponding dictionary and corpus (see [this notebook](http://nbviewer.ipython.org/github/bmabey/pyLDAvis/blob/master/notebooks/Gensim%20Newsgroup.ipynb) for how these were created):

In [6]:
import gensim

dictionary = gensim.corpora.Dictionary.load('newsgroups.dict')
corpus = gensim.corpora.MmCorpus('newsgroups.mm')
lda = gensim.models.ldamodel.LdaModel.load('newsgroups_50.model')

scipy.sparse.sparsetools is a private module for scipy.sparse, and should not be used.


In the dark ages in order to inspect our topics all we had was `show_topics` and friends:

In [7]:
lda.show_topics()

[(37,
  u'0.018*food + 0.018*msg + 0.014*disease + 0.011*henry + 0.010*patients + 0.010*cancer + 0.010*health + 0.010*doctor + 0.007*eat + 0.007*effects'),
 (18,
  u'0.038*men + 0.034*constitution + 0.021*male + 0.014*purdue + 0.013*partners + 0.012*study + 0.012*women + 0.011*cross + 0.009*female + 0.009*weaver'),
 (0,
  u'0.028*host + 0.027*nntp + 0.027*posting + 0.025*newsreader + 0.021*tin + 0.020*technology + 0.019*institute + 0.017*michael + 0.017*edu + 0.016*keith'),
 (17,
  u'0.013*states + 0.011*new + 0.010*american + 0.009*state + 0.008*national + 0.008*united + 0.008*congress + 0.008*drugs + 0.007*press + 0.007*washington'),
 (33,
  u"0.025*year + 0.024*players + 0.021*league + 0.018*team + 0.015*division + 0.015*boston + 0.013*season + 0.013*hit + 0.011*fan + 0.011*he's"),
 (7,
  u'0.087*car + 0.033*cars + 0.030*andrew + 0.012*mellon + 0.012*carnegie + 0.011*mph + 0.011*ford + 0.009*cmu + 0.009*pittsburgh + 0.008*fpu'),
 (12,
  u'0.026*colorado + 0.020*fbi + 0.015*jim + 0.0

Thankfully, in addition to these *still helpful functions*, we can get a feel for all of the topics with this one-liner:

In [8]:
import pyLDAvis.gensim

pyLDAvis.gensim.prepare(lda, corpus, dictionary)

## GraphLab

As I mentioned above you can also easily visualize GraphLab TopicModels as well. Check out [this notebook](http://nbviewer.ipython.org/github/bmabey/pyLDAvis/blob/master/notebooks/GraphLab.ipynb#topic=7&lambda=0.41&term=) if you are interested in that.


## Go forth and visualize!

What are you waiting for? Go ahead and `pip install pyldavis`.