# PLSA
_Probabilistic latent semantic analysis_

## Preliminaries
#### Import dependencies

In [1]:
import sys
import matplotlib.pyplot as plt

ModuleNotFoundError: No module named 'matplotlib'

#### Set the plotting environment

In [17]:
%matplotlib notebook

#### Put the actual `plsa` package onto the _python path_

In [18]:
sys.path.append('..')

#### Import main classes from the `plsa` package

In [2]:
from plsa import Corpus, Pipeline, Visualize
from plsa.pipeline import DEFAULT_PIPELINE
from plsa.algorithms import PLSA

ModuleNotFoundError: No module named 'plsa'

## Data Sources


In [20]:
directory = '../data/blogs'

## Set Up the Corpus
#### Define pre-processing pipeline


In [21]:
pipeline = Pipeline(*DEFAULT_PIPELINE)
pipeline

Pipeline:
0: remove_non_ascii
1: to_lower
2: remove_numbers
3: tag_remover
4: punctuation_remover
5: tokenize
6: LemmatizeWords
7: RemoveStopwords
8: short_word_remover

#### Load corpus


In [36]:
# import nltk
# nltk.download()
corpus = Corpus.from_xml(directory, pipeline)
corpus

Corpus:
Number of documents: 6316
Number of words:     19425

## Run PLSA

#### Choose the number of topics

In [37]:
n_topics = 5

#### Instantiate a PLSA model

In [38]:
plsa = PLSA(corpus, n_topics, True)
plsa

PLSA:
====
Number of topics:     5
Number of documents:  6316
Number of words:      19425
Number of iterations: 0

#### Fit a PLSA model

In [39]:
result = plsa.fit()
plsa


KeyboardInterrupt



#### Find the best PLSA model of many
As with any iterative algorithm, also the probabilities in PSLA need to be (randomly) initialized prior to the first iteration step. Therefore, calling the ``fit`` method of two different `PLSA` instances operating on the _same_ corpus with the _same_ number of topics potentially leads to (slightly) different results, corresponding to different local minima of the Kullback-Leibler divergence between the true document-word probability and its approximate factorization. To mitigate this effect, perform multiple runs and pick the best model.


In [None]:
result = plsa.best_of(5)

#### Examine the results


In [28]:
result.topic

NameError: name 'result' is not defined

In [29]:
new_doc = 'Hello! This is the federal humpty dumpty agency for state funding.'

topic_components, number_of_new_words, new_words = result.predict(new_doc)

print('Relative topic importance in new document:', topic_components)
print('Number of previously unseen words in new document:', number_of_new_words)
print('Previously unseen words in new document:', new_words)

NameError: name 'result' is not defined

And, of course, we can look at individual topics, that is, how important which word is for which topic. Let's look at the top-10 words of the first topic.

In [None]:
result.word_given_topic[0][:10] 

## Visualize the Results

In [None]:
visualize = Visualize(result)
visualize

#### Convergence
Since PLSA uses an iterative expectation-maximization (EM) style algorithm, let's make sure we have achieved reasonable convergence.

In [None]:
fig, ax = plt.subplots()
_ = visualize.convergence(ax)
fig.tight_layout()

#### Relative topic importance
How important are the topics we found in the corpus?

In [None]:
fig, ax = plt.subplots()
_ = visualize.topics(ax)
fig.tight_layout()

#### The topics
The most interesting part is probably the topics themselves, We can visualize them as word clouds.

In [None]:
fig = plt.figure(figsize=(9.4, 10))
_ = visualize.wordclouds(fig)