Example Notebook for Persistent Topics
====================

This notebook shows how we can **(1)** run LDA on one data, **(2)** save some of the topics from the first LDA run and **(3)** use the saved topics when running LDA again on a new data.

Import stuff

In [None]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

import sys
basedir = '../'
sys.path.append(basedir)

from lda_for_fragments import Ms2Lda
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

1. Initial LDA on Beer 3
---------------------------

In [None]:
# Number of topics, around 300-400 seems to be good for the beer/urine data from cross-validation results
n_topics = 300 

# How many samples to get during Gibbs sampling, recommended >500 for analysis
n_samples = 500

# No. of burn-in samples to discard before we start averaging over the samples. 
# If 0, then we'll use only the last sample for the results.
n_burn = 0 

# Thinning parameter when averaging over the samples. 
# If n_burn is 0 then this doesn't matter.
n_thin = 1 

# Follow the recommendation from Griffith & Styver
alpha = 50.0/n_topics # hyper-parameter for document-topic distributions
beta = 0.1 # hyper-parameter for topic-word distributions

In [None]:
# The input files
fragment_filename = basedir + 'input/relative_intensities/Beer_3_T10_POS_fragments_rel.csv'
neutral_loss_filename = basedir + 'input/relative_intensities/Beer_3_T10_POS_losses_rel.csv'
mzdiff_filename = None
ms1_filename = basedir + 'input/relative_intensities/Beer_3_T10_POS_ms1_rel.csv'
ms2_filename = basedir + 'input/relative_intensities/Beer_3_T10_POS_ms2_rel.csv'

In [None]:
ms2lda = Ms2Lda.lcms_data_from_R(fragment_filename, neutral_loss_filename, mzdiff_filename, 
                                 ms1_filename, ms2_filename)

In [None]:
ms2lda.run_lda(n_topics, n_samples, n_burn, n_thin, alpha, beta)

Next, we save the results of this LDA on beer3 and produce the output matrices etc.

In [None]:
ms2lda.do_thresholding(th_doc_topic=0.05, th_topic_word=0.05)

In [None]:
ms2lda.write_results('beer3_pos_rel')

In [None]:
ms2lda.print_topic_words()

In [None]:
ms2lda.plot_log_likelihood()

And we show the ranking of the top-10 topics by their h-indices. Change the *sort_by* parameter to rank by either h-index or in-degree and remove the *top_N* parameter to show the ranking of all topics

In [None]:
# topic_ranking, sorted_topic_counts = ms2lda.rank_topics(sort_by='in_degree')
topic_ranking, sorted_topic_counts = ms2lda.rank_topics(sort_by='h_index', top_N=10)

Plot the fragments of these topics.

In [None]:
persisted_topics  = [57, 87, 16, 50, 101, 0, 5, 20, 21, 27]

# Non-interactive visualisation
# Remove the 'selected_topics' parameter to make the plots for all topics.
# You can sort by either 'h_index' or 'in_degree'.
ms2lda.plot_lda_fragments(consistency=0.50, sort_by="h_index", selected_topics=persisted_topics)

# uncomment below for interactive visualisation
# ms2lda.plot_lda_fragments(consistency=0.50, sort_by="h_index", interactive=True)

Let's say we want to set all the top-10 topics from beer3pos above (or whatever defined in the *persisted_topics* variable) and run the LDA using them again on beer2.

First we need to save the state of the LDA model that we just ran on  beer3pos. Below line will create two more files: the dumped model state (*beer3pos.model*) and the list of vocabularies of the 'words' used for the persisted topics (*beer3pos.vocab*). These files are written to the same location as the output matrices, i.e. in the *results/beer3_pos_rel* folder relative to this notebook.

In [None]:
model_filename = 'results/beer3_pos_rel/beer3pos.model'
vocab_filename = 'results/beer3_pos_rel/beer3pos.vocab'
ms2lda.persist_topics(model_filename, vocab_filename, persisted_topics)

2. LDA on Beer2 with persistent topics from Beer3
------------------------------------------------------

First we load the previously saved model of beer3.

In [None]:
from lda_cgs import CollapseGibbsLda
beer3_model = CollapseGibbsLda.load(model_filename)
if hasattr(beer3_model, 'selected_topics'):
    print "Persistent topics = " + str(beer3_model.selected_topics)

Now we have to go to R and run the feature extraction script (*MS1MS2_MatrixGeneration.R*) on the Beer2pos data. <font color='red'>**This step has to be manually done for now .. although we should automate it as part of the pipeline later. If you change any of the initial LDA parameters in section 1, you should also do this before proceeding further (because it might result in different probabilities of the words)**</font>

Specifically in that R script, make sure to set the following parameter (that specifies the vocabulary list of the persistent topics)

    prev_words_file <- '/home/joewandy/git/metabolomics_tools/justin/notebooks/results/beer3_pos_rel/beer3pos.vocab'


<hr/>

When running the LDA on beer2pos, there's now the additional parameter *previous_model* that needs to be passed in. Also, the total number of topics are now 135. The persistent topics (10) come first, and the remaining new topics (300) are appended after them. 

In [None]:
# Number of NEW topics. The total number of topics is actually 10 (previous) + 300 (new) = 310. 
n_topics = 300 

# How many samples to get during Gibbs sampling, recommended >500 for analysis
n_samples = 500

# No. of burn-in samples to discard before we start averaging over the samples. 
# If 0, then we'll use only the last sample for the results.
n_burn = 0 

# Thinning parameter when averaging over the samples. 
# If n_burn is 0 then this doesn't matter.
n_thin = 1 

# Follow the recommendation from Griffith & Styver
total_no_of_topics = n_topics + len(beer3_model.selected_topics)
alpha = 50.0/total_no_of_topics # hyper-parameter for document-topic distributions
beta = 0.1 # hyper-parameter for topic-word distributions

In [None]:
fragment_filename = basedir + 'input/relative_intensities/Beer_2_T10_POS_fragments_rel.csv'
neutral_loss_filename = basedir + 'input/relative_intensities/Beer_2_T10_POS_losses_rel.csv'
mzdiff_filename = None
ms1_filename = basedir + 'input/relative_intensities/Beer_2_T10_POS_ms1_rel.csv'
ms2_filename = basedir + 'input/relative_intensities/Beer_2_T10_POS_ms2_rel.csv'

ms2lda = Ms2Lda.lcms_data_from_R(fragment_filename, neutral_loss_filename, mzdiff_filename, 
                                 ms1_filename, ms2_filename)
ms2lda.run_lda(n_topics, n_samples, n_burn, n_thin, alpha, beta, previous_model=beer3_model)

In [None]:
ms2lda.write_results('beer2_pos_rel')

3. Beer2 Results
------------------

The persisted topics from previous LDA run are placed first in list of topics of the new LDA run, so so old topic 57 becomes new topic 0, old topic 87 is new topic 1, etc.

In [None]:
old_persisted_index = beer3_model.selected_topics
new_persisted_index = range(len(beer3_model.selected_topics))

print "in beer3pos = " + str(old_persisted_index)
print "in beer2pos = " + str(new_persisted_index)

If we show the ranking of the top-10 topics in beer2pos by the H-index, we see that the persisted topics from beer3 aren't very high up the list, i.e. we don't see topics 0 - 9 there.

In [None]:
topic_ranking, sorted_topic_counts = ms2lda.rank_topics(sort_by='h_index', top_N=10)

We can also plot the fragments in topics 0 - 9 in beer2pos.

In [None]:
ms2lda.do_thresholding(th_doc_topic=0.05, th_topic_word=0.05)

In [None]:
# non-interactive
ms2lda.plot_lda_fragments(consistency=0.0, sort_by="h_index", selected_topics=new_persisted_index)

# interactive
# ms2lda.plot_lda_fragments(consistency=0.0, sort_by="h_index", interactive=True)

We can also plot of the predictive distribution of the persisted topics (theta) in the old and new LDA results

In [None]:
pred_old = beer3_model.posterior_alpha
pred_old = pred_old / np.sum(pred_old)
pred_old = pred_old[old_persisted_index]
print pred_old

beer2_model = ms2lda.model
pred_new = beer2_model.posterior_alpha
pred_new = pred_new / np.sum(pred_new)
pred_new = pred_new[new_persisted_index]
print pred_new

In [None]:
K = len(old_persisted_index)
ind = np.arange(K)  # the x locations for the groups
width = 0.35       # the width of the bars
fig, ax = plt.subplots()
rects1 = ax.bar(ind, pred_old, width, color='g')
rects2 = ax.bar(ind+width, pred_new, width, color='b')

# add some text for labels, title and axes ticks
ax.set_ylabel('Predictive Probability')
ax.set_xlabel('Persistent Topic')
ax.set_title('Predictive distributions of persistent topics')
ax.set_xticks(ind+width)
ax.set_xticklabels(old_persisted_index)

ax.legend( (rects1[0], rects2[0]), ('Beer3pos', 'Beer2pos') )
plt.show()