Example Notebook for Persistent Topics
====================

This notebook shows how we can **(1)** run LDA on one data, **(2)** save some of the topics from the first LDA run and **(3)** use the saved topics when running LDA again on a new data.

Import stuff

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

import sys
basedir = '../'
sys.path.append(basedir)

from lda_for_fragments import Ms2Lda
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

1. Initial LDA on Beer 3
---------------------------

In [2]:
n_topics = 125 # number of topics
n_samples = 400 # how many samples to get during Gibbs sampling
n_burn = 200 # no. of burn-in samples to discard
n_thin = 5 # thinning parameter
alpha = 0.1 # Dirichlet parameter for document-topic distributions
beta = 0.01 # Dirichlet parameter for topic-word distributions

fragment_filename = basedir + 'input/relative_intensities/Beer_3_T10_POS_fragments_rel.csv'
neutral_loss_filename = basedir + 'input/relative_intensities/Beer_3_T10_POS_losses_rel.csv'
mzdiff_filename = None

ms1_filename = basedir + 'input/relative_intensities/Beer_3_T10_POS_ms1_rel.csv'
ms2_filename = basedir + 'input/relative_intensities/Beer_3_T10_POS_ms2_rel.csv'

In [None]:
ms2lda = Ms2Lda(fragment_filename, neutral_loss_filename, mzdiff_filename, 
                ms1_filename, ms2_filename, relative_intensity=True)
df, vocab = ms2lda.preprocess()

Data shape (856, 1664)


In [None]:
ms2lda.run_lda(df, vocab, n_topics, n_samples, n_burn, n_thin, 
               alpha, beta, use_own_model=True, use_native=True)

Fitting model...
CGS LDA initialising
......................................................................................
Using Numba for LDA sampling
Preparing words
Preparing Z matrix
DONE
Burn-in 1 
Burn-in 2 
Burn-in 3 
Burn-in 4 
Burn-in 5 
Burn-in 6 
Burn-in 7 
Burn-in 8 
Burn-in 9 
Burn-in 10 
Burn-in 11 
Burn-in 12 
Burn-in 13 
Burn-in 14 
Burn-in 15 
Burn-in 16 
Burn-in 17 
Burn-in 18 
Burn-in 19 
Burn-in 20 
Burn-in 21 
Burn-in 22 
Burn-in 23 
Burn-in 24 
Burn-in 25 
Burn-in 26 
Burn-in 27 
Burn-in 28 
Burn-in 29 
Burn-in 30 
Burn-in 31 
Burn-in 32 
Burn-in 33 
Burn-in 34 
Burn-in 35 
Burn-in 36 
Burn-in 37 
Burn-in 38 
Burn-in 39 
Burn-in 40 
Burn-in 41 
Burn-in 42 
Burn-in 43 
Burn-in 44 
Burn-in 45 
Burn-in 46 
Burn-in 47 
Burn-in 48 
Burn-in 49 
Burn-in 50 
Burn-in 51 
Burn-in 52 
Burn-in 53 
Burn-in 54 
Burn-in 55 
Burn-in 56 
Burn-in 57 
Burn-in 58 
Burn-in 59 
Burn-in 60 
Burn-in 61 
Burn-in 62 
Burn-in 63 
Burn-in 64 
Burn-in 65 
Burn-in 66 
Burn-in 67 
Burn-in 68 

Next, we save the results of this LDA on beer3 and produce the output matrices etc.

In [None]:
ms2lda.write_results('beer3_pos_rel')

And we show the ranking of the top-10 topics by their h-indices. Change the *sort_by* parameter to rank by either h-index or in-degree and remove the *top_N* parameter to show the ranking of all topics

In [None]:
# topic_ranking, sorted_topic_counts = ms2lda.rank_topics(sort_by='in_degree')
topic_ranking, sorted_topic_counts = ms2lda.rank_topics(sort_by='h_index', top_N=10)

Plot the fragments of these topics.

In [None]:
persisted_topics  = [57, 87, 16, 50, 101, 0, 5, 20, 21, 27]

# remove the selected_topics parameter to plot for all topics
ms2lda.plot_lda_fragments(consistency=0.50, sort_by="h_index", selected_topics=persisted_topics)

Let's say we want to set all the top-10 topics from beer3pos above (or whatever defined in the *persisted_topics* variable) and run the LDA using them again on beer2.

First we need to save the state of the LDA model that we just ran on  beer3pos. Below line will create two more files: the dumped model state (*beer3pos.model*) and the list of vocabularies of the 'words' used for the persisted topics (*beer3pos.vocab*). These files are written to the same location as the output matrices, i.e. in the *results/beer3_pos_rel* folder relative to this notebook.

In [None]:
model_filename = 'results/beer3_pos_rel/beer3pos.model'
vocab_filename = 'results/beer3_pos_rel/beer3pos.vocab'
ms2lda.save_model(persisted_topics, model_filename, vocab_filename)

2. LDA on Beer2 with persistent topics from Beer3
------------------------------------------------------

First we load the previously saved model of beer3.

In [None]:
from lda_cgs import CollapseGibbsLda
beer3_model = CollapseGibbsLda.load(model_filename)
if hasattr(beer3_model, 'selected_topics'):
    print "Persistent topics = " + str(beer3_model.selected_topics)

Now we have to go to R and run the feature extraction script (*MS1MS2_MatrixGeneration_default_7ppm_specPeaks.R*) on the Beer2pos data. <font color='red'>**This step has to be manually done for now .. although we should automate it as part of the pipeline later.**</font>

Specifically in the R script, set the following parameter (that specifies the vocabulary list of the persistent topics)

    prev_words_file <- '/home/joewandy/git/metabolomics_tools/justin/notebooks/results/beer3_pos_rel/beer3pos.vocab'

and re-run sections in the R-script that does feature extractions .. from the "Data filtering" part onwards.

<hr/>

When running the LDA on beer2pos, there's now the additional parameter *previous_model* that needs to be passed in. Also, the total number of topics are now 135. The persistent topics (10) come first, and the remaining new topics (125) are appended after them. 

In [None]:
fragment_filename = basedir + 'input/relative_intensities/Beer_2_T10_POS_fragments_rel.csv'
neutral_loss_filename = basedir + 'input/relative_intensities/Beer_2_T10_POS_losses_rel.csv'
mzdiff_filename = None
ms1_filename = basedir + 'input/relative_intensities/Beer_2_T10_POS_ms1_rel.csv'
ms2_filename = basedir + 'input/relative_intensities/Beer_2_T10_POS_ms2_rel.csv'

ms2lda = Ms2Lda(fragment_filename, neutral_loss_filename, mzdiff_filename, 
                ms1_filename, ms2_filename, relative_intensity=True)
df, vocab = ms2lda.preprocess()
ms2lda.run_lda(df, vocab, n_topics, n_samples, n_burn, n_thin, 
               alpha, beta, use_own_model=True, use_native=True, previous_model=beer3_model)

In [None]:
ms2lda.write_results('beer2_pos_rel')

3. Beer2 Results
------------------

The persisted topics from previous LDA run are placed first in list of topics of the new LDA run, so so old topic 57 becomes new topic 0, old topic 87 is new topic 1, etc.

In [None]:
old_persisted_index = beer3_model.selected_topics
new_persisted_index = range(len(beer3_model.selected_topics))

print "in beer3pos = " + str(old_persisted_index)
print "in beer2pos = " + str(new_persisted_index)

If we show the ranking of the top-10 topics in beer2pos by the H-index, we see that the persisted topics from beer3 aren't very high up the list, i.e. we don't see topics 0 - 9 there.

In [None]:
topic_ranking, sorted_topic_counts = ms2lda.rank_topics(sort_by='h_index', top_N=10)

We can also plot the fragments in topics 0 - 9 in beer2pos.

In [None]:
ms2lda.plot_lda_fragments(consistency=0.0, sort_by="h_index", selected_topics=new_persisted_index)

We can also plot of the predictive distribution of the persisted topics (theta) in the old and new LDA results

In [None]:
pred_old = np.sum(beer3_model.doc_topic_, axis=0)
pred_old = pred_old / np.sum(pred_old)
pred_old = pred_old[old_persisted_index]

beer2_model = ms2lda.model
pred_new = np.sum(beer2_model.doc_topic_, axis=0)
pred_new = pred_new / np.sum(pred_new)
pred_new = pred_new[new_persisted_index]

In [None]:
K = len(old_persisted_index)
ind = np.arange(K)  # the x locations for the groups
width = 0.35       # the width of the bars
fig, ax = plt.subplots()
rects1 = ax.bar(ind, pred_old, width, color='g')
rects2 = ax.bar(ind+width, pred_new, width, color='b')

# add some text for labels, title and axes ticks
ax.set_ylabel('Predictive Probability')
ax.set_xlabel('Persistent Topic')
ax.set_title('Predictive distributions of persistent topics')
ax.set_xticks(ind+width)
ax.set_xticklabels(old_persisted_index)

ax.legend( (rects1[0], rects2[0]), ('Beer3pos', 'Beer2pos') )
plt.show()