Example Notebook
============

<h2>Import the MS2LDA package</h2>

In [None]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

import os
import sys
basedir = '../'
sys.path.append(basedir)

from lda_for_fragments import Ms2Lda

<h2>1. Prepare the input matrices and running LDA on them</h2>

LDA requires us to produce the input matrices of the counts of words in the documents.

There are two ways to generate these input matrices. You can either run the feature extraction pipeline script (written in R) separately, producing the input matrices which can then be specified below .. 

In [None]:
fragment_filename = basedir + 'input/final/Beer_3_full1_5_2E5_pos_fragments.csv'
neutral_loss_filename = basedir + 'input/final/Beer_3_full1_5_2E5_pos_losses.csv'
mzdiff_filename = None
ms1_filename = basedir + 'input/final/Beer_3_full1_5_2E5_pos_ms1.csv'
ms2_filename = basedir + 'input/final/Beer_3_full1_5_2E5_pos_ms2.csv'
ms2lda = Ms2Lda.lcms_data_from_R(fragment_filename, neutral_loss_filename, mzdiff_filename, 
                             ms1_filename, ms2_filename)

Or call the following wrapper method that does that for you. This takes as input the full scan and fragmentation files (defined in config_filename) and produces the various count matrices used as input to LDA, which are automatically read inside the ms2lda object.

Note: The main entry point for feature extraction is the <i>script_folder/startFeatureExtraction.R</i>, which loads RMassBank as a requirement. RMassBank relies on rJava, which unfortunately can be really rather fiddly to configure. The following is a common problem when configuring rJava : http://stackoverflow.com/questions/12872699/error-unable-to-load-installed-packages-just-now. 

In [None]:
# path to the folder containing the R scripts used for feature extraction 
script_folder = '/home/joewandy/git/metabolomics_tools/justin/R'

# path to the configuration file for feature extraction
config_filename = os.path.join(script_folder, 'config.yml')

# too many warning messages printed from R
import warnings
warnings.filterwarnings("ignore")

# run the feature extraction pipeline, this will run for a long time!!
ms2lda = Ms2Lda.run_feature_extraction(script_folder, config_filename)

Now we're ready to run LDA.

In [None]:
### all the parameters you need to specify to run LDA ###

n_topics = 300 # 300 - 400 topics from cross-validation
n_samples = 100 # 100 is probably okay for testing. For manuscript, use > 500-1000.
n_burn = 0 # if 0 then we only use the last sample
n_thin = 1 # every n-th sample to use for averaging after burn-in. Ignored if n_burn = 0
alpha = 50.0/n_topics # hyper-parameter for document-topic distributions
beta = 0.1 # hyper-parameter for topic-word distributions

ms2lda.run_lda(n_topics, n_samples, n_burn, n_thin, alpha, beta)

In [None]:
ms2lda.save_project('results/beer3pos.project')

<h2>2. Resuming from Previous Run</h2>

If you did the save_project() above, you can resume from this step directly the next time you load the notebook ..

In [None]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

import sys
basedir = '../'
sys.path.append(basedir)

from lda_for_fragments import Ms2Lda

In [None]:
ms2lda = Ms2Lda.resume_from('results/beer3pos.project')

<h2>3. Results</h2>

We need to threshold the document-topic and topic-word distributions produced by LDA, so we can say which topics are used in which documents, and which words 'belongs' to a topic. 

In [None]:
# Thresholding the doc_topic and topic_word matrices
ms2lda.do_thresholding(th_doc_topic=0.05, th_topic_word=0.01)

Print the words in each topic.

In [None]:
ms2lda.print_topic_words()

Save the output CSV files

In [None]:
ms2lda.write_results('beer3_test_method3')

Set into the list below the MS1 peaks that you want to color differently in the graph page. You can see the names from the label of the nodes in the graph page or also from the CSV matrices written above. 

Also, for the graph page, you can which nodes are to be coloured differently. Add the node label to the list below and specify its colour, using either the colour name or its hex code (see http://www.w3schools.com/cssref/css_colornames.asp for the list of colour names/codes).

In [None]:
# a nice default maroon colour for highlighted nodes in the graph
default_colour = '#CC0000'

# Specify the nodes and its respective colour here
# If you don't need this, set 
# special_nodes = None
special_nodes = [
    ('doc_372.18877_540.996', default_colour),
    ('doc_291.66504_547_239', 'gold'),
    ('doc_308.17029_289.13', '#C71585'),
    ('topic_244', default_colour),
    ('topic_98', 'aqua'),
    ('topic_293', '#ff1493')
]

If the 'interactive' parameter below is True, we will show an interactive visualisation of the results in a separate tab. You need to interrupt the kernel to stop it once you're done with it (from the menu above, Kernel > Interrupt).

In [None]:
ms2lda.plot_lda_fragments(consistency=0.0, interactive=True, to_highlight=special_nodes)

# e.g. for non-interactive plot
# ms2lda.plot_lda_fragments(consistency=0.50, sort_by="in_degree")