Example Notebook
============

In [None]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

import os
import sys
basedir = '../'
sys.path.append(basedir)

from lda_for_fragments import Ms2Lda

<h2>1. Feature Extraction</h2>

There are two ways to generate the input matrices, described below in option (a) and (b).

<h3>a. Loading Existing Input Matrices</h3>

Either you separately run the feature extraction pipeline in R, producing the input matrices below .. 

In [None]:
fragment_filename = basedir + 'input/final/Beer_3_full1_5_2E5_pos_fragments.csv'
neutral_loss_filename = basedir + 'input/final/Beer_3_full1_5_2E5_pos_losses.csv'
mzdiff_filename = None
ms1_filename = basedir + 'input/final/Beer_3_full1_5_2E5_pos_ms1.csv'
ms2_filename = basedir + 'input/final/Beer_3_full1_5_2E5_pos_ms2.csv'
ms2lda = Ms2Lda.lcms_data_from_R(fragment_filename, neutral_loss_filename, mzdiff_filename, 
                             ms1_filename, ms2_filename)

<h3>b. Running the Feature Extraction Pipeline</h3>

Or you can call the run_feature_extraction in MS2-LDA. This is a wrapper method to the feature extraction pipeline, written in R. It takes as input the full scan and fragmentation files (defined in config_filename) and produces the various count matrices used as input to LDA.

Note: On the R side, the main entry point for the feature extraction pipeline is the <i>script_folder/startFeatureExtraction.R</i>, which loads RMassBank as a requirement. RMassBank relies on rJava, which unfortunately can be really rather fiddly to configure. The following is a common problem when configuring rJava : http://stackoverflow.com/questions/12872699/error-unable-to-load-installed-packages-just-now. 

Note2: This also steps depends on http://rpy.sourceforge.net/, which doesn't seem to be well-supported in Windows.

In [None]:
# path to the folder containing the R scripts used for feature extraction 
script_folder = '/home/joewandy/git/metabolomics_tools/justin/R'

# path to the configuration file for feature extraction
config_filename = os.path.join(script_folder, 'config.yml')

# too many warning messages printed from R
import warnings
warnings.filterwarnings("ignore")

# run the feature extraction pipeline, this will run for a long time!!
ms2lda = Ms2Lda.run_feature_extraction(script_folder, config_filename)

<hr/>

<h2>2. Analysis</h2>

<h3>a. Run LDA</h3>

Once the data has been loaded by performing either step 1(a) or 1(b), we're now ready to run LDA.

In [None]:
### all the parameters you need to specify to run LDA ###

n_topics = 300 # 300 - 400 topics from cross-validation
n_samples = 1000 # 100 is probably okay for testing. For manuscript, use > 500-1000.
n_burn = 0 # if 0 then we only use the last sample
n_thin = 1 # every n-th sample to use for averaging after burn-in. Ignored if n_burn = 0
alpha = 50.0/n_topics # hyper-parameter for document-topic distributions
beta = 0.1 # hyper-parameter for topic-word distributions

ms2lda.run_lda(n_topics, n_samples, n_burn, n_thin, alpha, beta)

<h3>b. (Optional) In-silico Annotation using SIRIUS</h3>

For the purpose of visualisation in step 3(c), we can annotate the MS1 and MS2 peaks using [SIRIUS](http://bio.informatik.uni-jena.de/software/sirius/), an in-silico fragmentation tool written in Java. At the moment, each parent MS1 peak and its associated MS2 spectra are run through SIRIUS separately. Isotopic information, which can be used to improve annotation, is not used yet.

If you run this annotation step before saving the project in step (c) below, the annotation information will be saved into the ms1 and ms2 peak info too.

In [None]:
sirius_platform = 'orbitrap'
ms2lda.annotate_with_sirius(sirius_platform, mode='pos') # mode is either 'pos' or 'neg'

<h3>c. (Optional) Saving Project</h3>

Save the whole project so we don't have to re-run everything the next time ..

In [None]:
# leave the message parameter out if nothing to say
ms2lda.save_project('results/beer3posxx.project', message="Type any message you want here")

<hr/>

<h2>3. Results</h2>

<h3>(Optional) Resuming Project</h3>

If you saved the project in step (2c), you can resume from here the next time you load this notebook ..

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

import sys
basedir = '../'
sys.path.append(basedir)

from lda_for_fragments import Ms2Lda

In [2]:
ms2lda = Ms2Lda.resume_from('results/beer3pos.project')

Project loaded from results/beer3pos.project time taken = 17.5967371464
 - input_filenames = 
	../input/final/Beer_3_full1_5_2E5_pos_fragments.csv
	../input/final/Beer_3_full1_5_2E5_pos_losses.csv
	../input/final/Beer_3_full1_5_2E5_pos_ms1.csv
	../input/final/Beer_3_full1_5_2E5_pos_ms2.csv
 - df.shape = (1588, 3171)
 - K = 300
 - alpha = 0.166666666667
 - beta = 0.1
 - number of samples stored = 1
 - last_saved_timestamp = Thu Sep  3 14:54:21 2015


<h3>a. Thresholding</h3>

For the purpose of visualisation only, we threshold the document-topic and topic-word distributions produced by LDA, so we can say which topics are used in which documents, and which words 'belongs' to a topic. This needs to be done before step (b) and (c) below.

In [3]:
# Thresholding the doc_topic and topic_word matrices
ms2lda.do_thresholding(th_doc_topic=0.05, th_topic_word=0.01)

<h3>b. Print Results</h3>

Print which fragment/loss words occur with probability above the threshold in each topic.

In [None]:
ms2lda.print_topic_words()

We can also save the output to CSV files

In [None]:
ms2lda.write_results('beer3_test_method3')

<h3>c. Visualisation</h3>

A visualisation module is provided to explore the results. This can be run in either interactive (in the browser) or non-interactive (directly plotting all results in this notebook, which can be a lot output!).

<h4>Set Visualisation Parameters</h4>

In [4]:
# If True, an interactive visualisation is shown in a separate tab. 
# You need to interrupt the kernel to stop it once you're done with it (from the menu above, Kernel > Interrupt).
interactive=True

In [5]:
# Used for highlighting 'consistent' topic fragments/losses during plotting. 
# A value of 0.50 means the fragment (or loss) word occurs at least half of the plotted documents of the topic.
consistency=0.0

In [None]:
# Used for graph visualisation in the interactive mode only. 
# Specifies the 'special' nodes to be coloured differently.
# special_nodes = [
#     ('doc_372.18877_540.996', '#CC0000'), # maroon
#     ('doc_291.66504_547_239', 'gold'),
#     ('doc_308.17029_289.13', 'green'),
#     ('topic_244', '#CC0000'), # maroon
#     ('topic_98', 'aqua'),
#     ('topic_293', '#ff1493') # deep pink
# ]
special_nodes = None

<h4>Run Visualisation</h4>

In [None]:
ms2lda.plot_lda_fragments(consistency=consistency, interactive=interactive, to_highlight=special_nodes)

Ranking topics ...
 - topic 0 h-index = 2
 - topic 1 h-index = 2
 - topic 2 h-index = 2
 - topic 3 h-index = 4
 - topic 4 h-index = 1
 - topic 5 h-index = 2
 - topic 6 h-index = 2
 - topic 7 h-index = 3
 - topic 8 h-index = 2
 - topic 9 h-index = 2
 - topic 10 h-index = 2
 - topic 11 h-index = 4
 - topic 12 h-index = 3
 - topic 13 h-index = 3
 - topic 14 h-index = 1
 - topic 15 h-index = 2
 - topic 16 h-index = 2
 - topic 17 h-index = 2
 - topic 18 h-index = 1
 - topic 19 h-index = 3
 - topic 20 h-index = 3
 - topic 21 h-index = 3
 - topic 22 h-index = 2
 - topic 23 h-index = 3
 - topic 24 h-index = 1
 - topic 25 h-index = 1
 - topic 26 h-index = 2
 - topic 27 h-index = 2
 - topic 28 h-index = 3
 - topic 29 h-index = 3
 - topic 30 h-index = 1
 - topic 31 h-index = 2
 - topic 32 h-index = 3
 - topic 33 h-index = 2
 - topic 34 h-index = 2
 - topic 35 h-index = 2
 - topic 36 h-index = 2
 - topic 37 h-index = 3
 - topic 38 h-index = 4
 - topic 39 h-index = 2
 - topic 40 h-index = 1
 - topi

127.0.0.1 - - [05/Oct/2015 02:09:16] "GET / HTTP/1.1" 200 -
127.0.0.1 - - [05/Oct/2015 02:09:16] "GET /LDAvis.css HTTP/1.1" 200 -
127.0.0.1 - - [05/Oct/2015 02:09:16] "GET /d3.js HTTP/1.1" 200 -
127.0.0.1 - - [05/Oct/2015 02:09:16] "GET /LDAvis.js HTTP/1.1" 200 -
127.0.0.1 - - [05/Oct/2015 02:09:16] "GET /images/graph_example.jpg HTTP/1.1" 200 -
127.0.0.1 - - [05/Oct/2015 02:09:16] "GET /images/default_logo.png HTTP/1.1" 200 -
127.0.0.1 - - [05/Oct/2015 02:09:18] "GET /topic?circle_id=ldavis_el17953139836431633272829522518-topic241&action=load HTTP/1.1" 200 -
127.0.0.1 - - [05/Oct/2015 02:09:19] "GET /topic?circle_id=ldavis_el17953139836431633272829522518-topic241&action=set HTTP/1.1" 200 -
127.0.0.1 - - [05/Oct/2015 02:09:21] "GET /topic?circle_id=ldavis_el17953139836431633272829522518-topic241&action=load HTTP/1.1" 200 -
