Initial run
============

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

import os
import sys
basedir = '../'
sys.path.append(basedir)

from lda_for_fragments import Ms2Lda

<h2>1. Feature Extraction</h2>

There are two ways to generate the input matrices, described below in option (a) and (b).

<h3>a. Loading Existing Input Matrices</h3>

Either you separately run the feature extraction pipeline in R, producing the input matrices below .. 

In [2]:
fragment_filename = '/home/joewandy/isabel/isabelpos_fragments.csv'
neutral_loss_filename = '/home/joewandy/isabel/isabelpos_losses.csv'
mzdiff_filename = None
ms1_filename = '/home/joewandy/isabel/isabelpos_ms1.csv'
ms2_filename = '/home/joewandy/isabel/isabelpos_ms2.csv'
ms2lda = Ms2Lda.lcms_data_from_R(fragment_filename, neutral_loss_filename, mzdiff_filename, 
                             ms1_filename, ms2_filename)

Loading input files
Data shape (61, 2985)


<h3>b. Running the Feature Extraction Pipeline</h3>

Or you can call the run_feature_extraction in MS2-LDA. This is a wrapper method to the feature extraction pipeline, written in R. It takes as input the full scan and fragmentation files (defined in config_filename) and produces the various count matrices used as input to LDA.

Note: On the R side, the main entry point for the feature extraction pipeline is the <i>script_folder/startFeatureExtraction.R</i>, which loads RMassBank as a requirement. RMassBank relies on rJava, which unfortunately can be really rather fiddly to configure. The following is a common problem when configuring rJava : http://stackoverflow.com/questions/12872699/error-unable-to-load-installed-packages-just-now. 

Note2: This also steps depends on http://rpy.sourceforge.net/, which doesn't seem to be well-supported in Windows.

In [None]:
# path to the folder containing the R scripts used for feature extraction 
script_folder = '/Users/simon/git/metabolomics_tools/justin/R'

# path to the configuration file for feature extraction
# config_filename = os.path.join(script_folder, 'config.yml')
config_filename = '/Users/simon/git/metabolomics_tools/justin/isabel/config.yml'

# too many warning messages printed from R
import warnings
warnings.filterwarnings("ignore")

# run the feature extraction pipeline, this will run for a long time!!
ms2lda = Ms2Lda.run_feature_extraction(script_folder, config_filename)

<hr/>

<h2>2. Analysis</h2>

<h3>a. Run LDA</h3>

Once the data has been loaded by performing either step 1(a) or 1(b), we're now ready to run LDA.

In [3]:
### all the parameters you need to specify to run LDA ###

n_topics = 10 # 300 - 400 topics from cross-validation
n_samples = 500 # 100 is probably okay for testing. For manuscript, use > 500-1000.
n_burn = 0 # if 0 then we only use the last sample
n_thin = 1 # every n-th sample to use for averaging after burn-in. Ignored if n_burn = 0
alpha = 50.0/n_topics # hyper-parameter for document-topic distributions
beta = 0.1 # hyper-parameter for topic-word distributions

ms2lda.run_lda(n_topics, n_samples, n_burn, n_thin, alpha, beta)

Fitting model...
CGS LDA initialising
.......
Using Numba for LDA sampling
Preparing words
Preparing Z matrix
DONE
Sample 1   Log likelihood = -72493.065 
Sample 2   Log likelihood = -66294.165 
Sample 3   Log likelihood = -60491.404 
Sample 4   Log likelihood = -56909.086 
Sample 5   Log likelihood = -54987.119 
Sample 6   Log likelihood = -53772.525 
Sample 7   Log likelihood = -53165.675 
Sample 8   Log likelihood = -52743.725 
Sample 9   Log likelihood = -52427.084 
Sample 10   Log likelihood = -52038.428 
Sample 11   Log likelihood = -51671.552 
Sample 12   Log likelihood = -51538.521 
Sample 13   Log likelihood = -51306.650 
Sample 14   Log likelihood = -51106.953 
Sample 15   Log likelihood = -50816.168 
Sample 16   Log likelihood = -50754.881 
Sample 17   Log likelihood = -50690.395 
Sample 18   Log likelihood = -50584.156 
Sample 19   Log likelihood = -50363.528 
Sample 20   Log likelihood = -50257.572 
Sample 21   Log likelihood = -50412.248 
Sample 22   Log likelihood = -501

<h3>b. (Optional) In-silico Annotation using SIRIUS</h3>

For the purpose of visualisation in step 3(c), we can annotate the MS1 and MS2 peaks using [SIRIUS](http://bio.informatik.uni-jena.de/software/sirius/), an in-silico fragmentation tool written in Java. At the moment, each parent MS1 peak and its associated MS2 spectra are run through SIRIUS separately. Isotopic information, which can be used to improve annotation, is not used yet.

If you run this annotation step before saving the project in step (c) below, the annotation information will be saved into the ms1 and ms2 peak info too.

In [None]:
sirius_platform = 'orbitrap'
ms2lda.annotate_with_sirius(sirius_platform, mode='pos') # mode is either 'pos' or 'neg'

<h3>c. (Optional) Saving Project</h3>

Save the whole project so we don't have to re-run everything the next time ..

In [None]:
# leave the message parameter out if nothing to say
ms2lda.save_project('/home/joewandy/isabel/isabel.project', message="Initial run on Isabel data by Joe")

<hr/>

<h2>3. Results</h2>

<h3>(Optional) Resuming Project</h3>

If you saved the project in step (2c), you can resume from here the next time you load this notebook ..

In [None]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

import sys
basedir = '../'
sys.path.append(basedir)

from lda_for_fragments import Ms2Lda

In [None]:
ms2lda = Ms2Lda.resume_from('/home/joewandy/isabel/isabel.project')

<h3>a. Thresholding</h3>

For the purpose of visualisation only, we threshold the document-topic and topic-word distributions produced by LDA, so we can say which topics are used in which documents, and which words 'belongs' to a topic. This needs to be done before step (b) and (c) below.

In [5]:
# Thresholding the doc_topic and topic_word matrices
ms2lda.do_thresholding(th_doc_topic=0.05, th_topic_word=0.01)

<h3>b. Print Results</h3>

Print which fragment/loss words occur with probability above the threshold in each topic.

In [6]:
ms2lda.print_topic_words()

Topic 0: fragment_124.99979 (0.159640203363), fragment_86.09653 (0.0509190457567), fragment_124.9977 (0.0485725459523), fragment_98.98431 (0.038404380133), fragment_104.10691 (0.03684004693), fragment_86.09506 (0.0344935471255), fragment_125.00208 (0.0321470473211), fragment_184.07376 (0.0313648807196), loss_58.06617 (0.0305827141181), fragment_104.10883 (0.0196323816973), fragment_60.08105 (0.0157215486899), fragment_98.98251 (0.0157215486899), fragment_71.07309 (0.0118107156824), 

Topic 1: loss_16.9656 (0.0677408959072), loss_1.98299 (0.0664518208186), loss_18.97982 (0.0361585562359), loss_17.99744 (0.0335804060587), loss_1.99306 (0.0284241057042), loss_0.93298 (0.0258459555269), fragment_126.05476 (0.021978730261), loss_19.96557 (0.0213341927167), fragment_72.08081 (0.0174669674509), fragment_126.05678 (0.0168224299065), loss_45.9923 (0.0161778923622), fragment_80.04946 (0.0148888172736), loss_0.97097 (0.013599742185), loss_45.9938 (0.0129552046407), loss_1.95394 (0.0129552046407),

We can also save the output to CSV files

In [None]:
ms2lda.write_results('beer3_test_method3')

<h3>c. Visualisation</h3>

A visualisation module is provided to explore the results. This can be run in either interactive (in the browser) or non-interactive (directly plotting all results in this notebook, which can be a lot output!).

<h4>Set Visualisation Parameters</h4>

In [7]:
# If True, an interactive visualisation is shown in a separate tab. 
# You need to interrupt the kernel to stop it once you're done with it (from the menu above, Kernel > Interrupt).
interactive=True

In [8]:
# Used for highlighting 'consistent' topic fragments/losses during plotting. 
# A value of 0.50 means the fragment (or loss) word occurs at least half of the plotted documents of the topic.
consistency=0.0

In [9]:
# Used for graph visualisation in the interactive mode only. 
# Specifies the 'special' nodes to be coloured differently.
# special_nodes = [
#     ('doc_372.18877_540.996', '#CC0000'), # maroon
#     ('doc_291.66504_547_239', 'gold'),
#     ('doc_308.17029_289.13', 'green'),
#     ('topic_244', '#CC0000'), # maroon
#     ('topic_98', 'aqua'),
#     ('topic_293', '#ff1493') # deep pink
# ]
special_nodes = None

<h4>Run Visualisation</h4>

In [None]:
ms2lda.plot_lda_fragments(consistency=consistency, interactive=interactive, to_highlight=special_nodes)

Ranking topics ...
 - topic 0 h-index = 7
 - topic 1 h-index = 5
 - topic 2 h-index = 6
 - topic 3 h-index = 5
 - topic 4 h-index = 4
 - topic 5 h-index = 5
 - topic 6 h-index = 2
 - topic 7 h-index = 4
 - topic 8 h-index = 4
 - topic 9 h-index = 2
DONE!

Generating plots for topic 0 h-index=7, degree=26
Generating plots for topic 2 h-index=6, degree=29
Generating plots for topic 1 h-index=5, degree=31
Generating plots for topic 3 h-index=5, degree=30
Generating plots for topic 5 h-index=5, degree=30
Generating plots for topic 4 h-index=4, degree=22
Generating plots for topic 7 h-index=4, degree=27
Generating plots for topic 8 h-index=4, degree=21
Generating plots for topic 6 h-index=2, degree=23
Generating plots for topic 9 h-index=2, degree=26
Using visualisation script defined in /LDAvis.js

Note: if you're in the IPython notebook, pyLDAvis.show() is not the best command
      to use. Consider using pyLDAvis.display(), or pyLDAvis.enable_notebook().
      See more information at htt

127.0.0.1 - - [01/Oct/2015 17:26:48] "GET / HTTP/1.1" 200 -
127.0.0.1 - - [01/Oct/2015 17:26:48] "GET /LDAvis.css HTTP/1.1" 200 -
127.0.0.1 - - [01/Oct/2015 17:26:48] "GET /d3.js HTTP/1.1" 200 -
127.0.0.1 - - [01/Oct/2015 17:26:48] "GET /LDAvis.js HTTP/1.1" 200 -
127.0.0.1 - - [01/Oct/2015 17:26:48] "GET /images/graph_example.jpg HTTP/1.1" 200 -
127.0.0.1 - - [01/Oct/2015 17:26:48] "GET /images/default_logo.png HTTP/1.1" 200 -
127.0.0.1 - - [01/Oct/2015 17:26:50] "GET /topic?circle_id=ldavis_el124141400914198526481705957299-topic2&action=load HTTP/1.1" 200 -
127.0.0.1 - - [01/Oct/2015 17:26:52] "GET /topic?circle_id=ldavis_el124141400914198526481705957299-topic2&action=set HTTP/1.1" 200 -
127.0.0.1 - - [01/Oct/2015 17:26:53] "GET /topic?circle_id=ldavis_el124141400914198526481705957299-topic2&action=load HTTP/1.1" 200 -
127.0.0.1 - - [01/Oct/2015 17:26:53] "GET /images/default_logo.png HTTP/1.1" 200 -
127.0.0.1 - - [01/Oct/2015 17:26:54] "GET /topic?circle_id=ldavis_el12414140091419852