Example Notebook 2: Loading MS2LDA Analysis
==============================================

This notebook demonstrates loading an existing MS2LDA analysis containing discovered Mass2Motifs, alongside the list of MS1 and MS2 peaks with putative elemental formula annotations.

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

import numpy as np
import pandas as pd
from IPython.display import display

In [2]:
import os
import sys
basedir = '../MS2LDA/python'
sys.path.append(basedir)

from lda_for_fragments import Ms2Lda

If there's any error above, ensure that the basedir correctly points to the location of the MS2LDA Python codes.

Save the whole project so we don't have to re-run everything the next time .. The message parameter can be omitted.

In [None]:
ms2lda.save_project('beer3test.project', 
                    message="Beer3Pos analysis for the manuscript with SIRIUS EF Annotation")

<hr/>

<h2>3. Results</h2>

<h3>a. Resuming Project (Optional)</h3>

If you saved the project in step (2c), you can resume from here the next time you load this notebook ..

In [3]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

import sys
basedir = '../'
sys.path.append(basedir)

import numpy as np
import pandas as pd
import pylab as plt
from IPython.display import display
from lda_for_fragments import Ms2Lda

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [4]:
ms2lda = Ms2Lda.resume_from('results/beer3pos.project')

Project loaded from results/beer3pos.project time taken = 21.2584190369
 - input_filenames = 
	../input/manuscript/Beer3pos_MS1filter_Method3_fragments.csv
	../input/manuscript/Beer3pos_MS1filter_Method3_losses.csv
	../input/manuscript/Beer3pos_MS1filter_Method3_ms1.csv
	../input/manuscript/Beer3pos_MS1filter_Method3_ms2.csv
 - df.shape = (1422, 4496)
 - K = 300
 - alpha = 0.166666666667
 - beta = 0.1
 - number of samples stored = 1
 - last_saved_timestamp = Fri Oct 16 02:56:11 2015
 - message = Beer3Pos analysis for the manuscript with SIRIUS EF Annotation


In [3]:
display(ms2lda.ms1)

Unnamed: 0,peakID,MSnParentPeakID,msLevel,rt,mz,intensity,Sample,GroupPeakMSn,CollisionEnergy,annotation
1,1,0,1,578.503,70.065165,16431632.0000,1,0,0,C4H7N
5,5,0,1,566.043,72.080799,1067314.7500,1,0,0,C4H9N
12,12,0,1,1210.110,72.080818,1025769.1250,1,0,0,C4H9N
15,15,0,1,468.470,73.064788,925079.4375,1,0,0,C4H8O
19,19,0,1,656.240,76.039316,1047793.6875,1,0,0,
24,24,0,1,1027.660,76.075699,1636355.7500,1,0,0,C3H9NO
33,33,0,1,562.308,81.033514,766053.0000,1,0,0,C5H4O
37,37,0,1,632.557,83.060245,705789.6250,1,0,0,C4H6N2
42,42,0,1,486.476,84.044375,2827970.7500,1,0,0,C4H5NO
46,46,0,1,566.043,84.080711,978819.3750,1,0,0,C5H9N


In [4]:
display(ms2lda.ms2)

Unnamed: 0,peakID,MSnParentPeakID,msLevel,rt,mz,intensity,Sample,GroupPeakMSn,CollisionEnergy,fragment_bin_id,loss_bin_id,annotation
792,792,789,2,577.268,70.065056,1.000000,1,0,0,70.06514,46.00539,C4H7N
942,942,932,2,531.059,118.086075,1.000000,1,0,0,118.08614,,C5H11NO2
17512,17512,17496,2,604.701,104.107330,1.000000,1,0,0,104.10738,154.0027,C5H13NO
307,307,298,2,933.275,104.107353,1.000000,1,0,0,104.10738,,C5H13NO
2648,2648,2632,2,495.514,136.062347,1.000000,1,0,0,136.06239,,C5H5N5
936,936,932,2,531.059,58.065453,0.533076,1,0,0,58.06552,60.02094,C3H7N
14567,14567,14559,2,531.059,118.086113,1.000000,1,0,0,118.08614,117.07884,C5H11NO2
303,303,298,2,933.275,60.080959,0.485118,1,0,0,60.08102,44.02604,C3H9N
2916,2916,2888,2,545.999,138.054489,1.000000,1,0,0,138.05452,,C7H7NO2
937,937,932,2,531.059,59.073425,0.344497,1,0,0,59.07323,59.01322,C3H8N


<h3>b. Thresholding</h3>

For the purpose of visualisation only, we threshold the document-topic and topic-word distributions produced by LDA, so we can say which topics are used in which documents, and which words 'belongs' to a topic. This needs to be done before step (b) and (c) below.

In [5]:
# Thresholding the doc_topic and topic_word matrices
ms2lda.do_thresholding(th_doc_topic=0.05, th_topic_word=0.01)

From this point onwards, we will refer to an LDA topic as **Mass2Motif** when interpreting the results.

<h3>c. Print Results</h3>

Print which fragment/loss features occur with probability above the thresholds defined above in each Mass2Motif.

In [None]:
ms2lda.print_motif_features()

We can also save the output to CSV files

In [None]:
ms2lda.write_results('beer3pos_csv_out')

<h3>d. Cosine Clustering (optional)</h3>

We plot the cosine clustering of the parent ions to investigate the agreement/difference between peaks that have been clustered by cosine clustering vs motifs from LDA. First we need to construct the clustering. To do this, we can use either the hierarchical clustering with Euclidean distance from scipy or an alternative greedy clustering method we devise, which works as follows: **(1)** find the next unprocessed parent ion having max intensity, **(2)** group this to other parents with cosine similarity over the threshold (0.55), **(3)** repeat until all parents have been processed.


In [None]:
# method is either 'hierarchical' or 'greedy'
# peak_names, clustering = ms2lda.run_cosine_clustering(method='hierarchical', th_clustering=0.80)
peak_names, clustering = ms2lda.run_cosine_clustering(method='greedy', th_clustering=0.55)

In [None]:
print "Found {} clusters".format(np.max(clustering))

Then we pick some Mass2Motifs, e.g. 102 and 220 below, and plot how all the parent ions cluster based on their cosine similarity. The parent ions of interest that have been assigned to the motif (above the threshold) are indicated as red dots in the plot.

In [None]:
G, cluster_interests = ms2lda.plot_cosine_clustering(102, clustering, peak_names)
for cluster in sorted(cluster_interests):
    interest_members = cluster_interests[cluster]
    print "Cluster %d with %d parent ion(s) of interest:" % (cluster, len(interest_members))
    display(ms2lda.ms1[ms2lda.ms1['peakID'].isin(interest_members)])

In [None]:
G, cluster_interests = ms2lda.plot_cosine_clustering(220, clustering, peak_names)
for cluster in sorted(cluster_interests):
    interest_members = cluster_interests[cluster]
    print "Cluster %d with %d parent ion(s) of interest:" % (cluster, len(interest_members))
    display(ms2lda.ms1[ms2lda.ms1['peakID'].isin(interest_members)])

<h3>e. Visualisation</h3>

A visualisation module is provided to explore the results. This can be run in either interactively in the browser or non-interactively by plotting all results in this notebook (which can be a lot of plots!)

<h4>Set Visualisation Parameters</h4>

In [6]:
# If True, an interactive visualisation is shown in a separate tab. 
# You need to interrupt the kernel to stop it once you're done with it (from the menu above, Kernel > Interrupt).
interactive=True

In [7]:
# Used for graph visualisation in the interactive mode only. 
# Specifies the 'special' nodes to be coloured differently.
special_nodes = [
    # you can colour the MS1 peak in the graph
    # 'doc_peakid', where peakid is the peak ID of the MS1 peak    
    ('doc_21758', 'gold'),
    # you can also colour the Mass2Motif in the graph
    ('motif_0', '#ff1493')
]

# If nothing ..
# special_nodes = None

In [8]:
# read the annotation assigned to each Mass2Motif from a CSV file
# this could also be stored in e.g. a database
import csv
motif_annotation = {}
for item in csv.reader(open("results/beer3pos_annotation.csv"), skipinitialspace=True):
    key = int(item[0])
    val = item[1]
    print str(key) + '\t' + val
    motif_annotation[key] = val

# here we set all the motifs having annotations as special nodes too
motif_colour = '#CC0000'
to_add_list = ['motif_' + str(x) for x in motif_annotation.keys()]
for item in to_add_list:
    special_nodes.append((item, motif_colour))

# If nothing ..
# motif_annotation = {} # or just leave the 'additional_info' parameter out when calling plot_lda_fragments below

260	Water loss indicative of a free hydroxyl group - often seen in sugary structures
262	Carboxylic acid group (COOH) - generic substructure in amino acids and organic acids
226	Loss of [hexose-H2O] - suggests hexose conjugation (e.g. glucose) substructure
158	Leucine related substructure
243	Conjugation of a phosphate group (H4O4P) substructure
127	Conjugation of a phosphate group (H4O4P) substructure
174	Pyroglutamic acid (pyroglutamate) substructure
59	Pyroglutamic acid (pyroglutamate) substructure
214	Amine loss,  suggests free NH2 group in fragmented molecule
60	Double water loss for metabolites containing OH groups + aliphatic chain,  e.g. sugars
151	[proline-H2O],  suggests conjugated proline substructure.
40	Imidazole group linked to a carboxylgroup through one CH2 group
284	Suggests dihydroxylated benzene ring substructure
276	Alkyl aromatic substructure,  suggests aromatic ring with 2-carbon alkyl chain attached
45	Pipecolic acid (pipecolate) substructure
78	Trimethylated ami

<h4>Run Visualisation</h4>

In [None]:
ms2lda.plot_lda_fragments(interactive=interactive, to_highlight=special_nodes, additional_info=motif_annotation)

Ranking motifs ...
 - Mass2Motif 0 h-index = 2
 - Mass2Motif 1 h-index = 2
 - Mass2Motif 2 h-index = 1
 - Mass2Motif 3 h-index = 2
 - Mass2Motif 4 h-index = 3
 - Mass2Motif 5 h-index = 5
 - Mass2Motif 6 h-index = 4
 - Mass2Motif 7 h-index = 5
 - Mass2Motif 8 h-index = 3
 - Mass2Motif 9 h-index = 5
 - Mass2Motif 10 h-index = 4
 - Mass2Motif 11 h-index = 4
 - Mass2Motif 12 h-index = 4
 - Mass2Motif 13 h-index = 5
 - Mass2Motif 14 h-index = 2
 - Mass2Motif 15 h-index = 3
 - Mass2Motif 16 h-index = 4
 - Mass2Motif 17 h-index = 4
 - Mass2Motif 18 h-index = 2
 - Mass2Motif 19 h-index = 6
 - Mass2Motif 20 h-index = 3
 - Mass2Motif 21 h-index = 2
 - Mass2Motif 22 h-index = 5
 - Mass2Motif 23 h-index = 7
 - Mass2Motif 24 h-index = 3
 - Mass2Motif 25 h-index = 3
 - Mass2Motif 26 h-index = 6
 - Mass2Motif 27 h-index = 3
 - Mass2Motif 28 h-index = 3
 - Mass2Motif 29 h-index = 1
 - Mass2Motif 30 h-index = 2
 - Mass2Motif 31 h-index = 2
 - Mass2Motif 32 h-index = 2
 - Mass2Motif 33 h-index = 3
 - Ma