Example Notebook 1: Performing MS2LDA Analysis
==============================================

This notebook demonstrates the Mass2Motif discovery stage of MS2LDA. We describe how to create the necessary count (MS1-MS2) matrices in R, how to load these count matrices into the Python worklflow, run the LDA, and perform in-silico elemental formula assignments (optional), and save the results.

If preprocessing of step (1) has already been done, you can jump straight to step (2).

In [None]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

import numpy as np
import pandas as pd
from IPython.display import display

In [None]:
import os
import sys
basedir = '../MS2LDA/python'
sys.path.append(basedir)

from lda_for_fragments import Ms2Lda

If there is any error above, please ensure that the basedir correctly points to the location of the MS2LDA Python codes.

<h2>1. Creating the count matrices: Feature Extraction using R</h2>

Run the R feature extraction pipeline to produce the necessary input matrices for MS2LDA. All the R scripts necessary to perform feature extraction can be found in the "R" folder. The entry point to the pipeline is **R/MS1MS2_MatrixGeneration.R**. Load that file in e.g. RStudio, set the working directory to the "R" folder, and configure parameters of the pipeline from the **config.yml** file. In that file, the location of the mzXML and/or mzML files need to be specified. Raw vendor formats can be transformed into the required open format by using MSconvert (ProteoWizard, http://proteowizard.sourceforge.net/tools.shtml) The default parameter settings provided in that .yml file is for the example data that was generated from pHILIC-MS  runs in positive ionisation mode. You are advised to modify the parameter settings according to the platform that was used.

Note: RMassBank is one of the dependencies of the pipeline. RMassBank relies on rJava. The following is a common problem that you might encounter when configuring rJava : http://stackoverflow.com/questions/12872699/error-unable-to-load-installed-packages-just-now. 

Below an example of how to load the CSV files produced for the Beer3 Positive mode after running the feature extraction pipeline in R.

In [None]:
input_dir = 'input_files'

fragment_filename = os.path.join(input_dir, 'Beer3pos_MS1filter_Method3_fragments.csv')
neutral_loss_filename = os.path.join(input_dir, 'Beer3pos_MS1filter_Method3_losses.csv')
mzdiff_filename = None # mz differences feature -- unused

ms1_filename = os.path.join(input_dir, 'Beer3pos_MS1filter_Method3_ms1.csv')
ms2_filename = os.path.join(input_dir, 'Beer3pos_MS1filter_Method3_ms2.csv')

In [None]:
ms2lda = Ms2Lda.lcms_data_from_R(fragment_filename, neutral_loss_filename, mzdiff_filename, 
                             ms1_filename, ms2_filename)

<h2>2. Run LDA</h2>

Run LDA inference using either Gibbs sampling or variational inference. The following are the common LDA parameters for both methods:
- **n_topic** is the number of Mass2Motifs to discover
- **alpha** is the hyper-parameter for document-topic distributions
- **beta** is the hyper-parameter for topic-word distributions

In [None]:
n_topics = 300
alpha = 50.0/n_topics
beta = 0.1

For collapsed Gibbs sampling, we also have the following parameters:
- **n_samples** is the number of posterior samples to obtain during Gibbs sampling
- **n_burn** is the number of burn-in samples, n_burn should be < n_samples. If 0, then only the last sample is used.
- **n_thin** is every n-th sample to use for averaging after burn-in. Ignored if n_burn = 0.

In [None]:
n_samples = 1000
n_burn = 0
n_thin = 1

ms2lda.run_lda_gibbs(n_topics, n_samples, n_burn, n_thin, alpha, beta)

Alternatively we can run LDA using variational inference. The following parameter is to be specified:
- **n_its** is the number of steps in the VB

In [None]:
n_its = 1000

ms2lda.run_lda_vb(n_topics, n_its, alpha, beta)

<h2>3a. In-silico Annotation using EF Assigner (Optional)</h3>

For visualisation, a simple in-silico annotation method (EF Assigner) is provided. The method works by combinatorially enumerating all candidate formulae that can be produced by the precursor mass, applying the 7 golden rules to reduce the candidate set and returning the formula closest in mass to the observed precursor mass as the 'top hit'. This method does not assign formulae to the losses, only the fragments.

The parameter settings below need to be adjusted accordingly to perform MS1 peak annotations: 

- **mode** is either 'pos' or 'neg'
- **target** is either 'ms1', 'ms2_fragment' or 'ms2_loss'
- **ppm** is the mass accuracy against the theoretical mass
- **max_mass** is the maximum pass to process from the input
- On top of the 7-golden rules, we implement a heuristic rule 8 to specify the maximum number of occurrences of each atom (via the **max_occurrences** parameter).
- **n_stages** defines the number of stages in the annotation. If n_stages is 2, then in the first stage, elemental formulae search is performed for the atoms CHNOPS. In the second stage, unannotated masses will be processed with the addition of C13, F, Cl to the list of atoms to search.

In [None]:
max_occurrences = {'N':6, 'S': 2, 'P': 2, 'C13':1, 'F':0, 'Cl':0}
n_stages = 2
tol = 3

ms2lda.annotate_peaks(mode='pos', target='ms1', ppm=5, max_mass=250, 
                      rule_8_max_occurrences=max_occurrences, n_stages=n_stages)

MS2 fragments/losses can also be annotated by specifying different *target* parameter. In the example below, for mass <= 80, a tolerance of 5 ppm is used, whilst for 80 < mass <= 400, a tolerance of 10 ppm is used.

In [None]:
max_occurrences = {'N':6, 'S': 2, 'P': 2, 'C13':1, 'F':0, 'Cl':0}
n_stages = 1
tol = [(70, 10), (200, 5)]

# annotate the elemental formulae of MS2 fragments
ms2lda.annotate_peaks(mode='pos', target='ms2_fragment', ppm=tol, max_mass=200, 
                      rule_8_max_occurrences=max_occurrences, n_stages=n_stages)

Neutral loss annotations can be performed by setting by  can also annotate the neutral losses. Set the *target* parameter to 'ms2_loss' and *mode* to 'none'

In [None]:
ms2lda.annotate_peaks(mode='none', target='ms2_loss', ppm=5, max_mass=200, n_stages=1)

Display the MS1 and MS2 dataframes after annotation.

In [None]:
display(ms2lda.ms1)

In [None]:
display(ms2lda.ms2)

<h2>3b. In-silico Annotation using SIRIUS (Optional)</h2>

MS1 and MS2 peaks can also be annotated using [SIRIUS](http://bio.informatik.uni-jena.de/software/sirius/), an in-silico fragmentation tool written in Java. At the moment, each parent MS1 peak and its associated MS2 spectra are run through SIRIUS separately. Isotopic information, which can be used to improve annotation, is not used yet.

It might be tricky to get SIRIUS running, especially on the latest Mac OSX. If you have not installed SIRIUS yet, we recommend using the EF Assigner above.

Parameters:
- sirius_platform specifies the profile used by SIRIUS. Refer to SIRIUS manual for more details. 
- ppm_max is the mass tolerance used by SIRIUS when assigning elemental formulae
- mode is either 'pos' or 'neg'
- max_ms1 excludes any MS1 with m/z > 400 from annotation since it takes too long to process

In [None]:
ms2lda.annotate_with_sirius(sirius_platform='orbitrap', ppm_max=5, mode='pos', max_ms1=400)

In [None]:
display(ms2lda.ms1)

In [None]:
display(ms2lda.ms2)

<h2>4. Saving MS2LDA Project</h2>

Finally, the entire MS2LDA project, including LDA topics (Mass2Motifs) and elemental formula assignments, can be saved to reload later, for example to load the generated LDA model into the visualization module (see example_notebook_2) - this is highly recommended. The message parameter can be omitted.

In [None]:
ms2lda.save_project('projects/beer3test.project', 
                    message="Beer3Pos analysis for the manuscript with SIRIUS EF Annotation")