Extending mzR

Laurent edited this page Apr 14, 2014 · 1 revision

Project description

The mzR R/Bioconductor package provides a unified API to the common open and community-driven file formats and parsers available for mass spectrometry data, namely mzXML, mzML and mzData (see vignette for details). It uses C and C++ code from other third party open-source projects and heavily relies on the Rcpp package to, notably, provide a direct mapping from R to C++ infrastructure. Currently, mzR provides two actual backends to read Mass Spectrometry raw data:

  1. netCDF which reads, as the name implies, netCDF data
  2. RAMP to read mzData and mzXML via the ISB RAMP parser. This backend can also read mzML through the proteowizard RAMPadapter around the proteowizard infrastructure, but this interface is limited to the lowest common denominator between the mzXML/mzData/mzML formats.

This project is intended to add several related backends to mzR, by providing a direct wrapper around -- and full access to -- the proteowizard msdata object. The candidate will interact closely with Laurent Gatto and Steffen Neumann, and the proteowizard and Rcpp communities.

The pwiz/mzML backend

The pwiz/mzML backend should be a drop-in replacement and pass unit tests also for the Bioconductor XCMS and MSnbase packages. Any XCMS and MSnbase modifications required will be done by Steffen Neumann and Laurent Gatto respectively. Secondly, the pwiz/mzML should provide access to the <chromatogram>s stored in an mzML file (Martens et al. 2011).

The mzIdentML backend/format

The project also aims at facilitating access to identification data in the mzIdentML data format (Jones et al. 2012) through the proteowizard framework. A similar backend, as currently available to raw mass spectrometry files (mzXML, mzML, mzData), will be developed for mzIdentML files.

At the end of the project, the candidate will be familiar with the major mass-spectrometry data formats and main MS toolkits used in proteomics and metabolomics. After successful completion of the project, the candidate will be added to the list of mzR contributors.

Project attributes and estimates:

  • Difficulty: medium to difficult, depending on experience and C++ fluency.
  • Skills needed: intermediate R programming, knowledge of package development helpful, good knowledge of C and especially C++ essential. The candidate will have to familiarise herself with the mass-spectrometry data, the respective data formats and the proteowizard code base.
  • Deliverable: pwiz and identificaiton backends to be added to the mzR package.
  • Mentors: Laurent Gatto and Steffen Neuman, with additional Rcpp support from Dirk Eddelbuettel.
  • References: see project description.