Skip to content

Depthcharge v0.4.0

Choose a tag to compare

@wfondrie wfondrie released this 17 Apr 20:22
· 21 commits to main since this release
98035ec

We have completely reworked of the data module.
Depthcharge now uses Apache Arrow-based formats instead of HDF5; spectra are converted either Parquet or streamed with PyArrow, optionally into Lance datasets.

We now also have full support for small molecules, with the MoleculeTokenizer,
AnalyteTransformerEncoder, and AnalyteTransformerDecoder classes.

Breaking Changes

  • PeptideTransformer* are now AnalyteTransformer*, providing full support for small molecule analytes. Additionally the interface has been completely reworked.
  • Mass spectrometry data parsers now function as iterators, yielding batches of spectra as pyarrow.RecordBatch objects.
  • Parsers can now be told to read arbitrary fields from their respective file formats with the custom_fields parameter.
  • The parsing functionality of SpctrumDataset and its subclasses have been moved to the spectra_to_* functions in the data module.
  • SpectrumDataset and its subclasses now return dictionaries of data rather than a tuple of data. This allows us to incorporate arbitrary additional data
  • SpectrumDataset and its subclasses are now lance.torch.data.LanceDataset subclasses, providing native PyTorch integration.
  • All dataset classes now do not have a loader() method.

Added

  • Support for small molecules.
  • Added the StreamingSpectrumDataset for fast inference.
  • Added spectra_to_df, spectra_to_df, spectra_to_stream to the depthcharge.data module.

Changed

  • Determining the mass spectrometry data file format is now less fragile.
    It now looks for known line contents, rather than relying on the extension.