Depthcharge v0.4.0
We have completely reworked of the data module.
Depthcharge now uses Apache Arrow-based formats instead of HDF5; spectra are converted either Parquet or streamed with PyArrow, optionally into Lance datasets.
We now also have full support for small molecules, with the MoleculeTokenizer,
AnalyteTransformerEncoder, and AnalyteTransformerDecoder classes.
Breaking Changes
PeptideTransformer*are nowAnalyteTransformer*, providing full support for small molecule analytes. Additionally the interface has been completely reworked.- Mass spectrometry data parsers now function as iterators, yielding batches of spectra as
pyarrow.RecordBatchobjects. - Parsers can now be told to read arbitrary fields from their respective file formats with the
custom_fieldsparameter. - The parsing functionality of
SpctrumDatasetand its subclasses have been moved to thespectra_to_*functions in the data module. SpectrumDatasetand its subclasses now return dictionaries of data rather than a tuple of data. This allows us to incorporate arbitrary additional dataSpectrumDatasetand its subclasses are nowlance.torch.data.LanceDatasetsubclasses, providing native PyTorch integration.- All dataset classes now do not have a
loader()method.
Added
- Support for small molecules.
- Added the
StreamingSpectrumDatasetfor fast inference. - Added
spectra_to_df,spectra_to_df,spectra_to_streamto thedepthcharge.datamodule.
Changed
- Determining the mass spectrometry data file format is now less fragile.
It now looks for known line contents, rather than relying on the extension.