Skip to content

Latest commit

 

History

History
315 lines (197 loc) · 13.3 KB

pandas_df_conversion.rst

File metadata and controls

315 lines (197 loc) · 13.3 KB

Export to pandas DataFrame

NOTE: This feature is available only if using a version of pyOpenMS >= 3.0, at the time of writing this means using one of the nightly builds as described in the Installation Instructions.

In pyOpenMS some data structures can be converted to a tabular format as a pandas.DataFrame. This allows convenient access to data and meta values of spectra, features and identifications.

Required imports for the examples:

from pyopenms import *
import pandas as pd
from urllib.request import urlretrieve
url = 'https://raw.githubusercontent.com/OpenMS/pyopenms-docs/master/src/data/'

MSExperiment

pyopenms.MSExperiment.get_df( long=False )

Generates a pandas DataFrame with all peaks in the MSExperiment

Parameters:

long : default False

set to True if you want to have a long/expanded/melted dataframe with one row per peak. Faster but replicated RT information. If False, returns rows in the style: rt, np.array(mz), np.array(int)

Returns:

pandas.DataFrame

peak map information stored in a DataFrame

Examples:

urlretrieve(url+'BSA1.mzML', 'BSA1.mzML')
exp = MSExperiment()
MzMLFile().load('BSA1.mzML', exp)

df = exp.get_df() # default: long = False
df.head(2)
exp.get_df()
  RT mzarray intarray
0 1501.41394 [300.0897645621494, 300.18132740129533, 300.20... [3431.0261, 1181.809, 1516.1746, 1719.8547, 11...
1 1503.03125 [300.06577092599525, 300.08932376441896, 300.2... [914.79034, 1842.2311, 2395.1025, 851.4738, 16...
df = exp.get_df(long=True)
df.head(2)
exp.get_df(long=True)
  RT mz inty
0 1501.41394 300.089752 3431.026123
1 1501.41394 300.181335 1181.808960

PeptideIdentifications

pyopenms.peptide_identifications_to_df( peps, decode_ontology=True, default_missing_values={bool: False, int: -9999, float: np.nan, str: ''}, export_unidentified=True )

Generates a pandas DataFrame with all peaks in the MSExperiment

Parameters:

peps :

list of PeptideIdentification objects

decode_ontology : default True

if meta values contain CV identifer (e.g., from PSI-MS) they will be automatically decoded into the human readable CV term name.

default_missing_values : default {bool: False, int: -9999, float: np.nan, str: ''}

default value for missing values for each data type

export_unidentified : default True

export PeptideIdentifications without PeptideHit

Returns:

pandas.DataFrame

peptide identifications in a DataFrame

Example:

urlretrieve(url+'small.idXML', 'small.idXML')
prot_ids = []
pep_ids = []
IdXMLFile().load('small.idXML', prot_ids, pep_ids)

df = peptide_identifications_to_df(pep_ids)
df.head(2)
peptide_identifications_to_df(pep_ids)
  id RT mz q-value charge protein_accession start end NuXL:z2 mass NuXL:z3 mass ... isotope_error NuXL:peptide_mass_z0 NuXL:XL_U NuXL:sequence_score
0 OpenNuXL_2019-12-04T16:39:43_1021782429466859437 900.425415 414.730865 0.368649 4 DECOY_sp|Q86UQ0|ZN589_HUMAN 255 267 828.458069 552.641113 ... 0 1654.901611 0 0.173912
1 OpenNuXL_2019-12-04T16:39:43_7293634134684008928 903.565186 506.259521 0.422779 2 sp|P61313|RL15_HUMAN 179 187 0.0 0.0 ... 0 1010.504639 0 0.290786

FeatureMap

pyopenms.FeatureMap.get_df( meta_values = None )

Generates a pandas DataFrame with information contained in the FeatureMap.

Optionally the feature meta values and information for the assigned PeptideHit can be exported.

Parameters:

meta_values : default None

meta values to include (None, [custom list of meta value names] or 'all')

export_peptide_identifications (bool): default True

export sequence and score for best PeptideHit assigned to a feature. Additionally the ID_filename (file name of the corresponding ProteinIdentification) and the ID_native_id (spectrum ID of the corresponding Feature) are exported. They are also annotated as meta values when collecting all assigned PeptideIdentifications from a FeatureMap with FeatureMap.get_assigned_peptide_identifications(). A DataFrame from the assigned peptides generated with peptide_identifications_to_df(assigned_peptides) can be merged with the FeatureMap DataFrame with: merged_df = pd.merge(feature_df, assigned_peptide_df, on=['feature_id', 'ID_native_id', 'ID_filename'])

Returns:

pandas.DataFrame

feature information stored in a DataFrame

Examples:

urlretrieve(url+'BSA1_F1_idmapped.featureXML', 'BSA1_F1_idmapped.featureXML')
feature_map = FeatureMap()
FeatureXMLFile().load('BSA1_F1_idmapped.featureXML', feature_map)

df = feature_map.get_df() # default: meta_values = None
df.head(2)
feature_map.get_df()
id peptide_sequence peptide_score ID_filename ID_native_id charge RT mz RTstart RTend mzstart mzend quality intensity
9650885788371886430 LVTDLTK 0.000000 unknown spectrum=1270 2 1942.600083 395.239277 1932.484009 1950.834351 395.239199 397.245758 0.808494 157572000.0
18416216708636999474 DDSPDLPK 0.034483 unknown spectrum=1167 2 1749.138335 443.711224 1735.693115 1763.343506 443.711122 445.717531 0.893553 54069300.0
df = feature_map.get_df(meta_values = 'all', export_peptide_identifications = False)
df.head(2)
feature_map.get_df(meta_values = 'all', export_peptide_identifications = False)
id charge RT mz RTstart RTend mzstart mzend quality intensity FWHM spectrum_index spectrum_native_id label score_correlation score_fit
9650885788371886430 2 1942.600083 395.239277 1932.484009 1950.834351 395.239199 397.245758 0.808494 157572000.0 10.061090 259 spectrum=1270 168 0.989969 0.660286
18416216708636999474 2 1749.138335 443.711224 1735.693115 1763.343506 443.71112 445.717531 0.893553 54069300.0 14.156094 156 spectrum=1167 169 0.999002 0.799234
df = feature_map.get_df(meta_values = [b'FWHM', b'label'])
df.head(2)
feature_map.get_df(meta_values = [b'FWHM', b'label'])
id charge RT mz RTstart RTend mzstart mzend quality intensity FWHM label
9650885788371886430 2 1942.600083 395.239277 1932.484009 1950.834351 395.239199 397.245758 0.808494 157572000.0 10.061090 168
18416216708636999474 2 1749.138335 443.711224 1735.693115 1763.343506 443.71112 445.717531 0.893553 54069300.0 14.156094 169

Extract assigned peptide identifications from a feature map

Peptide identifications can be mapped to their corresponding features in a FeatureMap. It is possible to extract them using the function pyopenms.FeatureMap.get_assigned_peptide_identifications() returning a list of PeptideIdentification objects.

pyopenms.FeatureMap.get_assigned_peptide_identifications()

Generates a list with peptide identifications assigned to a feature.

Adds 'ID_native_id' (feature spectrum id), 'ID_filename' (primary MS run path of corresponding ProteinIdentification) and 'feature_id' (unique ID of corresponding Feature) as meta values to the peptide hits. A DataFrame from the assigned peptides generated with peptide_identifications_to_df(assigned_peptides) can be merged with the FeatureMap DataFrame with: merged_df = pd.merge(feature_df, assigned_peptide_df, on=['feature_id', 'ID_native_id', 'ID_filename'])

Returns:

[PeptideIdentification]

list of PeptideIdentification objects

A DataFrame can be created on the resulting list of PeptideIdentification objects using pyopenms.peptide_identifications_to_df(assigned_peptides). Feature map and peptide data frames contain columns, on which they can be merged together to contain the complete information for peptides and features in a single data frame.

The columns for unambiguously merging the data frames:

  • feature_id: the unique feature identifier
  • ID_native_id: the feature spectrum native identifier
  • ID_filename: the filename (primary MS run path) of the corresponding ProteinIdentification

Example:

feature_df = feature_map.get_df()
assigned_peptides = feature_map.get_assigned_peptide_identifications()
assigned_peptide_df = peptide_identifications_to_df(assigned_peptides)

merged_df = pd.merge(feature_df, assigned_peptide_df, on=['feature_id', 'ID_native_id', 'ID_filename'])
merged_df.head(2)
consensus_map.get_df()
feature_id peptide_sequence peptide_score ID_filename ID_native_id charge_x RT_x mz_x RTstart RTend ... id RT_y mz_y q-value charge_y protein_accession start end OMSSA_score target_decoy
9650885788371886430 LVTDLTK 0.000000 unknown spectrum=1270 2 1942.600083 395.239277 1932.484009 1950.834351 ... OMSSA_2009-11-17T11:11:11_4731105163044641872 1933.405151 395.239349 0.000000 2 P02769|ALBU_BOVIN -1 -1 0.001084 True
18416216708636999474 DDSPDLPK 0.034483 unknown spectrum=1167 2 1749.138335 443.711224 1735.693115 1763.343506 ... OMSSA_2009-11-17T11:11:11_4731105163044641872 1738.033447 443.711243 0.034483 2 P02769|ALBU_BOVIN -1 -1 0.003951 True

ConsensusMap

pyopenms.ConsensusMap.get_df()

Generates a pandas DataFrame with both consensus feature meta data and intensities from each sample.

Returns:

pandas.DataFrame

consensus map meta data and intensity stored in pandas DataFrame

pyopenms.ConsensusMap.get_intensity_df()

Generates a pandas DataFrame with feature intensities from each sample in long format (over files).

For labelled analyses channel intensities will be in one row, therefore resulting in a semi-long/block format. Resulting DataFrame can be joined with result from get_metadata_df by their index 'id'.

Returns:

pandas.DataFrame

intensity DataFrame

pyopenms.ConsensusMap.get_metadata_df()

Generates a pandas DataFrame with feature meta data (sequence, charge, mz, RT, quality).

Resulting DataFrame can be joined with result from get_intensity_df by their index 'id'.

Returns:

pandas.DataFrame

DataFrame with metadata for each feature (such as: best identified sequence, charge, centroid RT/mz, fitting quality)

Examples:

urlretrieve(url+'ProteomicsLFQ_1_out.consensusXML', 'ProteomicsLFQ_1_out.consensusXML')
consensus_map = ConsensusMap()
ConsensusXMLFile().load('ProteomicsLFQ_1_out.consensusXML', consensus_map)

df = consensus_map.get_df()
df.head(2)
df = consensus_map.get_intensity_df()
df.head(2)
consensus_map.get_intensity_df()
id BSA1_F1.mzML ... BSA1_F2.mzML
2935923263525422257 0.0 ... 0.0
10409195546240342212 1358151.0 ... 0.0
df = consensus_map.get_metadata_df()
df.head(2)
consensus_map.get_metadata_df()
id sequence charge RT mz quality
2935923263525422257 DGDIEAEISR 3 1523.370634 368.843773 0.000000
10409195546240342212 SHC(Carbamidomethyl)IAEVEK 3 1552.032973 358.174576 0.491247