**mzTbs**

_General_

mzTab is meant to be a light-weight, tab-delimited file format for proteomics data. The target audience for this format are primarily researchers outside of proteomics. It should be easy to parse and only contain the minimal information required to evaluate the results of a proteomics experiment. The aim of the format is to present the results of a proteomics experiment in a computationally accessible overview. The aim is not to provide the detailed evidence for these results, or allow recreating the process which led to the results. Both of these functions are established through links to more detailed representations in other formats, in particular mzIdentML and mzQuantML [Ref1](https://code.google.com/archive/p/mztab/). Besides, mzTab can be used alone or with those other formats [Ref2](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4189001/)

**Warning**:Although mzTab can be used to report a detailed view on data, it explicitly does not aim to capture the whole complexity and evidence trail of a proteomics study. Even the most complex mzTab files still include simplifications/assumptions of the experimental results. This, for instance, is the case in identification (e.g. protein inference/grouping is only supported to a limited extent) and quantification results (e.g. the coordinates for isotope patterns in quantified two-dimensional “features” cannot be fully reported). This missing information can be reported using the existing PSI standard formats mzIdentML and mzQuantML.  

_File content_

Section:
- MTD: metadata - was deliberately kept flexible, and the majority of fields are optional. Therefore, it is possible to report different levels of experimental annotation depending on the interest of the producer of the files, ranging from basic annotations to the complete
- PRH: protein hearder
- PRT: protein identifications 
- PEH: peptide header
- PEP: peptide identifications
- PSH: peptide-spectrum  hearder
- PSM: peptide-spectrum match - indicates whether the peptides were unambiguously assigned to a given protein. 
- SMH: small molecules hearder  - is used to report aggregated quantification data based on several PSMs.
- SML: small molecules identifications
- COM: comments 


<!-- ![Fig1](/home/tiago/documents/lncRNA/Study/Fig1_mztabe_content.jpg) -->


In [2]:
from pyteomics import mztab

In [3]:
mztab = mztab.MzTab("/home/tiagoborelli/Documentos/lncRNA/coffie/generated/2experimentos.pride.mztab")

In [13]:
mztab.keys()

odict_keys(['PRT', 'PEP', 'PSM', 'SML'])

In [4]:
mztab.metadata

OrderedDict([('mzTab-version', 1.0),
             ('mzTab-mode', 'Complete'),
             ('mzTab-type', 'Identification'),
             ('mzTab-ID', 'PXD002963-58016'),
             ('title', 'no assay title provided (mzIdentML)'),
             ('description',
              'Coffee is one of the most important commodities cultivated worldwide and has great economic impact in producing countries. Although 130 different species belonging to the coffea gender have been described, only two of them are commercially exploited: Coffea arabica and Coffea canephora. C. arabica is responsible for 61% of the world production (Van der Vossen et al., 2015). However, due to the narrow genetic back ground, classical genetic breeding is time consuming and takes around 30 years (Santana-Buzzy et al., 2007; Hendre et al., 2014). Several genetic engineering and biotechnological tools have been successfully applied in coffee breeding. Somatic embryogenesis (SE) is a process in which new viable embryos a

In [6]:
mztab.spectrum_match_table.columns

Index(['sequence', 'PSM_ID', 'accession', 'unique', 'database',
       'database_version', 'search_engine', 'search_engine_score[1]',
       'search_engine_score[2]', 'search_engine_score[3]',
       'search_engine_score[4]', 'modifications', 'retention_time', 'charge',
       'exp_mass_to_charge', 'calc_mass_to_charge', 'spectra_ref', 'pre',
       'post', 'start', 'end', 'opt_global_mzidentml_original_ID',
       'opt_global_cv_MS:1002217_decoy_peptide',
       'opt_global_cv_PRIDE:0000091_rank'],
      dtype='object')

In [8]:
mztab.spectrum_match_table[["accession","sequence","database",'pre', 'post', 'start', 'end',"spectra_ref"]]

Unnamed: 0_level_0,accession,sequence,database,pre,post,start,end,spectra_ref
PSM_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,A0A068U1Z5_COFCA,VFGPHQWEILR,uniprot-taxonomy%3Acoffee.fasta,,,364,374,ms_run[3]:index=262
2,A0A068TTF0_COFCA,YLEDKTSVPYEPVYSDEQAR,uniprot-taxonomy%3Acoffee.fasta,,,312,331,ms_run[8]:index=1954
3,A0A068TTF0_COFCA,YLEDKTSVPYEPVYSDEQAR,uniprot-taxonomy%3Acoffee.fasta,,,312,331,ms_run[20]:index=1775
4,A0A068TTF0_COFCA,YLEDKTSVPYEPVYSDEQAR,uniprot-taxonomy%3Acoffee.fasta,,,312,331,ms_run[17]:index=1996
5,A0A068TTF0_COFCA,YLEDKTSVPYEPVYSDEQAR,uniprot-taxonomy%3Acoffee.fasta,,,312,331,ms_run[15]:index=1870
...,...,...,...,...,...,...,...,...
16925,A0A068UYE2_COFCA,LFILDYHDMLLPFIEGMNSLPGR,uniprot-taxonomy%3Acoffee.fasta,,,504,526,ms_run[2]:index=654
16926,A0A068VF06_COFCA,IVNKWNTALIGLMTYFR,uniprot-taxonomy%3Acoffee.fasta,,,1298,1314,ms_run[2]:index=308
16927,A0A068VF06_COFCA,IVNKWNTALIGLMTYFR,uniprot-taxonomy%3Acoffee.fasta,,,1298,1314,ms_run[4]:index=226
16926,A0A068V8A1_COFCA,IVNKWNTALIGLMTYFR,uniprot-taxonomy%3Acoffee.fasta,,,1299,1315,ms_run[2]:index=308


In [9]:
mztab.protein_table[['accession', 'description', 'taxid', 'species', 'database',
       'database_version', 'search_engine']]

Unnamed: 0_level_0,accession,description,taxid,species,database,database_version,search_engine
accession,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
A0A068U1Z5_COFCA,A0A068U1Z5_COFCA,Eukaryotic translation initiation factor 3 sub...,,,uniprot-taxonomy%3Acoffee.fasta,,"[Mascot, X!Tandem, Scaffold]"
A0A068TTF0_COFCA,A0A068TTF0_COFCA,"Coffea canephora DH200=94 genomic scaffold, sc...",,,uniprot-taxonomy%3Acoffee.fasta,,"[Mascot, X!Tandem, Scaffold]"
A0A068U9U8_COFCA,A0A068U9U8_COFCA,"Coffea canephora DH200=94 genomic scaffold, sc...",,,uniprot-taxonomy%3Acoffee.fasta,,"[Mascot, X!Tandem, Scaffold]"
A0A068URX2_COFCA,A0A068URX2_COFCA,ATP-dependent Clp protease proteolytic subunit...,,,uniprot-taxonomy%3Acoffee.fasta,,"[Mascot, X!Tandem, Scaffold]"
A0A068TYH5_COFCA,A0A068TYH5_COFCA,"Coffea canephora DH200=94 genomic scaffold, sc...",,,uniprot-taxonomy%3Acoffee.fasta,,"[Mascot, X!Tandem, Scaffold]"
...,...,...,...,...,...,...,...
A0A068UX04_COFCA,A0A068UX04_COFCA,"Coffea canephora DH200=94 genomic scaffold, sc...",,,uniprot-taxonomy%3Acoffee.fasta,,"[Mascot, X!Tandem, Scaffold]"
A0A068ULJ5_COFCA,A0A068ULJ5_COFCA,"Coffea canephora DH200=94 genomic scaffold, sc...",,,uniprot-taxonomy%3Acoffee.fasta,,"[Mascot, X!Tandem, Scaffold]"
A0A068UYE2_COFCA,A0A068UYE2_COFCA,Lipoxygenase OS=Coffea canephora GN=GSCOC_T000...,,,uniprot-taxonomy%3Acoffee.fasta,,"[Mascot, X!Tandem, Scaffold]"
A0A068VF06_COFCA,A0A068VF06_COFCA,"Coffea canephora DH200=94 genomic scaffold, sc...",,,uniprot-taxonomy%3Acoffee.fasta,,"[Mascot, X!Tandem, Scaffold]"
