**mzTbs**

_General_

mzTab is meant to be a light-weight, tab-delimited file format for proteomics data. The target audience for this format are primarily researchers outside of proteomics. It should be easy to parse and only contain the minimal information required to evaluate the results of a proteomics experiment. The aim of the format is to present the results of a proteomics experiment in a computationally accessible overview. The aim is not to provide the detailed evidence for these results, or allow recreating the process which led to the results. Both of these functions are established through links to more detailed representations in other formats, in particular mzIdentML and mzQuantML [Ref1](https://code.google.com/archive/p/mztab/). Besides, mzTab can be used alone or with those other formats [Ref2](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4189001/)

**Warning**:Although mzTab can be used to report a detailed view on data, it explicitly does not aim to capture the whole complexity and evidence trail of a proteomics study. Even the most complex mzTab files still include simplifications/assumptions of the experimental results. This, for instance, is the case in identification (e.g. protein inference/grouping is only supported to a limited extent) and quantification results (e.g. the coordinates for isotope patterns in quantified two-dimensional “features” cannot be fully reported). This missing information can be reported using the existing PSI standard formats mzIdentML and mzQuantML.  

_File content_

Section:
- MTD: metadata - was deliberately kept flexible, and the majority of fields are optional. Therefore, it is possible to report different levels of experimental annotation depending on the interest of the producer of the files, ranging from basic annotations to the complete
- PRH: protein hearder
- PRT: protein identifications 
- PEH: peptide header
- PEP: peptide identifications
- PSH: peptide-spectrum  hearder
- PSM: peptide-spectrum match - indicates whether the peptides were unambiguously assigned to a given protein. 
- SMH: small molecules hearder  - is used to report aggregated quantification data based on several PSMs.
- SML: small molecules identifications
- COM: comments 


<!-- ![Fig1](/home/tiago/documents/lncRNA/Study/Fig1_mztabe_content.jpg) -->


In [3]:
from pyteomics import mztab

In [4]:
MassIVE_mztab = mztab.MzTab("/home/tiago/documents/lncRNA/massive/ccms_result/Dataset A_Replicate 1_Mudpit_ft2010122202 (F020529).mzTab")

In [5]:
MassIVE_mztab.metadata

OrderedDict([('mzTab-version', 1.0),
             ('mzTab-mode', 'Complete'),
             ('mzTab-type', 'Identification'),
             ('mzTab-ID', 'RESULT-00003'),
             ('title', 'no assay title provided (mzIdentML)'),
             ('description',
              'Spectrum Identification Protocol: Enzymes - Trypsin; Database Filters -'),
             ('software[1]', ('Mascot', '2.2.07')),
             ('software[2]', ('Scaffold', 'Scaffold_4.11.0')),
             ('software[2]-setting[1]', 'Scaffold: Minimum Peptide Count = 2'),
             ('software[2]-setting[2]',
              'Scaffold: Minimum Protein Probability = 0.99'),
             ('software[2]-setting[3]',
              'Scaffold: Minimum Peptide Probability = 0.95'),
             ('software[2]-setting[4]',
              'minimum number of enzymatic termini = 0'),
             ('software[2]-setting[5]',
              'Scaffold:Samples In Mudpit Mode = Mudpit_ft2010122202 (F020529)'),
             ('software[2]-se

In [9]:
MassIVE_mztab.spectrum_match_table.columns

Index(['sequence', 'PSM_ID', 'accession', 'unique', 'database',
       'database_version', 'search_engine', 'search_engine_score[1]',
       'search_engine_score[2]', 'search_engine_score[3]', 'modifications',
       'retention_time', 'charge', 'exp_mass_to_charge', 'calc_mass_to_charge',
       'spectra_ref', 'pre', 'post', 'start', 'end',
       'opt_global_mzidentml_original_ID',
       'opt_global_cv_MS:1002217_decoy_peptide',
       'opt_global_cv_PRIDE:0000091_rank', 'opt_global_pass_threshold',
       'opt_global_valid', 'opt_global_invalid_reason',
       'opt_global_cv_MS:1002354_PSM-level_q-value'],
      dtype='object')

In [13]:
MassIVE_mztab.spectrum_match_table[["accession","sequence","database",'pre', 'post', 'start', 'end',"spectra_ref"]]

Unnamed: 0_level_0,accession,sequence,database,pre,post,start,end,spectra_ref
PSM_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,AT1G22110.1,CLKSCFVGMR,TAIR10_pep_20101214_FR_CP_BSA.fasta,,,264,273,ms_run[1]:index=10831
2,AT1G12410.1,FAMPLSR,TAIR10_pep_20101214_FR_CP_BSA.fasta,,,183,189,ms_run[1]:index=1283
3,AT1G34220.2,LTIPR,TAIR10_pep_20101214_FR_CP_BSA.fasta,,,23,27,ms_run[1]:index=10192
4,AT1G56030.1,YDQGSSIRKEK,TAIR10_pep_20101214_FR_CP_BSA.fasta,,,265,275,ms_run[1]:index=13680
5,AT1G56030.1,NLLETCFTGQKNLK,TAIR10_pep_20101214_FR_CP_BSA.fasta,,,242,255,ms_run[1]:index=21687
...,...,...,...,...,...,...,...,...
14808,REV_AT3G58030.3,LSEIR,TAIR10_pep_20101214_FR_CP_BSA.fasta,,,217,221,ms_run[1]:index=27
14809,REV_AT3G58030.3,LSEIR,TAIR10_pep_20101214_FR_CP_BSA.fasta,,,217,221,ms_run[1]:index=30
14808,REV_AT3G58030.4,LSEIR,TAIR10_pep_20101214_FR_CP_BSA.fasta,,,217,221,ms_run[1]:index=27
14809,REV_AT3G58030.4,LSEIR,TAIR10_pep_20101214_FR_CP_BSA.fasta,,,217,221,ms_run[1]:index=30


In [21]:
MassIVE_mztab.protein_table[['accession', 'description', 'taxid', 'species', 'database',
       'database_version', 'search_engine']]

Unnamed: 0_level_0,accession,description,taxid,species,database,database_version,search_engine
accession,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
AT1G22110.1,AT1G22110.1,structural constituent of ribosome,,,TAIR10_pep_20101214_FR_CP_BSA.fasta,Unknown,"[Scaffold, Mascot]"
AT1G12410.1,AT1G12410.1,CLP protease proteolytic subunit 2,,,TAIR10_pep_20101214_FR_CP_BSA.fasta,Unknown,"[Scaffold, Mascot]"
AT1G34220.2,AT1G34220.2,Regulator of Vps4 activity in the MVB pathway ...,,,TAIR10_pep_20101214_FR_CP_BSA.fasta,Unknown,"[Scaffold, Mascot]"
AT1G56030.1,AT1G56030.1,RING/U-box superfamily protein,,,TAIR10_pep_20101214_FR_CP_BSA.fasta,Unknown,"[Scaffold, Mascot]"
REV_AT2G28150.1,REV_AT2G28150.1,Domain of unknown function (DUF966),,,TAIR10_pep_20101214_FR_CP_BSA.fasta,Unknown,"[Scaffold, Mascot]"
...,...,...,...,...,...,...,...
REV_AT1G31790.1,REV_AT1G31790.1,Tetratricopeptide repeat (TPR)-like superfamil...,,,TAIR10_pep_20101214_FR_CP_BSA.fasta,Unknown,"[Scaffold, Mascot]"
AT2G25280.1,AT2G25280.1,CONTAINS InterPro DOMAIN/s: UPF0103/Mediator o...,,,TAIR10_pep_20101214_FR_CP_BSA.fasta,Unknown,"[Scaffold, Mascot]"
REV_AT5G44560.2,REV_AT5G44560.2,SNF7 family protein,,,TAIR10_pep_20101214_FR_CP_BSA.fasta,Unknown,"[Scaffold, Mascot]"
REV_AT3G58030.1,REV_AT3G58030.1,RING/U-box superfamily protein,,,TAIR10_pep_20101214_FR_CP_BSA.fasta,Unknown,"[Scaffold, Mascot]"
